Hacking Molbank: Creating a Graphical Table of Contents
Molbank is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including machine-readable molecular representations of its content, and its very liberal policy on mirroring and robots. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.
The Graphical Table of Contents (GTOC)
The Molbank Graphical Table of Contents (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.

Prerequisites, Downloading, and Running
To run this project, you'll need Ruby CDK. A recent article described the small amount of system configuration required for Ruby CDK on Linux. Another article showed how to install Ruby CDK on Windows.
The complete source code for this project can be downloaded from RubyForge. A subdirectory called demo contains the pre-built final result.
After unpacking the molbank-0.1.0 archive, the demo application can be run:
$ cd molbank-0.0.1 $ ruby test.rb
Problems, We've Got Problems
Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:
Blank Images The entry for M52 is blank. Checking the underlying molfile reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the molecules shown in M52 only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.
Images Containing an "R" Atom Label Entry M53 shows an "R" group at what should be the carbonyl carbon. The underlying molfile contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.
Molfile not Found Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.
Broken Molfiles The Molfile for M162 encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with M402. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.
Bogus Molfiles For reasons I still can't understand, the Molfile for M407 encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.
After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:
What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?
How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)
Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.
Behind the Scenes
The Ruby Molbank GTOC generator works by connecting to the www.mdpi.net server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using Ruby CDK. As a final step, the index.html page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is produced with eRuby using a previously-described technique.
Conclusions
Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.
In some ways, the software described here and its output are less interesting than the larger questions they raise:
How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?
How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?
In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?
These and many related questions are waiting just around the corner. As Open Access becomes more viable, both technically and commercially, look to Open Source and Open Data to provide the synergies that will unlock its true potential.
Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry
Molbank, published by Molecuar Diversity Preservation International, is one of the oldest of a handful of Open Access journals in chemistry. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why they didn't implement it sooner.
A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.
Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.
Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of Chemical Abstract Service has even been able to contemplate, let alone prepare for.
This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.
We Have Met the Enemy and He Is Us
The basic problem of the primary literature is that the material to be published grows more rapidly than the number of people or institutions interested in buying and/or using it. A smaller, but still nagging, difficulty is that unit costs increase more rapidly than publishers are able to increase unit productivity.
... But in the last analysis, the primary literature would easily be able to continue basically unchanged, were it not for the fact that the demand has stabilized, while the supply of material has not yet done so.
-David E. Gushee J. Chem. Doc. 1970, 10, 30-32
Gushee goes on to discuss the decline of ACS journal subscription rates and the simultaneous increase in total pages printed and journals published. One wonders to what extent these trends continued over the last 36 years and how this phenomenon may driving the current escalation in journal costs.
About this "price squeeze" and a publisher's inability to escape it, Gushee writes:
A scientific society cannot, however, control cost as the typical business can. In journal publishing, the only real cost we can save is the page we don't print. And to restrict the number of pages printed is to interfere with the dissemination of knowledge, which is, after all, the basic reason the Society exists in the first place.
There are many interesting tidbits in this Back to the Future article, but perhaps none more so than the following:
Should the number of pages go over some critical number, then we get into a position of having to charge such a high price that individuals can no longer afford the journal. Chemical Abstracts, as an entity, reached that point some years ago and can no longer be considered a publication for individual subscriptions.
How expensive does a journal need to become before it can no longer be considered a publication for individual libraries? When that point is reached, who is responsible?
The Open Access Ecosystem
What happens to an article in an Open Access journal that shuts down? Recently, this question was raised on the Blue Obelisk mailing list about an article published in the Internet Journal of Chemistry (IJC). Because the lights now appear to be out for good at IJC, are its articles lost forever?
The good news is that by retaining copyright, authors of Open Access articles have the right to copy or reprocess their work in any form they see fit. If a traditional subscription-based journal shuts down, the fate of its entire article collection is up to the publisher, who is in nearly all cases the sole copyright holder. It's remarkable that self-respecting scientists would knowingly allow the fruits of their hard work to meet with such a fate. With Open Access, the author is in control of keeping their article publicly visible.
The bad news is that keeping an article publicly visible is the last thing most scientists want to spend valuable time and energy on. After all, that's what the journal was there for, wasn't it? Given the technical barriers to self-archiving Open Access content, who could blame them? First, an author needs to find a server willing to host their content. After that comes learning the software to get the article onto the server. Then comes the need to decide on the archival format, being ever-mindful of the hamburger effect. Of course, authors would probably want some assurance that the location of this article won't change and will be "permanently" available. Does a DOI need to be re-assigned? And let's not forget about how the poor reader is supposed to find these articles (some would say that Google is the answer, but I would disagree). Expecting each author to solve these problems on his or her own simply won't work. There must be a better way.
To my knowledge, there is no solution to the Open Access archiving problem. But if history is any guide, this is a huge opportunity that will soon disappear. Maybe a SourceForge-like repository for Open Access content would work. Perhaps something less structured would be enough. The profit motive would certainly come into play, as the successful solution to this problem would easily have thousands, if not tens of thousands, of regular users. Whatever form the solution might take, it would most likely be a simple system built by a small organization using off-the-shelf components. I would expect nothing less from a disruptive technology like Open Access.
As one or more solutions to the Open Access archival problem begin to gain traction, other opportunities may arise and be exploited by enterprising individuals and small organizations. And so on, until a thriving ecosystem becomes established.
Proponents have been debating the "how" of Open Access for some time now. Maybe it's time to start thinking about what comes after the Open Access transition.
Electric Cars and Open Access
Markets are developed with fine products that customers desire to own. No salesman can take a marginal product into the marketplace and have any hope of establishing a sustainable consumer base. Consumers will not be forced into a purchase that they do not want. Mandates will not work in a consumer-driven, free market economy. For electric vehicles to find a place in the market, respectable products comparable to today's gasoline-powered cars must be available.
-William Glaub, Chrysler Corporation, 1995 -Cited in The Innovator's Dilemma by Clayton Christensen
The value of scientific research—especially in chemistry—exists long after its publication and certainly well past 6 months. Moreover, it will be difficult to maintain a cost-efficient, high-caliber peer review and permanent archiving system if scientific societies have just six months to recoup costs before mandating free access. The prospect of “free” access to literature may seem good, but high quality literature at an affordable price is better.
-E. Ann Nally, ACS President, 2006 Update on Open Access
The 2006 documentary Who Killed the Electric Car? should give anyone involved with Open Access reason to ponder. The film tells the story of GM's electric car experiment (the EV1) in California and its eventual failure. Open Access bears a striking resemblance to the electric car, including the ways parties on both sides frame the discussion, the market dynamics involved, and steps by government to mandate a solution. Can Open Access avoid the fate of the EV1?
In August 2005, the Beilstein-Institut launched Beilstein Journal of Organic Chemistry (BJOC). It was the first Open Access journal in chemistry to have the backing of a large publisher, and understandably, expectations were high. The journal's aims represent a significant departure from established practice in chemical publishing:
Beilstein Journal of Organic Chemistry offers organic chemists a unique opportunity to publish their research rapidly in an Open Access medium that is freely available online to researchers worldwide. In doing so it not only offers authors uniquely wide visibility and high impact, but it also ensures that their work is part of the permanent, publicly available archive of science. Open Access does not compromise the high quality of the articles published. All manuscripts submitted to the journal are subject to rigorous peer review.
While rigorous peer review has always been an objective in chemical publication, the idea of creating a permanent, publicly-available Open Access archive had only previously been attempted by a few daring, lesser-known publishers. Even more provocatively, BJOC waived its author submission fee, making the journal free to readers and authors alike.
Back in 2005, many were asking tough questions about the BJOC revenue model. How will BJOC maintain high publication quality while taking money from neither subscribers nor authors? Who will eventually end up paying for this service and when?
The question less frequently asked is the subject of this article: "Will BJOC be able to attract the same flow of high-quality manuscript submissions as its competitors?" The answer seemed so self-evident as to border on the absurd. Of course they would! Scientists everywhere seemed to be calling for Open Access, and here stood a publisher offering it at no cost to subscribers or authors - at least for a while. What was there not to like?
Fifteen months later, discussions of cost and quality, although no doubt important, are nowhere to be found. Whereas originally BJOC maintained that authors' fees would be waived for an indefinite period of time, their position now suggests that authors' fees might never be charged:
The publication costs for Beilstein Journal of Organic Chemistry are covered by the journal, so authors do not need to pay an article processing charge.
Anyone who has watched BJOC's front page for the last few months will no doubt have noticed a puzzling trend: the journal releases on average fewer than three papers per month, which actually represents a slight decrease from its rate at launch.

We can only assume that BJOC's editors didn't set out to produce a journal that publishes fewer articles in an entire month than Journal of Organic Chemistry publishes in a day. It then follows that publication-quality manuscripts are simply not being sent to BJOC in significant quantity.
We could certainly propose a few hypotheses at this point. For example, Peter Murray-Rust points to the "citation economy" and the role a journal's prestige plays in an author's journal selection process. He also points out that most journals take time to develop, and so it may be too early to judge the success of BJOC.
It would be hard to deny the role of prestige in scientific publication - but if this is the only explanation, then how do new journals ever come into being? It wasn't too long ago that Organic Letters was an upstart in a market long dominated by Tetrahedron Letters. Within two to three years, the tables had turned decisively. Consider also Chemistry: An Asian Journal. Although only a few months old, it out-publishes BJOC by a factor of 10:1.
Metcalfe's law states that in any communications network, among which scientific journals can clearly be counted, the value of the network is proportional to the square of the number of users. Large, existing journals have a significant advantage in this respect. A journal without regular readers can't possibly hope to attract manuscript submissions, no matter how revolutionary its publishing model.
I won't be offering any concrete hypotheses at this point. I'll simply return to the analogy I started this article with: the documentary Who Killed the Electric Car?. How could a car "supported" by apparently so many fail to find a market? The documentary names several possible factors, including the government, consumers, the auto industry, inferior battery technology, and alternative technologies. These factors no doubt played roles in the failure of the EV1, but maybe there's another way to look at it.

A very different kind of analysis of the failure of the electric car is provided in Clayton Christensen's landmark book The Innovator's Dilemma. Christenson views electric cars as a "Disruptive Technology", the defining characteristics of which are:
It performs worse in one or more areas, but is typically simpler, more reliable, or more convenient than existing technologies.
Its performance trajectory is steeper than that of existing technologies.
It is built from off-the-shelf components.
It is less profitable than existing technologies.
Leading firms' most profitable customers generally can't use it and don't want it.
It is first commercialized in emerging or insignificant markets.
Large organizations are fundamentally incapable of successfully bringing it to market.
According to Christensen, the electric car died because GM failed to recognize that it was dealing with a Disruptive Technology and act accordingly. Could it be that the less than enthusiastic reception to BJOC results from the same failure on the part of the Beilstein-Institut? Future articles will explore this idea.


