Hacking Molbank: Creating a Graphical Table of Contents
Molbank is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including machine-readable molecular representations of its content, and its very liberal policy on mirroring and robots. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.
The Graphical Table of Contents (GTOC)
The Molbank Graphical Table of Contents (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.

Prerequisites, Downloading, and Running
To run this project, you'll need Ruby CDK. A recent article described the small amount of system configuration required for Ruby CDK on Linux. Another article showed how to install Ruby CDK on Windows.
The complete source code for this project can be downloaded from RubyForge. A subdirectory called demo contains the pre-built final result.
After unpacking the molbank-0.1.0 archive, the demo application can be run:
$ cd molbank-0.0.1 $ ruby test.rb
Problems, We've Got Problems
Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:
Blank Images The entry for M52 is blank. Checking the underlying molfile reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the molecules shown in M52 only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.
Images Containing an "R" Atom Label Entry M53 shows an "R" group at what should be the carbonyl carbon. The underlying molfile contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.
Molfile not Found Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.
Broken Molfiles The Molfile for M162 encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with M402. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.
Bogus Molfiles For reasons I still can't understand, the Molfile for M407 encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.
After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:
What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?
How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)
Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.
Behind the Scenes
The Ruby Molbank GTOC generator works by connecting to the www.mdpi.net server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using Ruby CDK. As a final step, the index.html page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is produced with eRuby using a previously-described technique.
Conclusions
Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.
In some ways, the software described here and its output are less interesting than the larger questions they raise:
How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?
How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?
In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?
These and many related questions are waiting just around the corner. As Open Access becomes more viable, both technically and commercially, look to Open Source and Open Data to provide the synergies that will unlock its true potential.
Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry
Molbank, published by Molecuar Diversity Preservation International, is one of the oldest of a handful of Open Access journals in chemistry. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why they didn't implement it sooner.
A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.
Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.
Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of Chemical Abstract Service has even been able to contemplate, let alone prepare for.
This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.
Readily Available, Without Infringements or Restrictions
...If we consider that one of the purposes of publication is to offer testable data, then it would seem that a minimum requirement would be that where computer programs and their results are presented, the author will make source code available on request. ACS could render good service by undertaking the distribution of such requested code. Furthermore, I would make it a condition for publication that such source code be provided. If the scientist is unwilling to disclose his code because he wishes to engage in a commercial venture, then I suggest that he be invited to take out a paid advertisement in the journal and be denied the privilege of publication to promote his product.
-John Figueras J. Chem. Inf. Comput. Sci. 1984, 24, 276
Science moves forward only insofar as observations can be validated and put to use by a third party. Chemical informatics is no different from any other field in this respect. Yet publications of the type Mr. Figueras opposed can still be found in 2006. Why is this?
At issue isn't just software. The ACS has recently spoken out on the necessity of open data sets. As a condition for publication, any data reported in a manuscript must now either appear in Supplementary Material or be “readily available, without infringements or restrictions.” Although this is a positive development, the wait continues for an equivalent statement on the availability of source code.
Open software systems and open data packages are most useful when they can be readily found by others and used together. In an effort to work on this problem, several individuals, including myself, formed The Blue Obelisk group. Through this group and others like it, like-minded researches can begin to reap the benefits of openness enjoyed by other fields.
Older posts: 1 2

