Making the Case: In Silico Prediction of Ames Test Mutagenicity

Posted by Rich Apodaca Thu, 28 Dec 2006 15:09:00 GMT

The two models (SAm and AIm) and the RHC [robust hybrid classifier] were implemented in C++ using OpenBabel 1.100.2 libraries (http://openbabel.sourceforge.net/wiki/Main_Page).

The AI model (AIm) is based on the LAZAR system (http://www.predictive-toxicology.org/lazar/index.html) developed by C. Helma...

-Paolo Mazzatorta, Liên-Anh Tran, Benoît Schilter, and Martin Grigorov J. Chem. Inf. Model.

Yet another appearance of Open Source software in the primary cheminformatics literature comes by way of a paper from Mazzatorta, Tran, Shilter, and Grigorov of the Nestlé Research Center. This work employs two Open Source libraries: lazar, a tool for the prediction of toxic properties of chemical structures; and Open Babel, a widely-used, low-level library for cheminformatics. lazar, in turn, is based on both Open Babel and the GNU Scientific Library (GSL), a numerical library. Unfortunately, the Nestlé authors don't indicate whether the source code for their system is publicly available. Nevertheless, their work gives a taste of the kinds of synergies that inevitably develop through the the use of Open Source software.

Scaffolding

Posted by Rich Apodaca Thu, 21 Dec 2006 15:39:00 GMT

Linus Torvalds, for example, didn't actually try to write Linux from scratch. Instead, he started by reusing code and ideas from Minix, a tiny Unix-like operating system for PC clones. Eventually all the Minix code went away or was completely rewritten -- but while it was there, it provided scaffolding for the infant that would eventually become Linux.

-Eric Steven Raymond, The Cathedral and the Bazaar

The creation of Linux is, in part, the stuff of legend. But ESR does make an interesting observation on one role that Open Source software can play in building complex systems. Rather than viewing Open Source software as a permanent fixture of a larger system, why not view it as temporary scaffolding to be replaced, in part or in full, after the system has been fully implemented and all of its requirements are known?

Hacking Molbank: Creating a Graphical Table of Contents

Posted by Rich Apodaca Mon, 11 Dec 2006 15:00:00 GMT

Molbank is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including machine-readable molecular representations of its content, and its very liberal policy on mirroring and robots. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.

The Graphical Table of Contents (GTOC)

The Molbank Graphical Table of Contents (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.

Prerequisites, Downloading, and Running

To run this project, you'll need Ruby CDK. A recent article described the small amount of system configuration required for Ruby CDK on Linux. Another article showed how to install Ruby CDK on Windows.

The complete source code for this project can be downloaded from RubyForge. A subdirectory called demo contains the pre-built final result.

After unpacking the molbank-0.1.0 archive, the demo application can be run:

$ cd molbank-0.0.1
$ ruby test.rb

Problems, We've Got Problems

Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:

  • Blank Images The entry for M52 is blank. Checking the underlying molfile reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the molecules shown in M52 only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.

  • Images Containing an "R" Atom Label Entry M53 shows an "R" group at what should be the carbonyl carbon. The underlying molfile contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.

  • Molfile not Found Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.

  • Broken Molfiles The Molfile for M162 encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with M402. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.

  • Bogus Molfiles For reasons I still can't understand, the Molfile for M407 encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.

After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:

  • What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?

  • How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)

Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.

Behind the Scenes

The Ruby Molbank GTOC generator works by connecting to the www.mdpi.net server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using Ruby CDK. As a final step, the index.html page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is produced with eRuby using a previously-described technique.

Conclusions

Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.

In some ways, the software described here and its output are less interesting than the larger questions they raise:

  • How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?

  • How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?

  • In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?

These and many related questions are waiting just around the corner. As Open Access becomes more viable, both technically and commercially, look to Open Source and Open Data to provide the synergies that will unlock its true potential.

Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry

Posted by Rich Apodaca Thu, 30 Nov 2006 15:01:00 GMT

Molbank, published by Molecuar Diversity Preservation International, is one of the oldest of a handful of Open Access journals in chemistry. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why they didn't implement it sooner.

A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.

Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.

Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of Chemical Abstract Service has even been able to contemplate, let alone prepare for.

This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.

Eleven Free Cheminformatics Scripting Environments

Posted by Rich Apodaca Tue, 14 Nov 2006 16:13:00 GMT

A recent question on Yahoo's chemoinf forum got me thinking about free cheminformatics scripting environments. If you've ever wanted to learn an object-oriented scripting language such as Ruby, Python, Perl, or Groovy in the context of cheminformatics, there are many good options to choose from. Few experiences expand a programmer's horizons more than learning one of these freedom languages. This is especially true for developers who, like myself, come from a background involving the safety languages C++ and Java.

Below is a complete roundup of Open Source cheminformatics scripting environments, grouped by language. If closed, commercial offerings were included, this list would, of course, be longer. In the interest of full disclosure, I am the author of RCDK and have worked on OBRuby.

  1. Ruby Chemistry Development Kit (RCDK)- IUPAC nomenclature translation, 2-D structure layout, 2-D color rendering. RCDK combines the capabilities of three Open Source Java toolkits with the agility of the Ruby platform, all in an easy-to-install package. Parse IUPAC nomenclature. Create 2-D coordinates for SMILES strings and IUPAC names. Render anti-aliased color 2-D molecular images in SVG, PNG, and JPG format.

  2. Ruby/Open Babel: OBRuby- A recent addition to the growing family of alternative programming interfaces offered by the C++ toolkit Open Babel. Interconvert several molecular languages including SMILES, molfile, CML, PDB, and InChi. Perform sophisticated molecular queries with SMARTS pattern matching.

  3. Chemruby Rubyforge Site - A pure Ruby toolkit with portions written in C to speed performance. Although I successfully installed Chemruby on my system, I can't use it due to a failed dependency on a library called "dbm".

  4. Molruby - Parse SDFiles in Ruby or on the command line. Molruby is clearly a project in it's early stages. On the other hand, if you're interested in learning Ruby, Molruby's small size may be suited to getting familiar with key concepts.

  5. PyDaylight - A "Pythonic", "thick" interface to the popular Daylight toolkit. The author, Andrew Dalke has done a great deal to promote the idea of applying scripting languages to cheminformatics. Unfortunately, Daylight's toolkit isn't yet offered under an Open Source license, making it difficult for me to evaluate the PyDaylight interface.

  6. Python/Open Babel - Access a good chunk of the impressive Open Babel API through Python. I needed to perform a a small modification to get OBPython working on my system. After that this package worked exactly as advertised.

  7. Python/CDK - Use Jython to access the complete CDK API using Python. Jython is a Java implementation of the Python interpreter, and so this use of the CDK lets developers combine their favorite Java and Python software.

  8. FROWNS (Python) - Loosely based on the PyDaylight API by Andrew Dalke. Read and write SMILES and Molfiles. Perform SMARTS queries, work with fingerprints and enumerate molecular cycles. With optional GraphVis support, render 2-D molecular images.

  9. Perl/Open Babel - Use Open Babel from Perl. I was unsuccessful in building OBPerl on my system; your mileage may vary.

  10. Perlmol- Read and write a number of common formats including SMILES, molfile, SLN, and PDB. Query by molecular and reaction pattern. Installation on my system went smoothly. One of the best-documented projects on this list.

  11. Groovy/CDK - Groovy is a relatively new object-oriented scripting language for Java. I found no Internet references on using Groovy with CDK in English, although it should be simple to do. If you read Japanese, try this link. Stay tuned for more on this interesting combination.

Older posts: 1 2 3 4 5 6