Never Draw the Same Molecule Twice: Image Metadata for Cheminformatics 3

Posted by Rich Apodaca Wed, 01 Aug 2007 10:17:00 GMT

The graphical language of 2D structures has served chemistry well for the last 100 years. Ironically, this language which is so useful for human communication is extraordinarily difficult for machines to understand. Heroic efforts at digital raster image recognition such as OSRA and those recently summarized by Egon Willighagen, in addition to a handful of others, have tried to tackle this problem with varying degrees of success.

The problem remains unsolved, and continues to be one of the most difficult technical challenges in cheminformatics. But the pace at which non-machine readable images are generated has accelerated dramatically in the last two years with the emergence of numerous free chemical databases.

What if 2D structure images simply contained all of the information needed for machine processing in the first place?

This idea isn't as far-fetched as it may sound initially. As discussed in a recent D-F article, both GChemPaint and ACD ChemSketch have been claimed to be capable of encoding machine-readable structure information.

Previous D-F articles have described "Firefly", the codename for a new lightweight 2D structure editor designed specifically for the Web. With major work on the editor's user interface complete, more recent efforts have focused on implementing a 2D rendering toolkit, and with it a mechanism to encode structural information within 2D molecular images.

As a demonstration of what is now possible, consider the structure of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia), depicted as a PNG image at the beginning of this article. At first glance, the image appears to be just like any other image of a 2D molecular structure. But it is not, for embedded within it are the connection table and 2D atom coordinates of rosiglitazone encoded as an industry-standard molfile.

Given the right software, a computer can interpret the structural information encoded in the rosiglitazone image and precisely re-create the original molecular representation. A graphical diagnostic tool bundled with Firefly was equipped with code for precisely this purpose.

This tool can work with molfile-encoded PNG images just as easily as it can with molfiles; they can be opened and the resulting molecule can be further edited, saved in another format, or re-written as a embedded-molfile PNG image.

The first step is to select the PNG image from a local hard drive:

Opening this image produces a fully-editable version of the original molecule:

Obviously, nothing limits this technique to molfiles. InChI, SMILES, CML, or any other molecular encoding scheme would work just as well.

Using molecular-encoded PNG images as a Web-ready replacement for the Word/Chemdraw OLE technology may be one application of this approach. With a large corpus of these images, chemical Web spidering and data mining would be possible on a scale unimaginable today. As always, these possibilities reinforce the desperate need for high quality tools that chemists actually want to use, and which simultaneously yield machine-readable output.

Comments

Leave a response

  1. Joerg Kurt Wegner Wed, 01 Aug 2007 20:06:09 GMT

    Can you also add a copyright notice, source, molecular properties, or even the microformats mentioned by Egon http://chem-bla-ics.blogspot.com/2007/07/rdf-ing-molecular-space.html

    I would suggest using EXIF http://en.wikipedia.org/wiki/Exchangeableimagefile_format

    I think this is important, because people need achknowledgement and this is one way doing it. Beside is this a very primitive insurance against copyright frauds.

    Cheers, Joerg

  2. Rich Apodaca Wed, 01 Aug 2007 21:59:08 GMT

    The format is really quite flexible - you can add anything you want in formats ranging from gzip compressed to straight text. This lack of constraints brings its own problems.

    It may be too early to say what the best format for encoding actually is.

  3. Egon Willighagen Thu, 02 Aug 2007 07:44:49 GMT

    Rich, this is brilliant. This actually links two of the projects on which I have summer students work: one on a chemical editor [1], the other working on chemistry for Strigi [2]! Strigi is the desktop search engine, and it would index both the PNG file, and the chemical file inside the PNG annotation. Do you have a test file for Alexandr and the other Strigi people to play with?

    I guess we do not get to see the source code (I would love to do this for JChemPaint or Bioclipse), but guess we can figure it out... Are you embedding CML or a MDL molfile? Does it also directly embed the InChI?

    1.http://progz-jchem.blogspot.com/ 2.http://neksa.blogspot.com/