Never Draw the Same Molecule Twice: Viewing Image Metadata 5

Posted by Rich Apodaca Wed, 08 Aug 2007 07:40:00 GMT

Chemists are accustomed to embedding live molecular objects in their documents with Microsoft Word/ChemDraw. These objects can then be reprocessed and embedded into other documents, such as PowerPoint presentations, saving enormous amounts of time. What if the same feature were available with Web documents?

A recent D-F article proposed a method to encode molecular structure data within commonly-used Web image formats such as PNG. That article contained an embedded image of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia) encoded by a rendering toolkit built with Firefly. I claimed that this image contained the complete connection table and atom coordinates as embedded metadata. In this article, I'll show a simple method to read this metadata.

Metadata is a standard part of the PNG specification; to read it requires nothing more than software capable of recognizing it. I recently found a Web-based, cross-platform method for doing so. The Image Metadata Viewer by FileFormat.info accepts an uploaded image file and returns that image's metadata. Let's try it with the image of rosiglitazone.

After saving the image to my hard drive, uploading it to FileFormat.info and pressing start, I can see that the image contains metadata:

The metadata can be viewed either as XML or as plain text. Choosing plain text (second option) gives me the complete molfile, stored as a key/value hash (molfile=[molfile]).

Clearly, reading metadata is not a problem given the right software. But this leaves the question of how metadata is encoded in the first place - especially in a programming language such as Java. Like everything else, it's not difficult when you know how. Stay tuned for the answer.

Never Draw the Same Molecule Twice: Image Metadata for Cheminformatics 3

Posted by Rich Apodaca Wed, 01 Aug 2007 06:17:00 GMT

The graphical language of 2D structures has served chemistry well for the last 100 years. Ironically, this language which is so useful for human communication is extraordinarily difficult for machines to understand. Heroic efforts at digital raster image recognition such as OSRA and those recently summarized by Egon Willighagen, in addition to a handful of others, have tried to tackle this problem with varying degrees of success.

The problem remains unsolved, and continues to be one of the most difficult technical challenges in cheminformatics. But the pace at which non-machine readable images are generated has accelerated dramatically in the last two years with the emergence of numerous free chemical databases.

What if 2D structure images simply contained all of the information needed for machine processing in the first place?

This idea isn't as far-fetched as it may sound initially. As discussed in a recent D-F article, both GChemPaint and ACD ChemSketch have been claimed to be capable of encoding machine-readable structure information.

Previous D-F articles have described "Firefly", the codename for a new lightweight 2D structure editor designed specifically for the Web. With major work on the editor's user interface complete, more recent efforts have focused on implementing a 2D rendering toolkit, and with it a mechanism to encode structural information within 2D molecular images.

As a demonstration of what is now possible, consider the structure of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia), depicted as a PNG image at the beginning of this article. At first glance, the image appears to be just like any other image of a 2D molecular structure. But it is not, for embedded within it are the connection table and 2D atom coordinates of rosiglitazone encoded as an industry-standard molfile.

Given the right software, a computer can interpret the structural information encoded in the rosiglitazone image and precisely re-create the original molecular representation. A graphical diagnostic tool bundled with Firefly was equipped with code for precisely this purpose.

This tool can work with molfile-encoded PNG images just as easily as it can with molfiles; they can be opened and the resulting molecule can be further edited, saved in another format, or re-written as a embedded-molfile PNG image.

The first step is to select the PNG image from a local hard drive:

Opening this image produces a fully-editable version of the original molecule:

Obviously, nothing limits this technique to molfiles. InChI, SMILES, CML, or any other molecular encoding scheme would work just as well.

Using molecular-encoded PNG images as a Web-ready replacement for the Word/Chemdraw OLE technology may be one application of this approach. With a large corpus of these images, chemical Web spidering and data mining would be possible on a scale unimaginable today. As always, these possibilities reinforce the desperate need for high quality tools that chemists actually want to use, and which simultaneously yield machine-readable output.