Never Draw the Same Molecule Twice: Writing PNG Image Metadata with Python

Posted by Rich Apodaca Wed, 29 Aug 2007 21:17:00 GMT

A recent D-F article discussed a method for encoding machine-readable molecular structure information as image metadata. This article generated some interest among developers. For example, Noel O'Boyle posted code for reading PNG image metadata with Python. The popularity of Python in cheminformatics makes this approach especially attractive.

But how would you write PNG image metadata with Python? The obvious answer of using Image.info followed by Image.write doesn't appear to work. Given my limited knowledge of Python, the answer must come from elsewhere.

Fortunately, Nick Galbreath wrote in to offer a solution. Using Python, PIL, and an undocumented class, Nick has developed a small wrapper function that writes metadata for PNG images. In fact, Nick is fast on his way to becoming a PNG metadata expert, if reluctantly so. His blog is worth checking out and contains several useful techniques for image manipulation.

Never Draw the Same Molecule Twice: Viewing Image Metadata 5

Posted by Rich Apodaca Wed, 08 Aug 2007 11:40:00 GMT

Chemists are accustomed to embedding live molecular objects in their documents with Microsoft Word/ChemDraw. These objects can then be reprocessed and embedded into other documents, such as PowerPoint presentations, saving enormous amounts of time. What if the same feature were available with Web documents?

A recent D-F article proposed a method to encode molecular structure data within commonly-used Web image formats such as PNG. That article contained an embedded image of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia) encoded by a rendering toolkit built with Firefly. I claimed that this image contained the complete connection table and atom coordinates as embedded metadata. In this article, I'll show a simple method to read this metadata.

Metadata is a standard part of the PNG specification; to read it requires nothing more than software capable of recognizing it. I recently found a Web-based, cross-platform method for doing so. The Image Metadata Viewer by FileFormat.info accepts an uploaded image file and returns that image's metadata. Let's try it with the image of rosiglitazone.

After saving the image to my hard drive, uploading it to FileFormat.info and pressing start, I can see that the image contains metadata:

The metadata can be viewed either as XML or as plain text. Choosing plain text (second option) gives me the complete molfile, stored as a key/value hash (molfile=[molfile]).

Clearly, reading metadata is not a problem given the right software. But this leaves the question of how metadata is encoded in the first place - especially in a programming language such as Java. Like everything else, it's not difficult when you know how. Stay tuned for the answer.

Never Draw the Same Molecule Twice: Image Metadata for Cheminformatics 3

Posted by Rich Apodaca Wed, 01 Aug 2007 10:17:00 GMT

The graphical language of 2D structures has served chemistry well for the last 100 years. Ironically, this language which is so useful for human communication is extraordinarily difficult for machines to understand. Heroic efforts at digital raster image recognition such as OSRA and those recently summarized by Egon Willighagen, in addition to a handful of others, have tried to tackle this problem with varying degrees of success.

The problem remains unsolved, and continues to be one of the most difficult technical challenges in cheminformatics. But the pace at which non-machine readable images are generated has accelerated dramatically in the last two years with the emergence of numerous free chemical databases.

What if 2D structure images simply contained all of the information needed for machine processing in the first place?

This idea isn't as far-fetched as it may sound initially. As discussed in a recent D-F article, both GChemPaint and ACD ChemSketch have been claimed to be capable of encoding machine-readable structure information.

Previous D-F articles have described "Firefly", the codename for a new lightweight 2D structure editor designed specifically for the Web. With major work on the editor's user interface complete, more recent efforts have focused on implementing a 2D rendering toolkit, and with it a mechanism to encode structural information within 2D molecular images.

As a demonstration of what is now possible, consider the structure of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia), depicted as a PNG image at the beginning of this article. At first glance, the image appears to be just like any other image of a 2D molecular structure. But it is not, for embedded within it are the connection table and 2D atom coordinates of rosiglitazone encoded as an industry-standard molfile.

Given the right software, a computer can interpret the structural information encoded in the rosiglitazone image and precisely re-create the original molecular representation. A graphical diagnostic tool bundled with Firefly was equipped with code for precisely this purpose.

This tool can work with molfile-encoded PNG images just as easily as it can with molfiles; they can be opened and the resulting molecule can be further edited, saved in another format, or re-written as a embedded-molfile PNG image.

The first step is to select the PNG image from a local hard drive:

Opening this image produces a fully-editable version of the original molecule:

Obviously, nothing limits this technique to molfiles. InChI, SMILES, CML, or any other molecular encoding scheme would work just as well.

Using molecular-encoded PNG images as a Web-ready replacement for the Word/Chemdraw OLE technology may be one application of this approach. With a large corpus of these images, chemical Web spidering and data mining would be possible on a scale unimaginable today. As always, these possibilities reinforce the desperate need for high quality tools that chemists actually want to use, and which simultaneously yield machine-readable output.

The Chemically-Aware Web: Are We There Yet?

Posted by Rich Apodaca Wed, 13 Sep 2006 17:25:00 GMT

Recently, I wrote a tutorial on embedding 2-D molecular renderings into webpages as Scalable Vector Graphics (SVG). This tutorial also contained a small experiment on the current chemical informatics capabilities of the Web.

Here is a scenario from the near future: Joe is writing a review on Cephalosporin C that he wants to publish the modern way - directly to the Web. An entirely new concept in scientific publishing has started to take hold. Rather than submitting scientific articles to publishers, who then make hamburger out of them and strip authors of their rights to reproduce their own work, a new system in which journals simply aggregate content already on the Web is gaining momentum. Some journals specialize in only including the very best scientific Web content available, and so enjoy a prestige factor. It's still a peer review system, but with inversion of control. The trick for scientists is getting their work indexed, and so noticed, in the first place.

Joe just downloaded a new 2-D structure editor, FooChemPaint, that he heard can make the structure drawings in his review structure-searchable. Every chemist he knows is talking about a new free search engine called Haystac (Haystac Ain't Chmoogle) that lets them substructure-search the web. For some reason, you need to create your structures using FooChemPaint if you want your own documents to be included in the search results.

After Joe finishes drawing Cephalosporin C with FooChemPaint, he chooses the File->Save As... menu item. Instead of saving as a JPG or PNG like he's done with other software, he saves the image as SVG. He then embeds the SVG into his review using a procedure similar to the one I outlined previously.

From Joe's perspective, he hasn't done anything very new. But unknown to Joe, FooChemPaint has automatically inserted the InChI identifier of Cephalosporin C as metadata into his SVG document. This enables ordinary search engines such as Google to associate the InChI with his SVG. The best part is that the entire process is essentially invisible to Joe.

Haystac is a web application that presents users with an online structure editor for preparing molecular queries. When a structure query is submitted, Haystac searches its molecular database for matches. This database, in turn, was built by a web spider specifically designed to look for InChI identifiers, maybe with the help of Google's Web API. One of Haystac's records for the structure of Cephalosporin C points to Joe's review article.

Science fiction? Maybe. This is where the experiment comes in. Before I submitted the article on SVG, I manually annotated the SVG of Alprazolam with the corresponding InChI. The XML source can be viewed in Firefox by right-clicking on the SVG image and choosing This Frame->View Frame Source, or alternatively here. Below is a fragment of the XML:

<svg ...>
  <rdf:RDF
    xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
    xmlns:dc = "http://purl.org/dc/elements/1.1/" >
    <rdf:Description about="http://depth-first.com"
      dc:title="InChI=1/C17H13ClN4/c1-11-20-21-16-10-19-17(12-5-3-2-4-6-12)14-9-13(18)7-8-15(14)22(11)16/h2-9H,10H2,1H3"
      dc:format="image/svg+xml"
      dc:language="en" >
      <dc:creator>
        <rdf:Bag>
          <rdf:li>Richard L. Apodaca</rdf:li>
        </rdf:Bag>
      </dc:creator>
    </rdf:Description>
  </rdf:RDF>

  <!-- etc. -->
</svg>

Today I searched for the title of my article in Google and found it. I then searched for the InChI in the SVG metadata and did not find it. Currently, a search of this InChI shows only one hit from the DrugBank database.

The experiment failed in its stated goal of getting the InChI of Alprazolam indexed by Google via the metadata in its SVG rendering. Was it the formatting of my RDF tags? Is metadata just indexed more slowly than other content? Does Google just ignore metadata to avoid keyword stuffing by Search Engine Optimization tricksters? Are embedded SVG documents ignored by Google altogether? Whatever the reason, the technical barriers to a system like this working today are very low and dropping rapidly.