Testing Automatic Chemical Structure Recognition with OSRA 10

Posted by Rich Apodaca Thu, 07 Feb 2008 15:45:00 GMT

Countless chemical structures exist only in a raster image format such as JPG, GIF, BMP, or on a printed page or PDF. While perfectly readable to humans, they are very difficult for machines to read. Given the sheer number of these structures that have been produced over the last few decades, the only hope of excavating them from their current data tombs is with computer recognition of some kind. This article discusses OSRA, an open source software package designed to do for chemical structures what Optical Character Recognition did for the printed word.

An online version of OSRA was used to read PNG images of chemical structures produced by an application based on ChemWriter. Both aliased and antialiased images were used and atom coloring was disabled:

Structure interpretation failed for the antialiased image at both 300 and 72 DPI resolution. This was the SMILES that was produced at 72 DPI; the one produced at the 300 DPI setting was not much more encouraging.

However, using the aliased imaged at 72 DPI produced the correct structure.

Could the failure to recognize the antialiased image be due to a problem with the ChemWriter application's rasterization method? Apparently, not. When a screen capture utility was used to produce the image from the ChemWriter application window, the wrong structure was again produced. Here, the PNG encoding was not through a Java program but rather the underlying operating system (Linux) using a standard screen capture utility.

To test the idea that line thickness might play a role in determining the quality of OSRA's interpretation, the antialiased image below was submitted:

Still, the incorrect structure was produced.

Apparently, images of 2D structures in which antialiasing has been applied cause difficulties for OSRA.

Fortunately, the ChemWriter-based application embedded the full connection table of the molecule into all of its images as metadata, so an optical recognition step is unnecessary.

Provided that no antialiasing has been applied to images, OSRA would seem to be a capable tool for converting rasterized 2D chemical structures into machine-readable format.

Image Credit: jspad

Never Draw the Same Molecule Twice: Viewing Image Metadata 5

Posted by Rich Apodaca Wed, 08 Aug 2007 11:40:00 GMT

Chemists are accustomed to embedding live molecular objects in their documents with Microsoft Word/ChemDraw. These objects can then be reprocessed and embedded into other documents, such as PowerPoint presentations, saving enormous amounts of time. What if the same feature were available with Web documents?

A recent D-F article proposed a method to encode molecular structure data within commonly-used Web image formats such as PNG. That article contained an embedded image of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia) encoded by a rendering toolkit built with Firefly. I claimed that this image contained the complete connection table and atom coordinates as embedded metadata. In this article, I'll show a simple method to read this metadata.

Metadata is a standard part of the PNG specification; to read it requires nothing more than software capable of recognizing it. I recently found a Web-based, cross-platform method for doing so. The Image Metadata Viewer by FileFormat.info accepts an uploaded image file and returns that image's metadata. Let's try it with the image of rosiglitazone.

After saving the image to my hard drive, uploading it to FileFormat.info and pressing start, I can see that the image contains metadata:

The metadata can be viewed either as XML or as plain text. Choosing plain text (second option) gives me the complete molfile, stored as a key/value hash (molfile=[molfile]).

Clearly, reading metadata is not a problem given the right software. But this leaves the question of how metadata is encoded in the first place - especially in a programming language such as Java. Like everything else, it's not difficult when you know how. Stay tuned for the answer.

Never Draw the Same Molecule Twice: Image Metadata for Cheminformatics 3

Posted by Rich Apodaca Wed, 01 Aug 2007 10:17:00 GMT

The graphical language of 2D structures has served chemistry well for the last 100 years. Ironically, this language which is so useful for human communication is extraordinarily difficult for machines to understand. Heroic efforts at digital raster image recognition such as OSRA and those recently summarized by Egon Willighagen, in addition to a handful of others, have tried to tackle this problem with varying degrees of success.

The problem remains unsolved, and continues to be one of the most difficult technical challenges in cheminformatics. But the pace at which non-machine readable images are generated has accelerated dramatically in the last two years with the emergence of numerous free chemical databases.

What if 2D structure images simply contained all of the information needed for machine processing in the first place?

This idea isn't as far-fetched as it may sound initially. As discussed in a recent D-F article, both GChemPaint and ACD ChemSketch have been claimed to be capable of encoding machine-readable structure information.

Previous D-F articles have described "Firefly", the codename for a new lightweight 2D structure editor designed specifically for the Web. With major work on the editor's user interface complete, more recent efforts have focused on implementing a 2D rendering toolkit, and with it a mechanism to encode structural information within 2D molecular images.

As a demonstration of what is now possible, consider the structure of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia), depicted as a PNG image at the beginning of this article. At first glance, the image appears to be just like any other image of a 2D molecular structure. But it is not, for embedded within it are the connection table and 2D atom coordinates of rosiglitazone encoded as an industry-standard molfile.

Given the right software, a computer can interpret the structural information encoded in the rosiglitazone image and precisely re-create the original molecular representation. A graphical diagnostic tool bundled with Firefly was equipped with code for precisely this purpose.

This tool can work with molfile-encoded PNG images just as easily as it can with molfiles; they can be opened and the resulting molecule can be further edited, saved in another format, or re-written as a embedded-molfile PNG image.

The first step is to select the PNG image from a local hard drive:

Opening this image produces a fully-editable version of the original molecule:

Obviously, nothing limits this technique to molfiles. InChI, SMILES, CML, or any other molecular encoding scheme would work just as well.

Using molecular-encoded PNG images as a Web-ready replacement for the Word/Chemdraw OLE technology may be one application of this approach. With a large corpus of these images, chemical Web spidering and data mining would be possible on a scale unimaginable today. As always, these possibilities reinforce the desperate need for high quality tools that chemists actually want to use, and which simultaneously yield machine-readable output.

Editable and Searchable 2D Molecular Images 2

Posted by Rich Apodaca Mon, 30 Jul 2007 12:01:00 GMT

Word processing replaced the typewriter for the simple reason that documents could be prepared and edited so much more quickly. If Web authoring replaces conventional word processors, it will be for the simple reason that Web documents can be found, distributed, reprocessed, and combined with other content so much more effectively. The peculiar nature of chemical structure information complicates chemistry's transition to Web authoring. This article, the first in a series, discusses some of the challenges that lie ahead.

State of the Art: Word/ChemDraw

Microsoft Word allows 2D molecular graphics, typically created with ChemDraw, to be embedded in documents and later edited. Those images can then be copied into Power Point presentations and reused in a variety of other Windows-specific products. This practice has become so widespread throughout industry and academics, that few chemists even think about the technology that many of them rely on several times a week.

Chemical Structures are Peculiar

A 2D molecular image, like the one depicting fluoxetine at the top of this article, is a peculiar beast. On one level, it's a picture that anybody can look at. But on another level, it's a type of object for which manipulation by humans and computers is extremely useful. The combination of Microsoft Word and ChemDraw lets chemists conveniently manage the dual nature of chemical structures.

Live Molecular Images

Why would anybody want to create editable and searchable 2D molecular graphics such as JPGs, PNGs, and SVGs? Alas, technology has a way of moving on just when we're getting comfortable with it (an especially difficult concept for typewriter manufacturers who went bust during the 1980s, and the dedicated word processor manufacturers who followed).

Consider the number of Word and PowerPoint documents you read last week compared to the number of Web pages. Chances are the ratio is at least 1:10. The trend shows no signs of reversing itself.

Although Web authoring tools have been very slow to reach the average user, the blogging explosion has led to rapid evolution in the field. As tools like WordPress, Movable Type, and even Wikipedia race to satisfy the needs of power authors, the average user will rather unexpectedly discover that they have access to perfectly capable tools that let them abandon their over-engineered (and expensive) word processors to experiment with Web publishing.

The Wikipedia Chemisty/Structure Drawing Workgroup hints at what lies ahead for chemistry. Two tools, GChemPaint and ACD ChemSketch, now enable molecular structure information to be embedded in images.

As chemistry turns to the Web as its primary publication medium, chemists will need the same ability to deal with chemical structures offered by their current tools of choice. In articles to follow, I'll discuss some ways this could happen.