Testing Automatic Chemical Structure Recognition with OSRA

By Richard L. Apodaca

2008-02-07T00:00:00.000Z

Countless chemical structures exist only in a raster image format such as JPG, GIF, BMP, or on a printed page or PDF. While perfectly readable to humans, they are very difficult for machines to read. Given the sheer number of these structures that have been produced over the last few decades, the only hope of excavating them from their current data tombs is with computer recognition of some kind. This article discusses OSRA, an open source software package designed to do for chemical structures what Optical Character Recognition did for the printed word.

An online version of OSRA was used to read PNG images of chemical structures produced by an application based on ChemWriter. Both aliased and antialiased images were used and atom coloring was disabled:

OSRA

Structure interpretation failed for the antialiased image at both 300 and 72 DPI resolution. This was the SMILES that was produced at 72 DPI; the one produced at the 300 DPI setting was not much more encouraging.

However, using the aliased imaged at 72 DPI produced the correct structure.

Could the failure to recognize the antialiased image be due to a problem with the ChemWriter application's rasterization method? Apparently, not. When a screen capture utility was used to produce the image from the ChemWriter application window, the wrong structure was again produced. Here, the PNG encoding was not through a Java program but rather the underlying operating system (Linux) using a standard screen capture utility.

To test the idea that line thickness might play a role in determining the quality of OSRA's interpretation, the antialiased image below was submitted:

Thin Lines

Still, the incorrect structure was produced.

Apparently, images of 2D structures in which antialiasing has been applied cause difficulties for OSRA.

Fortunately, the ChemWriter-based application embedded the full connection table of the molecule into all of its images as metadata, so an optical recognition step is unnecessary.

Provided that no antialiasing has been applied to images, OSRA would seem to be a capable tool for converting rasterized 2D chemical structures into machine-readable format.