Googling for Molecules with InChIMatic and Firefly

A series of D-F articles have discussed InChIMatic, a Web application that lets you structure-search the Web using popular search engines such as Google. Recent articles have also described Firefly, a lightweight 2D structure editor designed especially for the Web.
Today, the first alpha release of Firefly is available for use with InChIMatic.
Despite its small size of only 103K, the Firefly applet offers a number of advanced features:
A clean interface with major functionality in plain sight.
Antialiased rendering. Pressing the "+" and "-" keys will zoom in and out to reveal rendering detail.
User-overridable bond length and angle constraints. When dragging a bond, use Shift to relax both angle and length constraints, or Ctrl to relax only angle constraints.
Automatic inside-outside double bond rendering.
Built-in molfile import/export. Use the File->Import Molfile and File->Export Molfile options to copy/paste a molfile from your system clipboard.
Automatic implicit hydrogen detection. The quadrant for hydrogen placement is chosen based on the bonds surrounding the atom.
Twenty levels of undo/redo. The commands can either be issued from the menu, or Ctrl-Z/Ctrl-Y.
Persistent molecule. When you visit another page and come back, Firefly remembers the molecule you were working on.
No digital certificate authorization. Just start using it.
Firefly also incorporates a number of keyboard shortcuts to speed up structure drawing:
1-9 keys Builds a chain with the indicated number of carbons.
a key Phenyl (aromatic) ring. When hovering over a bond, fuses the ring to the bond. When hovering over an atom, fuses the ring to that atom, if possible, or sprouts the ring.
f, l, r, i keys The elements F, Cl, Br, and I, respectively.
z and t keys The elements Si and Sn, respectively
b, c, n, o, s, and p keys The elements B, C, N, O, S, and P, respectively.
[delete] and [backspace] keys deletes whatever is underneath the cursor.
To use these shortcuts, simply hover the cursor over an atom and press the key on your keyboard.
Being an alpha release, this version of Firefly still has room for improvement. Your feedback is important. Please send questions, comments, and suggestions to the email address found under Firefly's "Help" menu.
Never Draw the Same Molecule Twice: Viewing Image Metadata 5
Chemists are accustomed to embedding live molecular objects in their documents with Microsoft Word/ChemDraw. These objects can then be reprocessed and embedded into other documents, such as PowerPoint presentations, saving enormous amounts of time. What if the same feature were available with Web documents?
A recent D-F article proposed a method to encode molecular structure data within commonly-used Web image formats such as PNG. That article contained an embedded image of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia) encoded by a rendering toolkit built with Firefly. I claimed that this image contained the complete connection table and atom coordinates as embedded metadata. In this article, I'll show a simple method to read this metadata.
Metadata is a standard part of the PNG specification; to read it requires nothing more than software capable of recognizing it. I recently found a Web-based, cross-platform method for doing so. The Image Metadata Viewer by FileFormat.info accepts an uploaded image file and returns that image's metadata. Let's try it with the image of rosiglitazone.
After saving the image to my hard drive, uploading it to FileFormat.info and pressing start, I can see that the image contains metadata:

The metadata can be viewed either as XML or as plain text. Choosing plain text (second option) gives me the complete molfile, stored as a key/value hash (molfile=[molfile]).
Clearly, reading metadata is not a problem given the right software. But this leaves the question of how metadata is encoded in the first place - especially in a programming language such as Java. Like everything else, it's not difficult when you know how. Stay tuned for the answer.
Never Draw the Same Molecule Twice: Image Metadata for Cheminformatics 3
The graphical language of 2D structures has served chemistry well for the last 100 years. Ironically, this language which is so useful for human communication is extraordinarily difficult for machines to understand. Heroic efforts at digital raster image recognition such as OSRA and those recently summarized by Egon Willighagen, in addition to a handful of others, have tried to tackle this problem with varying degrees of success.
The problem remains unsolved, and continues to be one of the most difficult technical challenges in cheminformatics. But the pace at which non-machine readable images are generated has accelerated dramatically in the last two years with the emergence of numerous free chemical databases.
What if 2D structure images simply contained all of the information needed for machine processing in the first place?
This idea isn't as far-fetched as it may sound initially. As discussed in a recent D-F article, both GChemPaint and ACD ChemSketch have been claimed to be capable of encoding machine-readable structure information.
Previous D-F articles have described "Firefly", the codename for a new lightweight 2D structure editor designed specifically for the Web. With major work on the editor's user interface complete, more recent efforts have focused on implementing a 2D rendering toolkit, and with it a mechanism to encode structural information within 2D molecular images.
As a demonstration of what is now possible, consider the structure of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia), depicted as a PNG image at the beginning of this article. At first glance, the image appears to be just like any other image of a 2D molecular structure. But it is not, for embedded within it are the connection table and 2D atom coordinates of rosiglitazone encoded as an industry-standard molfile.
Given the right software, a computer can interpret the structural information encoded in the rosiglitazone image and precisely re-create the original molecular representation. A graphical diagnostic tool bundled with Firefly was equipped with code for precisely this purpose.
This tool can work with molfile-encoded PNG images just as easily as it can with molfiles; they can be opened and the resulting molecule can be further edited, saved in another format, or re-written as a embedded-molfile PNG image.
The first step is to select the PNG image from a local hard drive:

Opening this image produces a fully-editable version of the original molecule:

Obviously, nothing limits this technique to molfiles. InChI, SMILES, CML, or any other molecular encoding scheme would work just as well.
Using molecular-encoded PNG images as a Web-ready replacement for the Word/Chemdraw OLE technology may be one application of this approach. With a large corpus of these images, chemical Web spidering and data mining would be possible on a scale unimaginable today. As always, these possibilities reinforce the desperate need for high quality tools that chemists actually want to use, and which simultaneously yield machine-readable output.
Editable and Searchable 2D Molecular Images 2
Word processing replaced the typewriter for the simple reason that documents could be prepared and edited so much more quickly. If Web authoring replaces conventional word processors, it will be for the simple reason that Web documents can be found, distributed, reprocessed, and combined with other content so much more effectively. The peculiar nature of chemical structure information complicates chemistry's transition to Web authoring. This article, the first in a series, discusses some of the challenges that lie ahead.
State of the Art: Word/ChemDraw
Microsoft Word allows 2D molecular graphics, typically created with ChemDraw, to be embedded in documents and later edited. Those images can then be copied into Power Point presentations and reused in a variety of other Windows-specific products. This practice has become so widespread throughout industry and academics, that few chemists even think about the technology that many of them rely on several times a week.
Chemical Structures are Peculiar
A 2D molecular image, like the one depicting fluoxetine at the top of this article, is a peculiar beast. On one level, it's a picture that anybody can look at. But on another level, it's a type of object for which manipulation by humans and computers is extremely useful. The combination of Microsoft Word and ChemDraw lets chemists conveniently manage the dual nature of chemical structures.
Live Molecular Images
Why would anybody want to create editable and searchable 2D molecular graphics such as JPGs, PNGs, and SVGs? Alas, technology has a way of moving on just when we're getting comfortable with it (an especially difficult concept for typewriter manufacturers who went bust during the 1980s, and the dedicated word processor manufacturers who followed).
Consider the number of Word and PowerPoint documents you read last week compared to the number of Web pages. Chances are the ratio is at least 1:10. The trend shows no signs of reversing itself.
Although Web authoring tools have been very slow to reach the average user, the blogging explosion has led to rapid evolution in the field. As tools like WordPress, Movable Type, and even Wikipedia race to satisfy the needs of power authors, the average user will rather unexpectedly discover that they have access to perfectly capable tools that let them abandon their over-engineered (and expensive) word processors to experiment with Web publishing.
The Wikipedia Chemisty/Structure Drawing Workgroup hints at what lies ahead for chemistry. Two tools, GChemPaint and ACD ChemSketch, now enable molecular structure information to be embedded in images.
As chemistry turns to the Web as its primary publication medium, chemists will need the same ability to deal with chemical structures offered by their current tools of choice. In articles to follow, I'll discuss some ways this could happen.
Top Ten Best-Selling Drugs Worldwide (2006)
If you haven't had a chance to do so yet, IMS Health's recent Intelligence.360 on the global pharmaceutical industry is worth reading. One noteworthy set of data contained in the report is a list of the top ten best-selling drugs worldwide for 2006.
A list of chemical names and numbers by itself is not that useful. However, adding chemical structures has a way of prompting better questions and generating many more ideas. In that spirit, I've created an online table of the ten best-selling drugs worldwide for 2006. This table contains the 2D chemical structure, generic name, trade name, global sales in US$, company, and indication for each drug.
A new software package codenamed "Firefly" was used to generate the chemical structures in the table. Firefly is a lightweight 2D editor and rendering library written in Java. A series of articles on Firefly can be found on Depth-First.
This link takes you to the table.

