Ferrocene and Beyond: A Solution to the Molecular Representation Problem

December 19, 2006

The representation of molecular structure decisively determines the scope of a chemical computer program. Our goal is to provide a versatile computer-oriented molecular structure representation for chemical information storage and retrieval as well as for computer-assisted synthesis design. Structural formulas describe molecular structure on the proper level of abstraction for these applications. ... It is therefore desirable that the computer-oriented representation of molecular structure be as expressive as the structural formulas.

Andreas Dietz, J. Chem. Inf. Comput. Sci. 1995, 35, 787-802

A recent Depth-First article highlighted the difficulty that existing molecular languages have in communicating the generalized, multi-atom bonding present in metallocenes such a ferrocene. For software and Web services that do not interact with the outside world, the Ferrocene Problem may not be a big deal. But for the growing number that do, the Ferrocene Problem is but the tip of a very large iceberg.

Today's Weird-Looking Molecule is Tomorrow's Molecule of the Month

Consider the problem of axial chirality, such as that present in certain biaryls. None of the molecular languages currently in widespread use (InChI, SMILES, Molfile, or CML) provide a mechanism to faithfully represent and communicate this structural motif. In the 1980s, axial chirality was a novelty. Today it is ubiquitous. Consider this graphical abstract from the current issue of Organic Letters:

If you were asked to create an application capable of distinguishing substituted (R) and (S) binol enantiomers, could you do it? If your system needed to reliably interact with the outside world, could it do so? If you're working with any of the cheminformatics tools currently in widespread use, chances are good that the answers to these questions would be "no".

Do you still think of metallocenes as curiosities studied by a handful of organometallic chemists? Consider this J. Org. Chem. ASAP contents article describing one of the most fundamental transformations in organic chemistry:

The problem only gets worse as concepts like axial and planar chirality are increasingly co-mingled with multi-atom bonding. For example, consider the following graphical abstract, taken from J. Org. Chem. ASAP contents:

These molecules, and many others like them, were used in the context of organic chemistry. Moreover, the papers describing their use were published in widely-respected journals specializing in organic chemistry. Yet dozens of popular cheminformatics tools specifically designed for use with organic chemisty are incapable of faithfully representing the most interesting features of these molecules. In other words, the problem is both real and immediate.

Chemistry relentlessly marches forward, revealing even greater molecular information problems on the horizon. For software to remain relevant, it must be based on tools that are up to the challenge.

A Solution

The system proposed by Dietz offers a solution to nearly all of the bonding and stereochemistry problems of existing molecular languages. As a tradeoff, Dietz's system is significantly more complicated to implement. This places an increased burden on software to make the system as simple and understandable as possible.

Java and XML Implementations

Any specification, if it is to become more than just an academic exercise, requires a software implementation. Fortunately, for Dietz's system both a software implementation and an XML Schema have been developed and are freely-available.

The software implementation can be found in the Java framework Octet. In addition to fully-implementing Dietz's specification, Octet enables ring perception, substructure and query structure matching, breadth-first traversal, and of course, depth-first traversal. Add-on libraries are available for 2-D structure depiction, and Molfile and SMILES input and output. A CDK News article discusses CDKTools, a bridge to the Chemistry Development Kit. Octet remains, to my knowledge, the first and only implementation of the Dietz system.

The first, and to my knowledge only, XML implementation of the Dietz molecular representation system is FlexMol (Flexible Molecular Object Language). A commented W3C schema is distributed with Octet. Browser-ready HTML documentation can be found here, or from the sidebar links under "APIs and Schema Documentation." Octet is able to read and write FlexMol documents, providing an open, end-to-end solution to the problem of representing and transmitting molecules containing "nonstandard" bonding and stereochemistry.

Conclusions

Both FlexMol and Octet are convenient tools for working with the Dietz molecular representation system. Future articles in this series will show how they can be used to solve current, real-world molecular representation problems.