A Molecular Language for Modern Chemistry: Getting Started with FlexMol
Existing molecular languages are limited in their ability to represent such commonplace features as multi-center bonding and axial chirality. The practical outcome of these limitations can be seen in PubChem's four separate entries for ferrocene and the inability to fully represent many molecules now in common use by organic chemists.
A recent article touched on a molecular representation system that was capable of far greater expressive power than those currently in use. In this article, I'll introduce FlexMol, an XML implementation of this advanced molecular representation system.
What is FlexMol?
FlexMol is an XML-based molecular language that's designed to allow the faithful representation of any molecule, regardless of its peculiarities. The following is a list of features that FlexMol can encode:
Multi-atom, multi-electron bonds
All known forms of stereochemistry, including axial chirality (e.g., allenes and biarlys), planar chirality (e.g., metallocenes), and non-tetrahedral stereocenters (e.g., square planar and octahedral metal complexes)
Non-natural isotopic distributions and pure isotopes
Virtual hydrogens (similar to "implicit hydrogens") through mandatory, explicit enumeration
Electronic spin, enabling the differentiation of spin states
What Does FlexMol Look Like?
Let's start with the simple example of benzene:
<!-- Benzene, represented as "1,3,5-cyclohexatriene" -->
<?xml version="1.0" standalone="yes"?>
<molecule>
<constitution>
<atoms>
<atom id="C0" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C1" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C2" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C3" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C4" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C5" symbol="C" hydrogens="1" ionization="4"></atom>
</atoms>
<bonding>
<bond source="C0" target="C1" bondingElectrons="2"></bond>
<bond source="C1" target="C2" bondingElectrons="4"></bond>
<bond source="C2" target="C3" bondingElectrons="2"></bond>
<bond source="C3" target="C4" bondingElectrons="4"></bond>
<bond source="C4" target="C5" bondingElectrons="2"></bond>
<bond source="C0" target="C5" bondingElectrons="4"></bond>
</bonding>
</constitution>
</molecule>The above representation divides the structure of benzene into two main elements - atoms and bonding. Both of these elements are in turn subelements of the constitution element, which specifies atom connectivity. Had we been representing a molecule with stereochemical features, the above document could have also contained a configuration element, a conformation element, or both.
Within the atoms element are definitions for each of the six degenerate carbon atoms of benzene. Each atom is assigned a unique ID for use elsewhere in the document, an atomic symbol, the number of hydrogens bonded to each atom, and the effective ionization state of each atom. The mandatory hydrogens attribute specifies "virtual" hydrogens, or those associated with an atom without being full-fledged nodes in the graph representation.
The bonding element defines all of the bonding arrangements within benzene. In this case, benzene is being represented as "cyclohexatriene" with alternating single and double bonds; below we'll see how to use FlexMol to represent delocalized (aromatic) bonding. Each bond specifies a source atom, a target atom, and the number of bonding electrons.
In many situations, the above representation of benzene will not suffice. What if we want to describe the one-electron ionization of benzene to form the benzene radical cation? Using the "cyclohexatriene" form of benzene makes it impossible to select the correct bond from which to take electrons.
Instead, we could use a more physically meaningful representation of benzene, such as that shown below:
<!-- Benzene, represented with a delocalized pi-system -->
<?xml version="1.0" standalone="yes"?>
<molecule>
<constitution>
<atoms>
<atom id="C0" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C1" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C2" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C3" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C4" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C5" symbol="C" hydrogens="1" ionization="4"></atom>
</atoms>
<bonding>
<bond source="C0" target="C1" bondingElectrons="2"></bond>
<bond source="C1" target="C2" bondingElectrons="2"></bond>
<bond source="C2" target="C3" bondingElectrons="2"></bond>
<bond source="C3" target="C4" bondingElectrons="2"></bond>
<bond source="C4" target="C5" bondingElectrons="2"></bond>
<bond source="C0" target="C5" bondingElectrons="2"></bond>
<bondingSystem bondingElectrons="6">
<connections>
<atomPair source="C0" target="C1"></atomPair>
<atomPair source="C1" target="C2"></atomPair>
<atomPair source="C2" target="C3"></atomPair>
<atomPair source="C3" target="C4"></atomPair>
<atomPair source="C4" target="C5"></atomPair>
<atomPair source="C0" target="C5"></atomPair>
</connections>
</bondingSystem>
</bonding>
</constitution>
</molecule>This is certainly more verbose, but what does it buy us? Notice the bondingSystem subelement at the end of the bonding element. Here we define an extended six-atom, six-electron bonding system that much more closely reflects the true nature of benzene's pi-system. Now it's obvious that this is the bonding motif from which to take an electron to make the benzene radical cation.
Next, consider the cyclopenadienyl anion, which possesses a five-atom, six-electron Hueckel aromatic bonding system. We can apply the same principles in representing benzene's pi-system to the representation of the cyclopentadienyl anion's pi-bonding:
<!-- Cyclopentadienyl Anion -->
<?xml version="1.0" standalone="yes"?>
<molecule>
<constitution>
<atoms>
<atom id="C0" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C1" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C2" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C3" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C4" symbol="C" hydrogens="1" ionization="4"></atom>
</atoms>
<bonding>
<bond source="C0" target="C1" bondingElectrons="2"></bond>
<bond source="C1" target="C2" bondingElectrons="2"></bond>
<bond source="C2" target="C3" bondingElectrons="2"></bond>
<bond source="C3" target="C4" bondingElectrons="2"></bond>
<bond source="C0" target="C4" bondingElectrons="2"></bond>
<bondingSystem bondingElectrons="6">
<connections>
<atomPair source="C0" target="C1"></atomPair>
<atomPair source="C1" target="C2"></atomPair>
<atomPair source="C2" target="C3"></atomPair>
<atomPair source="C3" target="C4"></atomPair>
<atomPair source="C0" target="C4"></atomPair>
</connections>
</bondingSystem>
</bonding>
</constitution>
</molecule>In the above representation, all carbon atoms are equivalent - something difficult, if not impossible, to achieve with most other molecular languages. Furthermore, the representation of delocalized bonding closely matches what most chemists would describe. We could get even more sophisticated and place individual electrons into three separate bonding systems in analogy with molecular orbitals - it really depends on what we'd like to emphasize.
This is well and good for aromaticity, but how can all of this help solve the Ferrocene Problem? Just as with cyclopentadienyl anion and benzene, in the representation of ferrocene below, we're taking advantage of FlexMol's support for multi-atom bonding. In this case, we define three bondingSystems, each of which contain six electrons. We could have just as easily created a single eighteen-electron, eleven-atom bonding system. Our choice of representation again depends on what we're trying to emphasize.
<!-- Ferrocene -->
<?xml version="1.0" standalone="yes"?>
<molecule>
<constitution>
<atoms>
<atom id="C0" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C1" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C2" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C3" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C4" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C5" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C6" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C7" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C8" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="C9" symbol="C" hydrogens="1" ionization="4"></atom>
<atom id="Fe10" symbol="Fe" hydrogens="0" ionization="8"></atom>
</atoms>
<bonding>
<bond source="C0" target="C1" bondingElectrons="2"></bond>
<bond source="C1" target="C2" bondingElectrons="2"></bond>
<bond source="C2" target="C3" bondingElectrons="2"></bond>
<bond source="C3" target="C4" bondingElectrons="2"></bond>
<bond source="C0" target="C4" bondingElectrons="2"></bond>
<bond source="C5" target="C6" bondingElectrons="2"></bond>
<bond source="C6" target="C7" bondingElectrons="2"></bond>
<bond source="C7" target="C8" bondingElectrons="2"></bond>
<bond source="C8" target="C9" bondingElectrons="2"></bond>
<bond source="C5" target="C9" bondingElectrons="2"></bond>
<bondingSystem bondingElectrons="6">
<connections>
<atomPair source="C0" target="C1"></atomPair>
<atomPair source="C1" target="C2"></atomPair>
<atomPair source="C2" target="C3"></atomPair>
<atomPair source="C3" target="C4"></atomPair>
<atomPair source="C0" target="C4"></atomPair>
<atomPair source="C0" target="Fe10"></atomPair>
<atomPair source="C1" target="Fe10"></atomPair>
<atomPair source="C2" target="Fe10"></atomPair>
<atomPair source="C3" target="Fe10"></atomPair>
<atomPair source="C4" target="Fe10"></atomPair>
</connections>
</bondingSystem>
<bondingSystem bondingElectrons="6">
<connections>
<atomPair source="C5" target="C6"></atomPair>
<atomPair source="C6" target="C7"></atomPair>
<atomPair source="C7" target="C8"></atomPair>
<atomPair source="C8" target="C9"></atomPair>
<atomPair source="C5" target="C9"></atomPair>
<atomPair source="C5" target="Fe10"></atomPair>
<atomPair source="C6" target="Fe10"></atomPair>
<atomPair source="C7" target="Fe10"></atomPair>
<atomPair source="C8" target="Fe10"></atomPair>
<atomPair source="C9" target="Fe10"></atomPair>
</connections>
</bondingSystem>
<bondingSystem bondingElectrons="6">
<connections>
<atomPair source="C0" target="C1"></atomPair>
<atomPair source="C1" target="C2"></atomPair>
<atomPair source="C2" target="C3"></atomPair>
<atomPair source="C3" target="C4"></atomPair>
<atomPair source="C0" target="C4"></atomPair>
<atomPair source="C0" target="Fe10"></atomPair>
<atomPair source="C1" target="Fe10"></atomPair>
<atomPair source="C2" target="Fe10"></atomPair>
<atomPair source="C3" target="Fe10"></atomPair>
<atomPair source="C4" target="Fe10"></atomPair>
<atomPair source="C5" target="C6"></atomPair>
<atomPair source="C6" target="C7"></atomPair>
<atomPair source="C7" target="C8"></atomPair>
<atomPair source="C8" target="C9"></atomPair>
<atomPair source="C5" target="C9"></atomPair>
<atomPair source="C5" target="Fe10"></atomPair>
<atomPair source="C6" target="Fe10"></atomPair>
<atomPair source="C7" target="Fe10"></atomPair>
<atomPair source="C8" target="Fe10"></atomPair>
<atomPair source="C9" target="Fe10"></atomPair>
</connections>
</bondingSystem>
</bonding>
</constitution>
</molecule>The same principles outlined for ferrocene apply equally to other metallocenes. FlexMol can also represent a host of otherwise tough cases such as nonclassical carbocations, allylmetal complexes, resonance-stabilized radicals and ions, and transition states.
Why XML?

XML provides several often-cited advantages:
Availability of standardized parser and output libraries
Human readability
Adequate mapping to Object-Oriented models for most purposes
Nothing about FlexMol prevents it from being built on top of another data-interchange format. Two of the most interesting alternatives to XML are JavaScript Object Notation (JSON) and YAML. JSON in particular seems to have learned from XML's experiences and so represents a platform worthy of serious consideration.
What About Chemical Markup Language?
Chemical Markup Language (CML) is a widely-used XML-based molecular language. So why invent yet another XML language for chemistry? Currently, CML does not solve the molecular representation problems discussed in this article and those preceding it. So although FlexMol and CML are both built on XML, they are nevertheless each aimed at addressing different problems. In this respect, FlexMol and CML are complementary.
Where's the Software?
Any language needs software to make it useful. To simplify the use of FlexMol, it is fully supported by Octet, an Open Source framework written in Java. Supporting FlexMol in other cheminformatics toolkits will likely be challenging due to impedance mismatch; FlexMol can precisely encode a variety of structural concepts that simply don't exist elsewhere.
Conclusions
Existing molecular languages lack the expressive power to represent many structural motifs in widespread use by today's chemists. FlexMol was designed to solve this problem. Future articles in this series will demonstrate how FlexMol documents can be read and written, as well as showing some techniques for manipulating the resulting molecular representations.
Ferrocene and Beyond: A Solution to the Molecular Representation Problem
The representation of molecular structure decisively determines the scope of a chemical computer program. Our goal is to provide a versatile computer-oriented molecular structure representation for chemical information storage and retrieval as well as for computer-assisted synthesis design. Structural formulas describe molecular structure on the proper level of abstraction for these applications. ... It is therefore desirable that the computer-oriented representation of molecular structure be as expressive as the structural formulas.
-Andreas Dietz, J. Chem. Inf. Comput. Sci. 1995, 35, 787-802
A recent Depth-First article highlighted the difficulty that existing molecular languages have in communicating the generalized, multi-atom bonding present in metallocenes such a ferrocene. For software and Web services that do not interact with the outside world, the Ferrocene Problem may not be a big deal. But for the growing number that do, the Ferrocene Problem is but the tip of a very large iceberg.
Today's Weird-Looking Molecule is Tomorrow's Molecule of the Month
Consider the problem of axial chirality, such as that present in certain biaryls. None of the molecular languages currently in widespread use (InChI, SMILES, Molfile, or CML) provide a mechanism to faithfully represent and communicate this structural motif. In the 1980s, axial chirality was a novelty. Today it is ubiquitous. Consider this graphical abstract from the current issue of Organic Letters:

If you were asked to create an application capable of distinguishing substituted (R) and (S) binol enantiomers, could you do it? If your system needed to reliably interact with the outside world, could it do so? If you're working with any of the cheminformatics tools currently in widespread use, chances are good that the answers to these questions would be "no".
Do you still think of metallocenes as curiosities studied by a handful of organometallic chemists? Consider this J. Org. Chem. ASAP contents article describing one of the most fundamental transformations in organic chemistry:

The problem only gets worse as concepts like axial and planar chirality are increasingly co-mingled with multi-atom bonding. For example, consider the following graphical abstract, taken from J. Org. Chem. ASAP contents:

These molecules, and many others like them, were used in the context of organic chemistry. Moreover, the papers describing their use were published in widely-respected journals specializing in organic chemistry. Yet dozens of popular cheminformatics tools specifically designed for use with organic chemisty are incapable of faithfully representing the most interesting features of these molecules. In other words, the problem is both real and immediate.
Chemistry relentlessly marches forward, revealing even greater molecular information problems on the horizon. For software to remain relevant, it must be based on tools that are up to the challenge.
A Solution
The system proposed by Dietz offers a solution to nearly all of the bonding and stereochemistry problems of existing molecular languages. As a tradeoff, Dietz's system is significantly more complicated to implement. This places an increased burden on software to make the system as simple and understandable as possible.
Java and XML Implementations
Any specification, if it is to become more than just an academic exercise, requires a software implementation. Fortunately, for Dietz's system both a software implementation and an XML Schema have been developed and are freely-available.
The software implementation can be found in the Java framework Octet. In addition to fully-implementing Dietz's specification, Octet enables ring perception, substructure and query structure matching, breadth-first traversal, and of course, depth-first traversal. Add-on libraries are available for 2-D structure depiction, and Molfile and SMILES input and output. A CDK News article discusses CDKTools, a bridge to the Chemistry Development Kit. Octet remains, to my knowledge, the first and only implementation of the Dietz system.
The first, and to my knowledge only, XML implementation of the Dietz molecular representation system is FlexMol (Flexible Molecular Object Language). A commented W3C schema is distributed with Octet. Browser-ready HTML documentation can be found here, or from the sidebar links under "APIs and Schema Documentation." Octet is able to read and write FlexMol documents, providing an open, end-to-end solution to the problem of representing and transmitting molecules containing "nonstandard" bonding and stereochemistry.
Conclusions
Both FlexMol and Octet are convenient tools for working with the Dietz molecular representation system. Future articles in this series will show how they can be used to solve current, real-world molecular representation problems.
Older posts: 1 2

