The Fundamental Cheminformatics Toolset

Posted by Rich Apodaca Tue, 08 Jan 2008 14:37:00 GMT

Reference: W.J. Howe and T.R. Hogadone, J. Chem. Inf. Model.

Imagine you need to create a cheminformatics system that's useful to chemists in their daily work. What tools would you absolutely need, regardless of the specific system you're building?

The answer to this question is hardly academic. If you're looking for ways to disproportionately improve the state of cheminformatics, improving the performance of one or more of its fundamental tools would seem to be a logical path.

Here, in no particular order, are my picks for the five fundamental cheminformatics tools:

  • 2D Structure Editor. Ubiquitous yet mostly-ignored, the 2D structure editor is the last mile connecting cheminformaticians with laboratory chemists. Take away the structure editor for data entry and building queries, and most cheminformatics systems become useless to the average chemist.

  • 2D Structure Renderer. Chemists expect their cheminformatics systems to communicate with them the way that other chemists do - through 2D chemical structures. Rendering software makes this possible. Like the 2D structure editor, structure renderers are a widely-ignored yet critical link between producers and consumers of cheminformatics software. Although the 2D renderer and editor need not necessarily be related, the two technologies are so similar that most 2D editors are based on a related 2D rendering engine.

  • Structure Query System. The purpose of the vast majority of cheminformatics systems is to produce a set of chemical structure results based on a structure query. The structure query system makes this possible. As the datasets that chemists deal with become ever larger, the ability to specify query structures at a high level of detail, and retrieve the results efficiently, becomes increasingly important. This is an area ripe for big improvements.

  • Low-Level Cheminformatics Toolkit. Most cheminformatics systems involve one or more elements specific to their problem domain. For example, predictive tools may use molecular descriptors. A robust and versatile low-level cheminformatics toolkit makes it possible to build problem-specific cheminformatics libraries. This toolkit may or may not be used in the 2D structure editor and renderer, depending on whether an adequate text-based molecular language is available (see below).

  • Text-Based Molecular Language. Cheminformatics systems are frequently built from components developed independently by multiple groups. These systems may be developed in different programming languages, may even run on different operating systems, and may need to communicate over a network connection. A well-specified, open, text-based molecular language makes it possible for these systems to interoperate. Two widely-used examples include MDL's molfile format and Daylight's SMILES, both of which have significant limitations.

One of the reasons I consider this set of cheminformatics tools in particular to be fundamental is the perennial need to use and improve them. Elements of each of these tools can be seen, for example, in the COUSIN system developed by Howe and Hogadone at Upjohn over 25 years ago. Comparison of this system with PubChem shows just how little the basic problems change, despite major changes in underlying technology.

What are your fundamental cheminformatics tools and which of them are you working to improve?

Debabelization

Posted by Rich Apodaca Wed, 08 Nov 2006 19:32:00 GMT

Today, we find Chemical Abstracts with over two million compounds coded in a connectivity table system and ISI with close to a million compounds coded in WLN. The U.S. Patent Office has large files coded in the Hayward notation; the IDC has large numbers of compounds in its CT and GREMAS Code. Derwent has a sizable patent file coded in one fragment code, and many journal literature compounds coded in the Ring Code fragment code. There are a number of individual companies and government agencies with over 100,000 compounds coded in "a" system. And almost all companies synthesizing new compounds have some internal system for their compounds. Finally, there are many universities with a wide variety of coded structure files.

-Charles E. Granito J. Chem. Doc. 1973, 13, 72-74

The situation described by Granito in 1973 seems eerily familiar today. The names of the players, the technologies, and encoding systems have changed, but the problem of multiple incompatible molecular languages has persisted for over 30 years.

This problem will become even more pronounced in the near future as free chemistry databases on the Web continue their rapid proliferation. In Granito's world of closed, proprietary databases and unevenly distributed computer power, interoperability was an afterthought; in the coming world of free, open databases, and ubiquitous computer networks that connect to them, interoperability will be taken for granted.

Granito goes on to observe that "there is no one 'best' system" for molecular representation. And he's right. Molecular languages evolve within a particular problem domain, just as human languages evolve within a specific cultural context. This isn't to say that a molecular language can't be creatively adapted to serve purposes for which it was never designed. Trying to do so is, after all, how new languages are conceived.

Consider the case of InChI, which is both a molecular identification system and a line notation, or Chemical Markup Language (CML), an XML language. There are vast areas of chemistry in which using either InChI or CML will be problematic - particularly polymers, organometallics, and inorganic chemistry. And let's not ignore new molecular representation problems brewing on the horizon like small molecule tertiary structure. Yet for pure organic chemistry as most of us know it today, InChI and CML may well be optimal.

The problem is that both InChI and CML compete with simpler, entrenched alternatives - SMILES and molfile, respectively. Even MDL, the author of the original molfile specification, is having difficulty gaining acceptance for its new molfile format, despite significant technical advantages.

If history is any guide, we can look forward to at least as many molecular languages in the next thirty years as we've seen in the last thirty. It wasn't long ago that WLN was viewed as the language of the future. Now it just looks cryptic. For this we can thank a combination of technology advances and the emergence of a far simpler alternative, SMILES. A similar fate, more likely than not, awaits all molecular languages currently in use.

Will there ever be a universal molecular language and is there any point in trying to invent one? Every area of chemistry introduces its own peculiarities not shared by any of the others. Yet all users want the simplest language possible. These two contradictory forces ensure that a universal language is unlikely to ever appear. In other words, the most successful new molecular languages are likely to be agile - simple, easy to learn, cheap to implement, and quickly adaptable in the face of new chemical concepts and advances in computer technology.