Hashing and the Universal Molecular Identifier 3
The impossibility of separating the nomenclature of a science from the science itself, is owing to this, that every branch of physical science must consist of three things; the series of facts which are the objects of the science, the ideas which represent these facts, and the words by which these ideas are expressed. ... -Antoine Lavoisier Traité Elémentaire de Chimie
The idea of creating a system in which a short sequence of characters can be used in place of a molecular structure traces its roots to the beginnings of modern chemistry. As the Web continues to become the information platform of choice - even in chemistry, the need to communicate molecular information in a Web-ready format increases.
InChI
Recently, InChI has emerged as a candidate for this position - or more precisely, InChIKey has. For the unfamiliar, InChI is program that converts a variety of molecular structures into a sequence of characters; the same structure always gives the same sequence.
InChI is but one of a very large class of known line notations in chemistry. Unlike most other line notations, it combines the following ideas into one system:
- Normalization. Many bonding arrangements, such as the nitro group and tautomers can be represented in multiple equivalent forms. InChI is capable of eliminating alternative representations.
- Symmetrization. InChI views molecules as networks of typed atoms connected by equivalent links. This eliminates the artifical distinction among single and double bonds in aromatic molecules such as benzene.
- Canonicalization. Each atom in a molecule is assigned a reproducibly unique index, enabling equivalent inputs with different atom numberings to produce identical output.
- Representation. InChIs are represented as a single line of text through the use of canonical numbering and rules for representing atom types.
- Fixed Length. Unique to InChI - InChI Key enables very long InChI output to be compressed into a fixed-length string. The process is for all intents and purposes one-way.
This was no easy task, and is a remarkable accomplishment in itself.
InChI Limitations
But like any technology, InChI has limitations. Most importantly, InChI can't uniquely represent several important classes of molecular species, including those possessing axial and planar chirality, organometallics containing multicentered bonds, and polymers.
Extending InChI to encompass these important classes will be no easy task. Unlike the situation when InChI was being developed, many research groups around the world now use InChI - any change that breaks existing InChIs or their Keys would be disruptive, and even possibly ignored by large segments of the community.
The difficulty of the task is compounded by the technical limitations in dealing with line notations. The qualities that tend to make them compact (e.g., unidimensionality) tend to be the same qualities that make them difficult to extend.
Taking a Step Back
What if we could go back to points 1-5 above and create a system from scratch that addressed them all - and in a way in which all forms of chirality, bonding, and polymers were included? How would we do it?
No More Line Notations
Our most important decision might be to avoid the use of a line notation altogether. Given (5), we know that anything we come up with, no matter how verbose, can be reduced to a fixed length string that can be readily used in URLs and presented to end users for copy/paste operations and visual comparison.
Combining a full-featured file format with hashing offers a way to reap the benefits of a rich language for describing molecules while retaining a convenient method for using them on the Web. (for an example of a site that makes extensive use of hashes in URLs and on Web pages, see GitHub).
What About Molfile?
Instead of inventing our own 'standard', what if we based our identifier on a file format already in widespread use instead? What about MDL V2000 molfiles?
At a minimum, we would need to specify:
- An algorithm that could convert any molfile into canonical form.
- A hashing algorithm, of which there are many good ones to choose some such as plain old SHA-1.
Provided we could find solutions to the problems above, this would be an attractive option. We'd simply be layering additional constraints on a file format with a massive installed base. In fact, we would end up with something possessing nearly equivalent qualities to InChI Key, but with the advantage that we're extending an existing, widely-used standard.
Although we still would have the problem of not being able to deal with multicentered bonding and axial chirality, at least we'd have the possibility to include polymers.
But we'd still be leaving out a large portion of chemical space.
What About Another File Format?
No publicly-described file format that I'm aware of has all of the qualities needed to serve as a basis for a universal molecular identifier. Although some come close, close is not enough.
Every file format has something to teach. It may be possible to borrow the best elements of each, but this would be neither simple nor risk-free.
Conclusions
Hashing offers an attractive method for converting detailed, machine-readable descriptions of molecular structure into a fixed-length string suitable for use on the Web. The problem then boils down to identifying, or inventing, the means to create these descriptions.
Whether the tradeoffs in using an identifier that can't be decoded are worth the benefits of one that can be readily encoded and shared is another question altogether.
The First InChIKey Collision 4
Noel O'Boyle raises an important question about the canonicalization algorithms used in chemical line notations such as SMILES and InChI:
Can their canonicalisation procedures ensure that two identical molecular graphs result in the same canonical SMILES or InChI?
This isn't merely an academic exercise. One of the best uses for InChI that I've found is as a private primary key in a molecular database. For example, instead of performing a computationally intensive exact structure search to determine if a newly-submitted molecule already exists as part of a chemical registration system, we can simply convert the molecule into a string with InChI and query an indexed field in the 'molecule' table, using code highly-optimized by our database system.
Not only that, but the way InChIs are generated assures us that some of the trickier problems in creating chemical registration systems, such as stereoisomer identification and tautomer detection, have been addressed at virtually no cost.
If there ever were a failure in a line notation canonicalization algorithm, it would probably be detected by a database maintainer noticing that the system failed to register a molecule that clearly wasn't already present, or registered a molecule that already was present.
In other words, this kind of 'bug' might be difficult to detect.
If it's any consolation, PubChem maintains the world's largest collection of freely-downloadable InChI-structure correlations (tens of millions), and as far as I know has not encountered a single InChI failure.
Another kind of failure is possible, though highly unlikely, when using InChIKey, which is generated from a cryptograpic hash of the full InChI. From the Official InChI site:
There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.
Those are pretty slim odds indeed. But they're not zero, either. As far as I know, not one InChI Key collision has been reported to date. But that doesn't mean one won't be found later on today.
If you wanted to create an efficient system for finding InChIKey collisions, how would you go about doing it?
Why Chemical Abstracts Service Might Not Want To Use InChI
In late March a company called Outsell published an article on InChI by Daniel Pollock that has created a bit of a stir. The article, titled Chemical Bonding InChI by InChI was retracted by Outsell, who write:
On March 30th we published an Insights piece titled “Chemical Bonding InChI by InChI.” We have removed it from our archives. We pride ourselves on publishing independent, fact-based research that provides strong, substantiated analysis and recommendations about key market dynamics. In the Important Details section of the piece, we did a good job of explaining details about The International Chemical Identifier as an emerging industry standard used to describe chemical substances. However, in the Implications section we published information about Chemical Abstract Service’s highly-regarded SciFinder product that was incorrect, and we did not cite a sufficiently balanced set of references in developing our argument.
Further, it is our practice to avoid speculating about an organization’s stance on a topic without reaching out to the organization for on-the-record research briefings. Overall, the tone of the piece could be taken to single out CAS as being late in responding to the trends, and in our view the research and analysis did not support it.
We regret that this piece didn’t hold up to our internal standards and that it was not caught in our internal peer-review process before it was published. Even more, we regret that it didn’t live up to the high standards you are accustomed to expecting from us. We apologize for the circumstances leading up to this.
The reasons for retraction seem to revolve around statements made about SciFinder and speculations made in the article about the Chemical Abstracts Service (CAS) response to InChI.
The original article is available through Google's cache.
The statements about SciFinder and the CAS response to InChI appear to have been made in two paragraphs:
The current gold standard for identifying chemical substances are proprietary Chemical Abstracts Service (CAS) Registry Numbers, owned and operated by the American Society of Chemistry [sic] (ACS). We do not yet know if CAS plans to map its database to InChI. However, given that CAS has been criticised for its proprietary approach in the past, and took until April 2008 to release a web based version of its flagship SciFinder database, in Outsell’s opinion we may have to wait a while yet.
However, we do hope that this is not the case since it is important that information providers do not Balkanize their information if they are not to get lost in the web (see Insights 18 July 2008, Nature Publishing Group Sets the Cat Amongst the Pigeons of Open Access, But Maybe We’re All Missing the Point). The point here is that open standards can benefit all by making information (products) easier to discover, and this speaks to one of the core demands of the networked environment. So, for example, CAS’s index of 40 million substances is not threatened by open standards and, in fact, our view is that mapping CAS numbers to an [sic] standard such as InChI can only help to make it more accessible. And with over 20 million substances now indexed by ChemSpider, the InChI could emerge as a - if not the - industry standard index of chemical substances on the web.
The retraction has prompted at least three four reactions so far:
The Sceptical Chymist asks "Outsell now say that the original article wasn’t balanced and that the ‘tone of the piece could be taken to single out CAS as being late in responding to the trends’. Surely readers could make that judgement for themselves?"
Antony Williams notes that "Conspiracy theories are already moving around the community."
Peter Murray Rust suspects "In short the best guess is that CAS see InChIs as a threat (I’ll discuss the foolishness of this below) and that they put pressure on Outsell to retract."
Activity on the CHMINF-L list.
For what it's worth, there are legitimate technical and legal arguments for why CAS many not want to "embrace" InChI just yet. Among them:
no specification that would enable the creation of an independent implementation;
single implementation with an open source license that has come under legitimate criticism;
fails to generate unique identifier for many molecules likely present in CAS Registry (e.g., ferrocene, (R)- and (S)- BINAP);
InChI itself is problematically long (breaks HTML layouts, for example) for even medium-sized molecules;
InChIKey has no mechanism for backward-compatibility with newer versions that may fix bugs or add features to the existing implementation;
the "final" version has only very recently been available.
I have no idea whether any of these factors are important to CAS now or in the future. I also don't know why the Outsell article was retracted beyond what's available on their site.
What I can say is that it's inaccurate to portray InChI as anything but a technology with some potential at this point. The Outsell article painted a picture of something a bit more, and this may explain, in part, the retraction.
Update: it now appears that the retraction on the Outsell site has been shortened to: "On March 30th we published an Insights piece titled "Chemical Bonding InChI by InChI." We have removed it from our archives."
Build a RESTful Chemical Registration System from Scratch Part 1: Tools of the Trade 4
A chemical registration system forms the core of most database-driven cheminformatics applications. Yet detailed instructions, in the traditional literature or otherwise, on how to create one from free components are surprisingly rare. This article introduces a new Depth-First series aimed a bringing together several tutorials written over the last year to create a RESTful chemical registration system that anyone can build, run, and adapt to their own needs.
Defining the Problem
Whether you're building or designing a database-driven chemical informatics system, at some point you'll face the problem of getting molecules into and out of your database. This is where chemical registration systems come in. eMolecules has created a summary on the subject. It defines the main responsibilities of a chemical registration system as ensuring:
Structural novelty - The same molecule never gets stored twice.
Structural normalization When multiple representations of a molecule are possible (e.g., tautomers and charge-separated forms), only one is used.
Structure drawing Present a chemical structure recognizable to chemists.
Consistent relationships among related compounds The system must decide what to do with various salt forms (or other mixtures) of a particular compound a user might decide to register. There are many options, but they must be applied consistently.
Reasonable behavior when a structure is (partially) unknown Not every compound of interest will have a known chemical structure. Sometimes the structure will only be partially known as in the case of double bond geometry and absolute stereochemistry.
Security Enough said? There's always more.
Reasonable behavior when changes are made to structures The system must be able to respond well to the inevitable: a user changes their mind about the structure s/he entered.
The system we'll build in this tutorial won't deal with all of these responsibilities, but it will handle most of them. In addition, it will address some other problems as well.
The Approach
We'll be building a Web Service, which is defined by the W3C as "a software system designed to support interoperable machine-to-machine interaction over a network."
The reason is simple: we want our chemical registration system to be addressable from anywhere in the world, and we want to use it as an interchangeable, technology-agnostic, loosely-coupled component to build more complex chemistry Web applications.
We want this system to be as easy to deploy as possible on any hardware. Lengthy configuration processes, source code compilation, and exotic dependencies are out. Drag-and-drop deployment, self-contained packaging, convention over configuration, and platform-independent binaries are in.
REST?
There are currently multiple competing approaches for creating Web Services. One of the most flexible and straightforward to implement is Representational State Transfer (REST). In a nutshell, REST leverages the full HTTP protocol for passing messages to and from the server. This simple idea has some powerful implications for the design of the system, which will be explored in articles to come.
Tools
We will use a number of free, open technologies in the creation of our system:
Technology Platform Java will be used exclusively due to its massive installed base, platform-independence, and high performance.
Cheminformatics Toolkit MX will supply the main interface between chemistry and Java.
Unique Identifier InChI will be used to assign unique identifiers to compounds stored in the registry.
Server Jetty will supply basic HTTP functionality.
Servlet Restlet will simplify the implementation of REST using the servlet specification.
Database H2 will provide fast, portable, zero-administration SQL support.
Object Persistence The exact method of persisting Java objects hasn't been settled yet, but Active Objects looks quite interesting, especially when combined with H2.
Conclusions
Chemical registration systems play a vital role in enabling data-driven chemistry applications. This article introduced the problems registration systems typically solve and outlined a plan for implementing one using only free, open components. The next article in this series will discuss the design of the registry Web Service.
Image Credit: Phillip Torrone
Mr. InChI: Tear Down This Wall 9
InChI, useful as it may be, has some important limitations. One of the biggest relates to portability. The InChI source code is written in C, meaning that developers in other languages need to jump through hoops of varying degrees of difficulty to get InChI to work with their development platform of choice. Compounding the problem is the near-total lack of documentation that would guide third-party implementers in creating their own de-novo InChI generators.
Like it or not, if you do InChI and you don't develop in C or C++, you'll eventually face the gnarly problem of how to integrate this oddball native library into your code base and maintain it.
But, you may argue, InChI is written in C and C source is portable across platforms. What's the big deal?
True enough, but C binaries most definitely are not portable. That means that your application or library needs to become aware of differences in its target platforms - in most cases far too aware.
If you're working in a platform-independent language like Java, Python, or Ruby, this can drive you nuts. If not for the single InChI library dependency, you could distribute one version of your application or library and be done with it.
With InChI in the mix, you'll need to worry about all kinds of things you shouldn't have to. Linux, Windows, or OS X? 32-bit or 64-bit? Intel or Power PC?
It's not like there aren't various solutions to the problem. Several articles have appeared on Depth-First describing some workarounds, but each introduces its own limitations:
From C Source Code to Platform-Independent Executable Jarfile: Using NestedVM to Build JInChI
A Simple and Portable Ruby Interface to InChI - Part 2: Silencing Console Output
Another option for Java is to use Java Native Interface InChI Wrapper. This library, written by Sam Adams and Jim Downing is distributed with precompiled InChI binaries, which makes integration a little easier.
But for one small example of the kinds of limitations even this seemingly good solution brings, and the kind of valuable time that gets wasted on the C InChI dependency, consider this JRuby console output:
$ jirb irb(main):001:0> require 'jniinchi-0.5-jar-with-dependencies.jar' => true irb(main):002:0> import 'net.sf.jniinchi.JniInchiWrapper' => ["net.sf.jniinchi.JniInchiWrapper"] irb(main):003:0> JniInchiWrapper.loadLibrary ERROR net.sf.jnati.deploy.NativeLibraryLoader - Error loading native library: /home/rich/.jnati/repo/jniinchi/1.6/LINUX-X86/libJniInchi-1.6-LINUX-X86.so java.lang.UnsatisfiedLinkError: /home/rich/.jnati/repo/jniinchi/1.6/LINUX-X86/libJniInchi-1.6-LINUX-X86.so: libstdc++.so.5: cannot open shared object file: No such file or directory
Something that should be easy as pie is anything but.
By the way, the shared object in question was alive and well on my filesystem. A similar error occurs with Jython.
There may or may not be a solution to this problem. But let's not lose sight of the bigger question - why does the problem exist in the first place and repeat itself with frustrating regularity?
If the InChI team want InChI and InChIKey to become a truly universal identifier, a clearly-written specification, documented C source code, and a validation suite are essential. Until then we'll have to keep dodging bullets on our way around Checkpoint Charlie.

