Five Questions About the InChI Resolver
Yesterday the Royal Society of Chemistry (RSC) and ChemZoo (of ChemSpider fame) announced a plan to collaborate on the creation of an InChI Resolver service. From the announcement:
Using the InChI - an IUPAC standard identifier for compounds - scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future.
…
The InChI Resolver will be based on ChemSpider's existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. 'ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources' adds Antony Williams of ChemSpider, 'We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.'
It's encouraging to see a major scientific publisher lend its support to InChI in further evidence of the broad adoption of the identifier. And an InChI key resolver is something I've previously said might be a good idea.
Still, InChI and InChI Key represent a significant change in platform for the field of chemistry, in which CAS Registry Numbers are the gold standard for chemical identification.
If we've learned anything from the last 30 years of information technology, it's that once a platform (no matter how dysfunctional) becomes entrenched, nothing short of a game-changing strategy and herculean effort can replace it. The failure of Windows Vista offers a stark reminder of the power of an entrenched platform. Closer to home, the failure of V3000 molfiles to gain significant traction against V2000 offers another.
With these thoughts in mind, here are some questions about the new InChI Resolver service:
- What problem is the service really trying to solve? Although it might be obvious to those close to the situation, it's not quite clear to me. Many, if not most, of the desktop cheminformatics packages sold today now have support for generating InChIs. It's also possible to embed InChI in text documents without using a Web service. Convenient it's not, which may be the point. But if that's the case then the focus of the service should be convenience, simplicity, and ease of use.
- How hard would it be to crack an InChI hash? Before dismissing this as impossible, consider that an InChI key is a form of encryption, and a weak one at that. Breaking encryption schemes has a long history in computer science. Given the regularity of InChI syntax, how hard would it be to create software that can computationally provide the InChI that was used to generate an InChI key? What alternative hashing method might make it easier to do so? If there is one, it would become the standard, not the one currently being used.
- How will the authenticity of a hashed InChI from an untrusted source be verified? An InChI key might take the form of 'AAAAAAAAAAA-BBBBBBB-XYZ'. Given an arbitrary InChI key provided by an untrusted third party, how would you independently verify that it actually represents a valid key? In the absense of software like that described in Question 2, it would be impossible.
- What about BINOLs and Ferrocenes? InChI can't distinguish between stereoisomers arising from axial chirality such as that found in widely-used molecules such as BINOL. There are multiple ways to represent organometallics such as ferrocene using InChI, and each will give rise to a unique InChI key. This is a Bad Thing.
- Why bother with an InChI key at all? Consider a hypothetical InChI key: 'AAAAAAAAAAA-BBBBBBB-XYZ'. To an end user uninterested in information technology, why does it matter how the key was generated? One selling point might be that given an arbitrary key, the chemical structure it represents can be decoded independently of any service. But that service is the core of the RSC/ChemSpider proposal - and it will apparently only be able to resolve previously-deposited InchI keys. Sound familiar? This is essentially how the CAS Registry system works, except the CAS system can differentiate BINOL stereoisomers, uniquely identify organometallics, and even handle polymers and complex mixtures.
Within the RSC/ChemZoo proposal is a gem of an idea. The CAS Registry system is closed and in all likelihood will remain forever so. Verifying the authenticity of CAS number/chemical structure assignments is a big problem made worse by the closed nature of the CAS Registry system. Chemists must have a reliable method to reference chemical structures. There are no doubt many solutions to this problem with big payoffs to the field of chemistry for the one that actually works.