Five Questions About the InChI Resolver 16
Yesterday the Royal Society of Chemistry (RSC) and ChemZoo (of ChemSpider fame) announced a plan to collaborate on the creation of an InChI Resolver service. From the announcement:
Using the InChI - an IUPAC standard identifier for compounds - scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future.
...
The InChI Resolver will be based on ChemSpider's existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. 'ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources' adds Antony Williams of ChemSpider, 'We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.'
It's encouraging to see a major scientific publisher lend its support to InChI in further evidence of the broad adoption of the identifier. And an InChI key resolver is something I've previously said might be a good idea.
Still, InChI and InChI Key represent a significant change in platform for the field of chemistry, in which CAS Registry Numbers are the gold standard for chemical identification.
If we've learned anything from the last 30 years of information technology, it's that once a platform (no matter how dysfunctional) becomes entrenched, nothing short of a game-changing strategy and herculean effort can replace it. The failure of Windows Vista offers a stark reminder of the power of an entrenched platform. Closer to home, the failure of V3000 molfiles to gain significant traction against V2000 offers another.
With these thoughts in mind, here are some questions about the new InChI Resolver service:
What problem is the service really trying to solve? Although it might be obvious to those close to the situation, it's not quite clear to me. Many, if not most, of the desktop cheminformatics packages sold today now have support for generating InChIs. It's also possible to embed InChI in text documents without using a Web service. Convenient it's not, which may be the point. But if that's the case then the focus of the service should be convenience, simplicity, and ease of use.
How hard would it be to crack an InChI hash? Before dismissing this as impossible, consider that an InChI key is a form of encryption, and a weak one at that. Breaking encryption schemes has a long history in computer science. Given the regularity of InChI syntax, how hard would it be to create software that can computationally provide the InChI that was used to generate an InChI key? What alternative hashing method might make it easier to do so? If there is one, it would become the standard, not the one currently being used.
How will the authenticity of a hashed InChI from an untrusted source be verified? An InChI key might take the form of 'AAAAAAAAAAA-BBBBBBB-XYZ'. Given an arbitrary InChI key provided by an untrusted third party, how would you independently verify that it actually represents a valid key? In the absense of software like that described in Question 2, it would be impossible.
What about BINOLs and Ferrocenes? InChI can't distinguish between stereoisomers arising from axial chirality such as that found in widely-used molecules such as BINOL. There are multiple ways to represent organometallics such as ferrocene using InChI, and each will give rise to a unique InChI key. This is a Bad Thing.
Why bother with an InChI key at all? Consider a hypothetical InChI key: 'AAAAAAAAAAA-BBBBBBB-XYZ'. To an end user uninterested in information technology, why does it matter how the key was generated? One selling point might be that given an arbitrary key, the chemical structure it represents can be decoded independently of any service. But that service is the core of the RSC/ChemSpider proposal - and it will apparently only be able to resolve previously-deposited InchI keys. Sound familiar? This is essentially how the CAS Registry system works, except the CAS system can differentiate BINOL stereoisomers, uniquely identify organometallics, and even handle polymers and complex mixtures.
Within the RSC/ChemZoo proposal is a gem of an idea. The CAS Registry system is closed and in all likelihood will remain forever so. Verifying the authenticity of CAS number/chemical structure assignments is a big problem made worse by the closed nature of the CAS Registry system. Chemists must have a reliable method to reference chemical structures. There are no doubt many solutions to this problem with big payoffs to the field of chemistry for the one that actually works.


Just a few comments:
ad 2/ - a hash it not an encryption. There is an infinite number of possible input values that would give one particular hash. Therefore it is not possible to 'decode' hash - just create an input that yields the same hash. This would however most likely not have a form on InChI or, if it has, represent any meaningful structure. Therefore an automatic system for decrypting InChIKeys is not just a matter of cracking the hash.
ad 5/ For CAS number you need a service in both ways - to obtain CAS and to resolve it. For InChIKey only resolving needs a service, creation is independent.
Beda, thanks for pointing out the mistake in my terminology. Straight hashing is of course one-way by definition because each hash has an infinite number of reverse hashes.
On the other hand, there may be ways to combine a hashcode with non-hash data (carbon count, molecular mass, etc.) while taking advantage of the InChI syntax to effect an InChI Key resolution algorithmically. The key would be a hybrid of hash and non-hash data.
That still might not limit the possibilities enough. But depending on the application, getting a family of possible InChIs from a hash may be acceptable.
As for practicality... that's another matter. Brute force seems to be the only way to do it on face value.
Still, without the ability to find at least one InChI that could arise from a given InChI key, how would a resolver service verify validity of user input?
Regarding Q5, if a resolver service is needed for lookup, doesn't that mean that everyone will still need to use the service? In other words, how many users will only generate InChI keys and never resolve them?
And if the system requires a resolver service to fully function, why not use a more flexible method for generating keys?
The hash function is SHA-256, which is a "cryptographic hash function", meaning that it's designed so #2 is hard. As a consequence, #3 is impossible. It's not possible to tell that the hash can come from a real molecule, other than brute force.
Why is #3 important? What's the abuse you're trying to prevent? If someone submits fraudulent data, it'll be found out once anyone takes a look at it. Enough of those and the provider will likely be kicked out.
Having the hash does let people answer one question: can I determine that a given molecule is novel in an external database without revealing the structure?
BTW, I understand that there will be a new InChI hash format in a few months.
So if a less secure hash function were used and the full identifier included some connection-table derived information...
It's not so much abuse but good old-fashioned confusion about the 'right' way to generate the key. Without a method to verify that a key is in fact valid for use with a given system, there's no way to determine that anybody should be 'kicked out.'
I'm not talking about adding a checksum - CAS numbers have that, too. I'm talking about knowing that an actual valid InChI was used to generate a particular InChI Key.
If the system only accepts InChI Keys generated by the system itself (i.e., the user needs to draw the structure using a tool the system provides), then why even bother with the InChI Key at all? Why not develop a method that takes full advantage of the Web as an information exchange platform and which can handle all of chemistry?
In my view, the inability to verify the authenticity of an InChI key makes the system about as useful as CAS numbers from the perspective of resolution.
So why not either:
a) keep on using CAS numbers?
b) create a key-like identifier not based on InChI at all?
Very interesting idea. But you can hash any identifier to get the same result. For example, a CAS number resolver could simply accept a hashed form as input. Unless a match were found, the system would have no idea what molecule or CAS number the user queried.
My concern is that InChI hashing systems requiring a resolver bring the system one step closer to looking like the dominant platform - the CAS Registry.
The more like the entrenched platform an upstart look like, the less likely it is to do anything significant. There tends to be only one winner in the platform game.
It looks like you want the hash to contain something which can be use to verify the hash comes from a real molecule. If you do that then you need a longer hash to get the same uniqueness assurances you have from using SHA-256. Either you have a higher likelihood of hash collision (because some keys occur more often) or you need longer keys.
Hashing CAS #s wouldn't work, for two reasons: it would still require something which converts the structure into CAS# (that's Beda's point) and CAS#s are too dense and predictable. That is, I can brute force generate them, get the hash, and thus get the mapping from hash query to original structure. "Brute force" here means generating 100 million identifiers, not the 2*(814) needed to enumerate the InChI hashs.
If the registration system works as I think it does then it's easy to check if there's a registration problem. You do the search with the key, get a result, go to the provider and see what they have. If the structure they have doesn't generate the hash they say it does, there's a problem.
Of course, if they have good data but generate the key wrong, then there's a different problem. The data provider should be checking that the entire flow works correctly, but not everyone does that. There's also a few things where you have to just know which InChI flags to use to generate the key. ChemSpider et. al will likely provide guidance in that.
My complaint with all this is that I still don't see why I should be using InChIs. The cInChI program is only designed to turn an SD file into an InChI and while it's possible to turn an InChI into an SD file, it's not a primary goal. Yes, many of the data vendors and software vendors include InChI but I think it's because it's easy to add, combined with peer pressure. It just becomes a feature tick.
Sort of, but not quite. For a specific example, what if the molecular weight of the molecule were inserted between the hash part of the key and the checksum:
AAAAAAAAAAA-BBBBBBB-1024.77-XYZ
I know this adds to the length of the key, but this is just a thought experiment at this point.
Then it would be a bit simpler to discover the InChI because only those molecules with a molecular weight of 1024.77 need be considered.
How will the resolver system 'go to the provider to see what they have'?
Pubchem only knows about one compound with weight 1024.77, cid=15940270. Given the precision you specify, I don't think it would be that hard to enumerate all possible molecular formulas which can produce it, though of course the combinitorics of generating all such structures will likely still be large.
That is, how many structures can you make with the formula C61H78ClN7O5 ?
How does that weight help things? For example, if I'm trying to put fake data in the system, wouldn't I just generate numbers which match real formulas?
Which molecular weight values should I use to generate the number? What if some of the atoms are isotopically tagged?
As to the phrase "go to the provider", that's not done through the resolver. I assume it will work like Google. You search in one place, it gives you results, which includes links to the primary sources. Hence "go to the provider" means "follow a URL."
A lot.
Depends on how efficiently you can eliminate candidate structures based on the rest of the identifier. Brute force is not practical - you'd need a way to do it more intelligently.
Well, you could use average masses for all natural abundance atoms and exact masses for the rest - just like you'd do anyway.
How does that help with the problem of verifying that the key came from a real InChI?
"How does that help with the problem of verifying that the key came from a real InChI?"
Since I still don't understand your concern about this - are you worried about accidental or deliberate corruption of the database? False positives or false negatives? - I can't really answer.
Short version: do a search, get a response, use the response to go to the provider, get the original compound data from the provider, double-check that the compound data generates the correct InChI key.
To me that seems obvious.
Both, since I know both will happen. The identifier will get copied by hand and corrupted that way. It will get mis-copied with the system clipboard. And there's always that other class of internet user who I'd politely ask to get a life.
If a machine can't verify what's on the other end of the link and and take aggressive moves to correct misinformation, the system will quickly cease to be useful.
IMO, people-powered curation is not an option - it may seem to work in the short term, but organic chemistry is just too big.
The size of chemistry space isn't relevant to this discussion. InChI keys assume space is small - from the birthday paradox, well under 2**64 compounds.
Curation is not only people-powered, it's assisted by sofware. I found format errors in various bioinformatics data sets not because I hand inspected them but because I developed a tool to help me with my checks.
Bioinformatics data sets have a comparable number of records (96 million in GenBank), with sources from lightly-curated machine generated sequencing data to heavily-curated, closely analyzed sequences. What doesn't their success give you any confidence in cheminformatics data sets?
The size of chemistry space also isn't relevant because you should be looking at the trustworthiness of the providers. You can sample their error rates easily, especially with machine assist, and judge if it's good enough. There's only going to be in the tens or 100s of providers, I think. Or rather, most of the data will come from a small number of groups, and perhaps a lot of groups will contribute a few tens of structures, but no long-tail effect.
What error rate is acceptable?
Google, to use an extreme example, is not directly curated, but the algorithms are base on a lot of experience and software assistance.
From my point of view, InChI key for normal people now is mostly about web searching (it was designed for this and should be better than InChI itself). Do you have a molecule on your web? Put an InChI key there along with other data. Then when a user wants to find this structure, he generates an InChI key, puts it into Google and sees what comes out. Neither the creator of the website, nor the user needs to resolve the InChI key. On the other hand, should InChI key be used as a replacement of/alternative to CAS number (that means published as the only information about a structure), then the resolver would be needed. In fact the mere existence of the resolver may make it necessary to have a resolver :), because more and more people will start to use InChI key as CAS Nr. replacement. I do not think that this is a bad idea, because InChI key has at least one side that does not require a resolver and should also be much more open. I agree with you that before InChI key is adopted in this role, the matter of axial chirality should be resolved.
Interesting discussion. As a searcher with decades of experience (I preceeded the online age), I'd like to vote for use of both InChI and CASRN. Of course, the CASRN will be known only if it's a known compound (and I'd storngly suggest that if the submitter is not sure if they have a known compound or not, that they at least do a CA Registry File search). If it is a new compound, submit it to CAS and they will assign you a CASRN.
If one has any doubts about the structure in a CAS database, check out the original reference(s) and alert CAS to any errors. They'll fix them. Agreed, not very open, but it is precise and beleive me, that's what a searcher wants: precision.
I was not aware that InChI has a problem with axial chirality. That indeed must be fixed before widespeard use.
This is an interesting discourse. However, it is misleading to imply that the CAS Registry Number System is "closed.” Hundreds of thousands of scientists and students worldwide have easy and convenient access to accurate substance information in the CAS Registry through SciFinder and STN.
My apologies for the multiple posts, I was under the impression that I had mistyped during the "human" validation phase of my post.
@Crystal, thanks for the feedback. We seem to have different definitions on what "open" means. This is an entire topic in itself worthy of in-depth discussion, if you'd be interested.
About the multiple posting, my bad. Several first-time commenters have run into the same thing. The software I'm running required some ugly hacks to get reCAPTCHA working on all platforms. It's something I've been meaning to correct, but it won't be easy.