Hashing InChIs 1

Posted by Rich Apodaca Wed, 09 May 2007 14:01:00 GMT

The InChI team has announced a proposal for a standardized InChI hashing mechanism. This would create a free, fixed-length, alphanumeric molecular identifier.

This is an excellent proposal. One of the biggest problems in working with InChIs (and other line notations such as SMILES) is that even medium-sized molecules produce very long identifiers. Another problem is the use of characters that must be escaped in URLs. The hashing proposal addresses both of these issues, getting very close to creating the optimal molecular identifier.

For example, imagine the convenience of being able to refer to a molecule by a universally-recognized, machine-generated string like the one shown below:

AAAAAAAAAAA-BBBBBBB-XYZ

This is something that actually stands a chance of getting printed on reagent bottles, in catalogs, in patent applications, or anywhere else chemists are using chemical information. Aside from its length, it's not too different from that other molecular identifier system, but without the perpetual use tax.

There are at least three downsides to this approach:

  1. For most purposes, hashing is a one-way process. It would become virtually impossible to computationally convert this hashed identifier back into its InChI or molecular representation . On the other hand, this could create a market for cryptography experts in cheminformatics. A hashed-InChI lookup service would start to look very useful.

  2. Because of the one-way nature of hashing, the authenticity of a hashed InChI couldn't be directly verified. Checksums will help, but the fundamental problem remains. InChI itself can be decoded, and therefore authenticated.

  3. It's possible, although extremely unlikely, that two different molecules will end up having the same hashed InChI. Reducing the collision probability means increasing the length of the identifier.

As in any design decision, the question is whether the benefits outweigh the disadvantages.

Anyone is free to develop their own InChI hash system. Several, including me, already have. But by introducing a standard mechanism, the InChI team has the potential to create both a free and easy-to-use molecular identifier.

Update: InChI Canonicalization Algorithm

Posted by Rich Apodaca Sat, 05 May 2007 12:34:00 GMT

An older article on the InChI canonicalization algorithm has been restored and updated. The revised article contains a direct link to the InChI Technical Manual pdf file which I uploaded to SourceForge for convenience.

Structure Diagram Generation 4

Posted by Rich Apodaca Wed, 11 Apr 2007 10:20:00 GMT

Given a molecule with no 2D coordinates, how would you render a human-readable view? This problem arises in many situations, but most commonly in the context of interpreting line notations such as IUPAC nomenclature, SMILES, or InChI. Whatever the solution you come up with, you'll come face-to-face with the structure diagram generation (SDG) problem.

Generating 2D molecular coordinates is a fundamental (and remarkably difficult) problem in cheminformatics. Discussions in the primary literature date back to at least the 1970s with Chemical Abstract Service's pioneering large-scale efforts. A recent article from Chemical Computing Group (CCG) described the design and implementation of an advanced SDG system. To my knowledge, the only open source implementation of an SDG system is found in the Chemistry Development Kit, and by extension Ruby CDK.

The SDG problem plays an important role in the aesthetics of chemical structure diagrams, as mentioned by two readers. To render a molecule aesthetically, 2D coordinates must minimize confusing atom overlaps, unconventional orientations, and unusual bond angles.

The role of SDG in cheminformatics can only continue to increase in importance, especially as more and more structures are automatically generated through mining the primary literature, the Internet, old PDFs, and other sources. With all of these new computer-generated structures will come the need to make them readily understandable to a chemist through SDG.

Customize InChI Output with Rino

Posted by Rich Apodaca Mon, 19 Mar 2007 10:30:00 GMT

Rino is a toolkit for working with the IUPAC International Chemical Identifier (InChI) in Ruby. Because it's based on the IUPAC/NIST InChI toolkit, Rino can be configured using a variety of useful options. This article summarizes those options and provides an illustrative example.

Complete List of InChI Command Line Options

The following is a complete summary of the IUPAC/NIST InChI toolkit command line options:

  • SNon Exclude stereo (Default: Include Absolute stereo)

  • SRel Relative stereo

  • SRac Racemic stereo

  • SUCF Use Chiral Flag: On means Absolute stereo, Off - Relative

  • SUU Include omitted unknown/undefined stereo

  • NEWPS Narrow end of wedge points to stereocenter (default: both)

  • SPXYZ Include Phosphines Stereochemistry

  • SAsXYZ Include Arsines Stereochemistry

  • RecMet Include reconnected metals results

  • FixedH Mobile H Perception Off (Default: On)

  • AuxNone Omit auxiliary information (default: Include)

  • NoADP Disable Aggressive Deprotonation (for testing only)

  • Compress Compressed output

  • DoNotAddH Don't add H according to usual valences: all H are explicit

  • Wnumber Set time-out per structure in seconds; W0 means unlimited

  • SDF:DataHeader Read from the input SDfile the ID under this DataHeader

  • NoLabels Omit structure number, DataHeader and ID from InChI output

  • Tabbed Separate structure number, InChI, and AuxIndo with tabs

  • OutputSDF Convert InChI created with default aux. info to SDfile

  • InChI2InChI Convert InChI string into InChI string for validation purposes

  • SdfAtomsDT Output Hydrogen Isotopes to SDfile as Atoms D and T

  • STDIO Use standard input/output streams

  • FB (or FixSp3Bug) Fix bug leading to missing or undefined sp3 parity

  • WarnOnEmptyStructure Warn and produce empty InChI for empty structure

A Test

The following code displays the InChI for benzoic acid with and without mobile hydrogen atom perception. It requires both Rino and Ruby CDK. The latter library is used to convert a SMILES string into a molfile for use by Rino.

require 'rubygems'
require_gem 'rcdk'
require_gem 'rino'
require 'rcdk/util'

molfile=RCDK::Util::Lang.smiles_to_molfile 'c1ccccc1C(=O)O' # benzoic acid
reader = Rino::MolfileReader.new
inchi = reader.read(molfile)

puts "Without mobile hydrogen perception:\n#{inchi}\n\n"

reader.options << '-FixedH'
inchi = reader.read(molfile)

puts "With mobile hydrogen perception:\n#{inchi}"

The -FixedH flag used by the reader the second time tells Rino to identify mobile hydrogens in the InChI output. Some InChI authors use this form of InChI and others don't. PubChem is an example of a large InChI author that does use mobile hydrogen perception, as their entry for benzoic acid demonstrates. To perform an exact match of your InChIs with theirs, the -FixedH flag must be set.

Running the Test

Running the test code produces the following output:

Without mobile hydrogen perception:
InChI=1/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1-5H,(H,8,9)

With mobile hydrogen perception:
InChI=1/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1-5H,(H,8,9)/f/h8H

Conclusions

When matching InChIs generated by other authors, it's best to adopt their processing conventions. Rino makes it conventient to do so through its full support for the standard IUPAC/NIST command line options.

Eleven Qualities of The Perfect Line Notation for the Web 2

Posted by Rich Apodaca Wed, 14 Mar 2007 10:18:00 GMT

If you had to design the perfect line notation for the Web, what would it look like? This is hardly an academic exercise given the central role played by line notations in information systems. For a variety of reasons, existing line notations may not be the right match for the Web. This article explores this question and outlines the main qualities needed by a Web-friendly line notation.

A Few Lines About Line Notations

A line notation is any system that converts a molecular structure into a single line of text. Chemists have been using line notations for over 140 years - long before the advent of computers. Because of their versatility, line notations are frequently used in situations they were not designed for. When this happens, limitations become apparent, resulting in renewed efforts to build a better system.

As noted previously, the invention of new line notations is a field whose popularity ebbs and flows over time. Currently, the three most important line notations are:

  • IUPAC Nomenclature
  • Simplified Molecular Input Line Entry System (SMILES)
  • IUPAC International Chemical Identifier (InChI)

Each of these systems has its own unique characteristics. IUPAC nomenclature is the oldest and most widely-used line notation. It appears in numerous contexts, including Web pages, peer-reviewed journals, reports, patents, MSDS sheets, catalogs, and reagent bottles. By comparison, SMILES is a distant second in popularity. It's main role has been to facilitate machine entry of structural information by humans, like this. InChI is the newest of the bunch. It serves both as a line notation and as a unique identifier requiring no central authority.

The Perfect Line Notation for the Web

The emergence of the Web as a standard information delivery platform has refocused the attention of many developers on the line notation problem. With this idea in mind, here are some guesses about the qualities of the ideal Web-friendly line notation.

  1. Readily Encodable and Decodable by Humans. There's something unnerving about a line notation that can't easily be deciphered by humans. Is this really the right string? Did I copy it completely? This problem surfaces with every line notation, but some fare better than others. IUPAC nomenclature, for example, is one of the first things taught in many beginning organic chemistry classes. It's complicated, but still understandable by non-experts.

  2. Readily Encodable and Decodable by Machines. It may be relatively simple for humans to read and write IUPAC nomenclature, but not so for machines. Software that reads and writes SMILES, on the other hand, is by comparison easy to write. This explains the abundance of software packages that handle SMILES and the scarcity of those that handle IUPAC nomenclature.

  3. Uses URI-Safe Characters Only. A URI uniquely identifies every document on the Internet. Why can't a line notation be used in combination with a URI to uniquely identify every molecule? One reason is that every line notation currently in use contains characters unsafe for use in URIs. Any line notation designed for use on the Web needs to avoid these characters in its syntax. Update: InChI doesn't use unsafe characters, but it does use the reserved characters "=", "?", and "/". These characters may therefore need to be escaped, depending on the context.

  4. Encodes All Molecules. Buried within every line notation is an opinion on what chemistry is really about. To operate on the Web, these opinions need to be as closely aligned as possible with those of chemists themselves. Several Depth-First articles have discussed the limitations of existing line notations as molecular languages.

  5. Compact. Nobody wants to look at or manipulate a line of text that's longer than it needs to be. Of course, the more expressive a line notation is, the more verbose it will be. In other words, qualities 4 and 5 will always be in conflict.

  6. Canonicalizable. A line notation supports canonicalization when it specifies rules that can be guaranteed to always generate the same line notation for a given molecule. This feature enables many labor-saving assumptions. For example, a canonical representation makes a great identifier in a database, reducing the cost of storing and retrieving structural information.

  7. Explicit Hydrogen Atom Encoding. SMILES makes few requirements regarding hydrogen atom encoding. As a result, each software implementation is left to its own devices. The resulting confusion is the price paid for the convenience (Quality 1) of a compact notation (Quality 5).

  8. Hierarchical Structure. One of InChI's innovations was the introduction of a hierarchical encoding system. This system, also referred to as InChI "layers", enables a molecule to be viewed at several levels of resolution: as a molecular formula; as a network of atoms; as a network of atoms containing hydrogen atoms; as an atomic network with stereochemistry; and so on. I'm unaware of any reports in which this feature has been exploited in a practical way, although they aren't difficult to imagine.

  9. Flat Structure. By grouping structural features into layers (Quality 8), InChI introduces a lot of complexity that is absent in SMILES and even IUPAC nomenclature. This complexity, in part, makes it difficult for both humans and machines to properly encode InChIs (Qualities 1 and 2). Given this complexity, and the fact that the utility of hierarchical encoding has yet to be conclusively demonstrated, it may be better to avoid it.

  10. Open Source Software Implementation. No encoding standard in today's world stands a chance of gaining acceptance without an open source reference implementation. InChI broke new ground in this area and should serve as a model for any system that follows.

  11. Unencumbered by Patents. The success of molfile and SMILES as de facto standards derives partly from the decision made by their authors to refrain from patenting their languages. As a result, developers are motivated build their own implementations, rather than invent yet another language.

Conclusions

A robust and modern line notation system is a key technology for chemically enabling the Web. Existing line notations, although useful in many contexts, were not designed with this particular role in mind. The time has come to consider whether a new line notation system, designed specifically with the Web and modern chemistry in mind, might offer a better solution.

Photo credit: Wenwen - Flickr

Older posts: 1 2 3 4 5 6 ... 9