From InChI to Image with Ruby Open Babel and Ruby CDK 2
Like SMILES, InChI is a line notation that can be used to encode and store chemical information relatively efficiently. Although there are a number of scenarios where this strategy is used, what many of them have in common is the need to eventually convert an InChI into a human-readable form. In most cases, this form will be a 2D chemical structure. This article will show how a small Ruby library can convert InChI strings into color PNG images with the help of Ruby Open Babel and Ruby CDK.
The Library
Our library accepts an InChI as input and produces a scaled PNG image as output. It re-uses part of a previously-discussed library for the interconversion of SMILES and InChI.
require 'rubygems'
require 'openbabel'
require_gem 'rcdk'
require 'rcdk/util'
module InChI
@@to_smiles = OpenBabel::OBConversion.new
@@to_smiles.set_in_and_out_formats 'inchi', 'smi'
def inchi_to_png inchi, path_to_png, width, height
smiles = inchi_to_smiles inchi
RCDK::Util::Image.smiles_to_png smiles, path_to_png, width, height
end
private
def inchi_to_smiles inchi
mol = OpenBabel::OBMol.new
@@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
@@to_smiles.write_string(mol).strip
end
endTesting
Our library can be tested by saving it to a file called inchi.rb and using interactive Ruby (the warning can safely be ignored for now):$ irb irb(main):001:0> require 'inchi' ./inchi.rb:3:Warning: require_gem is obsolete. Use gem instead. /usr/local/lib/ruby/gems/1.8/gems/rcdk-0.3.0/lib/rcdk/java.rb:26:Warning: require_gem is obsolete. Use gem instead. i=> true irb(main):002:0> include InChI => Object irb(main):003:0> inchi='InChI=1/C23H27FN4O2/c1-15-18(23(29)28-10-3-2-4-21(28)25-15)9-13-27-11-7-16(8-12-27)22-19-6-5-17(24)14-20(19)30-26-22/h5-6,14,16H,2-4,7-13H2,1H3' #risperidone => "InChI=1/C23H27FN4O2/c1-15-18(23(29)28-10-3-2-4-21(28)25-15)9-13-27-11-7-16(8-12-27)22-19-6-5-17(24)14-20(19)30-26-22/h5-6,14,16H,2-4,7-13H2,1H3" irb(main):004:0> inchi_to_png inchi, 'risperidone.png', 300, 300 => nil
This code produces the following image:

Our library can also be used on more complicated molecules, for example Brevetoxin:
$ irb irb(main):001:0> require 'inchi' ./inchi.rb:3:Warning: require_gem is obsolete. Use gem instead. /usr/local/lib/ruby/gems/1.8/gems/rcdk-0.3.0/lib/rcdk/java.rb:26:Warning: require_gem is obsolete. Use gem instead. => true irb(main):002:0> include InChI => Object irb(main):003:0> inchi='InChI=1/C49H70O13/c1-26-17-36-39(22-45(52)58-36)57-44-21-38-40(62-48(44,4)23-26)18-28(3)46-35(55-38)11-7-6-10-31-32(59-46)12-8-14-34-33(54-31)13-9-15-43-49(5,61-34)24-42-37(56-43)20-41-47(60-42)30(51)19-29(53-41)16-27(2)25-50/h6-8,14,25-26,28-44,46-47,51H,2,9-13,15-24H2,1,3-5H3/b7-6-,14-8-' #brevetoxin a => "InChI=1/C49H70O13/c1-26-17-36-39(22-45(52)58-36)57-44-21-38-40(62-48(44,4)23-26)18-28(3)46-35(55-38)11-7-6-10-31-32(59-46)12-8-14-34-33(54-31)13-9-15-43-49(5,61-34)24-42-37(56-43)20-41-47(60-42)30(51)19-29(53-41)16-27(2)25-50/h6-8,14,25-26,28-44,46-47,51H,2,9-13,15-24H2,1,3-5H3/b7-6-,14-8-" irb(main):004:0> inchi_to_png inchi, 'brevetoxin.png', 300, 200 => nil
This produces the following image:

Conclusions
While our library could certainly be improved, it solves what otherwise would be a very difficult problem conveniently. Areas for further work include error handling and improving the appearance of the images (the latter is the aim of Firefly). Despite the fact that three programming languages are used (Ruby, C++, and Java), this complexity is neatly encapsulated behind a simple Ruby interface.
Everything Old is New Again: Wiswesser Line Notation (WLN)
Sometimes, searching through the attic of scientific ideas turns up unexpected treasures. Like old clothing styles that suddenly become fashionable again, the passage of time has a way of making old ideas relevant by supplying new context. Those ideas that once enjoyed widespread popularity followed by complete obscurity are especially interesting. This article talks about one of them and why it may matter again.
Some History
Wiswesser Line-Formula Chemical Notation (WLN) was the most popular of perhaps a dozen actively-used line notations systems during the 1960s and 1970s. Developed by William J. Wiswesser over a period of many years starting in the 1940s, WLN contains a surprising number of modern ideas about chemistry and information. At one point a serious contender for the position now held by IUPAC nomenclature, WLN has become so obscure that few chemists have even heard of it and no modern software can manipulate it. Even finding information on the basic grammar of WLN is difficult: almost all of this documentation is contained in out-of-print books.
A Guide
To my surprise, WLN is both easy to understand and easy to use. As far as canonicalized line notations go, WLN is far easier to comprehend than either InChI or Canonical SMILES. Even more surprisingly, WLN actually meets more than a few of the requirements for the ideal line notation for the Web. I was always struck by claims that high school graduates with little chemistry background could be trained to encode WLN in a few weeks; this now seems very plausible.
My guide is Elbert Smith's short 1968 book The Wiswesser Line-Formula Chemical Notation. I was able to pick up a used copy in excellent condition for under $30.00 from Amazon.
Some Examples
Functional groups, carbon chains, and rings play central roles in WLN. Unlike modern line notations that emphasize atoms, WLN is designed to mirror the way that chemists actually think about chemistry.
Consider acetone:

The two "1"s stand for saturated one-carbon chains, i.e. methyl groups. The "V" stands for a carbon doubly-bonded to oxygen.
Given nothing more than the above example, the encoding of diethyl ether should be completely clear:

"O" simply stands for a divalent oxygen atom.
The benzene ring is one of the most ubiquitous functional groups in organic chemistry. Wiswesser knew this and wanted to make it easy to encode aromatic compounds. His solution is simplicity itself. Consider acetophenone:

The "R" stands for a benzene ring. WLN canonicalization gives it the lowest priority and this is why it appears last.
What about disubstituted aromatics? Consider 4-chloroacetophenone:

The "G" symbol stands for chlorine. The " DV1" stands for the 4-acyl substituent. Here, the "D" denotes the 4-postion. The 3- position would result in " CV1", and the 2- position would give " BV1". The space character means that the character following it should be interpreted as ring locant.
WLN uses a very simple system of canonicalization based on alphanumeric order. Priority increases in the direction: (1) symbols; (2) numbers in numerical order; and (3) letters in alphabetical order (with the exception of R which has lower priority than symbols). Coding generally begins at the substituent assigned the highest priority. This explains why 4-chloroacetophenone is not coded as "1VR DG".
Advantages of WLN
WLN is remarkably compact, especially when compared to SMILES and InChI. For example, consider the InChI for 4-chloroacetophenone, which is eight times longer than the corresponding WLN:
InChI=1/C8H7ClO/c1-6(10)7-2-4-8(9)5-3-7/h2-5H,1H3Additionally, it's readily apparent to a human observer when a WLN is not properly coded - after all, the language was designed to be both read and written by humans rather than machines. Anyone can look at "GR DV1" and deduce almost instantly that it contains a carbonyl group (V), a phenyl group (R), a chloro group (G), and a methyl group (1).
And if this functional group recognition is easy for humans, it's orders of magnitude easier for machines. It's not difficult at all to imagine very sophisticated and fast molecular query systems that do nothing more than simple processing of the ASCII text contained within WLN strings.
Conclusions
It's very unlikely that WLN will ever be resurrected for the purpose of replacing existing line notations. On the other hand, WLN offers many potentially useful concepts for those creating new line notations. As they say, history doesn't repeat itself, but it frequently rhymes.
Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8
SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.
Parsing InChIs
Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.
A Simple Library
The following library provides everything we need to convert between SMILES and InChI via Ruby:
require 'openbabel'
module InChI
@@to_smiles = OpenBabel::OBConversion.new
@@to_inchi = OpenBabel::OBConversion.new
@@to_smiles.set_in_and_out_formats 'inchi', 'smi'
@@to_inchi.set_in_and_out_formats 'smi', 'inchi'
def inchi_to_smiles inchi
mol = OpenBabel::OBMol.new
@@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
@@to_smiles.write_string(mol).strip
end
def smiles_to_inchi smiles
mol = OpenBabel::OBMol.new
@@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
@@to_inchi.write_string(mol).strip
end
endTesting the Library
After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-" => "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.
What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1" => "CC(C)NC[C@@H](COc1cccc2ccccc12)O" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"
However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1" => "C[C@@H](c1ccccc1)O" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.
One More Gotcha
On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1" => "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O" irb(main):004:0> inchi = smiles_to_inchi smiles ./inchi.rb:20: [BUG] Segmentation fault ruby 1.8.6 (2007-03-13) [i686-linux] Aborted
The same segfault was obtained when using the babel command-line utility:
$ babel -ismi -oinchi C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O [Return] Segmentation fault
Conclusions
As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.
My InChI Runneth Over 2
The only solution to this problem I've found is to set the CSS overflow property to "scroll":
InChI=1/C50H70O14/c1-25(24-51)14-28-17-37(52)50(8)41(54-28)19-33-34(61-50)18-32-29(55-33)10-9-12-46(4)42(58-32)23-49(7)40(62-46)21-39-47(5,64-49)13-11-30-44(60-39)26(2)15-31-36(56-30)22-48(6)38(57-31)20-35-45(63-48)27(3)16-43(53)59-35/h9-10,16,24,26,28-42,44-45,52H,1,11-15,17-23H2,2-8H3/b10-9-Hashing InChIs 1
The InChI team has announced a proposal for a standardized InChI hashing mechanism. This would create a free, fixed-length, alphanumeric molecular identifier.
This is an excellent proposal. One of the biggest problems in working with InChIs (and other line notations such as SMILES) is that even medium-sized molecules produce very long identifiers. Another problem is the use of characters that must be escaped in URLs. The hashing proposal addresses both of these issues, getting very close to creating the optimal molecular identifier.
For example, imagine the convenience of being able to refer to a molecule by a universally-recognized, machine-generated string like the one shown below:
AAAAAAAAAAA-BBBBBBB-XYZ
This is something that actually stands a chance of getting printed on reagent bottles, in catalogs, in patent applications, or anywhere else chemists are using chemical information. Aside from its length, it's not too different from that other molecular identifier system, but without the perpetual use tax.
There are at least three downsides to this approach:
For most purposes, hashing is a one-way process. It would become virtually impossible to computationally convert this hashed identifier back into its InChI or molecular representation . On the other hand, this could create a market for cryptography experts in cheminformatics. A hashed-InChI lookup service would start to look very useful.
Because of the one-way nature of hashing, the authenticity of a hashed InChI couldn't be directly verified. Checksums will help, but the fundamental problem remains. InChI itself can be decoded, and therefore authenticated.
It's possible, although extremely unlikely, that two different molecules will end up having the same hashed InChI. Reducing the collision probability means increasing the length of the identifier.
As in any design decision, the question is whether the benefits outweigh the disadvantages.
Anyone is free to develop their own InChI hash system. Several, including me, already have. But by introducing a standard mechanism, the InChI team has the potential to create both a free and easy-to-use molecular identifier.

