Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8
SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.
Parsing InChIs
Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.
A Simple Library
The following library provides everything we need to convert between SMILES and InChI via Ruby:
require 'openbabel'
module InChI
@@to_smiles = OpenBabel::OBConversion.new
@@to_inchi = OpenBabel::OBConversion.new
@@to_smiles.set_in_and_out_formats 'inchi', 'smi'
@@to_inchi.set_in_and_out_formats 'smi', 'inchi'
def inchi_to_smiles inchi
mol = OpenBabel::OBMol.new
@@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
@@to_smiles.write_string(mol).strip
end
def smiles_to_inchi smiles
mol = OpenBabel::OBMol.new
@@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
@@to_inchi.write_string(mol).strip
end
endTesting the Library
After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-" => "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.
What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1" => "CC(C)NC[C@@H](COc1cccc2ccccc12)O" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"
However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1" => "C[C@@H](c1ccccc1)O" irb(main):004:0> inchi = smiles_to_inchi smiles => "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.
One More Gotcha
On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:
$ irb irb(main):001:0> require 'inchi' => true irb(main):002:0> include InChI => Object irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1" => "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O" irb(main):004:0> inchi = smiles_to_inchi smiles ./inchi.rb:20: [BUG] Segmentation fault ruby 1.8.6 (2007-03-13) [i686-linux] Aborted
The same segfault was obtained when using the babel command-line utility:
$ babel -ismi -oinchi C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O [Return] Segmentation fault
Conclusions
As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.


The crash is a known bug with some SMILES -> InChI conversions with 2.1.0 and is fixed in the SVN trunk and branch for 2.1.1.
I'll take a look at the stereo issue -- I think that may also be fixed in the latest code too.
No, unfortunately the stereo issue is present in the 2.1 branch and is a new bug. We'll see what we can do ASAP.
Would running this code on a dataset of 3D structures yield some useful bug reports? I think that if we could finally nail SMILES support, this would be a good thing. Maybe once Geoff and co. fix this problem, you could run the code on PubChem or ZINC.
Noel,
I've been thinking along exactly the same lines myself...
Similarly, we can run OB on every PDB in the PDB. At least we can find out if anything breaks the parser...
The biggest limiting factor in my testing recently has been in my use of a laptop with a small, slow hard drive. I simply don't have the disk space to keep PubChem or ZINC or the PDB around.
That will change in a few months...
But I think SMILES support in Open Babel is pretty robust -- it powers eMolecules, with somewhere north of 10 million molecules. Craig James reported all sorts of SMILES and SMARTS errors.
If either (or both) of you would like to try on ZINC or PubChem or PDB, I suspect we'll uncover more lurking bugs. We're getting closer to "industrial strength" though.
Regarding SMILES parsers, check out the recent article by Andrew Dalke (whom I also met at Sheffield) at: http://www.dalkescientific.com/writings/diary/archive/2007/06/25/smiles_states.html
I've downloaded the remediated PDB (a somewhat easier test set, I suspect, than the raw PDB, but we have to start somewhere), and will attempt to read in all of the files with pybel over the weekend. Converting with babel gives lots of error messages, but no problems converting...do we want to follow these up at some point?
Regarding 'industrial strength'...coming from a scripting background, I am a big believer in regression and unit tests, and I think they are the only way to ensure a rock solid parser. Any code submitted that breaks a test should just be reverted. This way incremental improvements are guaranteed.
Andrew Dalke's recent warning about parsing unfiltered InChIs in a production environment is also worth reading.