Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8

Posted by Rich Apodaca Mon, 25 Jun 2007 12:45:00 GMT

SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.

Parsing InChIs

Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.

A Simple Library

The following library provides everything we need to convert between SMILES and InChI via Ruby:

require 'openbabel'

module InChI
  @@to_smiles = OpenBabel::OBConversion.new
  @@to_inchi = OpenBabel::OBConversion.new
  @@to_smiles.set_in_and_out_formats 'inchi', 'smi'
  @@to_inchi.set_in_and_out_formats 'smi', 'inchi'

  def inchi_to_smiles inchi
    mol = OpenBabel::OBMol.new

    @@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
    @@to_smiles.write_string(mol).strip
  end

  def smiles_to_inchi smiles
    mol = OpenBabel::OBMol.new

    @@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
    @@to_inchi.write_string(mol).strip
  end
end

Testing the Library

After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
=> "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"

In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.

What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1"
=> "CC(C)NC[C@@H](COc1cccc2ccccc12)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"

However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
=> "C[C@@H](c1ccccc1)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"

The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.

One More Gotcha

On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1"
=> "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
./inchi.rb:20: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted

The same segfault was obtained when using the babel command-line utility:

$ babel -ismi -oinchi
C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O
[Return]
Segmentation fault

Conclusions

As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.

Comments

Leave a response

  1. Geoff Mon, 25 Jun 2007 13:33:13 GMT

    The crash is a known bug with some SMILES -> InChI conversions with 2.1.0 and is fixed in the SVN trunk and branch for 2.1.1.

    I'll take a look at the stereo issue -- I think that may also be fixed in the latest code too.

  2. Geoff Mon, 25 Jun 2007 13:48:20 GMT

    No, unfortunately the stereo issue is present in the 2.1 branch and is a new bug. We'll see what we can do ASAP.

  3. baoilleach Mon, 25 Jun 2007 14:43:45 GMT

    Would running this code on a dataset of 3D structures yield some useful bug reports? I think that if we could finally nail SMILES support, this would be a good thing. Maybe once Geoff and co. fix this problem, you could run the code on PubChem or ZINC.

  4. Rich Apodaca Tue, 26 Jun 2007 03:22:15 GMT

    Noel,

    I've been thinking along exactly the same lines myself...

  5. baoilleach Wed, 27 Jun 2007 16:50:01 GMT

    Similarly, we can run OB on every PDB in the PDB. At least we can find out if anything breaks the parser...

  6. Geoff Thu, 28 Jun 2007 22:06:59 GMT

    The biggest limiting factor in my testing recently has been in my use of a laptop with a small, slow hard drive. I simply don't have the disk space to keep PubChem or ZINC or the PDB around.

    That will change in a few months...

    But I think SMILES support in Open Babel is pretty robust -- it powers eMolecules, with somewhere north of 10 million molecules. Craig James reported all sorts of SMILES and SMARTS errors.

    If either (or both) of you would like to try on ZINC or PubChem or PDB, I suspect we'll uncover more lurking bugs. We're getting closer to "industrial strength" though.

  7. baoilleach Fri, 29 Jun 2007 07:40:57 GMT

    Regarding SMILES parsers, check out the recent article by Andrew Dalke (whom I also met at Sheffield) at: http://www.dalkescientific.com/writings/diary/archive/2007/06/25/smiles_states.html

    I've downloaded the remediated PDB (a somewhat easier test set, I suspect, than the raw PDB, but we have to start somewhere), and will attempt to read in all of the files with pybel over the weekend. Converting with babel gives lots of error messages, but no problems converting...do we want to follow these up at some point?

    Regarding 'industrial strength'...coming from a scripting background, I am a big believer in regression and unit tests, and I think they are the only way to ensure a rock solid parser. Any code submitted that breaks a test should just be reverted. This way incremental improvements are guaranteed.

  8. Rich Apodaca Fri, 29 Jun 2007 13:11:30 GMT

    Andrew Dalke's recent warning about parsing unfiltered InChIs in a production environment is also worth reading.