Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8

Posted by Rich Apodaca Mon, 25 Jun 2007 08:45:00 GMT

SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.

Parsing InChIs

Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.

A Simple Library

The following library provides everything we need to convert between SMILES and InChI via Ruby:

require 'openbabel'

module InChI
  @@to_smiles = OpenBabel::OBConversion.new
  @@to_inchi = OpenBabel::OBConversion.new
  @@to_smiles.set_in_and_out_formats 'inchi', 'smi'
  @@to_inchi.set_in_and_out_formats 'smi', 'inchi'

  def inchi_to_smiles inchi
    mol = OpenBabel::OBMol.new

    @@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
    @@to_smiles.write_string(mol).strip
  end

  def smiles_to_inchi smiles
    mol = OpenBabel::OBMol.new

    @@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
    @@to_inchi.write_string(mol).strip
  end
end

Testing the Library

After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
=> "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"

In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.

What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1"
=> "CC(C)NC[C@@H](COc1cccc2ccccc12)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"

However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
=> "C[C@@H](c1ccccc1)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"

The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.

One More Gotcha

On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1"
=> "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
./inchi.rb:20: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted

The same segfault was obtained when using the babel command-line utility:

$ babel -ismi -oinchi
C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O
[Return]
Segmentation fault

Conclusions

As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.

My InChI Runneth Over 2

Posted by Rich Apodaca Thu, 17 May 2007 08:59:00 GMT

The only solution to this problem I've found is to set the CSS overflow property to "scroll":

InChI=1/C50H70O14/c1-25(24-51)14-28-17-37(52)50(8)41(54-28)19-33-34(61-50)18-32-29(55-33)10-9-12-46(4)42(58-32)23-49(7)40(62-46)21-39-47(5,64-49)13-11-30-44(60-39)26(2)15-31-36(56-30)22-48(6)38(57-31)20-35-45(63-48)27(3)16-43(53)59-35/h9-10,16,24,26,28-42,44-45,52H,1,11-15,17-23H2,2-8H3/b10-9-

Strings and Things

Posted by Rich Apodaca Wed, 25 Apr 2007 09:28:00 GMT

I ran across John Bradshaw's excellent presentation Strings and Things. Part historical overview, part explanation of the SMILES/SMARTS line notation systems, Bradshaw's slides are chock full of interesting tidbits.

My favorite: slide 29 - "Line notations are dead." It's a wonderful illustration of why predicting the future of technology is so tricky. The light pen became the mouse, the computer display became color, and Digital fell off a cliff. SMILES and SMARTS are the only things to have survived.

Structure Diagram Generation 4

Posted by Rich Apodaca Wed, 11 Apr 2007 10:20:00 GMT

Given a molecule with no 2D coordinates, how would you render a human-readable view? This problem arises in many situations, but most commonly in the context of interpreting line notations such as IUPAC nomenclature, SMILES, or InChI. Whatever the solution you come up with, you'll come face-to-face with the structure diagram generation (SDG) problem.

Generating 2D molecular coordinates is a fundamental (and remarkably difficult) problem in cheminformatics. Discussions in the primary literature date back to at least the 1970s with Chemical Abstract Service's pioneering large-scale efforts. A recent article from Chemical Computing Group (CCG) described the design and implementation of an advanced SDG system. To my knowledge, the only open source implementation of an SDG system is found in the Chemistry Development Kit, and by extension Ruby CDK.

The SDG problem plays an important role in the aesthetics of chemical structure diagrams, as mentioned by two readers. To render a molecule aesthetically, 2D coordinates must minimize confusing atom overlaps, unconventional orientations, and unusual bond angles.

The role of SDG in cheminformatics can only continue to increase in importance, especially as more and more structures are automatically generated through mining the primary literature, the Internet, old PDFs, and other sources. With all of these new computer-generated structures will come the need to make them readily understandable to a chemist through SDG.

Creating Canonical SMILES with Ruby Open Babel

Posted by Rich Apodaca Tue, 03 Apr 2007 11:59:00 GMT

Unlike many data types, molecular structure representations are not normally unique. Each numbering system you choose for the atoms and bonds of a molecule gives rise to completely accurate, but degenerate molecular representations. This is one of the fundamental peculiarities of chemical information - and the focus of much research activity over the last sixty or so years. One of the most widely-used approaches to this problem is canonicalization.

This article discusses the SMILES canonicalization capability in the upcoming Open Babel 2.1 release. Among several other enhancements, this release will also feature a brand new Ruby interface. By way of preview, this article will demonstrate just how convenient it has now become to generate canonical SMILES strings with Ruby.

Consider the putative rodenticide aminopterin, the structure of which is shown above. Regardless of whether it turns out to be the culprit in the recent pet food poisoning case, it's a relatively complex molecule. And with this complexity comes many possible representations. Here's one of just hundreds, if not thousands, of possible SMILES strings for this molecule:

Nc3nc(N)c2nc(CNc1ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc1)cnc2n3

If you were developing a database of molecules and needed to support exact structure searching, how would you do it? One way would be to convert a query molecule to a canonical SMILES string, and then simply look for that string in an index of your database's canonical SMILES. This is useful because it allows us to convert a chemistry-specific problem (exact structure search) into a generic computer science problem (text matching).

We can create a simple Ruby library to convert any SMILES string into an Open Babel canonical SMILES string:

require 'openbabel'

class Can
  def initialize
    @conversion = OpenBabel::OBConversion.new
    @conversion.set_in_and_out_formats 'smi', 'can'
  end

  def convert smiles
    mol = OpenBabel::OBMol.new

    @conversion.read_string mol, smiles
    @conversion.write_string mol
  end
end
Save this code as a file called can.rb in your working directory. The library can then be used, for example, via interactive ruby (irb):
$ irb
irb(main):001:0> require 'can'
=> true
irb(main):002:0> c=Can.new
=> #>
irb(main):003:0> puts c.convert('Nc3nc(N)c2nc(CNc1ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc1)cnc2n3')
OC(=O)CC[C@@H](NC(=O)c1ccc(NCc2cnc3nc(N)nc(N)c3n2)cc1)C(=O)O
=> nil
irb(main):004:0> puts c.convert('C1=CC(=CC=C1C(=O)N[C@@H](CCC(=O)O)C(=O)O)NCC2=CN=C3C(=N2)C(=NC(=N3)N)N')
OC(=O)CC[C@@H](NC(=O)c1ccc(NCc2cnc3nc(N)nc(N)c3n2)cc1)C(=O)O
=> nil

As you can see, both SMILES strings for aminopterin were converted into the same canonical SMILES string.

Unlike InChI, which uses a "standard" canonicalization algorithm, SMILES canonicalization varies by software package. As a result, the SMILES canonicalization described here will be most useful within a software package, but probably not externally to it, at least initially.

Ruby is still an upstart language in cheminformatics. But tools like Ruby CDK and Ruby Open Babel offer ample opportunities for learning what this remarkable language can do for the development of chemistry applications.

Older posts: 1 2 3 4 5 6