Easily Convert IUPAC Nomenclature to SMILES, InChI, or Molfile with Rubidium

Posted by Rich Apodaca Fri, 19 Oct 2007 14:05:00 GMT

A recent article introduced Rubidium, a cheminformatics toolkit written in Ruby. One of Ruby's strengths is the speed with which it enables disparate pieces of code to be glued together - even if they're written in different programming languages. In this article, we'll see how Rubidium can be extended to provide support for converting IUPAC nomenclature into SMILES, InChI, or Molfile formats.

About Rubidium

Rubidium is a cheminformatics toolkit written in Ruby. Rubidium is currently configured to run on JRuby, although future versions may also work with Matz' Ruby Implementation) (MRI) via Ruby Java Bridge.

Rubidium will eventually be packaged as a RubyGem and hosted on RubyForge. For now, the toolkit consists of a running library that will updated and documented on this blog.

The Library

The library extends the CDK module presented in the previous article in this series. The main change is the addition of an IUPACReader class, based on Peter Corbett's excellent OPSIN library:

class IUPACReader
  import 'java.io.StringReader'
  import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'
  import 'org.openscience.cdk.io.CMLReader'
  import 'org.openscience.cdk.ChemFile'

  def initialize
    @iupac_reader = NameToStructure.new
    @cml_reader = CMLReader.new
  end

  def read name
    cml = @iupac_reader.parse_to_cml(name)

    raise "Could not parse '#{name}'." unless cml

    @cml_reader.set_reader StringReader.new(cml.to_xml)

    chem_file = @cml_reader.read ChemFile.new

    chem_file.chem_sequence(0).chem_model(0).molecule_set.molecule(0)
  end
end

Using this additional functionality requires nothing more than copying the OPSIN jarfile into the lib directory of your JRuby installation. You'll also need to place the CDK jarfile in this directory if you haven't done so already.

The complete Rubidium library can be downloaded here.

A Test

We can test Rubidium's IUPAC nomenclature parsing abilities with jirb. For example, to convert from name to SMILES:

$ jirb
irb(main):001:0> require 'cdk'
=> true
irb(main):002:0> c=CDK::Conversion.new
=> #<CDK::Conversion:0x46ca65 ... >
irb(main):003:0> c.set_formats 'iupac', 'smi'
=> "smi"
irb(main):004:0> c.convert '1,4-dichlorobenzene'
=> "C=1C=C(C=CC=1Cl)Cl"

To convert from name to InChI (in the same jirb session):

irb(main):005:0> c.set_out_format 'inchi'
=> "inchi"
irb(main):006:0> c.convert '1,4-dichlorobenzene'
=> "InChI=1/C6H4Cl2/c7-5-1-2-6(8)4-3-5/h1-4H"

And to convert from name to Molfile (also in the same jirb session):

irb(main):007:0> c.set_out_format 'mol'
=> "mol"
irb(main):008:0> c.convert '1,4-dichlorobenzene'
=> "\n  CDK    10/19/07,7:59\n\n  8  8  0  0  0  0  0  0  0  0999 V2000\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  2  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  4  2  0  0  0  0 \n  4  5  1  0  0  0  0 \n  5  6  2  0  0  0  0 \n  6  1  1  0  0  0  0 \n  7  1  1  0  0  0  0 \n  8  4  1  0  0  0  0 \nM  END\n"

Conclusions

By re-using a simple conversion API together with another Java library, we've given Rubidium the ability to translate IUPAC nomenclature into other molecular languages. The additional code was both easy to write and easy to test. Future articles will discuss the packaging, distribution, and further elaboration of Rubidium.

JRuby for Cheminformatics: Parsing SMILES Simply

Posted by Rich Apodaca Tue, 09 Oct 2007 12:40:00 GMT

The previous article in this series outlined some reasons to consider JRuby for cheminformatics. Now I'll show how easy it is to get started by describing how to parse SMILES strings with the help of the Chemistry Development Kit (CDK).

What About Ruby CDK?

A number of Depth-First articles have discussed Ruby CDK. This library runs on top of C-Ruby, otherwise known as Matz' Ruby Implementation (MRI). Ruby Java Bridge connects MRI to a Java Virtual Machine under Ruby CDK.

This article, and the others to follow, will instead discuss the use of the CDK and other Java libraries from JRuby. In contrast to MRI, JRuby is a pure Java implementation of the Ruby language. This approach offers some important advantages which will be highlighted along the way.

Installing JRuby

JRuby is not difficult to install. On Linux, the steps are:

  1. Install JDK Version 1.4 or higher.

  2. Download and unpack the most recent JRuby release - at the time of this writing, version 1.0.1.

  3. Add the JRuby bin directory to your path.

  4. There is no Step 4. ;-)

Installing CDK for JRuby

Installing CDK so that it works on JRuby is similarly quite simple:

  1. Download the most recent CDK jarfile - at the time of this writing, version 1.0.1.

  2. Move the CDK jarfile to your JRuby lib directory.

Testing CDK for JRuby

You can verify that your new CDK for JRuby installation works with jirb:

$ jirb
irb(main):001:0> require 'java'
=> true
irb(main):002:0> include_class 'org.openscience.cdk.smiles.SmilesParser'
=> ["org.openscience.cdk.smiles.SmilesParser"]

You should notice that jirb takes a few seconds to initialize the JVM, whereas irb starts almost instantly.

A Library to Read SMILES

We can write a short library to read SMILES strings using the CDK:

require 'java'
include_class 'org.openscience.cdk.smiles.SmilesParser'

module Daylight
  @@smiles_parser = SmilesParser.new

  def read_smiles smiles
    @@smiles_parser.parse_smiles smiles
  end
end

Notice the use of the Rubyesque method name parse_smiles rather than parseSmiles. This is just one of the built-in conveniences offered by JRuby.

Testing the Library

Saving the library as a file called daylight.rb lets us test it using interactive JRuby:
$ jirb
irb(main):001:0> require 'daylight'
=> true
irb(main):002:0> include Daylight
=> Object
irb(main):003:0> mol = read_smiles 'c1ccccc1'
=> #
irb(main):004:0> mol.atom_count
=> 6

As you can see, the benzene SMILES has been parsed correctly. Again, notice the use of the Rubyesque method name atom_count, rather than the CDK Java bean convention method name getAtomCount. This feature makes it easy to ignore the fact you're using a Java library and get on with writing your Ruby code. Brilliant!

Conclusions

This article has shown how to install JRuby and begin to write some simple cheminformatics programs with a distinctive Ruby flavor. Although the focus was on SMILES parsing, there's much more functionality to be found within the CDK and other cheminformatics libraries written in Java. Future articles will outline some of the possibilities.

Ruby CDK One-Liners: Create a Molfile With 2D Atom Coordinates From Arbitrary SMILES Strings

Posted by Rich Apodaca Thu, 20 Sep 2007 18:18:00 GMT

A very common operation in cheminformatics is the interconversion of molfiles and SMILES strings. Usually, converting from SMILES gives a molfile in which all atoms have coordinates of (0,0,0). Sometimes you just need more than that. The following Ruby CDK code will accept an arbitrary SMILES string and return a molfile with fully-assigned 2D atom coordinates:

require 'rubygems'
require 'rcdk'
require 'rcdk/util'
include RCDK::Util

XY.coordinate_molfile Lang.smiles_to_molfile('c1ccccc1')

Looking at it this way, those four lines of require/include statements seem pretty darn verbose.

Everything Old is New Again: Wiswesser Line Notation (WLN)

Posted by Rich Apodaca Fri, 20 Jul 2007 12:46:00 GMT

Sometimes, searching through the attic of scientific ideas turns up unexpected treasures. Like old clothing styles that suddenly become fashionable again, the passage of time has a way of making old ideas relevant by supplying new context. Those ideas that once enjoyed widespread popularity followed by complete obscurity are especially interesting. This article talks about one of them and why it may matter again.

Some History

Wiswesser Line-Formula Chemical Notation (WLN) was the most popular of perhaps a dozen actively-used line notations systems during the 1960s and 1970s. Developed by William J. Wiswesser over a period of many years starting in the 1940s, WLN contains a surprising number of modern ideas about chemistry and information. At one point a serious contender for the position now held by IUPAC nomenclature, WLN has become so obscure that few chemists have even heard of it and no modern software can manipulate it. Even finding information on the basic grammar of WLN is difficult: almost all of this documentation is contained in out-of-print books.

A Guide

To my surprise, WLN is both easy to understand and easy to use. As far as canonicalized line notations go, WLN is far easier to comprehend than either InChI or Canonical SMILES. Even more surprisingly, WLN actually meets more than a few of the requirements for the ideal line notation for the Web. I was always struck by claims that high school graduates with little chemistry background could be trained to encode WLN in a few weeks; this now seems very plausible.

My guide is Elbert Smith's short 1968 book The Wiswesser Line-Formula Chemical Notation. I was able to pick up a used copy in excellent condition for under $30.00 from Amazon.

Some Examples

Functional groups, carbon chains, and rings play central roles in WLN. Unlike modern line notations that emphasize atoms, WLN is designed to mirror the way that chemists actually think about chemistry.

Consider acetone:

1V1

The two "1"s stand for saturated one-carbon chains, i.e. methyl groups. The "V" stands for a carbon doubly-bonded to oxygen.

Given nothing more than the above example, the encoding of diethyl ether should be completely clear:

2O2

"O" simply stands for a divalent oxygen atom.

The benzene ring is one of the most ubiquitous functional groups in organic chemistry. Wiswesser knew this and wanted to make it easy to encode aromatic compounds. His solution is simplicity itself. Consider acetophenone:

1VR

The "R" stands for a benzene ring. WLN canonicalization gives it the lowest priority and this is why it appears last.

What about disubstituted aromatics? Consider 4-chloroacetophenone:

GR DV1

The "G" symbol stands for chlorine. The " DV1" stands for the 4-acyl substituent. Here, the "D" denotes the 4-postion. The 3- position would result in " CV1", and the 2- position would give " BV1". The space character means that the character following it should be interpreted as ring locant.

WLN uses a very simple system of canonicalization based on alphanumeric order. Priority increases in the direction: (1) symbols; (2) numbers in numerical order; and (3) letters in alphabetical order (with the exception of R which has lower priority than symbols). Coding generally begins at the substituent assigned the highest priority. This explains why 4-chloroacetophenone is not coded as "1VR DG".

Advantages of WLN

WLN is remarkably compact, especially when compared to SMILES and InChI. For example, consider the InChI for 4-chloroacetophenone, which is eight times longer than the corresponding WLN:

InChI=1/C8H7ClO/c1-6(10)7-2-4-8(9)5-3-7/h2-5H,1H3

Additionally, it's readily apparent to a human observer when a WLN is not properly coded - after all, the language was designed to be both read and written by humans rather than machines. Anyone can look at "GR DV1" and deduce almost instantly that it contains a carbonyl group (V), a phenyl group (R), a chloro group (G), and a methyl group (1).

And if this functional group recognition is easy for humans, it's orders of magnitude easier for machines. It's not difficult at all to imagine very sophisticated and fast molecular query systems that do nothing more than simple processing of the ASCII text contained within WLN strings.

Conclusions

It's very unlikely that WLN will ever be resurrected for the purpose of replacing existing line notations. On the other hand, WLN offers many potentially useful concepts for those creating new line notations. As they say, history doesn't repeat itself, but it frequently rhymes.

Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8

Posted by Rich Apodaca Mon, 25 Jun 2007 12:45:00 GMT

SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.

Parsing InChIs

Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.

A Simple Library

The following library provides everything we need to convert between SMILES and InChI via Ruby:

require 'openbabel'

module InChI
  @@to_smiles = OpenBabel::OBConversion.new
  @@to_inchi = OpenBabel::OBConversion.new
  @@to_smiles.set_in_and_out_formats 'inchi', 'smi'
  @@to_inchi.set_in_and_out_formats 'smi', 'inchi'

  def inchi_to_smiles inchi
    mol = OpenBabel::OBMol.new

    @@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
    @@to_smiles.write_string(mol).strip
  end

  def smiles_to_inchi smiles
    mol = OpenBabel::OBMol.new

    @@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
    @@to_inchi.write_string(mol).strip
  end
end

Testing the Library

After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
=> "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"

In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.

What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1"
=> "CC(C)NC[C@@H](COc1cccc2ccccc12)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"

However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
=> "C[C@@H](c1ccccc1)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"

The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.

One More Gotcha

On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1"
=> "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
./inchi.rb:20: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted

The same segfault was obtained when using the babel command-line utility:

$ babel -ismi -oinchi
C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O
[Return]
Segmentation fault

Conclusions

As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.

Older posts: 1 2 3 4 ... 6