JRuby for Cheminformatics - Parsing IUPAC Nomenclature with OPSIN

October 12, 2007

Recent articles have discussed the use of JRuby for cheminformatics. We've seen how to parse SMILES strings, and read or write InChIs. In this article, we'll see how easy it is to parse IUPAC nomenclature from JRuby using Peter Corbett's OPSIN library.

Installation

After installing JRuby, simply download the OPSIN jarfile and copy it to your JRuby lib directory. You're done.

A Simple Library

We can write a simple library to convert an IUPAC name into a CML document:

require 'jruby'

import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'

module IUPAC
  @@nts = NameToStructure.new

  def read_name name
    cml = @@nts.parse_to_cml(name)

    raise "Could not parse '#{name}'." unless cml

    cml.to_xml
  end
end

The read_name method accepts an iupac name as a string and returns a CML document as a string. If the input can't be parsed, an exception is raised.

Testing the Library

We can test the library by saving it as a file called iupac.rb and invoking jirb:

$ jirb
irb(main):001:0> require 'iupac'
=> true
irb(main):002:0> include IUPAC
=> Object
irb(main):003:0> read_name('4-iodobenzoic acid')

This returns the XML shown below, which has been re-formatted for clarity:

<cml xmlns="http://www.xml-cml.org/schema">
  <molecule id="m1">
    <atomArray>
      <atom id="a1" elementType="C">
        <label value="1" />
      </atom>
      <atom id="a2" elementType="C">
        <label value="2" />
      </atom>
      <atom id="a3" elementType="C">
        <label value="3" />
      </atom>
      <atom id="a4" elementType="C">
        <label value="4" />
      </atom>
      <atom id="a5" elementType="C">
        <label value="5" />
      </atom>
      <atom id="a6" elementType="C">
        <label value="6" />
      </atom>
      <atom id="a7" elementType="C" />
      <atom id="a8" elementType="O" />
      <atom id="a9" elementType="O" />
      <atom id="a10" elementType="I">
        <label value="1" />
      </atom>
    </atomArray>
    <bondArray>
      <bond atomRefs2="a1 a2" order="2" />
      <bond atomRefs2="a2 a3" order="1" />
      <bond atomRefs2="a3 a4" order="2" />
      <bond atomRefs2="a4 a5" order="1" />
      <bond atomRefs2="a5 a6" order="2" />
      <bond atomRefs2="a6 a1" order="1" />
      <bond atomRefs2="a7 a1" order="1" />
      <bond atomRefs2="a7 a8" order="2" />
      <bond atomRefs2="a7 a9" order="1" />
      <bond atomRefs2="a10 a4" order="1" />
    </bondArray>
  </molecule>
</cml>

This simple Ruby library has parsed the name '4-iodobenzoic acid' and has returned a string containing the CML representation for the molecule. If we had wanted the read_name method to return a traversable XML object model, we could have enabled that as well.

Conclusions

One of the objections raised whenever the issue of "new" programming languages comes up, regardless of their merit, is the age-old refrain "Yeah, but where's the software?" With JRuby, we bypass this question altogether. We can leverage the full scope of the massive Java development effort over the last ten years, which includes several excellent cheminformatics libraries. With virtually no effort, we have a working cheminformatics platform based on a widely-used, versatile and dynamic object-oriented scripting language. Future articles will discuss extensions to this platform and some applications.