Easily Convert IUPAC Nomenclature to SMILES, InChI, or Molfile with Rubidium

Posted by Rich Apodaca Fri, 19 Oct 2007 14:05:00 GMT

A recent article introduced Rubidium, a cheminformatics toolkit written in Ruby. One of Ruby's strengths is the speed with which it enables disparate pieces of code to be glued together - even if they're written in different programming languages. In this article, we'll see how Rubidium can be extended to provide support for converting IUPAC nomenclature into SMILES, InChI, or Molfile formats.

About Rubidium

Rubidium is a cheminformatics toolkit written in Ruby. Rubidium is currently configured to run on JRuby, although future versions may also work with Matz' Ruby Implementation) (MRI) via Ruby Java Bridge.

Rubidium will eventually be packaged as a RubyGem and hosted on RubyForge. For now, the toolkit consists of a running library that will updated and documented on this blog.

The Library

The library extends the CDK module presented in the previous article in this series. The main change is the addition of an IUPACReader class, based on Peter Corbett's excellent OPSIN library:

class IUPACReader
  import 'java.io.StringReader'
  import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'
  import 'org.openscience.cdk.io.CMLReader'
  import 'org.openscience.cdk.ChemFile'

  def initialize
    @iupac_reader = NameToStructure.new
    @cml_reader = CMLReader.new
  end

  def read name
    cml = @iupac_reader.parse_to_cml(name)

    raise "Could not parse '#{name}'." unless cml

    @cml_reader.set_reader StringReader.new(cml.to_xml)

    chem_file = @cml_reader.read ChemFile.new

    chem_file.chem_sequence(0).chem_model(0).molecule_set.molecule(0)
  end
end

Using this additional functionality requires nothing more than copying the OPSIN jarfile into the lib directory of your JRuby installation. You'll also need to place the CDK jarfile in this directory if you haven't done so already.

The complete Rubidium library can be downloaded here.

A Test

We can test Rubidium's IUPAC nomenclature parsing abilities with jirb. For example, to convert from name to SMILES:

$ jirb
irb(main):001:0> require 'cdk'
=> true
irb(main):002:0> c=CDK::Conversion.new
=> #<CDK::Conversion:0x46ca65 ... >
irb(main):003:0> c.set_formats 'iupac', 'smi'
=> "smi"
irb(main):004:0> c.convert '1,4-dichlorobenzene'
=> "C=1C=C(C=CC=1Cl)Cl"

To convert from name to InChI (in the same jirb session):

irb(main):005:0> c.set_out_format 'inchi'
=> "inchi"
irb(main):006:0> c.convert '1,4-dichlorobenzene'
=> "InChI=1/C6H4Cl2/c7-5-1-2-6(8)4-3-5/h1-4H"

And to convert from name to Molfile (also in the same jirb session):

irb(main):007:0> c.set_out_format 'mol'
=> "mol"
irb(main):008:0> c.convert '1,4-dichlorobenzene'
=> "\n  CDK    10/19/07,7:59\n\n  8  8  0  0  0  0  0  0  0  0999 V2000\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0\n    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  2  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  4  2  0  0  0  0 \n  4  5  1  0  0  0  0 \n  5  6  2  0  0  0  0 \n  6  1  1  0  0  0  0 \n  7  1  1  0  0  0  0 \n  8  4  1  0  0  0  0 \nM  END\n"

Conclusions

By re-using a simple conversion API together with another Java library, we've given Rubidium the ability to translate IUPAC nomenclature into other molecular languages. The additional code was both easy to write and easy to test. Future articles will discuss the packaging, distribution, and further elaboration of Rubidium.

JRuby for Cheminformatics: Reading and Writing InChIs Via the Java Native Interface 2

Posted by Rich Apodaca Wed, 10 Oct 2007 12:21:00 GMT

The increased use of the InChI identifier is making the reading and writing of InChIs a standard cheminformatics capability. Recent articles have discussed the advantages of JRuby for cheminformatics. One disadvantage of JRuby is that code written in C can't be directly used. The presents a potential problem for libraries, such as the InChI toolkit, that are written in C. Fortunately, the solution is simple. Today's tutorial will demonstrate how InChIs can be both read and written using the C-InChI toolkit via JRuby and the excellent JNI-InChI library.

About JNI-InChI

The JNI-InChI library, written by Jim Downing and Sam Adams, wraps the C InChI toolkit in a Java Native Interface. This low-level toolkit is suitable for building more complex software, but lacks many features present in the C InChI toolkit. For example, JNI-InChI doesn't directly interconvert SMILES or molfile with InChI. For that you'd need to build a support library. If you're building a toolkit from scratch, this lightweight approach can be a significant advantage.

The JNI-InChI binary distribution jarfile includes the compiled native InChI library. In this sense it's virtually indistinguishable from any other Java library. This simplified packaging makes it exceptionally easy to use JNI-InChI from JRuby, as we'll see below.

Installation

JRuby can be installed as described previously. To install the JNI-InChI library for JRuby, simply copy the current release jarfile into the lib directory of your JRuby installation. That's all there is to it.

A Simple Library

We can now write a simple library to read InChIs via JRuby:

require 'java'

include_class 'net.sf.jniinchi.JniInchiInput'
include_class 'net.sf.jniinchi.JniInchiInputInchi'
include_class 'net.sf.jniinchi.JniInchiWrapper'

module IUPAC
  def read_inchi inchi
    input = JniInchiInputInchi.new inchi

    JniInchiWrapper.getStructureFromInchi input
  end
end

Testing the Library

By saving the above library to a file called iupac.rb, we can parse InChIs via JRuby:

$ jirb
irb(main):001:0> require 'iupac'
=> true
irb(main):002:0> include IUPAC
=> Object
irb(main):003:0> output = read_inchi 'InChI=1/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H'
=> #
irb(main):004:0> output.num_atoms
=> 14
irb(main):005:0> output.num_bonds
=> 16

Writing InChIs

Because JNI-InChI is a low-level toolkit, writing InChIs is feasible, but not trivial. We must first construct a representation, and then get the InChI for it. For example, we could get the InChI for methane as follows:

$ jirb
irb(main):001:0> require 'java'
=> true
irb(main):002:0> include_class 'net.sf.jniinchi.JniInchiInput'
=> ["net.sf.jniinchi.JniInchiInput"]
irb(main):003:0> include_class 'net.sf.jniinchi.JniInchiAtom'
=> ["net.sf.jniinchi.JniInchiAtom"]
irb(main):004:0> include_class 'net.sf.jniinchi.JniInchiWrapper'
=> ["net.sf.jniinchi.JniInchiWrapper"]
irb(main):005:0> input = JniInchiInput.new
=> #
irb(main):006:0> a1 = input.add_atom JniInchiAtom.new(0,0,0, "C")
=> #
irb(main):007:0> a1.set_implicit_h(4)
=> nil
irb(main):008:0> output = JniInchiWrapper.get_inchi input
=> #
irb(main):009:0> output.get_inchi
=> "InChI=1/CH4/h1H4"

Fortunately, we don't have to work that hard. The Chemistry Development Kit, through JNI-InChI, supports reading and writing of InChIs via a variety of molecular languages, including SMILES and molfile. More on that later, though.

Conclusions

Provided that a Java Native Interface exists for a C library, it can be used from JRuby. Future articles will discuss the use of other cheminformatics libraries written in either C or C++ from JRuby, and their integration with pure Java and Ruby libraries.

Streamlining Cheminformatics on the Web: Let InChI Do the Heavy Lifting and Get Some REST 11

Posted by Rich Apodaca Mon, 01 Oct 2007 14:53:00 GMT

A recent Depth-First article discussed the advantages of minimal Web APIs in Cheminformatics. Recently, Antony Williams unveiled some simplified ChemSpider URL schemes, mainly from the perspective of enabling Google indexing. However, it's possible to take this scheme much, much further. Here I present a proposal for radically simplifying (and unifying) the development of cheminformatics Web APIs and the software that interacts with them.

The New ChemSpider URLs

ChemSpider now has several new kinds of URLs. For the purposes of this article, the most interesting of these are of the format:

These URLs may seem unremarkable, but there's much more than meets the eye. They let anonymous developers query ChemSpider about specific substances - without needing to know much at all about how ChemSpider itself works. Goodbye API. Goodbye API support. Goodbye API documentation. Goodbye angle brackets. Hello to getting stuff done. It's all very RESTful. Well, at least it could be that way with some minor modification.

Some Recommendations

ChemSpider hasn't quite reached that place where the API just disappears. The problem is that the ChemSpider URLs listed above point to query results pages, not compound summary pages. Were these URLs to redirect to a summary page, we could construct the following URLs to extract ChemSpider resources (I've replaced the '=' sign with a '/' for simplicity):

  • .../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ Get all resources for the molecule identified by the given InChIKey - i.e., "Compound summary page"

  • .../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/molfile.mol Get the molfile for the molecule identified by the given InChIKey

  • .../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/small_image.png Get the small image for the molecule indentified by the given InChIKey.

  • .../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/large_image.png Get the large image for the molecule identified by the given InChIKey.

  • .../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/citations.xml Get the list of citations for the molecule identified by the given InchIKey, in XML format.

Jane, a developer building Web applications on top of this new ChemSpider API, would immediately notice that things just work. Let's say her online database stores IC50s at the dopamine D2 receptor. On the summary page for each molecule, she wants to link out to the ChemSpider compound summary page, if available. She would simply construct the InChIKey on her server, build the needed ChemSpider URL and GET it. An HTTP 404 would indicate no molecule with that Key exists on ChemSpider and so no link would be shown. An HTTP 200 would indicate ChemSpider has the molecule, and so the link would appear.

Conclusions

It would be interesting enough if ChemSpider adopted a system like that described here. But the real power of this approach would emerge if multiple Web services were to adopt it. By following a simple set of conventions, these services would enable third party developers to elegantly mashup all manner of cheminformatics resources into applications unimaginable today.

Technically, there's nothing that prevents this system from being implemented on every free chemistry database in existence today. However, doing so would transfer a significant degree of control from service operators to third-party developers. Not all providers will be comfortable with that idea.

Cheminformatics Web service providers need to carefully consider whether they're trying to develop a platform or an integrated service. As history has shown, the strategies, and upside potential, for each approach can differ dramatically.

InChI for Newbies

Posted by Rich Apodaca Thu, 27 Sep 2007 12:39:00 GMT

The IUPAC International Chemical Indentifier (InChI) provides an open system for converting molecular representations into strings of text. Because text processing is one of the things computers do very well, InChI serves as an important link between chemistry and computer science.

Unfortunately, the InChI documentation is rather scattered. To help remedy this problem, I have selected some links to Depth-First articles discussing InChI. These articles span a wide range of perspectives. They have been divided into categories for easier navigation, although many contain information that any user of InChI could find useful.

InChI in Context

Hacking InChI

InChI and the Web

InChIMatic

Please feel free to link to your favorite InChI resource in the comments section.

Image Generator Credit: txt2pic.com

Building the Chemically-Aware Web: TotallySynthetic and InChIMatic

Posted by Rich Apodaca Mon, 24 Sep 2007 14:54:00 GMT

Recent D-F articles have discussed InChIMatic, a Web application that lets you search the Web for chemical structures by simply drawing them. InChIMatic takes advantage of InChI, a system for representing molecular structures as a strings of text, and Google, which indexes these text strings. In this article, I'll show InChIMatic in action as it quickly finds a molecule discussed in a review of Overman's Sarain A synthesis appearing in Paul Docherty's TotallySynthetic blog.

You Can Skip this Step

The TotallySynthetic review lists three InChIs at the bottom, but which structures, out of the many discussed, do these represent? We need to know so that we can enter these structures into InChIMatic. This is, of course a step only needed because we're testing the system, not because we're using the system the way it was designed to be used.

A recent D-F article discussed a method for converting InChIs into 2D structures using Ruby. It has the advantage of being easily adaptable to building chemically-aware Web spiders. And it's 100% Open Source.

Running this library over TotallySynthetic's InChIs yields the three images above. Notice, we have some problems. The first and third images lack stereochemistry. The second has a trans- double bond instead of the cis- stereochemistry encoded by the InChI. There are good reasons for each of these problems, which I hope to address in later articles. For now, it's sufficient that we can clearly make the connection between the TotallySynthetic InChIs and structures in the Sarain A review.

Run the Search

We can test this system by pointing our browser to inchimatic.com. Entering one of the structures and clicking "Search" takes us directly to a link for the TotallySynthetic site, courtesy of Google. Unfortunately, the link doesn't currently point to the article itself. This issue may resolve itself as the Googlebot continues to index the TotallySynthetic site.

A Technical Note

If you spend any time working with InChIs, you'll notice that they're very long. So long, in fact, that they break many Web page layouts. There have been many attempts to fix the long-InChI problem, but Paul may have found the answer by trying the simplest thing that could possibly work.

If you inspect the HTML source for the TotallySynthetic article, you'll find that Paul has inserted hard returns (br elements) to manually break his InChIs, including the one we just located with InChIMatic (first in the list) the first and last structures above, both of which can be found with InChIMatic:

<p><small>InChI=1/C29H33NO4Si/c1-5-32-28(31)26-25(34-27(30-26)22-15-9-6-10-16-22)21-33-35(29(2,3)4,23-17-11-7-12-18-23)24-19-13-8-14-20-24<br />
/h6-20,25-26H,5,21H2,1-4H3/t25-,26-/m0/s1 InChI=1/C18H25NO6S/c1-14-9-11-15(12-10-14)26(22,23)19(17(21)25-18(2,3)4)13-7-6-8-16(20)24-5/h6,8-12H,7,13H2,1-5H3/b8-6- InChI=1/C47H58N2O10SSi/c1-10-56-43(51)47(36(32-41(50)55-9)30-31-49(44(52)59-45(3,4)5)60(53,54)37-28-26-34(2)27-29-37)40<br />

(58-42(48-47)35-20-14-11-15-21-35)33-57-61(46(6,7)8,38-22-16-12-17-23-38)39-24-18-13-19-25-39/h11-29,36,40H,10,30-33H2,1-9H3<br />
/t36-,40-,47-/m0/s1</small>
</p>

In other words, fixing the long InChI/Google indexing problem may be as simple as just inserting br elements when needed. More on this later, though.

Conclusions

This article has shown a working demonstration that uses free tools to build self-organizing, highly distributed, searchable chemical databases. Although the system is far from perfect, it does provide a glimpse at what can be done right now with relatively little effort. Starting with this basic idea, we can begin to think about a variety of fast, free, user-friendly services that make finding molecules on the Web, and publishing their wherabouts, as easy as using Google and WordPress. But that's a story for another time.

Older posts: 1 2 3 4 ... 8