Easily Convert IUPAC Nomenclature to SMILES, InChI, or Molfile with Rubidium
A recent article introduced Rubidium, a cheminformatics toolkit written in Ruby. One of Ruby's strengths is the speed with which it enables disparate pieces of code to be glued together - even if they're written in different programming languages. In this article, we'll see how Rubidium can be extended to provide support for converting IUPAC nomenclature into SMILES, InChI, or Molfile formats.
About Rubidium
Rubidium is a cheminformatics toolkit written in Ruby. Rubidium is currently configured to run on JRuby, although future versions may also work with Matz' Ruby Implementation) (MRI) via Ruby Java Bridge.
Rubidium will eventually be packaged as a RubyGem and hosted on RubyForge. For now, the toolkit consists of a running library that will updated and documented on this blog.
The Library
The library extends the CDK module presented in the previous article in this series. The main change is the addition of an IUPACReader class, based on Peter Corbett's excellent OPSIN library:
class IUPACReader
import 'java.io.StringReader'
import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'
import 'org.openscience.cdk.io.CMLReader'
import 'org.openscience.cdk.ChemFile'
def initialize
@iupac_reader = NameToStructure.new
@cml_reader = CMLReader.new
end
def read name
cml = @iupac_reader.parse_to_cml(name)
raise "Could not parse '#{name}'." unless cml
@cml_reader.set_reader StringReader.new(cml.to_xml)
chem_file = @cml_reader.read ChemFile.new
chem_file.chem_sequence(0).chem_model(0).molecule_set.molecule(0)
end
endUsing this additional functionality requires nothing more than copying the OPSIN jarfile into the lib directory of your JRuby installation. You'll also need to place the CDK jarfile in this directory if you haven't done so already.
The complete Rubidium library can be downloaded here.
A Test
We can test Rubidium's IUPAC nomenclature parsing abilities with jirb. For example, to convert from name to SMILES:
$ jirb irb(main):001:0> require 'cdk' => true irb(main):002:0> c=CDK::Conversion.new => #<CDK::Conversion:0x46ca65 ... > irb(main):003:0> c.set_formats 'iupac', 'smi' => "smi" irb(main):004:0> c.convert '1,4-dichlorobenzene' => "C=1C=C(C=CC=1Cl)Cl"
To convert from name to InChI (in the same jirb session):
irb(main):005:0> c.set_out_format 'inchi' => "inchi" irb(main):006:0> c.convert '1,4-dichlorobenzene' => "InChI=1/C6H4Cl2/c7-5-1-2-6(8)4-3-5/h1-4H"
And to convert from name to Molfile (also in the same jirb session):
irb(main):007:0> c.set_out_format 'mol' => "mol" irb(main):008:0> c.convert '1,4-dichlorobenzene' => "\n CDK 10/19/07,7:59\n\n 8 8 0 0 0 0 0 0 0 0999 V2000\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0\n 0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0\n 1 2 2 0 0 0 0 \n 2 3 1 0 0 0 0 \n 3 4 2 0 0 0 0 \n 4 5 1 0 0 0 0 \n 5 6 2 0 0 0 0 \n 6 1 1 0 0 0 0 \n 7 1 1 0 0 0 0 \n 8 4 1 0 0 0 0 \nM END\n"
Conclusions
By re-using a simple conversion API together with another Java library, we've given Rubidium the ability to translate IUPAC nomenclature into other molecular languages. The additional code was both easy to write and easy to test. Future articles will discuss the packaging, distribution, and further elaboration of Rubidium.
JRuby for Cheminformatics: Parsing IUPAC Nomenclature with OPSIN
Recent articles have discussed the use of JRuby for cheminformatics. We've seen how to parse SMILES strings, and read or write InChIs. In this article, we'll see how easy it is to parse IUPAC nomenclature from JRuby using Peter Corbett's OPSIN library.
Installation
After installing JRuby, simply download the OPSIN jarfile and copy it to your JRuby lib directory. You're done.
A Simple Library
We can write a simple library to convert an IUPAC name into a CML document:
require 'jruby'
import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'
module IUPAC
@@nts = NameToStructure.new
def read_name name
cml = @@nts.parse_to_cml(name)
raise "Could not parse '#{name}'." unless cml
cml.to_xml
end
endThe read_name method accepts an iupac name as a string and returns a CML document as a string. If the input can't be parsed, an exception is raised.
Testing the Library
We can test the library by saving it as a file called iupac.rb and invoking jirb:
$ jirb
irb(main):001:0> require 'iupac'
=> true
irb(main):002:0> include IUPAC
=> Object
irb(main):003:0> read_name('4-iodobenzoic acid')
This returns the XML shown below, which has been re-formatted for clarity:
<cml xmlns="http://www.xml-cml.org/schema">
<molecule id="m1">
<atomArray>
<atom id="a1" elementType="C">
<label value="1" />
</atom>
<atom id="a2" elementType="C">
<label value="2" />
</atom>
<atom id="a3" elementType="C">
<label value="3" />
</atom>
<atom id="a4" elementType="C">
<label value="4" />
</atom>
<atom id="a5" elementType="C">
<label value="5" />
</atom>
<atom id="a6" elementType="C">
<label value="6" />
</atom>
<atom id="a7" elementType="C" />
<atom id="a8" elementType="O" />
<atom id="a9" elementType="O" />
<atom id="a10" elementType="I">
<label value="1" />
</atom>
</atomArray>
<bondArray>
<bond atomRefs2="a1 a2" order="2" />
<bond atomRefs2="a2 a3" order="1" />
<bond atomRefs2="a3 a4" order="2" />
<bond atomRefs2="a4 a5" order="1" />
<bond atomRefs2="a5 a6" order="2" />
<bond atomRefs2="a6 a1" order="1" />
<bond atomRefs2="a7 a1" order="1" />
<bond atomRefs2="a7 a8" order="2" />
<bond atomRefs2="a7 a9" order="1" />
<bond atomRefs2="a10 a4" order="1" />
</bondArray>
</molecule>
</cml>This simple Ruby library has parsed the name '4-iodobenzoic acid' and has returned a string containing the CML representation for the molecule. If we had wanted the read_name method to return a traversable XML object model, we could have enabled that as well.
Conclusions
One of the objections raised whenever the issue of "new" programming languages comes up, regardless of their merit, is the age-old refrain "Yeah, but where's the software?" With JRuby, we bypass this question altogether. We can leverage the full scope of the massive Java development effort over the last ten years, which includes several excellent cheminformatics libraries. With virtually no effort, we have a working cheminformatics platform based on a widely-used, versatile and dynamic object-oriented scripting language. Future articles will discuss extensions to this platform and some applications.
Eleven Qualities of The Perfect Line Notation for the Web 2
If you had to design the perfect line notation for the Web, what would it look like? This is hardly an academic exercise given the central role played by line notations in information systems. For a variety of reasons, existing line notations may not be the right match for the Web. This article explores this question and outlines the main qualities needed by a Web-friendly line notation.
A Few Lines About Line Notations
A line notation is any system that converts a molecular structure into a single line of text. Chemists have been using line notations for over 140 years - long before the advent of computers. Because of their versatility, line notations are frequently used in situations they were not designed for. When this happens, limitations become apparent, resulting in renewed efforts to build a better system.
As noted previously, the invention of new line notations is a field whose popularity ebbs and flows over time. Currently, the three most important line notations are:
- IUPAC Nomenclature
- Simplified Molecular Input Line Entry System (SMILES)
- IUPAC International Chemical Identifier (InChI)
Each of these systems has its own unique characteristics. IUPAC nomenclature is the oldest and most widely-used line notation. It appears in numerous contexts, including Web pages, peer-reviewed journals, reports, patents, MSDS sheets, catalogs, and reagent bottles. By comparison, SMILES is a distant second in popularity. It's main role has been to facilitate machine entry of structural information by humans, like this. InChI is the newest of the bunch. It serves both as a line notation and as a unique identifier requiring no central authority.
The Perfect Line Notation for the Web
The emergence of the Web as a standard information delivery platform has refocused the attention of many developers on the line notation problem. With this idea in mind, here are some guesses about the qualities of the ideal Web-friendly line notation.
Readily Encodable and Decodable by Humans. There's something unnerving about a line notation that can't easily be deciphered by humans. Is this really the right string? Did I copy it completely? This problem surfaces with every line notation, but some fare better than others. IUPAC nomenclature, for example, is one of the first things taught in many beginning organic chemistry classes. It's complicated, but still understandable by non-experts.
Readily Encodable and Decodable by Machines. It may be relatively simple for humans to read and write IUPAC nomenclature, but not so for machines. Software that reads and writes SMILES, on the other hand, is by comparison easy to write. This explains the abundance of software packages that handle SMILES and the scarcity of those that handle IUPAC nomenclature.
Uses URI-Safe Characters Only. A URI uniquely identifies every document on the Internet. Why can't a line notation be used in combination with a URI to uniquely identify every molecule? One reason is that every line notation currently in use contains characters unsafe for use in URIs. Any line notation designed for use on the Web needs to avoid these characters in its syntax. Update: InChI doesn't use unsafe characters, but it does use the reserved characters "=", "?", and "/". These characters may therefore need to be escaped, depending on the context.
Encodes All Molecules. Buried within every line notation is an opinion on what chemistry is really about. To operate on the Web, these opinions need to be as closely aligned as possible with those of chemists themselves. Several Depth-First articles have discussed the limitations of existing line notations as molecular languages.
Compact. Nobody wants to look at or manipulate a line of text that's longer than it needs to be. Of course, the more expressive a line notation is, the more verbose it will be. In other words, qualities 4 and 5 will always be in conflict.
Canonicalizable. A line notation supports canonicalization when it specifies rules that can be guaranteed to always generate the same line notation for a given molecule. This feature enables many labor-saving assumptions. For example, a canonical representation makes a great identifier in a database, reducing the cost of storing and retrieving structural information.
Explicit Hydrogen Atom Encoding. SMILES makes few requirements regarding hydrogen atom encoding. As a result, each software implementation is left to its own devices. The resulting confusion is the price paid for the convenience (Quality 1) of a compact notation (Quality 5).
Hierarchical Structure. One of InChI's innovations was the introduction of a hierarchical encoding system. This system, also referred to as InChI "layers", enables a molecule to be viewed at several levels of resolution: as a molecular formula; as a network of atoms; as a network of atoms containing hydrogen atoms; as an atomic network with stereochemistry; and so on. I'm unaware of any reports in which this feature has been exploited in a practical way, although they aren't difficult to imagine.
Flat Structure. By grouping structural features into layers (Quality 8), InChI introduces a lot of complexity that is absent in SMILES and even IUPAC nomenclature. This complexity, in part, makes it difficult for both humans and machines to properly encode InChIs (Qualities 1 and 2). Given this complexity, and the fact that the utility of hierarchical encoding has yet to be conclusively demonstrated, it may be better to avoid it.
Open Source Software Implementation. No encoding standard in today's world stands a chance of gaining acceptance without an open source reference implementation. InChI broke new ground in this area and should serve as a model for any system that follows.
Unencumbered by Patents. The success of molfile and SMILES as de facto standards derives partly from the decision made by their authors to refrain from patenting their languages. As a result, developers are motivated build their own implementations, rather than invent yet another language.
Conclusions
A robust and modern line notation system is a key technology for chemically enabling the Web. Existing line notations, although useful in many contexts, were not designed with this particular role in mind. The time has come to consider whether a new line notation system, designed specifically with the Web and modern chemistry in mind, might offer a better solution.
From IUPAC Name to Molecular Formula with Ruby CDK
Recently, a question was raised on the Yahoo cheminf group list regarding the conversion of IUPAC names into molecular formulas. This can be done quickly with Ruby CDK, as this article will show.
Prerequisites
This tutorial requires Ruby CDK, which in turn requires Ruby Java Bridge (RJB). A recent Depth-First article described the minimal system configuration required to run RJB on Linux. Another article showed how to install RJB on Windows.
A Small Library
The following library will convert IUPAC nomenclature into molecular formulas with Ruby:
require 'rubygems'
require_gem 'rcdk'
require 'rcdk'
require 'rcdk/util'
module Formulator
@@hydrogen_adder = Rjb::import('org.openscience.cdk.tools.HydrogenAdder').new
def get_formula(iupac_name)
mol = RCDK::Util::Lang.read_iupac iupac_name
@@hydrogen_adder.addExplicitHydrogensToSatisfyValency mol
analyzer = Rjb::import('org.openscience.cdk.tools.MFAnalyser').new(mol)
analyzer.getMolecularFormula
end
endSave this code as a file named formulator.rb in your working directory.
Testing the Library
The Formulator library can be tested with the following code:
require 'formulator'
include Formulator
get_formula 'benzene' # => "C6H6"
get_formula '4-(3,4-dichlorophenyl)-N-methyl-1,2,3,4-tetrahydronaphthalen-1-amine' # => "C17H17NCl2"Limitations
You may run across classes of structures that are not recognized by Ruby CDK. This is due to limitations of the underlying OPSIN library. For example, OPSIN does not yet recognize fused heterocycle names such as 'imidazo[2,1-b][1,3]thiazole'.
Conclusions
Ruby CDK makes short work of converting IUPAC names into molecular formulas. This is just one example of the kind of conversion that's possible. For example, a recent article discussed the conversion of IUPAC names to color 2-D structures.
Due to Ruby's position as both a highly functional scripting language and as the foundation for the popular Web application framework Ruby on Rails, a variety of IUPAC nomenclature translation applications are just a few lines of code away.
Google for Molecules with InChIMatic

InChIMatic is a simple Web application that uses Google to perform exact structure searches on the Web. After drawing your structure in the editor window, click the "InChI!" button to get a link. This link takes you to a Google query that displays matches for your molecule. You'll need both Java and JavaScript enabled in your browser to use InChIMatic.
The Technical Details
The technology at the heart of InChIMatic is the IUPAC International Chemical Identifier (InChI). An InChI is an alphanumeric string that uniquely identifies a molecular structure. By converting molecular structures to text, InChI makes it easy to use standard Internet tools to do exact structure searches.
The earliest reference in the peer-reviewed literature to using Google for searching InChIs is contained in a 2005 paper. More recently, a service called QueryChem has taken this idea one step further by using the Google API to perform substructure searches based on InChI.
InChIMatic works differently. Unlike a raw Google search, InChIMatic builds a Google query link for you. Unlike QueryChem, InChIMatic doesn't use the Google API and so has none of its restrictions. This does result in a limitation: InChIMatic can only currently be used to for exact structure queries.
The InChIMatic Web application has been discussed in greater technical detail in a previous article. The rapid Web application development framework Ruby on Rails made building InChIMatic a snap. InChIMatic is served by the Ruby application container Mongrel, which is hosted on a Linux server running Apache. Rino provided the Ruby interface to the IUPAC/NIST InChI toolkit. The 2-D structure editor is Java Molecular Editor (JME) by Peter Ertl, which is used with his kind permission.
Aside from JME, all components of InChIMatic, from the operating system it runs on to the InChI system itself, are Open Source software.
Using InChI to Raise the Visibility of Your Content
InChIMatic returns many Google results for common molecules. But less common, known molecules return no hits at all. Three factors are responsible: (1) Google doesn't index all InChIs on the Internet; (2) few content providers currently use InChI; and (3) there is no standard and convenient mechanism to embed InChIs into Web pages for indexing by Google.
For these reasons, I consider InChI to be bleeding edge technology. Some will find it useful, most will not. Unfortunately, this state of affairs will persist until problems (1) and (3) are solved.
Nevertheless, if you're technically adventurous, InChIMatic offers a relatively painless way to begin incorporating InChIs into your content and verifying that they get indexed. There's no software to download, install, or upgrade. Forget about operating system incompatibilities (hopefully!). Just point your Java-enabled browser to inchimatic.com.
Although there's no standard method to encode InChIs in Web pages, some interesting ideas have been put forward. Egon Willighagen has proposed a system based on RDFa. Future iterations of InChIMatic may include support for generating scripts and/or markup for including InChIs into blogs and other online content.
Conclusions
InChI is a complex new technology in need of easy-to-use tools. InChIMatic is one such tool that makes it possible to perform exact structure queries using Google.
One of the exciting things about Web applications is how quickly they can evolve. If in trying out InChIMatic you find something you'd like changed or added, please feel free to write me.
Older posts: 1 2

