From SMILES to InChI with OBRuby
SMILES and InChI are two commonly-used molecular line notations. Although each has its advantages and limitations, the novelty of InChI and the ubiquity of SMILES makes the SMILES to InChI conversion especially useful. Many of the situations in which the need for this conversion will arise are particularly well-suited for the Ruby programming language. A recent article described how RCDK and Rino could be used to accomplish this conversion. This article will show how Open Babel can be used from Ruby to effect the same conversion.
OBRuby
OBRuby is a SWIG-generated Ruby interface to the Open Babel library. Although OBRuby doesn't expose all aspects of the Open Babel API, nearly everything that can be done in C++ Open Babel can now be done in Ruby. For example, all OBConversion permutations should be available, including SMILES to InChI.
A Small Ruby Library
Let's create a small Ruby library for converting SMILES strings into InChI identifiers. Save the following into a file called convert.rb:
require 'openbabel'
class Convertor
def initialize
@conv = OpenBabel::OBConversion.new
@conv.set_in_and_out_formats('smi', 'inchi')
end
def get_inchi(smiles)
mol = OpenBabel::OBMol.new
@conv.read_string(mol, smiles)
@conv.write_string(mol)
end
end There's nothing tricky here. We've simply created a Ruby class that makes the SMILES to InChI conversion as simple as one method call to an instance.
Testing the Library
A good way to test this library is through Interactive Ruby (irb). For example, to find the InChI of caffeine:
require 'convert'
c = Convertor.new
puts c.get_inchi('Cn1cnc2c1c(=O)n(C)c(=O)n2C') # caffeine
# =>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3Chiral SMILES
I applied this simple Ruby conversion library to the (S)-methamphetamine record in PubChem:
- Isomeric SMILES: C[C@@H](CC1=CC=CC=C1)NC
- PubChem InChI: InChI=1/C10H15N/c1-9(11-2)8-10-6-4-3-5-7-10/h3-7,9,11H,8H2,1-2H3/t9-/m0/s1
My results were:
- Isomeric SMILES: C[C@@H](CC1=CC=CC=C1)NC
- OBRuby InChI: InChI=1/C10H15N/c1-9(11-2)8-10-6-4-3-5-7-10/h3-7,9,11H,8H2,1-2H3/t9-/m1/s1
As you can see, there is a discrepancy in the two stereo layers ('m0' vs. 'm1'). The same InChI is generated by Open Babel using either OBRuby or the Worldwide Molecular Matrix. Substituting the SMILES string representing the opposite configuration at carbon generates the InChI with opposite configuration (R), which again is opposite to that of (R)-methamphetamine in PubChem.
At this point, it is unclear whether Open Babel or PubChem is producing the correct InChI for the methamphetamine enantiomers. I suspect Open Babel is correct. By creating a molfile of (S)-methamphetamine with JME and running cInChI over it, I got the same output as with the Open Babel conversions. I've found similar differences between PubChem and Open Babel InChIs in every chiral molecule I've looked at.
Conclusions
The conversion of SMILES, and other molecular languages, into InChI identifiers can be expected to become a recurring need as the popularity of InChI increases. Combining the formidable translation capabilities of Open Babel with the comfort and convenience of Ruby offers a powerful new technique for doing so.
OBRuby: A Ruby Interface to Open Babel

And the LORD said, Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do.
-Genesis 11:6
Open Babel is a widely-used Open Source chemical informatics toolkit written in C++. Although originally designed as a molecular language translator, Open Babel also supports SMARTS pattern recognition, molecular fingerprints, molecular superposition, and other features as well.
Open Babel currently offers interfaces for two scripting languages: Python and Perl. Recently, Geoff Hutchison and I have been working to add Ruby to that list. This article reports our success in doing so and provides a glimpse of what might now be possible.
OBRuby
The upcoming release of Open Babel (version 2.1.0) will come complete with a Ruby interface. For those interested in trying it out sooner, a package called OBRuby can be downloaded now. OBRuby compiles against revision 1577 of the Open Babel SVN trunk. It has been tested with Linux and Mac OS X, and will probably work on Windows with minor modifications. The approach outlined here is known to fail with Open Babel 2.0.2.
OBRuby is a technology demonstration. The Ruby scripting support included with Open Babel 2.1.0 may differ in some details from OBRuby. My purpose in this article is simply to demonstrate what is now possible. Please read through the install scripts (they're short) to be sure you're comfortable with what they do.
Here was my OBRuby installation process:
- Download the Open Babel SVN trunk revision 1577 or later.
- cd trunk
- configure, make, (as root) make install
- (as root) ldconfig (necessary on my system - perhaps not on yours)
- cd OBRUBY_DIR
- ruby build.rb
- (as root) make install
One last wrinkle: the build.rb script included with OBRuby is something of a hack. It hardcodes the location of the Open Babel library on line 6:
@@ob_dir='/usr/local'$ irb irb(main):001:0> require 'openbabel' => true
A return value of true shows that the installation was successful. An error message about libopenbabel.so not being found indicates that your system can't find your Open Babel libraries. Be sure you've installed Open Babel and either run ldconfig or set LD_LIBRARY_PATH.
The majority of OBRuby was autogenerated by SWIG. A future article will detail how this was done - with an eye toward developing a Java interface to Open Babel.
Building an OBMol From SMILES
With installation out of the way, let's fire up OBRuby and take her for a test drive. The following code can either be entered with IRB or saved to a file and executed with the ruby interpreter:
require 'openbabel'
include OpenBabel
smi2mol = OBConversion.new
smi2mol.set_in_format("smi")
mol = OBMol.new
smi2mol.read_string(mol, 'CC(C)CCCC(C)C1CCC2C1(CCC3C2CC=C4C3(CCC(C4)O)C)C') # cholesterol, no chirality
mol.add_hydrogens
puts "Cholesterol has #{mol.num_atoms} atoms, including hydrogens."
puts "Its molecular weight is #{mol.get_mol_wt} and its molecular formula is #{mol.get_formula}."SMARTS Matching
One of the most useful features of Open Babel is its SMARTS pattern matching capability. This can conveniently be accessed from OBRuby by first instantiating an OBSmartsPattern, passing the SMARTS pattern of interest to the instance's init method, and retrieving the hit set:require 'openbabel'
include OpenBabel
smi2mol = OBConversion.new
smi2mol.set_in_format("smi")
mol = OBMol.new
smiles = 'CC(C)CCCC(C)C1CCC2C1(CCC3C2CC=C4C3(CCC(C4)O)C)C' # cholesterol, no chirality
smi2mol.read_string(mol, smiles)
mol.add_hydrogens
pattern=OBSmartsPattern.new
smarts = 'C1CCCCC1'
pattern.init(smarts)
pattern.match(mol)
hits = pattern.get_umap_list # => indicies of two cyclohexane rings
puts "Found #{hits.size} instances of the SMARTS pattern '#{smarts}' in the SMILES string #{smiles}. Here are the atom indices:"
hits.each_with_index do |hit, index|
print "Hit #{index}: [ "
hit.each do |atom_index|
print "#{atom_index} "
end
puts "]"
endFound 2 instances of the SMARTS pattern 'C1CCCCC1' in the SMILES string CC(C)CCCC(C)C1CCC2C1(CCC3C2CC=C4C3(CCC(C4)O)C)C. Here are the atom indices: Hit 0: [ 12 17 16 15 14 13 ] Hit 1: [ 20 25 24 23 22 21 ]
Finding Your Way
Using a new library like OBRuby can take some getting used to. An excellent source of information is OpenBabel's online API documentation. Another source is Ruby itself.
For example, let's say you've instantiated an OBMol, but can't remember the exact name of the method that counts the number of atoms. Just use Object.methods.sort:
require 'openbabel'
mol = OpenBabel::OBMol.new
mol.methods.sort # => see output belowConclusions
OBRuby combines the dynamic programming language Ruby with the highly-functional toolkit Open Babel. Further augmenting OBRuby's capabilities with the web application framework Rails and/or Ruby Chemistry Development Kit offers even more possibilities. Future articles will bring some of them to life.
CDK, the Ruby Way: RCDK-0.2.0
Ruby Chemistry Development Kit (RCDK) version 0.2.0 is now available. This version adds built-in support for Structure-CDK, a 2-D rendering framework. Simplifying the use of this library is a convenience layer enabling many common tasks to be accomplished with a single line of Ruby code.
Installing RCDK-0.2.0 is simple. From the command line (as root):
# gem install rcdk
Be prepared for a bit of a wait as the large RCDK RubyGem downloads and is installed.
If RCDK-0.1.0 is already installed on your system, version 0.2.0 can peacefully co-exist with it. Ruby will automatically load the most recent version of RCDK, and you can dynamically load the earlier version in your own code. If you'd like to uninstall RCDK-0.1.0 anyway, use the following (also as root):
# gem uninstall rcdk
Follow the menu to uninstall RCDK-0.1.0 and you're done.
If you haven't done so already, there is one bit of additional configuration. You'll need to update your LD_LIBRARY_PATH to point to the location of your system's native Java libraries. On Linux with Sun's JDK, this can be done with the following:
$ export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/i386:$LD_LIBRARY_PATH
This assumes the JAVA_HOME was already set. If not, it will need to point to your systems Java installation directory.
The whole process can be automated by including the above line at the end of your .bash_profile file (or equivalent).
As a simple demonstration, let's say you'd like to depict the 2-D structure encoded by a SMILES string as a 200x200 pixel PNG image. With RCDK-0.2.0, this can be done with the following Ruby code (which can be entered interactively via irb):
require 'rubygems'
require_gem 'rcdk'
require 'util'
smiles = 'Oc1ccccc1' #phenol
RCDK::Util::Image.smiles_to_png(smiles, 'phenol.png', 200, 200)This code creates phenol.png in your current directory:

Of course, there's much more to RCDK than just SMILES depiction. Future articles will describe some of the many possibilities.
Hacking PubChem: Query by SMILES
Recently, I showed how a simple PubChem API could be built from a few lines of Ruby code. The API we created could retrieve a molfile and a 2-D molecular rendering given a PubChem compound ID (CID). In this tutorial, we'll see how a SMILES query mechanism can be added to the API, enabling CIDs to be retrieved from any valid SMILES string. We'll also see how to extend this capability to retrieving a 2-D image from PubChem by submitting a SMILES string.
Credits
The API that follows is based on the pubchem.rb file found in Chemruby by Tadashi Kadowaki and Nobua Tanaka.
Defining the Problem
We want to create a PubChem API that returns an Array of CIDs given any valid SMILES string. The API will communicate with the publically-available molecular database PubChem using HTTP.
In some cases, PubChem associates more than one CID for a given molecular structure. For example, querying the SMILES string c1ccccc1 (benzene) finds both benzene and C-14 benzene. The software needs to handle these cases as well.
Prerequisites
The only thing you'll need for this tutorial is Ruby, preferably v1.8 or better.
Code
Create a file called query.rb in your working directory containing the following code:
require 'uri'
require 'net/http'
# A simple SMILES query for PubChem based on the file <tt>pubchem.rb</tt>,
# and originally part of Chemruby (http://rubyforge.org/project/chemruby).
# Distributed under Ruby's License.
#
# Copyright (C) 2005, 2006 KADOWAKI Tadashi <kado@kuicr.kyoto-u.ac.jp>
# TANAKA Nobuya <tanaka@kuicr.kyoto-u.ac.jp>
# APODACA Richard <r_apodaca@users.sf.net>
class PubChemQuery
@@host="pubchem.ncbi.nlm.nih.gov"
@@searchpath="/search/"
@@query="PreQSrv.cgi"
@@boundary="-----boundary-----"
# Synthetic form data. Lifted from Chemruby <tt>pubchem.rb</tt>
@@data = [
@@boundary, "Content-Disposition: form-data; name=\"mode\"", "", "simplequery",
@@boundary, "Content-Disposition: form-data; name=\"queue\"", "", "ssquery",
@@boundary, "Content-Disposition: form-data; name=\"simple_searchdata\"", "", '%s',
@@boundary, "Content-Disposition: form-data; name=\"simple_searchtype\"", "", "fs",
@@boundary, "Content-Disposition: form-data; name=\"maxhits\"", "", '%s',
@@boundary].join("\x0d\x0a")
# Returns an <tt>Array</tt> of CIDs matching <tt>smiles</tt>. If no matches are found,
# <tt>nil</tt> is returned.
def self.query_by_smiles(smiles, maxhits = 100)
form_response = post_form(smiles, maxhits)
wait_response = process_wait_page(form_response)
url = get_report_url(wait_response)
url ? process_report(url) : nil
end
private
# Returns the response to posting the initial search form.
def self.post_form(smiles, maxhits)
response = ''
Net::HTTP.start(@@host, 80) do |http|
response = http.post(@@searchpath + @@query, @@data % [smiles, maxhits],
{
'Content-Type' => "multipart/form-data; boundary=#{@@boundary}",
'Referer' => "http://pubchem.ncbi.nlm.nih.gov/search/"
}).body
end
response
end
# Processes the wait page displayed after submission of the search form.
def self.process_wait_page(body)
response = ''
if m = /url="([^"]+)"/.match(body)
Net::HTTP.start(@@host, 80) do |http|
response = http.get(@@searchpath + m[1]).body
end
end
response
end
# Returns the URL, as a <tt>String</tt>, to the search report, given the specified
# body of the wait page.
def self.get_report_url(body)
url = nil
Net::HTTP.start(@@host, 80) do |http|
while /setTimeout\('document.location.replace\("([^"]+)"\);', (\d+)\)/ =~ body do
sleep($2.to_f/100)
response = http.get(URI.parse($1).to_s)
body = response.body
url = response['location']
end
end
url
end
# Extracts CIDs from the search report contained at <tt>url</tt>.
def self.process_report(url)
cid = Array.new
Net::HTTP.start(@@host, 80) do |http|
# text format
url.sub!(/cmd=Select\+from\+History/, 'cmd=Text&dopt=Brief')
http.get(url).body.scan(/\d+: CID: (\d+)/).each do |id|
cid.push(id[0])
end
end
cid
end
endYou might want to manually submit a SMILES query to PubChem as a refresher on how this webapp works. Briefly, the contents of the SMILES search field are read, and a wait screen appears, typically for three seconds. You are then redirected to a results report page containing thumbnail images of the hits and their CIDs.
The PubChemQuery class contains a single public class method, query_by_smiles. This method builds a form to submit, based on the supplied SMILES string and optional maxhits argument. It then waits until PubChem indicates that the query is about to finish processing. The URL for the results report page is then parsed. If a nonempty URL was found, then its page is loaded, and CIDs are scraped. Otherwise, the method returns nil.
Usage
Using PubChemQuery consists of invoking its class method query_by_smiles. You can do so either via the Ruby interpreter (ruby), or preferably through Interactive Ruby (irb).
require 'query'
smiles = "c1cccc(Cl)c1(Cl)" # chlorobenzene
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
puts cid # => 7239Layering Complexity
We can combine the SMILES query API discussed here with the molfile and image retrieval discussed in the earlier Hacking Pubchem article.
Let's say you'd like to download PubChem's 2-D image of imatinib (Gleevec) by submitting its SMILES string. Copy the file named pubchem.rb, provided in the original PubChem tutorial, into your working directory. Now you can programmatically download imatinib's 2-D image from PubChem based only on a SMILES string, for example:
require 'pubchem'
require 'query'
smiles="Cc3ccc(NC(=O)c2ccc(CN1CCN(C)CC1)cc2)cc3Nc5nccc(c4cccnc4)n5" #imatinib
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
if cid
puts "CID found: #{cid[0]}"
filename = cid[0] + ".png"
puts "Writing image to #{filename} ..."
PubChem.write_image(cid[0], filename)
else
puts "No CID for #{smiles} was found."
endThis produces an image of imatinib called 5291.png in your working directory:

Wrapping Up
As you can see, we're just scratching the surface. The approach outlined here offers nearly unlimited possibilities for repackaging PubChem's own content, and mashing this content up with that of other sites. Happy hacking!
Hacking NMRShiftDB
NMRShiftDB is an open web database of peer-reviewed NMR chemical shifts compiled by volunteers. As of this writing, it contains 22,429 measured spectra from 18,986 structures, and reports 927 registered users. The database code itself is open source.
Although NMRShiftDB has a web interface, its architecture is designed to simplify writing programs that use it. A previous article showed how a working PubChem API could be written with just a few lines of Ruby. This time, I'll show how the same thing can be done for NMRShiftDB.
Ingredients
This tutorial uses Arton's excellent Ruby Java Bridge, the installation and use of which has been previously discussed. Also used is Ruby's InChI interface, Rino, for which installation instructions are here.
Create a working directory called nmr. Into this directory, copy cdk-20060714.jar, which can be downloaded here.
Code
Create a file called nmr.rb containing the following Ruby code:
require 'net/http'
require 'smi2inchi'
# A very simple NMRShiftDB Web API.
class NMRFetcher
# Creates a <tt>Translator</tt> instance.
def initialize
@translator = Translator.new
end
# Returns an XML record, as a string, for the molecule
# with SMILES matching <tt>smiles</tt> and spectrum type
# matching <tt>spectrumtype</tt> (13C, 1H, 15N and 31P).
def get_record(smiles, spectrumtype)
body = nil
inchi = (smi2inchi(smiles)).gsub('InChI=', 'inchi=')
path = '/NmrshiftdbServlet?nmrshiftdbaction=exportcmlbyinchi&' + inchi + '&spectrumtype=' + spectrumtype
Net::HTTP.start('nmrshiftdb.ice.mpg.de') do |http|
response = http.get(path)
body = response.body
end
if !valid_record?(body)
return nil
end
body
end
private
def valid_record?(body)
!body.eql?('No such molecule or spectrum')
end
def smi2inchi(smiles)
@translator.translate(smiles)
end
endThe magic in the above code is nothing more than a simple HTTP request sent to nmrshiftdb.ice.mpg.de, contained in the get_record method. This request encodes an InChI identifier, which is generated from the SMILES string passed as an argument. We also specify a spectrum type.
Now create a file called smi2inchi.rb, containing the following Ruby code:
ENV['CLASSPATH'] = './cdk-20060714.jar'
require 'rubygems'
require_gem 'rjb'
require_gem 'rino'
require 'rjb'
StringWriter = Rjb::import 'java.io.StringWriter'
SmilesParser = Rjb::import 'org.openscience.cdk.smiles.SmilesParser'
MDLWriter = Rjb::import 'org.openscience.cdk.io.MDLWriter'
# Converts a SMILES string into an InChI identifier using
# the CDK Library (Java) and the Rino Library (Ruby/C).
class Translator
def initialize
@smiles_parser = SmilesParser.new
@mdl_writer = MDLWriter.new
@mol2inchi = Rino::MolfileReader.new
end
# Returns an InChI identifier from the specified SMILES string.
# Uses the CDK classes SmilesParser and MDLWriter to generate
# a molfile from a SMILES string. Then this molfile is
# parsed by Rino::MolfileReader.
def translate(smiles)
mol = @smiles_parser.parseSmiles(smiles)
sw = StringWriter.new
@mdl_writer.setWriter(sw)
@mdl_writer.write(mol)
@mol2inchi.read(sw.toString)
end
endThe description and use of this code was discussed in a recent article on generating InChI identifiers from SMILES strings.
Before using the code we've just created you'll need to set the LD_LIBRARY_PATH (or equivalent) to point to the native Java libraries. On Linux with Sun's JDK, this is done from the command line with:
$ export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/i386:$LD_LIBRARY_PATH
Using the NMRFetcher class is just a matter of creating an instance, and invoking get_record with the desired SMILES string and spectrum type (1H, 13C). Doing so returns a CML document containing the structure of the compound and its spectrum. If no record matches, the method returns nil. The code below give an example in which the CML output is pretty-printed using the wonderful Ruby API for XML, REXML:
require "rexml/document"
require 'nmr'
nmr = NMRFetcher.new
smiles = 'c1ccccc1' #benzene, to keep things simple
type = '13C'
record = nmr.get_record(smiles, type)
if record #pretty-print the CML record using REXML
file = File.new('result.xml', 'w')
(REXML::Document.new(record)).write(file, 0)
file.close
else #write an error
File.open('result.error', 'w') do |file|
file << 'No record of SMILES: ' + smiles
end
end$ ruby test.rb
Alternatively, it can be entered interactively and played with using irb:
$ irb irb(main):001:0>
Output
The program produces the following Chemical Markup Language output in a file called result.xml:
<cml>
<molecule title='Benzene' id='nmrshiftdb7901' xmlns='http://www.xml-cml.org/schema/cml2/core'>
<atomArray xmlns='http://www.xml-cml.org/schema'>
<atom elementType='C' y2='0.7625' x2='-1.4063' id='a1' formalCharge='0' hydrogenCount='0'/>
<atom elementType='C' y2='0.35' x2='-2.1207' id='a2' formalCharge='0' hydrogenCount='0'/>
<atom elementType='C' y2='-0.475' x2='-2.1207' id='a3' formalCharge='0' hydrogenCount='0'/>
<atom elementType='C' y2='-0.8875' x2='-1.4063' id='a4' formalCharge='0' hydrogenCount='0'/>
<atom elementType='C' y2='-0.475' x2='-0.6918' id='a5' formalCharge='0' hydrogenCount='0'/>
<atom elementType='C' y2='0.35' x2='-0.6918' id='a6' formalCharge='0' hydrogenCount='0'/>
</atomArray>
<bondArray xmlns='http://www.xml-cml.org/schema'>
<bond atomRefs2='a1 a2' order='S' id='b1'/>
<bond atomRefs2='a2 a3' order='D' id='b2'/>
<bond atomRefs2='a3 a4' order='S' id='b3'/>
<bond atomRefs2='a4 a5' order='D' id='b4'/>
<bond atomRefs2='a5 a6' order='S' id='b5'/>
<bond atomRefs2='a1 a6' order='D' id='b6'/>
</bondArray>
</molecule>
<spectrum moleculeRef='nmrshiftdb7901' xmlns:cml='http://www.xml-cml.org/dict/cml' xmlns:cmlDict='http://www.xml-cml.org/dict/cmlDict' xmlns:siUnits='http://www.xml-cml.org/units/siUnits' type='NMR' xmlns:macie='http://www.xml-cml.org/dict/macie' xmlns:units='http://www.xml-cml.org/units/units' id='nmrshiftdb15502' xmlns:subst='http://www.xml-cml.org/dict/substDict' xmlns:nmr='http://www.nmrshiftdb.org/dict' xmlns='http://www.xml-cml.org/schema/cml2/spect'>
<conditionList xmlns='http://www.xml-cml.org/schema'>
<scalar dataType='xsd:string' units='siUnits:k' dictRef='cml:temp'>298</scalar>
<scalar dataType='xsd:string' units='siUnits:hertz' dictRef='cml:field'>Unreported</scalar>
</conditionList>
<metadataList xmlns='http://www.xml-cml.org/schema'>
<metadata name='nmr:OBSERVENUCLEUS' content='13C'/>
</metadataList>
<peakList xmlns='http://www.xml-cml.org/schema'>
<peak xUnits='units:ppm' peakShape='sharp' xValue='128.5' id='p0' atomRefs='a1 a2 a3 a4 a5 a6'/>
</peakList>
</spectrum>
</cml>The kind of output produced by NMRFetcher and NMRShiftDB could be used in a variety of ways. Notice, near the bottom of the document, how peak assignments are made relative the the atom labels in the molecule declaration. It should be possible, for example, to create interactive 2-D structure diagrams from this document in which a user mouses over an atom and gets a C-13 chemical shift.
NMRShiftDB is a valuable and free online resource for NMR spectroscopy. Programatically mixing its capabilities with free software and other online services offers numerous opportunities to build innovative chemical informatics systems.

