Looking at InChIs

Posted by Rich Apodaca Tue, 26 Sep 2006 14:35:00 GMT

InChI identifiers can be viewed both as unique molecular keys and as a language encoding molecular structure. With the right software, it is possible to decode any InChI to arrive at a human-readable molecular structure. This tutorial will show how to convert InChI identifiers into 2-D molecular renderings using open source tools.

Prerequisites

The InChI to 2-D image conversion process requires two pieces of software:

  • Rino decodes InChI identifiers into molfiles. The resulting atomic coordinates are set to zero.

  • RCDK assigns coordinates to the molfile produced by Rino, and renders the result.

Bring on the Code

The following Ruby code illustrates how the InChI for the pesticide fipronil (Regent) can be translated into a PNG image:

require 'rubygems'
require_gem 'rino'
require_gem 'rcdk'
require 'util'

inchi = 'InChI=1/C12H4Cl2F6N4OS/c13-5-1-4(11(15,16)17)2-6(14)8(5)24-10(22)9(7(3-21)23-24)26(25)12(18,19)20/h1-2H,22H2' #fipronil
reader = Rino::InChIReader.new
molfile1 = reader.read(inchi) # lacks 2-D atomic coordinates
molfile2 = RCDK::Util::XY.coordinate_molfile(molfile1) # has 2-D atomic coordinates

RCDK::Util::Image.molfile_to_png(molfile2, 'fipronil.png', 350, 300)

Running this code produces the image fipronil.png in your working directory:

Limitations

The technique illustrated here is subject to the same limitations as the underlying software. For Rino, this means that stereochemistry is ignored. For RCDK, this means that implicit hydrogen atoms, isotopes, and charges are omitted, and that layout of macrocycles and other complex ring systems may not subjectively appear very refined.

Other Software that Does This

To my knowledge, only one other Open Source package, BKChem, is capable of rendering InChIs as described here. BKChem's underlying InChI translation and depiction software, OASA, can also be accessed online. For comparison, OASA produces the following image for for the fipronil InChI:

The PubChem editor can also translate and render InChIs, but no source code appears to be available. PubChem's InChI translation and rendering output for fipronil is:

The Chemistry Development Kit, on which RCDK is based, was recently upgraded to support reading InChI identifiers. For some time, CDK has been able to generate 2-D atomic coordinates.

More information on InChI software can be found at Beda Kosata's InChI.info site.

The Final Word

Within certain limitations, it is quite feasible to programatically obtain a 2-D molecular image for any InChI identifier. Combining this capability with other chemical informatics software and services offers numerous possibilities to use InChI in innovative ways.

Decoding InChIs with Rino

Posted by Rich Apodaca Tue, 19 Sep 2006 14:07:00 GMT

InChI identifiers are unique, ASCII-based molecular identifiers well-suited for chemical informatics on the Web. But they are also much more than that. Encoded in every InChI is all of the information needed to reconstruct a valid, machine-readable molecular representation. This tutorial shows how Open Source tools can be used to construct a molfile representation from an InChI identifier with the help of new features in the Rino toolkit for Ruby. The ability of Rino to produce InChI identifiers from molfile input has already been discussed.

Credits

What follows was in part inspired by helpful comments posted by Sam Adams, author of the JNI InChI Wrapper, and Dmitrii Tchekhovskoi, co-author of the InChI software.

A Demo with cInChI

The newest release of the IUPAC InChI-API toolkit can now translate an InChI identifier into a molfile. This consists of a two-step process:

  1. Convert a simple InChI into a full InChI with Auxiliary Information (AuxInfo).
  2. Convert the full InChI into a molfile.

You can get a feel for how this process works by using the cInChI command-line program. Create a file called test.txt containing the following InChI (for benzene):

InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H
Now, run cInChI:
$ touch temp.txt
$ ./cInChI-1 test.txt temp.txt -InChI2Struct

The first line creates an empty temporary file, temp.txt. Into this file is written the full InChI as output. The -InChI2Struct parameter tells InChI to generate an InChI with Auxiliary Information.

Now, create an empty file, benzene.mol and run cInChI with the -OutputSDF option:

$ touch benzene.mol
$ ./cInChI-1 temp.txt benzene.mol -OutputSDF

If everything worked, you should now have a molfile called benzene.mol, describing benzene, in your working directory. All atom coordinates will be zero, because coordinate generation is outside the scope of the InChI project. This has important implications or stereochemistry (see below). Of course, other free libraries can generate aesthetically-pleasing 2-D molecular coordinates.

Hello, Rino

Rino is a thin Ruby wrapper around the InChI-API toolkit, which is written in C. An earlier article described the use of the automatic wrapper generator SWIG to write the C glue code that Rino interfaces with. The current version of Rino (v0.2.0) uses this approach to Ruby interface generation.

The current version of Rino can conveniently be installed by executing the following (as root):

# gem install rino

Earlier today, I got "404 Not Found" errors for this command, but not recently. The source is not clear, but seems to occur within the 24 hours after the Gem is uploaded. If you run into problems, the Rino RubyGem can also be downloaded and installed locally.

If you've already installed Rino-0.1.0, the new version can happily cohabitate with it. RubyGems by default installs the most recent version of Rino unless you specify otherwise. If you'd like to uninstall Rino-0.1.0 do the following (as root):

# gem uninstall rino

You should get a menu of Rino version to uninstall.

A Ruby Demo

The following Ruby code demonstrates the use of Rino to translate an InChI identifier into a molfile:

require 'rubygems'
require_gem 'rino'

inchi = 'InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H' # benzene
reader = Rino::InChIReader.new
molfile = reader.read(inchi)

p molfile # => prints the molfile for benzene

If you'd like even more control, you can directly access the InChI run method, which provides all of the capabilities of running cInChI from the command line:

require 'rubygems'
require_gem 'rino'

input = 'input.txt'   # a valid file in your working dir
output = 'output.txt' # also a valid file

Rino::InChI.run(['', input, output])

Limitations

The InChI->molfile implementation in the InChI-API toolkit does not reproduce stereochemical information. For example, passing an InChI of a molecule containing a single tetrahedral stereocenter results in a molfile lacking stereo parities. Further, an explicit hydrogen atom is added to the sterogenic atom in the molfile output. Being based entirely on the InChI-API, Rino inherits these behaviors.

Rino is based on a very simple interface into InChI's main method. This has the advantage that anything that can be done with the cInChI command line application can also be done with Rino. It carries the disadvantage that the convenience classes InChIReader and MolfileReader use a less than elegant system of temporary disk files for input-output. Future versions of Rino should address this issue, a task that may be simplified by SWIG.

Other InChI Parsers

To my knowledge, three Open Source InChI parsers, besides the InChI-API and Rino, exist. They are:

  • Ninja. A Java library that performs low-level InChI parsing, and is designed as a platform for more sophisticated parsers. While it does not create molfiles from InChIs, it can be used as a foundation for software that does. Ninja is used in the molecular language framework, Rosetta, although this work is far from complete.

  • BKChem. Beda Kosata's 2-D structure editor, which is written in Python. The similarities between Ruby and Python make this codebase a potentially useful starting point for a pure Ruby InChI parser.

  • JNI InChI Wrapper. Also a wrapper for the InChI-API. When used in combination with the Chemistry Development Kit, this package has been reported to produce molfiles from InChI identifiers.

More information on InChI software capabilities can be found at Beda Kosata's InChI info site.

Wrapping Up

The translation of InChI identifiers into other molecular representation systems will become more important as InChI gains traction. Mashups involving InChI translation offer many tantalizing opportunities for innovative chemical informatics applications. Future articles will discuss some of them.

Taking a SWIG of InChI

Posted by Rich Apodaca Sat, 16 Sep 2006 14:43:00 GMT

The IUPAC InChI developer toolkit is written in C. It is currently the only Open Source software capable of generating InChI identifiers. Software that needs to write InChIs must use the C toolkit in one form or another. This poses a problem for the large amount of chemical informatics software being written in other languages. In this article, I'll explain how the Open Source tool SWIG can solve this problem in a semi-automated way. The same concepts can, in principle, be used to link any library written in C/C++ with another language.

Prerequisites

This tutorial uses Ruby as the language that InChI will be linked with. You'll therefore need both Ruby and the Ruby development libraries installed. You'll also need SWIG and possibly the SWIG development libraries.

Use the Source, Luke

After downloading and unpacking InChI-1-API v1.0.1, collect all header (*.h) and source (*.c) files into a directory called inchi. These files can be found in the following two directories:

  • InChI-1-API/cInChI/common
  • InChI-1-API/cInChI/main

Find the Main Method

This tutorial will create an interface into the InChI main() function. This function is found on line 149 of the file ichimain.c. For reasons I won't get into here, rename this method run and change the second argument type to char **. Also, add a prototype for the run function directly above line 149:

int run( int argc, char **argv ); // new line added

int run( int argc, char **argv ) // formerly line 149

Create the Interface File

The focal point of SWIG is the interface file. This file specifies the C functions you want to link into and some items to help in doing so. Create a file called libinchi.i containing the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/* The name of this module. */
%module libinchi

/*
 * Tells SWIG to treat char ** as a special case.
 */
%typemap(in) (int argc, char **argv) {

 /* Get the length of the array */
 int size = RARRAY($input)->len; 
 int i;
 $1 = ($1_ltype) size;
 $2 = (char **) malloc((size+1)*sizeof(char *));

 /* Get the first element in memory */
 VALUE *ptr = RARRAY($input)->ptr; 
 for (i=0; i < size; i++, ptr++)

 /* Convert Ruby Object String to char* */
 $2[i]= STR2CSTR(*ptr); 
 $2[i]=NULL; /* End of list */
}

/*
 * Cleans up the char ** array created before 
 * the function call.
 */
%typemap(freearg) char ** {
 free((char *) $1);
}

/*
 * Function definition from ichimain.c.
 */
extern int run(int argc, char **argv);

The interface file has three main parts. The first part (line 2) names the module. The second part (lines 7-30) makes the necessary Ruby/C datatype conversions. The last part (line 35) tells SWIG the InChI functions we want to be able to access from Ruby.

Take a SWIG

At this point, SWIG has everything it needs to autogenerate our glue code. This can be done by:

$ swig -ruby libinchi.i

This command should have created a new source file, libinchi_wrap.c, that contains all of the C glue code for our library. We'll have a look at the most important part of this file shortly.

Create a Makefile

We'll need a makefile with which to compile our library. Fortunately, Ruby makes this very easy. Create a file called extconf.rb containing the following Ruby code:

require 'mkmf'

create_makefile('libinchi')
A makefile can now be generated by:
$ ruby extconf.rb

Build the Library

Our library can now be built with:

$ make

Use InChI from Ruby

We are now done with the basics. You can verify that the process worked through Interactive Ruby (irb):

$ irb
irb(main):001:0> require 'libinchi'
=> true

The return value of true shows that Ruby loaded and recognized the binary library we just built (libinchi.so). We are now able to use this library as if it were written in Ruby.

Use the Library

To test the library, copy a molfile called test.mol into your inchi directory. Now run this code:

require 'libinchi'

Libinchi.run(['', 'test.mol'])

You should get a lot of output from the InChI libary. If you take a look at the inchi directory contents, a new file, test.mol.txt, has been created. It contains the InChI identifier of the molecule contained in your molfile. This software also created a log file (test.mol.log) and a problem file (test.mol.prb).

You may be wondering why the first element in the Array passed to Libinchi.run is empty. The reason is that by convention a C main method expects its first argument to be the name of the program itself. The InChI main method takes this into account, and so the Array simply leaves its first element blank.

Customize the Library

Have a look at the libinchi_wrap.c file that SWIG created. At the bottom of this file should be a function called Init_libinchi:

SWIGEXPORT(void) Init_libinchi(void) {
  int i;

  SWIG_InitRuntime();
  mLibinchi = rb_define_module("Libinchi");

  for (i = 0; swig_types_initial[i]; i++) {
    swig_types[i] = SWIG_TypeRegister(swig_types_initial[i]);
    SWIG_define_class(swig_types[i]);
  }

  rb_define_module_function(mLibinchi, "run", _wrap_run, -1);
}

This is what Ruby uses to map C functions to Ruby modules, classes, and methods. In this case, the C run method is being mapped to a module called Libinchi which has a run method.

Let's say that you'd prefer a module name of InChI with a method called write_inchi. The following changes to Init_libinchi will accomplish this:

SWIGEXPORT(void) Init_libinchi(void) {
  int i;

  SWIG_InitRuntime();
  mLibinchi = rb_define_module("InChI");

  for (i = 0; swig_types_initial[i]; i++) {
    swig_types[i] = SWIG_TypeRegister(swig_types_initial[i]);
    SWIG_define_class(swig_types[i]);
  }

  rb_define_module_function(mLibinchi, "write_inchi", _wrap_run, -1);
}

Run make again. Now the following can be used to write the InChI information for test.mol:

require 'libinchi'

InChI.write_inchi(['', 'test.mol'])

Summing Up

SWIG simplifies the job of connecting high-level languages like Ruby to C/C++ libraries. Although not illustrated in the simple example above, SWIG offers several advanced tools for creating rich library interfaces. Given the large amount of chemical informatics software written in C/C++, and the increasing interest by developers in scripting languages such as Ruby, the SWIG approach is likely to be broadly useful in several areas of chemical informatics integration.

The C InChI toolkit appears in a few other Open Source projects including Open Babel, the Chemistry Development Kit via the JNI InChI Wrapper, and Rino. To my knowledge, none use SWIG. This will soon change as the approach described here becomes incorporated into Rino.

On a more general note, the availability of the InChI source code under an Open Source license is essential to developing and distributing the kind of integration library discussed here. We can only hope that others working in chemical informatics see the wisdom in a system that creates healthy software ecosystems wherever it takes hold.

The Automatic Encoding of Chemical Structures

Posted by Rich Apodaca Tue, 05 Sep 2006 13:46:00 GMT

No advantages accrue to the chemist from knowing how to generate and how to interpret a chemical code. Codes are needed only for the mechanical manipulation of chemical structures. Clearly then, if the coding of chemical compounds could be accomplished automatically, this automatic conversion would relieve the chemist of a considerable burden.

-Alfred Feldman et al. J. Chem. Doc. 1963, 3, 187-189

The success of any new molecular encoding method relies, in part, on its invisibility to its prospective users. After all, why should anyone bother to learn yet another molecular language, especially one designed with computers in mind? Yet these encoding systems are critical in connecting chemical information and information technologies. How can any new encoding method be made part of existing workflows invisibly?

Feldman and his group at Walter Reed faced a similar problem in the early 1960's. American Cyanamide had been using a modified typewriter to prepare attractive 2-D chemical structures, purely for human consumption. Feldman's idea was to modify the typewriter design still further such that a computer-usable molecular code would be recorded as a byproduct of preparing the structure diagram. The typist could remain blissfully unaware of the mechanical magic beneath, and get on with his or her job. The idea was later adapted by Shell to produce a more cost-effective device.

The structure editor has long since replaced the chemical typewriter. But the same forces are at work with today's new molecular encoding methods, especially InChI. To what extent are scientists themselves being given the tools to leverage these new technologies, without having to become aware of them? What will these new tools look like and how will they differ from what came before?

Hacking NMRShiftDB

Posted by Rich Apodaca Mon, 04 Sep 2006 13:28:00 GMT

NMRShiftDB is an open web database of peer-reviewed NMR chemical shifts compiled by volunteers. As of this writing, it contains 22,429 measured spectra from 18,986 structures, and reports 927 registered users. The database code itself is open source.

Although NMRShiftDB has a web interface, its architecture is designed to simplify writing programs that use it. A previous article showed how a working PubChem API could be written with just a few lines of Ruby. This time, I'll show how the same thing can be done for NMRShiftDB.

Ingredients

This tutorial uses Arton's excellent Ruby Java Bridge, the installation and use of which has been previously discussed. Also used is Ruby's InChI interface, Rino, for which installation instructions are here.

Create a working directory called nmr. Into this directory, copy cdk-20060714.jar, which can be downloaded here.

Code

Create a file called nmr.rb containing the following Ruby code:

require 'net/http'
require 'smi2inchi'

# A very simple NMRShiftDB Web API.
class NMRFetcher

  # Creates a <tt>Translator</tt> instance.
  def initialize
    @translator = Translator.new
  end

  # Returns an XML record, as a string, for the molecule
  # with SMILES matching <tt>smiles</tt> and spectrum type
  # matching <tt>spectrumtype</tt> (13C, 1H, 15N and 31P).
  def get_record(smiles, spectrumtype)
    body = nil
    inchi = (smi2inchi(smiles)).gsub('InChI=', 'inchi=')
    path = '/NmrshiftdbServlet?nmrshiftdbaction=exportcmlbyinchi&' + inchi + '&spectrumtype=' + spectrumtype

    Net::HTTP.start('nmrshiftdb.ice.mpg.de') do |http|
      response = http.get(path)
      body = response.body
    end

    if !valid_record?(body)
      return nil
    end

    body
  end

private

  def valid_record?(body)
    !body.eql?('No such molecule or spectrum')
  end

  def smi2inchi(smiles)
    @translator.translate(smiles)
  end
end

The magic in the above code is nothing more than a simple HTTP request sent to nmrshiftdb.ice.mpg.de, contained in the get_record method. This request encodes an InChI identifier, which is generated from the SMILES string passed as an argument. We also specify a spectrum type.

Now create a file called smi2inchi.rb, containing the following Ruby code:

ENV['CLASSPATH'] = './cdk-20060714.jar'
require 'rubygems'
require_gem 'rjb'
require_gem 'rino'
require 'rjb'

StringWriter = Rjb::import 'java.io.StringWriter'

SmilesParser = Rjb::import 'org.openscience.cdk.smiles.SmilesParser'
MDLWriter = Rjb::import 'org.openscience.cdk.io.MDLWriter'

# Converts a SMILES string into an InChI identifier using
# the CDK Library (Java) and the Rino Library (Ruby/C).
class Translator

  def initialize
    @smiles_parser = SmilesParser.new
    @mdl_writer = MDLWriter.new
    @mol2inchi = Rino::MolfileReader.new
  end

  # Returns an InChI identifier from the specified SMILES string.
  # Uses the CDK classes SmilesParser and MDLWriter to generate
  # a molfile from a SMILES string. Then this molfile is
  # parsed by Rino::MolfileReader.
  def translate(smiles)
    mol = @smiles_parser.parseSmiles(smiles)

    sw = StringWriter.new

    @mdl_writer.setWriter(sw)
    @mdl_writer.write(mol)

    @mol2inchi.read(sw.toString)
  end
end

The description and use of this code was discussed in a recent article on generating InChI identifiers from SMILES strings.

Before using the code we've just created you'll need to set the LD_LIBRARY_PATH (or equivalent) to point to the native Java libraries. On Linux with Sun's JDK, this is done from the command line with:

$ export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/i386:$LD_LIBRARY_PATH

Using the NMRFetcher class is just a matter of creating an instance, and invoking get_record with the desired SMILES string and spectrum type (1H, 13C). Doing so returns a CML document containing the structure of the compound and its spectrum. If no record matches, the method returns nil. The code below give an example in which the CML output is pretty-printed using the wonderful Ruby API for XML, REXML:

require "rexml/document"
require 'nmr'

nmr = NMRFetcher.new
smiles = 'c1ccccc1' #benzene, to keep things simple
type = '13C'
record = nmr.get_record(smiles, type)

if record #pretty-print the CML record using REXML
  file = File.new('result.xml', 'w')

  (REXML::Document.new(record)).write(file, 0)

  file.close
else #write an error
  File.open('result.error', 'w') do |file|
    file << 'No record of SMILES: ' + smiles
  end
end
The above code can be put into a file (test.rb) and run:
$ ruby test.rb

Alternatively, it can be entered interactively and played with using irb:

$ irb
irb(main):001:0>

Output

The program produces the following Chemical Markup Language output in a file called result.xml:

<cml>
  <molecule title='Benzene' id='nmrshiftdb7901' xmlns='http://www.xml-cml.org/schema/cml2/core'>
    <atomArray xmlns='http://www.xml-cml.org/schema'>
      <atom elementType='C' y2='0.7625' x2='-1.4063' id='a1' formalCharge='0' hydrogenCount='0'/>
      <atom elementType='C' y2='0.35' x2='-2.1207' id='a2' formalCharge='0' hydrogenCount='0'/>
      <atom elementType='C' y2='-0.475' x2='-2.1207' id='a3' formalCharge='0' hydrogenCount='0'/>
      <atom elementType='C' y2='-0.8875' x2='-1.4063' id='a4' formalCharge='0' hydrogenCount='0'/>
      <atom elementType='C' y2='-0.475' x2='-0.6918' id='a5' formalCharge='0' hydrogenCount='0'/>
      <atom elementType='C' y2='0.35' x2='-0.6918' id='a6' formalCharge='0' hydrogenCount='0'/>
    </atomArray>
    <bondArray xmlns='http://www.xml-cml.org/schema'>
      <bond atomRefs2='a1 a2' order='S' id='b1'/>
      <bond atomRefs2='a2 a3' order='D' id='b2'/>
      <bond atomRefs2='a3 a4' order='S' id='b3'/>
      <bond atomRefs2='a4 a5' order='D' id='b4'/>
      <bond atomRefs2='a5 a6' order='S' id='b5'/>
      <bond atomRefs2='a1 a6' order='D' id='b6'/>
    </bondArray>
  </molecule>
  <spectrum moleculeRef='nmrshiftdb7901' xmlns:cml='http://www.xml-cml.org/dict/cml' xmlns:cmlDict='http://www.xml-cml.org/dict/cmlDict' xmlns:siUnits='http://www.xml-cml.org/units/siUnits' type='NMR' xmlns:macie='http://www.xml-cml.org/dict/macie' xmlns:units='http://www.xml-cml.org/units/units' id='nmrshiftdb15502' xmlns:subst='http://www.xml-cml.org/dict/substDict' xmlns:nmr='http://www.nmrshiftdb.org/dict' xmlns='http://www.xml-cml.org/schema/cml2/spect'>
    <conditionList xmlns='http://www.xml-cml.org/schema'>
      <scalar dataType='xsd:string' units='siUnits:k' dictRef='cml:temp'>298</scalar>
      <scalar dataType='xsd:string' units='siUnits:hertz' dictRef='cml:field'>Unreported</scalar>
    </conditionList>
    <metadataList xmlns='http://www.xml-cml.org/schema'>
      <metadata name='nmr:OBSERVENUCLEUS' content='13C'/>
    </metadataList>
    <peakList xmlns='http://www.xml-cml.org/schema'>
      <peak xUnits='units:ppm' peakShape='sharp' xValue='128.5' id='p0' atomRefs='a1 a2 a3 a4 a5 a6'/>
    </peakList>
  </spectrum>
</cml>

The kind of output produced by NMRFetcher and NMRShiftDB could be used in a variety of ways. Notice, near the bottom of the document, how peak assignments are made relative the the atom labels in the molecule declaration. It should be possible, for example, to create interactive 2-D structure diagrams from this document in which a user mouses over an atom and gets a C-13 chemical shift.

NMRShiftDB is a valuable and free online resource for NMR spectroscopy. Programatically mixing its capabilities with free software and other online services offers numerous opportunities to build innovative chemical informatics systems.

Older posts: 1 ... 5 6 7 8 9