From SMILES to InChI: Rino, CDK, and Ruby Java Bridge

Posted by Rich Apodaca Sat, 26 Aug 2006 19:37:00 GMT

Integrating Ruby and Java is fast and easy with Ruby Java Bridge (RJB), which was discussed previously. In this article, I'll show how RJB can be used to solve a practical chemical informatics problem - the conversion of SMILES strings into InChI identifiers.

Prerequisites

This tutorial is aimed at Linux users, but you should be able to accomplish the same thing in Windows and Mac OS X, although these systems have not been tested. You'll need to install a few software packages if you haven't done so already: Ruby; Ruby Gems; RJB; CDK; and Rino. After installing RubyGems, RJB and Rino can both be installed from the command line (as root):

# gem install rjb
# gem install rino

Next, create a working directory, smi2inchi. Into this directory, move a copy of the full CDK-2006714 jarfile. That's it for libraries, so let's move onto the translator itself.

The Translator

The Translator class consists of a small piece of Ruby code gluing CDK's SmilesParser and MDLWriter with the Ruby InChI library Rino. Rino is a thin Ruby wrapper around the IUPAC InChI library, which is in turn written in C.

ENV['CLASSPATH'] = './cdk-20060714.jar'

require 'rubygems'
require_gem 'rjb'
require_gem 'rino'
require 'rjb'

StringWriter = Rjb::import 'java.io.StringWriter'

SmilesParser = Rjb::import 'org.openscience.cdk.smiles.SmilesParser'
MDLWriter = Rjb::import 'org.openscience.cdk.io.MDLWriter'

# Converts a SMILES string into an InChI identifier using
# the CDK Library (Java) and the Rino Library (Ruby/C).
class Translator

  def initialize
    @smiles_parser = SmilesParser.new
    @mdl_writer = MDLWriter.new
    @mol2inchi = Rino::MolfileReader.new
  end

  # Returns an InChI identifier from the specified SMILES string.
  # Uses the CDK classes SmilesParser and MDLWriter to generate
  # a molfile from a SMILES string. Then this molfile is
  # parsed by Rino::MolfileReader.
  def translate(smiles)
    mol = @smiles_parser.parseSmiles(smiles)

    sw = StringWriter.new

    @mdl_writer.setWriter(sw)
    @mdl_writer.write(mol)

    @mol2inchi.read(sw.toString)
  end
end
Add the above code to a file called smi2inchi.rb. The first line points the CLASSPATH environment variable, which is needed by RJB, to the CDK library. Lines 3-6 include the RJB and Rino RubyGems. Lines 8-11 import the built-in Java class StringWriter and the CDK Java classes SmilesParser and MDLWriter using RJB's syntax. The core of the class consists of the translate method, which simply coordinates the pieces. Using the Translator class consists of creating an instance, and invoking its translate method on a SMILES string:
require 'smi2inchi'

translator = Translator.new
inchi = translator.translate 'c1ccccc1'

p inchi # => "InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H"
The above code fragment can be saved to a text file (e.g. test.rb) and invoked with the Ruby interpreter:
$ ruby test.rb

Alternatively, it can be entered interactively with the Interactive Ruby Interpreter (irb):

$ irb
irb(main):001:0>

With just a few lines of Ruby, we've solved a real problem. This example integrates software from three different programming languages: Ruby, C, and Java. Given the variety of chemical informatics software written in these languages, Ruby Java Bridge offers numerous integration possibilities.

A First Look at Modular Chemical Descriptor Language (MCDL)

Posted by Rich Apodaca Sun, 20 Aug 2006 00:04:00 GMT

The Modular Chemical Descriptor Language (MCDL) was developed to address the need for linear representation of structural and other chemical information for chemical databases, E-journals and the Internet.

-Andrei A. Gakh and Michael N. Burnett J. Chem. Inf. Comput. Sci. 2001, 41, 1494-1499

Molecular line notations reduce a molecular structure to a string of ASCII characters. This is helpful in a variety of situations: as a method of text-based structure input; as a compact representation that can be stored and transmitted over a network; and in some cases as a method for uniquely identifying a molecular structure. The development of line notations is one of the oldest pursuits in chemical informatics.

MCDL has a lot in common with InChI. Both languages are modular in the sense that succeeding levels of structural complexity are represented by individual “modules” (MCDL) or “layers” (InChI): constitution; connectivity; and stereochemistry. Both languages sport free developer toolkits written in C (the InChI toolkit, and LINDES, respectively). Interactive structure-drawing tools even exist for both languages (the interactive MCDL tool was recently released).

MCDL and InChI also differ in some significant ways. One of the biggest differences is that InChI separates hydrogen atoms and their parent atoms into separate layers, whereas MCDL places hydrogen atoms together with the atom to which they are attached. Another difference is in the approach to canonicalization. InChI uses a relatively complex system not unlike that of canonical SMILES. In contrast, MCDL uses a simpler system based on ASCII lexical ordering of atom types. On the non-technical side, InChI carries the endorsement of IUPAC, whereas MCDL is the work of independent developers.

MCDL and InChI approach the problem of developing an internet-ready line notation from different angles. It will be interesting to see how each evolves.

107 Years of Line-Formula Notations (1861-1968)

Posted by Rich Apodaca Fri, 18 Aug 2006 06:50:00 GMT

L E5 B666 FVTJ A1 E1 OQ

Thus, within the short period of just seven years after the birth of structural chemistry in 1861, virtually all of the main ideas relating to line-formula descriptions were conceived and published. No basically new practices appeared for some 79 years. Then, within an identically brief period of just seven years (1947-1954), virtually all of the fundamental features of structure-delineating chemical notations appeared in the international chemical literature.

-William J. Wiswesser J. Chem. Doc. 1968, 8, 146-150

Apparently, advances in chemical line notations have a history of occurring in clusters. Perhaps the development of InChI will spawn a renaissance in the development and use of line notations. Is there room (or need) for multiple line notation languages, each filling a particular niche, or can a universal line notation ever be developed? Will currently popular line notations such as SMILES and InChI seem as cumbersome in 30 years as Wiswesser Line Notation does today?

Ruby and InChI: The Rino Library

Posted by Rich Apodaca Thu, 17 Aug 2006 11:02:00 GMT

Rino is a Ruby libraryfor generating InChI identifiers. Currently, molfile input is supported. The following code illustrates basic usage.

reader = Rino::MolfileReader.new
inchi = reader.read(IO.Read('some_molfile.mol'))

puts "The identifier is: #{inchi}"

RDoc documentation can be viewed online here.

Rino was implemented as an extension of the IUPAC InChI library, which is written in C. To keep the extension as simple as possible, Rino's point of entry into the IUPAC library is the inchi_main main method. File input is provided with temporary files that are deleted when a MolfileReader is garbage collected. This has the advantage that all options available to the IUPAC InChI console application are also available from within Rino. For example, activating compressed InChI output can be done by simply appending the option to a MolfileReader's options array.

reader.options << '-Compress'

Rino is available as both a Ruby Gem and as a source package from RubyForge. If you're new to Ruby, the RubyGems package is recommended. After installing RubyGems, Rino can be installed with the following shell command (executed as root):

gem install rino

To load Rino, simply execute the following code prior to creating a MolfileReader.

require 'rubygems'
require_gem 'rino'

The use of the Chemruby library in combination with Rino offers some interesting possibilites for chemical infomatics with Ruby.

InChI Canonicalization Algorithm

Posted by Rich Apodaca Sat, 12 Aug 2006 21:38:00 GMT

The InChI canonicalization algorithm uniquely numbers the atoms of a molecule. To date, the only implementation is found in the C source code of the InChI software. To enable new InChI implementations, for example in other programming languges, the complete canonicalization procedure is needed. Although it has not been published formally, the information exists in two messages posted to the inchi-discuss mailing list by Dmitrii Tchekhovskoi. To make this information more accessible, these messages have been compiled and re-formatted. The resulting document applies to v1.0 of the IUPAC InChI software. The following article refers to the InChI Technical Manual, which can be downloaded from SourceForge.

Background

Below is a general brief description of the InChI canonicalization algorithm. I did not dare to include canonicalization steps involved in the treatment of mobile hydrogens; included is only a brief Note on how it is implemented.

The order of minimization may be found in the section IV.e Canonicalization, InChI Tech. Of course all details are in the code.

The minimization itself is a highly technical and boring issue; an essential part of it is built on the well-known B. D. McKay algorithm (ref. 5 in the InChI Tech. Man.) which itself is not an easy to read text. Given our limited resources, we decided to postpone the detailed documentation of the canonicalization. However, the InChI code is freely available and the references to the corresponding variables in the code are given in the Figure 30, InChI Tech. Man.

Here is a very brief description (leaving aside mobile H treatment and almost all technical details; almost all numerical examples below refer to 2-chlorobutane).

Major Step A: hydrogenless constitution.

  1. The atoms (ignoring terminal hydrogens) that are vertices in the "molecular graph" are given numerical "colors" in this order of precedence: (a) Ordering number in the sequence: C, other atoms in alphabetic order, bridging H. In case of C4H9Cl all C will be given 1, Cl will be given 2 (b) Number of connections (number of bonds). In 2-chlorobutane CH3CH2CH(Cl)CH3 these are (in brackets): C[1]C[2]C3C[1] The resultant "ordered lists of colors" presented in order of the atoms in the semistructural formula CH3CH2CH(Cl)CH3 are: C: 1, 1 C: 1, 2 C: 1, 3 Cl: 2, 1 C 1, 1

  2. Atoms are assigned new colors according to lexicographical comparison of the "color lists", in ascending order [for example, (1,1) < (1,2) < (2,1); (1, 2) < (1, 2, 1)]

    C: 1, 1 => 2 C: 1, 2 => 3 C: 1, 3 => 4 Cl: 2, 1 => 5 C 1, 1 => 2

    You may notice here an unimportant detail: each color is equal to the number of atoms that have this or smaller color.

  3. Atoms are assigned new "ordered lists of colors": the first in the list is the color of the atom, the rest are sorted in ascending order colors of other atoms, connected to this atom:

    C: 2, 3 C: 3, 2, 3 C: 4, 2, 3, 5 Cl: 5, 4 C 2, 4

  4. Atoms are assigned new colors according to lexicographical comparison of the "color lists", in ascending order

    C: 2, 3 => 1 C: 3, 2, 3 => 3 C: 4, 2, 3, 5 => 4 Cl: 5, 4 => 5 C 2, 4 => 2

  5. Steps 3-4 are repeated until all new colors are different or no more changes occur (for 2-chlorobutane the colors - canonical numbers - have already been found)

    The resultant colors produce a so called equitable partition, in a way which is conceptually almost same as the intermediate result of the SMILES-2 algorithm "SMILES. 2. Algorithm for Generation of Unique SMILES Notation" by Weininger, D.; Weininger, A.; Weininger, J.L.; JCICS Vol. 29, pp. 97-101, 1989.

  6. If some of the colors are still identical, then the smallest is picked up and reduced to the previous color + 1. For example, if colors are (this example does not refer to 2-chlorobutane): 1,2,5,5,5,7,7 then the smallest duplicated color is 5, the previous color is 2. A color of one of the colored-5-atoms will be reduced from 5 to 2+1=3.

  7. Repeat steps 3-6 until all colors become different (this is almost same as obtaining the final result of the SMILES-2 algorithm) and save the "connection table". To make the reading easier, the process of obtaining this table (actually, a list of number) is split into 3 steps. (a)The connection table is made out of segments, ordered in ascending order of the color of the first atom in a segment. The number of the segments is the number of atoms. Each segment starts with the color of an atom and is followed by a colon and a sorted list of the colors of atoms, connected to it: 1:3; 2:4; 3:1,4; 4:2,3,5; 5:4; (b) Since this connection table contains each connection 2 times (for example, the bond between atoms of color 1 and 3 is in the segments "1:3" and "3:1"), it is rewritten by excluding colors that are greater than the first color in the segment: 1; 2; 3:1; 4:2,3; 5:4; (c) The delimiters now are redundant because the members of each segment are always smaller than the first member of the segment. This is the final connection table to be saved and used later: 1, 2, 3, 1, 4, 2, 3, 5, 4

  8. There could be a great deal of arbitrariness in choosing the atom whose color was to be reduced at step 6 (in the example, 3 atoms have color 5; each of them could be chosen). Therefore, repeat step 7 for all possible sequences of choosing the atoms whose color is reduced. Lexicographically compare each obtained connection table to the previously saved and keep the smallest one together with the assignment of the colors to the atoms. These colors are the canonical numbers for the hydrogenless structure. If two connection tables are identical then atoms that have same colors in two connection tables belong to the same equivalence class; this information is saved and used. The equivalence class is the smallest color in the equivalence group.

    (You may find this approach in, for example, "A Computer-Oriented Linear Canonical Notational System for the Representation of Organic Structures with Stereochemistry" by Agarwal, K.K.; Gelernter, H.L.; JCICS v.34, pp.463-479, 1994. However, the implemented in InChI algorithm from Ref 5 allows to avoid a combinatorial explosion in typical chemical structures, obtain equivalence classes, and even find the order of the permutation group and its generators)

    So far we got a canonical numbering (colors) for a hydrogenless structure and the canonical equivalence classes (=the smallest color in each set of equivalent atoms).

  9. Make new colors out of the canonical equivalence classes and repeat steps 3-8 if these colors are different from the colors previously used at Step 3. Obtain the new minimal connection table. Use these classes as initial colors in the next steps (If equivalence classes are, for example, 1, 1, 1, 4, 4, 5, 5, 5 then the corresponding colors are 3, 3, 3, 5, 5, 8, 8, 8)

Major Step B. Add hydrogen atoms to the structure

[A Forward Note: in the first reading you may want to skip it. In case of mobile H the steps are somewhat different, namely: (m-a)Add a list of only those H that are not mobile (similar to B.1 below) and minimize both the connection table (it will be same) and the list. (m-b)Add mobile groups as pseudoatoms connected by directed edges (it means that these pseudoatoms are not included in the connection table segments of the real atoms) to the atoms where the mobile H and possibly negative charges may reside and canonicalize this structure. Number of H and (-) in the groups are in one more list to minimize. The result is the Mobile H canonical numbering and the corresponding equivalence classes, including equivalence classes of the mobile H (and possibly negative charge) groups. Mobile groups that have only negative charges are not included in this process. (m-c) Add isotopic list (see Major Step C below) to the number of lists to be minimized. Do not include in it the exchangeable isotopic atoms H. The result of the minimization is the Mobile H canonical numbering and equivalence classes for the isotopic structure. (f-a) For the fixed mobile H (FixedH option) start with the results of (m-a) and add a list of the fixed positions of the mobile H (colors of the atoms where these H reside) and numbers of these atoms H. The result of the minimization is the Fixed-H canonical numbering and equivalence classes. (f-b) Add isotopic list (see Major Step C below). The minimization result is the Fixed-H canonical numbering and equivalence classes for the isotopic Fixed-H structure.]

Use previously obtained equivalence classes at Step A.9 and use the previously obtained minimal connection table for the comparison. Run Steps A.3-8 with the following difference: each time the connection tables are compared at Step A.8, in case of identical connection tables also compare the list of terminal atoms H in the following form: 1, numberof_H(1), 2, numberofH(2), ...n, numberof_H(n) where number_of_H(c) is the number of terminal atoms H attached to the atom that has color c; n = number of atoms. Save the found this way minimal list of the terminal atoms together with the assignment of the colors to the atoms. Also obtain the equivalence classes as it was done earlier.

We now have the canonical colors (numbering) of the non-isotopic non-tautomeric structure.

Major Step C. Add isotopic composition to the structure

If the structure is isotopic then add one more list to compare if the connection tables and the lists of terminal atoms H are same: 1, isoweight(1), 2, isoweight(2), ...n, iso_weight(n)

where iso_weight(c) is the "isotopic weight" of the atom to which the color c was assigned. For each atom the isotopic weight is calculated according to the formula:

iso_weight=nH1 + 32(nH2 + 32(nH3 + 32*shift))

nH1 = number of terminal atoms of protium attached to the atom nH2 = number of terminal atoms of deuterium attached to the atom nH2 = number of terminal atoms of tritium attached to the atom shift = [(integral) mass of the isotopic atom] - [rounded average atomic mass]

Note: hydrogen H is treated differently from it isotope protium: H has "natural" isotopic composition while protium is treated as an isotopic atom

In case of a not isotopic atom the shift = 0 by definition. If the atom is isotopic and its mass number is greater or equal to the rounded average atomic mass (that is, shift is not negative) then the shift is incremented to avoid shift=0 for isotopes.

If the formula produces iso_weight=0 (the atom and the attached H are not isotopic) then isoweight(c) is set equal to LONGMAX from include file limits.h (for a 32-bit systems it is usually 2,147,483,647 - greater than any iso_weight). This forces isotopic atoms assume the least possible canonical numbers.

Repeat Major Step B, adding the list of isotopic weights to those already minimized.

At this point we are finished with the modified B.D.McKay"s algorithm.

It should be pointed out that for the sake of simplicity, avoiding dependence on the hardware or operating system, and possibility to reproduce the results "by hand", the efficiency of the original B.D.McKay"s algorithm has been reduced. The greatest impact is due to abandoning hashing for the connection table comparison and introducing additional to the connection table lists to be minimized. Also the implemented algorithm for calculating the equitable partition from the given colors is less effective than the one suggested in Ref. 5. All further improvements introduced by B.D.McKay after publishing Ref. 5 in his famous Nauty program are not implemented in the InChI code.

Major Step D. Stereochemistry

  1. For the found canonical colors (numbers) calculate double bond (>X=Y<) and cumulene (>W=X=Y=Z<) parities: For each atom at the ends of the double bond or cumulene find connected to it by a single bond atom that has larger canonical number. If these two found atoms are in "cis" positions then the parity is (-), otherwise the parity is (+) Save parities list: c1[1],c2[1],p[1],c1[2],c2[2],p[2],...,c1[n1],c2[n1],p[n1] arranged in ascending order of (c1[i],c2[i]) pairs where n1=number of possibly stereogenic double bonds and cumulenes c1[i]>c2[i] are colors of the atoms at the end of a double bond or cumulene p[i] is the parity ("u" > "?" > "+" > "-") Order: let a1>a2 and b1>b2 be the colors of the atoms for two double bonds, (a1,a2) and (b1,b2). (a1,a2) > (b1,b2) if and only if ((a1 > b1) || (a1==b1 && a1 > b1)) (here the C programming language notation is used)

  2. For each allene >X=Y=Z< consider a tetrahedron that has as its apices the four atoms connected by single bonds to the allene atoms X and Z. If you look at other apices from the apex that has the smallest canonical number and see canonical numbers of these three other apices arranged in ascending order clockwise then the parity is (+), otherwise it is (-). Save parities list: c[1],p[1],c[2],p[2],...,c[n2],p[n2] arranged in ascending order of c[i] where n2=number of possibly stereogenic allenes c[i] are the colors of atoms Y p[i] are the parities

  3. For each possibly stereogenic atom consider a tetrahedron that has as its apices the four atoms connected this possibly stereogenic atom. If you look at other apices from the apex that has the smallest canonical number and see canonical numbers of these three other apices arranged in ascending order clockwise then the parity is (+), otherwise it is (-). Save parities list: c[1],p[1],c[2],p[2],...,c[n3],p[n3] arranged in ascending order of c[i] where n3=number of possibly stereogenic atoms c[i] are the colors of the atoms p[i] are the parities

    Note. Terminal hydrogen atoms do not have colors (canonical numbers). In parity calculations, hydrogen atoms are assumed to have colors less than the smallest color of other atoms, that is, less than 1. The values of their colors c are assumed to be: c[H] < c[protium] < c[deuterium] < c[tritium] < 1 In a special case of all four connected to the same atom the atom is not stereogenic. In case of a tetrahedral atom that has only 3 bonds (for example, >S=O or >N-) the direction of the lone electron pair is used as one more bond; c[lone pair] < c[H].

  4. Repeat steps 1-3 for all other mappings of the canonical numbers on the atoms that produce same results as in Major Step B or C and find the mapping(s) that produce the lexicographically smallest result in this order of the lists: D.1, D.2, D.3. To each result apply a heuristic to detect possibly stereogenic elements that in reality are not stereogenic; if such elements have been found then remove their parities and repeat D.1-4.

  5. Repeat steps 1-4 for the spatially inverted structure. Accept the one that has smaller stereo (D.1 stereo should be same). Set "inverted" flag if the inverted stereo was selected.

    This description refers to a single component.

Older posts: 1 ... 6 7 8