Decoding InChIs: An Introduction to Ninja

Posted by Rich Apodaca Fri, 12 Jan 2007 01:36:00 GMT

InChI is both a molecular identifier and a molecular language. As the use of InChIs spreads, there will be an increasing need to convert InChIs to molecular structures. In this article, I'll introduce a software package called "Ninja" that can serve as a foundation for writing InChI parsers in a variety of toolkits and programming languages.

About Ninja

Ninja is a low-level Java toolkit for parsing InChIs. Its main purpose is to break an InChI into a set of tokens that are then assigned meaning consistent with the InChI specification. Ninja is intended as a platform on which full-fledged InChI parsers can be built. As such, it is both small and portable. Ninja was developed with Sun's JDK-1.4, although earlier versions should also work. The Ninja project is hosted on SourceForge, from which the complete source can be downloaded.

Printing an InChI Report

Ninja can print a descriptive summary of any InChI from the command line. For example, if the Ninja jarfile (lib/ninja-0.1.4.jar) is located on your classpath or in your working directory, the report for benzene's InChI could be printed with:

$ java -jar ninja-0.1.4.jar "InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H"

This command produces the following output:

[Parsing InChI] InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H
----------------------------------------------------
[Version Name] 1
----------------------------------------------------
[Entity Count] 1
----------------------------------------------------
[Entity (1)]
[Main Layer]
[Formula] C6H6
[Heavy atoms] 6
[Atom(1)] label = C fixed-h-count = 1
[Atom(2)] label = C fixed-h-count = 1
[Atom(3)] label = C fixed-h-count = 1
[Atom(4)] label = C fixed-h-count = 1
[Atom(5)] label = C fixed-h-count = 1
[Atom(6)] label = C fixed-h-count = 1
[Connection table]
    1  2  3  4  5  6
 1     +  +
 2  +        +
 3  +           +
 4     +           +
 5        +        +
 6           +  +
[Mobile Hydrogen Groups] 0
[Charge Layer]
[Charge] 0
[Protonation] 0
[Stereo Layer]
[Bond Stereo Units] 0
[Atom Stereo Units] 0
[Isotopic Layer]
[Isotopes]
[Stereo Layer]
[Bond Stereo Units] 0
[Atom Stereo Units] 0
[Fixed Hydrogen Layer]

This report shows that, as expected, the benzene InChI encodes six carbon atoms, each of which has one attached fixed hydrogen. The connections among these carbons are represented by a connection table. The last half of the report refers to elements that are missing in the benzene InChI, but which are nevertheless reportable if present. The readability of this report could be enhanced, particularly for complex InChIs, by indenting items to indicate nesting.

A lot about the InChI language can be understood from reading reports such as the one above. The terms used in both the reports, and the Ninja API are taken directly from the InChI Technical Manual.

Conclusions

Ninja is a low-level toolkit for decoding InChIs. Although it can be used as a standalone application, its real utility is as a library. Future articles will discuss this topic.

Anatomy of a Cheminformatics Web Application: InChIMatic

Posted by Rich Apodaca Fri, 15 Dec 2006 15:49:00 GMT

InChI is an open molecular identifier system. Although InChIs obviate the need for a central registration authority, they are complex enough that they must be generated by computer. Currently, a few desktop molecular editors can generate InChI identifiers. But wouldn't it be more convenient if this capability existed in a simple Web application that could be used from any computer - anywhere? This article describes a Web application called "InChIMatic", which does just that.

In this article, I'll show how Java Molecular Editor (JME), a lightweight 2-D structure editor, can be extended to produce InChI identifiers through server-side software written in Ruby, rather than by extending the applet with Java code.

Downloads and Prerequisites

InChIMatic requires Ruby on Rails and the Rino InChI toolkit. Both of these libraries can be installed using the RubyGems packaging system.

The complete InChIMatic source package can be downloaded from RubyForge. For convenience, a copy of JME is included with the distribution. The author, Peter Ertl, has kindly given permission for the bundled JME applet to be used with InChIMatic. For other uses, consult the JME homepage.

Running InChIMatic

$ cd inchimatic-0.0.2
$ ruby script/server

Pointing your browser to http://localhost:3000/inchi/input, drawing a structure in the JME window, and pressing the "InChI!" button will produce the corresponding InChI in the area below.

Behind the Scenes

The JME applet itself provides no capabilities for generating InChI identifiers. This functionality is instead provided by the Rails server via the Rino InChI library.

Let's say Susan wants to get the InChI for 3,4-dichlorophenol. After entering the structure into the JME window, she presses the "InChI!" button. This sets in motion the following sequence of events:

  1. The JavaScript function writeMolfile() is called. This retrieves a molfile representation of 3,4-dichlorophenol from JME, which is then written to to the hidden field molfile.

  2. A Rails listener notices that the hidden text field has been updated and so invokes the InChIMatic ajax_inchi action. This is a Rails Ajax action that will update only a portion of the InChIMatic window. For more detail on this Rails Ajax technique, see the previous Anatomy of a Cheminformatics Web Application article.

  3. The ajax_inchi action retrieves the contents of the hidden molfile field. This molfile is then used to generate an InChI using Rino. This InChI is then saved to the instance variable inchi.

  4. The contents of the InChIMatic area partitioned by the results div are then updated with the InChI obtained in Step 3. The JME applet itself is unaffected by this operation, allowing Susan to further elaborate her molecule, if she'd like.

So What? Re-Thinking the Role of Applets

JME is, by itself, incapable of generating InChIs. Yet InChIMatic provides this capability as if it existed natively. In other words, a lightweight, fast-loading, and responsive 2-D editor can be extended on the server side, rather than on the client side. The difference is imperceptible to the user, but ripe with potential for the developer.

One of the most common, and completely valid, complaints about Java applets is that they take too long to load. Offloading some of the functionality currently being bundled in applets onto a Web server offers one way to combat the problem. Furthermore, combining Java applets with Ajax and powerful Web application frameworks like Ruby on Rails offers virtually limitless opportunities to re-think the role of Java applets in Web application development.

Conclusions

JME's strength comes, perhaps ironically, from its limited functionality. By using some simple Web programming techniques, JME can be extended with server-side programming. The advantages, compared to extending the JME applet itself with Java on the client side, are significant. Future articles in this series will explore some of the possibilities.

The Problem with Ferrocene

Posted by Rich Apodaca Tue, 12 Dec 2006 14:53:00 GMT


Four different Compound Identifiers. Three different canonical SMILES. Three different InChIs. This is how Ferrocene is represented in PubChem. Even more strangely, none of bonding arrangements accurately reflect the ways most chemists would think about it.

It's not a Good Thing to list the same compound under four different entries in the same molecular database. In the best case it's inconvenient. In the worst case it can cause information that does exist to act as if it does not. I'm guessing, but I would suspect that behind the scenes at PubChem one or more chemical informatics tools are being pushed beyond their area of expertise.

SMILES, InChI, Molfile, and CML are molecular languages that were designed primarily with organic compounds in mind. In this world, bonds occur in neat two-atom units with an even (integer) number of shared electrons. This system falls apart in the world of organometallic chemistry, where multi-atom bonding is commonplace. The same problem also crops up when describing de-localized organic ions and radicals. Multi-atom bonding even rears its head in something as prosaic as the aromaticity of naphthalene and benzene. (To be fair, InChI has less of a problem here than most other molecular languages because of its focus on atoms rather than bonds.)

Is the problem serious enough to do something about it? Forty years ago, metallocenes were a novelty - today they're ubiquitous. They're key components of new materials, catalysts, and perhaps eventually even drugs. They're abundant in every major chemical supplier's catalog. Every respectable journal runs at least one article per issue in which metallocenes play an important role. It seems unlikely that the problem with Ferrocene and its multi-atom bonding cousins can continue to be swept under the rug much longer.

Maybe the problem lies with the deficiencies of the molecular languages currently in use. After all, it seems unlikely that any system can ever become the "universal" language of chemical informatics. On the other hand, the problem may instead arise from these languages, and their limitations, figuring too prominently in the design of the underlying software.

Debabelization

Posted by Rich Apodaca Wed, 08 Nov 2006 14:32:00 GMT

Today, we find Chemical Abstracts with over two million compounds coded in a connectivity table system and ISI with close to a million compounds coded in WLN. The U.S. Patent Office has large files coded in the Hayward notation; the IDC has large numbers of compounds in its CT and GREMAS Code. Derwent has a sizable patent file coded in one fragment code, and many journal literature compounds coded in the Ring Code fragment code. There are a number of individual companies and government agencies with over 100,000 compounds coded in "a" system. And almost all companies synthesizing new compounds have some internal system for their compounds. Finally, there are many universities with a wide variety of coded structure files.

-Charles E. Granito J. Chem. Doc. 1973, 13, 72-74

The situation described by Granito in 1973 seems eerily familiar today. The names of the players, the technologies, and encoding systems have changed, but the problem of multiple incompatible molecular languages has persisted for over 30 years.

This problem will become even more pronounced in the near future as free chemistry databases on the Web continue their rapid proliferation. In Granito's world of closed, proprietary databases and unevenly distributed computer power, interoperability was an afterthought; in the coming world of free, open databases, and ubiquitous computer networks that connect to them, interoperability will be taken for granted.

Granito goes on to observe that "there is no one 'best' system" for molecular representation. And he's right. Molecular languages evolve within a particular problem domain, just as human languages evolve within a specific cultural context. This isn't to say that a molecular language can't be creatively adapted to serve purposes for which it was never designed. Trying to do so is, after all, how new languages are conceived.

Consider the case of InChI, which is both a molecular identification system and a line notation, or Chemical Markup Language (CML), an XML language. There are vast areas of chemistry in which using either InChI or CML will be problematic - particularly polymers, organometallics, and inorganic chemistry. And let's not ignore new molecular representation problems brewing on the horizon like small molecule tertiary structure. Yet for pure organic chemistry as most of us know it today, InChI and CML may well be optimal.

The problem is that both InChI and CML compete with simpler, entrenched alternatives - SMILES and molfile, respectively. Even MDL, the author of the original molfile specification, is having difficulty gaining acceptance for its new molfile format, despite significant technical advantages.

If history is any guide, we can look forward to at least as many molecular languages in the next thirty years as we've seen in the last thirty. It wasn't long ago that WLN was viewed as the language of the future. Now it just looks cryptic. For this we can thank a combination of technology advances and the emergence of a far simpler alternative, SMILES. A similar fate, more likely than not, awaits all molecular languages currently in use.

Will there ever be a universal molecular language and is there any point in trying to invent one? Every area of chemistry introduces its own peculiarities not shared by any of the others. Yet all users want the simplest language possible. These two contradictory forces ensure that a universal language is unlikely to ever appear. In other words, the most successful new molecular languages are likely to be agile - simple, easy to learn, cheap to implement, and quickly adaptable in the face of new chemical concepts and advances in computer technology.

From SMILES to InChI with OBRuby

Posted by Rich Apodaca Fri, 03 Nov 2006 15:50:00 GMT

SMILES and InChI are two commonly-used molecular line notations. Although each has its advantages and limitations, the novelty of InChI and the ubiquity of SMILES makes the SMILES to InChI conversion especially useful. Many of the situations in which the need for this conversion will arise are particularly well-suited for the Ruby programming language. A recent article described how RCDK and Rino could be used to accomplish this conversion. This article will show how Open Babel can be used from Ruby to effect the same conversion.

OBRuby

OBRuby is a SWIG-generated Ruby interface to the Open Babel library. Although OBRuby doesn't expose all aspects of the Open Babel API, nearly everything that can be done in C++ Open Babel can now be done in Ruby. For example, all OBConversion permutations should be available, including SMILES to InChI.

A Small Ruby Library

Let's create a small Ruby library for converting SMILES strings into InChI identifiers. Save the following into a file called convert.rb:

require 'openbabel'

class Convertor
  def initialize
    @conv = OpenBabel::OBConversion.new

    @conv.set_in_and_out_formats('smi', 'inchi')
  end

  def get_inchi(smiles)
    mol = OpenBabel::OBMol.new

    @conv.read_string(mol, smiles)
    @conv.write_string(mol)
  end
end 

There's nothing tricky here. We've simply created a Ruby class that makes the SMILES to InChI conversion as simple as one method call to an instance.

Testing the Library

A good way to test this library is through Interactive Ruby (irb). For example, to find the InChI of caffeine:

require 'convert'

c = Convertor.new

puts c.get_inchi('Cn1cnc2c1c(=O)n(C)c(=O)n2C') # caffeine
# =>InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3

Chiral SMILES

I applied this simple Ruby conversion library to the (S)-methamphetamine record in PubChem:

  • Isomeric SMILES: C[C@@H](CC1=CC=CC=C1)NC
  • PubChem InChI: InChI=1/C10H15N/c1-9(11-2)8-10-6-4-3-5-7-10/h3-7,9,11H,8H2,1-2H3/t9-/m0/s1

My results were:

  • Isomeric SMILES: C[C@@H](CC1=CC=CC=C1)NC
  • OBRuby InChI: InChI=1/C10H15N/c1-9(11-2)8-10-6-4-3-5-7-10/h3-7,9,11H,8H2,1-2H3/t9-/m1/s1

As you can see, there is a discrepancy in the two stereo layers ('m0' vs. 'm1'). The same InChI is generated by Open Babel using either OBRuby or the Worldwide Molecular Matrix. Substituting the SMILES string representing the opposite configuration at carbon generates the InChI with opposite configuration (R), which again is opposite to that of (R)-methamphetamine in PubChem.

At this point, it is unclear whether Open Babel or PubChem is producing the correct InChI for the methamphetamine enantiomers. I suspect Open Babel is correct. By creating a molfile of (S)-methamphetamine with JME and running cInChI over it, I got the same output as with the Open Babel conversions. I've found similar differences between PubChem and Open Babel InChIs in every chiral molecule I've looked at.

Conclusions

The conversion of SMILES, and other molecular languages, into InChI identifiers can be expected to become a recurring need as the popularity of InChI increases. Combining the formidable translation capabilities of Open Babel with the comfort and convenience of Ruby offers a powerful new technique for doing so.

Older posts: 1 ... 4 5 6 7 8 9