Nomenclature translation is the process of converting a human-readable chemical name into a machine-readable notational scheme such as a connection table. It plays a key role in linking the older chemical literature to modern information technologies, such as the Internet.
Buried deep within the Chemistry Development Kit (CDK) is a library for nomenclature translation called ChemNomParse. At the heart of ChemNomParse is a remarkable piece of software called the Java Compiler Compiler (JavaCC), a parser generator and lexical analyzer generator for Java. A FAQ on JavaCC is available here.
This tutorial demonstrates how freely-available, open source tools can be used to parse an IUPAC chemical name and generate its corresponding 2-D structure rendering. A closely-related tutorial on generating 2-D structures from SMILES strings may be helpful as background.
This tutorial uses Arton's Ruby Java Bridge, the installation and use of which has been outlined previously. In addition, you'll need to download Structure-CDK v0.1.2, also previously discussed. Be sure to download v0.1.2, as two upgrades have been released since the package was originally described. This tutorial has been tested on Mandriva Linux 2006.
Create a working directory called
nom. From the
lib directory of the Structure-CDK distribution, copy
structure-cdk-0.1.2.jar into your
depict working directory.
Create a file called
depict.rb and copy the following code into it:
ENV['CLASSPATH'] = './cdk-20060714.jar:./structure-cdk-0.1.2.jar'
NomParser = Rjb::import 'org.openscience.cdk.iupac.parser.NomParser'
StructureDiagramGenerator = Rjb::import 'org.openscience.cdk.layout.StructureDiagramGenerator'
ImageKit = Rjb::import 'net.sf.structure.cdk.util.ImageKit'
@sdg = StructureDiagramGenerator.new
def depict_png(nom, width, height, path_to_png)
ImageKit::writePNG(nom_to_mol(nom), width, height, path_to_png)
def depict_svg(nom, width, height, path_to_svg)
ImageKit::writeSVG(nom_to_mol(nom), width, height, path_to_svg)
After you save this file, you'll need to set your
LD_LIBRARY_PATH on unix (or the equivalent on another OS):
This tells RJB where to find Java's native libraries. Because of RJB's current design,
LD_LIBRARY_PATH needs to be set from the command line, rather than from within a Ruby process.
Using the Depictor class is as simple as creating an instance and invoking
depict_svg on it:
depictor = Depictor.new
depictor.depict_png('2-phenylcyclohexan-1-ol', 300, 300, 'output.png')
Executing the above code either through the Ruby interpreter (ruby) or via Interactive Ruby (irb) products a PNG image of the chiral auxiliary shown below:
Other names correctly recognized by ChemNomParse include:
Many chemical names, ranging from the simple to the complicated, were not be recognized at all by ChemNomParse. Some examples are:
- 2-methyl-5-prop-1-en-2-yl-cyclohex-2-en-1-one (carvone)
Some names were incorrectly interpreted due to misassigned locants. For example, 2-chloro-3-hydroxybutanoic acid produced the incorrectly asssigned structure shown below:
ChemNomParse can accurately recognize chemical names representing simple substitutions on basic hydrocarbon scaffolds. More complicated structures, such as heterocycles, bicyclic systems, and systems involving nested substituents do not appear to be handled at all. It is not clear to what extent these limitations reflect a small dictionary of morphemes (the basic nomenclature building blocks) versus deeper design issues.
Despite its limitations, ChemNomParse is an interesting piece of open source software for working with chemical nomenclature. From this simple tutorial, it can be seen that nomenclature translation, when combined with other capabilities such as 2-D rendering, offers many exciting possibilities.