Chemical Nomenclature Translation

September 10, 2006

... We report here the development of a computer program for converting chemical names into connection tables, a process we call "nomenclature translation." ... this process provides an alternate method of structure registration by allowing a new substance to be input via a structurally descriptive systematic name instead of only as a connection table taken from a structural diagram.

G.G.V. Stouw et al. J. Chem. Doc. 1974, 14, 185-193

Systematic nomenclature is one of the oldest forms of line notation. As a result, it can be found widely in papers, patents, spreadsheets, and other documents. Any software that can convert systematic nomenclature, such as IUPAC names, into a computer-based representational system, such as a connection table, has the potential to unlock vast amounts of legacy chemical information by making it structure-searchable.

Stouw and his group at Chemical Abstracts Service (CAS) developed the first working system for name to structure conversion. Their interest in an automated process stemmed from the potential to greatly accelerate the rate at which the chemical literature could be indexed. Instead of a human creating a computer representation by manually parsing a systematic name from a paper, a computer could do it error-free at a fraction of the cost. These factors are still at work today, although the pool of raw chemical information material has increased exponentially since 1974.

Nomenclature translation has been more widely investigated than the related problem of 2-D raster image interpretation, although the driving forces in both cases are the same. There are, of course, several proprietary packages for nomenclature translation. An important disadvantage of all of them is a distinct lack of customizability.

Open source nomenclature translation options have been very limited. One of the first such packages was ChemNomParse by David Robinson, Bhupinder Sandhu, and Stephen Tomkinson at the University of Manchester. ChemNomParse has since been made part of the Chemistry Development Kit (CDK). Although its capabilities are relatively limited, ChemNomParse is very useful for the design it embodies.

More recently, Peter Corbet at Cambridge has developed a package called OPSIN. Egon Willighagen wrote about integrating OPSIN into the desktop software package Bioclipse. OPSIN's source can be found in the project's SVN repository.

The most exciting potential for chemical nomenclature translation is realized when this capability is blended with other chemical informatics technologies. Future articles in this series will show how ChemNomParse and OPSIN can be used with other open source tools to create rich chemical informatics systems.