Agile Chemical Informatics Development with CDK and Ruby: RCDK-0.3.0
Ruby Chemistry Development Kit (RCDK) version 0.3.0 is now available from RubyForge. RCDK enables the complete CDK API to be accessed from Ruby. This release adds support for IUPAC nomenclature translation and tighter Java integration.
Dependencies
RCDK requires Ruby, the Ruby developer libraries, a working build toolchain, and Ruby Java Bridge (RJB). This latter dependency can be satisfied during the RCDK installation process if the RubyGems method is used (see 'Installation').
Installation
RCDK can be conveniently installed using the RubyGems packaging mechanism:
# gem install rcdk
Alternatively, the source package and RubyGem can be downloaded here.
Tighter Java Integration
RCDK-0.3.0 introduces a previously-described Java package to Ruby module mapping mechanism. For example, if you'd like to create a Java ArrayList, it can be done through the new jrequire command:
require 'rubygems'
require_gem 'rcdk'
jrequire 'java.util.ArrayList'
list = Java::Util::ArrayList.new
list.size # => 0IUPAC Nomenclature Translation
RCDK's most important new chemical informatics feature is made possible by Peter Corbett's excellent IUPAC nomenclature translation library OPSIN. It can either be used directly with jrequire, or indirectly through RCDK's convenience library RCDK::Util:
require 'rubygems'
require_gem 'rcdk'
require 'rcdk/util'
mol = RCDK::Util::Lang.read_iupac 'quinoline'
mol.getAtomCount # => 10There are two things to notice here. First, no jrequire statement is needed when using the RCDK::Util library. Second, there is a multisecond delay after read_iupac is invoked. OPSIN itself introduces this delay during the NameToStructure constructor call, and RCDK inherits this behavior. However, after the first invocation of read_iupac, subsequent calls to this method are very fast.
Let's decorate the quinoline nucleus with some substituents and render a 2-D image of the result. Execute the following code, either through the Ruby interpreter (ruby) or through Interactive Ruby (irb):
require 'rubygems'
require_gem 'rcdk'
require 'rcdk/util'
RCDK::Util::Image.iupac_to_png('3-chloro-4-(2-aminopropyl)-6-mercapto-8-(2-hydroxyphenyl)-quinoline-2-carboxylic acid', 'test.png', 300, 300)Running this code produces the following image in your working directory:

Be Agile
RCDK marries the agility of the Ruby language with the functionality of three Open Source chemical informatics libraries: CDK; OPSIN; and Structure-CDK. Future articles will discuss some simple applications of this powerful combination.
Scripting Java with Ruby: Yet Another Java Bridge
New technologies attempting to compete with older technologies need to provide a clear upgrade path, if they are to succeed. A case in point is Ruby. Many Java developers' reaction to this language has less to do with its capabilities and more to do with previous investments in Java. What good is a new language if the special library X that you depend on needs to be rewritten from scratch?
Previous articles, starting with this one, have discussed Ruby Java Bridge (RJB) as a Java-Ruby integration tool. Two additional articles discussed RJB in the context of mapping Java packages onto Ruby modules and Java-Ruby integration on Windows. RJB currently provides the mechanism whereby the full Chemistry Development Kit (CDK) API can be used in Ruby with Ruby CDK.
Another option for Java-Ruby integration is JRuby, a Java implementation of the Ruby interpreter. JRuby offers tight integration with the Java Virtual Machine, which will be ideal in many situations. In other situations, it will not be the best choice. For example, one of the advantages of RJB over JRuby is that the standard C-Ruby implementation can be used. This in turn offers, for example, full Rails functionality and access to C extensions. A disadvantage of RJB is that, being written in C, it requires a working build toolchain for installation.
I've seen one report of a Macintosh installation of RJB that failed. Without a Mac of my own, I can't confirm if this is indeed a problem. But this report also pointed me to a third approach to Ruby-Java integration, Yet Another Java Bridge (YAJB). YAJB is different from both JRuby and RJB in that it extends the C implementation of Ruby with a Java bridge written in pure Java. In theory, it should run on any platform that both Ruby and Java run on.
YAJB-0.8.1 installed on my system without a hitch. From the root directory of the distribution:
# ruby setup.rb
Using YAJB was straightforward. A Java Vector instance could be instantiated and manipulated using familiar syntax:
require 'yajb/jbridge'
include JavaBridge
v = jnew "java.util.Vector"
v.add("one")
v.add("two")
v.size # => 2
v.elementAt(1) # => "two"Good integration tools can make the difference between actually using new technologies and simply observing them. Java developers interested in using Ruby now have at least three good options to choose from: JRuby; RJB; and YAJB.
From IUPAC Nomenclature to 2-D Structures With OPSIN
A previous article introduced OPSIN, an Open Source Java library for decoding IUPAC chemical nomenclature. In this tutorial, you'll see how OPSIN can, when interfaced with freely-available chemical informatics software, generate 2-D structure diagrams from IUPAC names.
Prerequisites
This tutorial requires Ruby CDK (RCDK), which in turn requires Ruby, Java, and the Ruby Java Bridge. Tutorials detailing the installation of RCDK on both Windows and Linux platforms are available.
In addition, you'll need a copy of the standalone jarfile opsin-big-0.1.0.jar. Future versions of RCDK will integrate the OPSIN jarfile, making this step unnecessary.
Outlining the Problem and a Solution
We'd like to create a simple Ruby class with a method that accepts an IUPAC chemical name as input and produces a PNG image of the corresponding molecule as output. OPSIN accepts IUPAC names as input, but it only produces Chemical Markup Language (CML) as output. The CML output lacks 2-D coordinates, and OPSIN itself has no 2-D rendering capabilities.
We'll use RCDK to augment OPSIN's capabilities. Thanks to CDK's built-in CML support, RCDK can read CML and generate an AtomContainer representation. RCDK also supports the assignment of 2-D coordinates to an AtomContainer via CDK's StructureDiagramGenerator. To produce the PNG image, we'll use the 2-D rendering capability made possible through Structure-CDK, which is a built-in component of RCDK.
A Simple Ruby Library
Create a working directory and copy opsin-big-0.1.0.jar into it. Next, create a file called depictor.rb containing the following Ruby code:
require 'rubygems'
require_gem 'rcdk'
require 'rcdk'
Java::Classpath.add('opsin-big-0.1.0.jar')
require 'util'
# A simple IUPAC->2-D structure convertor.
class Depictor
@@StringReader = import 'java.io.StringReader'
@@NameToStructure = import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'
@@CMLReader = import 'org.openscience.cdk.io.CMLReader'
@@ChemFile = import 'org.openscience.cdk.ChemFile'
def initialize
@nts = @@NameToStructure.new
@cml_reader = @@CMLReader.new
end
# Writes a <tt>width</tt> by <tt>height</tt> PNG to
# <tt>filename</tt> for the molecule described by
# <tt>iupac_name</tt>.
def depict_png(iupac_name, filename, width, height)
cml = @nts.parseToCML(iupac_name)
throw("Can't parse name: #{iupac_name}") unless cml
molfile = cml_to_molfile(cml)
RCDK::Util::Image.molfile_to_png(molfile, filename, width, height)
end
private
def cml_to_molfile(cml)
string_reader = StringReader.new(cml.toXML)
@cml_reader.setReader(string_reader)
chem_file = @cml_reader.read(@@ChemFile.new)
molecule = chem_file.getChemSequence(0).getChemModel(0).getSetOfMolecules.getMolecule(0)
molecule = RCDK::Util::XY.coordinate_molecule(molecule)
RCDK::Util::Lang.get_molfile(molecule)
end
endTesting, Testing
A short test will demonstrate the capabilities of the Depictor library. Add the following to a file called test.rb in your working directory (or enter it interactively with irb):
require 'depictor'
depictor = Depictor.new
name = '3,3-dimethyl-7-oxo-6-[(2-phenylacetyl)amino]-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylic acid' #Penicillin G
depictor.depict_png(name, 'out.png', 300, 300)Running this test produces a 300x300 PNG image of Penicillin G, named out.png, in your working directory:

As you can see, this simple library and test code has:
- correctly parsed the rather complex IUPAC name (3,3-dimethyl-7-oxo-6-[(2-phenylacetyl)amino]-4-thia-1-azabicyclo[3.2.0]heptane-2- carboxylic acid) to a valid CML representation
- converted this representation to a CDK AtomContainer
- assigned 2-D coordinates
- rendered a PNG image in color
Notice how the thiaazabicyclo[3.2.0] system, complete with properly-placed substitutents, was flawlessly identified and parsed.
If you entered the above test code interactively via IRB, you may have noticed a multi-second delay in instantiating Depictor. This latency results from a sluggish NameToStructure constructor in OPSIN. A similar delay also occurs in OPSIN's pure-Java unit tests. Once Depictor is instantiated, however, image generation occurs relatively quickly.
The unususal orientation of the beta-lactam carbonyl group is determined by CDK's StructureDiagramGenerator. The source of this behavior will be explored in a future article.
More Examples
To illustrate some of the capabilities of the OPSIN-RCDK combination, a few more examples are provided below.
One of OPSIN's more surprising features is how well it handles heterocycles. For example, the IUPAC name for caffeine (1,3,7-trimethylpurine-2,6-dione) is translated to:
As another example, consider the tetrazole (1-[2-hydroxy-3-propyl-4-[3-(2H-tetrazol-5-yl)propoxy]phenyl]ethanone):
Highly substituted benzene rings and carboxylic acids are also translated accurately, as in 3-acetamido-5-(acetyl-methyl-amino)-2,4,6-triiodo-benzoic acid (Metrizoate):
How about a hairy-looking macrocycle name with multiple levels of morpheme nesting (3,6-diamino-N-[[15-amino-11-(2-amino-3,4,5,6-tetrahydropyrimidin-4-yl)-8- [(carbamoylamino)methylidene]-2-(hydroxymethyl)-3,6,9,12,16-pentaoxo- 1,4,7,10,13-pentazacyclohexadec-5-yl]methyl]hexanamide)? Not a problem:
Limitations
In my tests of the OPSIN library, one structure appeared to be incorrectly parsed - N-(5-chloro-2-methyl-phenyl)-2-methoxy-N-(2-oxooxazolidin-3-yl)acetamide:
There are actually two problems with the output. First, an oxygen atom and a methyl group are overlapping near the top of the diargram. This cosmetic issue is related to CDK's StructureDiagramGenerator. Second, the oxazolidine nitrogen atom is misplaced by OPSIN. The correct 2-D image of this molecule, obtained from PubChem, is shown below:
Conclusions
It's not common to find an early-development Open Source project with the sophistication of OPSIN. The smooth handling of nested morphemes, aromatic heterocycles, macrocycles, and a good fraction of what I threw at it leads me to belive that a well-designed and extensible nomenclature parsing engine lies at OPSIN's core. More on that later, though.
What could you do with a powerful Open Source IUPAC nomenclature parser? The answer to that one question could fill a three-volume series. Suffice it to say that OPSIN, in combination with other Open Source software, offers virtually limitless potential for indexing, collecting, repackaging, reprocessing, and mashing up vast amounts of chemical information. Because of its Open Source license, OPSIN can be extended and otherwise modified to fit your particular needs. Future articles will highlight some of the possibilities.
Compiling C to Java Bytecode
In the ideal world for many Java developers, all software would be written in Java. The reality is that a great deal of software is written in other languages, one of the most widespread of which is C. This article discusses a unique approach to working with C code from Java, producing 100% pure Java bytecode that runs anywhere Java does.
JNI - The Standard Solution
The standard solution to working with C code from Java has been the Java Native Interface (JNI). In this approach, the Java Virtual Machine (JVM) is able to treat a native binary library as if it were written in Java. This is a clever solution that does what it claims to.
Unfortunately, JNI introduces a platform dependency - the very thing Java was designed to avoid. Depending on the details of the native library, this platform dependency may effectively banish your software from platforms that it would otherwise run on without modification. The Eclipse team, for example, has had to deal with the platform dependence issues of the Standard Widget Toolkit (SWT) for some time now. Even if a workable solution is developed, deployment is an order of magnitude more complex when native libraries are involved.
It doesn't have to be this way. What if it were possible to compile C source code directly into Java bytecode?
A Better Way
Axiomatic Solutions has an answer to this problem called Axiomatic Multi-Platform C (AMPC). This software can compile C source files directly into Java class files.
Axiomatic offers a free demo version of AMPC, which can be downloaded here. The demo is rather limited; it expires after fifteen days and lacks certain key features available in the full version, such as multiplication and division.
For those serious about AMPC, the full version can be had for $2999.00. This is a hefty sum. But depending on who you are and what you're trying to do, AMPC may be the most cost-effective solution available.
AMPC is not the only C to Java conversion option. Another program, C2J is free (as in beer) software from Novasoft that translates C source into Java source. Jazillian also converts C source into Java source, with an emphasis on readability. Links to more C to Java solutions are available from this page.
A Simple Demo
To learn more about AMPC, I downloaded and installed the Demo version 1.5.1. It installed without a hitch.
AMPC actually consists of two components - a command-line utility and an IDE. Those of you used to Eclipse will be somewhat disappointed with AMPC's IDE, which is based on SciTE. For this reason, I spent most of my time with the command-line utility.
I decided the venerable Hello World application should be my first stop. I saved this version to a file called hello.c:
#include <stdio.h>
int main(void)
{
printf("Hello World - From C!\n");
return 0;
}
>compile hello.c
This produced the file hello.class. Running this class with Java confirmed that this process does indeed work:
>java hello Hello World - From C!
A More Complex Demo
One of the key differences between C and Java is that C has pointers and Java does not. So how does AMPC handle a simple program that uses pointers? Very well, it turns out. For this test, I used the following source code, which I lifted from this tutorial:
#include <stdio.h>
int j, k;
int *ptr;
int main(void)
{
j = 1;
k = 2;
ptr = &k;
printf("\n");
printf("j has the value %d and is stored at %p\n", j, (void *)&j);
printf("k has the value %d and is stored at %p\n", k, (void *)&k);
printf("ptr has the value %p and is stored at %p\n", ptr, (void *)&ptr);
printf("The value of the integer pointed to by ptr is %d\n", *ptr);
return 0;
}
j has the value 1 and is stored at 0x2824 k has the value 2 and is stored at 0x2826 ptr has the value 0x2826 and is stored at 0x2c78 The value of the integer pointed to by ptr is 2
So What?
Libraries written in C are of course quite common in chemical informatics and computational chemistry. Although most of these are legacy libraries developed long ago, some are more recent.
A case in point is the InChI library, the only implementation of which is written in C. It has been suggested that the best solution to using InChI from Java is JNI. However, for the reasons outlined above, this is not really the solution that Java developers want. I, and others, have argued that a pure Java implementation is the best solution - but porting is an expensive proposition, given the complexity of the InChI code.
Perhaps applying AMPC, C2J, Jazillian, or similar software to the InChI library would offer the best of both worlds. That is, assuming these approaches can be made to work.
A future article will detail my attempts to translate the InChI library to Java with C2J.
The Final Word
The limited nature of AMPC demo prevents me from evaluating whether the full version can be used to compile real libraries, like InChI, directly into Java bytecode. However, if my experiences with the demo version are predictive, AMPC may well be a viable option for chemical informatics integration efforts.
Toward an Open, Worldwide Chemical Information Network
...Whatever your views of the present situation may be, I think there is general agreement that more attention will be given in the next few years to the information network concept. The hardware capability for such a network is well assured; in fact, the capability exists today. The real question is when, and under what conditions, the chemical community will determine that an economic need exists for a network that will tie together a wide range of chemical information services.
-Walter M. Carlson J. Chem. Doc. 1965, 5, 1-3
Several online chemical information services, including PubChem, NMRShiftDB, and ZINC, have emerged in a relatively short period of time. As these systems go from being toys for hackers to essential components of scientific workflow, their true potential will be unlocked by developing innovative ways to tie these disparate systems together.
This is not unlike the situation Carlson was describing in his 1964 luncheon speech before the ACS Division of Chemical Literature. Technologies have changed radically, but the fundamental problem of integrating disparate chemical information systems remains unsolved and ripe with possibilities.
A future in which Chemical Abstracts Service no longer dominates the collection and distribution of chemical information is looking more possible than ever before. If recent history is any guide to this future, we can look to an array of semi-independent, open systems using open standards and operating on a global scale to become the new focal point. In fact, the capability exists today.

