Cheminformatics in Any Language with MX Part 1: Scala 4
One of the best ways to keep current with new developments in programming is to try new languages. Fortunately, there are so many to choose from that finding an unfamiliar language is less of a problem than finding one that's interesting enough to make the effort worthwhile.
In this new series of articles, we'll explore many of the most interesting programming languages through the lense of cheminformatics. Our test subject will be MX, the lightweight, cross-platform, cross-language cheminformatics toolkit written in Java. These tutorials are not designed to teach you a new language. Instead, they will focus on quickly getting you to the point of being able to perform basic cheminformatics operations within the target language.
Why Scala?
Scala is a general purpose programming language developed by Martin Odersky and first released in 2003. It was designed to ebrace two important programming paradigms: object-oriented programming and functional programming. Scala, like many other languages, has been implemented on the Java Virtual Machine (JVM), positioning it to take full advantage of a rich selection of useful libraries.
For Web developers, Scala comes complete with a few interesting frameworks, the most popular of which appears to be Lift. Because the entire framework runs on a standard JVM, there's no need to change the way you've come to rely on deploying Java-based web applications. Apparently, Twitter is finding Scala to be very useful as a platform for creating Web applications that scale.
As will be true of most of the languages in this series, my knowledge of Scala is extremely limited and pieced together from Web resources assembled on the spur of the moment and in great haste. Your comments, corrections, and suggestions are welcome.
Getting Started
Scala can be installed by downloading the current version and following these instructions. Make sure to update your environment variables. On my Ubuntu Linux system, the installation worked without a hitch. This tutorial will guide you through the basics of the language and its toolset.
Working With MX
Let's begin by creating a molecule with Scala using MX. To do this, we'll need to let Scala know where to find the MX library. Download the current MX jarfile and place it in your working directory. We can then invoke an interactive Scala shell with MX support through:
$ scala -classpath mx-0.108.1.jar
We can now load benzene and verify that it has the correct number of atoms.
scala> import com.metamolecular.mx.io.Molecules import com.metamolecular.mx.io.Molecules scala> val benzene = Molecules.createBenzene benzene: com.metamolecular.mx.model.Molecule = com.metamolecular.mx.model.DefaultMolecule@2d5534 scala> benzene.countBonds res0: Int = 6
We can also determine that benzene has one ring, as expected:
scala> import com.metamolecular.mx.io.Molecules import com.metamolecular.mx.io.Molecules scala> val benzene = Molecules.createBenzene benzene: com.metamolecular.mx.model.Molecule = com.metamolecular.mx.model.DefaultMolecule@117b450 scala> import com.metamolecular.mx.ring.HanserRingFinder import com.metamolecular.mx.ring.HanserRingFinder scala> val finder = new HanserRingFinder finder: com.metamolecular.mx.ring.HanserRingFinder = com.metamolecular.mx.ring.HanserRingFinder@1c0cb76 scala> val rings = finder.findRings(benzene) rings: java.util.Collection[java.util.List[com.metamolecular.mx.model.Atom]] = [[com.metamolecular.mx.model.DefaultMolecule$AtomImpl@fe087b, com.metamolecular.mx.model.DefaultMolecule$AtomImpl@1def3f5, com.metamolecular.mx.model.DefaultMolecule$AtomImpl@62974e, com.metamolecular.mx.model.DefaultMolecule$AtomImpl@bbaa16, com.metamolecular.mx.model.DefaultMolecule$AtomImpl@9ba045, com.m... scala> rings.size res0: Int = 1
Conclusions
Scala is a neat little language with a lot to offer. With MX and interactive Scala, it's possible to quickly explore ways this new language can be used in cheminformatics.
Credit: thanks to Rajarshi Guha for some of the inspiration for this series.
MX Performance Comparison #3: Substructure Search in MX and CDK
Substructure search is a fundamental cheminformatics operation. MX, the open source cheminformatics toolkit, contains an implementation based on the VF monomorphism algorithm. How fast is it? Let's compare it to CDK's UniversalIsomorphismTester.

The full report is available here. The full source code can be found on GitHub.
This test reads the molecules contained in a 416-record SD file into memory during setup. Then, during the test phase, each of these molecules is compared for a substructure relationship to a benzene molecule. As you can see, MX ran this test nearly five times faster than CDK.
MX and CDK differ in the algorithms used for substructure match. Whereas MX uses a variant of VF, CDK uses a variant of Ullmann. As noted by the VF creators, these two algorithms have very different performance characteristics, with VF always outperforming Ullmann. The performance gap increases quickly with increasing graph size.
CDK will soon have a new substructure matcher based on Rajarshi Guha's implementation. It will be interesting to directly compare this new CDK matcher to the one used by MX.
As is the case with the MX/CDK ring perception comparison, it should be noted that the MX substructure matcher implementation is optimized for readability and correctness, but not performance. A number of interesting opportunities exist for increasing the performance of the MX substructure matcher.
One point to note: MX and CDK differed in the number of hits found, with MX detecting all 416 and CDK finding 412. This is most likely due to the presence of isotopically-labelled benzenes in the dataset. Depending on your interpretation of a substructure match, either CDK or MX could be returning the "correct" answer.
MX Performance Comparison #2: Exhaustive Ring Perception in MX and CDK 1
Benchmarking can be a useful first step in optimizing the performance of software. Recently a group of developers including myself began creating an open suite of benchmarks for cheminformatics. Currently, two open source cheminformatics toolkits are included: MX and CDK.
Ring perception is the foundation of many cheminformatics algorithms, so performance is an important issue. How do MX and CDK compare? See for yourself:

This benchmark finds all rings in a collection of 416 substituted benzenes created from a PubChem query. Timing starts after an in-memory collection of hydrogen-suppressed molecules has been created to avoid differences in IO performance. As you can see, MX is about 44% faster than CDK. Both toolkits find the same number of total rings in the dataset (2,179).
To run the benchmark yourself, use the GitHub repository.
One anecdotal observation: The number of iterations (10 warmup, 5 test) is lower than usual because CDK appeared to run slower and slower with each iteration. By the time 18 iterations had been made, my system was at a standstill. The cause is not clear. The setup as run avoids this behavior.
Both CDK and MX implement the Hanser Algorithm, although even a quick glance at the respective sources will reveal big differences in implementation. The MX implementation was optimized for readability and correctness, but not performance. As such, there may be some low hanging fruit to be had from the simplest of optimizations.
For more details, see the full report.
Open Benchmarks for Cheminformatics: First Performance Comparison Between CDK and MX 3
The previous article in this series discussed Japex in the context of creating open cheminformatics benchmarks. If you're not familiar with it, Japex is a microbenchmarking framework written in Java that does for benchmarking what Ant does for building projects. Among its many interesting features is the ability to generate bar charts for performance comparisions.
Recently I finished building the first direct performance comparison between CDK and MX, two open source cheminformatics toolkits. The chart below summarizes the test.

You can read the full report for yourself here. This test compares the relative speed of loading a 33-record SD file and summing the calculated molecular masses from each record. As you can see, CDK is about 19% faster than MX on the system I looked at.
It should be pointed out that I'm no expert with Japex, so it's possible that I've introduced a source of error into this comparison that could affect the outcome.
Benchmarking is clearly a process, not an endpoint. In the months ahead, expect to see many more benchmarking comparisons, both between MX and other toolkits, and within MX itself.
Build a RESTful Chemical Registration System from Scratch Part 1: Tools of the Trade 4
A chemical registration system forms the core of most database-driven cheminformatics applications. Yet detailed instructions, in the traditional literature or otherwise, on how to create one from free components are surprisingly rare. This article introduces a new Depth-First series aimed a bringing together several tutorials written over the last year to create a RESTful chemical registration system that anyone can build, run, and adapt to their own needs.
Defining the Problem
Whether you're building or designing a database-driven chemical informatics system, at some point you'll face the problem of getting molecules into and out of your database. This is where chemical registration systems come in. eMolecules has created a summary on the subject. It defines the main responsibilities of a chemical registration system as ensuring:
Structural novelty - The same molecule never gets stored twice.
Structural normalization When multiple representations of a molecule are possible (e.g., tautomers and charge-separated forms), only one is used.
Structure drawing Present a chemical structure recognizable to chemists.
Consistent relationships among related compounds The system must decide what to do with various salt forms (or other mixtures) of a particular compound a user might decide to register. There are many options, but they must be applied consistently.
Reasonable behavior when a structure is (partially) unknown Not every compound of interest will have a known chemical structure. Sometimes the structure will only be partially known as in the case of double bond geometry and absolute stereochemistry.
Security Enough said? There's always more.
Reasonable behavior when changes are made to structures The system must be able to respond well to the inevitable: a user changes their mind about the structure s/he entered.
The system we'll build in this tutorial won't deal with all of these responsibilities, but it will handle most of them. In addition, it will address some other problems as well.
The Approach
We'll be building a Web Service, which is defined by the W3C as "a software system designed to support interoperable machine-to-machine interaction over a network."
The reason is simple: we want our chemical registration system to be addressable from anywhere in the world, and we want to use it as an interchangeable, technology-agnostic, loosely-coupled component to build more complex chemistry Web applications.
We want this system to be as easy to deploy as possible on any hardware. Lengthy configuration processes, source code compilation, and exotic dependencies are out. Drag-and-drop deployment, self-contained packaging, convention over configuration, and platform-independent binaries are in.
REST?
There are currently multiple competing approaches for creating Web Services. One of the most flexible and straightforward to implement is Representational State Transfer (REST). In a nutshell, REST leverages the full HTTP protocol for passing messages to and from the server. This simple idea has some powerful implications for the design of the system, which will be explored in articles to come.
Tools
We will use a number of free, open technologies in the creation of our system:
Technology Platform Java will be used exclusively due to its massive installed base, platform-independence, and high performance.
Cheminformatics Toolkit MX will supply the main interface between chemistry and Java.
Unique Identifier InChI will be used to assign unique identifiers to compounds stored in the registry.
Server Jetty will supply basic HTTP functionality.
Servlet Restlet will simplify the implementation of REST using the servlet specification.
Database H2 will provide fast, portable, zero-administration SQL support.
Object Persistence The exact method of persisting Java objects hasn't been settled yet, but Active Objects looks quite interesting, especially when combined with H2.
Conclusions
Chemical registration systems play a vital role in enabling data-driven chemistry applications. This article introduced the problems registration systems typically solve and outlined a plan for implementing one using only free, open components. The next article in this series will discuss the design of the registry Web Service.
Image Credit: Phillip Torrone

