Introducing MX: Lightweight and Free Cheminformatics Tools for Java 5

Posted by Rich Apodaca Fri, 21 Nov 2008 18:28:00 GMT

If you want to build cheminformatics software of any kind, you'll need a basic toolkit. Ideally, this toolkit contains all of the low-level functionality used over and over in your projects. Tools for building an in-memory molecular representation, exact- and substructure comparison, and reading/writing molfiles all fall into this category. Also ideally, this toolkit should be free. Not free in the sense of free to use if you work at a university, free to try, or even free to use provided that you make your changes public when you redistribute the toolkit. But free in the sense of "do whatever you want with it and all you have to do is include a copyright notice."

This article introduces MX, a suite of lightweight and free cheminformatics tools for Java designed to fill these needs.

Download

A Google Code page has been set up for MX. Both a source distribution and compiled jarfile representing MX in its current state can be downloaded.

A subsequent article will show how to get started with MX.

Origins: A Chemical Structure Editor for Web Applications

In 2007 my company, Metamolecular, set out to build a lightweight and easy-to-use chemical structure editor for Web applications. Realizing the increasing importance the Web would play as a chemical communication medium in the next decade, a truly Web-based, platform-independent alternative to ChemDraw and ISIS/Draw seemed to be a good direction to pursue. The resulting product became known as ChemWriter.

Minimizing deployment footprint was a key consideration with ChemWriter; the last thing a chemist using a Web site wants to spend his or her time doing is waiting around for a large applet to download. With many chemical structure editor applets available today, download times on the order of one minute or longer are not uncommon. This is simply unacceptable.

To create ChemWriter, an ultra-lightweight cheminformatics toolkit was need. How lightweight? We were targeting 100 KB for the complete editor. A good chemical structure editor is a fairly complex piece of UI software involving multiple drawing tools with state-dependent behavior, not to mention some fairly sophisticated vector graphics rendering and molfile input/output. The only way we could reach our 100 KB target for ChemWriter is if the basic cheminformatics toolkit were 20 KB or smaller.

At the time, there was no cheminformatics toolkit, free or otherwise, that could fill the need.

So it was created from scratch.

High Performance in ChemPhoto

Eventually, the same cheminformatics toolkit used in ChemWriter was adapted for ChemPhoto, the chemical structure imaging application.

ChemPhoto was designed to dynamically display 100,000 or more 2D chemical structures in a grid-like GUI using minimal memory. Rather than pre-loading all 100,000 molecule objects into memory, which would not be feasible on most systems, ChemPhoto uses a lazy approach in which an in-memory index of the target SD file is built. Every time a new structure needs to be displayed to the user during a scrolling event, it's created from scratch: the molfile text is loaded from disk, a molecule object is created, the molecule is rendered, and then the entire construct is thrown away.

The performance of ChemPhoto was so good, even though everything was being created on demand and immediately thrown away, that it appeared the cheminformatics toolkit being used had potential in high-performance situations as well.

Substructure Search and Mapping

Recently, Rajarshi Guha reported his port of the VF library to Java for use with the Chemistry Development Kit (CDK). This began a thought process starting with "how can it be improved" and leading to the conclusion that the creation of flexible, Java-centric substructure search utilities would offer the most bang for the buck. A subsequent article described a simple strategy that could be used to get there.

To implement this idea, a cheminformatics toolkit was needed. The one used successfully in ChemWriter and ChemPhoto was an ideal candidate.

The result, a complete substructure search and mapping utility built from scratch, is available in MX under the package com.metamolecular.mx.map.

Free to Use Anytime, Anyplace - No Strings Attached

Licenses can be a problem with nearly all open source cheminformatics toolkits. If your work is mostly done in an academic environment for free, you're likely to experience no problem at all. However, if you run a company that sells licenses to software containing code you'd rather not reveal to the world, the reciprocity provisions in licenses such as those in the GPL, Mozilla (MPL), and IBM (CPL) families lead to major problems.

The problem isn't so much the open source license itself - it's the fact that the original copyright owners either won't give their permission to dual-license their contributions, or in many cases, can't even be tracked down to ask.

This is an unacceptable position for a software distributor wanting to use open source as a cost-effective means to boost their developer productivity.

To address these issues, MX is being distributed under the extremely permissive MIT License. In a nutshell it says you are free to modify and incorporate MX into any software you distribute without any obligation to release a line of your own source code. It also says if MX doesn't do the job, you're on your own. And that's about all it says. Your only obligation is to include the original copyright notice on all copies or substantial portions of the software.

To my knowledge, only one major cheminformatics toolkit is licensed under an academic-style open source license - RDKit, which is licensed under the New BSD License.

Conclusions

A basic cheminformatics toolkit is a vital component of most chemistry-related software. For maximal cost-effectiveness as a software distributor, a free toolkit licensed under a permissive open source license is ideal. MX is a free and lightweight cheminformatics toolkit written in Java that has been used successfully in two commercial products.

Future articles will describe the many ways MX can be used and extended.

Substructure Search From Scratch in Java Part 1: The Atom Mapping Problem

Posted by Rich Apodaca Mon, 17 Nov 2008 19:17:00 GMT

One of the most important capabilities in cheminformatics is mapping the atoms of a query structure onto the atoms of a target structure. Although useful in itself, the main value of atom mapping comes from the software that gets built on top of it: exact structure comparators, substructure search systems, and query atom/bond search systems such as SMARTS. The fundamental nature of atom mapping means that correctness, efficiency and adaptability are essential features of a good mapping implementation. Recently, a D-F article made the case that atom mapping software written in Java needs to be Java-centric to achieve these goals. This article, the first in a series that describes a complete substructure search system written in Java, takes the first step by offering some simple interface definitions and code for the atom mapping problem.

The Problem

Given a query molecule (query) and a target molecule (target), our atom mapping software needs to find ways to match the atoms of query onto target such that the mapping describes a substructure embedded in target. The software might stop at one mapping, continue on to find all of them, or stop at some point in the middle. It all depends on the specific cheminformatics problem we're trying to solve.

The Recursive Function

Our implementation will gradually build up an atom mapping by traversing the atoms of query in depth-first order and trying to map each found atom onto an atom in target. At each step in the process, we will have a partial atom map that maps some of the atoms in query onto target. That map, and any other information needed to complete the analysis will be kept in an instance of a class implementing the State interface.

A State will be manipulated by a recursive method, mapFirst that returns when the first atom map is found:

// create a list to hold atom maps
List<Map<Atom, Atom>> maps = new ArrayList<Map<Atom, Atom>>();

// create initial state
State state = ...; 

boolean mapFirst(State state)
{
  if (state.isDead())
  {
    return false;
  }

  if (state.isGoal())
  {
    maps.add(state.getMap());

    return true;
  }

  boolean found = false;

  while (!found && state.hasNextCandidate())
  {
    Match candidate = state.nextCandidate();

    if (state.isMatchFeasible(candidate))
    {
      State nextState = state.nextState(candidate);
      found = mapFirst(nextState);

      nextState.backTrack();
    }
  }

  return found;
}

Comparison of the mapFirst method to the pseudocode VF algorithm Match procedure given in the previous article shows some similarities. In fact, something similar to the mapFirst method forms the basis of many atom mappers in use today.

Although it may be clear from the code, it's worth re-iterating that each time mapFirst is recursively called, an attempt is made to branch off a new State that maps an additional pair of atoms from query to target. If that branch leads to a possible solution, it's followed. Otherwise the next possible mapping is explored.

The State Interface

The recursive mapFirst method determines all of the methods the State interface needs to define:

public interface State
{
  /**
   * Returns the current mapping of query atoms onto target atoms.
   * This map is shared among all states obtained through nextState.
   */
  public Map<Atom, Atom> getMap();

  /**
   * Returns true if another candidate match can be found or
   * false otherwise.
   */
  public boolean hasNextCandidate();

  /**
   * Returns the next candidate match.
   */
  public Match nextCandidate();

  /**
   * Returns true if the given match will work with the current
   * map, or false otherwise.
   */
  public boolean isMatchFeasible(Match match);

  /**
   * Returns true if all atoms in the query molecule have been
   * mapped.
   */
  public boolean isGoal();

  /**
   * Returns true if no match will come from this State.
   */
  public boolean isDead();

  /**
   * Returns a state in which the atoms in match have been
   * added to the current mapping.
   */
  public State nextState(Match match);

  /**
   * Returns this State's atom map to its original condition.
   */
  public void backTrack();
}

Finally, State uses an instance of the Match class:

public class Match
{
  private Atom query;
  private Atom target;

  public Match(Atom query, Atom target)
  {
    this.query = query;
    this.target = target;
  }

  public Atom getQueryAtom()
  {
    return query;
  }

  public Atom getTargetAtom()
  {
    return target;
  }
}

Conclusions

With just a few lines of Java, we've managed to reduce the fundamental cheminformatics problem of atom mapping to the far simpler problem of implementing the State interface.

How many ways are there to implement the State interface? Probably as many as there are subgraph isomorphism algorithms. Notice that the way we've set up the problem lets us use the same recursive method to test all State implementations, an essential prerequisite for benchmarking and optimization.

Future articles in this series will describe one way to implement the State interface.

Image Credit: Dollar Bin

One of These Things is Not Like The Others 11

Posted by Rich Apodaca Thu, 13 Nov 2008 01:49:00 GMT

You can't get very far in cheminformatics without the ability to compare one molecule to another to find either an exact structure or substructure match. For example, if you want to build chemical databases, a good substructure matcher comes in very handy. As luck would have it, the substructure match problem (a variant of the subgraph isomorphism problem) is both computationally expensive and difficult implement. This article discusses one approach to the problem.

Background

Recently, Rajarshi Guha described some benchmarking studies suggesting that it was possible to greatly improve the speed of the Chemistry Development Kit (CDK) substructure matching code. His code employed the widely-used Ullmann algorithm.

There's just one problem: the Ullmann algorithm detects edge-induced isomporphisms. This means, for example, that if your query molecule is propane and your test molecule is cyclopropane, you won't find a match with an Ullmann-backed tool. I'm still not sure if it's possible to modify an Ullmann implementation to make its matches node-induced. Based on the implementations I've seen, the answer appears to be "no."

For substructure matching, we need an atom-induced isomorphism algorithm.

What's Wrong with Existing Implementations?

To begin with, it must be pointed out that working isomorphism code is valuable and hard-won.

Having said that, many Java implementations are written in a way that makes optimization difficult at best. Some start out as C code that then gets ported, mostly verbatim. Other are written with an understandable emphasis on speed over readability. For developers used to working with classes, objects, shallow loops, and short methods with expressive names, the impedance mismatch can be jarring to say the least.

Here's an example, taken from the CDK, that while functional, would take a great deal of time to understand well enough to change:

public static List makeAtomsMapOfBondsMap(List l, IAtomContainer g1, IAtomContainer g2) {
  if(l==null)
    return(l);
  List result = new ArrayList();
  for (int i = 0; i < l.size(); i++) {
    IBond bond1 = g1.getBond(((RMap) l.get(i)).getId1());
    IBond bond2 = g2.getBond(((RMap) l.get(i)).getId2());
    IAtom[] atom1 = BondManipulator.getAtomArray(bond1);
    IAtom[] atom2 = BondManipulator.getAtomArray(bond2);
    for (int j = 0; j < 2; j++) {
      List bondsConnectedToAtom1j = g1.getConnectedBondsList(atom1[j]);
      for (int k = 0; k < bondsConnectedToAtom1j.size(); k++) {
        if (bondsConnectedToAtom1j.get(k) != bond1) {
          IBond testBond = (IBond)bondsConnectedToAtom1j.get(k);
            for (int m = 0; m < l.size(); m++) {
              IBond testBond2;
              if (((RMap) l.get(m)).getId1() == g1.getBondNumber(testBond)) {
                testBond2 = g2.getBond(((RMap) l.get(m)).getId2());
                for (int n = 0; n < 2; n++) {
                  List bondsToTest = g2.getConnectedBondsList(atom2[n]);
                  if (bondsToTest.contains(testBond2)) {
                    RMap map;
                    if (j == n) {
                      map = new RMap(g1.getAtomNumber(atom1[0]), g2.getAtomNumber(atom2[0]));
                    } else {
                      map = new RMap(g1.getAtomNumber(atom1[1]), g2.getAtomNumber(atom2[0]));
                    }
                    if (!result.contains(map)) {
                      result.add(map);
                    }
                    RMap map2;
                    if (j == n) {
                      map2 = new RMap(g1.getAtomNumber(atom1[1]), g2.getAtomNumber(atom2[1]));
                    } else {
                      map2 = new RMap(g1.getAtomNumber(atom1[0]), g2.getAtomNumber(atom2[1]));
                    }
                    if (!result.contains(map2)) {
                      result.add(map2);
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  return (result);
}

VFLib

Rajarshi's implementation of substructure search was based on a Java port of the VFLib C++ library. VFLib was developed by an Italian group to compare the performance of the VF algorithm with that of Ullmann.

VFLib defines a single interface (State) that a variety of subgraph isomorphism matchers can implement in order to work interchangeably.

What makes this so interesting is that when you can boil a software problem down to implementing an interface, it can become orders of magnitude simpler. But more on that later.

Another interesting aspect of VFLib is that the code can be easily converted from an edge-induced implementation to a node-induced implementation. In other words, if we had a Java port of the VFLib2 code, we could begin to build families of Java-based substructure matchers that could be easily compared and optimized.

The View from 10,000 Feet

One of the difficult aspects of implementing subgraph isomorphism algorithms is dividing the process up into understandable chunks. One way forward might be to look for commonalities among all of the approaches currently used. What might those be? Here are some possibilities:

  • Recursion. At the heart of any implementation typically lives a method that repeatedly calls itself (without creating a stack overflow).

  • Gradual accumulation of state. What's that recursive method doing? Building up a map of the atoms from a query structure to a target structure, one pair of atoms at a time. Sometimes it fails and needs to go back to the last successful match. Sometimes it succeeds and needs to report that information to avoid accessing an out-of-bounds index. At every stage, the accumulated state must be sufficient to finish the mapping attempt.

  • Mapping comes for free. The implementation typically uses an internal map to keep track of what it's done, so getting one mapping (or more) of the query structure onto the target tends to be as easy as simply detecting that a match exists.

  • Optimization heuristics. Where to begin, what order to compare structural features, and what features should be compared anyway? The possibilities for taking advantage of simple optimization rules are significant. It should, therefore, be easy to run many implementations side-by-side in performance tests.

A paper describing the VF algorithm, and the way VFLib implements it is freely available.

In it, a high-level overview of the VF algorithm is presented:

PROCEDURE Match(s)
  INPUT: an intermediate state s; the initial state s0 has M(s0)=∅
  OUTPUT: the mappings between the two graphs
  IF M(s) covers all the nodes of G2 THEN
    OUTPUT M(s)
  ELSE
    Compute the set P(s) of the pairs candidate for inclusion in M(s)
    FOREACH (n, m)∈ P(s)
      IF F(s, n, m) THEN
        Compute the state s' obtained by adding (n, m) to M(s)
        CALL Match(s')
      END IF
    END FOREACH
     Restore data structures
  END IF
END PROCEDURE

The Match(s) procedure plays the role of recursive function, while s and s' play the dual roles of state accumulators and feature comparators.

VFLib, together with the paper describing it, does a good job of breaking the process up into manageable chunks from which unit tests, interface definitions, and ultimately working code can created in a variety of languages.

Conclusions

Substructure matching is one of the most difficult and most useful cheminformatics tasks. Although many Java cheminformatics toolkits support substructure search, their implementations can be difficult to understand, modify, and optimize. VFLib has some interesting features that could help to change that.

Image Credit: Santa Rosa OLD SKOOL

Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases 6

Posted by Rich Apodaca Thu, 02 Oct 2008 23:27:00 GMT

For anyone working in a chemistry-related job, chemical databases are ubiquitous. A printed list of IUPAC names, a spreadsheet containing CAS numbers, and a set of hand-drawn structures on index cards are all primitive chemical databases. They aren't nearly as useful as they could be to either the creator or his/her collaborators, but they are databases nevertheless. Anyone who has spent time in industry or academics knows that these low-tech chemical databases are everywhere. And they become more of a problem as more information is moved into electronic format.

All articles in this series:

The Problem: Structure Search is Hard

Many of the low-tech chemical databases that professional chemists routinely share and work with would become orders of magnitude more useful if they were converted into substructure-searchable databases and published to the Web. Although there has been a great deal of effort toward this end in the last few years, there's still much, much more that could be done.

One of the main problems in creating a substructure-searchable chemical database is implementing the substructure search capability itself. This one requirement has done more to stifle the free flow of chemical information than perhaps any other. Solving the problem appears very difficult on first or second glance, and it is very difficult if you don't have the right tools. Many companies offer solutions - but at a price, both in terms of money and time, that is simply out of reach.

What can you do if you're just getting started with modest requirements and budget?

About This Series

This article, the first in a series, will describe the creation of a chemical substructure search engine using exclusively well-maintained and robust open source tools: Open Babel for generating fingerprints and peforming atom-by-atom searches; MySQL as a relational database; and Ruby as a scripting language.

Each of these three components is a commodity that can be replaced with any one of a number of open-source or proprietary substitutes, maximizing flexibility and minimizing vendor lock-in.

Other Resources

Norbert Haider of the University of Vienna has written a very useful tutorial on creating a structure-searchable database using free tools, which is part of a larger series. That series differs from this one in the technology stack used and the level of detail to be provided. The series of articles to appear here will spell out the low-level series of steps needed to create a working substructure search system. It's hoped that taking this perspective makes clear the steps needed to apply the approach to alternative technology platforms.

Binary Fingerprints and Relational Databases

At the heart of the system we'll build is the chemical fingerprint which is a (usually) lossy binary representation of a chemical structure. Creating a binary fingerprint is like putting every chemical structure, known or unknown into just one bin out of a very large, but finite set of bins. Although the same molecule is guaranteed to always go into the same bin, more than one molecule can be placed into each bin. This is a general feature of all hashing schemes.

Andrew Dalke has written an excellent series of articles on fingerprints and what can be done with them. Another good overview is available from Daylight. This article will assume you know what fingerprints are and how they can be used to compare chemical structures.

The problem with binary fingerprints is that they are generally several hundred bits long - too long to be represented in a form that allows direct and rapid query by a relational database system. They need to be broken up - but how?

A widely-used approach (and the one that will be taken here) involves breaking up the fingerprint into a series of integers that are stored in the database.

For example, let's say we have a 1024-bit fingerprint. We could represent this as a number from 0 to 2^1024, which of course is way to big for most computers to handle today. We could, however, represent this fingerprint as a series of sixteen 64-bit integers (which are available on most systems).

So, the binary fingerprint:

1111111101111111110110111011011000101000011000011010011100010000
1001100010101101000110100010110011101100100000100100000111010100
0101010000101011001010011001000100011001100000101100111010001110
1001000101001010000001011001100101101011111111011000111100000111
1010101100100101000100001100011001010111001001110101101100010010
0011101011101110110011111010000010111001100101001001101010110001
1100111000010100000100110111101001011100010111010001010101101101
0010001111111010111011110110000000001010111011111001111001111101
0101011100011111110111011110011110100110010110010101011001011111
0110100001111001101111011101001101101001000100010001100101111000
0011111001000100001111111110001100111001101000000100010010010110
0000011101001001011000111110101110010101110001111010100001100100
0100100111101010110101101010110110101010110110111011011001111111
0011100100101101101001000001000111110101011101110101101001101001
0110100100111001111001001111110111111001110100100110010100011110
0010101100101000011110101110111011001110101111100001011010101100

could also be represented as this decimal fingerprint (assuming your machine is big-endian):

18410675377121896208
11001478244984832468
6064987026359504526
10469186440276053767
12332281598675737362
4246559787872197297
14849515287603909997
2592647731284516477
6277980392575817311
7528256967824972152
4486781373924787350
525060695046727780
5326305550703244927
4120129631153511017
7582343227124114718
3109870708788696748

We can easily store this set of 16 numbers in a relational database table. For example, if we had a MySQL database called "compounds", we could create a "fingerprints" table:

mysql> create database compounds;
Query OK, 1 row affected (0.02 sec)

mysql> use compounds;

Database changed
mysql> create table fingerprints(id int not null auto_increment, primary key(id), fp0 bigint(64), fp1 bigint(64), fp2 bigint(64), fp3 bigint(64), fp4 bigint(64), fp5 bigint(64), fp6 bigint(64), fp7 bigint(64), fp8 bigint(64), fp9 bigint(64), fp10 bigint(64), fp11 bigint(64), fp12 bigint(64), fp13 bigint(64), fp14 bigint(64), fp15 bigint(64));
Query OK, 0 rows affected (0.01 sec)

mysql> describe fingerprints;
+-------+------------+------+-----+---------+----------------+
| Field | Type       | Null | Key | Default | Extra          |
+-------+------------+------+-----+---------+----------------+
| id    | int(11)    | NO   | PRI | NULL    | auto_increment | 
| fp0   | bigint(64) | YES  |     | NULL    |                | 
| fp1   | bigint(64) | YES  |     | NULL    |                | 
| fp2   | bigint(64) | YES  |     | NULL    |                | 
| fp3   | bigint(64) | YES  |     | NULL    |                | 
| fp4   | bigint(64) | YES  |     | NULL    |                | 
| fp5   | bigint(64) | YES  |     | NULL    |                | 
| fp6   | bigint(64) | YES  |     | NULL    |                | 
| fp7   | bigint(64) | YES  |     | NULL    |                | 
| fp8   | bigint(64) | YES  |     | NULL    |                | 
| fp9   | bigint(64) | YES  |     | NULL    |                | 
| fp10  | bigint(64) | YES  |     | NULL    |                | 
| fp11  | bigint(64) | YES  |     | NULL    |                | 
| fp12  | bigint(64) | YES  |     | NULL    |                | 
| fp13  | bigint(64) | YES  |     | NULL    |                | 
| fp14  | bigint(64) | YES  |     | NULL    |                | 
| fp15  | bigint(64) | YES  |     | NULL    |                | 
+-------+------------+------+-----+---------+----------------+
17 rows in set (0.01 sec)

Conclusions

Although we have neither a substructure search engine nor a database, we've laid a solid foundation for those things. The next article in this series will show how to use this humble beginning to model some simple substructure queries in a way that lets MySQL do most of the heavy-lifting.

Image Credit: Mr. Jaded