MX-1.0 Beta 2

Posted by Rich Apodaca Thu, 02 Jul 2009 01:18:00 GMT

The second beta release of MX is now available for download. MX is a library of essential cheminformatics models and routines. It was created with the goal of providing a clean, well-tested platform for chemistry applications.

This beta release includes a new binary fingerprinter package that was not ready for release in the first beta. Of particular note is the modular system MX uses for creating fingerprints based on interchangeable BloomFilter, Walker, RingFilter, RingFinder, and PathWriter classes.

Unit tests are now augmented by Mockito for stubbing and mocking. The MX test suite now includes over 300 individual tests, all of which pass.

The full list of MX features includes:

  • Substructure search and atom mapping based on the VF algorithm

  • Exhaustive ring perception based on the Hanser algorithm

  • Path-based binary fingerprints

  • Flexible query atom support

  • Depth-First traversal

  • Implicit hydrogen detection

  • Complete system of atomic masses and isotopes based on the IUPAC Technical Report.

  • Model objects (Molecule, Atom, Bond, Superatom) based on MDL CTfile specification

  • Molfile reader and writer

  • SD File reader and writer

MX is released under the highly permissive MIT License.

If you find a bug or other issue, please consider filing an bug report.

Always Be Testing: Using Mockito in MX 2

Posted by Rich Apodaca Mon, 29 Jun 2009 15:45:00 GMT

Test-driven development (TDD) is an iterative technique for software development in which a failing test is written before a single line of production code is written. Like any technique, it has its limitations and isn't applicable in every situation. Nevertheless, TDD is a powerful method to create more robust low-level code and more effective high-level designs.

The early days of TDD were not much fun. The reason: the whole idea behind object-oriented software is to divide responsibilities among cooperating objects, which greatly complicates independent testing of isolated object behavior. A test that started out verifying the behavior of a single object could quickly become a test verifying the behaviors of large numbers of dependent objects (and their dependencies) as well.

An approach to this problem that has gained traction over the last few years is mocking and stubbing. Although the terminology has become a bit tortured, the basic idea is to use stand-in objects whose behaviors can be pre-defined and later verified when testing an object with dependencies.

If all of this sounds a bit abstract, consider this example from MX, the lightweight cheminformatics toolkit.

Current activity in MX is focused on a Walker class that 'walks' the graph structure of a Molecule representation in depth-first order, reporting what it finds to a Reporter. To direct its progress, a Walker uses a Step. So, to write tests for Walker, we're dealing with at least two dependencies: Step and Reporter.

Given that we'd like to use test-driven development, how do we write the first Walker test?

The approach taken in MX is to use mock objects created with the Mockito library. Although there are many mocking solutions for Java, I've found Mockito to be the most intuitive and easy to use.

For example, consider this test, which verifies that a Walker doesn't continue walking past a maximum depth (click here to see the whole test):

where doStep simply calls walker.step:

In other words, rather than concocting a real Molecule that would set up the path behavior we need to test, we set up the states of all our dependencies directly such that they exhibit the testable behavior.

This buys us a couple of things. First, when our test fails, we can be sure it's failing because of either the way we set it up or the Walker implementation we're testing - not the dependencies. Second, we can specify the test environment and the actions taken on it with far greater precision.

There are many approaches to solving the testing dependency problem, and this is but one. Finding an approach that works for you can be a powerful way to increase your individual productivity and that of your team.

Update June 29, 2009: The original test didn't work as intended and has since been replaced with a simpler approach that does.

Porting MX: CDK-Compatible VF Implementation

Posted by Rich Apodaca Fri, 26 Jun 2009 14:24:00 GMT

Substructure search is a fundamental cheminformatics operation, and an especially important component in chemical structure databases. Although a few algorithms for atom-by-atom comparison of two structures are available, one of the fastest is VF, which is implemented in MX, a lightweight cheminformatics toolkit.

A recent post discussed the limitations of directly porting the C++ implementation of VF into Java and why a Java-centric, de novo implementation was created for MX instead.

I'm now happy to report that Syed Asad Rahman of the European Bioinformatics Institute has created a preliminary implementation of VF for the Chemistry Development Kit (CDK) by porting the MX mapping package.

Looking through Asad's work, one of the most striking things is the isolation of CDK-specific code into a few key areas, a trait shared by the original MX implementation. Another is the readability of the code. Both features should greatly simplify further optimization work.

If you've been looking for a fast substructure search engine for your cheminformatics work, I recommend checking out both MX and the CDK port.

Quick MX Update: Extensible Fingerprints and Hydrogen-Blocked Substructure Queries

Posted by Rich Apodaca Thu, 18 Jun 2009 13:37:00 GMT

The master branch of MX now features support for two very important cheminformatics capabilities:

  1. Extensible Fingerprints. Lets you fine-tune the performance of your binary path-based fingerprints. Change path depth, number of bits, and the characters used to generate the fingerprints. The default implementation differentiates cycle paths from non-cycle paths, as well as unsaturated from saturated atom types.

  2. Hydrogen-Blocked Substructure Queries. Chemists are accustomed to using a substructure search idiom in which explicit hydrogen means a blocked position. MX now has full support for this way of searching chemical databases.

Although it's too late to include these changes into the upcoming 1.0 release of MX, watch for them to appear in 1.1.

MX is a toolkit of essential cheminformatics models and algorithms that emphasizes efficiency, modularity, and readability.

If the Wheel Doesn't Work, Reinvent it 14

Posted by Rich Apodaca Tue, 16 Jun 2009 17:04:00 GMT

Chris Steinbeck has an interesting post on the CDK code review process that discusses a new VF implementation. In it, he notes:

I checked it out and looked at the code, which looked horrible because it was a 1:1 translation of a horrible looking C code. Clearly, a decent naming of the variables would greatly improve the code but I remember a statement that the translator himself could not make sense out of this, so the original author is to blame :-) . I do not get the impression that this problem can be rectified quickly. In fact, it took Mark a few days to debug this code by adding a rich collection of debug messages. I’m not sure that this is how it should be. The code is essentially unreadable.

For the unfamiliar, VF is a subgraph matching algorithm that has been shown to perform better than Ullmann for small graphs, and much better than Ullmann for large graphs.

Faced with essentially the same problem of implementing VF in Java for MX, I abandoned my early efforts to port the VFlib C++ implementation. The C++ implementation may make sense to a C++ programmer, but directly porting it to Java was judged as not being a good long-term move.

The problem was maintenance.

Although opinions on the subject vary, maintainable Java code to me has a few easily-identifiable characteristics. Among them are:

  1. Descriptive variable and method names.
  2. Limited use of deep nesting (> 3 levels) within methods.
  3. Stateful objects.
  4. Use of collections over primitives.
  5. Few methods over ten lines long.

The VF Java port that I created failed on just about every count - and failed consistently.

It turned out that a short description of the VF algorithm was remarkably clear, lending itself well to a Java-centric, object-oriented implementation that was successfully integrated into MX.

As a bonus, because test-driven development was used from the start for the MX implementation of VF, not only is the code maintainable, but it can be refactored and recasted with higher confidence due to the tests that are now present. This was used to great effect during a recent large-scale refactoring of the MX code to support arbitrary Query Atoms.

Would you consider bolting a bicycle wheel onto your new Kawasaki? Of course not. Why do the same with your software?

Reuse whenever it's consistent with your goals. When it's not, then reinvent.

Older posts: 1 2 3 ... 6