Dear Lazyweb: Does Jmol Support Output for Use with 3D Glasses? 3
After having spent some time recently playing around with Jmol, the thought occurs - is there a setting that enables 3D-effect viewing with those colored glassed used with movies?
For that matter, which of the free 3D molecule viewers currenly available support it?
Image Credit: Dicky
MX-1.0 Beta 2
The second beta release of MX is now available for download. MX is a library of essential cheminformatics models and routines. It was created with the goal of providing a clean, well-tested platform for chemistry applications.
This beta release includes a new binary fingerprinter package that was not ready for release in the first beta. Of particular note is the modular system MX uses for creating fingerprints based on interchangeable BloomFilter, Walker, RingFilter, RingFinder, and PathWriter classes.
Unit tests are now augmented by Mockito for stubbing and mocking. The MX test suite now includes over 300 individual tests, all of which pass.
The full list of MX features includes:
Substructure search and atom mapping based on the VF algorithm
Exhaustive ring perception based on the Hanser algorithm
Path-based binary fingerprints
Flexible query atom support
Depth-First traversal
Implicit hydrogen detection
Complete system of atomic masses and isotopes based on the IUPAC Technical Report.
Model objects (Molecule, Atom, Bond, Superatom) based on MDL CTfile specification
Molfile reader and writer
SD File reader and writer
MX is released under the highly permissive MIT License.
If you find a bug or other issue, please consider filing an bug report.
Hashing and the Universal Molecular Identifier 3
The impossibility of separating the nomenclature of a science from the science itself, is owing to this, that every branch of physical science must consist of three things; the series of facts which are the objects of the science, the ideas which represent these facts, and the words by which these ideas are expressed. ... -Antoine Lavoisier Traité Elémentaire de Chimie
The idea of creating a system in which a short sequence of characters can be used in place of a molecular structure traces its roots to the beginnings of modern chemistry. As the Web continues to become the information platform of choice - even in chemistry, the need to communicate molecular information in a Web-ready format increases.
InChI
Recently, InChI has emerged as a candidate for this position - or more precisely, InChIKey has. For the unfamiliar, InChI is program that converts a variety of molecular structures into a sequence of characters; the same structure always gives the same sequence.
InChI is but one of a very large class of known line notations in chemistry. Unlike most other line notations, it combines the following ideas into one system:
- Normalization. Many bonding arrangements, such as the nitro group and tautomers can be represented in multiple equivalent forms. InChI is capable of eliminating alternative representations.
- Symmetrization. InChI views molecules as networks of typed atoms connected by equivalent links. This eliminates the artifical distinction among single and double bonds in aromatic molecules such as benzene.
- Canonicalization. Each atom in a molecule is assigned a reproducibly unique index, enabling equivalent inputs with different atom numberings to produce identical output.
- Representation. InChIs are represented as a single line of text through the use of canonical numbering and rules for representing atom types.
- Fixed Length. Unique to InChI - InChI Key enables very long InChI output to be compressed into a fixed-length string. The process is for all intents and purposes one-way.
This was no easy task, and is a remarkable accomplishment in itself.
InChI Limitations
But like any technology, InChI has limitations. Most importantly, InChI can't uniquely represent several important classes of molecular species, including those possessing axial and planar chirality, organometallics containing multicentered bonds, and polymers.
Extending InChI to encompass these important classes will be no easy task. Unlike the situation when InChI was being developed, many research groups around the world now use InChI - any change that breaks existing InChIs or their Keys would be disruptive, and even possibly ignored by large segments of the community.
The difficulty of the task is compounded by the technical limitations in dealing with line notations. The qualities that tend to make them compact (e.g., unidimensionality) tend to be the same qualities that make them difficult to extend.
Taking a Step Back
What if we could go back to points 1-5 above and create a system from scratch that addressed them all - and in a way in which all forms of chirality, bonding, and polymers were included? How would we do it?
No More Line Notations
Our most important decision might be to avoid the use of a line notation altogether. Given (5), we know that anything we come up with, no matter how verbose, can be reduced to a fixed length string that can be readily used in URLs and presented to end users for copy/paste operations and visual comparison.
Combining a full-featured file format with hashing offers a way to reap the benefits of a rich language for describing molecules while retaining a convenient method for using them on the Web. (for an example of a site that makes extensive use of hashes in URLs and on Web pages, see GitHub).
What About Molfile?
Instead of inventing our own 'standard', what if we based our identifier on a file format already in widespread use instead? What about MDL V2000 molfiles?
At a minimum, we would need to specify:
- An algorithm that could convert any molfile into canonical form.
- A hashing algorithm, of which there are many good ones to choose some such as plain old SHA-1.
Provided we could find solutions to the problems above, this would be an attractive option. We'd simply be layering additional constraints on a file format with a massive installed base. In fact, we would end up with something possessing nearly equivalent qualities to InChI Key, but with the advantage that we're extending an existing, widely-used standard.
Although we still would have the problem of not being able to deal with multicentered bonding and axial chirality, at least we'd have the possibility to include polymers.
But we'd still be leaving out a large portion of chemical space.
What About Another File Format?
No publicly-described file format that I'm aware of has all of the qualities needed to serve as a basis for a universal molecular identifier. Although some come close, close is not enough.
Every file format has something to teach. It may be possible to borrow the best elements of each, but this would be neither simple nor risk-free.
Conclusions
Hashing offers an attractive method for converting detailed, machine-readable descriptions of molecular structure into a fixed-length string suitable for use on the Web. The problem then boils down to identifying, or inventing, the means to create these descriptions.
Whether the tradeoffs in using an identifier that can't be decoded are worth the benefits of one that can be readily encoded and shared is another question altogether.
Always Be Testing: Using Mockito in MX 2
Test-driven development (TDD) is an iterative technique for software development in which a failing test is written before a single line of production code is written. Like any technique, it has its limitations and isn't applicable in every situation. Nevertheless, TDD is a powerful method to create more robust low-level code and more effective high-level designs.
The early days of TDD were not much fun. The reason: the whole idea behind object-oriented software is to divide responsibilities among cooperating objects, which greatly complicates independent testing of isolated object behavior. A test that started out verifying the behavior of a single object could quickly become a test verifying the behaviors of large numbers of dependent objects (and their dependencies) as well.
An approach to this problem that has gained traction over the last few years is mocking and stubbing. Although the terminology has become a bit tortured, the basic idea is to use stand-in objects whose behaviors can be pre-defined and later verified when testing an object with dependencies.
If all of this sounds a bit abstract, consider this example from MX, the lightweight cheminformatics toolkit.
Current activity in MX is focused on a Walker class that 'walks' the graph structure of a Molecule representation in depth-first order, reporting what it finds to a Reporter. To direct its progress, a Walker uses a Step. So, to write tests for Walker, we're dealing with at least two dependencies: Step and Reporter.
Given that we'd like to use test-driven development, how do we write the first Walker test?
The approach taken in MX is to use mock objects created with the Mockito library. Although there are many mocking solutions for Java, I've found Mockito to be the most intuitive and easy to use.
For example, consider this test, which verifies that a Walker doesn't continue walking past a maximum depth (click here to see the whole test):
where doStep simply calls walker.step:
In other words, rather than concocting a real Molecule that would set up the path behavior we need to test, we set up the states of all our dependencies directly such that they exhibit the testable behavior.
This buys us a couple of things. First, when our test fails, we can be sure it's failing because of either the way we set it up or the Walker implementation we're testing - not the dependencies. Second, we can specify the test environment and the actions taken on it with far greater precision.
There are many approaches to solving the testing dependency problem, and this is but one. Finding an approach that works for you can be a powerful way to increase your individual productivity and that of your team.
Update June 29, 2009: The original test didn't work as intended and has since been replaced with a simpler approach that does.
Porting MX: CDK-Compatible VF Implementation
Substructure search is a fundamental cheminformatics operation, and an especially important component in chemical structure databases. Although a few algorithms for atom-by-atom comparison of two structures are available, one of the fastest is VF, which is implemented in MX, a lightweight cheminformatics toolkit.
A recent post discussed the limitations of directly porting the C++ implementation of VF into Java and why a Java-centric, de novo implementation was created for MX instead.
I'm now happy to report that Syed Asad Rahman of the European Bioinformatics Institute has created a preliminary implementation of VF for the Chemistry Development Kit (CDK) by porting the MX mapping package.
Looking through Asad's work, one of the most striking things is the isolation of CDK-specific code into a few key areas, a trait shared by the original MX implementation. Another is the readability of the code. Both features should greatly simplify further optimization work.
If you've been looking for a fast substructure search engine for your cheminformatics work, I recommend checking out both MX and the CDK port.

