Cheminformatics in Rust: Implementing a Minimal Molecule API

A cheminformatics toolkit, like most software, is made of layers. The bottom layer deals with graphs and their manipulation. Above that sits a tightly coupled layer dedicated to the representation of molecular objects. All of a toolkit's strengths and limitations ultimately evolve from these two layers.

Previous articles have described minimal APIs for Graph and Molecule. This article moves the discussion one step further — into the realm of implementation. Tying together yet another thread, the language I'll be using is Rust.

The Code

A repository containing the code discussed in this post can be found on GitHub. Using it will require a Rust installation.

Clone the repository and run its test suite with the following commands:

git clone https://github.com/rapodaca/molecule.rs
cd molecule.rs
cargo test

The remainder of this article describes the design and implementation of the molecule.rs repository.

Molecule Trait

The focal point of molecule.rs is the Molecule trait. This trait extends the Graph trait from Gamma, a Rust graph library. Eight new methods are contained within the Molecule trait, and excerpt of which appears below.

// imports

pub trait Molecule<'a, N: 'a>: Graph<'a, usize> {
    fn element(&self, id: &usize) -> Result<Element, Error>;

    fn isotope(&self, id: &usize) -> Result<Option<u16>, Error>;

    fn electrons(&self, id: &usize) -> Result<u8, Error>;

    fn hydrogens(&self, id: &usize) -> Result<u8, Error>;

    fn charge(&self, id: &usize) -> Result<i8, Error>;

    fn atom_parity(&self, id: &usize) -> Result<Option<Parity>, Error>;

    fn bond_order(&self, sid: &usize, tid: &usize) -> Result<u8, Error>;

    fn bond_parity(
        &self, sid: &usize, tid: &usize
    ) -> Result<Option<Parity>, Error>;
}

These methods trace their origins to a previously-presented minimal molecule API. The main idea is that the smaller this API can be made, the simpler it will be to learn, implement, and use. Accordingly, the methods found in Molecule represent an irreducible core for a cheminformatics toolkit. All of them expose metadata associated with either the nodes or edges of an underlying graph.

Having already addressed the most difficult questions about design, implementation becomes a matter of satisfying unit tests and the Rust compiler.

The Invisible Atom Pattern

There will be many ways to implement the Molecule trait. The approach used in molecule.rs is something I call the invisible atom pattern.

The invisible atom pattern relies on objects of type Atom to perform most of the heavy lifting behind the Molecule trait. For example, Atom has a method called #electrons that returns a valence electron count. A Molecule implementation delegates method calls to Atom#electrons. The remaining methods in the Molecule interface are implemented similarly.

// imports

pub struct Atom {
    element: Element,
    nonbonding_electrons: u8,
    bonding_electrons: u8,
    hydrogens: u8,
    isotope: Option<u16>,
    parity: Option<Parity>,
    bonds: Vec<Bond>,
    neighbors: Vec<usize>
}

impl<'a> Atom {
    pub fn build(spec: spec::Atom) -> Result<Self, Error> {
      // Use the specification object spec to construct the initial internal
      // state of the Atom
    }

    pub fn element(&self) -> Element {
        self.element
    }

    // other instance methods providing Graph and Molecule services
}

Atom is invisible in the sense that it lives in a private package invisible to the outside world. Why this approach? The main reason is that Atom is an implementation detail. Recall that the Molecule API requires no handles to an atom-like object, nor does it expose any. Instead, atomic attributes and behaviors are accessed through opaque numerical indexes (type usize). This approach happens to work well given Rust's ownership model. It also translates to other languages.

The build function associated with Atom uses a parameter called spec. I'll have more to say about this technique later.

Molecule Implementation

The Molecule trait is implemented by DefaultMolecule (found in default_molecule.rs). Only three pieces of data are needed: an array of atoms; an array of atom indexes; and an array of edges. From these data structures, all of the methods of both Graph and Molecule are implemented.

// imports

pub struct DefaultMolecule {
    atoms: Vec<Atom>,
    indices: Vec<usize>,
    edges: Vec<(usize, usize)>
}

impl DefaultMolecule {
    pub fn build(spec: spec::Molecule) -> Result<Self, Error> {
      // Use the specification object spec to initialize the internal state
      // of a DefaultMolecule
    }
}

// ...

impl<'a> Graph<'a, usize> for DefaultMolecule {
    // ... required iterator definitions
    // implement Graph methods in terms of the DefaultMolecule struct
    fn is_empty(&self) -> bool {
        self.atoms.is_empty()
    }

    // remaining Graph methods
}

impl<'a> Molecule<'a, usize> for DefaultMolecule {
  // implement Molecule methods in terms of the DefaultMolecule struct
    fn element(&self, id: &usize) -> Result<Element, GraphError> {
        match self.atoms.get(*id) {
            Some(node) => Ok(node.element()),
            None => Err(GraphError::UnknownNode)
        }
    }

    // remaining Molecule methods
}

Like the Graph trait, the Molecule trait has no mutator methods. This fits well with Rust, where mutators can quickly lead to unwanted complications.

Given there are no mutators, how does a Molecule ever get built? The answer lies with the build function associated with DefaultMolecule. It accepts a specification object (spec) as an argument, using its attributes to build an internal Molecule state. build returns an error value when inconsistent state is detected.

Specification Objects

Atom and DefaultMolecule are constructed with the help of specification objects. A specification object is a dumb data structure encoding the state of some object to be constructed. By "dumb" I mean that no methods are defined - just attributes. As such, specification objects are lightweight precursors to more complex object. They can be created, copied, stored, and passed efficiently. Think of a specification object as something that can be directly translated to JSON.

Specification objects are versatile and inexpensive. Using them helps keep method signatures small. Coupled with Rust's static typing system, specification objects make the construction of new Molecule objects reliable and unsurprising.

Rust Iterators are Tricky

The most difficult part of implementing Molecule was getting its Iterators to work correctly. Because Molecule implements the Graph trait, DefaultMolecule must define three Iterator types: one over all nodes; another for a node's neighbors; and a third for all edges. Rust's famously strict type system and ownership model joined forces here to make what should have been straightforward job difficult.

My original plan for DefaultMolecule#nodes was to return an Iterator over the indices of the underlying Vec of atoms. Surprisingly, however, this does not seem to be possible. To solve this problem, a separate array, indices is maintained. DefaultMolecule#nodes returns an iterator over this vector's elements.

DefaultMolecule#neighbors followed a similar pattern. The method returns an Iterator over the indexes of an atom's neighbors. As with other atom-specific tasks, this one is delegated to Atom. Atom#neighbors in turn returns an Iterator over the elements of its neighbors vector. An alternative approach in which neighbor indices were dynamically iterated via a call to map hit a problem that apparently has no solution.

In contrast, DefaultMolecule#edges uses a custom iterator (EdgeIterator) to iterate edges. On the surface, it appeared as if returning an iterator over the edges vector should have sufficed. However, this iterator generates tuples of indices, not tuples of index references. EdgeIterator provides the necessary translation.

pub struct EdgeIterator<'a> {
    cursor: usize,
    edges: &'a Vec<(usize, usize)>
}

impl<'a> EdgeIterator<'a> {
    pub fn new(edges: &'a Vec<(usize, usize)>) -> Self {
        EdgeIterator { cursor: 0, edges }
    }
}

impl<'a> Iterator for EdgeIterator<'a> {
    type Item = (&'a usize, &'a usize);

    fn next(&mut self) -> Option<(&'a usize, &'a usize)> {
        if let Some((sid, tid)) = self.edges.get(self.cursor) {
            self.cursor += 1;

            Some((&sid, &tid))
        } else {
            None
        }
    }
}

Attempts to replace DefaultMolecule's edges vector with one using the correct form failed. This failure resulted from the return of borrowed references to data (indices) owned by the build function. I suspect that a simpler approach eliminating the need for EdgeIterator is possible.

What Next?

DefaultMolecule implements all services defined by the minimum molecule API. As such, it should be feasible to build a cheminformatics toolkit of arbitrary complexity based on DefaultMolecule.

A useful first step on that path would be to implement a SMILES reader and writer. The use of specification objects is simple but verbose. A function that could return Molecule implementations given valid SMILES strings would be a big step forward, as would a function that could write SMILES given a Molecule.

Conclusion

Previous posts introduced the concept of the minimum Molecule API. This post presents an implementation in Rust. No insurmountable restrictions were found, although Rust's ownership model did present some challenges not likely to found in other languages. A repository with all source and a complete suite of unit tests is available from GitHub.