Making the Case: Similarity by Compression
...The structures were converted to SMILES format and canonicalized using a program written with the open-source Java cheminformatics library JOELib2. ... To conclude, we have demonstrated that SMILES strings and compression programs are a simple, yet powerful method for similarity searching, competitive with state-of-the-art-techniques. The Ruby scripts used to carry out the experiments described in this paper are available for download from http://comp.chem.nottingham.ac.uk/download/zippity/.
James Melville, Jenna Riley, and Johathan Hirst, J. Chem Inf. Model.
Yet another appearance of Open Source software in the literature comes by way of a paper from Melville, Riley, and Hirst. This work takes advantage of the alphabet-like nature of SMILES strings and widely-available compression algorithms to perform molecular similarity analyses. Not only does this work use the Open Source JOELib library but the authors have made the Ruby scripts that perform the similarity analysis freely available under the same terms as Ruby (Ruby's license or the GPL).
The times they are a-changein'.
Scripting Molecular Fingerprints with Ruby CDK
A molecular fingerprint represents a molecule as series of bits. There are many situations in which this reduced form of molecular representation is useful. For example, fingerprints are frequently used as a fast prescreen for database substructure searches. They can also be used for "fuzzy" comparisons involving molecular similarity, a nice complement to binary queries such as substructure search.
Fingerprints have their limitations. Being a form of hashing, they are imprecise in that two different molecules can have exactly the same fingerprint. The converse is also true: many molecular fingerprints exaggerate small differences between two molecules that most chemists would say are similar - for example between oxygen and sulfur analogs of the same structure.
Despite their limitations, the advantages of fingerprints make them useful in many situations. As a result, numerous fingerprinting systems have become popular. This tutorial will focus on creating and manipulating molecular fingerprints from Ruby using the Ruby Chemistry Development Kit (RCDK).
Prerequisites
For this tutorial, you'll need Ruby CDK (RCDK). A recent article described the small amount of system configuration required for RCDK on Linux. Another article showed how to install RCDK on Windows.
A Small Fingerprint Library
Let's build a small Ruby library for working with fingerprints. Place the following code into a file called fingerprint.rb in your working directory:
require 'rubygems'
require_gem 'rcdk'
require 'rcdk/util'
jrequire 'org.openscience.cdk.fingerprint.Fingerprinter'
jrequire 'org.openscience.cdk.similarity.Tanimoto'
# Molecule fingerprinting
class Fingerprinter
def initialize
@fingerprinter = Org::Openscience::Cdk::Fingerprint::Fingerprinter.new
end
def fingerprint(smiles)
mol = RCDK::Util::Lang.read_smiles smiles
fp = @fingerprinter.getFingerprint mol
# Metaprogramming!
fp.extend(Fingerprint)
end
end
# BitSet comparison
module Fingerprint
# Returns true of all of the bits set to true in this fingerprint are also set to true in the specified fingerprint
def subset?(fingerprint)
Org::Openscience::Cdk::Fingerprint::Fingerprinter.isSubset(fingerprint, self)
end
# Tanimoto similarity of this fingerprint and the specified fingerprint
def tanimoto(fingerprint)
Org::Openscience::Cdk::Similarity::Tanimoto.calculate(self, fingerprint)
end
endOf particular note is the use of Ruby's Object.extend method. This method allows a single instance of an object to be extended at runtime - a form of metaprogramming. In this case, we add the subset? and tanimoto methods for determining whether all of the bits in one fingerprint are present in another, and for determining similarity, respectively. We use this technique here because currently RJB doesn't provide the complete interface into Java classes that would be required to create a Ruby class that directly inherits from Java's BitSet class.
Testing the Library


Claritin (loratadine, left) and Clarinex (desloratadine, right) are two structurally-related antihistamines. Can we quantitate the degree of similarity between these two structures? Fingerprints provide one way. The following code creates fingerprints for the two structures, determines if one is the subset of another, and assigns a Tanimoto similarity value:
require 'fingerprint'
f = Fingerprinter.new
loratadine = f.fingerprint 'CCOC(=O)N1CCC(=C2C3=C(CCC4=C2N=CC=C4)C=C(C=C3)Cl)CC1'
desloratadine = f.fingerprint 'C1CC2=C(C=CC(=C2)Cl)C(=C3CCNCC3)C4=C1C=CC=N4'
puts "Loratadine is a subset of desloratadine: #{loratadine.subset? desloratadine}" # => false
puts "Desloratadine is a subset of loratadine: #{desloratadine.subset? loratadine}" # => true
puts "Tanimoto similarity of desloratadine and loratadine: #{loratadine.tanimoto desloratadine}" # => 0.895683467388153Variations
CDK's Fingerprinter class returns an instance of the Java class BitSet. This BitSet can be further manipulated in Ruby. For example, to find the size (the total number of bits) of the BitSet, we could use:
loratadine.size # => 1024Similarly, to find the number of bits set to true, we would use:
loratadine.cardinality # => 278To print out a list of all bits set to true, we could use the toString method:
loratadine.toString # => "{2, 8, 11, 16, 18, 22, 32, 37, 38, 41, 42, 46, 47, 51, 57, 64, 65, 66, 69 ... }"Conclusions
Fingerprints enable many useful and fast comparisons between molecules. The form of fingerprint we've used here is but one of possibilities offered by CDK. The next article in this series will discuss fingerprints in Open Babel using both Ruby and Python.


