<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Scripting Molecular Fingerprints with Ruby CDK</title>
    <link>http://depth-first.com/articles/2006/11/22/scripting-molecular-fingerprints-with-ruby-cdk</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Scripting Molecular Fingerprints with Ruby CDK</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/ruby_logo_new.gif" align="right"&gt;&lt;/img&gt;A &lt;a href="http://www.daylight.com/dayhtml/doc/theory/theory.finger.html"&gt;molecular fingerprint&lt;/a&gt; represents a molecule as series of bits. There are many situations in which this reduced form of molecular representation is useful. For example, fingerprints are frequently used as a fast prescreen for database substructure searches. They can also be used for "fuzzy" comparisons involving molecular similarity, a nice complement to binary queries such as substructure search.&lt;/p&gt;

&lt;p&gt;Fingerprints have their limitations. Being a form of hashing, they are imprecise in that two different molecules can have exactly the same fingerprint. The converse is also true: many molecular fingerprints exaggerate small differences between two molecules that most chemists would say are similar - for example between oxygen and sulfur analogs of the same structure.&lt;/p&gt;

&lt;p&gt;Despite their limitations, the advantages of fingerprints make them useful in many situations. As a result, numerous fingerprinting systems have become popular. This tutorial will focus on creating and manipulating molecular fingerprints from Ruby using the Ruby Chemistry Development Kit (RCDK).&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;For this tutorial, you'll need &lt;a href="http://depth-first.com/articles/2006/10/30/agile-chemical-informatics-development-with-cdk-and-ruby-rcdk-0-3-0"&gt;Ruby CDK&lt;/a&gt; (RCDK). A recent article described the small amount of system configuration required for &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-0"&gt;RCDK on Linux&lt;/a&gt;. Another article showed how to install &lt;a href="http://depth-first.com/articles/2006/10/12/running-ruby-java-bridge-on-windows"&gt;RCDK on Windows&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;A Small Fingerprint Library&lt;/h4&gt;

&lt;p&gt;Let's build a small Ruby library for working with fingerprints. Place the following code into a file called &lt;strong&gt;fingerprint.rb&lt;/strong&gt; in your working directory:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require_gem&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rcdk&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rcdk/util&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;jrequire&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;org.openscience.cdk.fingerprint.Fingerprinter&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;jrequire&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;org.openscience.cdk.similarity.Tanimoto&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# Molecule fingerprinting&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Fingerprinter&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt;
    &lt;span class="attribute"&gt;@fingerprinter&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Org&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Openscience&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Cdk&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Fingerprint&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Fingerprinter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;smiles&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;mol&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;RCDK&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Util&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Lang&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read_smiles&lt;/span&gt; &lt;span class="ident"&gt;smiles&lt;/span&gt;

    &lt;span class="ident"&gt;fp&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@fingerprinter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getFingerprint&lt;/span&gt; &lt;span class="ident"&gt;mol&lt;/span&gt;

    &lt;span class="comment"&gt;# Metaprogramming!&lt;/span&gt;
    &lt;span class="ident"&gt;fp&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;extend&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;Fingerprint&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# BitSet comparison&lt;/span&gt;
&lt;span class="keyword"&gt;module &lt;/span&gt;&lt;span class="module"&gt;Fingerprint&lt;/span&gt;
  &lt;span class="comment"&gt;# Returns true of all of the bits set to true in this fingerprint are also set to true in the specified fingerprint&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;subset?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="constant"&gt;Org&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Openscience&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Cdk&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Fingerprint&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Fingerprinter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;isSubset&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="constant"&gt;self&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Tanimoto similarity of this fingerprint and the specified fingerprint&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;tanimoto&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="constant"&gt;Org&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Openscience&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Cdk&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Similarity&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Tanimoto&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;calculate&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;self&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Of particular note is the use of Ruby's &lt;tt&gt;Object.extend&lt;/tt&gt; method. This method allows a single instance of an object to be extended at runtime - a form of &lt;a href="http://depth-first.com/articles/2006/10/24/metaprogramming-with-ruby-mapping-java-packages-onto-ruby-modules"&gt;metaprogramming&lt;/a&gt;. In this case, we add the &lt;tt&gt;subset?&lt;/tt&gt; and &lt;tt&gt;tanimoto&lt;/tt&gt; methods for determining whether all of the bits in one fingerprint are present in another, and for determining similarity, respectively. We use this technique here because currently RJB doesn't provide the complete interface into Java classes that would be required to create a Ruby class that directly inherits from Java's BitSet class.&lt;/p&gt;

&lt;h4&gt;Testing the Library&lt;/h4&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20061122/loratadine.png"&gt;&lt;/img&gt;&lt;img src="http://depth-first.com/demo/20061122/desloratadine.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3957"&gt;Claritin&lt;/a&gt; (loratadine, left) and &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=124087"&gt;Clarinex&lt;/a&gt; (desloratadine, right) are two structurally-related antihistamines. Can we quantitate the degree of similarity between these two structures? Fingerprints provide one way. The following code creates fingerprints for the two structures, determines if one is the subset of another, and assigns a Tanimoto similarity value:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;f&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Fingerprinter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;

&lt;span class="ident"&gt;loratadine&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;fingerprint&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;CCOC(=O)N1CCC(=C2C3=C(CCC4=C2N=CC=C4)C=C(C=C3)Cl)CC1&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;desloratadine&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;fingerprint&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;C1CC2=C(C=CC(=C2)Cl)C(=C3CCNCC3)C4=C1C=CC=N4&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Loratadine is a subset of desloratadine: &lt;span class="expr"&gt;#{loratadine.subset? desloratadine}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; false&lt;/span&gt;
&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Desloratadine is a subset of loratadine: &lt;span class="expr"&gt;#{desloratadine.subset? loratadine}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; true&lt;/span&gt;
&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Tanimoto similarity of desloratadine and loratadine: &lt;span class="expr"&gt;#{loratadine.tanimoto desloratadine}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; 0.895683467388153&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Variations&lt;/h4&gt;

&lt;p&gt;CDK's &lt;tt&gt;&lt;a href="http://cdk.sourceforge.net/api/org/openscience/cdk/fingerprint/Fingerprinter.html"&gt;Fingerprinter&lt;/a&gt;&lt;/tt&gt; class returns an instance of the Java class &lt;tt&gt;&lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/BitSet.html"&gt;BitSet&lt;/a&gt;&lt;/tt&gt;. This &lt;tt&gt;BitSet&lt;/tt&gt; can be further manipulated in Ruby. For example, to find the size (the total number of bits) of the &lt;tt&gt;BitSet&lt;/tt&gt;, we could use:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;loratadine&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;size&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; 1024&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Similarly, to find the number of bits set to true, we would use:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;loratadine&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;cardinality&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; 278&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To print out a list of all bits set to true, we could use the &lt;tt&gt;toString&lt;/tt&gt; method:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;loratadine&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;toString&lt;/span&gt; &lt;span class="comment"&gt;# =&amp;gt; &amp;quot;{2, 8, 11, 16, 18, 22, 32, 37, 38, 41, 42, 46, 47, 51, 57, 64, 65, 66, 69 ... }&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Fingerprints enable many useful and fast comparisons between molecules. The form of fingerprint we've used here is but one of possibilities offered by CDK. The next article in this series will discuss fingerprints in &lt;a href="http://openbabel.sourceforge.net/wiki/Fingerprint"&gt;Open Babel&lt;/a&gt; using both Ruby and Python.&lt;/p&gt;</description>
      <pubDate>Wed, 22 Nov 2006 15:44:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:b5807052-d051-4121-b89f-1d8cc908ef4f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/11/22/scripting-molecular-fingerprints-with-ruby-cdk</link>
      <category>Tools</category>
      <category>fingerprint</category>
      <category>bitset</category>
      <category>similarity</category>
      <category>rcdk</category>
      <category>ruby</category>
    </item>
  </channel>
</rss>
