Fast Substructure Search Using Open Source Tools Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby

Posted by Rich Apodaca Wed, 29 Oct 2008 17:15:00 GMT

We can think of a fingerprint as a bucket into which every molecule in the universe can be reproducibly placed. Each molecule will belong to a single bucket, but each bucket may contain any number of molecules. In other words, there exists a one-to-many relationship between a fingerprint and its associated molecules. The previous article in this series discussed how to model this relationship using SQL. This article will take the idea one step further by describing one way to model this relationship in Ruby.

All Articles in this Series:

SQL Recap

So far, we've set up a fingerprints database:

mysql> describe fingerprints;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(11)             | NO   | PRI | NULL    | auto_increment | 
| byte0  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte1  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte2  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte3  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte4  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte5  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte6  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte7  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte8  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte9  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte10 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte11 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte12 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte13 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte14 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte15 | bigint(64) unsigned | YES  |     | 0       |                | 
+--------+---------------------+------+-----+---------+----------------+
17 rows in set (0.00 sec)

This database contains a single (empty) fingerprint:

mysql> select * from fingerprints;
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
| id | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 | byte8 | byte9 | byte10 | byte11 | byte12 | byte13 | byte14 | byte15 |
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
|  1 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |      0 |      0 |      0 |      0 |      0 |      0 | 
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
1 row in set (0.00 sec)

We've also set up a compounds database containing a foreign key (fingerprint_id) into the fingerprints table:

mysql> describe compounds;
+----------------+---------+------+-----+---------+----------------+
| Field          | Type    | Null | Key | Default | Extra          |
+----------------+---------+------+-----+---------+----------------+
| id             | int(11) | NO   | PRI | NULL    | auto_increment | 
| fingerprint_id | int(11) | YES  |     | NULL    |                | 
| smiles         | text    | YES  |     | NULL    |                | 
+----------------+---------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

In this hypothetical example, the compounds database is populated by two molecules, benzene and bromobenzene, both of which share the same fingerprint:

mysql> select * from compounds;
+----+----------------+------------+
| id | fingerprint_id | smiles     |
+----+----------------+------------+
|  1 |              1 | c1ccccc1   | 
|  2 |              1 | c1ccccc1Br | 
+----+----------------+------------+
2 rows in set (0.00 sec)

Adding the Ruby Layer

In Part 3, we created a CRUD API for fingerprints in Ruby. We now need to modify the class we created there, Fingerprint, to make it aware of the Compounds it will be associated with.

For brevity, you can view the updated Fingerprint class here. The main change has been to add a single line of code that tells Fingerprint that it's now associated with a class called Compound:

  has_many :compounds
All that remains is to bring the Compound class into being:
require 'rubygems'
require 'active_record'
require 'fingerprint'

ActiveRecord::Base.establish_connection(
  :adapter    => 'mysql',
  :host       => 'localhost',
  :username   =>  'root',
  :password   =>  '',
  :database   =>  'compounds'
)

class Compound < ActiveRecord::Base
  belongs_to :fingerprint
end
The belongs_to line is the counterpart to Fingerprint's has_many line. Together, both Fingerprint and Compound create a system in which each Fingerprint can reference multiple Compounds and each Compound references one Fingerprint. Let's test this with interactive Ruby:
$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> f=Fingerprint.find 1
=> #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0>
irb(main):003:0> f.compounds
=> [#<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1">, #<Compound id: 2, fingerprint_id: 1, smiles: "c1ccccc1Br">]

Looks good. Our code has made the correct association between a Fingerprint and its Compounds. What about the other way around?

$ irb
irb(main):001:0> require 'compound'
=> true
irb(main):002:0> c=Compound.find 1
=> #<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1">
irb(main):003:0> c.fingerprint
=> #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0>

As expected, the first Compound became associated with the correct Fingerprint.

Conclusions

Our system can now store and query molecular fingerprints in a relational database. It also associates multiple compounds with each fingerprint.

We have a complete fingerprint screening system, but not a substructure search system.

What's missing? For one thing, we'd need a way to perform atom-by-atom searches (ABAS) of all candidate structures after the fingerprint screening process is complete. Recall that just because a query fingerprint matches a candidate fingerprint doesn't necessarily mean that a substructure match has been found.

We'd also need a way to conveniently get real compounds with real fingerprints into our database. Only then would we be able to test the chemical validity of substructure queries.

The remaining articles in this series will discuss approaches to each of these requirements.

Image Credit: leeechy

Fast Substructure Search Using Open Source Tools Part 5: Relating Molecules to Fingerprints with SQL 4

Posted by Rich Apodaca Tue, 21 Oct 2008 01:28:00 GMT

A molecular fingerprint is a special kind of hash function that can reproducibly place any molecule, known or unknown, into one of a large but finite set of groups. Each molecule will be associated with exactly one fingerprint, but each fingerprint can be associated with multiple molecules. In other words, there exists a one-to-many relationship between fingerprints and molecules. This article outlines one of the final steps in creating a substructure-searchable relational chemical database by describing a simple method for associating fingerprints and molecules.

All Articles in this Series:

Modelling a One-To-Many Relationship

The one-to-many relationship is one of the most fundamental concepts in relational databases. In our case, we'd like to create a new table called compounds. We'd furthermore like to link each row in the compounds table with a row in the fingerprints table. This can be accomplished by adding a column to the compounds table that's capable of holding an id from the "fingerprints" table (foreign key).

This would then give us the ability to gather all of the rows in the compounds table that match a particular fingerprint (or group of fingerprints).

Creating the compounds Table

The compounds table we'll create will store three pieces of information:

  1. A unique id (something that all of our tables will have).
  2. An integer column called "fingerprint_id" that will store the unique id of a fingerprint described by a row in the fingerprints table.
  3. A string column called "smiles" that will hold the SMILES string of each compound in compact form.

We can create the table with the following:

mysql> create table compounds (id int not null auto_increment, primary key(id), fingerprint_id int, smiles text);
Query OK, 0 rows affected (0.01 sec)

mysql> describe compounds;
+----------------+---------+------+-----+---------+----------------+
| Field          | Type    | Null | Key | Default | Extra          |
+----------------+---------+------+-----+---------+----------------+
| id             | int(11) | NO   | PRI | NULL    | auto_increment | 
| fingerprint_id | int(11) | YES  |     | NULL    |                | 
| smiles         | text    | YES  |     | NULL    |                | 
+----------------+---------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

Using compounds and fingerprints Together

Now let's populate our database with some simple, fake data. If you haven't done so already, delete all rows from your existing fingerprints table:

mysql> delete from fingerprints;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from fingerprints;
Empty set (0.00 sec)

mysql> describe fingerprints;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(11)             | NO   | PRI | NULL    | auto_increment | 
| byte0  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte1  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte2  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte3  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte4  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte5  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte6  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte7  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte8  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte9  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte10 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte11 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte12 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte13 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte14 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte15 | bigint(64) unsigned | YES  |     | 0       |                | 
+--------+---------------------+------+-----+---------+----------------+
17 rows in set (0.01 sec)

Now let's create a dummy fingerprint for the sake of simplicity:

mysql> insert into fingerprints () values();
Query OK, 1 row affected (0.00 sec)

mysql> select * from fingerprints;
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
| id    | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 | byte8 | byte9 | byte10 | byte11 | byte12 | byte13 | byte14 | byte15 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
| 16806 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |      0 |      0 |      0 |      0 |      0 |      0 | 
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
1 row in set (0.00 sec)

Let's associate two compounds with this fingerprint:

mysql> insert into compounds (fingerprint_id,smiles) values(16806,'c1ccccc1');
Query OK, 1 row affected (0.01 sec)

mysql> insert into compounds (fingerprint_id,smiles) values(16806,'c1ccccc1Br');
Query OK, 1 row affected (0.00 sec)

mysql> select * from compounds;
+----+----------------+------------+
| id | fingerprint_id | smiles     |
+----+----------------+------------+
| 20 |          16806 | c1ccccc1   | 
| 21 |          16806 | c1ccccc1Br | 
+----+----------------+------------+
2 rows in set (0.00 sec)

We can now find all compounds with fingerprints containing no bits set:

mysql> select compounds.* from compounds inner join fingerprints on compounds.fingerprint_id=fingerprints.id where fingerprints.byte0=0 and fingerprints.byte1=0 and fingerprints.byte2=0 and fingerprints.byte3=0 and fingerprints.byte4=0 and fingerprints.byte5=0 and fingerprints.byte6=0 and fingerprints.byte7=0 and fingerprints.byte8=0 and fingerprints.byte8=0 and fingerprints.byte9=0 and fingerprints.byte10=0 and fingerprints.byte11=0 and fingerprints.byte12=0 and fingerprints.byte13=0 and fingerprints.byte14=0 and fingerprints.byte15=0;
+----+----------------+------------+
| id | fingerprint_id | smiles     |
+----+----------------+------------+
| 20 |          16806 | c1ccccc1   | 
| 21 |          16806 | c1ccccc1Br | 
+----+----------------+------------+
2 rows in set (0.00 sec)

We could just as easily replace the "=" operator with the "&" operator to perform substructure fingerprint screens. Although the data we're using is hardly realistic, the same concepts apply regardless of how the fingerprints are constructed.

Conclusions

We now have a way to associate fingerprints with compounds stored in our database. Although we could continue to populate and query our database using hand-coded SQL statements, what we'd really like to use is an API written in a high-level programming language. The next article in this series will demonstrate how this can be done in Ruby.

Image Credit: Roberto F.

Fast Substructure Search Using Open Source Tools Part 3: A CRUD API for Fingerprints in Ruby

Posted by Rich Apodaca Mon, 06 Oct 2008 22:13:00 GMT

The previous article in this series showed how to perform fingerprint screens for substructure searches using nothing more than SQL. Although this is significant progress, working at the level of SQL queries to perform create, read, update, and delete operations (CRUD) on our fingerprint table is more work than it needs to be. We'd really prefer to use an API written in a high-level programming language. This article describes a simple Ruby API for managing and querying a database of molecular fingerprints.

All Articles in this Series:

Some Changes to the Database Schema

Before we move forward, we must deal with one minor detail. By default MySQL uses signed 64-bit integers. This gives a range for integers of -9223372036854775808 to 9223372036854775807. Ruby, on the other hand, can work with integers of any size through the Bignum class - and we'll be taking full advantage of this feature.

If we want to avoid the headache of constantly accounting for the difference, we need to tell our database to use unsigned integers in the fingerprints table. This can be done by first dropping the old table:

mysql> drop table fingerprints;
Query OK, 0 rows affected (0.00 sec)

Now let's create a new table in which the fpn columns store unsigned integers only. While we're at it, let's change the naming of these columns from fpn to the more descriptive byten and set a default value of zero.

The new table can be created with with:

mysql> create table fingerprints(id int not null auto_increment, primary key(id), byte0 bigint(64) unsigned default 0, byte1 bigint(64) unsigned default 0, byte2 bigint(64) unsigned default 0, byte3 bigint(64) unsigned default 0, byte4 bigint(64) unsigned default 0, byte5 bigint(64) unsigned default 0, byte6 bigint(64) unsigned default 0, byte7 bigint(64) unsigned default 0, byte8 bigint(64) unsigned default 0, byte9 bigint(64) unsigned default 0, byte10 bigint(64) unsigned default 0, byte11 bigint(64) unsigned default 0, byte12 bigint(64) unsigned default 0, byte13 bigint(64) unsigned default 0, byte14 bigint(64) unsigned default 0, byte15 bigint(64) unsigned default 0);
Query OK, 0 rows affected (0.00 sec)

mysql> describe fingerprints;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(11)             | NO   | PRI | NULL    | auto_increment | 
| byte0  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte1  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte2  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte3  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte4  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte5  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte6  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte7  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte8  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte9  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte10 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte11 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte12 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte13 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte14 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte15 | bigint(64) unsigned | YES  |     | 0       |                | 
+--------+---------------------+------+-----+---------+----------------+
17 rows in set (0.01 sec)

We're now ready to create the Ruby API.

The API

The code below is all we need to begin querying and managing our fingerprint database in Ruby:

require 'rubygems'
require 'active_record'

ActiveRecord::Base.establish_connection(
  :adapter    => 'mysql',
  :host       => 'localhost',
  :username   =>  'root',
  :password   =>  '',
  :database   =>  'compounds'
)

class Fingerprint < ActiveRecord::Base
  @@bytes_prefix = "byte"

  def each_byte
    0.upto(byte_count - 1) {|i| yield send("#{@@bytes_prefix}#{i}")  }
  end

  def each_byte_with_index
    0.upto(byte_count - 1) {|i| yield send("#{@@bytes_prefix}#{i}"), i  }
  end

  def fill_bytes
    0.upto(byte_count - 1) {|i| send("#{@@bytes_prefix}#{i}=", yield(i))}

    self
  end

  def to_byte_array
    Array.new(16).fill{|i| send("#{@@bytes_prefix}#{i}")}
  end

  def byte_count
    result = 0

    while respond_to? "#{@@bytes_prefix}#{result}"
      result += 1
    end

    result
  end

  def bitstring
    result = ""

    each_byte {|byte| result +=  sprintf("%064b", byte)}

    result
  end

  def cardinality
    bitstring.count("1")
  end

  def eql?(other)
    to_byte_array.eql?(other.to_byte_array)
  end

  def save
    return false unless Fingerprint.find_by_fingerprint(self).empty?

    super
  end

  def self.find_by_fingerprint fingerprint
    Fingerprint.find_by_sql sql_for_find_by_fingerprint(fingerprint)
  end

  def self.find_children_by_fingerprint fingerprint
    Fingerprint.find_by_sql sql_for_find_children_by_fingerprint(fingerprint)
  end

  def self.sql_for_find_by_fingerprint fingerprint
    result = "select fingerprints.* from fingerprints where "
    last = fingerprint.byte_count - 1

    fingerprint.each_byte_with_index do |byte, i|
      result += "#{@@bytes_prefix}#{i}=#{byte}" + ((i ==last) ? "" : " and ")
    end

    result
  end

  def self.sql_for_find_children_by_fingerprint fingerprint
    result = "select fingerprints.* from fingerprints where "
    last = fingerprint.byte_count - 1

    fingerprint.each_byte_with_index do |byte, i|
      result += "#{@@bytes_prefix}#{i}&#{byte}=#{byte}" + ((i ==last) ? "" : " and ")
    end

    result
  end
end

Testing the API

We can test this library from interactive ruby (irb). Let's add two fingerprints - the first consisting of all bits set to "1" and the second consisting of alternating "1" and "0" bits:

$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> f1=Fingerprint.new.fill_bytes{"ffffffffffffffff".hex}
=> #<Fingerprint id: nil, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>
irb(main):003:0> f1.save
=> true
irb(main):004:0> f2=Fingerprint.new.fill_bytes{"aaaaaaaaaaaaaaaa".hex}
=> #<Fingerprint id: nil, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):005:0> f2.save
=> true

Let's find the fingerprint in which all bits are turned on:

$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> query=Fingerprint.new.fill_bytes{"ffffffffffffffff".hex}
=> #<Fingerprint id: nil, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>
irb(main):003:0> Fingerprint.find_by_fingerprint query
=> [#<Fingerprint id: 111, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>]

Our query has found an exact match for the query fingerprint in the database at row 111. (This id is not 1 because previous automated tests that I wrote and executed have added and removed rows, advancing the id counter).

We can also search the database for the children of an arbitrary fingerprint query. A test fingerprint A is a "child" of query Q if all of the set bits in Q are also set in A. Notice that this leaves open the possibility that A has more bits set than Q. For example:

$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> query=Fingerprint.new.fill_bytes{"aaaaaaaaaaaaaaaa".hex}
=> #<Fingerprint id: nil, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):003:0> results = Fingerprint.find_children_by_fingerprint query
=> [#<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>, #<Fingerprint id: 111, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>]

It worked - both fingerprints stored in the database were found.

We can delete a Fingerprint like this:

$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> f=Fingerprint.find 112
=> #<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):003:0> f.destroy
=> #<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):004:0> Fingerprint.count
=> 1

Active Record and the Fingerprint API

The Fingerprint class is so concise because it takes advantage of the Ruby library called ActiveRecord. ActiveRecord is the object-relational mapping system used in Ruby on Rails. ActiveRecord can be used outside of Rails, as was done for this library, by including the code at the top of the file beginning with "ActiveRecord::Base.establish_connection...", where you'd use the parameters specific to your database.

We gain three key advantages with this approach: (1) we have very little SQL to code; (2) we have access to all of ActiveRecord's built-in CRUD operations such as counting records through Fingerprint.count and deleting Fingerprints with destroy without writing anything ourselves; and (3) we can easily integrate the Fingerprint class into any Ruby on Rails application.

Variations

At least two other Object-Ralational Mapping systems could be used from Ruby, DataMapper, and Sequel. The approach described here could be adapted to these other ORMS with minimal effort.

Conclusions

We now have a working fingerprint screening system built solely from open source components. MySQL houses the data and provides for highly-optimized queries. A concise Ruby API created with ActiveRecord now allows us to deal with our fingerprint database as a collection of objects in a high-level language. We can perform all CRUD operations without writing a line of SQL.

We've come a long way, but we're still not dealing with molecules. We previously saw how Open Babel can generate fingerprints with which we could, in principle, populate and query our database. The next article in this series will use this capability in creating a more chemically-aware system.

Image Credit: peerlingguy

Fast Substructure Search Using Open Source Tools Part 2: Fingerprint Screen With SQL 23

Posted by Rich Apodaca Fri, 03 Oct 2008 14:47:00 GMT

The previous article in this series discussed the configuration of a MySQL database for fast substructure search with binary fingerprints. This article first shows how to populate this database with real fingerprint data for two molecules. Then it shows how to formulate standard SQL queries to screen the database for substructures.

All articles in this series:

Creating the Fingerprints with Open Babel

The babel command line utility will, among it many conversions, return a fingerprint when given a valid SMILES string. For example, we can create the fingerprint for benzene like this:

$ babel -ismi -ofpt
c1ccccc1
>   6 bits set 
00000000 00000000 00000000 00000200 00000000 00000000 
00000000 00000000 00000000 00000840 00000000 00008000 
00000000 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000 08000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000 00020000 
00000000 00000000 
1 molecule converted
12 audit log messages

Similarly, we create the fingerprint for phenol like this:

$ babel -ismi -ofpt
c1ccccc1O 
>   12 bits set 
00000000 00000008 20000000 00000200 00000000 00000000 
02000000 00000000 00000000 00000840 00000000 00008000 
00000002 00000000 00000000 00000008 00000000 00000000 
00000000 00020000 00000000 08000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000 00020000 
00000000 00000000 
1 molecule converted
19 audit log messages

The exact meaning of these fingerprints is interesting, but not relevant. Without getting into the details of the Open Babel fingerprint formats, which are discussed in detail elsewhere, the output contains the binary fingerprint of each molecule as an array of 32-bit hexadecimal numbers.

Adding Fingerprints to the Database

To use Open Babel's fingerprints with out database, we need to convert the 32-bit hexadecimal numerical output to 64-bit decimal format. This is not difficult and most programming environments make this very simple. For example, the following Ruby code will convert the third and fourth 32-bit hexadecimal numbers in the benzene fingerprint into a 64-bit decimal number:

$ irb
irb(main):001:0> "0000000000000200".hex
=> 512

Performing this conversion for every pair of 32-bit hex numbers in each fingerprint gives a set of numbers we can place directly into our database:

mysql> # Add benzene decimal fingerprint.
mysql> insert into fingerprints
(fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15)
values
(0, 512, 0, 0, 2112, 32768, 0, 0, 0, 0, 134217728, 0, 0, 0, 131072, 0);

Similarly,

mysql> # Add phenol decimal fingerprint.
mysql> insert into fingerprints
(fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15)
values
(8, 2305843009213694464, 0, 144115188075855872, 2112, 32768, 8589934592, 8, 0, 131072, 134217728, 0, 0, 0, 131072, 0);

Our table is now ready to be queried:

mysql> select * from fingerprints;
+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
| id | fp0  | fp1                 | fp2  | fp3                | fp4  | fp5   | fp6        | fp7  | fp8  | fp9    | fp10      | fp11 | fp12 | fp13 | fp14   | fp15 |


+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
|  1 |    0 |                 512 |    0 |                  0 | 2112 | 32768 |          0 |    0 |    0 |      0 | 134217728 |    0 |    0 |    0 | 131072 |    0 | 
|  2 |    8 | 2305843009213694464 |    0 | 144115188075855872 | 2112 | 32768 | 8589934592 |    8 |    0 | 131072 | 134217728 |    0 |    0 |    0 | 131072 |    0 | 
+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
2 rows in set (0.00 sec)

Querying the Database

With a table of fingerprints in hand, we can begin formulating queries. To do so, we'll use MySQL's built-in support for binary arithmetic.

A molecule with fingerprint A can represent a substructure of another molecule with fingerprint B if all of the bits in B are also present in A. Mathematically, we'd say that:

Bi & Ai = Bi

for all bits i in A and B.

Let's say we have a two-bit fingerprint consisting of 01 and 11 (binary) in our database. We can use MySQL to test whether the molecule from which the second fingerprint was derived could be a substructure of the molecule from which the first fingerprint was derived with this syntax:

mysql> select 1&3;
+-----+
| 1&3 |
+-----+
|   1 | 
+-----+
1 row in set (0.00 sec)

The answer is yes, there could be a substructure match because 1&3 = 1.

We're now ready to perform our first substructure screen using SQL. This consists of selecting all rows for which each of the 16 fingerprint components, when anded together with a query fingerprint component, gives back the original component.

To see if phenol is a substructure of benzene, we could use the following:

mysql> select id from fingerprints where fp0&0=0 and fp1&512=512 and fp2&0=0 and fp3&0=0 and fp4&2112=2112 and fp5&32768=32768 and fp6&0=0 and fp7&0=0 and fp8&0=0 and fp9&0=0 and fp10&134217728=134217728 and fp11&0=0 and fp12&0=0 and fp13&0=0 and fp14&131072=131072 and fp15&0=0;
+----+
| id |
+----+
|  1 | 
|  2 | 
+----+
2 rows in set (0.00 sec)

Our query results are telling us that phenol is both a substructure of benzene and itself, as expected.

Conclusions

We now have a database populated with two molecules represented as fingerprints. We can even scan the database for possible substructure matches using nothing more than standard SQL queries. Nevertheless, we've had to use a lot of manual coding to convert hex into decimal and create SQL. We need a library to do this mundane work for us. The next article in this series will discuss a better approach using Ruby.

Image Credit: dopeylok

Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases 6

Posted by Rich Apodaca Thu, 02 Oct 2008 23:27:00 GMT

For anyone working in a chemistry-related job, chemical databases are ubiquitous. A printed list of IUPAC names, a spreadsheet containing CAS numbers, and a set of hand-drawn structures on index cards are all primitive chemical databases. They aren't nearly as useful as they could be to either the creator or his/her collaborators, but they are databases nevertheless. Anyone who has spent time in industry or academics knows that these low-tech chemical databases are everywhere. And they become more of a problem as more information is moved into electronic format.

All articles in this series:

The Problem: Structure Search is Hard

Many of the low-tech chemical databases that professional chemists routinely share and work with would become orders of magnitude more useful if they were converted into substructure-searchable databases and published to the Web. Although there has been a great deal of effort toward this end in the last few years, there's still much, much more that could be done.

One of the main problems in creating a substructure-searchable chemical database is implementing the substructure search capability itself. This one requirement has done more to stifle the free flow of chemical information than perhaps any other. Solving the problem appears very difficult on first or second glance, and it is very difficult if you don't have the right tools. Many companies offer solutions - but at a price, both in terms of money and time, that is simply out of reach.

What can you do if you're just getting started with modest requirements and budget?

About This Series

This article, the first in a series, will describe the creation of a chemical substructure search engine using exclusively well-maintained and robust open source tools: Open Babel for generating fingerprints and peforming atom-by-atom searches; MySQL as a relational database; and Ruby as a scripting language.

Each of these three components is a commodity that can be replaced with any one of a number of open-source or proprietary substitutes, maximizing flexibility and minimizing vendor lock-in.

Other Resources

Norbert Haider of the University of Vienna has written a very useful tutorial on creating a structure-searchable database using free tools, which is part of a larger series. That series differs from this one in the technology stack used and the level of detail to be provided. The series of articles to appear here will spell out the low-level series of steps needed to create a working substructure search system. It's hoped that taking this perspective makes clear the steps needed to apply the approach to alternative technology platforms.

Binary Fingerprints and Relational Databases

At the heart of the system we'll build is the chemical fingerprint which is a (usually) lossy binary representation of a chemical structure. Creating a binary fingerprint is like putting every chemical structure, known or unknown into just one bin out of a very large, but finite set of bins. Although the same molecule is guaranteed to always go into the same bin, more than one molecule can be placed into each bin. This is a general feature of all hashing schemes.

Andrew Dalke has written an excellent series of articles on fingerprints and what can be done with them. Another good overview is available from Daylight. This article will assume you know what fingerprints are and how they can be used to compare chemical structures.

The problem with binary fingerprints is that they are generally several hundred bits long - too long to be represented in a form that allows direct and rapid query by a relational database system. They need to be broken up - but how?

A widely-used approach (and the one that will be taken here) involves breaking up the fingerprint into a series of integers that are stored in the database.

For example, let's say we have a 1024-bit fingerprint. We could represent this as a number from 0 to 2^1024, which of course is way to big for most computers to handle today. We could, however, represent this fingerprint as a series of sixteen 64-bit integers (which are available on most systems).

So, the binary fingerprint:

1111111101111111110110111011011000101000011000011010011100010000
1001100010101101000110100010110011101100100000100100000111010100
0101010000101011001010011001000100011001100000101100111010001110
1001000101001010000001011001100101101011111111011000111100000111
1010101100100101000100001100011001010111001001110101101100010010
0011101011101110110011111010000010111001100101001001101010110001
1100111000010100000100110111101001011100010111010001010101101101
0010001111111010111011110110000000001010111011111001111001111101
0101011100011111110111011110011110100110010110010101011001011111
0110100001111001101111011101001101101001000100010001100101111000
0011111001000100001111111110001100111001101000000100010010010110
0000011101001001011000111110101110010101110001111010100001100100
0100100111101010110101101010110110101010110110111011011001111111
0011100100101101101001000001000111110101011101110101101001101001
0110100100111001111001001111110111111001110100100110010100011110
0010101100101000011110101110111011001110101111100001011010101100

could also be represented as this decimal fingerprint (assuming your machine is big-endian):

18410675377121896208
11001478244984832468
6064987026359504526
10469186440276053767
12332281598675737362
4246559787872197297
14849515287603909997
2592647731284516477
6277980392575817311
7528256967824972152
4486781373924787350
525060695046727780
5326305550703244927
4120129631153511017
7582343227124114718
3109870708788696748

We can easily store this set of 16 numbers in a relational database table. For example, if we had a MySQL database called "compounds", we could create a "fingerprints" table:

mysql> create database compounds;
Query OK, 1 row affected (0.02 sec)

mysql> use compounds;

Database changed
mysql> create table fingerprints(id int not null auto_increment, primary key(id), fp0 bigint(64), fp1 bigint(64), fp2 bigint(64), fp3 bigint(64), fp4 bigint(64), fp5 bigint(64), fp6 bigint(64), fp7 bigint(64), fp8 bigint(64), fp9 bigint(64), fp10 bigint(64), fp11 bigint(64), fp12 bigint(64), fp13 bigint(64), fp14 bigint(64), fp15 bigint(64));
Query OK, 0 rows affected (0.01 sec)

mysql> describe fingerprints;
+-------+------------+------+-----+---------+----------------+
| Field | Type       | Null | Key | Default | Extra          |
+-------+------------+------+-----+---------+----------------+
| id    | int(11)    | NO   | PRI | NULL    | auto_increment | 
| fp0   | bigint(64) | YES  |     | NULL    |                | 
| fp1   | bigint(64) | YES  |     | NULL    |                | 
| fp2   | bigint(64) | YES  |     | NULL    |                | 
| fp3   | bigint(64) | YES  |     | NULL    |                | 
| fp4   | bigint(64) | YES  |     | NULL    |                | 
| fp5   | bigint(64) | YES  |     | NULL    |                | 
| fp6   | bigint(64) | YES  |     | NULL    |                | 
| fp7   | bigint(64) | YES  |     | NULL    |                | 
| fp8   | bigint(64) | YES  |     | NULL    |                | 
| fp9   | bigint(64) | YES  |     | NULL    |                | 
| fp10  | bigint(64) | YES  |     | NULL    |                | 
| fp11  | bigint(64) | YES  |     | NULL    |                | 
| fp12  | bigint(64) | YES  |     | NULL    |                | 
| fp13  | bigint(64) | YES  |     | NULL    |                | 
| fp14  | bigint(64) | YES  |     | NULL    |                | 
| fp15  | bigint(64) | YES  |     | NULL    |                | 
+-------+------------+------+-----+---------+----------------+
17 rows in set (0.01 sec)

Conclusions

Although we have neither a substructure search engine nor a database, we've laid a solid foundation for those things. The next article in this series will show how to use this humble beginning to model some simple substructure queries in a way that lets MySQL do most of the heavy-lifting.

Image Credit: Mr. Jaded