Fast Substructure Search Using Open Source Tools Part 6 - Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby

October 29, 2008

We can think of a fingerprint as a bucket into which every molecule in the universe can be reproducibly placed. Each molecule will belong to a single bucket, but each bucket may contain any number of molecules. In other words, there exists a one-to-many relationship between a fingerprint and its associated molecules. The previous article in this series discussed how to model this relationship using SQL. This article will take the idea one step further by describing one way to model this relationship in Ruby.

All Articles in this Series:

SQL Recap

So far, we've set up a fingerprints database:

mysql> describe fingerprints;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(11)             | NO   | PRI | NULL    | auto_increment | 
| byte0  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte1  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte2  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte3  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte4  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte5  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte6  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte7  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte8  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte9  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte10 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte11 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte12 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte13 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte14 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte15 | bigint(64) unsigned | YES  |     | 0       |                | 
+--------+---------------------+------+-----+---------+----------------+
17 rows in set (0.00 sec)

This database contains a single (empty) fingerprint:

mysql> select * from fingerprints;
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
| id | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 | byte8 | byte9 | byte10 | byte11 | byte12 | byte13 | byte14 | byte15 |
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
|  1 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |      0 |      0 |      0 |      0 |      0 |      0 | 
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
1 row in set (0.00 sec)

We've also set up a compounds database containing a foreign key (fingerprint_id) into the fingerprints table:

mysql> describe compounds;
+----------------+---------+------+-----+---------+----------------+
| Field          | Type    | Null | Key | Default | Extra          |
+----------------+---------+------+-----+---------+----------------+
| id             | int(11) | NO   | PRI | NULL    | auto_increment | 
| fingerprint_id | int(11) | YES  |     | NULL    |                | 
| smiles         | text    | YES  |     | NULL    |                | 
+----------------+---------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

In this hypothetical example, the compounds database is populated by two molecules, benzene and bromobenzene, both of which share the same fingerprint:

mysql> select * from compounds;
+----+----------------+------------+
| id | fingerprint_id | smiles     |
+----+----------------+------------+
|  1 |              1 | c1ccccc1   | 
|  2 |              1 | c1ccccc1Br | 
+----+----------------+------------+
2 rows in set (0.00 sec)

Adding the Ruby Layer

In Part 3, we created a CRUD API for fingerprints in Ruby. We now need to modify the class we created there, Fingerprint, to make it aware of the Compounds it will be associated with.

For brevity, you can view the updated Fingerprint class here. The main change has been to add a single line of code that tells Fingerprint that it's now associated with a class called Compound:

  has_many :compounds

All that remains is to bring the Compound class into being:

require 'rubygems'
require 'active_record'
require 'fingerprint'

ActiveRecord::Base.establish_connection(
  :adapter    => 'mysql',
  :host       => 'localhost',
  :username   =>  'root',
  :password   =>  '',
  :database   =>  'compounds'
)

class Compound < ActiveRecord::Base
  belongs_to :fingerprint
end

The belongs_to line is the counterpart to Fingerprint's has_many line. Together, both Fingerprint and Compound create a system in which each Fingerprint can reference multiple Compounds and each Compound references one Fingerprint.

Let's test this with interactive Ruby:

$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> f=Fingerprint.find 1
=> #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0>
irb(main):003:0> f.compounds
=> [#<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1">, #<Compound id: 2, fingerprint_id: 1, smiles: "c1ccccc1Br">]

Looks good. Our code has made the correct association between a Fingerprint and its Compounds. What about the other way around?

$ irb
irb(main):001:0> require 'compound'
=> true
irb(main):002:0> c=Compound.find 1
=> #<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1">
irb(main):003:0> c.fingerprint
=> #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0>

As expected, the first Compound became associated with the correct Fingerprint.

Conclusions

Our system can now store and query molecular fingerprints in a relational database. It also associates multiple compounds with each fingerprint.

We have a complete fingerprint screening system, but not a substructure search system.

What's missing? For one thing, we'd need a way to perform atom-by-atom searches (ABAS) of all candidate structures after the fingerprint screening process is complete. Recall that just because a query fingerprint matches a candidate fingerprint doesn't necessarily mean that a substructure match has been found.

We'd also need a way to conveniently get real compounds with real fingerprints into our database. Only then would we be able to test the chemical validity of substructure queries.

The remaining articles in this series will discuss approaches to each of these requirements.

Image Credit: leeechy