Substructure Search From Scratch in Java Part 1: The Atom Mapping Problem
One of the most important capabilities in cheminformatics is mapping the atoms of a query structure onto the atoms of a target structure. Although useful in itself, the main value of atom mapping comes from the software that gets built on top of it: exact structure comparators, substructure search systems, and query atom/bond search systems such as SMARTS. The fundamental nature of atom mapping means that correctness, efficiency and adaptability are essential features of a good mapping implementation. Recently, a D-F article made the case that atom mapping software written in Java needs to be Java-centric to achieve these goals. This article, the first in a series that describes a complete substructure search system written in Java, takes the first step by offering some simple interface definitions and code for the atom mapping problem.
The Problem
Given a query molecule (query) and a target molecule (target), our atom mapping software needs to find ways to match the atoms of query onto target such that the mapping describes a substructure embedded in target. The software might stop at one mapping, continue on to find all of them, or stop at some point in the middle. It all depends on the specific cheminformatics problem we're trying to solve.
The Recursive Function
Our implementation will gradually build up an atom mapping by traversing the atoms of query in depth-first order and trying to map each found atom onto an atom in target. At each step in the process, we will have a partial atom map that maps some of the atoms in query onto target. That map, and any other information needed to complete the analysis will be kept in an instance of a class implementing the State interface.
A State will be manipulated by a recursive method, mapFirst that returns when the first atom map is found:
// create a list to hold atom maps
List<Map<Atom, Atom>> maps = new ArrayList<Map<Atom, Atom>>();
// create initial state
State state = ...;
boolean mapFirst(State state)
{
if (state.isDead())
{
return false;
}
if (state.isGoal())
{
maps.add(state.getMap());
return true;
}
boolean found = false;
while (!found && state.hasNextCandidate())
{
Match candidate = state.nextCandidate();
if (state.isMatchFeasible(candidate))
{
State nextState = state.nextState(candidate);
found = mapFirst(nextState);
nextState.backTrack();
}
}
return found;
}Comparison of the mapFirst method to the pseudocode VF algorithm Match procedure given in the previous article shows some similarities. In fact, something similar to the mapFirst method forms the basis of many atom mappers in use today.
Although it may be clear from the code, it's worth re-iterating that each time mapFirst is recursively called, an attempt is made to branch off a new State that maps an additional pair of atoms from query to target. If that branch leads to a possible solution, it's followed. Otherwise the next possible mapping is explored.
The State Interface
The recursive mapFirst method determines all of the methods the State interface needs to define:
public interface State
{
/**
* Returns the current mapping of query atoms onto target atoms.
* This map is shared among all states obtained through nextState.
*/
public Map<Atom, Atom> getMap();
/**
* Returns true if another candidate match can be found or
* false otherwise.
*/
public boolean hasNextCandidate();
/**
* Returns the next candidate match.
*/
public Match nextCandidate();
/**
* Returns true if the given match will work with the current
* map, or false otherwise.
*/
public boolean isMatchFeasible(Match match);
/**
* Returns true if all atoms in the query molecule have been
* mapped.
*/
public boolean isGoal();
/**
* Returns true if no match will come from this State.
*/
public boolean isDead();
/**
* Returns a state in which the atoms in match have been
* added to the current mapping.
*/
public State nextState(Match match);
/**
* Returns this State's atom map to its original condition.
*/
public void backTrack();
}Finally, State uses an instance of the Match class:
public class Match
{
private Atom query;
private Atom target;
public Match(Atom query, Atom target)
{
this.query = query;
this.target = target;
}
public Atom getQueryAtom()
{
return query;
}
public Atom getTargetAtom()
{
return target;
}
}Conclusions
With just a few lines of Java, we've managed to reduce the fundamental cheminformatics problem of atom mapping to the far simpler problem of implementing the State interface.
How many ways are there to implement the State interface? Probably as many as there are subgraph isomorphism algorithms. Notice that the way we've set up the problem lets us use the same recursive method to test all State implementations, an essential prerequisite for benchmarking and optimization.
Future articles in this series will describe one way to implement the State interface.
Image Credit: Dollar Bin
Fast Substructure Search Using Open Source Tools Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby
We can think of a fingerprint as a bucket into which every molecule in the universe can be reproducibly placed. Each molecule will belong to a single bucket, but each bucket may contain any number of molecules. In other words, there exists a one-to-many relationship between a fingerprint and its associated molecules. The previous article in this series discussed how to model this relationship using SQL. This article will take the idea one step further by describing one way to model this relationship in Ruby.
All Articles in this Series:
- Part 1: Fingerprints and Databases
- Part 2: Fingerprint Screen With SQL
- Part 3: A CRUD API for Fingerprints in Ruby
- Part 4: Creating Fingerprints from Chemical Structures
- Part 5: Relating Molecules to Fingerprints with SQL
- Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby
SQL Recap
So far, we've set up a fingerprints database:
mysql> describe fingerprints; +--------+---------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+---------------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | byte0 | bigint(64) unsigned | YES | | 0 | | | byte1 | bigint(64) unsigned | YES | | 0 | | | byte2 | bigint(64) unsigned | YES | | 0 | | | byte3 | bigint(64) unsigned | YES | | 0 | | | byte4 | bigint(64) unsigned | YES | | 0 | | | byte5 | bigint(64) unsigned | YES | | 0 | | | byte6 | bigint(64) unsigned | YES | | 0 | | | byte7 | bigint(64) unsigned | YES | | 0 | | | byte8 | bigint(64) unsigned | YES | | 0 | | | byte9 | bigint(64) unsigned | YES | | 0 | | | byte10 | bigint(64) unsigned | YES | | 0 | | | byte11 | bigint(64) unsigned | YES | | 0 | | | byte12 | bigint(64) unsigned | YES | | 0 | | | byte13 | bigint(64) unsigned | YES | | 0 | | | byte14 | bigint(64) unsigned | YES | | 0 | | | byte15 | bigint(64) unsigned | YES | | 0 | | +--------+---------------------+------+-----+---------+----------------+ 17 rows in set (0.00 sec)
This database contains a single (empty) fingerprint:
mysql> select * from fingerprints; +----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+ | id | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 | byte8 | byte9 | byte10 | byte11 | byte12 | byte13 | byte14 | byte15 | +----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+ | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+ 1 row in set (0.00 sec)
We've also set up a compounds database containing a foreign key (fingerprint_id) into the fingerprints table:
mysql> describe compounds; +----------------+---------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------+---------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | fingerprint_id | int(11) | YES | | NULL | | | smiles | text | YES | | NULL | | +----------------+---------+------+-----+---------+----------------+ 3 rows in set (0.00 sec)
In this hypothetical example, the compounds database is populated by two molecules, benzene and bromobenzene, both of which share the same fingerprint:
mysql> select * from compounds; +----+----------------+------------+ | id | fingerprint_id | smiles | +----+----------------+------------+ | 1 | 1 | c1ccccc1 | | 2 | 1 | c1ccccc1Br | +----+----------------+------------+ 2 rows in set (0.00 sec)
Adding the Ruby Layer
In Part 3, we created a CRUD API for fingerprints in Ruby. We now need to modify the class we created there, Fingerprint, to make it aware of the Compounds it will be associated with.
For brevity, you can view the updated Fingerprint class here. The main change has been to add a single line of code that tells Fingerprint that it's now associated with a class called Compound:
has_many :compoundsrequire 'rubygems'
require 'active_record'
require 'fingerprint'
ActiveRecord::Base.establish_connection(
:adapter => 'mysql',
:host => 'localhost',
:username => 'root',
:password => '',
:database => 'compounds'
)
class Compound < ActiveRecord::Base
belongs_to :fingerprint
end$ irb irb(main):001:0> require 'fingerprint' => true irb(main):002:0> f=Fingerprint.find 1 => #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0> irb(main):003:0> f.compounds => [#<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1">, #<Compound id: 2, fingerprint_id: 1, smiles: "c1ccccc1Br">]
Looks good. Our code has made the correct association between a Fingerprint and its Compounds. What about the other way around?
$ irb irb(main):001:0> require 'compound' => true irb(main):002:0> c=Compound.find 1 => #<Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1"> irb(main):003:0> c.fingerprint => #<Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0>
As expected, the first Compound became associated with the correct Fingerprint.
Conclusions
Our system can now store and query molecular fingerprints in a relational database. It also associates multiple compounds with each fingerprint.
We have a complete fingerprint screening system, but not a substructure search system.
What's missing? For one thing, we'd need a way to perform atom-by-atom searches (ABAS) of all candidate structures after the fingerprint screening process is complete. Recall that just because a query fingerprint matches a candidate fingerprint doesn't necessarily mean that a substructure match has been found.
We'd also need a way to conveniently get real compounds with real fingerprints into our database. Only then would we be able to test the chemical validity of substructure queries.
The remaining articles in this series will discuss approaches to each of these requirements.
Image Credit: leeechy
Fast Substructure Search Using Open Source Tools Part 4: Creating Fingerprints from Chemical Structures
The previous articles in this series have detailed the steps needed to build a working fingerprint screening system using nothing more than the open source tools MySQL, Ruby, and ActiveRecord. With this system we can create, read, update, and destroy fingerprints in persistent storage. Although the system meets all of the requirements of a fingerprint screening system, it isn't a substructure search system - yet. For that, we need a way to convert chemical structure representations into fingerprints. This article describes a very simple method for doing so.
All Articles in this Series:
- Part 1: Fingerprints and Databases
- Part 2: Fingerprint Screen With SQL
- Part 3: A CRUD API for Fingerprints in Ruby
- Part 4: Creating Fingerprints from Chemical Structures
- Part 5: Relating Molecules to Fingerprints with SQL
- Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby
A Ruby Fingerprinter in Eight Lines
Let's create a Fingerprinter class that's capable of converting a SMILES string into a Fingerprint that can be stored and queried. The Ruby code below makes use of Open Babel's babel command-line utility:
require 'fingerprint'
class Fingerprinter
def fingerprint_smiles smiles
raw = %x[echo '#{smiles}' | babel -ismi -ofpt 2>/dev/null]
bytes = raw.gsub(/>.*?\n/, '').gsub(/\n/, '').split
Fingerprint.new.fill_bytes{|i| "#{bytes[2*i]}#{bytes[2*i+1]}".hex}
end
endThis class takes advantage of Ruby's ability to interface directly with the command line through the %x operator in a way similar to that previously described for the cInChI command line tool. The babel output is then converted into a form suitable for use with our previously-defined Fingerprint class.
Although quite easy to implement, this approach may not work in every situation. For example, the fingerprint_smiles method opens the possibility that a malicious user could attempt to execute arbitrary shell commands by creating a mis-formed SMILES string. Windows users may need to adapt the code. But for trusted SMILES on Unix machines, this implementation works well and can be used in many different programming environments.
Testing the Fingerprinter
We can test the Fingerprinter through interactive Ruby (irb):$ irb irb(main):001:0> require 'lib/fingerprinter' => true irb(main):002:0> fp=Fingerprinter.new => #<Fingerprinter:0xb7498038> irb(main):003:0> f=fp.fingerprint_smiles 'c1ccccc1' => #<Fingerprint id: nil, byte0: 0, byte1: 512, byte2: 0, byte3: 0, byte4: 2112, byte5: 32768, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 134217728, byte11: 0, byte12: 0, byte13: 0, byte14: 131072, byte15: 0, hex: nil> irb(main):004:0> f.cardinality => 6 irb(main):005:0> f.bitstring => "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000100000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000"
As we previously saw, any Fingerprint we create can be stored and later retrieved from a MySQL database. If we've already stored the fingerprint for benzene it can be found with the following:
$ irb irb(main):001:0> require 'lib/fingerprinter' => true irb(main):002:0> fp=Fingerprinter.new => #<Fingerprinter:0xb74ae284> irb(main):003:0> f=fp.fingerprint_smiles 'c1ccccc1' => #<Fingerprint id: nil, byte0: 0, byte1: 512, byte2: 0, byte3: 0, byte4: 2112, byte5: 32768, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 134217728, byte11: 0, byte12: 0, byte13: 0, byte14: 131072, byte15: 0, hex: nil> irb(main):004:0> Fingerprint.find_by_fingerprint f => #<Fingerprint id: 12687, byte0: 0, byte1: 512, byte2: 0, byte3: 0, byte4: 2112, byte5: 32768, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 134217728, byte11: 0, byte12: 0, byte13: 0, byte14: 131072, byte15: 0, hex: "000000000000000000000000000002000000000000000000000...">
Conclusions
We now have the ability to create, store, and query fingerprints created from arbitrary SMILES strings. If there were a 1:1 relationship between molecules and fingerprints, we'd be nearly done. But things are not quite that simple. The next article in this series will show how to relate molecules to fingerprints.
Image Credit: adrenalin
Fast Substructure Search Using Open Source Tools Part 3: A CRUD API for Fingerprints in Ruby
The previous article in this series showed how to perform fingerprint screens for substructure searches using nothing more than SQL. Although this is significant progress, working at the level of SQL queries to perform create, read, update, and delete operations (CRUD) on our fingerprint table is more work than it needs to be. We'd really prefer to use an API written in a high-level programming language. This article describes a simple Ruby API for managing and querying a database of molecular fingerprints.
All Articles in this Series:
- Part 1: Fingerprints and Databases
- Part 2: Fingerprint Screen With SQL
- Part 3: A CRUD API for Fingerprints in Ruby
- Part 4: Creating Fingerprints from Chemical Structures
- Part 5: Relating Molecules to Fingerprints with SQL
- Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby
Some Changes to the Database Schema
Before we move forward, we must deal with one minor detail. By default MySQL uses signed 64-bit integers. This gives a range for integers of -9223372036854775808 to 9223372036854775807. Ruby, on the other hand, can work with integers of any size through the Bignum class - and we'll be taking full advantage of this feature.
If we want to avoid the headache of constantly accounting for the difference, we need to tell our database to use unsigned integers in the fingerprints table. This can be done by first dropping the old table:
mysql> drop table fingerprints; Query OK, 0 rows affected (0.00 sec)
Now let's create a new table in which the fpn columns store unsigned integers only. While we're at it, let's change the naming of these columns from fpn to the more descriptive byten and set a default value of zero.
The new table can be created with with:
mysql> create table fingerprints(id int not null auto_increment, primary key(id), byte0 bigint(64) unsigned default 0, byte1 bigint(64) unsigned default 0, byte2 bigint(64) unsigned default 0, byte3 bigint(64) unsigned default 0, byte4 bigint(64) unsigned default 0, byte5 bigint(64) unsigned default 0, byte6 bigint(64) unsigned default 0, byte7 bigint(64) unsigned default 0, byte8 bigint(64) unsigned default 0, byte9 bigint(64) unsigned default 0, byte10 bigint(64) unsigned default 0, byte11 bigint(64) unsigned default 0, byte12 bigint(64) unsigned default 0, byte13 bigint(64) unsigned default 0, byte14 bigint(64) unsigned default 0, byte15 bigint(64) unsigned default 0); Query OK, 0 rows affected (0.00 sec) mysql> describe fingerprints; +--------+---------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+---------------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | byte0 | bigint(64) unsigned | YES | | 0 | | | byte1 | bigint(64) unsigned | YES | | 0 | | | byte2 | bigint(64) unsigned | YES | | 0 | | | byte3 | bigint(64) unsigned | YES | | 0 | | | byte4 | bigint(64) unsigned | YES | | 0 | | | byte5 | bigint(64) unsigned | YES | | 0 | | | byte6 | bigint(64) unsigned | YES | | 0 | | | byte7 | bigint(64) unsigned | YES | | 0 | | | byte8 | bigint(64) unsigned | YES | | 0 | | | byte9 | bigint(64) unsigned | YES | | 0 | | | byte10 | bigint(64) unsigned | YES | | 0 | | | byte11 | bigint(64) unsigned | YES | | 0 | | | byte12 | bigint(64) unsigned | YES | | 0 | | | byte13 | bigint(64) unsigned | YES | | 0 | | | byte14 | bigint(64) unsigned | YES | | 0 | | | byte15 | bigint(64) unsigned | YES | | 0 | | +--------+---------------------+------+-----+---------+----------------+ 17 rows in set (0.01 sec)
We're now ready to create the Ruby API.
The API
The code below is all we need to begin querying and managing our fingerprint database in Ruby:
require 'rubygems'
require 'active_record'
ActiveRecord::Base.establish_connection(
:adapter => 'mysql',
:host => 'localhost',
:username => 'root',
:password => '',
:database => 'compounds'
)
class Fingerprint < ActiveRecord::Base
@@bytes_prefix = "byte"
def each_byte
0.upto(byte_count - 1) {|i| yield send("#{@@bytes_prefix}#{i}") }
end
def each_byte_with_index
0.upto(byte_count - 1) {|i| yield send("#{@@bytes_prefix}#{i}"), i }
end
def fill_bytes
0.upto(byte_count - 1) {|i| send("#{@@bytes_prefix}#{i}=", yield(i))}
self
end
def to_byte_array
Array.new(16).fill{|i| send("#{@@bytes_prefix}#{i}")}
end
def byte_count
result = 0
while respond_to? "#{@@bytes_prefix}#{result}"
result += 1
end
result
end
def bitstring
result = ""
each_byte {|byte| result += sprintf("%064b", byte)}
result
end
def cardinality
bitstring.count("1")
end
def eql?(other)
to_byte_array.eql?(other.to_byte_array)
end
def save
return false unless Fingerprint.find_by_fingerprint(self).empty?
super
end
def self.find_by_fingerprint fingerprint
Fingerprint.find_by_sql sql_for_find_by_fingerprint(fingerprint)
end
def self.find_children_by_fingerprint fingerprint
Fingerprint.find_by_sql sql_for_find_children_by_fingerprint(fingerprint)
end
def self.sql_for_find_by_fingerprint fingerprint
result = "select fingerprints.* from fingerprints where "
last = fingerprint.byte_count - 1
fingerprint.each_byte_with_index do |byte, i|
result += "#{@@bytes_prefix}#{i}=#{byte}" + ((i ==last) ? "" : " and ")
end
result
end
def self.sql_for_find_children_by_fingerprint fingerprint
result = "select fingerprints.* from fingerprints where "
last = fingerprint.byte_count - 1
fingerprint.each_byte_with_index do |byte, i|
result += "#{@@bytes_prefix}#{i}&#{byte}=#{byte}" + ((i ==last) ? "" : " and ")
end
result
end
end
Testing the API
We can test this library from interactive ruby (irb). Let's add two fingerprints - the first consisting of all bits set to "1" and the second consisting of alternating "1" and "0" bits:
$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> f1=Fingerprint.new.fill_bytes{"ffffffffffffffff".hex}
=> #<Fingerprint id: nil, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>
irb(main):003:0> f1.save
=> true
irb(main):004:0> f2=Fingerprint.new.fill_bytes{"aaaaaaaaaaaaaaaa".hex}
=> #<Fingerprint id: nil, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):005:0> f2.save
=> true
Let's find the fingerprint in which all bits are turned on:
$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> query=Fingerprint.new.fill_bytes{"ffffffffffffffff".hex}
=> #<Fingerprint id: nil, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>
irb(main):003:0> Fingerprint.find_by_fingerprint query
=> [#<Fingerprint id: 111, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>]
Our query has found an exact match for the query fingerprint in the database at row 111. (This id is not 1 because previous automated tests that I wrote and executed have added and removed rows, advancing the id counter).
We can also search the database for the children of an arbitrary fingerprint query. A test fingerprint A is a "child" of query Q if all of the set bits in Q are also set in A. Notice that this leaves open the possibility that A has more bits set than Q. For example:
$ irb
irb(main):001:0> require 'fingerprint'
=> true
irb(main):002:0> query=Fingerprint.new.fill_bytes{"aaaaaaaaaaaaaaaa".hex}
=> #<Fingerprint id: nil, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>
irb(main):003:0> results = Fingerprint.find_children_by_fingerprint query
=> [#<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410>, #<Fingerprint id: 111, byte0: 18446744073709551615, byte1: 18446744073709551615, byte2: 18446744073709551615, byte3: 18446744073709551615, byte4: 18446744073709551615, byte5: 18446744073709551615, byte6: 18446744073709551615, byte7: 18446744073709551615, byte8: 18446744073709551615, byte9: 18446744073709551615, byte10: 18446744073709551615, byte11: 18446744073709551615, byte12: 18446744073709551615, byte13: 18446744073709551615, byte14: 18446744073709551615, byte15: 18446744073709551615>]
It worked - both fingerprints stored in the database were found.
We can delete a Fingerprint like this:
$ irb irb(main):001:0> require 'fingerprint' => true irb(main):002:0> f=Fingerprint.find 112 => #<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410> irb(main):003:0> f.destroy => #<Fingerprint id: 112, byte0: 12297829382473034410, byte1: 12297829382473034410, byte2: 12297829382473034410, byte3: 12297829382473034410, byte4: 12297829382473034410, byte5: 12297829382473034410, byte6: 12297829382473034410, byte7: 12297829382473034410, byte8: 12297829382473034410, byte9: 12297829382473034410, byte10: 12297829382473034410, byte11: 12297829382473034410, byte12: 12297829382473034410, byte13: 12297829382473034410, byte14: 12297829382473034410, byte15: 12297829382473034410> irb(main):004:0> Fingerprint.count => 1
Active Record and the Fingerprint API
The Fingerprint class is so concise because it takes advantage of the Ruby library called ActiveRecord. ActiveRecord is the object-relational mapping system used in Ruby on Rails. ActiveRecord can be used outside of Rails, as was done for this library, by including the code at the top of the file beginning with "ActiveRecord::Base.establish_connection...", where you'd use the parameters specific to your database.
We gain three key advantages with this approach: (1) we have very little SQL to code; (2) we have access to all of ActiveRecord's built-in CRUD operations such as counting records through Fingerprint.count and deleting Fingerprints with destroy without writing anything ourselves; and (3) we can easily integrate the Fingerprint class into any Ruby on Rails application.
Variations
At least two other Object-Ralational Mapping systems could be used from Ruby, DataMapper, and Sequel. The approach described here could be adapted to these other ORMS with minimal effort.
Conclusions
We now have a working fingerprint screening system built solely from open source components. MySQL houses the data and provides for highly-optimized queries. A concise Ruby API created with ActiveRecord now allows us to deal with our fingerprint database as a collection of objects in a high-level language. We can perform all CRUD operations without writing a line of SQL.
We've come a long way, but we're still not dealing with molecules. We previously saw how Open Babel can generate fingerprints with which we could, in principle, populate and query our database. The next article in this series will use this capability in creating a more chemically-aware system.
Image Credit: peerlingguy
Fast Substructure Search Using Open Source Tools Part 2: Fingerprint Screen With SQL 23
The previous article in this series discussed the configuration of a MySQL database for fast substructure search with binary fingerprints. This article first shows how to populate this database with real fingerprint data for two molecules. Then it shows how to formulate standard SQL queries to screen the database for substructures.
All articles in this series:
- Part 1: Fingerprints and Databases
- Part 2: Fingerprint Screen With SQL
- Part 3: A CRUD API for Fingerprints in Ruby
- Part 4: Creating Fingerprints from Chemical Structures
- Part 5: Relating Molecules to Fingerprints with SQL
- Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby
Creating the Fingerprints with Open Babel
The babel command line utility will, among it many conversions, return a fingerprint when given a valid SMILES string. For example, we can create the fingerprint for benzene like this:
$ babel -ismi -ofpt c1ccccc1 > 6 bits set 00000000 00000000 00000000 00000200 00000000 00000000 00000000 00000000 00000000 00000840 00000000 00008000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 08000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00020000 00000000 00000000 1 molecule converted 12 audit log messages
Similarly, we create the fingerprint for phenol like this:
$ babel -ismi -ofpt c1ccccc1O > 12 bits set 00000000 00000008 20000000 00000200 00000000 00000000 02000000 00000000 00000000 00000840 00000000 00008000 00000002 00000000 00000000 00000008 00000000 00000000 00000000 00020000 00000000 08000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00020000 00000000 00000000 1 molecule converted 19 audit log messages
The exact meaning of these fingerprints is interesting, but not relevant. Without getting into the details of the Open Babel fingerprint formats, which are discussed in detail elsewhere, the output contains the binary fingerprint of each molecule as an array of 32-bit hexadecimal numbers.
Adding Fingerprints to the Database
To use Open Babel's fingerprints with out database, we need to convert the 32-bit hexadecimal numerical output to 64-bit decimal format. This is not difficult and most programming environments make this very simple. For example, the following Ruby code will convert the third and fourth 32-bit hexadecimal numbers in the benzene fingerprint into a 64-bit decimal number:
$ irb irb(main):001:0> "0000000000000200".hex => 512
Performing this conversion for every pair of 32-bit hex numbers in each fingerprint gives a set of numbers we can place directly into our database:
mysql> # Add benzene decimal fingerprint. mysql> insert into fingerprints (fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15) values (0, 512, 0, 0, 2112, 32768, 0, 0, 0, 0, 134217728, 0, 0, 0, 131072, 0);
Similarly,
mysql> # Add phenol decimal fingerprint. mysql> insert into fingerprints (fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15) values (8, 2305843009213694464, 0, 144115188075855872, 2112, 32768, 8589934592, 8, 0, 131072, 134217728, 0, 0, 0, 131072, 0);
Our table is now ready to be queried:
mysql> select * from fingerprints; +----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+ | id | fp0 | fp1 | fp2 | fp3 | fp4 | fp5 | fp6 | fp7 | fp8 | fp9 | fp10 | fp11 | fp12 | fp13 | fp14 | fp15 | +----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+ | 1 | 0 | 512 | 0 | 0 | 2112 | 32768 | 0 | 0 | 0 | 0 | 134217728 | 0 | 0 | 0 | 131072 | 0 | | 2 | 8 | 2305843009213694464 | 0 | 144115188075855872 | 2112 | 32768 | 8589934592 | 8 | 0 | 131072 | 134217728 | 0 | 0 | 0 | 131072 | 0 | +----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+ 2 rows in set (0.00 sec)
Querying the Database
With a table of fingerprints in hand, we can begin formulating queries. To do so, we'll use MySQL's built-in support for binary arithmetic.
A molecule with fingerprint A can represent a substructure of another molecule with fingerprint B if all of the bits in B are also present in A. Mathematically, we'd say that:
Bi & Ai = Bi
for all bits i in A and B.
Let's say we have a two-bit fingerprint consisting of 01 and 11 (binary) in our database. We can use MySQL to test whether the molecule from which the second fingerprint was derived could be a substructure of the molecule from which the first fingerprint was derived with this syntax:
mysql> select 1&3; +-----+ | 1&3 | +-----+ | 1 | +-----+ 1 row in set (0.00 sec)
The answer is yes, there could be a substructure match because 1&3 = 1.
We're now ready to perform our first substructure screen using SQL. This consists of selecting all rows for which each of the 16 fingerprint components, when anded together with a query fingerprint component, gives back the original component.
To see if phenol is a substructure of benzene, we could use the following:
mysql> select id from fingerprints where fp0&0=0 and fp1&512=512 and fp2&0=0 and fp3&0=0 and fp4&2112=2112 and fp5&32768=32768 and fp6&0=0 and fp7&0=0 and fp8&0=0 and fp9&0=0 and fp10&134217728=134217728 and fp11&0=0 and fp12&0=0 and fp13&0=0 and fp14&131072=131072 and fp15&0=0; +----+ | id | +----+ | 1 | | 2 | +----+ 2 rows in set (0.00 sec)
Our query results are telling us that phenol is both a substructure of benzene and itself, as expected.
Conclusions
We now have a database populated with two molecules represented as fingerprints. We can even scan the database for possible substructure matches using nothing more than standard SQL queries. Nevertheless, we've had to use a lot of manual coding to convert hex into decimal and create SQL. We need a library to do this mundane work for us. The next article in this series will discuss a better approach using Ruby.
Image Credit: dopeylok
Older posts: 1 2

