Fast Substructure Search Using Open Source Tools Part 2: Fingerprint Screen With SQL 23

Posted by Rich Apodaca Fri, 03 Oct 2008 14:47:00 GMT

The previous article in this series discussed the configuration of a MySQL database for fast substructure search with binary fingerprints. This article first shows how to populate this database with real fingerprint data for two molecules. Then it shows how to formulate standard SQL queries to screen the database for substructures.

All articles in this series:

Creating the Fingerprints with Open Babel

The babel command line utility will, among it many conversions, return a fingerprint when given a valid SMILES string. For example, we can create the fingerprint for benzene like this:

$ babel -ismi -ofpt
c1ccccc1
>   6 bits set 
00000000 00000000 00000000 00000200 00000000 00000000 
00000000 00000000 00000000 00000840 00000000 00008000 
00000000 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000 08000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000 00020000 
00000000 00000000 
1 molecule converted
12 audit log messages

Similarly, we create the fingerprint for phenol like this:

$ babel -ismi -ofpt
c1ccccc1O 
>   12 bits set 
00000000 00000008 20000000 00000200 00000000 00000000 
02000000 00000000 00000000 00000840 00000000 00008000 
00000002 00000000 00000000 00000008 00000000 00000000 
00000000 00020000 00000000 08000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000 00020000 
00000000 00000000 
1 molecule converted
19 audit log messages

The exact meaning of these fingerprints is interesting, but not relevant. Without getting into the details of the Open Babel fingerprint formats, which are discussed in detail elsewhere, the output contains the binary fingerprint of each molecule as an array of 32-bit hexadecimal numbers.

Adding Fingerprints to the Database

To use Open Babel's fingerprints with out database, we need to convert the 32-bit hexadecimal numerical output to 64-bit decimal format. This is not difficult and most programming environments make this very simple. For example, the following Ruby code will convert the third and fourth 32-bit hexadecimal numbers in the benzene fingerprint into a 64-bit decimal number:

$ irb
irb(main):001:0> "0000000000000200".hex
=> 512

Performing this conversion for every pair of 32-bit hex numbers in each fingerprint gives a set of numbers we can place directly into our database:

mysql> # Add benzene decimal fingerprint.
mysql> insert into fingerprints
(fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15)
values
(0, 512, 0, 0, 2112, 32768, 0, 0, 0, 0, 134217728, 0, 0, 0, 131072, 0);

Similarly,

mysql> # Add phenol decimal fingerprint.
mysql> insert into fingerprints
(fp0,fp1,fp2,fp3,fp4,fp5,fp6,fp7,fp8,fp9,fp10,fp11,fp12,fp13,fp14,fp15)
values
(8, 2305843009213694464, 0, 144115188075855872, 2112, 32768, 8589934592, 8, 0, 131072, 134217728, 0, 0, 0, 131072, 0);

Our table is now ready to be queried:

mysql> select * from fingerprints;
+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
| id | fp0  | fp1                 | fp2  | fp3                | fp4  | fp5   | fp6        | fp7  | fp8  | fp9    | fp10      | fp11 | fp12 | fp13 | fp14   | fp15 |


+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
|  1 |    0 |                 512 |    0 |                  0 | 2112 | 32768 |          0 |    0 |    0 |      0 | 134217728 |    0 |    0 |    0 | 131072 |    0 | 
|  2 |    8 | 2305843009213694464 |    0 | 144115188075855872 | 2112 | 32768 | 8589934592 |    8 |    0 | 131072 | 134217728 |    0 |    0 |    0 | 131072 |    0 | 
+----+------+---------------------+------+--------------------+------+-------+------------+------+------+--------+-----------+------+------+------+--------+------+
2 rows in set (0.00 sec)

Querying the Database

With a table of fingerprints in hand, we can begin formulating queries. To do so, we'll use MySQL's built-in support for binary arithmetic.

A molecule with fingerprint A can represent a substructure of another molecule with fingerprint B if all of the bits in B are also present in A. Mathematically, we'd say that:

Bi & Ai = Bi

for all bits i in A and B.

Let's say we have a two-bit fingerprint consisting of 01 and 11 (binary) in our database. We can use MySQL to test whether the molecule from which the second fingerprint was derived could be a substructure of the molecule from which the first fingerprint was derived with this syntax:

mysql> select 1&3;
+-----+
| 1&3 |
+-----+
|   1 | 
+-----+
1 row in set (0.00 sec)

The answer is yes, there could be a substructure match because 1&3 = 1.

We're now ready to perform our first substructure screen using SQL. This consists of selecting all rows for which each of the 16 fingerprint components, when anded together with a query fingerprint component, gives back the original component.

To see if phenol is a substructure of benzene, we could use the following:

mysql> select id from fingerprints where fp0&0=0 and fp1&512=512 and fp2&0=0 and fp3&0=0 and fp4&2112=2112 and fp5&32768=32768 and fp6&0=0 and fp7&0=0 and fp8&0=0 and fp9&0=0 and fp10&134217728=134217728 and fp11&0=0 and fp12&0=0 and fp13&0=0 and fp14&131072=131072 and fp15&0=0;
+----+
| id |
+----+
|  1 | 
|  2 | 
+----+
2 rows in set (0.00 sec)

Our query results are telling us that phenol is both a substructure of benzene and itself, as expected.

Conclusions

We now have a database populated with two molecules represented as fingerprints. We can even scan the database for possible substructure matches using nothing more than standard SQL queries. Nevertheless, we've had to use a lot of manual coding to convert hex into decimal and create SQL. We need a library to do this mundane work for us. The next article in this series will discuss a better approach using Ruby.

Image Credit: dopeylok

Comments

Leave a response

  1. Oleg (OUrsu@salud.unm.edu) Fri, 03 Oct 2008 17:26:09 GMT

    One important performance aspect regarding fingerprints, you will have to check that bits in the fingerprints you use have a balanced overlap over large molecular libraries. I assume than you will scan those fingerprints with tanimoto to retrieve the structures from databases which pass a certain threshold and do a more accurate substructure search? If this is the scenario than having a large overlap between fingerprints will reduce greatly the performance of search. It is not a trivial task to come up with good quality fingerprints.

    Oleg.

  2. Rajarshi Sat, 04 Oct 2008 14:11:52 GMT

    It is also useful to see the work of Baldi, which describes some nice heuristics to define upper bounds on the similarities based on the number of bits set. Of course, it's only useful if the desired similarity is quite high (say > 0.8).

    http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2007/47/i02/abs/ci600358f.html

    We've implemented this on our local version of PubChem with the MACCS keys - and it works quite nicely for queries with a high similarity cutoff, greatly reducing the number of rows to look at.

    Another nice thing is that when loading a FP, you can set a trigger to eval the number of 1's and put that in an int field. Which can be indexed.

    As a result the heuristic is simply a range query on the int field, which is very fast.

  3. Rajarshi Sat, 04 Oct 2008 14:21:00 GMT

    Also, why are you splitting the FP into int's? (I probably should have asked this on your previous post).

    For example, Postgres supports bit fields of arbitrary length. Have you considered looking at the performance hit/gain by going from a bit field to a set of integers? Of course the fact that you can index the integer fields, migh offset the complexity of the queries using those fields.

  4. Rich Apodaca Sat, 04 Oct 2008 16:38:47 GMT

    Rajarshi, I was going to add the Baldi optimization near the end of the series. I agree - it has a lot of potential and is a breeze to implement.

    You raise a very good question about why fingerprints are split into ints. I originally thought it would be simpler to see what what happening that way. But on second thought, it would be worth describing how to to get a fingerprint to be stored and queried from a single field in a table, even if the syntax might not be as portable.

  5. Christoph Steinbeck Sat, 04 Oct 2008 16:42:23 GMT

    Hi Rich, very useful and timely series :-) We are currently working on making a CDK-based Oracle cartridge for substructure searching and optimizing the fingerprints will be a first task. Second will be getting more sophisticated queries like R-Groups and variables x=[Br,Cl, but not F] to work.

  6. Rich Apodaca Sat, 04 Oct 2008 19:23:13 GMT

    Rajarshi, it sounds like you're suggesting the use of a BLOB or VARBINARY field in MySQL.

    AFAIK, bit functions in MySQL only take 64-bit integers and don't work on binary data. The advantage of the approach outlined here is that the rdbms does all of the binary fingerprint comparison.

    So... how would you get MySQL (or Posgress, etc.) to do binary arithmetic on binary fields of lengths of up to 1024?

  7. Rich Apodaca Sat, 04 Oct 2008 19:28:26 GMT

    Chris, sounds interesting and similar in intent to MyChem. What's a good link summarizing CDK's current fingerprint support?

  8. Rajarshi Sat, 04 Oct 2008 20:10:33 GMT

    Rich, regarding bit strings > 1024 bits, Postgres has a fields of type bit and bit varying, which takes arbitrary length bit strings. The first fixes the length of the bit string, the second allows variable length bit string.

    Once you have that you can use &, | etc to do direct bitwise operations on the field. Example

    CREATE TABLE test (a BIT(3) ); insert into test values (B'101'); insert into test values (B'000'); insert into test values (B'111'); select B'000' & a from test; select B'000' | a from test;

    But you're right, this won't port directly to MySQL. But the code is much easier to write and read :)

  9. Rich Apodaca Sun, 05 Oct 2008 01:03:46 GMT

    Rajarshi, thanks for the useful info. I wonder which other open source dbs have that feature...

  10. Rajarshi Sun, 05 Oct 2008 05:10:17 GMT

    Well Derby does support it (but not as simply as Postgres, IMO). Sqlite does not and I think Firebird does

  11. Wolf Ihlenfeldt Sun, 05 Oct 2008 09:58:06 GMT

    Some of you may be interested to learn that the algorithms described in the work of Baldi et al. were in nearly identical form developed independently by us at PubChem and are being used there.

    The original PubChem fingerprints can be downloaded from the PubChem ftp site, or generated by the free academic version of the Cactvs toolkit for arbitrary molecules.

  12. Kris Sun, 05 Oct 2008 16:45:43 GMT

    Great set of posts. I would love to contribute and use a database based backend to the open babel fastsearch.

  13. Rich Apodaca Mon, 06 Oct 2008 01:59:09 GMT

    Wolf, when might we see a publication of some kind describing the design, implementation, and optimizations behind PubChem?

    I remember seeing the documentation on the PubChem fingerprint system, which is an alternative to the Open Babel "FP2" all-path based systems used here.

    Kris, what kind of substructure search system does ChemSink currently use?

  14. Kris Mon, 06 Oct 2008 14:20:00 GMT

    Rich, right now I am using open babel to generate a fast search index file from a list of smiles strings which are associated with the id's in the database. Then I can just use openbabel's similarity screen and substructure search to perform queries on the fly and grab the associated data from the db.

    Just to be clear, the file which I use to generate the index looks like this:

    smiles_string id CCC 1 CC=O 2 CO 3 etc

  15. Ernst-Georg Schmid Tue, 07 Oct 2008 09:38:16 GMT

    Hi,

    the next Version of pgchem::tigress now uses binary fingerprints also. It is not released yet, because it was kind of a major rewrite and I'm ironing out the last glitches, but if somebody wants to take a look at the code how a domain index based on OB FP2 and FP3 fingerprints can be built, or dares to try it out :-) please contact me.

    And many thanks for the pointer to the Baldi paper - there's always something new to learn...

  16. Andrew Dalke Wed, 08 Oct 2008 09:24:57 GMT

    I'm working on a new set of essays related to fast fingerprint searches. I implemented the Baldi optimizations (though like Wolf said, Baldi isn't the first one to implement those, at least not for the single compound search stage) and found that except for near similarities (Rajarshi's suggestion of >0.8 sounds right - my test case was 0.72), it wasn't faster than doing a modified linear search.

    I think a problem is the caching behavior. Jumping around according to the Baldo algorithm means the chip and OS can't easily guess what to prefetch.

    But my test set was only 250,000 compounds and I haven't checked how it would scale for 20M-compound sized systems. Hence why I haven't written those essays.

    By "modified linear search" I mean start from the bit 0 and scan forward until reaching a bit count that's too high for the given best hit.

    Also, the Baldi algorithm requires a sort using intermediate storage. Since the graph is monotonic decreasing going away from the center point, a better algorithm when thinking about this from an API standpoint is to do a merge sort of two already sorted inputs. This doesn't require a malloc and doesn't have the potential O(nbits log nbits) overhead - though as Baldi says, that's small and a one-time cost.

    Anyway, will be writing more about this in a bit.

  17. Rich Apodaca Sat, 11 Oct 2008 15:12:58 GMT

    Ernst-Georg, sounds great. Do you know of any public-facing databases using pgchem::tigress?

  18. Rich Apodaca Sat, 11 Oct 2008 15:30:56 GMT

    Andrew, I look forward to your next installment.

    The fingerprint screening stage is, in many cases, the fast part of a search. The subsequent atom-by-atom search (ABAS) is the rate-limiting step for fingerprint screens returning thousands of candidates.

    Optimizing (or eliminating) the ABAS should result in the biggest increases in performance for these kinds of queries (such as benzene or methanol as a SSS).

    Will your next set of essays touch on this?

  19. Andrew Dalke Mon, 13 Oct 2008 07:32:03 GMT

    The fingerprint work I'm doing is for similarity searches, so there is no next step.

    I hadn't heard the term "ABAS" before, and I don't think I agree with it. If a query is exactly the same as a substructure key then the search code should figure out that there's no need for additional subgraph matching. Of course, doing that test elegantly probably requires some canonical SMARTS form. :)

    Since you're using the fingerprints for screening, there are a couple of other ways to speed things up, but they aren't I think possible in your MySQL solution. Some of the words in the fingerprint are more selective than others, and a simple reordering so that that word is checked first gives about a 10% performance boost. Here's my essay about it.

    Another option is to do the inverted index. If ids[i] contains a set of identifiers with bit i set, etc. then intersection(ids[i] where i is set in the query fp) gives the list of compounds which pass the screen.

    The downside is that storing the fingerprints this way makes for very slow similarity searches, so either there's two data sets to manage or just go ahead and search the fingerprints directly. As you say, it is fast.

    The hard part is coming up with good substructure screens. I can ask "how does OpenBabel's -fp2 option compare to PubChem's CACTVS substructure fingerprints" but without a good idea of the types of queries people do, I can't tell you which is better.

    I would like to see a list of the top, say, 1,000 queries of pubchem and something showing the distribution of query vs. number of times it's used as a query. Or perhaps a random selection of 10,000 queries that PubChem gets in a year. Taking care to minimize privacy concerns, of course.

  20. Ernst-Georg Schmid Mon, 13 Oct 2008 10:10:10 GMT

    Rich,

    currently http://www.chemcollect.de -> catalog -> search by structure and -> search by substance class uses it, but the released version with checkmol/matchmol generated integer fingerprints.

  21. Wiraj Bibile Wed, 15 Oct 2008 10:50:07 GMT

    Hi I am currently working a substructure search algorithm using finger prints (I am using the CDK package). My algorithm uses precompiled finger prints placed in a database, which is matched against a query finger print. If the fingerprint query succeeds then a more complete test is done to make sure that the query chemical is a substructure of the matched chemical (from the database). this second test is low in performance reason why a finger print test is done initially.

    However when using common chemicals in my query the finger print match succeeds more often and fails at the second stage. Therefore taking longer time to run the query, do you know of any way I can improve fingerprint matching for common structures?

  22. Rich Apodaca Wed, 15 Oct 2008 16:15:39 GMT

    Wiraj, good question. I've got some ideas of my own, but I'll be putting those together in a later post.

    In the meantime, it might be worth starting a thread dedicated to this topic and see what turns up from readers around the Web...

  23. Rajarshi Thu, 16 Oct 2008 02:22:38 GMT

    Wiraj, maybe you need a finer grained fingerprint? But in the end if the query is small (and common) enough, then there's a high chance that it's FP bits will be a subset of a large fraction of the database.

Comments