<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases</title>
    <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/jaded/89717778/"&gt;&lt;img src="http://depth-first.com/demo/20081002/fingerprint.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;For anyone working in a chemistry-related job, chemical databases are ubiquitous. A printed list of IUPAC names, a spreadsheet containing &lt;a href="http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia"&gt;CAS numbers&lt;/a&gt;, and a set of hand-drawn structures on index cards are all primitive chemical databases. They aren't nearly as useful as they could be to either the creator or his/her collaborators, but they are databases nevertheless. Anyone who has spent time in industry or academics knows that these low-tech chemical databases are everywhere. And they become more of a problem as more information is moved into electronic format.&lt;/p&gt;

&lt;p&gt;All articles in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: Fingerprints and Databases&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/03/fast-substructure-search-using-open-source-tools-part-2-fingerprint-screen-with-sql"&gt;Part 2: Fingerprint Screen With SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/06/fast-substructure-search-using-open-source-tools-part-3-a-crud-api-for-fingerprints-in-ruby"&gt;Part 3: A CRUD API for Fingerprints in Ruby&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/15/fast-substructure-search-using-open-source-tools-part-4-creating-fingerprints-from-chemical-structures"&gt;Part 4: Creating Fingerprints from Chemical Structures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/21/fast-substructure-search-using-open-source-tools-part-5-relating-molecules-to-fingerprints-with-sql"&gt;Part 5: Relating Molecules to Fingerprints with SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/29/fast-substructure-search-using-open-source-tools-part-6-modelling-a-one-to-many-relationship-between-fingerprints-and-compounds-in-ruby"&gt;Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;The Problem: Structure Search is Hard&lt;/h4&gt;

&lt;p&gt;Many of the low-tech chemical databases that professional chemists routinely share and work with would become orders of magnitude more useful if they were converted into substructure-searchable databases and published to the Web. Although there has been a &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;great deal of effort toward this end&lt;/a&gt; in the last few years, there's still much, much more that could be done.&lt;/p&gt;

&lt;p&gt;One of the main problems in creating a substructure-searchable chemical database is implementing the substructure search capability itself. This one requirement has done more to stifle the free flow of chemical information than perhaps any other. Solving the problem appears very difficult on first or second glance, and it is very difficult if you don't have the right tools. Many companies offer solutions - but at a price, both in terms of money and time, that is simply out of reach.&lt;/p&gt;

&lt;p&gt;What can you do if you're just getting started with modest requirements and budget?&lt;/p&gt;

&lt;h4&gt;About This Series&lt;/h4&gt;

&lt;p&gt;This article, the first in a series, will describe the creation of a chemical substructure search engine using exclusively well-maintained and robust open source tools: &lt;a href="http://openbabel.org"&gt;Open Babel&lt;/a&gt; for generating fingerprints and peforming atom-by-atom searches; &lt;a href="http://mysql.com"&gt;MySQL&lt;/a&gt; as a relational database; and &lt;a href="http://ruby-lang.org"&gt;Ruby&lt;/a&gt; as a scripting language.&lt;/p&gt;

&lt;p&gt;Each of these three components is a commodity that can be replaced with any one of a number of open-source or proprietary substitutes, maximizing flexibility and minimizing vendor lock-in.&lt;/p&gt;

&lt;h4&gt;Other Resources&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://merian.pch.univie.ac.at/pch/nh_info.html"&gt;Norbert Haider&lt;/a&gt; of the University of Vienna has written a very useful tutorial on &lt;a href="http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb.html"&gt;creating a structure-searchable database using free tools&lt;/a&gt;, which is part of a &lt;a href="http://depth-first.com/articles/2007/04/13/roll-your-own-chemical-database-with-free-components"&gt;larger series&lt;/a&gt;. That series differs from this one in the technology stack used and the level of detail to be provided. The series of articles to appear here will spell out the low-level series of steps needed to create a working substructure search system. It's hoped that taking this perspective makes clear the steps needed to apply the approach to alternative technology platforms.&lt;/p&gt;

&lt;h4&gt;Binary Fingerprints and Relational Databases&lt;/h4&gt;

&lt;p&gt;At the heart of the system we'll build is the chemical fingerprint which is a (usually) lossy binary representation of a chemical structure. Creating a binary fingerprint is like putting every chemical structure, known or unknown into just one bin out of a very large, but finite set of bins. Although the same molecule is guaranteed to always go into the same bin, more than one molecule can be placed into each bin. This is a general feature of all &lt;a href="http://en.wikipedia.org/wiki/Hash_function"&gt;hashing&lt;/a&gt; schemes.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.dalkescientific.com/index.html"&gt;Andrew Dalke&lt;/a&gt; has written &lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;an excellent series of articles&lt;/a&gt; on fingerprints and what can be done with them. Another good overview is &lt;a href="http://www.daylight.com/dayhtml/doc/theory/theory.finger.html"&gt;available from Daylight&lt;/a&gt;. This article will assume you know what fingerprints are and how they can be used to compare chemical structures.&lt;/p&gt;

&lt;p&gt;The problem with binary fingerprints is that they are generally several hundred bits long - too long to be represented in a form that allows direct and rapid query by a relational database system. They need to be broken up - but how?&lt;/p&gt;

&lt;p&gt;A widely-used approach (and the one that will be taken here) involves breaking up the fingerprint into a series of integers that are stored in the database.&lt;/p&gt;

&lt;p&gt;For example, let's say we have a 1024-bit fingerprint. We could represent this as a number from 0 to 2^1024, which of course is way to big for most computers to handle today. We could, however, represent this fingerprint as a series of sixteen 64-bit integers (which are available on most systems).&lt;/p&gt;

&lt;p&gt;So, the binary fingerprint:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
1111111101111111110110111011011000101000011000011010011100010000
1001100010101101000110100010110011101100100000100100000111010100
0101010000101011001010011001000100011001100000101100111010001110
1001000101001010000001011001100101101011111111011000111100000111
1010101100100101000100001100011001010111001001110101101100010010
0011101011101110110011111010000010111001100101001001101010110001
1100111000010100000100110111101001011100010111010001010101101101
0010001111111010111011110110000000001010111011111001111001111101
0101011100011111110111011110011110100110010110010101011001011111
0110100001111001101111011101001101101001000100010001100101111000
0011111001000100001111111110001100111001101000000100010010010110
0000011101001001011000111110101110010101110001111010100001100100
0100100111101010110101101010110110101010110110111011011001111111
0011100100101101101001000001000111110101011101110101101001101001
0110100100111001111001001111110111111001110100100110010100011110
0010101100101000011110101110111011001110101111100001011010101100
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;could also be represented as this decimal fingerprint (assuming your machine is &lt;a href="http://en.wikipedia.org/wiki/Endianness"&gt;big-endian&lt;/a&gt;):&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
18410675377121896208
11001478244984832468
6064987026359504526
10469186440276053767
12332281598675737362
4246559787872197297
14849515287603909997
2592647731284516477
6277980392575817311
7528256967824972152
4486781373924787350
525060695046727780
5326305550703244927
4120129631153511017
7582343227124114718
3109870708788696748
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;We can easily store this set of 16 numbers in a relational database table. For example, if we had a MySQL database called "compounds", we could create a "fingerprints" table:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
mysql&gt; create database compounds;
Query OK, 1 row affected (0.02 sec)

mysql&gt; use compounds;

Database changed
mysql&gt; create table fingerprints(id int not null auto_increment, primary key(id), fp0 bigint(64), fp1 bigint(64), fp2 bigint(64), fp3 bigint(64), fp4 bigint(64), fp5 bigint(64), fp6 bigint(64), fp7 bigint(64), fp8 bigint(64), fp9 bigint(64), fp10 bigint(64), fp11 bigint(64), fp12 bigint(64), fp13 bigint(64), fp14 bigint(64), fp15 bigint(64));
Query OK, 0 rows affected (0.01 sec)

mysql&gt; describe fingerprints;
+-------+------------+------+-----+---------+----------------+
| Field | Type       | Null | Key | Default | Extra          |
+-------+------------+------+-----+---------+----------------+
| id    | int(11)    | NO   | PRI | NULL    | auto_increment | 
| fp0   | bigint(64) | YES  |     | NULL    |                | 
| fp1   | bigint(64) | YES  |     | NULL    |                | 
| fp2   | bigint(64) | YES  |     | NULL    |                | 
| fp3   | bigint(64) | YES  |     | NULL    |                | 
| fp4   | bigint(64) | YES  |     | NULL    |                | 
| fp5   | bigint(64) | YES  |     | NULL    |                | 
| fp6   | bigint(64) | YES  |     | NULL    |                | 
| fp7   | bigint(64) | YES  |     | NULL    |                | 
| fp8   | bigint(64) | YES  |     | NULL    |                | 
| fp9   | bigint(64) | YES  |     | NULL    |                | 
| fp10  | bigint(64) | YES  |     | NULL    |                | 
| fp11  | bigint(64) | YES  |     | NULL    |                | 
| fp12  | bigint(64) | YES  |     | NULL    |                | 
| fp13  | bigint(64) | YES  |     | NULL    |                | 
| fp14  | bigint(64) | YES  |     | NULL    |                | 
| fp15  | bigint(64) | YES  |     | NULL    |                | 
+-------+------------+------+-----+---------+----------------+
17 rows in set (0.01 sec)

&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Although we have neither a substructure search engine nor a database, we've laid a solid foundation for those things. The next article in this series will show how to use this humble beginning to model some simple substructure queries in a way that lets MySQL do most of the heavy-lifting.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/jaded/"&gt;Mr. Jaded&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</description>
      <pubDate>Thu, 02 Oct 2008 23:27:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:1d0f8fdc-4664-46ac-89cc-7c1de0608edd</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases</link>
      <category>Tools</category>
      <category>mysql</category>
      <category>ruby</category>
      <category>openbabel</category>
      <category>database</category>
      <category>fingerprint</category>
      <category>substructuresearch</category>
      <category>substructure</category>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Pansanel</title>
      <description>&lt;p&gt;Dear Rich,&lt;/p&gt;

&lt;p&gt;You can also the Mychem software for generating fingerprints. All Open Babel fingerprint types are supported.
You can also combine several fingerprint types. Read the
Mychem documentation for further details or send me a message.
Recently, we setted up a new projet called ChemiSQL. It intends to federate several open source chemical cartridge projects (Mychem, Orchem and Pgchem).&lt;/p&gt;</description>
      <pubDate>Mon, 06 Oct 2008 13:01:09 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:29745432-7b4e-46b1-9583-ccf0e9ff74e8</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-802</link>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Rajarshi</title>
      <description>&lt;p&gt;Rich, yes, you're right that linear scans for a 100K row table is probably fine - especially if it's just bit ops. Right now, we're using a R-tree index based on the GiST implementation to support spatial indexing for shape descriptors.&lt;/p&gt;

&lt;p&gt;For structure search, I'm waiting for Ernst to release his bit field GiST index. But even a GiST index does not help with a fully general SMARTS substructure search as far as I can see&lt;/p&gt;</description>
      <pubDate>Sat, 04 Oct 2008 14:06:32 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:29d275c4-b41e-49b7-b3ff-3cd24b79f3a4</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-786</link>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Craig Knox</title>
      <description>&lt;p&gt;Rich, our lab is planning on starting a blog soon - I think it will be one of the first posts.&lt;/p&gt;</description>
      <pubDate>Fri, 03 Oct 2008 19:45:04 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:33f2328f-5f51-4b44-9641-553e64d15a94</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-784</link>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Rich Apodaca</title>
      <description>&lt;p&gt;Craig, any plans to write up how you did it and maybe discuss performance on a large database like &lt;a href="http://www.drugbank.ca/" rel="nofollow"&gt;DrugBank&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Rajarshi, I hadn't seen &lt;a href="http://www.sai.msu.su/~megera/postgres/gist/doc/intro.shtml" rel="nofollow"&gt;GiST&lt;/a&gt; before, but it looks very interesting. You're right to a certain extent (more later) about the need for linear scans in the system I'm describing. For databases of ~100,000 structures or fewer, the speed of this system is plenty fast. For larger databases, though, GiST might be worth looking at.&lt;/p&gt;

&lt;p&gt;Have you implemented a structure search system using GiST indexes?&lt;/p&gt;</description>
      <pubDate>Fri, 03 Oct 2008 14:30:38 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:dcb6d781-bba5-4d7c-bbd5-068601e1f8bb</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-782</link>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Rajarshi</title>
      <description>&lt;p&gt;Another possibility is to have a indexing scheme within the database that operates on binary strings. PostgreSQL supports GiST indexes, which are a generalization of search trees. Ernst Georg Schmidt (of Tigress fame) is working on this - the advantage is that you don't have to pre-process fingerprints.&lt;/p&gt;

&lt;p&gt;The main reason for using an index is that you'd avoid a linear scan over the database - for a generalized substructure search (i.e., SMARTS based) I see no other way than to do linear scans - which kills performance on a 17M row table! (Roger Sayle has written about some impressive optimizations applicable to SMARTS based searches - but they're definitely non-trivial to implement)&lt;/p&gt;

&lt;p&gt;While fingerprint ops are fast, even comparing fingerprints (which can be used as a filter if the query is a SMILES) requires linear scans unless there is some index on the bit field. Of course, in your approach, you could have a multi-column index on the integer fields - I'd be interested to see how that works out.&lt;/p&gt;</description>
      <pubDate>Fri, 03 Oct 2008 12:08:46 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:63ec2fcb-5985-4b0e-bca1-0f19c0d7c7c8</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-781</link>
    </item>
    <item>
      <title>"Fast Substructure Search Using Open Source Tools Part 1: Fingerprints and Databases" by Craig Knox</title>
      <description>&lt;p&gt;I just finished implementing substructure/similarity search with DrugBank.  If your database is free for public use and has no login, you can use ChemAxon software for free.  Perhaps it isn't as fast or as powerful, but it was extremely easy to implement (I actually wrapped their software in a jRuby Rail project with GlassFish so I can update and search the database via REST - it works surprisingly well)... anyways, just my 2 cents.&lt;/p&gt;

&lt;p&gt;Interesting to see the details of how one would (start) to do it with open source software - I may have to look into it.&lt;/p&gt;</description>
      <pubDate>Fri, 03 Oct 2008 06:24:34 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:8ab8fad9-9cba-4672-855b-d03969cb18e7</guid>
      <link>http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases#comment-780</link>
    </item>
  </channel>
</rss>
