<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" version="2.0">
  <channel>
    <title>Depth-First</title>
    <link>http://depth-first.com</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/Depth-first" type="application/rss+xml" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">633918</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.feedburner.com</feedburner:feedburnerHostname><item>
      <title>Introducing MX: Lightweight and Free Cheminformatics Tools for Java</title>
      <description>&lt;p&gt;&lt;a href="http://code.google.com/p/mx-java/"&gt;&lt;img src="http://depth-first.com/demo/20081121/mx.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;If you want to build cheminformatics software of any kind, you'll need a basic toolkit. Ideally, this toolkit contains all of the low-level functionality used over and over in your projects. Tools for building an in-memory molecular representation, exact- and substructure comparison, and reading/writing molfiles all fall into this category. Also ideally, this toolkit should be free. Not free in the sense of free to use if you work at a university, free to try, or even free to use provided that you make your changes public when you redistribute the toolkit. But free in the sense of "do whatever you want with it and all you have to do is include a copyright notice."&lt;/p&gt;

&lt;p&gt;This article introduces MX, a suite of lightweight and free cheminformatics tools for Java designed to fill these needs.&lt;/p&gt;

&lt;h4&gt;Download&lt;/h4&gt;

&lt;p&gt;A &lt;a href="http://code.google.com/p/mx-java/"&gt;Google Code page&lt;/a&gt; has been set up for MX. Both a &lt;a href="http://mx-java.googlecode.com/files/mx-0.103.0-src.tar.gz"&gt;source distribution&lt;/a&gt; and &lt;a href="http://mx-java.googlecode.com/files/mx-0.103.0.jar"&gt;compiled jarfile&lt;/a&gt; representing MX in its current state can be downloaded.&lt;/p&gt;

&lt;p&gt;A subsequent article will show how to get started with MX.&lt;/p&gt;

&lt;h4&gt;Origins: A Chemical Structure Editor for Web Applications&lt;/h4&gt;

&lt;p&gt;In 2007 my company, &lt;a href="http://metamolecular.com"&gt;Metamolecular&lt;/a&gt;, set out to build a lightweight and easy-to-use chemical structure editor for Web applications. Realizing the increasing importance the Web would play as a chemical communication medium in the next decade, a truly Web-based, platform-independent alternative to ChemDraw and ISIS/Draw seemed to be a good direction to pursue. The resulting product became known as &lt;a href="http://metamolecular.com/chemwriter"&gt;ChemWriter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Minimizing deployment footprint was a key consideration with ChemWriter; the last thing a chemist using a Web site wants to spend his or her time doing is waiting around for a large applet to download. With many chemical structure editor applets available today, download times on the order of one minute or longer are not uncommon. This is simply unacceptable.&lt;/p&gt;

&lt;p&gt;To create ChemWriter, an ultra-lightweight cheminformatics toolkit was need. How lightweight? We were targeting 100 KB for the complete editor. A good chemical structure editor is a fairly complex piece of UI software involving multiple drawing tools with state-dependent behavior, not to mention some fairly sophisticated &lt;a href="http://depth-first.com/articles/2008/10/31/a-simple-vector-graphics-api-for-chemical-structure-output-part-1-in-search-of-a-simplifying-approach-for-chemphoto"&gt;vector graphics rendering&lt;/a&gt; and molfile input/output. The only way we could reach our 100 KB target for ChemWriter is if the basic cheminformatics toolkit were 20 KB or smaller.&lt;/p&gt;

&lt;p&gt;At the time, there was no cheminformatics toolkit, free or otherwise, that could fill the need.&lt;/p&gt;

&lt;p&gt;So it was created from scratch.&lt;/p&gt;

&lt;h4&gt;High Performance in ChemPhoto&lt;/h4&gt;

&lt;p&gt;Eventually, the same cheminformatics toolkit used in ChemWriter was adapted for &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;ChemPhoto was designed to dynamically display 100,000 or more 2D chemical structures in a grid-like GUI using minimal memory. Rather than pre-loading all 100,000 molecule objects into memory, which would not be feasible on most systems, ChemPhoto uses a lazy approach in which an in-memory index of the target SD file is built. Every time a new structure needs to be displayed to the user during a scrolling event, it's created from scratch: the molfile text is loaded from disk, a molecule object is created, the molecule is rendered, and then the entire construct is thrown away.&lt;/p&gt;

&lt;p&gt;The performance of ChemPhoto was so good, even though everything was being created on demand and immediately thrown away, that it appeared the cheminformatics toolkit being used had potential in high-performance situations as well.&lt;/p&gt;

&lt;h4&gt;Substructure Search and Mapping&lt;/h4&gt;

&lt;p&gt;Recently, &lt;a href="http://rguha.wordpress.com/"&gt;Rajarshi Guha&lt;/a&gt; reported &lt;a href="http://rguha.wordpress.com/2008/09/19/faster-substructure-search-in-the-cdk"&gt;his port of the VF library to Java&lt;/a&gt; for use with the Chemistry Development Kit (CDK). This began a thought process starting with "how can it be improved" and leading to the conclusion that the creation of &lt;a href="http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other"&gt;flexible, Java-centric substructure search utilities&lt;/a&gt; would offer the most bang for the buck. &lt;a href="http://depth-first.com/articles/2008/11/17/substructure-search-from-scratch-in-java-part-1-the-atom-mapping-problem"&gt;A subsequent article&lt;/a&gt; described a simple strategy that could be used to get there.&lt;/p&gt;

&lt;p&gt;To implement this idea, a cheminformatics toolkit was needed. The one used successfully in ChemWriter and ChemPhoto was an ideal candidate.&lt;/p&gt;

&lt;p&gt;The result, a complete substructure search and mapping utility built from scratch, is available in MX under the package &lt;tt&gt;com.metamolecular.mx.map&lt;/tt&gt;.&lt;/p&gt;

&lt;h4&gt;Free to Use Anytime, Anyplace - No Strings Attached&lt;/h4&gt;

&lt;p&gt;Licenses can be a problem with nearly all open source cheminformatics toolkits. If your work is mostly done in an academic environment for free, you're likely to experience no problem at all. However, if you run a company that sells licenses to software containing code you'd rather not reveal to the world, the &lt;a href="http://depth-first.com/articles/2006/12/29/dispelling-open-source-confusion-an-introduction-to-licenses"&gt;reciprocity provisions&lt;/a&gt; in licenses such as those in the GPL, Mozilla (MPL), and IBM (CPL) families lead to major problems.&lt;/p&gt;

&lt;p&gt;The problem isn't so much the open source license itself - it's the fact that the original copyright owners either won't give their permission to dual-license their contributions, or in many cases, can't even be tracked down to ask.&lt;/p&gt;

&lt;p&gt;This is an unacceptable position for a software distributor wanting to use open source as a cost-effective means to boost their developer productivity.&lt;/p&gt;

&lt;p&gt;To address these issues, MX is being distributed under the extremely permissive &lt;a href="http://www.opensource.org/licenses/mit-license.php"&gt;MIT License&lt;/a&gt;. In a nutshell it says you are free to modify and incorporate MX into any software you distribute without any obligation to release a line of your own source code. It also says if MX doesn't do the job, you're on your own. And that's about all it says. Your only obligation is to include the original copyright notice on all copies or substantial portions of the software.&lt;/p&gt;

&lt;p&gt;To my knowledge, only one major cheminformatics toolkit is licensed under an academic-style open source license - &lt;a href="http://www.rdkit.org/"&gt;RDKit&lt;/a&gt;, which is licensed under the &lt;a href="http://www.opensource.org/licenses/bsd-license.php"&gt;New BSD License&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;A basic cheminformatics toolkit is a vital component of most chemistry-related software. For maximal cost-effectiveness as a software distributor, a free toolkit licensed under a permissive open source license is ideal. MX is a free and lightweight cheminformatics toolkit written in Java that has been used successfully in two commercial products.&lt;/p&gt;

&lt;p&gt;Future articles will describe the many ways MX can be used and extended.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=G9udN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=G9udN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=0ITqn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=0ITqn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=rRt3n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=rRt3n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=hToIn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=hToIn" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Fri, 21 Nov 2008 18:28:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:c54d3fa0-de47-4040-a48d-ba23553469dd</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/21/introducing-mx-lightweight-and-free-cheminformatics-tools-for-java</link>
      <category>Tools</category>
      <category>mx</category>
      <category>java</category>
      <category>toolkit</category>
      <category>mitlicense</category>
      <category>opensource</category>
      <category>substructure</category>
      <category>chemwriter</category>
      <category>chemphoto</category>
      <category>metamolecular</category>
    </item>
    <item>
      <title>ChemPhoto Beta-2</title>
      <description>&lt;p&gt;&lt;a href="http://metamolecular.com/chemphoto"&gt;&lt;img src="http://depth-first.com/demo/20080908/chemphoto.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;The second beta version of &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt; developed by &lt;a href="http://metamolecular.com"&gt;Metamolecular&lt;/a&gt;, is now available for testing.&lt;/p&gt;

&lt;p&gt;Thanks to detailed reports from several &lt;a href="http://depth-first.com/articles/2008/10/24/chemphoto-beta-1-now-available"&gt;ChemPhoto Beta-1 testers&lt;/a&gt;, a number of issues were discovered and have been addressed by this newest release.&lt;/p&gt;

&lt;p&gt;If you're interested in seeing how easy it can be to create high-quality 2D chemical structure images from your compound collection, please &lt;a href="http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=" onclick="window.open('http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;" title="Reveal this e-mail address"&gt;drop me a line.&lt;/a&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=0H1MN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=0H1MN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=DIq2n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=DIq2n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=HaPBn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=HaPBn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=1g16n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=1g16n" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Thu, 20 Nov 2008 23:59:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:6f85cfb7-23b1-44f1-8358-542c0726e355</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/20/chemphoto-beta-2</link>
      <category>Tools</category>
      <category>chemphoto</category>
      <category>beta</category>
      <category>imaging</category>
      <category>2d</category>
      <category>java</category>
    </item>
    <item>
      <title>SciFinder Web, Greasemonkey, and REST: Embracing Divergence in Chemical Information Systems</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/trashd/754113710/"&gt;&lt;img src="http://depth-first.com/demo/20081119/diverge.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Recently, &lt;a href="http://personnes.epfl.ch/alain.borel"&gt;Alain Borel&lt;/a&gt; posted a message to the &lt;a href="https://listserv.indiana.edu/cgi-bin/wa-iub.exe?A0=CHMINF-L"&gt;CHMINF-L list&lt;/a&gt; describing his successful attempt to get links to external datasources to show up in SciFinder Web:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;I'm dreaming of a world where chemical data is hyperlinked as thoroughly
    as text data is today (or even more)... here's my small contribution to
    this goal.&lt;/p&gt;
    
    &lt;p&gt;...&lt;/p&gt;
    
    &lt;p&gt;Basically, Greasemonkey allows you to rewrite the HTML content of a web
    page before it is rendered in the browser window. Thanks to this, I've
    been able to write a script that adds links to external databases
    through registry numbers. Currently, &lt;a href="http://chemspider.com"&gt;ChemSpider&lt;/a&gt;, &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt; and &lt;a href="http://chemexper.com"&gt;Chemexper&lt;/a&gt; [links added]
    are supported - I also have a private version that links to our Intranet
    chemical stocks application. The pleasant side for those who worry about
    intellectual property is that neither side of the link needs to know
    what's on the other side, and even the plugin doesn't know what's inside
    the database records.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="http://personnes.epfl.ch/alain.borel"&gt;Greasemonkey&lt;/a&gt; is the client-side script plugin engine that enables users to change the appearance and content of any site. Although originally designed as a Firefox extension, as least some Greasemonkey scripts can be run in &lt;a href="http://www.simplehelp.net/2007/11/14/how-to-run-greasemonkey-scripts-in-safari/"&gt;Safari&lt;/a&gt; and &lt;a href="http://www.ghacks.net/2008/10/18/google-chrome-adds-greasemonkey-support/"&gt;Chrome&lt;/a&gt;. A &lt;a href="http://dx.doi.org/10.1186/1471-2105-8-487"&gt;recent paper&lt;/a&gt; describes some of the potential for this form of scripting in the life sciences.&lt;/p&gt;

&lt;p&gt;Alain's script, which can be &lt;a href="http://biscom.epfl.ch/scifinder_links.user.js"&gt;freely downloaded&lt;/a&gt;, uses &lt;a href="http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia"&gt;CAS numbers&lt;/a&gt; to link SciFinder records to the external databases. Although I don't have access to SciFinder Web, Alain's description makes it sound like each entry for a specific substance in SciFinder Web is given an additional set of links out to external datasources.&lt;/p&gt;

&lt;h4&gt;What's REST Got To Do With It?&lt;/h4&gt;

&lt;p&gt;One of Alain's external datasources is &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt;. A unique feature of Chempedia is the way it exposes a &lt;a href="http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia"&gt;an electronic paper trail for CAS numbers&lt;/a&gt;. Rather than just reporting a CAS registry number, it fully discloses which organization is asserting that a particular CAS number belongs with a structure.&lt;/p&gt;

&lt;p&gt;For example, see &lt;a href="http://chempedia.com/registry_numbers/525-66-6"&gt;this entry on [525-66-6]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Chempedia was designed from the start to apply the principles of &lt;a href="http://depth-first.com/articles/2007/05/30/restful-cheminformatics"&gt;REST&lt;/a&gt;. The big idea behind REST is that every resource on the Web, such as a CAS number, can be manipulated by exactly four methods: GET; PUT; POST; and DELETE.&lt;/p&gt;

&lt;p&gt;The highly-desirable side-effect of designing Web sites around the concept of resources being acted on by exactly four methods is that sites applying this philosophy become orders of magnitude easier to &lt;a href="http://depth-first.com/articles/2006/09/23/mashups-for-fun-and-profit"&gt;mash up&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Alain's Greasemonkey script is an example of a client-side mashup. But how did he do it?&lt;/p&gt;

&lt;h4&gt;CAS Numbers are First-Class Citizens&lt;/h4&gt;

&lt;p&gt;Each CAS number on Chempedia is a resource that can be accessed by a URL taking the form:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;http://chempedia.com/registry_numbers/REGISTRY_NUMBER&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;where &lt;tt&gt;REGISTRY_NUMBER&lt;/tt&gt; is the CAS number of interest. For example, acetaminophen has the registry number [103-90-2] and it can be accessed with this URL:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://chempedia.com/registry_numbers/103-90-2"&gt;http://chempedia.com/registry_numbers/103-90-2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you request a CAS number that doesn't exist, you should receive a 404 error, although a bug in Chempedia is currently preventing that from happening.&lt;/p&gt;

&lt;p&gt;To link SciFinder Web to Chempedia, Alain's user script simply looks for which CAS number the SciFinder page is talking about and constructs the RESTful URL. It doesn't get much simpler than that.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Chemistry is a large, established field. Not surprisingly, &lt;a href="http://depth-first.com/articles/2008/05/07/1908-and-all-that-the-long-tail-and-chemistry"&gt;specialization is an essential part of being a chemist&lt;/a&gt;. It's therefore to be expected that chemical databases will diverge into a variety of specialized forms. One size will almost certainly not fit all.&lt;/p&gt;

&lt;p&gt;We can deny this simple fact and build ever more complex and unusable chemical information systems. Or we can accept it and custom-build our services for the job at hand.&lt;/p&gt;

&lt;p&gt;RESTful server architectures and mashups offer a powerful way to accomplish this goal.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/trashd/"&gt;&lt;/em&gt;betenoir&lt;em&gt;&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=kGkaN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=kGkaN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=2VsUn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=2VsUn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=2Wzcn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=2Wzcn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=6Hi4n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=6Hi4n" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Wed, 19 Nov 2008 17:41:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:8381b2d7-1d61-4266-ab87-522da8b3e178</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/19/scifinder-web-greasemonkey-and-rest-embracing-divergence-in-chemical-information-systems</link>
      <category>Meta</category>
      <category>chempedia</category>
      <category>rest</category>
      <category>mashup</category>
      <category>greasemonkey</category>
      <category>userscript</category>
      <category>scifinder</category>
      <category>scifinderweb</category>
      <category>divergence</category>
    </item>
    <item>
      <title>Substructure Search From Scratch in Java Part 1: The Atom Mapping Problem</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/dollar_bin/2086969723/"&gt;&lt;img src="http://depth-first.com/demo/20081117/map.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;One of the most important capabilities in cheminformatics is mapping the atoms of a &lt;em&gt;query structure&lt;/em&gt; onto the atoms of a &lt;em&gt;target structure&lt;/em&gt;. Although useful in itself, the main value of atom mapping comes from the software that gets built on top of it: exact structure comparators, &lt;a href="http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases"&gt;substructure search systems&lt;/a&gt;, and query atom/bond search systems such as &lt;a href="http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html"&gt;SMARTS&lt;/a&gt;. The fundamental nature of atom mapping means that correctness, efficiency and adaptability are essential features of a good mapping implementation. Recently, a D-F article made the case that atom mapping software written in Java &lt;a href="http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other"&gt;needs to be Java-centric&lt;/a&gt; to achieve these goals. This article, the first in a series that describes a complete substructure search system written in Java, takes the first step by offering some simple interface definitions and code for the atom mapping problem.&lt;/p&gt;

&lt;h4&gt;The Problem&lt;/h4&gt;

&lt;p&gt;Given a query molecule (&lt;tt&gt;query&lt;/tt&gt;) and a target molecule (&lt;tt&gt;target&lt;/tt&gt;), our atom mapping software needs to find ways to match the atoms of &lt;tt&gt;query&lt;/tt&gt; onto &lt;tt&gt;target&lt;/tt&gt; such that the mapping describes a substructure embedded in &lt;tt&gt;target&lt;/tt&gt;. The software might stop at one mapping, continue on to find all of them, or stop at some point in the middle. It all depends on the specific cheminformatics problem we're trying to solve.&lt;/p&gt;

&lt;h4&gt;The Recursive Function&lt;/h4&gt;

&lt;p&gt;Our implementation will gradually build up an atom mapping by traversing the atoms of &lt;tt&gt;query&lt;/tt&gt; in depth-first order and trying to map each found atom onto an atom in &lt;tt&gt;target&lt;/tt&gt;. At each step in the process, we will have a partial atom map that maps some of the atoms in &lt;tt&gt;query&lt;/tt&gt; onto &lt;tt&gt;target&lt;/tt&gt;. That map, and any other information needed to complete the analysis will be kept in an instance of a class implementing the &lt;tt&gt;State&lt;/tt&gt; interface.&lt;/p&gt;

&lt;p&gt;A &lt;tt&gt;State&lt;/tt&gt; will be manipulated by a recursive method, &lt;tt&gt;mapFirst&lt;/tt&gt; that returns when the first atom map is found:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_java "&gt;// create a list to hold atom maps
List&amp;lt;Map&amp;lt;Atom, Atom&amp;gt;&amp;gt; maps = new ArrayList&amp;lt;Map&amp;lt;Atom, Atom&amp;gt;&amp;gt;();

// create initial state
State state = ...; 

boolean mapFirst(State state)
{
  if (state.isDead())
  {
    return false;
  }

  if (state.isGoal())
  {
    maps.add(state.getMap());

    return true;
  }

  boolean found = false;

  while (!found &amp;amp;&amp;amp; state.hasNextCandidate())
  {
    Match candidate = state.nextCandidate();

    if (state.isMatchFeasible(candidate))
    {
      State nextState = state.nextState(candidate);
      found = mapFirst(nextState);

      nextState.backTrack();
    }
  }

  return found;
}&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Comparison of the &lt;tt&gt;mapFirst&lt;/tt&gt; method to the pseudocode &lt;a href="http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other"&gt;VF algorithm &lt;tt&gt;Match&lt;/tt&gt; procedure given in the previous article&lt;/a&gt; shows some similarities. In fact, something similar to the &lt;tt&gt;mapFirst&lt;/tt&gt; method forms the basis of many atom mappers in use today.&lt;/p&gt;

&lt;p&gt;Although it may be clear from the code, it's worth re-iterating that each time &lt;tt&gt;mapFirst&lt;/tt&gt; is recursively called, an attempt is made to branch off a new &lt;tt&gt;State&lt;/tt&gt; that maps an additional pair of atoms from &lt;tt&gt;query&lt;/tt&gt; to &lt;tt&gt;target&lt;/tt&gt;. If that branch leads to a possible solution, it's followed. Otherwise the next possible mapping is explored.&lt;/p&gt;

&lt;h4&gt;The &lt;tt&gt;State&lt;/tt&gt; Interface&lt;/h4&gt;

&lt;p&gt;The recursive &lt;tt&gt;mapFirst&lt;/tt&gt; method determines all of the methods the &lt;tt&gt;State&lt;/tt&gt; interface needs to define:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_java "&gt;public interface State
{
  /**
   * Returns the current mapping of query atoms onto target atoms.
   * This map is shared among all states obtained through nextState.
   */
  public Map&amp;lt;Atom, Atom&amp;gt; getMap();

  /**
   * Returns true if another candidate match can be found or
   * false otherwise.
   */
  public boolean hasNextCandidate();

  /**
   * Returns the next candidate match.
   */
  public Match nextCandidate();

  /**
   * Returns true if the given match will work with the current
   * map, or false otherwise.
   */
  public boolean isMatchFeasible(Match match);

  /**
   * Returns true if all atoms in the query molecule have been
   * mapped.
   */
  public boolean isGoal();

  /**
   * Returns true if no match will come from this State.
   */
  public boolean isDead();

  /**
   * Returns a state in which the atoms in match have been
   * added to the current mapping.
   */
  public State nextState(Match match);

  /**
   * Returns this State's atom map to its original condition.
   */
  public void backTrack();
}&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, &lt;tt&gt;State&lt;/tt&gt; uses an instance of the &lt;tt&gt;Match&lt;/tt&gt; class:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_java "&gt;public class Match
{
  private Atom query;
  private Atom target;

  public Match(Atom query, Atom target)
  {
    this.query = query;
    this.target = target;
  }

  public Atom getQueryAtom()
  {
    return query;
  }

  public Atom getTargetAtom()
  {
    return target;
  }
}&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;With just a few lines of Java, we've managed to reduce the fundamental cheminformatics problem of atom mapping to the far simpler problem of implementing the &lt;tt&gt;State&lt;/tt&gt; interface.&lt;/p&gt;

&lt;p&gt;How many ways are there to implement the &lt;tt&gt;State&lt;/tt&gt; interface? Probably as many as there are subgraph isomorphism algorithms. Notice that the way we've set up the problem lets us use the same recursive method to test all &lt;tt&gt;State&lt;/tt&gt; implementations, an essential prerequisite for benchmarking and optimization.&lt;/p&gt;

&lt;p&gt;Future articles in this series will describe one way to implement the &lt;tt&gt;State&lt;/tt&gt; interface.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/dollar_bin/"&gt;Dollar Bin&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=aSUqN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=aSUqN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=528bn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=528bn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=nPjUn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=nPjUn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=lJfRn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=lJfRn" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Mon, 17 Nov 2008 19:17:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:a8551af2-31ff-48b0-8b1b-dbd0382d21a9</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/17/substructure-search-from-scratch-in-java-part-1-the-atom-mapping-problem</link>
      <category>Tools</category>
      <category>vf</category>
      <category>cheminformatics</category>
      <category>java</category>
      <category>mapping</category>
      <category>substructuresearch</category>
      <category>substructure</category>
    </item>
    <item>
      <title>One of These Things is Not Like The Others</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/santarosa/261923723/"&gt;&lt;img src="http://depth-first.com/demo/20081112/fractal.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;You can't get very far in cheminformatics without the ability to compare one molecule to another to find either an exact structure or substructure match. For example, if you want to build &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemical databases&lt;/a&gt;, a good substructure matcher &lt;a href="http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases"&gt;comes in very handy&lt;/a&gt;. As luck would have it, the substructure match problem (a variant of the &lt;a href="http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem"&gt;subgraph isomorphism problem&lt;/a&gt;) is both &lt;a href="http://en.wikipedia.org/wiki/NP-complete"&gt;computationally expensive&lt;/a&gt; and difficult implement. This article discusses one approach to the problem.&lt;/p&gt;

&lt;h4&gt;Background&lt;/h4&gt;

&lt;p&gt;Recently, &lt;a href="http://rguha.wordpress.com/"&gt;Rajarshi Guha&lt;/a&gt; described some &lt;a href="http://rguha.wordpress.com/2008/09/19/faster-substructure-search-in-the-cdk"&gt;benchmarking studies&lt;/a&gt; suggesting that it was possible to greatly improve the speed of the &lt;a href="http:/cdk.sf.net"&gt;Chemistry Development Kit&lt;/a&gt; (CDK) substructure matching code. His code employed the widely-used &lt;a href="http://portal.acm.org/citation.cfm?id=321925"&gt;Ullmann algorithm&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There's just one problem: the Ullmann algorithm detects edge-induced isomporphisms. This means, for example, that if your query molecule is propane and your test molecule is cyclopropane, you won't find a match with an Ullmann-backed tool. I'm still not sure if it's possible to modify an Ullmann implementation to make its matches node-induced. Based on the implementations I've seen, the answer appears to be "no."&lt;/p&gt;

&lt;p&gt;For substructure matching, we need an atom-induced isomorphism algorithm.&lt;/p&gt;

&lt;h4&gt;What's Wrong with Existing Implementations?&lt;/h4&gt;

&lt;p&gt;To begin with, it must be pointed out that working isomorphism code is valuable and hard-won.&lt;/p&gt;

&lt;p&gt;Having said that, many Java implementations are written in a way that makes optimization difficult at best. Some start out as C code that then gets ported, mostly verbatim. Other are written with an understandable emphasis on speed over readability. For developers used to working with classes, objects, shallow loops, and short methods with expressive names, the impedance mismatch can be jarring to say the least.&lt;/p&gt;

&lt;p&gt;Here's an example, taken from the CDK, that while functional, would take a great deal of time to understand well enough to change:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_java "&gt;public static List makeAtomsMapOfBondsMap(List l, IAtomContainer g1, IAtomContainer g2) {
  if(l==null)
    return(l);
  List result = new ArrayList();
  for (int i = 0; i &amp;lt; l.size(); i++) {
    IBond bond1 = g1.getBond(((RMap) l.get(i)).getId1());
    IBond bond2 = g2.getBond(((RMap) l.get(i)).getId2());
    IAtom[] atom1 = BondManipulator.getAtomArray(bond1);
    IAtom[] atom2 = BondManipulator.getAtomArray(bond2);
    for (int j = 0; j &amp;lt; 2; j++) {
      List bondsConnectedToAtom1j = g1.getConnectedBondsList(atom1[j]);
      for (int k = 0; k &amp;lt; bondsConnectedToAtom1j.size(); k++) {
        if (bondsConnectedToAtom1j.get(k) != bond1) {
          IBond testBond = (IBond)bondsConnectedToAtom1j.get(k);
            for (int m = 0; m &amp;lt; l.size(); m++) {
              IBond testBond2;
              if (((RMap) l.get(m)).getId1() == g1.getBondNumber(testBond)) {
                testBond2 = g2.getBond(((RMap) l.get(m)).getId2());
                for (int n = 0; n &amp;lt; 2; n++) {
                  List bondsToTest = g2.getConnectedBondsList(atom2[n]);
                  if (bondsToTest.contains(testBond2)) {
                    RMap map;
                    if (j == n) {
                      map = new RMap(g1.getAtomNumber(atom1[0]), g2.getAtomNumber(atom2[0]));
                    } else {
                      map = new RMap(g1.getAtomNumber(atom1[1]), g2.getAtomNumber(atom2[0]));
                    }
                    if (!result.contains(map)) {
                      result.add(map);
                    }
                    RMap map2;
                    if (j == n) {
                      map2 = new RMap(g1.getAtomNumber(atom1[1]), g2.getAtomNumber(atom2[1]));
                    } else {
                      map2 = new RMap(g1.getAtomNumber(atom1[0]), g2.getAtomNumber(atom2[1]));
                    }
                    if (!result.contains(map2)) {
                      result.add(map2);
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  return (result);
}&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt; 

&lt;h4&gt;VFLib&lt;/h4&gt;

&lt;p&gt;Rajarshi's implementation of substructure search was based on a Java port of the &lt;a href="http://amalfi.dis.unina.it/graph/db/vflib-2.0/doc/vflib.html"&gt;VFLib&lt;/a&gt; C++ library. VFLib was developed by an Italian group to compare the performance of the VF algorithm with that of Ullmann.&lt;/p&gt;

&lt;p&gt;VFLib defines a single interface (&lt;a href="http://amalfi.dis.unina.it/graph/db/vflib-2.0/doc/vflib-7.html"&gt;State&lt;/a&gt;) that a variety of subgraph isomorphism matchers can implement in order to work interchangeably.&lt;/p&gt;

&lt;p&gt;What makes this so interesting is that when you can boil a software problem down to implementing an interface, it can become orders of magnitude simpler. But more on that later.&lt;/p&gt;

&lt;p&gt;Another interesting aspect of VFLib is that the code can be easily converted from an edge-induced implementation to a node-induced implementation. In other words, if we had a Java port of the VFLib2 code, we could begin to build families of Java-based substructure matchers that could be easily compared and optimized.&lt;/p&gt;

&lt;h4&gt;The View from 10,000 Feet&lt;/h4&gt;

&lt;p&gt;One of the difficult aspects of implementing subgraph isomorphism algorithms is dividing the process up into understandable chunks. One way forward might be to look for commonalities among all of the approaches currently used. What might those be? Here are some possibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recursion.&lt;/strong&gt; At the heart of any implementation typically lives a method that repeatedly calls itself (without creating a stack overflow).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gradual accumulation of state.&lt;/strong&gt;  What's that recursive method doing? Building up a map of the atoms from a query structure to a target structure, one pair of atoms at a time. Sometimes it fails and needs to go back to the last successful match. Sometimes it succeeds and needs to report that information to avoid accessing an out-of-bounds index. At every stage, the accumulated state must be sufficient to finish the mapping attempt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mapping comes for free.&lt;/strong&gt; The implementation typically uses an internal map to keep track of what it's done, so getting one mapping (or more) of the query structure onto the target tends to be as easy as simply detecting that a match exists.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimization heuristics.&lt;/strong&gt; Where to begin, what order to compare structural features, and what features should be compared anyway? The possibilities for taking advantage of simple optimization rules are significant. It should, therefore, be easy to run many implementations side-by-side in performance tests.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;a href="http://amalfi.dis.unina.it/graph/db/papers/vf-algorithm.pdf"&gt;paper&lt;/a&gt; describing the VF algorithm, and the way VFLib implements it is freely available.&lt;/p&gt;

&lt;p&gt;In it, a high-level overview of the VF algorithm is presented:&lt;/p&gt;

&lt;pre&gt;
PROCEDURE Match(s)
  INPUT: an intermediate state s; the initial state s0 has M(s0)=∅
  OUTPUT: the mappings between the two graphs
  IF M(s) covers all the nodes of G2 THEN
    OUTPUT M(s)
  ELSE
    Compute the set P(s) of the pairs candidate for inclusion in M(s)
    FOREACH (n, m)∈ P(s)
      IF F(s, n, m) THEN
        Compute the state s' obtained by adding (n, m) to M(s)
        CALL Match(s')
      END IF
    END FOREACH
     Restore data structures
  END IF
END PROCEDURE
&lt;/pre&gt;

&lt;p&gt;The &lt;tt&gt;Match(s)&lt;/tt&gt; procedure plays the role of recursive function, while &lt;tt&gt;s&lt;/tt&gt; and &lt;tt&gt;s'&lt;/tt&gt; play the dual roles of state accumulators and feature comparators.&lt;/p&gt;

&lt;p&gt;VFLib, together with the paper describing it, does a good job of breaking the process up into manageable chunks from which unit tests, interface definitions, and ultimately working code can created in a variety of languages.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Substructure matching is one of the most difficult and most useful cheminformatics tasks. Although many Java cheminformatics toolkits support substructure search, their implementations can be difficult to understand, modify, and optimize. VFLib has some interesting features that could help to change that.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/santarosa/"&gt;Santa Rosa OLD SKOOL&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=kMBSN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=kMBSN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=if2En"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=if2En" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=Ud63n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=Ud63n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=0O1qn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=0O1qn" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Thu, 13 Nov 2008 01:49:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:36046dac-9227-4f69-bc26-1c96d5ed520e</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/13/one-of-these-things-is-not-like-the-other</link>
      <category>Tools</category>
      <category>vf</category>
      <category>vf2</category>
      <category>vflib</category>
      <category>isomorphism</category>
      <category>subgraph</category>
      <category>substructure</category>
      <category>java</category>
      <category>objectoriented</category>
    </item>
    <item>
      <title>Casual Saturdays: Business Plan</title>
      <description>&lt;p&gt;&lt;a href="http://dilbert.com/strips/comic/2008-10-25/" title="Dilbert.com"&gt;&lt;img src="http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/000000/20000/9000/000/29066/29066.strip.gif" border="0" alt="Dilbert.com" width="500"/&gt;&lt;/a&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=vvCmN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=vvCmN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=shRWn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=shRWn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=Nbg0n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=Nbg0n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=ymtgn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=ymtgn" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Sat, 08 Nov 2008 23:13:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:ee1520f4-738d-4809-8339-8be665b69a46</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/08/casual-saturdays-business-plan</link>
      <category>Meta</category>
      <category>casualsaturdays</category>
    </item>
    <item>
      <title>Building ChemWriter: What to Do When Requesting Applet Keyboard Focus Leads to Disappearing Popup Windows</title>
      <description>&lt;p&gt;&lt;a href="http://metamolecular.com/chemwriter"&gt;&lt;img src="http://metamolecular.com/images/global/chemwriter_small.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Recently a customer reported a problem in which mousing over an instance of the &lt;a href="http://metamolecular.com/chemwriter"&gt;ChemWriter&lt;/a&gt; editor applet caused browser popup windows to disappear behind the parent window. Although many view browser popup windows as bad UI design, there are situations in which no alternative exists. This article describes the window focus problem in detail and outlines one solution.&lt;/p&gt;

&lt;h4&gt;Background&lt;/h4&gt;

&lt;p&gt;One of the ways ChemWriter makes chemists more efficient at drawing chemical structures is through the use of &lt;a href="http://metamolecular.com/articles/chemwriter-keyboard"&gt;keyboard shortcuts&lt;/a&gt;. Rather than having to mouse back and forth between a tool palette and drawing canvas to put in atom labels, simply hover the mouse over the atom to label, and press a key on the keyboard. In addition to atom labels, there are keyboard shortcuts for chains (keys 1-9 while hovered over atom), for benzene rings ("a" key), and to edit bond order (1-3 while hovered over bond). &lt;a href="http://depth-first.com/articles/2008/06/18/screencast-drawing-structures-quickly-with-chemwriter"&gt;A short movie&lt;/a&gt; explains the feature in more detail.&lt;/p&gt;

&lt;p&gt;Although quite helpful to users, this feature requires some behind-the-scenes magic. Keyboard focus is one of those topics at the boundary between applet and browser for which very little documentation exists and, not surprisingly, one sees the most variation in platform and browser behavior.&lt;/p&gt;

&lt;p&gt;The approach ChemWriter takes is to wait for a signal that the mouse cursor has entered the canvas area. When this happens, keyboard focus is requested through &lt;tt&gt;Component.requestFocus()&lt;/tt&gt;.&lt;/p&gt;

&lt;h4&gt;Scope of the Problem&lt;/h4&gt;

&lt;p&gt;It turns out that on Windows, &lt;tt&gt;Component.requestFocus()&lt;/tt&gt; also causes the hosting window to be pulled to the top of the windows stack, explaining the behavior described above. On Linux and OS X, this doesn't happen, which is the behavior you'd expect.&lt;/p&gt;

&lt;p&gt;All Windows browsers, except the much maligned Internet Explorer 6, show this behavior. This includes Internet Explorer 7, Firefox 3, and Google Chrome. Internet Explorer 8 beta 2 also shows the behavior, but only once per popup window.&lt;/p&gt;

&lt;p&gt;Minimizing, then maximizing the popup window eliminated the problem some of the time. But a new popup window would still show the behavior.&lt;/p&gt;

&lt;h4&gt;The Solution&lt;/h4&gt;

&lt;p&gt;The root of the problem is that on Windows, keyboard focus is granted to an object regardless of whether the object's hosting browser window is focused. What's needed, therefore, is a way for the applet to implement a window focus check.&lt;/p&gt;

&lt;p&gt;Apparently, nothing in the Applet API itself can solve this problem. The &lt;tt&gt;Applet&lt;/tt&gt;, &lt;tt&gt;JApplet&lt;/tt&gt;, and &lt;tt&gt;AppletContext&lt;/tt&gt; classes only deal with much higher-level considerations. &lt;/p&gt;

&lt;p&gt;However, it's possible to take advantage of support for &lt;a href="https://developer.mozilla.org/En/Core_JavaScript_1.5_Guide:LiveConnect_Overview"&gt;LiveConnect&lt;/a&gt; technology on Windows, which is actually quite good. Using LiveConnect in combination with JavaScript's &lt;tt&gt;Document.hasFocus()&lt;/tt&gt; method offers the makings of a solution. For example, the following code can be used as a starting point within a Java applet to determine if the containing browser window is focused:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_java "&gt;netscape.javascript.JSObject js = netscape.javascript.JSObject.getWindow(this);
Object result = js.eval(&amp;quot;document.hasFocus();&amp;quot;);

if (&amp;quot;true&amp;quot;.equals(result.toString()))
{
  requestFocusMethod();
}&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Of course, &lt;tt&gt;requestFocusMethod()&lt;/tt&gt; needs to be defined, we need to check for &lt;tt&gt;null&lt;/tt&gt;, and we need to handle exceptions that could arise from a missing &lt;tt&gt;netscape.javascript&lt;/tt&gt; package. You'll also need to ensure that non-windows browsers such as Safari/OS X never even try to execute the LiveConnect code due to a very buggy implementation.&lt;/p&gt;

&lt;h4&gt;Issues&lt;/h4&gt;

&lt;p&gt;I've seen mixed signals about the status of LiveConnect in the next major release of the Java plugin. Regardless of the specific way it's implemented, it seems safe to say that Java-Javascript communication is far too valuable to abandon. The only question is what form support for this feature will take going forward.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;ChemWriter's ability to accept keyboard input is a helpful user interface feature, but one that resulted in window focus issues on Windows. Using LiveConnect in combination with some simple JavaScript in the focus-management code offered an effective solution.&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=ZdX2N"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=ZdX2N" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=74KSn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=74KSn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=GvQ4n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=GvQ4n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=wIpen"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=wIpen" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Thu, 06 Nov 2008 19:35:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:2c99c2e5-7eeb-4a8f-b578-f892877770d3</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/06/building-chemwriter-what-to-do-when-requesting-applet-keyboard-focus-leads-to-disappearing-popup-windows</link>
      <category>Tools</category>
      <category>chemwriter</category>
      <category>buildingchemwriter</category>
      <category>windowfocus</category>
      <category>windows</category>
      <category>liveconnect</category>
      <category>applet</category>
      <category>javascript</category>
      <category>java</category>
    </item>
    <item>
      <title>Billions and Billions</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/spaceritual/41205851/"&gt;&lt;img src="http://depth-first.com/demo/20081103/stars.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;One of the things that makes organic chemistry such a fascinating and useful subject (not to mention profitable) is the way mind-boggling levels of diversity arise from the application of very simple rules. &lt;/p&gt;

&lt;p&gt;In 1924, the American chemist Eugene Markush was awarded a new kind of patent (&lt;a href="http://www.google.com/patents?id=bE5UAAAAEBAJ&amp;amp;dq=1,506,316"&gt;US 1,506,316&lt;/a&gt;). Rather than claiming manufacturing processes listing specific input materials, Markush claimed processes listing families of compounds as inputs - and pretty large families at that. For example:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;The process for the manufacture of dyes which comprises coupling with a halogen-substituted pyrazolone, a diazotized unsulfphonated material selected from the group consisting of aniline, homologues of aniline and halogen substituted products of aniline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The phrase "selected from the group" has since become the signature of a special kind of patent claim called a &lt;a href="http://en.wikipedia.org/wiki/Claim_(patent)#Markush_claim"&gt;Markush claim&lt;/a&gt;. The approach significantly influenced the way chemical intellectual property was created and protected.&lt;/p&gt;

&lt;p&gt;A Markush structure is a special kind of chemical structure in which variables are used in place of specific substituents. It turns out that even the most basic Markush structures can specify very large numbers of specific chemical structures. For example, consider a Markush structure specifying all halogenated naphthalenes:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20081103/napthalene.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;&lt;center&gt;where X = H, F, Cl, Br, I&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;To a first approximation, we can calculate the number of compounds specified by this structure with the formula:&lt;/p&gt;

&lt;p&gt;5 x 5 x 5 x 5 x 5 x 5 x 5 x 5 = 5&lt;sup&gt;8&lt;/sup&gt; = 390,625&lt;/p&gt;

&lt;p&gt;To get the real number of halogenated naphthalenes, which would be less than 390,625, we'd need to account for symmetry of the enumerated structures.&lt;/p&gt;

&lt;p&gt;Applying a Markush perspective to cheminformatics leads to some interesting product ideas. For example, a few years ago &lt;a href="http://www.coalesix.com/"&gt;Coalesix&lt;/a&gt; released a computational tool called &lt;a href="http://www.coalesix.com/Our_Vision.html"&gt;Mobius&lt;/a&gt; (&lt;a href="http://www.filamentgroup.com/portfolio/coalesix/"&gt;screenshots&lt;/a&gt;) designed specifically to work with very large families of compounds encoded as Markush structures. And for years, patent databases such as &lt;a href="http://www.cas.org/expertise/cascontent/marpat.html"&gt;MARPAT&lt;/a&gt; have been capable of searching the Markush claims of patents.&lt;/p&gt;

&lt;p&gt;Given the centrality of Markush structures to the theory, practice, and law of modern industrial chemistry, it may be worthwhile to consider ways of incorporating the Markush perspective in to existing, or new, cheminformatics products and services.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/spaceritual/"&gt;Space Ritual&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=d8kdN"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=d8kdN" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=qQoyn"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=qQoyn" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=8Ek2n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=8Ek2n" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=rQi3n"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=rQi3n" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Mon, 03 Nov 2008 18:00:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:56ce0ac8-52b2-4be8-bdff-30fc803ad19e</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/11/03/billions-and-billions</link>
      <category>Meta</category>
      <category>markush</category>
      <category>patent</category>
      <category>claim</category>
      <category>chemicalstructure</category>
    </item>
    <item>
      <title>A Simple Vector Graphics API for Chemical Structure Output Part 1: In Search of a Simplifying Approach for ChemPhoto</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/estherase/2584941208/"&gt;&lt;img src="http://depth-first.com/demo/20081031/layers.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;One of the main design goals of &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt;, was to support all Web-relevant image output formats, both vector-based and pixel-based. Like most things in software development, there are far more approaches that add complexity to this problem than there are approaches that remove it. And for some reason, the complexity-reducing methods tend to be the last to be considered. This article, the first in a series, will discuss how ChemPhoto simplifies the problem of supporting multiple chemical structure image output formats from a common representation.&lt;/p&gt;

&lt;h4&gt;The Problem in a Nutshell&lt;/h4&gt;

&lt;p&gt;ChemPhoto uses an internal representation of molecular structure based closely on the industry-standard &lt;a href="http://www.mdl.com/downloads/public/ctfile/ctfile.pdf"&gt;MDL molfile format&lt;/a&gt;. Given this representation, ChemPhoto needs to be able to write a variety of vector- and raster-based image formats. Raster formats are fortunately limited to PNG and JPG, which are supported directly by the standard Java library.&lt;/p&gt;

&lt;p&gt;Vector formats, on the other hand are more diverse and less accessible. Currently, ChemPhoto supports Scalable Vector Graphics (SVG) and Encapsulated PostScript (EPS). Complete support for Adobe Flash (SWF) output is expected soon. Proof of concept for Microsoft's Vector Markup Language (VML) &lt;a href="http://depth-first.com/articles/2008/07/22/vector-markup-language-for-cheminformatics"&gt;has already been demonstrated&lt;/a&gt;. Support for Adobe Acrobat format, through the &lt;a href="http://www.lowagie.com/iText/"&gt;iText library&lt;/a&gt; is anticipated. Last but not least is Java2D itself for use in Swing components such as &lt;a href="http://metamolecular.com/chemwriter"&gt;renderers and editors&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Clearly, supporting all of these formats requires rendering code that is minimally coupled to the underlying display system. But how to do this in practice?&lt;/p&gt;

&lt;h4&gt;The Batik Approach: Extend Graphics2D&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://xmlgraphics.apache.org/batik/"&gt;Batik&lt;/a&gt; is a widely-used library for creating and processing SVG in Java. At its core is the &lt;a href="http://xmlgraphics.apache.org/batik/using/svg-generator.html"&gt;SVGGraphics2D class&lt;/a&gt; which extends &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/awt/Graphics2D.html"&gt;Graphics2D&lt;/a&gt;, overriding many of its methods in the process. The idea seems simple enough - create your drawing code using the Java2D API like you normally would. When you want to generate SVG, just pass an instance of &lt;tt&gt;SVGGraphics2D&lt;/tt&gt; and then read out the SVG document using &lt;tt&gt;stream&lt;/tt&gt; method.&lt;/p&gt;

&lt;p&gt;The problem with this approach is that every new image output format to be supported needs to extend Graphics2D and essentially re-implement most of its methods. Graphics2D is a large and complex class with many associated helper classes. Just knowing when you've completely covered the API is a major challenge, aside from the even bigger challenge of implementing the overridden methods.&lt;/p&gt;

&lt;p&gt;Fine, you might say, given that so many SVG interconverters exist, why not just use SVG (created by Batik) as the universal interconversion format and get a third-party-library to convert SVG into other vector formats?&lt;/p&gt;

&lt;p&gt;This approach is appealing in principle, but fails in practice. Many SVG implementations are partial at best - and many lack the documentation that would warn that a problem might exist with the exact form of SVG you're using. For example, in an early iteration of ChemPhoto, Batik was used to create SVG from some representative chemical structures. Unfortunately, the way Batik represented path data was not fully interpreted by any of the SVG-&gt;SWF converters that were examined. The result was bumpy instead of smooth curves for atom labels, and other unacceptable abnormalities.&lt;/p&gt;

&lt;p&gt;Finally, after spending some time reading J. David Eisenberg's &lt;a href="http://oreilly.com/catalog/9780596002237/toc.html"&gt;excellent book about SVG&lt;/a&gt;, it became clear that for drawing 2D chemical structures and even reactions and reaction schemes, only a fraction of the SVG specification was relevant.&lt;/p&gt;

&lt;p&gt;In this case, Batik, and its approach of extending Graphics2D was simply overkill that made the problem more complex than it needed to be.&lt;/p&gt;

&lt;h4&gt;A Better Approach: Create a Custom Vector Graphics Interface&lt;/h4&gt;

&lt;p&gt;Batik has the right idea: isolate drawing code from the specific format being generated. The problem is that the Graphics2D class wasn't really designed for this purpose. For one thing, it's a concrete class that inherits from another concrete class. And as mentioned before, Graphics2D a very complex class with many dependencies.&lt;/p&gt;

&lt;p&gt;How can we create a simple vector graphics API tailored to chemical structure image creation, which is easily re-implemented, and which works with the existing Java2D API?&lt;/p&gt;

&lt;p&gt;Part 2 of this series will describe one approach.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Creating the ChemPhoto rendering engine has been an evolutionary process. It started with the idea of directly using the Graphics2D class in rendering code, but has since moved on to the definition of a vector graphics abstraction layer to simplify the addition of new image formats.&lt;/p&gt;

&lt;p&gt;I'd like to thank those beta testers who have already offered valuable feedback on ChemPhoto. If you'd like an unlimited 30-day trial for yourself, please &lt;a href="http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=" onclick="window.open('http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;" title="Reveal this e-mail address"&gt;drop me a line.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/estherase/"&gt;estherase&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=WnmwM"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=WnmwM" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=w5AFm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=w5AFm" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=d7Rlm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=d7Rlm" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=6oRRm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=6oRRm" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Fri, 31 Oct 2008 18:25:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:ca61daeb-1523-4f0e-a4f6-f95a9d69f14e</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/31/a-simple-vector-graphics-api-for-chemical-structure-output-part-1-in-search-of-a-simplifying-approach-for-chemphoto</link>
      <category>Tools</category>
      <category>chemphoto</category>
      <category>chemicalstructureimaging</category>
      <category>vectorgraphics</category>
      <category>pdf</category>
      <category>svg</category>
      <category>vml</category>
      <category>eps</category>
      <category>png</category>
      <category>jpg</category>
    </item>
    <item>
      <title>Fast Substructure Search Using Open Source Tools Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/leechypics/505513640/"&gt;&lt;img src="http://depth-first.com/demo/20081029/fingerprint.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;We can think of a fingerprint as a bucket into which every molecule in the universe can be reproducibly placed. Each molecule will belong to a single bucket, but each bucket may contain any number of molecules. In other words, there exists a one-to-many relationship between a fingerprint and its associated molecules. The &lt;a href="http://depth-first.com/articles/2008/10/21/fast-substructure-search-using-open-source-tools-part-5-relating-molecules-to-fingerprints-with-sql"&gt;previous article in this series&lt;/a&gt; discussed how to model this relationship using SQL. This article will take the idea one step further by describing one way to model this relationship in Ruby.&lt;/p&gt;

&lt;p&gt;All Articles in this Series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases"&gt;Part 1: Fingerprints and Databases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/03/fast-substructure-search-using-open-source-tools-part-2-fingerprint-screen-with-sql"&gt;Part 2: Fingerprint Screen With SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/06/fast-substructure-search-using-open-source-tools-part-3-a-crud-api-for-fingerprints-in-ruby"&gt;Part 3: A CRUD API for Fingerprints in Ruby&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/15/fast-substructure-search-using-open-source-tools-part-4-creating-fingerprints-from-chemical-structures"&gt;Part 4: Creating Fingerprints from Chemical Structures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://depth-first.com/articles/2008/10/21/fast-substructure-search-using-open-source-tools-part-5-relating-molecules-to-fingerprints-with-sql"&gt;Part 5: Relating Molecules to Fingerprints with SQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 6: Modelling a One-To-Many Relationship Between Fingerprints and Compounds in Ruby&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;SQL Recap&lt;/h4&gt;

&lt;p&gt;So far, we've set up a fingerprints database:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
mysql&gt; describe fingerprints;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(11)             | NO   | PRI | NULL    | auto_increment | 
| byte0  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte1  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte2  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte3  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte4  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte5  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte6  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte7  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte8  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte9  | bigint(64) unsigned | YES  |     | 0       |                | 
| byte10 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte11 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte12 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte13 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte14 | bigint(64) unsigned | YES  |     | 0       |                | 
| byte15 | bigint(64) unsigned | YES  |     | 0       |                | 
+--------+---------------------+------+-----+---------+----------------+
17 rows in set (0.00 sec)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;This database contains a single (empty) fingerprint:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
mysql&gt; select * from fingerprints;
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
| id | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 | byte8 | byte9 | byte10 | byte11 | byte12 | byte13 | byte14 | byte15 |
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
|  1 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |      0 |      0 |      0 |      0 |      0 |      0 | 
+----+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+--------+--------+--------+
1 row in set (0.00 sec)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;We've also set up a compounds database containing a foreign key (&lt;tt&gt;fingerprint_id&lt;/tt&gt;) into the fingerprints table:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
mysql&gt; describe compounds;
+----------------+---------+------+-----+---------+----------------+
| Field          | Type    | Null | Key | Default | Extra          |
+----------------+---------+------+-----+---------+----------------+
| id             | int(11) | NO   | PRI | NULL    | auto_increment | 
| fingerprint_id | int(11) | YES  |     | NULL    |                | 
| smiles         | text    | YES  |     | NULL    |                | 
+----------------+---------+------+-----+---------+----------------+
3 rows in set (0.00 sec)
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;In this hypothetical example, the compounds database is populated by two molecules, benzene and bromobenzene, both of which share the same fingerprint:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
mysql&gt; select * from compounds;
+----+----------------+------------+
| id | fingerprint_id | smiles     |
+----+----------------+------------+
|  1 |              1 | c1ccccc1   | 
|  2 |              1 | c1ccccc1Br | 
+----+----------------+------------+
2 rows in set (0.00 sec)
&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;Adding the Ruby Layer&lt;/h4&gt;

&lt;p&gt;In &lt;a href="http://depth-first.com/articles/2008/10/06/fast-substructure-search-using-open-source-tools-part-3-a-crud-api-for-fingerprints-in-ruby"&gt;Part 3&lt;/a&gt;, we created a CRUD API for fingerprints in Ruby. We now need to modify the class we created there, Fingerprint, to make it aware of the Compounds it will be associated with.&lt;/p&gt;

&lt;p&gt;For brevity, you can &lt;a href="http://depth-first.com/demo/20081029/fingerprint.rb"&gt;view the updated Fingerprint class here&lt;/a&gt;. The main change has been to add a single line of code that tells &lt;tt&gt;Fingerprint&lt;/tt&gt; that it's now associated with a class called &lt;tt&gt;Compound&lt;/tt&gt;:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;  &lt;span class="ident"&gt;has_many&lt;/span&gt; &lt;span class="symbol"&gt;:compounds&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

All that remains is to bring the &lt;tt&gt;Compound&lt;/tt&gt; class into being:

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;active_record&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;fingerprint&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="constant"&gt;ActiveRecord&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Base&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;establish_connection&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;
  &lt;span class="symbol"&gt;:adapter&lt;/span&gt;    &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;mysql&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt;
  &lt;span class="symbol"&gt;:host&lt;/span&gt;       &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;localhost&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt;
  &lt;span class="symbol"&gt;:username&lt;/span&gt;   &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt;  &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;root&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt;
  &lt;span class="symbol"&gt;:password&lt;/span&gt;   &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt;  &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt;
  &lt;span class="symbol"&gt;:database&lt;/span&gt;   &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt;  &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;compounds&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="punct"&gt;)&lt;/span&gt;

&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Compound&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt; &lt;span class="constant"&gt;ActiveRecord&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Base&lt;/span&gt;
  &lt;span class="ident"&gt;belongs_to&lt;/span&gt; &lt;span class="symbol"&gt;:fingerprint&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

The &lt;tt&gt;belongs_to&lt;/tt&gt; line is the counterpart to &lt;tt&gt;Fingerprint's&lt;/tt&gt; &lt;tt&gt;has_many&lt;/tt&gt; line. Together, both &lt;tt&gt;Fingerprint&lt;/tt&gt; and &lt;tt&gt;Compound&lt;/tt&gt; create a system in which each &lt;tt&gt;Fingerprint&lt;/tt&gt; can reference multiple &lt;tt&gt;Compounds&lt;/tt&gt; and each &lt;tt&gt;Compound&lt;/tt&gt; references one &lt;tt&gt;Fingerprint&lt;/tt&gt;.

Let's test this with interactive Ruby:

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&amp;gt; require 'fingerprint'
=&amp;gt; true
irb(main):002:0&amp;gt; f=Fingerprint.find 1
=&amp;gt; #&amp;lt;Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0&amp;gt;
irb(main):003:0&amp;gt; f.compounds
=&amp;gt; [#&amp;lt;Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1"&amp;gt;, #&amp;lt;Compound id: 2, fingerprint_id: 1, smiles: "c1ccccc1Br"&amp;gt;]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Looks good. Our code has made the correct association between a &lt;tt&gt;Fingerprint&lt;/tt&gt; and its &lt;tt&gt;Compounds&lt;/tt&gt;. What about the other way around?&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&amp;gt; require 'compound'
=&amp;gt; true
irb(main):002:0&amp;gt; c=Compound.find 1
=&amp;gt; #&amp;lt;Compound id: 1, fingerprint_id: 1, smiles: "c1ccccc1"&amp;gt;
irb(main):003:0&amp;gt; c.fingerprint
=&amp;gt; #&amp;lt;Fingerprint id: 1, byte0: 0, byte1: 0, byte2: 0, byte3: 0, byte4: 0, byte5: 0, byte6: 0, byte7: 0, byte8: 0, byte9: 0, byte10: 0, byte11: 0, byte12: 0, byte13: 0, byte14: 0, byte15: 0&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;As expected, the first &lt;tt&gt;Compound&lt;/tt&gt; became associated with the correct &lt;tt&gt;Fingerprint&lt;/tt&gt;.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Our system can now store and query molecular fingerprints in a relational database. It also associates multiple compounds with each fingerprint.&lt;/p&gt;

&lt;p&gt;We have a complete fingerprint screening system, but not a substructure search system.&lt;/p&gt;

&lt;p&gt;What's missing? For one thing, we'd need a way to perform atom-by-atom searches (ABAS) of all candidate structures after the fingerprint screening process is complete. Recall that just because a query fingerprint matches a candidate fingerprint doesn't necessarily mean that a substructure match has been found.&lt;/p&gt;

&lt;p&gt;We'd also need a way to conveniently get real compounds with real fingerprints into our database. Only then would we be able to test the chemical validity of substructure queries.&lt;/p&gt;

&lt;p&gt;The remaining articles in this series will discuss approaches to each of these requirements.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/leechypics/"&gt;leeechy&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=5FGYM"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=5FGYM" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=GqhJm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=GqhJm" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=QePLm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=QePLm" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~f/Depth-first?a=mxFIm"&gt;&lt;img src="http://feeds.feedburner.com/~f/Depth-first?i=mxFIm" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Wed, 29 Oct 2008 17:15:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:442a864c-12bc-4ba5-a6f6-f9ca95180215</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/29/fast-substructure-search-using-open-source-tools-part-6-modelling-a-one-to-many-relationship-between-fingerprints-and-compounds-in-ruby</link>
      <category>Tools</category>
      <category>ruby</category>
      <category>substructuresearch</category>
      <category>fingerprint</category>
      <category>chemicaldatabase</category>
      <category>sql</category>
      <category>onetomany</category>
      <category>mysql</category>
    </item>
  </channel>
</rss>
