<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag mashup</title>
    <link>http://depth-first.com/articles/tag/mashup</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>User-Created Compound Monographs on Chempedia.net: Open Sourcing the Collation and Indexing of Chemical Information</title>
      <description>&lt;p&gt;&lt;a href="http://chempedia.com"&gt;&lt;img src="http://chempedia.net/images/global/logo.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Printed encyclopedias of chemical information like the &lt;a href="http://www.merckbooks.com/mindex/"&gt;Merck Index&lt;/a&gt; suffer from the problem of becoming obsolete on publication. When new compounds are discovered, or when the information about a compound changes, those changes can take many months or years to appear in print form due to the high cost of publication. It doesn't have to be that way. This article introduces a new feature to the free online chemical encyclopedia &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt; that lets working scientists update is contents via &lt;a href="http://wikipedia.org"&gt;Wikipedia&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;About Chempedia.net&lt;/h4&gt;

&lt;p&gt;A &lt;a href="http://depth-first.com/articles/2008/04/04/chempedia-net-mashing-up-pubchem-and-wikipedia"&gt;recent article&lt;/a&gt; introduced &lt;a href="http://chempedia.com"&gt;Chempdia&lt;/a&gt;, the free online chemical encyclopedia. This service is built on two of the largest &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;free and open repositories of chemical information&lt;/a&gt; in existence: &lt;a href="http://wikipedia.org"&gt;Wikipedia&lt;/a&gt; and &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt;. PubChem supplies low-level chemical information such as connection tables, and Wikipedia supplies free-text descriptions of the properties and uses of certain molecules.&lt;/p&gt;

&lt;h4&gt;Which Molecules?&lt;/h4&gt;

&lt;p&gt;Currently, Chempedia.net only includes &lt;a href="http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs"&gt;compound monographs&lt;/a&gt; for about 1,000 of its over 300,000 molecules. These monographs were located by a manual process in which the titles for all Wikipedia articles were downloaded in alphabetized form; this process clustered titles that represented IUPAC nomenclature due to its use of leading numbers and symbols. IUPAC nomenclature titles were extracted, and then a script was written to extract the chemical information from these titles and combine it with that from PubChem.&lt;/p&gt;

&lt;p&gt;This method, although useful for getting a service running, is clearly flawed. The biggest problem is in how to discover new compound monographs.&lt;/p&gt;

&lt;h4&gt;Why Not Put Users in Control?&lt;/h4&gt;

&lt;p&gt;Chempedia users themselves are in the best position to know when an existing Wikipedia compound monograph should appear in Chempedia but doesn't, when an existing monograph needs to be updated, or when a new monograph is written and needs to be linked.&lt;/p&gt;

&lt;p&gt;How can the process be &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;automated&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;As a partial answer to this question, users &lt;a href="http://chempedia.net/articles/new"&gt;now have the ability to notify Chempedia of any changes to a Wikipedia compound monograph&lt;/a&gt;, and to have those changes immediately reflected in the next viewing of a Chempedia compound monograph.&lt;/p&gt;

&lt;h4&gt;An Example&lt;/h4&gt;

&lt;p&gt;As an example, let's take &lt;a href="http://en.wikipedia.org/wiki/anandamide"&gt;anandamide&lt;/a&gt;, a compound I've had some experience with during my time as a medicinal chemist. Although the &lt;a href="http://chempedia.net/compounds/6030"&gt;Chempedia entry for ananandamide&lt;/a&gt; exists, there is (or as of today - was) no link to the Wikipedia compound monograph. Let's create one.&lt;/p&gt;

&lt;p&gt;At the top of &lt;a href="http://chempedia.com/"&gt;Chempedia's main menu&lt;/a&gt;, you'll see a link titled '&lt;a href="http://chempedia.net/articles/new"&gt;Update&lt;/a&gt;'. Choosing this link leads to a form that will ask for two pieces of information: (1) the title of the Wikipedia article to which you want Chempedia to link - in this case '&lt;a href="http://en.wikipedia.org/wiki/anandamide"&gt;anandamide&lt;/a&gt;'; and (2) &lt;a href="http://depth-first.com/articles/2007/09/18/six-reasons-i-like-recaptcha-or-how-to-build-a-web-service-worth-talking-about"&gt;reCaptcha&lt;/a&gt; text to keep robots from making mischief.&lt;/p&gt;

&lt;p&gt;Submitting this information is all that's needed to create a new or updated link from Chempedia to Wikipedia. Chempedia handles the rest.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Wikipedia is a vast source of free, high-quality, semi-structured chemical information just waiting to have good chemically-aware interfaces applied to it. Chempedia.net is an attempt to do just that, but it's a bit more as well. Although it may appear that Chempedia is the major beneficiary in this relationship, Wikipedia also benefits. When chemists have a tool that allows them to query and visualize Wikipedia using their native language (the chemical structure) they're in a better position to both use and contribute to Wikipedia itself - something I've started to do.&lt;/p&gt;

&lt;p&gt;This positive feedback effect is the real value of exposing Web services. The question is: who in cheminformatics is willing and able to take the risk to discover this simple principle and its benefits?&lt;/p&gt;</description>
      <pubDate>Thu, 17 Apr 2008 17:50:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:9db0f83e-ebaf-49cc-af9d-03d44250c05d</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/17/user-created-compound-monographs-on-chempedia-net-open-sourcing-the-collation-and-indexing-of-chemical-information</link>
      <category>Tools</category>
      <category>chempedia</category>
      <category>wikipedia</category>
      <category>webservice</category>
      <category>mashup</category>
      <category>compoundmonograph</category>
      <category>merckindex</category>
    </item>
    <item>
      <title>Hacking ChemSpider: Query by SMILES and InChI with Ruby</title>
      <description>&lt;p&gt;&lt;a href="http://chemspider.com"&gt;&lt;img src="http://depth-first.com/demo/20070917/chemspider.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Slowly but surely, cheminformatics Web APIs are starting to appear. What's the big deal, you may ask? By exposing Web APIs, service providers enable third parties to develop new applications that &lt;a href="http://depth-first.com/articles/2006/09/23/mashups-for-fun-and-profit"&gt;"mash up"&lt;/a&gt; functionality from two or more sites, or which take the original service in directions its founders never considered.&lt;/p&gt;

&lt;p&gt;By way of &lt;a href="http://www.chemspider.com/blog"&gt;Antony Williams' blog&lt;/a&gt;, I came across &lt;a href="http://www.chemspider.com/blog/?p=135"&gt;the announcement&lt;/a&gt; for the &lt;a href="http://www.chemspider.com/inchi.asmx"&gt;ChemSpider Web API&lt;/a&gt;. What can this API do for Web developers? To find out, let's write a small Ruby library.&lt;/p&gt;

&lt;h4&gt;The Library&lt;/h4&gt;

&lt;p&gt;Our library will accept a SMILES string or InChI identifier and returns a URL pointing to the corresponding ChemSpider compound summary page. Like &lt;a href="http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby"&gt;previous Web API demos&lt;/a&gt;, this one uses the powerful Ruby library &lt;a href="http://mechanize.rubyforge.org/"&gt;Mechanize&lt;/a&gt;, leading to very concise code:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;mechanize&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;module &lt;/span&gt;&lt;span class="module"&gt;ChemSpider&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;url_for_inchi&lt;/span&gt; &lt;span class="ident"&gt;inchi&lt;/span&gt;
    &lt;span class="ident"&gt;agent&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;WWW&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Mechanize&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
    &lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;agent&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://www.chemspider.com/inchi.asmx/InChIToCSID?inchi=&lt;span class="expr"&gt;#{inchi}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="ident"&gt;csid&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;body&lt;/span&gt;&lt;span class="punct"&gt;)/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;).&lt;/span&gt;&lt;span class="ident"&gt;innerHTML&lt;/span&gt;

    &lt;span class="ident"&gt;csid&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;?&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt; &lt;span class="punct"&gt;:&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://www.chemspider.com/RecordView.aspx?id=&lt;span class="expr"&gt;#{csid}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;url_for_smiles&lt;/span&gt; &lt;span class="ident"&gt;smiles&lt;/span&gt;
    &lt;span class="ident"&gt;agent&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;WWW&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Mechanize&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
    &lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;agent&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://www.chemspider.com/inchi.asmx/SMILESToInChI?smiles=&lt;span class="expr"&gt;#{smiles}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="ident"&gt;inchi&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;body&lt;/span&gt;&lt;span class="punct"&gt;)/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;).&lt;/span&gt;&lt;span class="ident"&gt;innerHTML&lt;/span&gt;

    &lt;span class="keyword"&gt;raise&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Invalid SMILES: &lt;span class="expr"&gt;#{smiles}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;inchi&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

    &lt;span class="ident"&gt;url_for_inchi&lt;/span&gt; &lt;span class="ident"&gt;inchi&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt; &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;tt&gt;url_for_inchi&lt;/tt&gt; method directly uses the ChemSpider API to query by InChI. The &lt;tt&gt;url_for_smiles&lt;/tt&gt; method first uses the ChemSpider API to convert a SMILES string to an InChI identifier, and then calls the &lt;tt&gt;url_for_inchi&lt;/tt&gt; method.&lt;/p&gt;

&lt;p&gt;Two points are worth noting. First, although for convenience the InChI identifier isn't &lt;a href="http://www.aptana.com/docs/index.php/URL_Escape_Codes"&gt;escaped&lt;/a&gt; before being appended to the API URL, strictly speaking it should be. Second, both methods invoke the underlying Mechanize library &lt;a href="http://code.whytheluckystiff.net/hpricot/"&gt;Hpricot&lt;/a&gt; to parse the raw XML returned by ChemSpider.&lt;/p&gt;

&lt;h4&gt;Testing&lt;/h4&gt;

&lt;p&gt;Saving the above code to a file called &lt;strong&gt;chemspider.rb&lt;/strong&gt;, we can get the URL to ChemSpider's benzene page from its InChI identifier via interactive Ruby (irb):&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'chemspider'
=&gt; true
irb(main):002:0&gt; include ChemSpider
=&gt; Object
irb(main):003:0&gt; url_for_inchi "InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H"
=&gt; "http://www.chemspider.com/RecordView.aspx?id=236"
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;We can work with SMILES strings just as easily as with InChIs:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'chemspider'
=&gt; true
irb(main):002:0&gt; include ChemSpider
=&gt; Object
irb(main):003:0&gt; url_for_smiles 'c1ccccc1'
=&gt; "http://www.chemspider.com/RecordView.aspx?id=236"
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Both the InChI and the SMILES string yield a URL pointing to the &lt;a href="http://www.chemspider.com/RecordView.aspx?id=236"&gt;same Chemspider page for benzene&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Like most &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemical databases&lt;/a&gt;, ChemSpider uses a compound summary page as a way of organizing the available resources for a given molecule. With a method in hand for accessing these pages based on arbitrary SMILES or InChIs, we can begin to think of manipulating ChemSpider independently of its current user interface. But that's a story for another time.&lt;/p&gt;</description>
      <pubDate>Mon, 17 Sep 2007 08:19:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:9e6b90f7-590d-47d4-b2a8-bbac5a014c74</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/09/17/hacking-chemspider-query-by-smiles-and-inchi-with-ruby</link>
      <category>Tools</category>
      <category>chemspider</category>
      <category>hackingchemspider</category>
      <category>ruby</category>
      <category>webapi</category>
      <category>mashup</category>
      <category>mechanize</category>
      <category>hpricot</category>
    </item>
    <item>
      <title>From IUPAC Nomenclature to 2-D Structures With OPSIN</title>
      <description>&lt;p&gt;A &lt;a href="http://depth-first.com/articles/2006/10/14/decoding-iupac-names-with-opsin"&gt;previous article&lt;/a&gt; introduced OPSIN, an Open Source Java library for decoding IUPAC chemical nomenclature. In this tutorial, you'll see how OPSIN can, when interfaced with freely-available chemical informatics software, generate 2-D structure diagrams from IUPAC names.&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;This tutorial requires &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-0"&gt;Ruby CDK&lt;/a&gt; (RCDK), which in turn requires Ruby, Java, and the &lt;a href="http://rjb.rubyforge.org"&gt;Ruby Java Bridge&lt;/a&gt;. Tutorials detailing the installation of RCDK on both &lt;a href="http://depth-first.com/articles/2006/10/12/running-ruby-java-bridge-on-windows"&gt;Windows&lt;/a&gt; and &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-"&gt;Linux&lt;/a&gt; platforms are available.&lt;/p&gt;

&lt;p&gt;In addition, you'll need a copy of the standalone jarfile &lt;a href="http://prdownloads.sourceforge.net/oscar3-chem/opsin-big-0.1.0.jar?download"&gt;opsin-big-0.1.0.jar&lt;/a&gt;. Future versions of RCDK will integrate the OPSIN jarfile, making this step unnecessary.&lt;/p&gt;

&lt;h4&gt;Outlining the Problem and a Solution&lt;/h4&gt;

&lt;p&gt;We'd like to create a simple Ruby class with a method that accepts an IUPAC chemical name as input and produces a PNG image of the corresponding molecule as output. OPSIN accepts IUPAC names as input, but it only produces &lt;a href="http://www.xml-cml.org/"&gt;Chemical Markup Language&lt;/a&gt; (CML) as output. The CML output lacks 2-D coordinates, and OPSIN itself has no 2-D rendering capabilities.&lt;/p&gt;

&lt;p&gt;We'll use RCDK to augment OPSIN's capabilities. Thanks to CDK's built-in CML support, RCDK can read CML and generate an &lt;tt&gt;AtomContainer&lt;/tt&gt; representation. RCDK also supports the assignment of 2-D coordinates to an &lt;tt&gt;AtomContainer&lt;/tt&gt; via CDK's &lt;tt&gt;StructureDiagramGenerator&lt;/tt&gt;. To produce the PNG image, we'll use the 2-D rendering capability made possible through &lt;a href="http://depth-first.com/articles/2006/08/28/drawing-2-d-structures-with-structure-cdk"&gt;Structure-CDK&lt;/a&gt;, which is a built-in component of RCDK.&lt;/p&gt;

&lt;h4&gt;A Simple Ruby Library&lt;/h4&gt;

&lt;p&gt;Create a working directory and copy &lt;a href="http://prdownloads.sourceforge.net/oscar3-chem/opsin-big-0.1.0.jar?download"&gt;opsin-big-0.1.0.jar&lt;/a&gt; into it. Next, create a file called &lt;strong&gt;depictor.rb&lt;/strong&gt; containing the following Ruby code:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require_gem&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rcdk&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rcdk&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="constant"&gt;Java&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Classpath&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;add&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;opsin-big-0.1.0.jar&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;

&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;util&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# A simple IUPAC-&amp;gt;2-D structure convertor.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Depictor&lt;/span&gt;
  &lt;span class="attribute"&gt;@@StringReader&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;import&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;java.io.StringReader&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
  &lt;span class="attribute"&gt;@@NameToStructure&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;import&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;uk.ac.cam.ch.wwmm.opsin.NameToStructure&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
  &lt;span class="attribute"&gt;@@CMLReader&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;import&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;org.openscience.cdk.io.CMLReader&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
  &lt;span class="attribute"&gt;@@ChemFile&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;import&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;org.openscience.cdk.ChemFile&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt;
    &lt;span class="attribute"&gt;@nts&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@@NameToStructure&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
    &lt;span class="attribute"&gt;@cml_reader&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@@CMLReader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Writes a &amp;lt;tt&amp;gt;width&amp;lt;/tt&amp;gt; by &amp;lt;tt&amp;gt;height&amp;lt;/tt&amp;gt; PNG to&lt;/span&gt;
  &lt;span class="comment"&gt;# &amp;lt;tt&amp;gt;filename&amp;lt;/tt&amp;gt; for the molecule described by&lt;/span&gt;
  &lt;span class="comment"&gt;# &amp;lt;tt&amp;gt;iupac_name&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;depict_png&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;iupac_name&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;filename&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;width&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;height&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;cml&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@nts&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;parseToCML&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;iupac_name&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="ident"&gt;throw&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Can't parse name: &lt;span class="expr"&gt;#{iupac_name}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt; &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;cml&lt;/span&gt;

    &lt;span class="ident"&gt;molfile&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;cml_to_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;cml&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="constant"&gt;RCDK&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Util&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Image&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;molfile_to_png&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;molfile&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;filename&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;width&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;height&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="ident"&gt;private&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;cml_to_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;cml&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;string_reader&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;StringReader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;cml&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;toXML&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="attribute"&gt;@cml_reader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;setReader&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;string_reader&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="ident"&gt;chem_file&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@cml_reader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="attribute"&gt;@@ChemFile&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;molecule&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;chem_file&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getChemSequence&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;).&lt;/span&gt;&lt;span class="ident"&gt;getChemModel&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;).&lt;/span&gt;&lt;span class="ident"&gt;getSetOfMolecules&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getMolecule&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="ident"&gt;molecule&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;RCDK&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Util&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;XY&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;coordinate_molecule&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;molecule&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

    &lt;span class="constant"&gt;RCDK&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Util&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Lang&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;molecule&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Testing, Testing&lt;/h4&gt;

&lt;p&gt;A short test will demonstrate the capabilities of the &lt;tt&gt;Depictor&lt;/tt&gt; library. Add the following to a file called &lt;strong&gt;test.rb&lt;/strong&gt; in your working directory (or enter it interactively with irb):&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;depictor&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;depictor&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Depictor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
&lt;span class="ident"&gt;name&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;3,3-dimethyl-7-oxo-6-[(2-phenylacetyl)amino]-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylic acid&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="comment"&gt;#Penicillin G&lt;/span&gt;

&lt;span class="ident"&gt;depictor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;depict_png&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;name&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;out.png&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="number"&gt;300&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="number"&gt;300&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Running this test produces a 300x300 PNG image of Penicillin G, named &lt;strong&gt;out.png&lt;/strong&gt;, in your working directory:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20061017/out.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;As you can see, this simple library and test code has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correctly parsed the rather complex IUPAC name (3,3-dimethyl-7-oxo-6-[(2-phenylacetyl)amino]-4-thia-1-azabicyclo[3.2.0]heptane-2- carboxylic acid) to a valid CML representation&lt;/li&gt;
&lt;li&gt;converted this representation to a CDK &lt;tt&gt;AtomContainer&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;assigned 2-D coordinates&lt;/li&gt;
&lt;li&gt;rendered a PNG image in color&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice how the thiaazabicyclo[3.2.0] system, complete with properly-placed substitutents, was flawlessly identified and parsed.&lt;/p&gt;

&lt;p&gt;If you entered the above test code interactively via IRB, you may have noticed a multi-second delay in instantiating &lt;tt&gt;Depictor&lt;/tt&gt;. This latency results from a sluggish &lt;tt&gt;NameToStructure&lt;/tt&gt; constructor in OPSIN. A similar delay also occurs in OPSIN's pure-Java unit tests. Once &lt;tt&gt;Depictor&lt;/tt&gt; is instantiated, however, image generation occurs relatively quickly.&lt;/p&gt;

&lt;p&gt;The unususal orientation of the beta-lactam carbonyl group is determined by CDK's &lt;tt&gt;StructureDiagramGenerator&lt;/tt&gt;. The source of this behavior will be explored in a future article.&lt;/p&gt;

&lt;h4&gt;More Examples&lt;/h4&gt;

&lt;p&gt;To illustrate some of the capabilities of the OPSIN-RCDK combination, a few more examples are provided below.&lt;/p&gt;

&lt;p&gt;One of OPSIN's more surprising features is how well it handles heterocycles. For example, the IUPAC name for caffeine (&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2519"&gt;1,3,7-trimethylpurine-2,6-dione&lt;/a&gt;) is translated to:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/caffeine.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;As another example, consider the tetrazole (&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=180603"&gt;1-[2-hydroxy-3-propyl-4-[3-(2H-tetrazol-5-yl)propoxy]phenyl]ethanone&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/180603.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Highly substituted benzene rings and carboxylic acids are also translated accurately, as in &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2528"&gt;3-acetamido-5-(acetyl-methyl-amino)-2,4,6-triiodo-benzoic acid&lt;/a&gt; (Metrizoate):&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/metrizoate.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;How about a hairy-looking macrocycle name with multiple levels of morpheme nesting (&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2547"&gt;3,6-diamino-N-[[15-amino-11-(2-amino-3,4,5,6-tetrahydropyrimidin-4-yl)-8- [(carbamoylamino)methylidene]-2-(hydroxymethyl)-3,6,9,12,16-pentaoxo- 1,4,7,10,13-pentazacyclohexadec-5-yl]methyl]hexanamide&lt;/a&gt;)? Not a problem:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/2547.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h4&gt;Limitations&lt;/h4&gt;

&lt;p&gt;In my tests of the OPSIN library, one structure appeared to be incorrectly parsed - &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=180591"&gt;N-(5-chloro-2-methyl-phenyl)-2-methoxy-N-(2-oxooxazolidin-3-yl)acetamide&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/180591.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;There are actually two problems with the output. First, an oxygen atom and a methyl group are overlapping near the top of the diargram. This cosmetic issue is related to CDK's &lt;tt&gt;StructureDiagramGenerator&lt;/tt&gt;. Second, the oxazolidine nitrogen atom is misplaced by OPSIN. The correct 2-D image of this molecule, obtained from PubChem, is shown below:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;
&lt;img src="http://depth-first.com/demo/20061017/180591_pc.png"&gt;&lt;/img&gt;
&lt;/center&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;It's not common to find an early-development Open Source project with the sophistication of OPSIN. The smooth handling of nested morphemes, aromatic heterocycles, macrocycles, and a good fraction of what I threw at it leads me to belive that a well-designed and extensible nomenclature parsing engine lies at OPSIN's core. More on that later, though.&lt;/p&gt;

&lt;p&gt;What could you do with a powerful Open Source IUPAC nomenclature parser? The answer to that one question could fill a three-volume series. Suffice it to say that OPSIN, in combination with other Open Source software, offers virtually limitless potential for indexing, collecting, repackaging, reprocessing, and mashing up vast amounts of chemical information. Because of its Open Source license, OPSIN can be extended and otherwise modified to fit your particular needs. Future articles will highlight some of the possibilities.&lt;/p&gt;</description>
      <pubDate>Tue, 17 Oct 2006 13:57:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:fd6de2ae-23c8-4e50-9765-344e9a7a9545</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/10/17/from-iupac-nomenclature-to-2-d-structures-with-opsin</link>
      <category>Graphics</category>
      <category>opsin</category>
      <category>nametostruct</category>
      <category>iupac</category>
      <category>rcdk</category>
      <category>structure</category>
      <category>cdk</category>
      <category>integration</category>
      <category>mashup</category>
    </item>
    <item>
      <title>Mashups for Fun and Profit</title>
      <description>&lt;p&gt;&lt;a href="http://www.programmableweb.com/"&gt;ProgrammableWeb&lt;/a&gt; offers one-stop shopping for all things mashup-related. If you've ever wanted to try your hand at Web programming, this site makes an excellent first stop. Be sure to check out the listing of over 1,000 mashup sites indexed by category and API.&lt;/p&gt;

&lt;p&gt;The move toward open, Web-based chemical information resources is &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;fully underway&lt;/a&gt;. The genie has been let out of the bottle, and there's no putting him back. This is bad news for large, established chemical information players. Their business models based on restricting information flow will be irreversibly &lt;a href="http://en.wikipedia.org/wiki/Disruptive_technology"&gt;disrupted&lt;/a&gt;. It's good news for tens of thousands of researchers who will be able to exploit chemical information in ways unimaginable today. Leading the way will be mashups that creatively tie diverse Web resources together, and dynamic programming languages like &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt; that make doing so easy.&lt;/p&gt;

&lt;p&gt;Are you ready for the future?&lt;/p&gt;</description>
      <pubDate>Sat, 23 Sep 2006 16:27:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:71720038-660f-4a04-9b83-2a341cc80241</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/09/23/mashups-for-fun-and-profit</link>
      <category>Web</category>
      <category>disruption</category>
      <category>mashup</category>
      <category>webapi</category>
    </item>
    <item>
      <title>Hacking PubChem: Entrez Programming Utilities</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;A &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;recent article&lt;/a&gt; poses the question of how to balance the rights of owners of open chemical information resources against those of their users, while promoting an innovative environment for third-party developers. Although &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; was the focus, the discussion could apply to any other chemical information resource. A reasonable approach would be to provide two separate entry points: one for Web browsers and another for various types of semi-autonomous software used in hacking and mashups.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://wwmm.ch.cam.ac.uk/blogs/corbett/"&gt;Peter Corbett&lt;/a&gt; writes to point out that the &lt;a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html"&gt;Entrez Programming Utilities&lt;/a&gt; can be used to query PubChem and other databases under the NIH/NCI/NCBI umbrella. A separate developer server processes requests, and the terms of its use are fairly well stated. Future articles will explore the possibility of building some simple Ruby APIs for this developer PubChem entry point.&lt;/p&gt;</description>
      <pubDate>Sat, 23 Sep 2006 01:22:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:23e87247-8058-42bc-882c-4ccd40d4f695</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/09/23/hacking-pubchem-entrez-programming-utilities</link>
      <category>Web</category>
      <category>pubchem</category>
      <category>api</category>
      <category>mashup</category>
      <category>entrez</category>
    </item>
    <item>
      <title>Hacking PubChem: Why The Open Access Fight is Just the Beginning</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;Like no other medium, the Internet tests our basic beliefs about the rights of resource owners and resource users. As the Internet increasingly becomes home to scientific publication mechanisms that have no counterpart in the physical world, a larger question looms: what separates fair use of these services from abuse?&lt;/p&gt;

&lt;p&gt;Depth-First hosts a series of articles, with possibly many more to follow, on programatically accessing open chemical information databases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2006/09/21/hacking-pubchem-query-by-smiles"&gt;Hacking PubChem: Query by SMILES&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2006/08/30/hacking-pubchem-with-ruby"&gt;Hacking PubChem with Ruby&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2006/09/04/hacking-nmrshiftdb"&gt;Hacking NMRShiftDB&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The availability of open chemical information resources like &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; and &lt;a href="http://nmrshiftdb.org/"&gt;NMRShiftDB&lt;/a&gt; is a very recent phenomenon, and desperately overdue. One premise of this blog is that chemical informatics is at the start of a renaissance; the chemical information revolution that started in the 1950's is now set to continue after a long period of stagnation. Large, open data sources, and open software that mines it, will fuel this transformation, just as they have in bioinformatics.&lt;/p&gt;

&lt;p&gt;The interaction of non-browser software with public databases, although rich in potential payoffs, can also lead to a great deal of damage. PubChem contains millions of structure-searchable compounds. Setting the wrong kinds of programs loose on this site could cause service interruptions ranging from the annoying to the severe.&lt;/p&gt;

&lt;p&gt;There is no standard mechanism for website owners to spell out acceptable use policies to non-browser software. The closest thing we have to a standard is the &lt;a href="http://www.robotstxt.org/wc/exclusion.html"&gt;Robots Exclusion Protocol&lt;/a&gt;. This protocol defines acceptable behaviors for a robot, which according to &lt;a href="http://www.robotstxt.org/wc/faq.html#what"&gt;one definition&lt;/a&gt; consist of: "... a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." Other definitions are in use. The one thing these definitions seem to have in common is the concept of scale: the more comprehensive and indiscriminate the program is in its interactions with a website, the more like a robot, and less like a browser, it becomes.&lt;/p&gt;

&lt;p&gt;Site owners specify their robots policy in a file called &lt;strong&gt;robots.txt&lt;/strong&gt; hosted on their servers. The &lt;a href="http://pubchem.ncbi.nlm.nih.gov/robots.txt"&gt;PubChem robots.txt file&lt;/a&gt; currently includes the following policies:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;User-agent: *
Disallow: /substance/PcsSrv.cgi
Disallow: /summary/summary.cgi
Disallow: /assay/assay.cgi
Disallow: /image/imgsrv.fcgi
Disallow: /image/smi2gif.fcgi
Disallow: /image/smi2gif.cgi
Disallow: /image/structurefly.cgi
Disallow: /search/NbrQsrv.cgi
Disallow: /search/PreQSrv.cgi&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, &lt;tt&gt;User-agent&lt;/tt&gt; refers to the name of the robot, which is set as a wildcard, meaning any robot. The &lt;tt&gt;Disallow&lt;/tt&gt; lines refer to resources off-limits to robots.&lt;/p&gt;

&lt;p&gt;One of these disallowed resources, &lt;tt&gt;/search/PreQSrv.cgi&lt;/tt&gt; is explicitly used in the &lt;a href="http://depth-first.com/articles/2006/09/21/hacking-pubchem-query-by-smiles"&gt;PubChem SMILES query article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Is a person who runs software of the type I describe in these articles violating PubChem's use policy? The best answer I can give is, "it depends." I think it would be hard for reasonable people to suggest that using the software as described in the tutorials, with their deliberately limited scope, for research purposes, and with no intent to do damage, represents abuse.&lt;/p&gt;

&lt;p&gt;On the other hand, I can see how reasonable people could argue that a website operating as a comprehensive front-end to PubChem using the techniques described in these articles could be considered abuse. I know I might consider it abuse if I ran PubChem, depending on why I was running the service.&lt;/p&gt;

&lt;p&gt;If I wanted to stimulate innovation in the area of open database mining, I might actually encourage front ends and similar third-party PubChem services. I might set aside servers specifically dedicated to this kind of activity. I might even develop an Open Source PubChem Web-API to help developers get started. Unfortunately, NIH's intentions are not exactly clear on this point.&lt;/p&gt;

&lt;p&gt;Looking at the &lt;a href="http://www.ncbi.nlm.nih.gov/About/disclaimer.html"&gt;NCBI's Copyright and Disclaimers page&lt;/a&gt;, the only document that to my knowledge states any kind of use policy, is not especially illuminating:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;&lt;strong&gt;Conditions of Use&lt;/strong&gt;&lt;/p&gt;

    &lt;p&gt;This site is maintained by the U.S. Government and is protected by various provisions of Title 18 of the U.S. Code. Violations of Title 18 are subject to criminal prosecution in a federal court. For site security purposes, as well as to ensure that this service remains available to all users, we use software programs to monitor traffic and to identify unauthorized attempts to upload or change information or otherwise cause damage. In the event of authorized law enforcement investigations and pursuant to any required legal process, information from these sources may be used to help identify an individual.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We are left with the critical, but unanswered question: "What represents an unauthorized use of PubChem?"&lt;/p&gt;

&lt;p&gt;The document cited above also raises the truly bizarre possibility of PubChem not actually being capable of granting rights to redistribute what is contained on its servers:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;This site also contains resources such as PubMed Central, Bookshelf, OMIM, and PubChem which incorporate material contributed or licensed by individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. ...&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But this is a subject for another day.&lt;/p&gt;

&lt;p&gt;Getting back to accessing PubChem data, one very far-sighted thing the NIH has done is to make the entire dataset &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/"&gt;freely downloadable&lt;/a&gt; in three different file formats. Rather than mine the PubChem website itself, you could download the data to your machine, letting the software you write access it locally. The sheer size of this dataset creates problems of its own. Future articles will describe some approaches to solving them.&lt;/p&gt;

&lt;p&gt;Regardless of your views on the use and abuse of chemical information resources like PubChem, it's clear that getting open resources on the Web is only the first in a long series of controversial steps that will ultimately transform both the practice and culture of research.&lt;/p&gt;</description>
      <pubDate>Fri, 22 Sep 2006 13:58:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:cc1896b1-0927-4254-8dce-d5e3816816aa</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning</link>
      <category>Open X</category>
      <category>pubchem</category>
      <category>openaccess</category>
      <category>fairuse</category>
      <category>api</category>
      <category>mashup</category>
    </item>
    <item>
      <title>Hacking PubChem with Ruby</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; is an increasingly popular, free-access, online molecular database operated by the National Institutes of Health. Web services are a hot topic, with sites such as &lt;a href="http://www.flickr.com/services/api/"&gt;Flickr&lt;/a&gt;, &lt;a href="http://www.google.com/apis/"&gt;Google&lt;/a&gt;, and &lt;a href="http://developer.ebay.com/common/api/"&gt;eBay&lt;/a&gt; offering developers the tools to build rich content through "mashups" of several web APIs. Although there is no formal PubChem API, it's possible to roll your own. As a demonstration, this article will show how structural information can be retrieved from PubChem using some simple Ruby code. The inspiration for this article came from the &lt;tt&gt;PubChem&lt;/tt&gt; module that is part of &lt;a href="http://chemruby.org"&gt;Chemruby&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The only thing you'll need for this tutorial is Ruby, preferably version 1.8.2 or higher. Create a directory called &lt;strong&gt;pubchem&lt;/strong&gt; and make it your working directory. Then create a file called &lt;strong&gt;pubchem.rb&lt;/strong&gt; containing the following code:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;net/http&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# A very simple PubChem Web API.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;PubChem&lt;/span&gt;

  &lt;span class="comment"&gt;# Returns a molfile (as a String) for the molecule with PubChem&lt;/span&gt;
  &lt;span class="comment"&gt;# CID matching compound_id.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.get_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;compound_id&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;molfile&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;
    &lt;span class="ident"&gt;path&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;/summary/summary.cgi?cid=&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;compound_id&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;&amp;amp;disopt=DisplaySDF&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

    &lt;span class="constant"&gt;Net&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;HTTP&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;start&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;pubchem.ncbi.nlm.nih.gov&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;http&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="ident"&gt;response&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;http&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;path&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
      &lt;span class="ident"&gt;molfile&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;response&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;body&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;molfile&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Writes a PNG image, for the molecule with PubChem&lt;/span&gt;
  &lt;span class="comment"&gt;# CID matching compound_id, to the file specified by filename.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.write_image&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;compound_id&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;filename&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;path&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;/image/imgsrv.fcgi?t=l&amp;amp;cid=&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;compound_id&lt;/span&gt;

    &lt;span class="constant"&gt;Net&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;HTTP&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;start&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;pubchem.ncbi.nlm.nih.gov&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;http&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="ident"&gt;response&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;http&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;path&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
      &lt;span class="ident"&gt;image&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;response&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;body&lt;/span&gt;

      &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;filename&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;w&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
        &lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="ident"&gt;image&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt; &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

PubChem references each of its compounds by a unique integer identifier, the PubChem CID. This is very handy because retrieving PubChem resources is as simple as encoding a URL containing the CID of interest. The class above illustrates how this system can be used to get a molfile and write a PNG image using just a few lines of Ruby.

Using the &lt;tt&gt;PubChem&lt;/tt&gt; class is simplicity itself. To get the molfile for Levonorgestrel (&lt;a href="http://www.go2planb.com/ForConsumers/Index.aspx"&gt;Plan B&lt;/a&gt;), which has the CID 13109:

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;pubchem&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;molfile&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;PubChem&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="ident"&gt;get_molfile&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;13109&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="comment"&gt;#=&amp;gt; returns the molfile for Levonorgestrel as a String&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

To write the 2-D structure diagram of Levonorgestrel as a PNG:

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;pubchem&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="constant"&gt;PubChem&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="ident"&gt;write_png&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;13109&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;image.png&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="comment"&gt;#=&amp;gt; writes a PNG image of Levonorgestrel&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

This code saves the image below to your working directory as &lt;strong&gt;image.png&lt;/strong&gt;.

&lt;center&gt;&lt;img src="http://depth-first.com/files/levonorgestrel.png"&gt;&lt;/img&gt;&lt;/center&gt;

The above two code fragments can either be saved as a file and executed by the Ruby interpreter:

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby filename.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;or it they be entered interactively in your console with &lt;a href="http://tryruby.hobix.com/"&gt;irb&lt;/a&gt;:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt;  
&lt;/pre&gt;   
&lt;/div&gt;

&lt;p&gt;As you can see, there's not much to building a PubChem API in Ruby. The same principles discussed here should apply in any programming language. Future articles in this series will show how to build more complex PubChem APIs and integrate them with other software packages and web services.&lt;/p&gt;</description>
      <pubDate>Wed, 30 Aug 2006 02:29:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:6cc9c1f4-db5b-4a86-96f1-9c9081a71b5d</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/08/30/hacking-pubchem-with-ruby</link>
      <category>Databases</category>
      <category>pubchem</category>
      <category>ruby</category>
      <category>mashup</category>
      <category>api</category>
    </item>
  </channel>
</rss>
