<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag casnumber</title>
    <link>http://depth-first.com/articles/tag/casnumber</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Validating CAS Numbers</title>
      <description>&lt;p&gt;The Chemical Abstracts Service (CAS) &lt;a href="http://en.wikipedia.org/wiki/CAS_registry_number"&gt;registry number system&lt;/a&gt; was designed to be fault-tolerant. Built into every CAS number is a &lt;a href="http://www.cas.org/expertise/cascontent/registry/checkdig.html"&gt;check-digit&lt;/a&gt; that makes it possible to detect mis-typed numbers. Validation is a mathematical and repetitive process well-suited for software.&lt;/p&gt;

&lt;p&gt;The Ruby program below validates arbitrary CAS numbers:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="keyword"&gt;module &lt;/span&gt;&lt;span class="module"&gt;CAS&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;validate&lt;/span&gt; &lt;span class="ident"&gt;cas_number&lt;/span&gt;
    &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="constant"&gt;false&lt;/span&gt; &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;cas_number&lt;/span&gt; &lt;span class="punct"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="ident"&gt;cas_number&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;[0-9]{2,7}-[0-9]{2}-[0-9]&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt;

    &lt;span class="ident"&gt;check_digit&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;cas_number&lt;/span&gt;&lt;span class="punct"&gt;[-&lt;/span&gt;&lt;span class="number"&gt;1&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt;&lt;span class="number"&gt;1&lt;/span&gt;&lt;span class="punct"&gt;].&lt;/span&gt;&lt;span class="ident"&gt;to_i&lt;/span&gt;
    &lt;span class="ident"&gt;sum&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt;

    &lt;span class="ident"&gt;cas_number&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;reverse&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;scan&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;[0-9]&lt;/span&gt;&lt;span class="punct"&gt;/).&lt;/span&gt;&lt;span class="ident"&gt;each_with_index&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;digit&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;i&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="ident"&gt;sum&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;sum&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;digit&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;to_i&lt;/span&gt; &lt;span class="punct"&gt;*&lt;/span&gt; &lt;span class="ident"&gt;i&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;check_digit&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="ident"&gt;sum&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;remainder&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="number"&gt;10&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="ident"&gt;include&lt;/span&gt; &lt;span class="constant"&gt;CAS&lt;/span&gt;

&lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="constant"&gt;true&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt;
  &lt;span class="ident"&gt;print&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;CAS Number: &lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

  &lt;span class="ident"&gt;cas_number&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;strip&lt;/span&gt;

  &lt;span class="keyword"&gt;break&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;cas_number&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;empty?&lt;/span&gt;

  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="constant"&gt;CAS&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;validate&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;cas_number&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;?&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;valid&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;:&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;invalid&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

The program can be tested from the command line:

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby cas.rb
CAS Number: 107-07-3
valid
CAS Number: 107-87-3
invalid
CAS Number:
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Note that a validated CAS number can still be absent from the CAS database; validation only says that a CAS number &lt;em&gt;could&lt;/em&gt; be valid based on its format.&lt;/p&gt;</description>
      <pubDate>Wed, 23 Jul 2008 12:30:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:37a66468-b6ed-4237-8fdc-81ac981466c8</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/07/23/validating-cas-numbers</link>
      <category>Tools</category>
      <category>cas</category>
      <category>casnumber</category>
      <category>validate</category>
      <category>ruby</category>
    </item>
    <item>
      <title>Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs</title>
      <description>&lt;p&gt;&lt;a href="http://wikipedia.org"&gt;&lt;img src="http://depth-first.com/demo/20070123/wikipedia.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Good news for cheminformatics: Chemical Abstracts Service (CAS) &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation"&gt;has agreed&lt;/a&gt; to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.&lt;/p&gt;

&lt;p&gt;Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?&lt;/p&gt;

&lt;h4&gt;Perspective&lt;/h4&gt;

&lt;p&gt;Estimates place the number of Wikipedia pages dealing with individual &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Inorganics"&gt;inorganic&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/List_of_organic_compounds"&gt;organic&lt;/a&gt; substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemical databases&lt;/a&gt;, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The number of Wikipedia compound monographs will increase significantly.&lt;/li&gt;
&lt;li&gt;The quality of monographs for intermediate- to well-known compounds will increase substantially.&lt;/li&gt;
&lt;li&gt;Demand for user-friendly interfaces to Wikipedia's chemical content will increase.&lt;/li&gt;
&lt;li&gt;Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.&lt;/li&gt;
&lt;li&gt;Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to...&lt;/li&gt;
&lt;li&gt;More creative tools for using the chemical content of Wikipedia.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As noted previously, it wasn't too long ago that indexing of the chemical literature &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;was done solely by volunteers&lt;/a&gt;. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.&lt;/p&gt;

&lt;p&gt;But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.&lt;/p&gt;

&lt;h4&gt;A Quick CAS Number API for Wikipedia&lt;/h4&gt;

&lt;p&gt;The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;hpricot&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;open-uri&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;cgi&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Wikikemi&lt;/span&gt;
  &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;

  &lt;span class="ident"&gt;attr_reader&lt;/span&gt; &lt;span class="symbol"&gt;:cas&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;
    &lt;span class="ident"&gt;uri&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;URI&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;escape&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://en.wikipedia.org/wiki/&lt;span class="expr"&gt;#{title}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt;
    &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;loading... &lt;span class="expr"&gt;#{uri}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="ident"&gt;doc&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt;
    &lt;span class="ident"&gt;table&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;doc&lt;/span&gt;&lt;span class="punct"&gt;/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt;

    &lt;span class="ident"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;inner_html&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;([0-9]{2,7}?&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9]{2}&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9])&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;table&lt;/span&gt;

    &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="global"&gt;$1&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Returns the CAS number present in the Wikipedia monograph with&lt;/span&gt;
&lt;span class="comment"&gt;# the indicated title, or an error message if none is found. Try, for example,&lt;/span&gt;
&lt;span class="comment"&gt;# &amp;quot;benzene.&amp;quot;.&lt;/span&gt;
&lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="constant"&gt;true&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Enter the title of the Wikipedia page, for example: 'benzene'&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="ident"&gt;monograph_title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;chomp&lt;/span&gt;
  &lt;span class="ident"&gt;w&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Wikikemi&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt; &lt;span class="ident"&gt;monograph_title&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;w&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;cas&lt;/span&gt; &lt;span class="punct"&gt;?&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;[&lt;span class="expr"&gt;#{w.cas}&lt;/span&gt;]&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;:&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;CAS number not found&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This program makes use of the excellent Ruby HTML parser, &lt;a href="http://code.whytheluckystiff.net/hpricot/"&gt;Hpricot&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Saving the above code to a file called &lt;strong&gt;wikikemi.rb&lt;/strong&gt;, we can run it with:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
Enter the title of the Wikipedia page, for example: 'benzene'
ferrocene
loading... http://en.wikipedia.org/wiki/ferrocene
[102-54-5]
Enter the title of the Wikipedia page, for example: 'benzene'
lipitor
loading... http://en.wikipedia.org/wiki/lipitor
[134523-00-5]
Enter the title of the Wikipedia page, for example: 'benzene'
1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
loading... http://en.wikipedia.org/wiki/1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
[91-17-8]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;All this method requires is that the Wikipedia page lists the correct CAS number in its &lt;a href="http://en.wikipedia.org/wiki/Template:Drugbox"&gt;Drugbox&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Template:Chembox_new"&gt;Chembox&lt;/a&gt; template. Fortunately, CAS has agreed to help make this happen.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;A little Ruby code is all it takes to build a working CAS number lookup system using Wikipedia. Although this may be useful as a standalone tool, it becomes much more powerful when made part of &lt;a href="http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem"&gt;a larger cheminformatics system&lt;/a&gt;. But that's a story for another time.&lt;/p&gt;

&lt;p&gt;See also &lt;a href="http://www.chemspider.com/blog/a-message-of-support-and-public-service-from-the-chemical-abstracts-service.html"&gt;Antony Williams' announcement on CAS and Wikipedia&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Wed, 02 Apr 2008 17:29:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:c11402b2-406a-4ec9-8b65-fc34da179c1a</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs</link>
      <category>Tools</category>
      <category>cas</category>
      <category>acs</category>
      <category>casnumber</category>
      <category>lookup</category>
      <category>wikipedia</category>
      <category>ruby</category>
    </item>
    <item>
      <title>Hacking PubChem: Visually Inspect Results for CAS Number and Keyword Searches</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;&lt;/a&gt;A recent article described how PubChem could be used to &lt;a href="http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby"&gt;quickly search for CAS numbers&lt;/a&gt;. Although useful, the approach is limited in that only an array of PubChem CIDs was returned. What would be really useful would be a simple way to create a report with entries hyperlinked into the PubChem site itself to aid in visual inspection. In this tutorial, we'll see how an HTML template and a few extra lines of code can do just that.&lt;/p&gt;

&lt;h4&gt;The Template&lt;/h4&gt;

&lt;p&gt;Ruby supports a number of HTML templating mechanisms. In this example, we'll use an ERB template resurrected from the &lt;a href="http://depth-first.com/articles/2006/12/11/hacking-molbank-creating-a-graphical-table-of-contents"&gt;Molbank graphical table of contents&lt;/a&gt; tutorial:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_xml "&gt;&lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;head&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;title&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;%=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;PubChem Search for #{term}&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;title&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;head&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;body&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;h1&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;%=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Search: #{term}&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;h1&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;tr&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;col&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;cids.each&lt;/span&gt; &lt;span class="attribute"&gt;do&lt;/span&gt; |&lt;span class="attribute"&gt;cid|&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;td&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;image&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=#{cid}&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;summary&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=#{cid}&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;a&lt;/span&gt; &lt;span class="attribute"&gt;href&lt;/span&gt;&lt;span class="punct"&gt;=&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&amp;lt;%= summary %&amp;gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&amp;gt;&lt;/span&gt;
            &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;img&lt;/span&gt; &lt;span class="attribute"&gt;src&lt;/span&gt;&lt;span class="punct"&gt;=&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&amp;lt;%= image %&amp;gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="attribute"&gt;border&lt;/span&gt;&lt;span class="punct"&gt;=&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;2&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&amp;gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;img&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;a&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;center&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;span&lt;/span&gt; &lt;span class="attribute"&gt;style&lt;/span&gt;&lt;span class="punct"&gt;=&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;font-size: 8px&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&amp;gt;&lt;/span&gt;
              &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;a&lt;/span&gt; &lt;span class="attribute"&gt;href&lt;/span&gt;&lt;span class="punct"&gt;=&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&amp;lt;%= summary %&amp;gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&amp;gt;&amp;lt;%=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;CID-#{cid}&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;a&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;span&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;center&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;td&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;col&lt;/span&gt; +&lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="number"&gt;1&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;if&lt;/span&gt; &lt;span class="attribute"&gt;col&lt;/span&gt; &lt;span class="punct"&gt;&amp;gt;&lt;/span&gt; 5 %&amp;gt;
          &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;col&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;tr&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt;&lt;span class="tag"&gt;tr&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt; &lt;span class="attribute"&gt;end&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;%&lt;/span&gt;&lt;span class="attribute"&gt;end&lt;/span&gt; %&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;tr&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;body&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="punct"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="tag"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above template uses a search term and an array of CIDs to build a table of results. Each cell in the table contains a color 2D image and the CID, both hyperlinked into PubChem itself.&lt;/p&gt;

&lt;p&gt;Saving this library to a file called &lt;strong&gt;template.rhtml&lt;/strong&gt; is all we need to do.&lt;/p&gt;

&lt;h4&gt;The Library&lt;/h4&gt;

&lt;p&gt;The library is a modification of the one shown in &lt;a href="http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby"&gt;the previous article&lt;/a&gt; in this series:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;mechanize&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;erb&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;module &lt;/span&gt;&lt;span class="module"&gt;PubChemTerms&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;report&lt;/span&gt; &lt;span class="ident"&gt;term&lt;/span&gt;
    &lt;span class="ident"&gt;cids&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;get_cids&lt;/span&gt; &lt;span class="ident"&gt;term&lt;/span&gt;
    &lt;span class="ident"&gt;erb&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;ERB&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;IO&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;template.rhtml&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;))&lt;/span&gt;

    &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;output.html&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;w+&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="ident"&gt;erb&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;result&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;binding&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;get_cids&lt;/span&gt; &lt;span class="ident"&gt;term&lt;/span&gt;
    &lt;span class="ident"&gt;agent&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;WWW&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Mechanize&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
    &lt;span class="ident"&gt;page&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;agent&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&amp;amp;retmax=100&amp;amp;term=&lt;span class="expr"&gt;#{term}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

    &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;parser&lt;/span&gt;&lt;span class="punct"&gt;/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;).&lt;/span&gt;&lt;span class="ident"&gt;collect&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="ident"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;innerHTML&lt;/span&gt;&lt;span class="punct"&gt;}&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The method &lt;tt&gt;report&lt;/tt&gt; accepts a search term and uses our template to render a report.&lt;/p&gt;

&lt;h4&gt;Testing&lt;/h4&gt;

&lt;p&gt;By saving the above library in a file called &lt;strong&gt;pubchem.rb&lt;/strong&gt;, we can search by keyword via interactive ruby (irb):&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'pubchem'
=&gt; true
irb(main):002:0&gt; include PubChemTerms
=&gt; Object
irb(main):003:0&gt; report 'esomeprazole'
=&gt; #&lt;File:output.html (closed)&gt;
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;This produces a file called &lt;strong&gt;output.html&lt;/strong&gt; that can be viewed with any browser:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070925/screenshot.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;As in the original version of the library, we can also query by CAS number:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'pubchem'
=&gt; true
irb(main):002:0&gt; include PubChemTerms
=&gt; Object
irb(main):003:0&gt; report '119141-88-7'
=&gt; #&lt;File:output.html (closed)&gt;
&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;The simple approach outlined here could be extended in many ways. For example, we could easily retrieve molfiles based on keyword or CAS number search. We could pipe queries together or work with query lists. We could &lt;a href="http://depth-first.com/articles/2007/09/17/hacking-chemspider-query-by-smiles-and-inchi-with-ruby"&gt;blend in ChemSpider data&lt;/a&gt;. We could even build a simple Web application (with &lt;a href="http://rubyonrails.org"&gt;Rails&lt;/a&gt;) that returned customized reports. Mixing in &lt;a href="http://depth-first.com/articles/tag/rcdk"&gt;Ruby CDK&lt;/a&gt; or &lt;a href="http://depth-first.com/articles/tag/rubyopenbabel"&gt;Ruby Open Babel&lt;/a&gt; offers still more possibilities.&lt;/p&gt;

&lt;p&gt;Increasingly, the most important question in cheminformatics is not "What can we build?", but rather "What should we build?" Success in this new world requires a much deeper understanding of how cheminformatics software is being used by real chemists and where it's not.&lt;/p&gt;</description>
      <pubDate>Tue, 25 Sep 2007 10:55:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:5b1ef92b-4ed3-443e-a683-dc37d23c4352</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/09/25/hacking-pubchem-visually-inspect-results-for-cas-number-and-keyword-searches</link>
      <category>Tools</category>
      <category>pubchem</category>
      <category>casnumber</category>
      <category>cas</category>
      <category>ruby</category>
      <category>keyword</category>
      <category>erb</category>
      <category>html</category>
      <category>entrez</category>
    </item>
    <item>
      <title>Hacking PubChem: Convert CAS Numbers into PubChem CIDs with Ruby</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right" border="0"&gt;&lt;/img&gt;&lt;/a&gt;Although the PubChem system has been discussed in &lt;a href="http://depth-first.com/articles/tag/pubchem"&gt;numerous recent D-F articles&lt;/a&gt; and elsewhere, there's much more to the story that hasn't been told. One of the more intriguing things PubChem can do is &lt;a href="http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem"&gt;look up CAS Numbers for free&lt;/a&gt;. In this tutorial, we'll see how a simple Ruby script can be used to automate the conversion of CAS numbers into PubChem Compound IDs (CIDs).&lt;/p&gt;

&lt;h4&gt;The Library&lt;/h4&gt;

&lt;p&gt;Our library needs to accept a CAS number and return an array of PubChem CIDs in response. To do this, we'll make use of the &lt;a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html"&gt;Entrez eUtils system&lt;/a&gt;. Although Entrez is incredibly complex, the only two things that matter now are that the NIH requires automated scripts to access most of its databases through Entrez, and that Entrez can be used to perform PubChem keyword queries.&lt;/p&gt;

&lt;p&gt;The library is simplicity itself:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;mechanize&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;module &lt;/span&gt;&lt;span class="module"&gt;PubChemTerms&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;get_cids&lt;/span&gt; &lt;span class="ident"&gt;term&lt;/span&gt;
    &lt;span class="ident"&gt;agent&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;WWW&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Mechanize&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;
    &lt;span class="ident"&gt;page&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;agent&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;get&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&amp;amp;retmax=100&amp;amp;term=&lt;span class="expr"&gt;#{term}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

    &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;page&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;parser&lt;/span&gt;&lt;span class="punct"&gt;/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;).&lt;/span&gt;&lt;span class="ident"&gt;collect&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="ident"&gt;id&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;innerHTML&lt;/span&gt;&lt;span class="punct"&gt;}&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The excellent Ruby library &lt;a href="http://mechanize.rubyforge.org/"&gt;Mechanize&lt;/a&gt; is used for submitting queries and processing the results. (This is the same library that was used to &lt;a href="http://depth-first.com/articles/2007/06/27/easily-convert-publisher-urls-and-dois-to-bibliographical-citations-synthesis-synlett-ruby-and-mechanize"&gt;extract full bibliographical information&lt;/a&gt; from nothing more than a DOI). The only remarkable thing about the library above is how unremarkable it is.&lt;/p&gt;

&lt;h4&gt;A Test&lt;/h4&gt;

We can test the library by saving it in a file called &lt;strong&gt;entrez.rb&lt;/strong&gt; and starting an interactive Ruby (irb) session. Opening up my copy of the Merck index to a random page and selecting an entry gives a CAS number to try (64318-79-2 - gemeprost). Plugging this CAS number into our irb session gives:

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'entrez'
=&gt; true
irb(main):002:0&gt; include PubChemTerms
=&gt; Object
irb(main):003:0&gt; get_cids '64318-79-2'
=&gt; ["5282237", "6434870"]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Our library has returned a Ruby array containing two compound identifiers. We can use PubChem to view their records &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5282237"&gt;here&lt;/a&gt; and &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=6434870"&gt;here&lt;/a&gt;. Visual inspection reveals these two compounds to be isomers of each other, with the first member of the array containing the direct hit.&lt;/p&gt;

&lt;p&gt;Let's try another CAS number selected from another random Merck index entry:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
irb(main):004:0&gt; get_cids '66981-73-5'
=&gt; ["68870", "169125"]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Again we've obtained two CIDs, with the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=68870"&gt;first one&lt;/a&gt; being the neutral form and the second one being the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=169125"&gt;sodium salt&lt;/a&gt; of the antidepressant tianeptine.&lt;/p&gt;

&lt;h4&gt;Applications&lt;/h4&gt;

&lt;p&gt;Now, instead of converting one or two CAS numbers, imagine we've got a few thousand. Our library could be easily adapted to this purpose. The only caveat is that we'd need to &lt;a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements"&gt;observe the Entrez use policy&lt;/a&gt; and not overload the server with too many requests. We could build in a delay with Ruby's &lt;tt&gt;sleep&lt;/tt&gt; method.&lt;/p&gt;

&lt;p&gt;Notice that the library can be used to search for any keyword - not just CAS numbers. For example:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ irb
irb(main):001:0&gt; require 'entrez'
=&gt; true
irb(main):002:0&gt; include PubChemTerms
=&gt; Object
irb(main):003:0&gt; get_cids 'anandamide'
=&gt; ["5281969", "5283455", "5283388", "4671", "5353407", "5283452", "5283456", "5283451", "5283450", "5283449", "5283448", "5283447", "5283445", "5283444"]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Like our previous queries, we've obtained multiple CIDs associated with the term 'anandamide', with the first one being the direct hit.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Our little library isn't perfect, but it performs a very difficult task cheaply and conveniently in the majority of cases. By mashing up this functionality with other Ruby cheminformatics libraries (for example &lt;a href="http://depth-first.com/articles/2007/04/09/painless-installation-of-ruby-open-babel"&gt;Ruby Open Babel&lt;/a&gt; and &lt;a href="http://depth-first.com/articles/tag/rubycdk"&gt;Ruby CDK&lt;/a&gt;), a variety of tough and highly practical cheminformatics problems can be solved elegantly. Look to further installments of the Hacking PubChem series to find out how.&lt;/p&gt;</description>
      <pubDate>Thu, 13 Sep 2007 09:36:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:12e717a5-01e2-4b4c-871b-0a86c69f55e4</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby</link>
      <category>Tools</category>
      <category>pubchem</category>
      <category>hackingpubchem</category>
      <category>ruby</category>
      <category>entrez</category>
      <category>casnumber</category>
      <category>keyword</category>
    </item>
    <item>
      <title>Simple CAS Number Lookup with PubChem</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right" border="none"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.cas.org/expertise/cascontent/registry/regsys.html"&gt;CAS Registry Numbers&lt;/a&gt; simplify the thorny problem of referring to chemical substances. These short numerical sequences are arguably the most widely-used form of molecular identifier, appearing on reagent bottles, in publications, in patents and patent applications, and MSDS sheets.&lt;/p&gt;

&lt;p&gt;During my time as a synthetic organic chemist, I would sometimes run into the problem of finding the structure of a molecule represented by a CAS number. A common case was when an ambiguous, incomprehensible, or blurred IUPAC name was printed on a reagent bottle along with a CAS number. By looking up the CAS number, I could confirm the bottle's contents.&lt;/p&gt;

&lt;p&gt;Your first impulse when looking up a CAS number might be to fire up &lt;a href="http://www.cas.org/SCIFINDER/"&gt;SciFinder&lt;/a&gt;. For years this was the only option. Those days are quickly starting to seem as quaint as when people actually wrote on pieces of paper and dropped them in mailboxes (&lt;a href="http://netflix.com"&gt;dropping DVDs in a mailbox&lt;/a&gt; is a different matter).&lt;/p&gt;

&lt;p&gt;A little-publicized feature of PubChem makes it an ideal way to quickly find the structure associated with a CAS Number. To use it, you need nothing more than a computer, a browser, and an internet connection.&lt;/p&gt;

&lt;p&gt;Browse over to the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; welcome page. At the top you'll find a search box. Enter your CAS number and press "Go." For this example, I'm using the CAS number for 2,5-Pyrazinedicarboxylic acid dihydrate:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070521/screenshot.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;If all goes well, you should see a results screen containing the structure of your compound and a link to its summary page:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070521/screenshot2.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;Does this seem a little too good to be true? Try it for yourself. Pick up a copy of the Aldrich catalog, Merck index, or anything else that lists lots of CAS numbers. Choose several structures at random and see how PubChem performs.&lt;/p&gt;

&lt;p&gt;There are limitations to this method. PubChem generally doesn't index large molecules such as polymers and peptides, so they won't be found by this method. Similarly, if a CAS number doesn't point to a distinct molecular entity (e.g. "mineral oil"), PubChem won't find it either. But these are hardly limitations in the vast majority of cases.&lt;/p&gt;

&lt;p&gt;With the &lt;a href="http://www.corporate-ir.net/ireye/ir_site.zhtml?ticker=SIAL&amp;amp;script=410&amp;amp;layout=-6&amp;amp;item_id=984368"&gt;recent addition of Sigma-Aldrich&lt;/a&gt; as a PubChem compound supplier, it won't be long before smaller companies begin following suit. What we're seeing with PubChem is a classic example of a &lt;a href="http://en.wikipedia.org/wiki/Network_effect"&gt;network effect&lt;/a&gt;. The end result should come as a surprise to nobody.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update: &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt; offers a more detailed &lt;a href="http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia"&gt;CAS Number Lookup&lt;/a&gt; service.&lt;/em&gt;&lt;/p&gt;</description>
      <pubDate>Mon, 21 May 2007 11:46:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:e20e2fc2-e99e-4171-8055-1493bcb31d65</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem</link>
      <category>Databases</category>
      <category>cas</category>
      <category>pubchem</category>
      <category>casnumber</category>
      <category>lookup</category>
      <category>networkeffect</category>
    </item>
    <item>
      <title>Hashing InChIs</title>
      <description>&lt;p&gt;The InChI team has announced &lt;a href="http://chemdata.nist.gov/InChI/inchi-hash.pdf"&gt;a proposal&lt;/a&gt; for a standardized InChI hashing mechanism. This would create a free, fixed-length, alphanumeric molecular identifier.&lt;/p&gt;

&lt;p&gt;This is an excellent proposal. One of the biggest problems in working with InChIs (and other line notations such as SMILES) is that even medium-sized molecules produce very long identifiers. Another problem is the use of characters that must be escaped in URLs. The hashing proposal addresses both of these issues, getting very close to creating &lt;a href="http://depth-first.com/articles/2007/03/14/eleven-qualities-of-the-perfect-line-notation-for-the-web"&gt;the optimal molecular identifier&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, imagine the convenience of being able to refer to a molecule by a universally-recognized, machine-generated string like the one shown below:&lt;/p&gt;

&lt;p&gt;AAAAAAAAAAA-BBBBBBB-XYZ&lt;/p&gt;

&lt;p&gt;This is something that actually stands a chance of getting printed on reagent bottles, in catalogs, in patent applications, or anywhere else chemists are using chemical information. Aside from its length, it's not too different from that &lt;a href="http://www.cas.org/expertise/cascontent/registry/regsys.html"&gt;other molecular identifier system&lt;/a&gt;, but without the perpetual use tax.&lt;/p&gt;

&lt;p&gt;There are at least three downsides to this approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;For most purposes, hashing is a one-way process. It would become virtually impossible to computationally convert this hashed identifier back into its InChI or molecular representation . On the other hand, this could create a market for cryptography experts in cheminformatics. A hashed-InChI lookup service would start to look very useful.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Because of the one-way nature of hashing, the authenticity of a hashed InChI couldn't be directly verified. Checksums will help, but the fundamental problem remains. InChI itself can be &lt;a href="http://depth-first.com/articles/2006/09/19/decoding-inchis-with-rino"&gt;decoded&lt;/a&gt;, and therefore authenticated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It's possible, although extremely unlikely, that two different molecules will end up having the same hashed InChI. Reducing the collision probability means increasing the length of the identifier.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As in any design decision, the question is whether the benefits outweigh the disadvantages.&lt;/p&gt;

&lt;p&gt;Anyone is free to develop their own InChI hash system. Several, including me, already have. But by introducing a standard mechanism, the InChI team has the potential to create both a &lt;em&gt;free&lt;/em&gt; and easy-to-use molecular identifier.&lt;/p&gt;</description>
      <pubDate>Wed, 09 May 2007 14:01:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:57c15c90-1d32-4c6d-a46c-46a765320b6b</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/05/09/hashing-inchis</link>
      <category>Meta</category>
      <category>inchi</category>
      <category>hash</category>
      <category>casnumber</category>
    </item>
  </channel>
</rss>
