<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs</title>
    <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs</title>
      <description>&lt;p&gt;&lt;a href="http://wikipedia.org"&gt;&lt;img src="http://depth-first.com/demo/20070123/wikipedia.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Good news for cheminformatics: Chemical Abstracts Service (CAS) &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation"&gt;has agreed&lt;/a&gt; to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.&lt;/p&gt;

&lt;p&gt;Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?&lt;/p&gt;

&lt;h4&gt;Perspective&lt;/h4&gt;

&lt;p&gt;Estimates place the number of Wikipedia pages dealing with individual &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Inorganics"&gt;inorganic&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/List_of_organic_compounds"&gt;organic&lt;/a&gt; substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemical databases&lt;/a&gt;, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The number of Wikipedia compound monographs will increase significantly.&lt;/li&gt;
&lt;li&gt;The quality of monographs for intermediate- to well-known compounds will increase substantially.&lt;/li&gt;
&lt;li&gt;Demand for user-friendly interfaces to Wikipedia's chemical content will increase.&lt;/li&gt;
&lt;li&gt;Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.&lt;/li&gt;
&lt;li&gt;Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to...&lt;/li&gt;
&lt;li&gt;More creative tools for using the chemical content of Wikipedia.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As noted previously, it wasn't too long ago that indexing of the chemical literature &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;was done solely by volunteers&lt;/a&gt;. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.&lt;/p&gt;

&lt;p&gt;But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.&lt;/p&gt;

&lt;h4&gt;A Quick CAS Number API for Wikipedia&lt;/h4&gt;

&lt;p&gt;The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;hpricot&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;open-uri&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;cgi&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Wikikemi&lt;/span&gt;
  &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;

  &lt;span class="ident"&gt;attr_reader&lt;/span&gt; &lt;span class="symbol"&gt;:cas&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;
    &lt;span class="ident"&gt;uri&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;URI&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;escape&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://en.wikipedia.org/wiki/&lt;span class="expr"&gt;#{title}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt;
    &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;loading... &lt;span class="expr"&gt;#{uri}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="ident"&gt;doc&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt;
    &lt;span class="ident"&gt;table&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;doc&lt;/span&gt;&lt;span class="punct"&gt;/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt;

    &lt;span class="ident"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;inner_html&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;([0-9]{2,7}?&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9]{2}&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9])&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;table&lt;/span&gt;

    &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="global"&gt;$1&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Returns the CAS number present in the Wikipedia monograph with&lt;/span&gt;
&lt;span class="comment"&gt;# the indicated title, or an error message if none is found. Try, for example,&lt;/span&gt;
&lt;span class="comment"&gt;# &amp;quot;benzene.&amp;quot;.&lt;/span&gt;
&lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="constant"&gt;true&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Enter the title of the Wikipedia page, for example: 'benzene'&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="ident"&gt;monograph_title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;chomp&lt;/span&gt;
  &lt;span class="ident"&gt;w&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Wikikemi&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt; &lt;span class="ident"&gt;monograph_title&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;w&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;cas&lt;/span&gt; &lt;span class="punct"&gt;?&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;[&lt;span class="expr"&gt;#{w.cas}&lt;/span&gt;]&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;:&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;CAS number not found&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This program makes use of the excellent Ruby HTML parser, &lt;a href="http://code.whytheluckystiff.net/hpricot/"&gt;Hpricot&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Saving the above code to a file called &lt;strong&gt;wikikemi.rb&lt;/strong&gt;, we can run it with:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
Enter the title of the Wikipedia page, for example: 'benzene'
ferrocene
loading... http://en.wikipedia.org/wiki/ferrocene
[102-54-5]
Enter the title of the Wikipedia page, for example: 'benzene'
lipitor
loading... http://en.wikipedia.org/wiki/lipitor
[134523-00-5]
Enter the title of the Wikipedia page, for example: 'benzene'
1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
loading... http://en.wikipedia.org/wiki/1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
[91-17-8]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;All this method requires is that the Wikipedia page lists the correct CAS number in its &lt;a href="http://en.wikipedia.org/wiki/Template:Drugbox"&gt;Drugbox&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Template:Chembox_new"&gt;Chembox&lt;/a&gt; template. Fortunately, CAS has agreed to help make this happen.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;A little Ruby code is all it takes to build a working CAS number lookup system using Wikipedia. Although this may be useful as a standalone tool, it becomes much more powerful when made part of &lt;a href="http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem"&gt;a larger cheminformatics system&lt;/a&gt;. But that's a story for another time.&lt;/p&gt;

&lt;p&gt;See also &lt;a href="http://www.chemspider.com/blog/a-message-of-support-and-public-service-from-the-chemical-abstracts-service.html"&gt;Antony Williams' announcement on CAS and Wikipedia&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Wed, 02 Apr 2008 17:29:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:c11402b2-406a-4ec9-8b65-fc34da179c1a</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs</link>
      <category>Tools</category>
      <category>cas</category>
      <category>acs</category>
      <category>casnumber</category>
      <category>lookup</category>
      <category>wikipedia</category>
      <category>ruby</category>
    </item>
    <item>
      <title>"Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs" by Rich Apodaca</title>
      <description>&lt;p&gt;Maria, great link. &lt;a href="http://wikixmldb.dyndns.org/" rel="nofollow"&gt;WikiXMLDB&lt;/a&gt; looks like an excellent approach to creating structured data from Wikipedia.&lt;/p&gt;</description>
      <pubDate>Wed, 16 Apr 2008 12:20:57 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:d69a6df2-4335-4c35-b850-3a4fed057282</guid>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs#comment-484</link>
    </item>
    <item>
      <title>"Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs" by Maria Grineva</title>
      <description>&lt;p&gt;Take a look at WikiXMLDB&lt;/p&gt;

&lt;p&gt;&lt;a href="http://wikixmldb.dyndns.org/" rel="nofollow"&gt;http://wikixmldb.dyndns.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It provides a way to query Wikipedia in XQuery.&lt;/p&gt;

&lt;p&gt;Wikipedia dump was parsed into XML and loaded into Sedna XML database. Now you have the flexibility and power of XQuery applied to rich Wikipedia content!&lt;/p&gt;</description>
      <pubDate>Wed, 16 Apr 2008 05:47:19 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:2c8928ae-8afe-4d23-8012-aca8bb86eca5</guid>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs#comment-483</link>
    </item>
    <item>
      <title>"Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs" by Richard Apodaca</title>
      <description>&lt;p&gt;Hanjo, thanks for the heads-up. Links are now fixed.&lt;/p&gt;</description>
      <pubDate>Sat, 05 Apr 2008 21:57:57 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:2e41248d-9c1f-4f19-9fa4-7ae24fe49da1</guid>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs#comment-468</link>
    </item>
    <item>
      <title>"Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs" by Hanjo Kim</title>
      <description>&lt;p&gt;Links of Wikipedia's organic and inorganic substances seem to be wrongly assigned. Two links should be interchanged.&lt;/p&gt;

&lt;p&gt;And always thank you for insightful post!&lt;/p&gt;</description>
      <pubDate>Sat, 05 Apr 2008 04:19:16 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:95713496-fc3e-4480-a6fa-e89ae617e6be</guid>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs#comment-466</link>
    </item>
  </channel>
</rss>
