<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag lookup</title>
    <link>http://depth-first.com/articles/tag/lookup</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs</title>
      <description>&lt;p&gt;&lt;a href="http://wikipedia.org"&gt;&lt;img src="http://depth-first.com/demo/20070123/wikipedia.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Good news for cheminformatics: Chemical Abstracts Service (CAS) &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation"&gt;has agreed&lt;/a&gt; to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.&lt;/p&gt;

&lt;p&gt;Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?&lt;/p&gt;

&lt;h4&gt;Perspective&lt;/h4&gt;

&lt;p&gt;Estimates place the number of Wikipedia pages dealing with individual &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Inorganics"&gt;inorganic&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/List_of_organic_compounds"&gt;organic&lt;/a&gt; substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemical databases&lt;/a&gt;, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The number of Wikipedia compound monographs will increase significantly.&lt;/li&gt;
&lt;li&gt;The quality of monographs for intermediate- to well-known compounds will increase substantially.&lt;/li&gt;
&lt;li&gt;Demand for user-friendly interfaces to Wikipedia's chemical content will increase.&lt;/li&gt;
&lt;li&gt;Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.&lt;/li&gt;
&lt;li&gt;Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to...&lt;/li&gt;
&lt;li&gt;More creative tools for using the chemical content of Wikipedia.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As noted previously, it wasn't too long ago that indexing of the chemical literature &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;was done solely by volunteers&lt;/a&gt;. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.&lt;/p&gt;

&lt;p&gt;But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.&lt;/p&gt;

&lt;h4&gt;A Quick CAS Number API for Wikipedia&lt;/h4&gt;

&lt;p&gt;The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;hpricot&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;open-uri&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;cgi&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Wikikemi&lt;/span&gt;
  &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;

  &lt;span class="ident"&gt;attr_reader&lt;/span&gt; &lt;span class="symbol"&gt;:cas&lt;/span&gt;

  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;
    &lt;span class="ident"&gt;uri&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;URI&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;escape&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;http://en.wikipedia.org/wiki/&lt;span class="expr"&gt;#{title}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt;
    &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;loading... &lt;span class="expr"&gt;#{uri}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="ident"&gt;doc&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt;
    &lt;span class="ident"&gt;table&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;doc&lt;/span&gt;&lt;span class="punct"&gt;/&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt;

    &lt;span class="ident"&gt;table&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;inner_html&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;([0-9]{2,7}?&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9]{2}&lt;span class="escape"&gt;\-&lt;/span&gt;[0-9])&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;table&lt;/span&gt;

    &lt;span class="attribute"&gt;@cas&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="global"&gt;$1&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Returns the CAS number present in the Wikipedia monograph with&lt;/span&gt;
&lt;span class="comment"&gt;# the indicated title, or an error message if none is found. Try, for example,&lt;/span&gt;
&lt;span class="comment"&gt;# &amp;quot;benzene.&amp;quot;.&lt;/span&gt;
&lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="constant"&gt;true&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;Enter the title of the Wikipedia page, for example: 'benzene'&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="ident"&gt;monograph_title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;chomp&lt;/span&gt;
  &lt;span class="ident"&gt;w&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Wikikemi&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt; &lt;span class="ident"&gt;monograph_title&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;w&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;cas&lt;/span&gt; &lt;span class="punct"&gt;?&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;[&lt;span class="expr"&gt;#{w.cas}&lt;/span&gt;]&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;:&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;CAS number not found&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This program makes use of the excellent Ruby HTML parser, &lt;a href="http://code.whytheluckystiff.net/hpricot/"&gt;Hpricot&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Saving the above code to a file called &lt;strong&gt;wikikemi.rb&lt;/strong&gt;, we can run it with:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby wikikemi.rb
Enter the title of the Wikipedia page, for example: 'benzene'
ferrocene
loading... http://en.wikipedia.org/wiki/ferrocene
[102-54-5]
Enter the title of the Wikipedia page, for example: 'benzene'
lipitor
loading... http://en.wikipedia.org/wiki/lipitor
[134523-00-5]
Enter the title of the Wikipedia page, for example: 'benzene'
1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
loading... http://en.wikipedia.org/wiki/1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
[91-17-8]
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;All this method requires is that the Wikipedia page lists the correct CAS number in its &lt;a href="http://en.wikipedia.org/wiki/Template:Drugbox"&gt;Drugbox&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Template:Chembox_new"&gt;Chembox&lt;/a&gt; template. Fortunately, CAS has agreed to help make this happen.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;A little Ruby code is all it takes to build a working CAS number lookup system using Wikipedia. Although this may be useful as a standalone tool, it becomes much more powerful when made part of &lt;a href="http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem"&gt;a larger cheminformatics system&lt;/a&gt;. But that's a story for another time.&lt;/p&gt;

&lt;p&gt;See also &lt;a href="http://www.chemspider.com/blog/a-message-of-support-and-public-service-from-the-chemical-abstracts-service.html"&gt;Antony Williams' announcement on CAS and Wikipedia&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Wed, 02 Apr 2008 17:29:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:c11402b2-406a-4ec9-8b65-fc34da179c1a</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs</link>
      <category>Tools</category>
      <category>cas</category>
      <category>acs</category>
      <category>casnumber</category>
      <category>lookup</category>
      <category>wikipedia</category>
      <category>ruby</category>
    </item>
    <item>
      <title>Simple CAS Number Lookup with PubChem</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right" border="none"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.cas.org/expertise/cascontent/registry/regsys.html"&gt;CAS Registry Numbers&lt;/a&gt; simplify the thorny problem of referring to chemical substances. These short numerical sequences are arguably the most widely-used form of molecular identifier, appearing on reagent bottles, in publications, in patents and patent applications, and MSDS sheets.&lt;/p&gt;

&lt;p&gt;During my time as a synthetic organic chemist, I would sometimes run into the problem of finding the structure of a molecule represented by a CAS number. A common case was when an ambiguous, incomprehensible, or blurred IUPAC name was printed on a reagent bottle along with a CAS number. By looking up the CAS number, I could confirm the bottle's contents.&lt;/p&gt;

&lt;p&gt;Your first impulse when looking up a CAS number might be to fire up &lt;a href="http://www.cas.org/SCIFINDER/"&gt;SciFinder&lt;/a&gt;. For years this was the only option. Those days are quickly starting to seem as quaint as when people actually wrote on pieces of paper and dropped them in mailboxes (&lt;a href="http://netflix.com"&gt;dropping DVDs in a mailbox&lt;/a&gt; is a different matter).&lt;/p&gt;

&lt;p&gt;A little-publicized feature of PubChem makes it an ideal way to quickly find the structure associated with a CAS Number. To use it, you need nothing more than a computer, a browser, and an internet connection.&lt;/p&gt;

&lt;p&gt;Browse over to the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; welcome page. At the top you'll find a search box. Enter your CAS number and press "Go." For this example, I'm using the CAS number for 2,5-Pyrazinedicarboxylic acid dihydrate:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070521/screenshot.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;If all goes well, you should see a results screen containing the structure of your compound and a link to its summary page:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070521/screenshot2.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;Does this seem a little too good to be true? Try it for yourself. Pick up a copy of the Aldrich catalog, Merck index, or anything else that lists lots of CAS numbers. Choose several structures at random and see how PubChem performs.&lt;/p&gt;

&lt;p&gt;There are limitations to this method. PubChem generally doesn't index large molecules such as polymers and peptides, so they won't be found by this method. Similarly, if a CAS number doesn't point to a distinct molecular entity (e.g. "mineral oil"), PubChem won't find it either. But these are hardly limitations in the vast majority of cases.&lt;/p&gt;

&lt;p&gt;With the &lt;a href="http://www.corporate-ir.net/ireye/ir_site.zhtml?ticker=SIAL&amp;amp;script=410&amp;amp;layout=-6&amp;amp;item_id=984368"&gt;recent addition of Sigma-Aldrich&lt;/a&gt; as a PubChem compound supplier, it won't be long before smaller companies begin following suit. What we're seeing with PubChem is a classic example of a &lt;a href="http://en.wikipedia.org/wiki/Network_effect"&gt;network effect&lt;/a&gt;. The end result should come as a surprise to nobody.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update: &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt; offers a more detailed &lt;a href="http://depth-first.com/articles/2008/05/26/simple-cas-number-lookup-and-more-with-chempedia"&gt;CAS Number Lookup&lt;/a&gt; service.&lt;/em&gt;&lt;/p&gt;</description>
      <pubDate>Mon, 21 May 2007 11:46:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:e20e2fc2-e99e-4171-8055-1493bcb31d65</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem</link>
      <category>Databases</category>
      <category>cas</category>
      <category>pubchem</category>
      <category>casnumber</category>
      <category>lookup</category>
      <category>networkeffect</category>
    </item>
  </channel>
</rss>
