Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs 4
Good news for cheminformatics: Chemical Abstracts Service (CAS) has agreed to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.
Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?
Perspective
Estimates place the number of Wikipedia pages dealing with individual inorganic and organic substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most chemical databases, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:
- The number of Wikipedia compound monographs will increase significantly.
- The quality of monographs for intermediate- to well-known compounds will increase substantially.
- Demand for user-friendly interfaces to Wikipedia's chemical content will increase.
- Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.
- Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to...
- More creative tools for using the chemical content of Wikipedia.
As noted previously, it wasn't too long ago that indexing of the chemical literature was done solely by volunteers. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.
But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.
A Quick CAS Number API for Wikipedia
The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'
class Wikikemi
@cas = nil
attr_reader :cas
def initialize title
uri = URI.escape("http://en.wikipedia.org/wiki/#{title}")
puts "loading... #{uri}"
doc = Hpricot(open(uri))
table = (doc/"table")[0]
table.inner_html.match(/([0-9]{2,7}?\-[0-9]{2}\-[0-9])/) if table
@cas = $1
end
end
# Returns the CAS number present in the Wikipedia monograph with
# the indicated title, or an error message if none is found. Try, for example,
# "benzene.".
while true
puts "Enter the title of the Wikipedia page, for example: 'benzene'"
monograph_title = gets.chomp
w = Wikikemi.new monograph_title
puts w.cas ? "[#{w.cas}]" : "CAS number not found"
endThis program makes use of the excellent Ruby HTML parser, Hpricot.
Saving the above code to a file called wikikemi.rb, we can run it with:
$ ruby wikikemi.rb
For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:
$ ruby wikikemi.rb Enter the title of the Wikipedia page, for example: 'benzene' ferrocene loading... http://en.wikipedia.org/wiki/ferrocene [102-54-5] Enter the title of the Wikipedia page, for example: 'benzene' lipitor loading... http://en.wikipedia.org/wiki/lipitor [134523-00-5] Enter the title of the Wikipedia page, for example: 'benzene' 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene loading... http://en.wikipedia.org/wiki/1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene [91-17-8]
All this method requires is that the Wikipedia page lists the correct CAS number in its Drugbox or Chembox template. Fortunately, CAS has agreed to help make this happen.
Conclusions
A little Ruby code is all it takes to build a working CAS number lookup system using Wikipedia. Although this may be useful as a standalone tool, it becomes much more powerful when made part of a larger cheminformatics system. But that's a story for another time.
See also Antony Williams' announcement on CAS and Wikipedia.
Simple CAS Number Lookup with PubChem 2
CAS Registry Numbers simplify the thorny problem of referring to chemical substances. These short numerical sequences are arguably the most widely-used form of molecular identifier, appearing on reagent bottles, in publications, in patents and patent applications, and MSDS sheets.
During my time as a synthetic organic chemist, I would sometimes run into the problem of finding the structure of a molecule represented by a CAS number. A common case was when an ambiguous, incomprehensible, or blurred IUPAC name was printed on a reagent bottle along with a CAS number. By looking up the CAS number, I could confirm the bottle's contents.
Your first impulse when looking up a CAS number might be to fire up SciFinder. For years this was the only option. Those days are quickly starting to seem as quaint as when people actually wrote on pieces of paper and dropped them in mailboxes (dropping DVDs in a mailbox is a different matter).
A little-publicized feature of PubChem makes it an ideal way to quickly find the structure associated with a CAS Number. To use it, you need nothing more than a computer, a browser, and an internet connection.
Browse over to the PubChem welcome page. At the top you'll find a search box. Enter your CAS number and press "Go." For this example, I'm using the CAS number for 2,5-Pyrazinedicarboxylic acid dihydrate:

If all goes well, you should see a results screen containing the structure of your compound and a link to its summary page:

Does this seem a little too good to be true? Try it for yourself. Pick up a copy of the Aldrich catalog, Merck index, or anything else that lists lots of CAS numbers. Choose several structures at random and see how PubChem performs.
There are limitations to this method. PubChem generally doesn't index large molecules such as polymers and peptides, so they won't be found by this method. Similarly, if a CAS number doesn't point to a distinct molecular entity (e.g. "mineral oil"), PubChem won't find it either. But these are hardly limitations in the vast majority of cases.
With the recent addition of Sigma-Aldrich as a PubChem compound supplier, it won't be long before smaller companies begin following suit. What we're seeing with PubChem is a classic example of a network effect. The end result should come as a surprise to nobody.

