# Wikipedia for Cheminformatics - A Simple Web API for Finding CAS Numbers in Compound Monographs

Good news for cheminformatics: Chemical Abstracts Service (CAS) has agreed to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.

Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?

## Perspective

Estimates place the number of Wikipedia pages dealing with individual inorganic and organic substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most chemical databases, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:

1. The number of Wikipedia compound monographs will increase significantly.
2. The quality of monographs for intermediate- to well-known compounds will increase substantially.
3. Demand for user-friendly interfaces to Wikipedia's chemical content will increase.
4. Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.
5. Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to…
6. More creative tools for using the chemical content of Wikipedia.

As noted previously, it wasn't too long ago that indexing of the chemical literature was done solely by volunteers. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.

But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.

## A Quick CAS Number API for Wikipedia

The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'

class Wikikemi
@cas = nil

def initialize title
uri = URI.escape("http://en.wikipedia.org/wiki/#{title}")
doc = Hpricot(open(uri))
table = (doc/"table")[0]

table.inner_html.match(/([0-9]{2,7}?\-[0-9]{2}\-[0-9])/) if table

@cas = \$1
end
end

# Returns the CAS number present in the Wikipedia monograph with
# the indicated title, or an error message if none is found. Try, for example,
# "benzene.".
while true
monograph_title = gets.chomp
w = Wikikemi.new monograph_title
end

This program makes use of the excellent Ruby HTML parser, Hpricot.

Saving the above code to a file called wikikemi.rb, we can run it with:

ruby wikikemi.rb

For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:

ruby wikikemi.rb
ferrocene
[102-54-5]
[91-17-8]