Hacking DOI: Interconvert Bibliographic References and DOIs with CrossRef and OpenURL 8

Posted by Rich Apodaca Tue, 06 May 2008 19:50:00 GMT

Science is in the middle of a transition from print to the internet as the primary medium of communication. This transition, although a boon for many scientists, creates a host of problems for those dealing with scientific information. For example, how would you interconvert a DOI and its corresponding bibliographic reference?

A previous Depth-First article discussed a screen-scraping method as one solution. Unfortunately, this system relies on an intimate understanding of how individual publishers' Websites work, requires a different implementation for each publisher, and can break at any time without warning.

This article discusses a far more robust solution to the problem of interconverting bibliographic references and DOIs.

Background: OpenURL and CrossRef

CrossRef is the official DOI link registration agency for scholarly and professional publications. One of the less well-known services offered by CrossRef is a free, Web-based bidirectional DOI/bibliographic reference converter based on OpenURL.

A Simple Ruby Library

The following Ruby library is all we need to begin using CrossRef and OpenURL:

require 'rubygems'
require 'hpricot'
require 'open-uri'

module DOI
  # Convert a doi into a bibliographic reference.
  def biblio_for doi
    doc = Hpricot(open("http://www.crossref.org/openurl/?id=doi:#{doi}&noredirect=true&pid=ourl_sample:sample&format=unixref"))

    journal = (doc/"abbrev_title").inner_html
    year = (doc/"journal_issue/publication_date/year").inner_html
    volume = (doc/"journal_issue/journal_volume/volume").inner_html
    number = (doc/"journal_issue/issue").inner_html
    first_page = (doc/"pages/first_page").inner_html
    last_page = (doc/"pages/last_page").inner_html

    "#{journal} #{year}, #{volume}(#{number}) #{first_page}-#{last_page}"
  end

  # Convert a bibliographic reference into a DOI.
  def doi_for journal, year, volume, issue, page
    doc = Hpricot(open("http://www.crossref.org/openurl/?title=#{journal.gsub(/ /, '%20')}&volume=#{volume}&issue=#{issue}&spage=#{page}&date=#{year}&pid=ourl_sample:sample&redirect=false&format=unixref"))

   (doc/"doi").inner_html
  end
end

This code makes use of the excellent Ruby HTML parser library Hpricot.

Testing the Library

Saving the Ruby code to a file named doi.rb, we can test it using the interactive Ruby shell:

$ irb
irb(main):001:0> require 'doi'
=> true
irb(main):002:0> include DOI
=> Object
irb(main):003:0> biblio_for "10.1021/cr00032a009"
=> "Chem. Rev. 1994, 94(8) 2483-2547"
irb(main):004:0> doi_for "Chem. Rev.", 1994, 94, 8, 2483
=> "10.1021/cr00032a009"

Notice how the journal abbreviation Chem. Rev. was used; we'd get the same result if we used Chemical Reviews.

Of course, the implementation described here could be refined a lot. With a DOI, it's trivial to construct a URL to the example paper. But we could take it further than that. With some carefully crafted regular expressions, our doi_for method could accept a freeform bibliographical citation rather than separately identified fragments. From there we might start to think about creating living HTML and/or Wikis from old PDFs and Word documents.

With a little creative thought, other possibilities are well within reach.

Caveat

Before extensively experimenting with CrossRef's OpenURL system, you might want to sign up for a free account. CrossRef is understandably interested in tracking usage and this is their way to do it.

Conclusions

DOIs and traditional bibliographical citations now coexist in a variety of settings, from literature citation managers to journals themselves. Using CrossRef, OpenURL and a little bit of code, it's now possible to make a great deal more sense of it all.

Harvesting bibliographical citations must be one of the least sexy topics in cheminformatics. But as Google demonstrated (building on the approach taken by Science Citation Index), cataloging citation behavior leads to a unique and highly productive way to view many tough problems. Future articles will discuss how this might apply to cheminformatics.

Image Credit: ecstaticist

CampDepict: Building a Simple SMILES Depict Web Application With JRuby, Structure CDK, and Camping

Posted by Rich Apodaca Wed, 23 Apr 2008 15:16:00 GMT

Today's tribute to the power of simplicity comes by way of John Jaeger, who has built one of the simplest cheminformatics Web applications ever written. His creation, CampDepict, interactively produces a raster image of a 2D chemical structure given a SMILES string, not unlike Daylight's Depict application.

CampDepict uses the Ruby Web microframework Camping. From the README:

Camping is a web framework which consistently stays at less than 4kb of code. You can probably view the complete source code on a single page. But, you know, it‘s so small that, if you think about it, what can it really do?

The idea here is to store a complete fledgling web application in a single file like many small CGIs. But to organize it as a Model-View-Controller application like Rails does. You can then easily move it to Rails once you‘ve got it going.

John's application is loosely-based on the Rails Depict application first described in 2006 here on Depth-First. His code makes use of CDK and Structure CDK, and it runs on JRuby.

If you've ever been curious about what Ruby has to offer cheminformatics, CampDepict could be just the application to get your feet wet.

Chempedia.net: Mashing Up PubChem and Wikipedia 12

Posted by Rich Apodaca Fri, 04 Apr 2008 14:06:00 GMT

PubChem and Wikipedia represent two of the largest open repositories of chemical information in the world. And they complement each other very nicely. PubChem contains mainly low-level chemical structure information whereas Wikipedia contains free-text descriptions of chemical compounds in the form of compound monographs.

Both services offer permission and access to copy and reuse their contents. But neither service is, by itself, nearly as useful as it could be.

Why not mash them up?

To explore that question my company, Metamolecular, LLC has launched Chempedia.

To my knowledge, Chempedia represents the first publicly-facing database of compounds to incorporate Wikipedia's collection of organic compound monographs. And it's one of the few cheminformatics services to make use of free-text descriptions generated by individual chemists.

Chempedia has been somewhat selective about the compounds it includes. To date, it has spidered over 2,500 monographs, combining them with over 300,000 of the most interesting compounds from PubChem. Not every Chempedia.net molecule has a monograph, but now there's a tool that can actually make that absence apparent.

Chempedia is both an experiment and a service. It's immediately useful for anyone in the business of making or doing things with organic molecules. It's created several unexpected moments of "Oh, that's actually a useful molecule!" It also will serve as a platform to test some of the ideas discussed in Depth-First over the last year or so on the advantages of the Web for collaboration in chemistry.

Stay tuned for more details about how Chempedia was created and some of its applications in chemistry.

Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs 4

Posted by Rich Apodaca Wed, 02 Apr 2008 21:29:00 GMT

Good news for cheminformatics: Chemical Abstracts Service (CAS) has agreed to help Wikipedia users curate its collection of CAS numbers. As a result of the diligence of some hard-working volunteers, chemistry's most universal system for referring to chemicals can now be used far more effectively by the worlds biggest open repository of knowledge.

Wouldn't it be great to be able to pull these CAS numbers from Wikipedia programmatically?

Perspective

Estimates place the number of Wikipedia pages dealing with individual inorganic and organic substances in the thousands. (I'll use the term "compound monographs" to describe them.) One factor acting to keep this number low is poor visibility of these entries. Unlike most chemical databases, Wikipedia can't, by itself, be easily searched by structure. As chemically-aware tools for indexing Wikipedia begin to emerge, look for six things to happen:

  1. The number of Wikipedia compound monographs will increase significantly.
  2. The quality of monographs for intermediate- to well-known compounds will increase substantially.
  3. Demand for user-friendly interfaces to Wikipedia's chemical content will increase.
  4. Wikipedia users will become interested in storing and finding ever more diverse kinds of information about each compound.
  5. Bench chemists will start to include Wikipedia as one of their preferred literature search techniques, leading to...
  6. More creative tools for using the chemical content of Wikipedia.

As noted previously, it wasn't too long ago that indexing of the chemical literature was done solely by volunteers. Wikipedia offers an intriguing way to channel the innate drive for chemists to combine their own work and experience with that of others to build useful information tools for the community.

But for now we are left with the question of how to index the chemical content of Wikipedia. Although a few systems have been proposed, the only practical method is through the use of CAS numbers. Which brings us to the subject of today's tutorial.

A Quick CAS Number API for Wikipedia

The Ruby program below will accept the title of any Wikipedia compound monograph title and return the CAS number for the compound being discussed, or an error message if none was found:

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'

class Wikikemi
  @cas = nil

  attr_reader :cas

  def initialize title
    uri = URI.escape("http://en.wikipedia.org/wiki/#{title}")
    puts "loading... #{uri}"
    doc = Hpricot(open(uri))
    table = (doc/"table")[0]

    table.inner_html.match(/([0-9]{2,7}?\-[0-9]{2}\-[0-9])/) if table

    @cas = $1
  end
end

# Returns the CAS number present in the Wikipedia monograph with
# the indicated title, or an error message if none is found. Try, for example,
# "benzene.".
while true
  puts "Enter the title of the Wikipedia page, for example: 'benzene'"
  monograph_title = gets.chomp
  w = Wikikemi.new monograph_title
  puts w.cas ? "[#{w.cas}]" : "CAS number not found"
end

This program makes use of the excellent Ruby HTML parser, Hpricot.

Saving the above code to a file called wikikemi.rb, we can run it with:

$ ruby wikikemi.rb

For example, we can look up the CAS numbers for Ferrocene, Lipitor, or 1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene:

$ ruby wikikemi.rb
Enter the title of the Wikipedia page, for example: 'benzene'
ferrocene
loading... http://en.wikipedia.org/wiki/ferrocene
[102-54-5]
Enter the title of the Wikipedia page, for example: 'benzene'
lipitor
loading... http://en.wikipedia.org/wiki/lipitor
[134523-00-5]
Enter the title of the Wikipedia page, for example: 'benzene'
1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
loading... http://en.wikipedia.org/wiki/1,2,3,4,4a,5,6,7,8,8a-Decahydronaphthalene
[91-17-8]

All this method requires is that the Wikipedia page lists the correct CAS number in its Drugbox or Chembox template. Fortunately, CAS has agreed to help make this happen.

Conclusions

A little Ruby code is all it takes to build a working CAS number lookup system using Wikipedia. Although this may be useful as a standalone tool, it becomes much more powerful when made part of a larger cheminformatics system. But that's a story for another time.

See also Antony Williams' announcement on CAS and Wikipedia.

NetBeans 6, Ruby, and Rails: A Surprisingly Effective Combination

Posted by Rich Apodaca Thu, 27 Mar 2008 17:46:00 GMT

For far too long Ruby has lacked a development environment that supported important features developers in other languages now take for granted: code completion; refactoring; platform-independence; and speed. Although NetBeans may not spring to mind when thinking of Rails IDEs, it should be at the top of the list for anyone interested in the subject.

Getting started with Ruby, Rails and NetBeans is as easy as downloading the installer and running it. If you later decide to add Java support to your installation (which is also excellent), that can be done by downloading and running the Java installer. You'll end up with a single IDE that supports both languages.

Code Completion

Although other IDEs support some form of Ruby code completion, NetBeans takes it to another level. Can't remember the exact name of the method you're looking for? Type the period and let NetBeans look up both the name and documentation for you:

Hitting return enters the method and creates a template for parameters and any needed blocks.

Refactoring

One of the things that makes Java such a powerful language for large projects is the refactoring support offered by most IDEs. NetBeans brings this power to Ruby. Need to rename a class, method, or variable? Let NetBeans do it for you:

Conclusions

There's much more to NetBeans 6 and Ruby/Rails than what's been shown here, including formatting/highlighting for JavaScript and CSS, user-definable Ruby/JRuby interpreter, and menu-based script execution. Whether you're looking for a way to get started with using Ruby and Rails or a way to become more efficient at it, NetBeans 6 is well worth the time.

Older posts: 1 2 3 ... 14