Hacking DOI: Interconvert Bibliographic References and DOIs with CrossRef and OpenURL 8
Science is in the middle of a transition from print to the internet as the primary medium of communication. This transition, although a boon for many scientists, creates a host of problems for those dealing with scientific information. For example, how would you interconvert a DOI and its corresponding bibliographic reference?
A previous Depth-First article discussed a screen-scraping method as one solution. Unfortunately, this system relies on an intimate understanding of how individual publishers' Websites work, requires a different implementation for each publisher, and can break at any time without warning.
This article discusses a far more robust solution to the problem of interconverting bibliographic references and DOIs.
Background: OpenURL and CrossRef
CrossRef is the official DOI link registration agency for scholarly and professional publications. One of the less well-known services offered by CrossRef is a free, Web-based bidirectional DOI/bibliographic reference converter based on OpenURL.
A Simple Ruby Library
The following Ruby library is all we need to begin using CrossRef and OpenURL:
require 'rubygems'
require 'hpricot'
require 'open-uri'
module DOI
# Convert a doi into a bibliographic reference.
def biblio_for doi
doc = Hpricot(open("http://www.crossref.org/openurl/?id=doi:#{doi}&noredirect=true&pid=ourl_sample:sample&format=unixref"))
journal = (doc/"abbrev_title").inner_html
year = (doc/"journal_issue/publication_date/year").inner_html
volume = (doc/"journal_issue/journal_volume/volume").inner_html
number = (doc/"journal_issue/issue").inner_html
first_page = (doc/"pages/first_page").inner_html
last_page = (doc/"pages/last_page").inner_html
"#{journal} #{year}, #{volume}(#{number}) #{first_page}-#{last_page}"
end
# Convert a bibliographic reference into a DOI.
def doi_for journal, year, volume, issue, page
doc = Hpricot(open("http://www.crossref.org/openurl/?title=#{journal.gsub(/ /, '%20')}&volume=#{volume}&issue=#{issue}&spage=#{page}&date=#{year}&pid=ourl_sample:sample&redirect=false&format=unixref"))
(doc/"doi").inner_html
end
endThis code makes use of the excellent Ruby HTML parser library Hpricot.
Testing the Library
Saving the Ruby code to a file named doi.rb, we can test it using the interactive Ruby shell:
$ irb irb(main):001:0> require 'doi' => true irb(main):002:0> include DOI => Object irb(main):003:0> biblio_for "10.1021/cr00032a009" => "Chem. Rev. 1994, 94(8) 2483-2547" irb(main):004:0> doi_for "Chem. Rev.", 1994, 94, 8, 2483 => "10.1021/cr00032a009"
Notice how the journal abbreviation Chem. Rev. was used; we'd get the same result if we used Chemical Reviews.
Of course, the implementation described here could be refined a lot. With a DOI, it's trivial to construct a URL to the example paper. But we could take it further than that. With some carefully crafted regular expressions, our doi_for method could accept a freeform bibliographical citation rather than separately identified fragments. From there we might start to think about creating living HTML and/or Wikis from old PDFs and Word documents.
With a little creative thought, other possibilities are well within reach.
Caveat
Before extensively experimenting with CrossRef's OpenURL system, you might want to sign up for a free account. CrossRef is understandably interested in tracking usage and this is their way to do it.
Conclusions
DOIs and traditional bibliographical citations now coexist in a variety of settings, from literature citation managers to journals themselves. Using CrossRef, OpenURL and a little bit of code, it's now possible to make a great deal more sense of it all.
Harvesting bibliographical citations must be one of the least sexy topics in cheminformatics. But as Google demonstrated (building on the approach taken by Science Citation Index), cataloging citation behavior leads to a unique and highly productive way to view many tough problems. Future articles will discuss how this might apply to cheminformatics.
Image Credit: ecstaticist


This post and its comments are also worth reading. Many of Noel O'Boyle's concerns have been addressed by the CrossRef team.
This is really nice. I have used CrossRef to try to retrieve metadata by DOI and there is actually a limit on what you are allowed, but for most peoples purposes it should be fine as they are not generally as greedy as me.
For libraries, who do need bulk metadata, it is a different matter of course and CrossRef have indicated that they are prepared to charge some libraries (for metadata, not access) which is of questionable value (depending on the amount charged) since it is almost free to retrieve with screen scraping tools these days (plus the income from journals).
As CrossRef themselves have said (see the link in Rich's comment above) and/or I have learned, the quality of the metadata supplied to them by the publishers is often quite poor, e.g. it may or may not include the title, the endpage, or any but the first author; the name of the journal might be abbreviated, with/out periods, or it might be written in full. Nice.
It's difficult to use such metadata for validating references in papers (where it could save a lot of mindless human labor). The best I could do is the ACS citation checker (webpage may currently be off-line due to maintainence).
Will, any idea of how many uses CrossRef allows before being throttled? I didn't see anything about that in their documentation.
Noel, could you provide some examples of where the CrossRef metadata is of poor quality?
There are plenty of examples among the 25 or so references in the example data on my ACS citation checker. Am waiting for that website to come back online (my hoster got hacked).
We have no automatic limits set right now on the openurl interface. We monitor overall use and if it is being hit unusually hard we try to track down the cause and work with the folks behind it. Generally our belief is, as long as the servers are handling the load then all is fine. But, we do need to watch for commercial outfits unfairly benefiting from the service.
Chuck, thanks - that's good to know. Can you give an idea of roughly how much might be too much use - as a guideline?
I hit the limit while using the metadata retrieval service with only Internet Explorer (i.e. not programatically). This server load explanation looks like a red herring.
I've also never used CrossRef metadata (I prefer to screen scrape) so they don't have to worry about unfair usage (which is left conveniently undefined).