Hacking CiteULike: Metascripting with Ruby and Session 1

Posted by Rich Apodaca Fri, 22 Jun 2007 10:08:00 GMT

CiteULike lets users easily manage their bibliographies of scholarly works, and in the process discover other users' papers on related subjects. One of the most powerful features of CiteULike is its ability to convert arbitrary URLs into fully-formatted bibliographical citations. CiteULike manages to do this while largely avoiding the Buggotea Problem in which multiple URLs pointing to the same work are saved. Wouldn't it be useful if this aspect of CiteULike could be independently scripted, tested, and re-integrated? This article describes how to do this using the powerful scripting language Ruby.

A Simple Test

The core of CiteULike's bibliography lookup system is contained in its Filters. Filters accept a URL they're interested in and return a bibliographical citation. Each filter generally works with a specific publisher's URLs and may be written in just about any scripting language.

CiteULike has released nearly all of its filters and the driver as an Open Source package distributed under a BSD-style license. Complete documentation on using and writing filters is available here, and the package can be obtained through subversion:

$ svn co http://svn.citeulike.org/svn/ citeulike

After changing into the citeulike/drivers directory, you'll see a file called driver.tcl. This script coordinates the activities of the various filters contained under their respective language subdirectories. Let's say you want to parse the following URL:

http://pubs.acs.org/cgi-bin/abstract.cgi/jmcmar/2007/50/i05/abs/jm0611509.html
The command to do so would be:
./driver.tcl parse http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

If you get an error starting with:

couldn't execute "./acs.py": no such file or directory
    while executing
"open "|./[file tail $exe]" "r+""
    (procedure "parse_url" line 31)
    invoked from within

then the problem lies with the shebang line of the drivers/python/acs.py script. For example, on my system I need to change the shebang to:

#!/usr/bin/python2.5
Making this change and re-running the driver script gives the output I was expecting:
parsing http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

serial -> 1549-9596
volume -> 46
linkouts -> {DOI {} 10.1021/ci050400b {} {}}
year -> 2006
type -> JOUR
start_page -> 991
url -> http://pubs3.acs.org/acs/journals/doilookup?in_doi=10.1021/ci050400b
end_page -> 998
plugin_version -> 1
doi -> 10.1021/ci050400b
day -> 22
issue -> 3
title -> The Blue Obelisk-Interoperability in Chemical Informatics
journal -> J. Chem. Inf. Model.
abstract -> Abstract: The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
status -> ok
month -> 5
authors -> {Guha {} R {Guha, R.}} {Howard {} MT {Howard, M.T.}} {Hutchison {} GR {Hutchison, G.R.}} {Murray-Rust {} P {Murray-Rust, P.}} {Rzepa {} H {Rzepa, H.}} {Steinbeck {} C {Steinbeck, C.}} {Wegner {} J {Wegner, J.}} {Willighagen {} EL {Willighagen, E.L.}}
address -> Pennsylvania State University, University Park, Pennsylvania 16804-3000, Jmol Project, U. S. A., Cornell University, Ithaca, New York 14853, Cambridge University, Cambridge CB2 1TN, Great Britain, Imperial College, London SW7 2AZ, Great Britain, Cologne University Bioinformatics Center (CUBIC), Zülpicher Str. 47, D-50674 Köln, Germany, University of Tübingen, Tübingen, Germany, and Jmol project, The Netherlands
plugin -> acs

Metascripting with Ruby and Session

The CiteULike driver is written in Tcl, a language I've been interested in and heard about, but which I just don't have the time to try to learn. Wouldn't it be great if we could direct the activities of the CiteULike driver from the comfort and power of Ruby?

It turns out that a handy little Ruby library exists which is perfect for the metascripting we'll need to do - Session. The Session library can be installed with:

# gem install session

Once installed, we can fire up interactive ruby (irb), and tell driver.tcl what to do:

$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'session'
=> true
irb(main):003:0> url = 'http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html'
=> "http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html"
irb(main):004:0> session = Session.new
=> #, @threads=[], @history=nil, @stdin=#, @use_open3=nil, @opts={}, @errproc=nil, @use_spawn=nil, @debug=nil, @stderr=#, @outproc=nil, @track_history=nil, @prog="sh">
irb(main):005:0> result=session.execute "./driver.tcl parse #{url}"

Reprocessing the Bibliography

The last command of our interactive ruby session returns an Array called "result", the first element of which is our article's bibliographical information. We can extract its title with the following commands:

irb(main):011:0> result[0].match /title -> (.*)/
=> #
irb(main):012:0> $1
=> "The Blue Obelisk-Interoperability in Chemical Informatics"

Using a series of similar regular expressions, we can re-construct the full bibliographical citation for the paper.

Conclusions

The availability of the CiteULike filters and driver opens up many possibilities to build collaborative bibliographical management applications. By using some simple metascripting techniques, this can be done in any scripting language. Our little example here is but a glimpse of what might be possible.

Buggotea: The Problem with Abundance 1

Posted by Rich Apodaca Fri, 15 Jun 2007 09:09:00 GMT

Although I still don't use it yet, Connotea is a very useful service for many scientists. Combining aspects of social networking and bibliography management, Connotea offers a glimpse at some of the vast potential for Web 2.0 in the sciences. But the service is not without its thorny technical problems, one of which is discussed in this article.

For those unfamiliar with the service, Connotea lets you organize and share hyperlinks. This, in itself, is nothing remarkable. Many services such as Digg, del.icio.us, and Reddit offer similar capability.

What's unique about Connotea is its emphasis on bookmarking scientific and scholarly content. By taking advantage of the CrossRef service built on top of the DOI system, Connotea makes creating a bibliographical reference to a paper as easy as entering a short alphanumeric sequence found on the document itself.

As long as all Connotea users work with DOIs, there is no problem. The DOI organization ensures that every document with a DOI can be accessed via a single, immutable URL. For example, if a paper has a DOI of "10.1021/ol015948s", then the document can be accessed through this link.

But what happens if a Connotea user either doesn't know about DOI or for some reason prefers not to use it? Instead, they'd rather work with a publisher's URL directly. This is not as unlikely as it may seem at first. For example, Connotea fails to recognize the title of many ACS papers when they are entered as DOIs, but does recognize them as direct abstract links.

PubMed offers still more ways to refer to the same document. To name a few:

Without really trying, we've found no fewer than five different URLs that all refer to the same scientific work. If you look under my user profile, you'll see that Connotea is happy to add all of these references as separate entities. This means that each will receive its own set of tags and its own summary page. If my collection of links grows to a few hundred, I may not realize that I actually have two or three links to the same paper in my collection. And other Connotea users may fail to see my papers because they're using a URL that differs from mine.

After researching this problem a bit, I found that although it doesn't seem to have an immediate solution, at least it has a name: Buggotea. It bears a remarkable similarity to the "unique" SMILES problem, which was a major motivation for the development of InChI.

It wasn't long ago that the ability to access the scientific literature online seemed far-fetched. Today, the Internet as become the only scientific publication medium that matters. This has created a variety of new problems - and opportunities to solve them.

Image Credit: gottcha78