Manage Your Bibliography with Firefox and Zotero

Posted by Rich Apodaca Tue, 03 Jul 2007 19:08:00 GMT

Compiling and managing a bibliography is still a pain. The problem is that too many of the available tools require more busywork than necessary. Copying, pasting, and especially transcribing all fall into this category. And most tools fail to take advantage of the fact that the Web browser has become the de facto standard tool for scientific information gathering.

A piece of software called Zotero may change this. Zotero is a Firefox extension that helps you compile and manage your bibliography, taking care of most of the most tedious and error-prone steps in the process.

Let's say Jane comes across a J. Am. Chem. Soc. paper that she wants to add to her bibliography. By using Zotero, this is as simple as finding the document icon in the browser address bar and clicking on it:

Doing so saves the paper to Jane's Zotero bibliography:

And that's it. (Be aware that ACS ASAP articles aren't yet recognized by Zotero.) Later on, Jane can go back and organize the articles she's collected by creating folders and adding tags. If you're interested in the details the Zotero movie is helpful.

Zotero has a lot going for it. Besides the compelling user experience, here are some other things to consider:

  • Zotero is released under the Educational Community License, a BSD-style academic license recognized by the Open Source Initiative. Do what you like with the code, just don't hold its creators liable.

  • Zotero uses a system of filters analogous to those used by CiteULike. Not every journal/publisher has a filter. For example, neither Synthesis nor Synlett papers are recognized (yet).

  • Zotero offers an IDE called Scaffold that takes most of the drudgery out of writing filters. This enables developers with minimal JavaScript knowledge to write software that will import their favorite journals' articles.

  • Plans are in the works for a Zotero server for collaborative bibliography creation.

  • For those in industry, Zotero does patents.

  • If you use Microsoft Word, Zotero apparently works with it. Support for Open Office is in progress.

In my view, Zotero's biggest limitation is that all data are stored on the local hard drive. Unlike CiteULike, there's no way to extract collective wisdom from the citation process. The ultimate application would combine Zotero's ease of use with CiteULike's collaborative features. Given everything Zotero has going for it, the wait may not be that long.

Easily Convert Publisher URLs and DOIs to Bibliographical Citations: Synthesis, Synlett, Ruby, and Mechanize 3

Posted by Rich Apodaca Wed, 27 Jun 2007 12:45:00 GMT

Just ten years ago, the thought of accessing all of the world's scientific literature online struck many as optimistic at best. Today, an increasing number of scientists use the Web as their only means of reading the literature.

This shift has brought with it a significant, but rarely discussed problem: converting a publisher URL or DOI to a bibliographical citation (title, authors, journal, page, volume, etc.). This is a problem because bookmarking and linking URLs are the way we reference Web documents, but the bibliographical citation is still how we reference paper documents. We may well see the day when the need for bibliographical citations disappears, but until that happens there's a need for user-friendly tools that manage the conversion.

This article discusses remarkably simple and flexible solution to this problem using Ruby and the outstanding Mechanize library. As test subjects, I'll use two of my favorite journals: Synthesis and Synlett.

What is Mechanize?

From the Mechanize documentation:

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

Think of Mechanize as a programmable Web browser controlled by Ruby. This powerful idea offers possibilities that go far beyond the relatively simple example I'll describe here.

A Simple Library

Our library consists of the following code:

require 'rubygems'
require 'mechanize'

module Thieme
  def get_ris url
    agent =  WWW::Mechanize.new
    page = agent.get url
    ris_link = page.links.text /[Bb]iblio/
    ris_url = "http://" + page.uri.host + ris_link.href

    agent.get_file ris_url
  end
end
After saving this code in a file called thieme.rb, we can test it on this Synthesis article with interactive ruby (irb):
$ irb
irb(main):001:0> require 'thieme'
=> true
irb(main):002:0> include Thieme
=> Object
irb(main):003:0> ris=get_ris 'http://www.thieme-connect.com/ejournals/abstract/synthesis/doi/10.1055/s-2007-966071'
=> "\nTY  - JOUR\nID  - 101055S2007966071\nAU  - Gil,Mar\355a Victoria\nAU  - Ar\351valo,Mar\355a Jos\351\nAU  - L\363pez,\323scar\nT1  - Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond\nJO  - Synthesis\nPY  - 2007///\nIS  - 11\nSP  - 1589\nEP  - 1620\nER  - \n\n"
irb(main):004:0> ris.match /T1  - (.*)/
=> #
irb(main):005:0> title = $1
=> "Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond"

Let's say that instead of a deep link to an article in the Thieme site we have a DOI. Can we still get the bibliographical citation?

irb(main):006:0> ris=get_ris 'http://dx.doi.org/10.1055/s-2007-966071'
=> "\nTY  - JOUR\nID  - 101055S2007966071\nAU  - Gil,Mar\355a Victoria\nAU  - Ar\351valo,Mar\355a Jos\351\nAU  - L\363pez,\323scar\nT1  - Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond\nJO  - Synthesis\nPY  - 2007///\nIS  - 11\nSP  - 1589\nEP  - 1620\nER  - \n\n"
irb(main):007:0> ris.match /T1  - (.*)/
=> #
irb(main):008:0> title = $1
=> "Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond"

It worked! Mechanize had no problem following the redirect from dx.doi.org. Similar results would be obtained with a Synlett article or DOI.

For this approach to be truly useful, our software would need to gracefully handle character encoding to avoid garbled strings such as "What?s".

How it Works

Our library relies on two important things being provided by the publisher: (1) a downloadable version of the RIS file for every article; and (2) a consistent way to access it across journals. By simply telling Mechanize to follow a link labeled as "Download bibliographical data", we can easily retrieve the full citation. Fortunately, nearly every scientific publisher follows this practice.

Conclusions

Just a few lines of Ruby code have solved a significant scientific information management problem, at least for one journal. A complete solution to the problem would require code for every scientific journal, a task well underway at CiteULike. While nothing here can pretend to be an end-user application, it's not difficult to imagine how to build one (or a few) using these basic concepts. But that's a story for another time.

Hacking CiteULike: Metascripting with Ruby and Session 1

Posted by Rich Apodaca Fri, 22 Jun 2007 14:08:00 GMT

CiteULike lets users easily manage their bibliographies of scholarly works, and in the process discover other users' papers on related subjects. One of the most powerful features of CiteULike is its ability to convert arbitrary URLs into fully-formatted bibliographical citations. CiteULike manages to do this while largely avoiding the Buggotea Problem in which multiple URLs pointing to the same work are saved. Wouldn't it be useful if this aspect of CiteULike could be independently scripted, tested, and re-integrated? This article describes how to do this using the powerful scripting language Ruby.

A Simple Test

The core of CiteULike's bibliography lookup system is contained in its Filters. Filters accept a URL they're interested in and return a bibliographical citation. Each filter generally works with a specific publisher's URLs and may be written in just about any scripting language.

CiteULike has released nearly all of its filters and the driver as an Open Source package distributed under a BSD-style license. Complete documentation on using and writing filters is available here, and the package can be obtained through subversion:

$ svn co http://svn.citeulike.org/svn/ citeulike

After changing into the citeulike/drivers directory, you'll see a file called driver.tcl. This script coordinates the activities of the various filters contained under their respective language subdirectories. Let's say you want to parse the following URL:

http://pubs.acs.org/cgi-bin/abstract.cgi/jmcmar/2007/50/i05/abs/jm0611509.html
The command to do so would be:
./driver.tcl parse http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

If you get an error starting with:

couldn't execute "./acs.py": no such file or directory
    while executing
"open "|./[file tail $exe]" "r+""
    (procedure "parse_url" line 31)
    invoked from within

then the problem lies with the shebang line of the drivers/python/acs.py script. For example, on my system I need to change the shebang to:

#!/usr/bin/python2.5
Making this change and re-running the driver script gives the output I was expecting:
parsing http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

serial -> 1549-9596
volume -> 46
linkouts -> {DOI {} 10.1021/ci050400b {} {}}
year -> 2006
type -> JOUR
start_page -> 991
url -> http://pubs3.acs.org/acs/journals/doilookup?in_doi=10.1021/ci050400b
end_page -> 998
plugin_version -> 1
doi -> 10.1021/ci050400b
day -> 22
issue -> 3
title -> The Blue Obelisk-Interoperability in Chemical Informatics
journal -> J. Chem. Inf. Model.
abstract -> Abstract: The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
status -> ok
month -> 5
authors -> {Guha {} R {Guha, R.}} {Howard {} MT {Howard, M.T.}} {Hutchison {} GR {Hutchison, G.R.}} {Murray-Rust {} P {Murray-Rust, P.}} {Rzepa {} H {Rzepa, H.}} {Steinbeck {} C {Steinbeck, C.}} {Wegner {} J {Wegner, J.}} {Willighagen {} EL {Willighagen, E.L.}}
address -> Pennsylvania State University, University Park, Pennsylvania 16804-3000, Jmol Project, U. S. A., Cornell University, Ithaca, New York 14853, Cambridge University, Cambridge CB2 1TN, Great Britain, Imperial College, London SW7 2AZ, Great Britain, Cologne University Bioinformatics Center (CUBIC), Zülpicher Str. 47, D-50674 Köln, Germany, University of Tübingen, Tübingen, Germany, and Jmol project, The Netherlands
plugin -> acs

Metascripting with Ruby and Session

The CiteULike driver is written in Tcl, a language I've been interested in and heard about, but which I just don't have the time to try to learn. Wouldn't it be great if we could direct the activities of the CiteULike driver from the comfort and power of Ruby?

It turns out that a handy little Ruby library exists which is perfect for the metascripting we'll need to do - Session. The Session library can be installed with:

# gem install session

Once installed, we can fire up interactive ruby (irb), and tell driver.tcl what to do:

$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'session'
=> true
irb(main):003:0> url = 'http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html'
=> "http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html"
irb(main):004:0> session = Session.new
=> #, @threads=[], @history=nil, @stdin=#, @use_open3=nil, @opts={}, @errproc=nil, @use_spawn=nil, @debug=nil, @stderr=#, @outproc=nil, @track_history=nil, @prog="sh">
irb(main):005:0> result=session.execute "./driver.tcl parse #{url}"

Reprocessing the Bibliography

The last command of our interactive ruby session returns an Array called "result", the first element of which is our article's bibliographical information. We can extract its title with the following commands:

irb(main):011:0> result[0].match /title -> (.*)/
=> #
irb(main):012:0> $1
=> "The Blue Obelisk-Interoperability in Chemical Informatics"

Using a series of similar regular expressions, we can re-construct the full bibliographical citation for the paper.

Conclusions

The availability of the CiteULike filters and driver opens up many possibilities to build collaborative bibliographical management applications. By using some simple metascripting techniques, this can be done in any scripting language. Our little example here is but a glimpse of what might be possible.

Why I Still Don't Use Connotea 2

Posted by Rich Apodaca Thu, 22 Mar 2007 16:17:00 GMT

Like most scientists, I have a collection of hardcopy journal articles. After they sit on my desk for awhile, I sort them into folders. Each folder has a label such as "dihydroxylation", "olefin metathesis", or "InChI". This system is nothing more than a small ontology. It does the job of building a top-level index of my papers, but it's not nearly as effective as it could be.

There are many problems with ontology. For example, the world changes; I decide to add just one aminohydroxylation paper to the "dihydroxylation" folder and before I know it there are five others in there. Most papers require multiple categories; should I file that metathesis paper under "ring closing", "ruthenium", or "nobel"?

Some time ago, Nature Publishing Group launched Connotea, a service designed to do for scientific papers what del.icio.us does for hyperlinks. CiteULike is a similar service. Both services abandon heirarchical classification in favor of tags - short text descriptions that can be applied to one or more articles. The possibilities of harnessing the collective intelligence of your fellow scientists through these services are tantalizing. And the ability to finally do away with hardcopy journal articles seems liberating.

I think both Connotea and CiteULike are great services, but I still continue to use my horrible system of physical papers and physical folders. And I know I'm not alone. Maybe the thought of transcribing my massive collection of paper into a system like Connotea gives me just the excuse I need to avoid doing it. Maybe I just like being able to browse the articles in these folders while looking for new ideas. Increasingly, I've been turning directly to services like SciFinder and Google to track down a paper, even if I know it's in my collection. So maybe my collection of hardcopy articles just isn't as useful as it once was.

Successful information systems demonstrate a concrete payoff that is much higher than the price of admission. As anyone who uses Linux or Mac OS X can tell you, technical superiority alone is not enough to make people switch. Although Connotea could no doubt make the management of my personal collection of articles easier, the price is simply too high to justify the effort.

Image Credit: Jean Ruaud