Starting, Quitting, and Finishing

Posted by Rich Apodaca Fri, 29 Jun 2007 09:19:00 GMT

As work on the 2D chemical structure editor I've codenamed Firefly enters its finishing iteration, I'm reminded of the many things in life that seem to get exponentially more difficult the closer they get to completion. There should be a name for this effect. The closest thing I've seen to a discussion is Seth Godin's The Dip.

Although every project has its own unique qualities, they all tend to go through the same three stages.

At the beginning of a project, every bit of work you put in results in something new and exciting. Nothing is more motivating (and addictive) than new and exciting stuff. The only constraints you face in this phase are those imposed by the problem itself. None of your previous decisions play a role, nor does a deep understanding of the problem. The project is wide open and anything is possible.

As the project matures, cruft starts to build up - lots of cruft. You realize that some of your approaches were way too general and others were way too specific. You begin to rewrite your review article, redesign your experiment, or refactor your code. Things that had once worked very nicely start to break. You patch them up, only to realize that your patch makes it impossible to move forward on other fronts. You refactor and redesign again ... and again ... and again. Through all of this churning, you're not seeing or creating anything new. Instead, you're retracing old ground. Few things are more de-motivating than retracing old ground. The urge to quit and move onto something new at this point is almost irresistible. As difficult as it can be to endure this phase, it's the only path to deep understanding of your problem.

Finally, you reach a point when it's clear you can move forward again. Now begins the hardest phase of the entire project because it's where you finally have to confront all of the niggly little details you put off during the first two phases. What makes this phase so difficult is not that the problems are intellectually challenging. No, what makes this phase so tough is that the problems that you now must solve are: (a) mind-numbingly boring and tedious; (b) unbelievably numerous; and (c) the only thing that stands between you and finishing - you've already nailed all the fun, juicy problems.

Having been through this cycle a few times forever changes your perspective on starting, finishing and quitting. You might even come to believe that knowing when each of these three actions is appropriate is the essential element of success.

Easily Convert Publisher URLs and DOIs to Bibliographical Citations: Synthesis, Synlett, Ruby, and Mechanize 3

Posted by Rich Apodaca Wed, 27 Jun 2007 08:45:00 GMT

Just ten years ago, the thought of accessing all of the world's scientific literature online struck many as optimistic at best. Today, an increasing number of scientists use the Web as their only means of reading the literature.

This shift has brought with it a significant, but rarely discussed problem: converting a publisher URL or DOI to a bibliographical citation (title, authors, journal, page, volume, etc.). This is a problem because bookmarking and linking URLs are the way we reference Web documents, but the bibliographical citation is still how we reference paper documents. We may well see the day when the need for bibliographical citations disappears, but until that happens there's a need for user-friendly tools that manage the conversion.

This article discusses remarkably simple and flexible solution to this problem using Ruby and the outstanding Mechanize library. As test subjects, I'll use two of my favorite journals: Synthesis and Synlett.

What is Mechanize?

From the Mechanize documentation:

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

Think of Mechanize as a programmable Web browser controlled by Ruby. This powerful idea offers possibilities that go far beyond the relatively simple example I'll describe here.

A Simple Library

Our library consists of the following code:

require 'rubygems'
require 'mechanize'

module Thieme
  def get_ris url
    agent =  WWW::Mechanize.new
    page = agent.get url
    ris_link = page.links.text /[Bb]iblio/
    ris_url = "http://" + page.uri.host + ris_link.href

    agent.get_file ris_url
  end
end
After saving this code in a file called thieme.rb, we can test it on this Synthesis article with interactive ruby (irb):
$ irb
irb(main):001:0> require 'thieme'
=> true
irb(main):002:0> include Thieme
=> Object
irb(main):003:0> ris=get_ris 'http://www.thieme-connect.com/ejournals/abstract/synthesis/doi/10.1055/s-2007-966071'
=> "\nTY  - JOUR\nID  - 101055S2007966071\nAU  - Gil,Mar\355a Victoria\nAU  - Ar\351valo,Mar\355a Jos\351\nAU  - L\363pez,\323scar\nT1  - Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond\nJO  - Synthesis\nPY  - 2007///\nIS  - 11\nSP  - 1589\nEP  - 1620\nER  - \n\n"
irb(main):004:0> ris.match /T1  - (.*)/
=> #
irb(main):005:0> title = $1
=> "Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond"

Let's say that instead of a deep link to an article in the Thieme site we have a DOI. Can we still get the bibliographical citation?

irb(main):006:0> ris=get_ris 'http://dx.doi.org/10.1055/s-2007-966071'
=> "\nTY  - JOUR\nID  - 101055S2007966071\nAU  - Gil,Mar\355a Victoria\nAU  - Ar\351valo,Mar\355a Jos\351\nAU  - L\363pez,\323scar\nT1  - Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond\nJO  - Synthesis\nPY  - 2007///\nIS  - 11\nSP  - 1589\nEP  - 1620\nER  - \n\n"
irb(main):007:0> ris.match /T1  - (.*)/
=> #
irb(main):008:0> title = $1
=> "Click Chemistry - What?s in a Name? Triazole Synthesis and Beyond"

It worked! Mechanize had no problem following the redirect from dx.doi.org. Similar results would be obtained with a Synlett article or DOI.

For this approach to be truly useful, our software would need to gracefully handle character encoding to avoid garbled strings such as "What?s".

How it Works

Our library relies on two important things being provided by the publisher: (1) a downloadable version of the RIS file for every article; and (2) a consistent way to access it across journals. By simply telling Mechanize to follow a link labeled as "Download bibliographical data", we can easily retrieve the full citation. Fortunately, nearly every scientific publisher follows this practice.

Conclusions

Just a few lines of Ruby code have solved a significant scientific information management problem, at least for one journal. A complete solution to the problem would require code for every scientific journal, a task well underway at CiteULike. While nothing here can pretend to be an end-user application, it's not difficult to imagine how to build one (or a few) using these basic concepts. But that's a story for another time.

Interconvert (Almost) Any SMILES and InChI with Ruby Open Babel 8

Posted by Rich Apodaca Mon, 25 Jun 2007 08:45:00 GMT

SMILES and InChI are the two most widely-used line notations in cheminformatics. Not surprisingly, there are many situations in which it's useful to interconvert the two. This article shows a simple method for doing so using Ruby Open Babel.

Parsing InChIs

Version 1.01 of the IUPAC/NIST C InChI toolkit introduced the ability to parse InChIs. This capability has subsequently been incorporated into Open Babel, and by extension, Ruby Open Babel. It's this capability that we'll take advantage of.

A Simple Library

The following library provides everything we need to convert between SMILES and InChI via Ruby:

require 'openbabel'

module InChI
  @@to_smiles = OpenBabel::OBConversion.new
  @@to_inchi = OpenBabel::OBConversion.new
  @@to_smiles.set_in_and_out_formats 'inchi', 'smi'
  @@to_inchi.set_in_and_out_formats 'smi', 'inchi'

  def inchi_to_smiles inchi
    mol = OpenBabel::OBMol.new

    @@to_smiles.read_string(mol, inchi) or raise "Can't parse InChI: #{inchi}."
    @@to_smiles.write_string(mol).strip
  end

  def smiles_to_inchi smiles
    mol = OpenBabel::OBMol.new

    @@to_inchi.read_string(mol, smiles) or raise "Can't parse SMILES #{smiles}."
    @@to_inchi.write_string(mol).strip
  end
end

Testing the Library

After saving the above code to a file named inchi.rb, we can interactively convert SMILES and InChIs:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"
=> "c1ccc(cc1)C(/[H])=C(/[H])c1ccccc1"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C14H12/c1-3-7-13(8-4-1)11-12-14-9-5-2-6-10-14/h1-12H/b12-11-"

In the above test, the InChI for cis-stilbene is converted into a SMILES string which is then converted back to InChI form with complete fidelity, including alkene geometry. Note that this would not have been possible using the approach that was previously discussed in which molfiles were used as intermediate datastructures.

What about chiral centers? Here the results are mixed. For example, when the round-trip conversion is applied to propranalol (PubChem, Video), the configuration of the stereocenter is inverted.

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m1/s1"
=> "CC(C)NC[C@@H](COc1cccc2ccccc12)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C16H21NO2/c1-12(2)17-10-14(18)11-19-16-9-5-7-13-6-3-4-8-15(13)16/h3-9,12,14,17-18H,10-11H2,1-2H3/t14-/m0/s1"

However, the same round-trip conversion of phenethanol works without inversion of stereochemistry:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles " InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"
=> "C[C@@H](c1ccccc1)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
=> "InChI=1/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1"

The most likely explanation is that under certain conditions, Open Babel incorrectly interprets and/or writes stereo parities.

One More Gotcha

On my system (Linux Mandriva 2007.1), attempting to perform the round-trip test on glucose resulted (reproducibly) in a segfault:

$ irb
irb(main):001:0> require 'inchi'
=> true
irb(main):002:0> include InChI
=> Object
irb(main):003:0> smiles = inchi_to_smiles "InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1"
=> "C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O"
irb(main):004:0> inchi = smiles_to_inchi smiles
./inchi.rb:20: [BUG] Segmentation fault
ruby 1.8.6 (2007-03-13) [i686-linux]

Aborted

The same segfault was obtained when using the babel command-line utility:

$ babel -ismi -oinchi
C([C@H]1[C@H]([C@@H]([C@H](C(O)O1)O)O)O)O
[Return]
Segmentation fault

Conclusions

As you can see, Ruby Open Babel makes short work of interconverting SMILES and InChIs. Despite problems with stereochemical configuration and segfaults on reading certain SMILES strings, the approach outlined here offers a quick and economical way to interconvert a variety of SMILES and InChIs.

Hacking CiteULike: Metascripting with Ruby and Session 1

Posted by Rich Apodaca Fri, 22 Jun 2007 10:08:00 GMT

CiteULike lets users easily manage their bibliographies of scholarly works, and in the process discover other users' papers on related subjects. One of the most powerful features of CiteULike is its ability to convert arbitrary URLs into fully-formatted bibliographical citations. CiteULike manages to do this while largely avoiding the Buggotea Problem in which multiple URLs pointing to the same work are saved. Wouldn't it be useful if this aspect of CiteULike could be independently scripted, tested, and re-integrated? This article describes how to do this using the powerful scripting language Ruby.

A Simple Test

The core of CiteULike's bibliography lookup system is contained in its Filters. Filters accept a URL they're interested in and return a bibliographical citation. Each filter generally works with a specific publisher's URLs and may be written in just about any scripting language.

CiteULike has released nearly all of its filters and the driver as an Open Source package distributed under a BSD-style license. Complete documentation on using and writing filters is available here, and the package can be obtained through subversion:

$ svn co http://svn.citeulike.org/svn/ citeulike

After changing into the citeulike/drivers directory, you'll see a file called driver.tcl. This script coordinates the activities of the various filters contained under their respective language subdirectories. Let's say you want to parse the following URL:

http://pubs.acs.org/cgi-bin/abstract.cgi/jmcmar/2007/50/i05/abs/jm0611509.html
The command to do so would be:
./driver.tcl parse http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

If you get an error starting with:

couldn't execute "./acs.py": no such file or directory
    while executing
"open "|./[file tail $exe]" "r+""
    (procedure "parse_url" line 31)
    invoked from within

then the problem lies with the shebang line of the drivers/python/acs.py script. For example, on my system I need to change the shebang to:

#!/usr/bin/python2.5
Making this change and re-running the driver script gives the output I was expecting:
parsing http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html

serial -> 1549-9596
volume -> 46
linkouts -> {DOI {} 10.1021/ci050400b {} {}}
year -> 2006
type -> JOUR
start_page -> 991
url -> http://pubs3.acs.org/acs/journals/doilookup?in_doi=10.1021/ci050400b
end_page -> 998
plugin_version -> 1
doi -> 10.1021/ci050400b
day -> 22
issue -> 3
title -> The Blue Obelisk-Interoperability in Chemical Informatics
journal -> J. Chem. Inf. Model.
abstract -> Abstract: The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
status -> ok
month -> 5
authors -> {Guha {} R {Guha, R.}} {Howard {} MT {Howard, M.T.}} {Hutchison {} GR {Hutchison, G.R.}} {Murray-Rust {} P {Murray-Rust, P.}} {Rzepa {} H {Rzepa, H.}} {Steinbeck {} C {Steinbeck, C.}} {Wegner {} J {Wegner, J.}} {Willighagen {} EL {Willighagen, E.L.}}
address -> Pennsylvania State University, University Park, Pennsylvania 16804-3000, Jmol Project, U. S. A., Cornell University, Ithaca, New York 14853, Cambridge University, Cambridge CB2 1TN, Great Britain, Imperial College, London SW7 2AZ, Great Britain, Cologne University Bioinformatics Center (CUBIC), Zülpicher Str. 47, D-50674 Köln, Germany, University of Tübingen, Tübingen, Germany, and Jmol project, The Netherlands
plugin -> acs

Metascripting with Ruby and Session

The CiteULike driver is written in Tcl, a language I've been interested in and heard about, but which I just don't have the time to try to learn. Wouldn't it be great if we could direct the activities of the CiteULike driver from the comfort and power of Ruby?

It turns out that a handy little Ruby library exists which is perfect for the metascripting we'll need to do - Session. The Session library can be installed with:

# gem install session

Once installed, we can fire up interactive ruby (irb), and tell driver.tcl what to do:

$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'session'
=> true
irb(main):003:0> url = 'http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html'
=> "http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050400b.html"
irb(main):004:0> session = Session.new
=> #, @threads=[], @history=nil, @stdin=#, @use_open3=nil, @opts={}, @errproc=nil, @use_spawn=nil, @debug=nil, @stderr=#, @outproc=nil, @track_history=nil, @prog="sh">
irb(main):005:0> result=session.execute "./driver.tcl parse #{url}"

Reprocessing the Bibliography

The last command of our interactive ruby session returns an Array called "result", the first element of which is our article's bibliographical information. We can extract its title with the following commands:

irb(main):011:0> result[0].match /title -> (.*)/
=> #
irb(main):012:0> $1
=> "The Blue Obelisk-Interoperability in Chemical Informatics"

Using a series of similar regular expressions, we can re-construct the full bibliographical citation for the paper.

Conclusions

The availability of the CiteULike filters and driver opens up many possibilities to build collaborative bibliographical management applications. By using some simple metascripting techniques, this can be done in any scripting language. Our little example here is but a glimpse of what might be possible.

Open Notebook Science Using InChIMatic

Posted by Rich Apodaca Thu, 21 Jun 2007 10:27:00 GMT

Have you ever wanted to find a molecule on the Web using your favorite search engine in combination with a 2-D structure editor? InChIMatic is a service that lets you do just that. In this article, I'll show how InChIMatic can be used to look up molecules in the UsefulChem-Molecules blog.

For those who aren't familiar with it, UsefulChem-Molecules is a blog operated by Jean-Claude Bradly's research group at Drexel University that publicly archives molecules of interest. Each entry is a single molecule that may be linked to other Web resources.

Let's say you wanted to look up dithranol. This can be done by simply pointing your browser to inchimatic.com and drawing the structure:

When you're finished, select your search engine of choice (we'll use Google here) and press "Search". You'll be taken to the familiar results page. The second result links to the UsefulChem-Molecules entry for dithranol.

In performing this simple workflow, I noticed areas for improvement in both UsefulChem and InChIMatic:

  • UsefulChem If you look at the entry for dithranol, you'll notice there are no linkouts. In essence, the entry is a bookmark without context. Although it's useful to know that the Bradly group is interested in this molecule, it would be even more interesting to know in what context. Each entry should contain at least one link giving the molecule a context.

  • InChIMatic Using the back button on the Google results page takes you back to InChIMatic, but your molecule is gone. If you wanted to look for a series of related molecules, you couldn't edit your existing structure. As Firefly 1.0 nears completion, a top priority will be to incorporate it into InChIMatic and fix the back-button problem.

As you can see, between InChIMatic and UsefulChem-Molecules, we have the makings of a crude laboratory information management system. The problem is we're trying to use existing tools (search engines and blogs) for purposes they are ill-suited for. It can work, but it could also work much better.

What chemistry really needs is open, user-friendly systems specifically designed to archive and search chemical information of the type maintained by the Bradly group. But that's a story for another time.

Older posts: 1 2 3