The Best API May Be No API At All - PubChem and PDB

Both PubChem and the Protein Data Bank (PDB) maintain vast collections of molecular data. Individual users are free to view and search these collections via standard Web browsers. But what are the options if you're developing software to interact with these databases?

Various application programming interfaces (APIs) are available for accessing PubChem and PDB records. For example, PubChem recently introduced its Power User Gateway (PUG), an XML-based query language. But writing APIs is extremely difficult; reconciling the need for simplicity with the need for rich functionality is a tough balancing act. Where do you draw the line?

Recently, Bosco described a remarkably short method to retrieve PDB records using nothing more than standard Python. Given the similarities between Python and Ruby, it seemed reasonable that his method could be adapted to Ruby.

The following Ruby library accepts a PDB identifier and returns the corresponding PDB record:

require 'net/http'

module PDB
  # Returns a PDB record for the given id
  def self.get_record id
    Net::HTTP.get_response('', "/pdb/files/#{id}.pdb").body

Notice how the business end of this library is nothing more than a single line of Ruby code.

The library can be tested by saving it in a file called pdb.rb and invoking interactive Ruby (irb):

irb(main):001:0> require 'pdb'
=> true
irb(main):002:0> puts PDB::get_record('1hpn')
HEADER    GLYCOSAMINOGLYCAN                       17-JAN-95   1HPN


Several months ago, a D-F article described a related, but somewhat lengthier approach to retrieving PubChem molfiles. Using the same approach we used for PDB, we can create the world's shortest PubChem library:

require 'net/http'

module PubChem
  # Returns a molfile for the given PubChem CID
  def self.get_molfile cid
    Net::HTTP.get_response('', "/summary/summary.cgi?cid=#{cid}&disopt=DisplaySDF").body

This library can be tested by saving it in a file called pubchem.rb followed by running irb:

irb(main):001:0> require 'pubchem'
=> true
irb(main):002:0> puts PubChem::get_molfile('969472') #eszopiclone (Lunesta)

 44 47  0     1  0  0  0  0  0999 V2000
    9.2619   -2.2732    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0


Both of these Ruby libraries leverage one the most versatile and robust protocols ever developed: plain old http. The last few years have witnessed a renaissance in using bare http as platform for building simplified yet powerful Web APIs with less software. Referred to as REST, the approach has gained traction partly in response to the wasteful complexities introduced by various XML-based approaches. Although slow to catch on in cheminformatics, REST has enormous potential in unifying a diverse array of isolated database systems.

One limitation of the approach described here is that the PubChem (or PDB) folks may get upset if you use it a lot. For example, if you examine the PubChem robots.txt file, you'll notice that access to the summary.cgi resource, which our library makes use of, is prohibited to robots:


User-agent: *

Disallow: /summary/summary.cgi

What makes a "robot" and does your software qualify for exclusion? The answer is not enirely clear-cut, especially in the era of browser-side scripting.

Regardless, it looks like PubChem's policy was put in place in 2004, long before PubChem had experience with usage patterns for its service. It may be that this restriction could be relaxed without adversely affecting PubChem's ability to operate efficiently. It may even be possible to offer a low-level http retrieval method alongside PubChem's PUG interface on a machine dedicated to automated queries (i.e., Entrez eUtils).

As developers, our mission is to deliver functionality, not to write software. We should extract every possible ounce of value from established protocols and APIs before writing a single line of additional code. REST, and the creative use of good old http, are powerful tools to do so.