The Best API May Be No API At All: PubChem and PDB 2

Posted by Rich Apodaca Mon, 13 Aug 2007 11:55:00 GMT

Both PubChem and the Protein Data Bank (PDB) maintain vast collections of molecular data. Individual users are free to view and search these collections via standard Web browsers. But what are the options if you're developing software to interact with these databases?

Various application programming interfaces (APIs) are available for accessing PubChem and PDB records. For example, PubChem recently introduced its Power User Gateway (PUG), an XML-based query language. But writing APIs is extremely difficult; reconciling the need for simplicity with the need for rich functionality is a tough balancing act. Where do you draw the line?

Recently, Bosco described a remarkably short method to retrieve PDB records using nothing more than standard Python. Given the similarities between Python and Ruby, it seemed reasonable that his method could be adapted to Ruby.

The following Ruby library accepts a PDB identifier and returns the corresponding PDB record:

require 'net/http'

module PDB
  # Returns a PDB record for the given id
  def self.get_record id
    Net::HTTP.get_response('www.rcsb.org', "/pdb/files/#{id}.pdb").body
  end
end
Notice how the business end of this library is nothing more than a single line of Ruby code. The library can be tested by saving it in a file called pdb.rb and invoking interactive Ruby (irb):
$ irb
irb(main):001:0> require 'pdb'
=> true
irb(main):002:0> puts PDB::get_record('1hpn')
HEADER    GLYCOSAMINOGLYCAN                       17-JAN-95   1HPN
TITLE     N.M.R. AND MOLECULAR-MODELLING STUDIES OF THE SOLUTION
TITLE    2 CONFORMATION OF HEPARIN

[truncated]

Several months ago, a D-F article described a related, but somewhat lengthier approach to retrieving PubChem molfiles. Using the same approach we used for PDB, we can create the world's shortest PubChem library:

require 'net/http'

module PubChem
  # Returns a molfile for the given PubChem CID
  def self.get_molfile cid
    Net::HTTP.get_response('pubchem.ncbi.nlm.nih.gov', "/summary/summary.cgi?cid=#{cid}&disopt=DisplaySDF").body
  end
end 
This library can be tested by saving it in a file called pubchem.rb followed by running irb:
$ irb
irb(main):001:0> require 'pubchem'
=> true
irb(main):002:0> puts PubChem::get_molfile('969472') #eszopiclone (Lunesta)
969472
  -OEChem-08130700422D

 44 47  0     1  0  0  0  0  0999 V2000
    9.2619   -2.2732    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0

[truncated]

Both of these Ruby libraries leverage one the most versatile and robust protocols ever developed: plain old http. The last few years have witnessed a renaissance in using bare http as platform for building simplified yet powerful Web APIs with less software. Referred to as REST, the approach has gained traction partly in response to the wasteful complexities introduced by various XML-based approaches. Although slow to catch on in cheminformatics, REST has enormous potential in unifying a diverse array of isolated database systems.

One limitation of the approach described here is that the PubChem (or PDB) folks may get upset if you use it a lot. For example, if you examine the PubChem robots.txt file, you'll notice that access to the summary.cgi resource, which our library makes use of, is prohibited to robots:

...

User-agent: *

...
Disallow: /summary/summary.cgi
...

What makes a "robot" and does your software qualify for exclusion? The answer is not enirely clear-cut, especially in the era of browser-side scripting.

Regardless, it looks like PubChem's policy was put in place in 2004, long before PubChem had experience with usage patterns for its service. It may be that this restriction could be relaxed without adversely affecting PubChem's ability to operate efficiently. It may even be possible to offer a low-level http retrieval method alongside PubChem's PUG interface on a machine dedicated to automated queries (i.e., Entrez eUtils).

As developers, our mission is to deliver functionality, not to write software. We should extract every possible ounce of value from established protocols and APIs before writing a single line of additional code. REST, and the creative use of good old http, are powerful tools to do so.

Image Credit Dru!

Hacking PubChem: Learning to Speak PUG 1

Posted by Rich Apodaca Mon, 11 Jun 2007 13:04:00 GMT

A previous article introduced PubChem's Power User Gateway (PUG), an XML-based communication channel. Although NIH kindly supplies a commented schema for PUG queries and responses, there's nothing like seeing real examples when learning a new language. This article will describe one method for conveniently generating PUG XML queries.

Let PubChem Build Your Query

One of the options on the PubChem search page is "Save Query." As it turns out, PubChem saves queries in PUG XML (I'll just call it PUGML). In other words, preparing a query using the PubChem search page and saving it gives a simple method for creating PUGML queries. Let's try it.

Using the "Sketch" button, draw the structure of benzimidazole. Under "Search Type", select "Substructure." Now click "Save Query", and you'll download a substructure query for benzimidazole in PUGML:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_query>
        <PCT-Query>
          <PCT-Query_type>
            <PCT-QueryType>
              <PCT-QueryType_css>
                <PCT-QueryCompoundCS>
                  <PCT-QueryCompoundCS_query>
                    <PCT-QueryCompoundCS_query_data>C1=CC=CC2=C1N=C[N]2</PCT-QueryCompoundCS_query_data>
                  </PCT-QueryCompoundCS_query>
                  <PCT-QueryCompoundCS_type>
                    <PCT-QueryCompoundCS_type_subss>
                      <PCT-CSStructure>
                        <PCT-CSStructure_bonds value="true"/>
                      </PCT-CSStructure>
                    </PCT-QueryCompoundCS_type_subss>
                  </PCT-QueryCompoundCS_type>
                  <PCT-QueryCompoundCS_results>2000000</PCT-QueryCompoundCS_results>
                </PCT-QueryCompoundCS>
              </PCT-QueryType_css>
            </PCT-QueryType>
          </PCT-Query_type>
        </PCT-Query>
      </PCT-InputData_query>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>

The PCT-QueryCompoundCS_type_subss element will tell PUG to look for substructures.

Using the Saved Query with PUG

Saving this file as benzimidazole_sss.xml, lets us feed it to PUG:

$ curl -d @benzimidazole_sss.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

and get the following PUGML response:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="queued"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_waiting>
          <PCT-Waiting>
            <PCT-Waiting_reqid>62668946396085905</PCT-Waiting_reqid>
            <PCT-Waiting_message>Structure search job was submitted</PCT-Waiting_message>
          </PCT-Waiting>
        </PCT-OutputData_output_waiting>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>
We can then check on the status of our query by saving the following as status.xml:
<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_request>
        <PCT-Request>
          <PCT-Request_reqid>62668946396085905</PCT-Request_reqid>
          <PCT-Request_type value="status"/>
        </PCT-Request>
      </PCT-InputData_request>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>
POSTing this to PUG:
$ curl -d @status.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

gives us the following PUGML:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
          <PCT-Status-Message_message>Your search has already been completed successfully!.</PCT-Status-Message_message>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_entrez>
          <PCT-Entrez>
            <PCT-Entrez_db>pccompound</PCT-Entrez_db>
            <PCT-Entrez_query-key>1</PCT-Entrez_query-key>
            <PCT-Entrez_webenv>0CPrI_peUmUtWDooyjxpJ1XAXPcOl-ESZZxj8sJV9ZDR8musMjh1oBTib@1EDD43FA66AE1BE0_0001SID</PCT-Entrez_webenv>
          </PCT-Entrez>
        </PCT-OutputData_output_entrez>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

Last time, we got a URL to download a gzipped SD File. This time, our query specified results to be returned as an Entrez Key through the PCT-Entrez_webenv element. We can construct a URL that will let us view these results:

http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=HistorySearch&WebEnvRq=1&db=pccompound&query_key=1&WebEnv=0CPrI_peUmUtWDooyjxpJ1XAXPcOl-ESZZxj8sJV9ZDR8musMjh1oBTib%401EDD43FA66AE1BE0_0001SID

Where to Next?

If we wanted to get a gzipped SD File instead, we'd need to edit our original query. But manually editing XML is a lot like mowing a lawn with scissors. What we'd really like is a simple API in a language like Ruby that will let us build sophisticated PUG queries, process the results, and pipe them into other queries with little effort. But that's a story for another time.

Image Credit: sutterbabe68

Hacking PubChem: Power User Gateway 2

Posted by Rich Apodaca Mon, 04 Jun 2007 11:06:00 GMT

If you've been waiting for a simple way to programatically query PubChem without screen scraping, the wait is over. An (apparently) new service called the Power User Gateway (PUG) now offers a direct, XML-based PubChem data channel.

See PUG

Previous articles have discussed various methods for hacking PubChem: screen scraping (link, link); with the Entrez Utilities; and by simply replicating the database. PUG is different in that it is both very simple and apparently quite powerful.

From the PUG documentation:

... There is a single CGI (pug.cgi, referred to hereafter as simply PUG) that is the central gateway to multiple PubChem functions. PUG takes no URL arguments; all communication with PUG is done by XML. To perform any request, you will formulate your input in XML and then HTTP POST it to PUG. The CGI interprets your incoming request, initiates the appropriate action, then returns results (also) in XML format. ...

See PUG Run

Let's perform a simple query using PUG. As the documentation states, all communication with PUG is done through HTTP POST. In contrast to other approaches to interfacing with PubChem, parameters and results are encoded in raw XML, the schema for which is available here. To use PUG your first step is to locate software capable of encoding this form of HTTP request.

cURL is such a utility. Among many capabilities, cURL offers a quick and easy way to POST XML to a server and view the response. For example, to POST the file called foo.xml to PUG, the command would be:

$ curl -d @foo.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

Our query will request PubChem's first fifty Compounds in sdf.gz format.

<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_download>
        <PCT-Download>
          <PCT-Download_uids>
            <PCT-QueryUids>
              <PCT-QueryUids_ids>
                <PCT-ID-List>
                  <PCT-ID-List_db>pccompound</PCT-ID-List_db>
                  <PCT-ID-List_uids>
                    <PCT-ID-List_uids_E>1</PCT-ID-List_uids_E>
                    <PCT-ID-List_uids_E>50</PCT-ID-List_uids_E>
                  </PCT-ID-List_uids>
                </PCT-ID-List>
              </PCT-QueryUids_ids>
            </PCT-QueryUids>
          </PCT-Download_uids>
          <PCT-Download_format value="sdf"/>
          <PCT-Download_compression value="gzip"/>
        </PCT-Download>
      </PCT-InputData_download>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>
After saving this file as pugtest.xml, we can POST it to PUG using cURL:
$ curl -d @pugtest.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

Run PUG, Run!

After POSTing our query, PUG gives one of two possible responses: we're informed of the status of our query, or we're given a URL to download our results.

Here's an example of a status result:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_waiting>
          <PCT-Waiting>
            <PCT-Waiting_reqid>638302818484957496</PCT-Waiting_reqid>
          </PCT-Waiting>
        </PCT-OutputData_output_waiting>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

The PCT-Waiting_reqid informs us of our query's ID. We could then prepare and POST another query to monitor its status:

<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_request>
        <PCT-Request>
          <PCT-Request_reqid>638302818484957496</PCT-Request_reqid>
          <PCT-Request_type value="status"/>
        </PCT-Request>
      </PCT-InputData_request>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>

Eventually, we'll get a response containing a PCT-Download_URL_url element. Inside this element is the URL through which we can download our results:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_download-url>
          <PCT-Download-URL>
            <PCT-Download-URL_url>ftp://ftp-private.ncbi.nlm.nih.gov/pubchem/.fetch/766964770894289974.sdf.gz</PCT-Download-URL_url>
          </PCT-Download-URL>
        </PCT-OutputData_output_download-url>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

Conclusions

PUG offers the basic foundation for building a variety of innovative and useful cheminformatics Web services. But before that can happen, high-level APIs will be needed in languages like Ruby, Python, and Java. With these APIs in hand, what kinds of applications will result? Fortunately, imagination is now the only barrier.

Image Credit: shutterbabe68

Octet Fundamentals: A Documented System of Atomic Masses

Posted by Rich Apodaca Fri, 02 Feb 2007 20:10:00 GMT

The way that atoms, and particularly their masses, are modeled sets the stage for the kinds of problems a cheminformatics environment can solve. Many systems are currently in use, a reflection of the many different ways there are to think about this problem. This article will introduce the atomic mass system used by Octet, which provides atomic mass values and uncertainties cross-referenced to the primary literature.

A Documented System of Atomic Masses

Mass and isotopic composition are fundamental atomic properties. In addition to the mass values themselves, the errors of these determinations are also important. Because these quantities are sometimes in dispute, it is essential that they be cross-referenced to the primary literature. Fortunately, a landmark work titled "Atomic weights of the elements" (AWOTE) accomplishing exactly this objective was published in 2000 by a team led by J. K. Böhlke from the U.S. Geological survey.

Octet uses an XML representation of the data contained in AWOTE. To view the entire document, click here. To illustrate the kind of data included in this document, consider this entry for the element carbon:

<entry symbol="C" atomic-number="6">
  <natural-abundance>
    <mass value="12.0107" error="0.0008" />
    <isotope mass-number="12">
      <mass value="12" error="0" />
      <abundance value="0.9893" error="0.0008" />
    </isotope>
    <isotope mass-number="13">
      <mass value="13.003354838" error="0.000000005" />
      <abundance value="0.0107" error="0.0008" />
    </isotope>
  </natural-abundance>
</entry>

Carbon has two naturally-occurring stable isotopes, 12C and 13C. They have relative abundances of 98.93% and 1.07%, and masses of 12 (exactly) and 13.003354838±0.000000005 unified mass units (u), respectively. Every element from hydrogen to uranium is included, excluding technitium. By reference to AWOTE, the determination of every value in the XML file can be found in the primary literature.

Using the Atomic Mass System

As a demonstration of Octet's system of atomic masses, consider the following Ruby code:

require 'rubygems'
require_gem 'rjb'

atomic_system=Rjb::import('net.sf.octet.model.BasicAtomicSystem').getInstance
carbon_distribution=atomic_system.getNaturalAbundance(atomic_system.getAtomicSymbol("C"))

puts carbon_distribution.countNuclei # =>2
puts carbon_distribution.getNucleus(0).getMassNumber # =>12
puts carbon_distribution.getNucleus(1).getMassNumber # =>13
puts atomic_system.getAtomicMass(carbon_distribution.getNucleus(0)).getValue.toString # => 12.0
puts atomic_system.getAtomicMass(carbon_distribution.getNucleus(1)).getValue.toString # => 13.003354838
puts atomic_system.getAtomicMass(carbon_distribution.getNucleus(1)).getUncertainty.toString # => 5.0E-9

The previous article in this series described the small number of steps needed to execute Ruby code such as that shown above on Windows and Linux systems. For more information on the AtomicSystem API, consult the Octet Javadoc.

Conclusions

Octet provides a comprehensive system of atomic masses containing both measurements and uncertainties. This system is furthermore cross-referenced to the primary literature. As a result, the mass of every Octet Molecule can be determined to high precision and with error analysis. Not every application will require this level of detail and documentation, but for those that do the capability exists.

numly esn 34181-070204-258949-40 Rate content:


Creative Commons License
This work is licensed under a Creative Commons Attribution 2.5 License.

A Molecular Language for Modern Chemistry: Reading FlexMol Documents with Octet

Posted by Rich Apodaca Wed, 31 Jan 2007 19:56:00 GMT

An XML language is only as useful as the software tools that take advantage of it. Previous articles have discussed how the XML language FlexMol can solve a variety of molecular representation problems ranging from the multiatom bonding of metallocenes to the axial chirality of biaryls. Octet is a framework written in Java that speaks FlexMol natively. In this article, I'll show how Octet can be used to read a sample FlexMol document.

Prerequisites

For this tutorial, you'll need Ruby Java Bridge (RJB). Previous articles have discussed the installation and use of RJB on Windows and Linux.

A Sample Molecule

A recent article disused a FlexMol representation of the chiral natural product monolaterol. Using a slightly modified numbering system for this molecule (shown above), we can construct a complete FlexMol representation. In this case, we simply start numbering at index zero, subtracting one from every index in the previous example to match the zero-based indices used in Octet.

A Demonstration Package

To illustrate the process of reading a FlexMol document, I've prepared a small package (demo-20070131.tar.gz) that can be downloaded from SourceForge. In it, you'll find an Octet jarfile (octet-0.8.2.jar), a FlexMol representation of monolaterol (s_monolaterol.xml), a Ruby library (reader.rb), and some Ruby test code (test.rb). Inflate this archive and make it your working directory.

A Simple Test

The following sequence of commands will run the test included with the demonstration package:

$ export CLASSPATH=./octet-0.8.2.jar
$ ruby test.rb

You should see several lines of output terminated with the line:

The exact mass of monolaterol is 276.115029755.

You can get more hands-on experience with loading and processing the monolaterol FlexMol document using interactive Ruby (irb). For example:

$ irb
irb(main):001:0> require 'reader'
=> true
irb(main):002:0> r=Reader.new
=> #:0x2b9ab1736690>, @handler=#<#:0x2b9ab1736e10>, @builder=#<#:0x2b9ab1736b90>>
irb(main):003:0> mol=r.read_file 's_monolaterol.xml'
=> #<#:0x2b9ab172cd48>
irb(main):004:0> mol.countAtoms
=> 21
irb(main):005:0> mol.countBondingSystems
=> 24

Of course, this is just scratching the surface of what can be done once a FlexMol document has been loaded by Octet.

Conclusions

Octet makes it possible to convert FlexMol documents into Java object representations that can be accessed through Ruby. With an object representation, the possibilities are limitless. Some simple examples have been provided here. Future articles will illustrate more advanced uses.

Older posts: 1 2