Hacking PubChem: Learning to Speak PUG 1

Posted by Rich Apodaca Mon, 11 Jun 2007 13:04:00 GMT

A previous article introduced PubChem's Power User Gateway (PUG), an XML-based communication channel. Although NIH kindly supplies a commented schema for PUG queries and responses, there's nothing like seeing real examples when learning a new language. This article will describe one method for conveniently generating PUG XML queries.

Let PubChem Build Your Query

One of the options on the PubChem search page is "Save Query." As it turns out, PubChem saves queries in PUG XML (I'll just call it PUGML). In other words, preparing a query using the PubChem search page and saving it gives a simple method for creating PUGML queries. Let's try it.

Using the "Sketch" button, draw the structure of benzimidazole. Under "Search Type", select "Substructure." Now click "Save Query", and you'll download a substructure query for benzimidazole in PUGML:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_query>
        <PCT-Query>
          <PCT-Query_type>
            <PCT-QueryType>
              <PCT-QueryType_css>
                <PCT-QueryCompoundCS>
                  <PCT-QueryCompoundCS_query>
                    <PCT-QueryCompoundCS_query_data>C1=CC=CC2=C1N=C[N]2</PCT-QueryCompoundCS_query_data>
                  </PCT-QueryCompoundCS_query>
                  <PCT-QueryCompoundCS_type>
                    <PCT-QueryCompoundCS_type_subss>
                      <PCT-CSStructure>
                        <PCT-CSStructure_bonds value="true"/>
                      </PCT-CSStructure>
                    </PCT-QueryCompoundCS_type_subss>
                  </PCT-QueryCompoundCS_type>
                  <PCT-QueryCompoundCS_results>2000000</PCT-QueryCompoundCS_results>
                </PCT-QueryCompoundCS>
              </PCT-QueryType_css>
            </PCT-QueryType>
          </PCT-Query_type>
        </PCT-Query>
      </PCT-InputData_query>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>

The PCT-QueryCompoundCS_type_subss element will tell PUG to look for substructures.

Using the Saved Query with PUG

Saving this file as benzimidazole_sss.xml, lets us feed it to PUG:

$ curl -d @benzimidazole_sss.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

and get the following PUGML response:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="queued"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_waiting>
          <PCT-Waiting>
            <PCT-Waiting_reqid>62668946396085905</PCT-Waiting_reqid>
            <PCT-Waiting_message>Structure search job was submitted</PCT-Waiting_message>
          </PCT-Waiting>
        </PCT-OutputData_output_waiting>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>
We can then check on the status of our query by saving the following as status.xml:
<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_request>
        <PCT-Request>
          <PCT-Request_reqid>62668946396085905</PCT-Request_reqid>
          <PCT-Request_type value="status"/>
        </PCT-Request>
      </PCT-InputData_request>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>
POSTing this to PUG:
$ curl -d @status.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

gives us the following PUGML:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
          <PCT-Status-Message_message>Your search has already been completed successfully!.</PCT-Status-Message_message>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_entrez>
          <PCT-Entrez>
            <PCT-Entrez_db>pccompound</PCT-Entrez_db>
            <PCT-Entrez_query-key>1</PCT-Entrez_query-key>
            <PCT-Entrez_webenv>0CPrI_peUmUtWDooyjxpJ1XAXPcOl-ESZZxj8sJV9ZDR8musMjh1oBTib@1EDD43FA66AE1BE0_0001SID</PCT-Entrez_webenv>
          </PCT-Entrez>
        </PCT-OutputData_output_entrez>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

Last time, we got a URL to download a gzipped SD File. This time, our query specified results to be returned as an Entrez Key through the PCT-Entrez_webenv element. We can construct a URL that will let us view these results:

http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=HistorySearch&WebEnvRq=1&db=pccompound&query_key=1&WebEnv=0CPrI_peUmUtWDooyjxpJ1XAXPcOl-ESZZxj8sJV9ZDR8musMjh1oBTib%401EDD43FA66AE1BE0_0001SID

Where to Next?

If we wanted to get a gzipped SD File instead, we'd need to edit our original query. But manually editing XML is a lot like mowing a lawn with scissors. What we'd really like is a simple API in a language like Ruby that will let us build sophisticated PUG queries, process the results, and pipe them into other queries with little effort. But that's a story for another time.

Image Credit: sutterbabe68

Hacking PubChem: Power User Gateway 2

Posted by Rich Apodaca Mon, 04 Jun 2007 11:06:00 GMT

If you've been waiting for a simple way to programatically query PubChem without screen scraping, the wait is over. An (apparently) new service called the Power User Gateway (PUG) now offers a direct, XML-based PubChem data channel.

See PUG

Previous articles have discussed various methods for hacking PubChem: screen scraping (link, link); with the Entrez Utilities; and by simply replicating the database. PUG is different in that it is both very simple and apparently quite powerful.

From the PUG documentation:

... There is a single CGI (pug.cgi, referred to hereafter as simply PUG) that is the central gateway to multiple PubChem functions. PUG takes no URL arguments; all communication with PUG is done by XML. To perform any request, you will formulate your input in XML and then HTTP POST it to PUG. The CGI interprets your incoming request, initiates the appropriate action, then returns results (also) in XML format. ...

See PUG Run

Let's perform a simple query using PUG. As the documentation states, all communication with PUG is done through HTTP POST. In contrast to other approaches to interfacing with PubChem, parameters and results are encoded in raw XML, the schema for which is available here. To use PUG your first step is to locate software capable of encoding this form of HTTP request.

cURL is such a utility. Among many capabilities, cURL offers a quick and easy way to POST XML to a server and view the response. For example, to POST the file called foo.xml to PUG, the command would be:

$ curl -d @foo.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

Our query will request PubChem's first fifty Compounds in sdf.gz format.

<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_download>
        <PCT-Download>
          <PCT-Download_uids>
            <PCT-QueryUids>
              <PCT-QueryUids_ids>
                <PCT-ID-List>
                  <PCT-ID-List_db>pccompound</PCT-ID-List_db>
                  <PCT-ID-List_uids>
                    <PCT-ID-List_uids_E>1</PCT-ID-List_uids_E>
                    <PCT-ID-List_uids_E>50</PCT-ID-List_uids_E>
                  </PCT-ID-List_uids>
                </PCT-ID-List>
              </PCT-QueryUids_ids>
            </PCT-QueryUids>
          </PCT-Download_uids>
          <PCT-Download_format value="sdf"/>
          <PCT-Download_compression value="gzip"/>
        </PCT-Download>
      </PCT-InputData_download>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>
After saving this file as pugtest.xml, we can POST it to PUG using cURL:
$ curl -d @pugtest.xml "http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"

Run PUG, Run!

After POSTing our query, PUG gives one of two possible responses: we're informed of the status of our query, or we're given a URL to download our results.

Here's an example of a status result:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_waiting>
          <PCT-Waiting>
            <PCT-Waiting_reqid>638302818484957496</PCT-Waiting_reqid>
          </PCT-Waiting>
        </PCT-OutputData_output_waiting>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

The PCT-Waiting_reqid informs us of our query's ID. We could then prepare and POST another query to monitor its status:

<PCT-Data>
  <PCT-Data_input>
    <PCT-InputData>
      <PCT-InputData_request>
        <PCT-Request>
          <PCT-Request_reqid>638302818484957496</PCT-Request_reqid>
          <PCT-Request_type value="status"/>
        </PCT-Request>
      </PCT-InputData_request>
    </PCT-InputData>
  </PCT-Data_input>
</PCT-Data>

Eventually, we'll get a response containing a PCT-Download_URL_url element. Inside this element is the URL through which we can download our results:

<?xml version="1.0"?>
<!DOCTYPE PCT-Data PUBLIC "-//NCBI//NCBI PCTools/EN" "http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd">
<PCT-Data>
  <PCT-Data_output>
    <PCT-OutputData>
      <PCT-OutputData_status>
        <PCT-Status-Message>
          <PCT-Status-Message_status>
            <PCT-Status value="success"/>
          </PCT-Status-Message_status>
        </PCT-Status-Message>
      </PCT-OutputData_status>
      <PCT-OutputData_output>
        <PCT-OutputData_output_download-url>
          <PCT-Download-URL>
            <PCT-Download-URL_url>ftp://ftp-private.ncbi.nlm.nih.gov/pubchem/.fetch/766964770894289974.sdf.gz</PCT-Download-URL_url>
          </PCT-Download-URL>
        </PCT-OutputData_output_download-url>
      </PCT-OutputData_output>
    </PCT-OutputData>
  </PCT-Data_output>
</PCT-Data>

Conclusions

PUG offers the basic foundation for building a variety of innovative and useful cheminformatics Web services. But before that can happen, high-level APIs will be needed in languages like Ruby, Python, and Java. With these APIs in hand, what kinds of applications will result? Fortunately, imagination is now the only barrier.

Image Credit: shutterbabe68

Hacking PubChem: Entrez Programming Utilities

Posted by Rich Apodaca Sat, 23 Sep 2006 05:22:00 GMT

A recent article poses the question of how to balance the rights of owners of open chemical information resources against those of their users, while promoting an innovative environment for third-party developers. Although PubChem was the focus, the discussion could apply to any other chemical information resource. A reasonable approach would be to provide two separate entry points: one for Web browsers and another for various types of semi-autonomous software used in hacking and mashups.

Peter Corbett writes to point out that the Entrez Programming Utilities can be used to query PubChem and other databases under the NIH/NCI/NCBI umbrella. A separate developer server processes requests, and the terms of its use are fairly well stated. Future articles will explore the possibility of building some simple Ruby APIs for this developer PubChem entry point.

Hacking PubChem: Why The Open Access Fight is Just the Beginning

Posted by Rich Apodaca Fri, 22 Sep 2006 17:58:00 GMT

Like no other medium, the Internet tests our basic beliefs about the rights of resource owners and resource users. As the Internet increasingly becomes home to scientific publication mechanisms that have no counterpart in the physical world, a larger question looms: what separates fair use of these services from abuse?

Depth-First hosts a series of articles, with possibly many more to follow, on programatically accessing open chemical information databases:

The availability of open chemical information resources like PubChem and NMRShiftDB is a very recent phenomenon, and desperately overdue. One premise of this blog is that chemical informatics is at the start of a renaissance; the chemical information revolution that started in the 1950's is now set to continue after a long period of stagnation. Large, open data sources, and open software that mines it, will fuel this transformation, just as they have in bioinformatics.

The interaction of non-browser software with public databases, although rich in potential payoffs, can also lead to a great deal of damage. PubChem contains millions of structure-searchable compounds. Setting the wrong kinds of programs loose on this site could cause service interruptions ranging from the annoying to the severe.

There is no standard mechanism for website owners to spell out acceptable use policies to non-browser software. The closest thing we have to a standard is the Robots Exclusion Protocol. This protocol defines acceptable behaviors for a robot, which according to one definition consist of: "... a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." Other definitions are in use. The one thing these definitions seem to have in common is the concept of scale: the more comprehensive and indiscriminate the program is in its interactions with a website, the more like a robot, and less like a browser, it becomes.

Site owners specify their robots policy in a file called robots.txt hosted on their servers. The PubChem robots.txt file currently includes the following policies:

User-agent: *
Disallow: /substance/PcsSrv.cgi
Disallow: /summary/summary.cgi
Disallow: /assay/assay.cgi
Disallow: /image/imgsrv.fcgi
Disallow: /image/smi2gif.fcgi
Disallow: /image/smi2gif.cgi
Disallow: /image/structurefly.cgi
Disallow: /search/NbrQsrv.cgi
Disallow: /search/PreQSrv.cgi

Here, User-agent refers to the name of the robot, which is set as a wildcard, meaning any robot. The Disallow lines refer to resources off-limits to robots.

One of these disallowed resources, /search/PreQSrv.cgi is explicitly used in the PubChem SMILES query article.

Is a person who runs software of the type I describe in these articles violating PubChem's use policy? The best answer I can give is, "it depends." I think it would be hard for reasonable people to suggest that using the software as described in the tutorials, with their deliberately limited scope, for research purposes, and with no intent to do damage, represents abuse.

On the other hand, I can see how reasonable people could argue that a website operating as a comprehensive front-end to PubChem using the techniques described in these articles could be considered abuse. I know I might consider it abuse if I ran PubChem, depending on why I was running the service.

If I wanted to stimulate innovation in the area of open database mining, I might actually encourage front ends and similar third-party PubChem services. I might set aside servers specifically dedicated to this kind of activity. I might even develop an Open Source PubChem Web-API to help developers get started. Unfortunately, NIH's intentions are not exactly clear on this point.

Looking at the NCBI's Copyright and Disclaimers page, the only document that to my knowledge states any kind of use policy, is not especially illuminating:

Conditions of Use

This site is maintained by the U.S. Government and is protected by various provisions of Title 18 of the U.S. Code. Violations of Title 18 are subject to criminal prosecution in a federal court. For site security purposes, as well as to ensure that this service remains available to all users, we use software programs to monitor traffic and to identify unauthorized attempts to upload or change information or otherwise cause damage. In the event of authorized law enforcement investigations and pursuant to any required legal process, information from these sources may be used to help identify an individual.

We are left with the critical, but unanswered question: "What represents an unauthorized use of PubChem?"

The document cited above also raises the truly bizarre possibility of PubChem not actually being capable of granting rights to redistribute what is contained on its servers:

This site also contains resources such as PubMed Central, Bookshelf, OMIM, and PubChem which incorporate material contributed or licensed by individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. ...

But this is a subject for another day.

Getting back to accessing PubChem data, one very far-sighted thing the NIH has done is to make the entire dataset freely downloadable in three different file formats. Rather than mine the PubChem website itself, you could download the data to your machine, letting the software you write access it locally. The sheer size of this dataset creates problems of its own. Future articles will describe some approaches to solving them.

Regardless of your views on the use and abuse of chemical information resources like PubChem, it's clear that getting open resources on the Web is only the first in a long series of controversial steps that will ultimately transform both the practice and culture of research.

Hacking PubChem: Query by SMILES

Posted by Rich Apodaca Thu, 21 Sep 2006 19:12:00 GMT

Recently, I showed how a simple PubChem API could be built from a few lines of Ruby code. The API we created could retrieve a molfile and a 2-D molecular rendering given a PubChem compound ID (CID). In this tutorial, we'll see how a SMILES query mechanism can be added to the API, enabling CIDs to be retrieved from any valid SMILES string. We'll also see how to extend this capability to retrieving a 2-D image from PubChem by submitting a SMILES string.

Credits

The API that follows is based on the pubchem.rb file found in Chemruby by Tadashi Kadowaki and Nobua Tanaka.

Defining the Problem

We want to create a PubChem API that returns an Array of CIDs given any valid SMILES string. The API will communicate with the publically-available molecular database PubChem using HTTP.

In some cases, PubChem associates more than one CID for a given molecular structure. For example, querying the SMILES string c1ccccc1 (benzene) finds both benzene and C-14 benzene. The software needs to handle these cases as well.

Prerequisites

The only thing you'll need for this tutorial is Ruby, preferably v1.8 or better.

Code

Create a file called query.rb in your working directory containing the following code:

require 'uri'
require 'net/http'

# A simple SMILES query for PubChem based on the file <tt>pubchem.rb</tt>,
# and originally part of Chemruby (http://rubyforge.org/project/chemruby).
# Distributed under Ruby's License.
#
# Copyright (C) 2005, 2006 KADOWAKI Tadashi <kado@kuicr.kyoto-u.ac.jp>
#                          TANAKA   Nobuya  <tanaka@kuicr.kyoto-u.ac.jp>
#                          APODACA  Richard <r_apodaca@users.sf.net>
class PubChemQuery
  @@host="pubchem.ncbi.nlm.nih.gov"
  @@searchpath="/search/"
  @@query="PreQSrv.cgi"
  @@boundary="-----boundary-----"

  # Synthetic form data. Lifted from Chemruby <tt>pubchem.rb</tt>
  @@data = [
    @@boundary, "Content-Disposition: form-data; name=\"mode\"", "", "simplequery",
    @@boundary, "Content-Disposition: form-data; name=\"queue\"", "", "ssquery",
    @@boundary, "Content-Disposition: form-data; name=\"simple_searchdata\"", "", '%s',
    @@boundary, "Content-Disposition: form-data; name=\"simple_searchtype\"", "", "fs",
    @@boundary, "Content-Disposition: form-data; name=\"maxhits\"", "", '%s',
    @@boundary].join("\x0d\x0a")

  # Returns an <tt>Array</tt> of CIDs matching <tt>smiles</tt>. If no matches are found,
  # <tt>nil</tt> is returned.
  def self.query_by_smiles(smiles, maxhits = 100)
    form_response = post_form(smiles, maxhits)
    wait_response = process_wait_page(form_response)
    url = get_report_url(wait_response)

    url ? process_report(url) : nil
  end

private

  # Returns the response to posting the initial search form.
  def self.post_form(smiles, maxhits)
    response = ''

    Net::HTTP.start(@@host, 80) do |http|
      response = http.post(@@searchpath + @@query, @@data % [smiles, maxhits],
      {
        'Content-Type' => "multipart/form-data; boundary=#{@@boundary}",
        'Referer' => "http://pubchem.ncbi.nlm.nih.gov/search/"
      }).body
    end

    response
  end

  # Processes the wait page displayed after submission of the search form.
  def self.process_wait_page(body)
    response = ''

    if m = /url="([^"]+)"/.match(body)
      Net::HTTP.start(@@host, 80) do |http|
        response = http.get(@@searchpath + m[1]).body
      end
    end

    response
  end

  # Returns the URL, as a <tt>String</tt>, to the search report, given the specified
  # body of the wait page.
  def self.get_report_url(body)
    url = nil

    Net::HTTP.start(@@host, 80) do |http|
      while /setTimeout\('document.location.replace\("([^"]+)"\);', (\d+)\)/ =~ body do
        sleep($2.to_f/100)

        response = http.get(URI.parse($1).to_s)
        body = response.body
        url = response['location']
      end
    end

    url
  end

  # Extracts CIDs from the search report contained at <tt>url</tt>.
  def self.process_report(url)
    cid = Array.new

    Net::HTTP.start(@@host, 80) do |http|
      # text format
      url.sub!(/cmd=Select\+from\+History/, 'cmd=Text&dopt=Brief')
      http.get(url).body.scan(/\d+: CID: (\d+)/).each do |id|
        cid.push(id[0])
      end
    end

    cid
  end
end

You might want to manually submit a SMILES query to PubChem as a refresher on how this webapp works. Briefly, the contents of the SMILES search field are read, and a wait screen appears, typically for three seconds. You are then redirected to a results report page containing thumbnail images of the hits and their CIDs.

The PubChemQuery class contains a single public class method, query_by_smiles. This method builds a form to submit, based on the supplied SMILES string and optional maxhits argument. It then waits until PubChem indicates that the query is about to finish processing. The URL for the results report page is then parsed. If a nonempty URL was found, then its page is loaded, and CIDs are scraped. Otherwise, the method returns nil.

Usage

Using PubChemQuery consists of invoking its class method query_by_smiles. You can do so either via the Ruby interpreter (ruby), or preferably through Interactive Ruby (irb).

require 'query'

smiles = "c1cccc(Cl)c1(Cl)" # chlorobenzene
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
puts cid # => 7239

Layering Complexity

We can combine the SMILES query API discussed here with the molfile and image retrieval discussed in the earlier Hacking Pubchem article.

Let's say you'd like to download PubChem's 2-D image of imatinib (Gleevec) by submitting its SMILES string. Copy the file named pubchem.rb, provided in the original PubChem tutorial, into your working directory. Now you can programmatically download imatinib's 2-D image from PubChem based only on a SMILES string, for example:

require 'pubchem'
require 'query'

smiles="Cc3ccc(NC(=O)c2ccc(CN1CCN(C)CC1)cc2)cc3Nc5nccc(c4cccnc4)n5" #imatinib
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)

if cid
  puts "CID found: #{cid[0]}"

  filename = cid[0] + ".png"
  puts "Writing image to #{filename} ..."
  PubChem.write_image(cid[0], filename)
else
  puts "No CID for #{smiles} was found."
end

This produces an image of imatinib called 5291.png in your working directory:

Wrapping Up

As you can see, we're just scratching the surface. The approach outlined here offers nearly unlimited possibilities for repackaging PubChem's own content, and mashing this content up with that of other sites. Happy hacking!

Older posts: 1 2