PubChem is a Platform
Two recent J. Chem. Inf. Model. articles support the idea that PubChem is rapidly evolving into a Chemical Informatics platform:
Large-Scale Annotation of Small-Molecule Libraries Using Public Databases. Using PubChem and other databases, the authors categorize the level of annotation (data, metadata, and links) of free chemical databases, with PubChem as the centerpiece. The work is part of a larger effort designed to integrate this free resource into the Novartis Research Foundation (GNF) workflow.
Web Service Infrastructure for Chemoinformatics. Among other interesting initiatives, the article describes a desktop application front-end for PubChem. (As a bonus, the authors also make the case).
Platforms are essential because they focus the attention and effort of self-interested third-parties around a common goal. They become so integrated into society that they eventually become invisible. There is outrage when they stop working. Think of highways, sewers, phone lines, communications satellites, the patent system, and the Internet, among others. We don't just use these services, we build on top of them.
Chemical Abstract Service is an important tool for many, but it is not a platform. By placing high costs on access to its service and severely restricting its use, the ACS has effectively shut out anyone wanting to build another service on top of CAS. Clearly this was part of the plan. Small and large third-party players alike are shut out, with the inevitable chilling effect on innovation.
Contrast this situation with PubChem. The public is free to download and re-use the entire database of molecules and associated data. PubChem has recently unveiled a new Web API called PUG that will make it even easier to layer on additional functionality. These kinds of capabilities create an entirely different dynamic: witness both eMolecules and ChemSpider, two services that unashamedly exploit the PubChem resource. Expect to see more of this in the months ahead.
Remember the Apple II? This product became so successful that it played a major role in undermining dozens of highly profitable and well-established businesses. Why was it so successful? One of the key reasons was its open architecture, compared to what had preceded it. Within a very short time, third parties had developed a large number of innovative products that exploited the underlying platform - both with and without Apple's encouragement. One of those products, VisiCalc was so successful that at one point many buyers of Apple's machine did so for no other purpose than to run it.
Whether PubChem itself ends up becoming the standard cheminformatics platform is hard to say. Perhaps this role will be filled by a system not yet built, or which evolves from PubChem. Whatever the outcome, PubChem has unmasked a deep need (and opportunity) for an open cheminformatics platform. As Apple's experience demonstrates, often you get more in the end by giving something up.
Hacking PubChem: Free Speech or Free Beer?
Government information available from this site is within the public domain. Public domain information on the National Library of Medicine (NLM) Web pages may be freely distributed and copied. However, it is requested that in any subsequent use of this work, NLM be given appropriate acknowledgment.
This site also contains resources such as PubMed Central, Bookshelf, OMIM, and PubChem which incorporate material contributed or licensed by individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners.
Open Source licensing is nothing short of revolutionary. Of all of the things an Open Source license makes possible, perhaps the most far-reaching is the right of licensees to create and distribute derivative works. This is what separates "software that's free" ("free as in beer") from "Free Software" ("free as in speech"). A licensee that is not free to create and distribute derivative works has virtually no incentive to build on what the original creator has given away. Would you contribute your valuable time to improving something that you knew you could never use as you saw fit? This may sound like semantic hair-splitting, but it's far from it. None of the phenomenal progress made in Open Source software would have been possible without the basic rights to create and distribute derivative works.
PubChem's Copyright Disclaimer should give anyone familiar with Open Source licensing grounds to ponder. Apparently, NIH is telling its users that it doesn't have the authority to grant them the right to copy all PubChem content or distribute derivative works. But what parts of PubChem can these rights be granted for, if any? What parts of Pubchem are copyrighted, and therefore owned, by contributors? How can a user find out which parts of PubChem are subject to copyright claims by contributors?
It isn't too difficult to imagine a scenario in which PubChem requires those depositing data to agree to a copyright waiver. This waiver would simply grant PubChem users the sublicensable right to copy a depositor's content verbatim and to distribute derivative works based on it, royalty-free. The depositor would still retain any copyright they might want to assert outside of PubChem. If the depositor doesn't own these rights, or isn't willing to part with them, then that content would be rejected. This has been done for years in Open Source software projects and is being done increasingly with Creative Commons licenses for non-software intellectual property. Both approaches have strengths and weaknesses, and my aim is not to advocate either one. The point is simply that the idea is not new.
Maybe a copyright waiver isn't feasible. Regardless, PubChem could create a mechanism whereby content for which a contributor is asserting copyright claims can be identified as such and optionally avoided by its users.
While I'd never turn down free beer, and I'd always thank those offering, in the long run free speech is far more sustaining.
Hacking PubChem: Query by SMILES
Recently, I showed how a simple PubChem API could be built from a few lines of Ruby code. The API we created could retrieve a molfile and a 2-D molecular rendering given a PubChem compound ID (CID). In this tutorial, we'll see how a SMILES query mechanism can be added to the API, enabling CIDs to be retrieved from any valid SMILES string. We'll also see how to extend this capability to retrieving a 2-D image from PubChem by submitting a SMILES string.
Credits
The API that follows is based on the pubchem.rb file found in Chemruby by Tadashi Kadowaki and Nobua Tanaka.
Defining the Problem
We want to create a PubChem API that returns an Array of CIDs given any valid SMILES string. The API will communicate with the publically-available molecular database PubChem using HTTP.
In some cases, PubChem associates more than one CID for a given molecular structure. For example, querying the SMILES string c1ccccc1 (benzene) finds both benzene and C-14 benzene. The software needs to handle these cases as well.
Prerequisites
The only thing you'll need for this tutorial is Ruby, preferably v1.8 or better.
Code
Create a file called query.rb in your working directory containing the following code:
require 'uri'
require 'net/http'
# A simple SMILES query for PubChem based on the file <tt>pubchem.rb</tt>,
# and originally part of Chemruby (http://rubyforge.org/project/chemruby).
# Distributed under Ruby's License.
#
# Copyright (C) 2005, 2006 KADOWAKI Tadashi <kado@kuicr.kyoto-u.ac.jp>
# TANAKA Nobuya <tanaka@kuicr.kyoto-u.ac.jp>
# APODACA Richard <r_apodaca@users.sf.net>
class PubChemQuery
@@host="pubchem.ncbi.nlm.nih.gov"
@@searchpath="/search/"
@@query="PreQSrv.cgi"
@@boundary="-----boundary-----"
# Synthetic form data. Lifted from Chemruby <tt>pubchem.rb</tt>
@@data = [
@@boundary, "Content-Disposition: form-data; name=\"mode\"", "", "simplequery",
@@boundary, "Content-Disposition: form-data; name=\"queue\"", "", "ssquery",
@@boundary, "Content-Disposition: form-data; name=\"simple_searchdata\"", "", '%s',
@@boundary, "Content-Disposition: form-data; name=\"simple_searchtype\"", "", "fs",
@@boundary, "Content-Disposition: form-data; name=\"maxhits\"", "", '%s',
@@boundary].join("\x0d\x0a")
# Returns an <tt>Array</tt> of CIDs matching <tt>smiles</tt>. If no matches are found,
# <tt>nil</tt> is returned.
def self.query_by_smiles(smiles, maxhits = 100)
form_response = post_form(smiles, maxhits)
wait_response = process_wait_page(form_response)
url = get_report_url(wait_response)
url ? process_report(url) : nil
end
private
# Returns the response to posting the initial search form.
def self.post_form(smiles, maxhits)
response = ''
Net::HTTP.start(@@host, 80) do |http|
response = http.post(@@searchpath + @@query, @@data % [smiles, maxhits],
{
'Content-Type' => "multipart/form-data; boundary=#{@@boundary}",
'Referer' => "http://pubchem.ncbi.nlm.nih.gov/search/"
}).body
end
response
end
# Processes the wait page displayed after submission of the search form.
def self.process_wait_page(body)
response = ''
if m = /url="([^"]+)"/.match(body)
Net::HTTP.start(@@host, 80) do |http|
response = http.get(@@searchpath + m[1]).body
end
end
response
end
# Returns the URL, as a <tt>String</tt>, to the search report, given the specified
# body of the wait page.
def self.get_report_url(body)
url = nil
Net::HTTP.start(@@host, 80) do |http|
while /setTimeout\('document.location.replace\("([^"]+)"\);', (\d+)\)/ =~ body do
sleep($2.to_f/100)
response = http.get(URI.parse($1).to_s)
body = response.body
url = response['location']
end
end
url
end
# Extracts CIDs from the search report contained at <tt>url</tt>.
def self.process_report(url)
cid = Array.new
Net::HTTP.start(@@host, 80) do |http|
# text format
url.sub!(/cmd=Select\+from\+History/, 'cmd=Text&dopt=Brief')
http.get(url).body.scan(/\d+: CID: (\d+)/).each do |id|
cid.push(id[0])
end
end
cid
end
endYou might want to manually submit a SMILES query to PubChem as a refresher on how this webapp works. Briefly, the contents of the SMILES search field are read, and a wait screen appears, typically for three seconds. You are then redirected to a results report page containing thumbnail images of the hits and their CIDs.
The PubChemQuery class contains a single public class method, query_by_smiles. This method builds a form to submit, based on the supplied SMILES string and optional maxhits argument. It then waits until PubChem indicates that the query is about to finish processing. The URL for the results report page is then parsed. If a nonempty URL was found, then its page is loaded, and CIDs are scraped. Otherwise, the method returns nil.
Usage
Using PubChemQuery consists of invoking its class method query_by_smiles. You can do so either via the Ruby interpreter (ruby), or preferably through Interactive Ruby (irb).
require 'query'
smiles = "c1cccc(Cl)c1(Cl)" # chlorobenzene
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
puts cid # => 7239Layering Complexity
We can combine the SMILES query API discussed here with the molfile and image retrieval discussed in the earlier Hacking Pubchem article.
Let's say you'd like to download PubChem's 2-D image of imatinib (Gleevec) by submitting its SMILES string. Copy the file named pubchem.rb, provided in the original PubChem tutorial, into your working directory. Now you can programmatically download imatinib's 2-D image from PubChem based only on a SMILES string, for example:
require 'pubchem'
require 'query'
smiles="Cc3ccc(NC(=O)c2ccc(CN1CCN(C)CC1)cc2)cc3Nc5nccc(c4cccnc4)n5" #imatinib
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
if cid
puts "CID found: #{cid[0]}"
filename = cid[0] + ".png"
puts "Writing image to #{filename} ..."
PubChem.write_image(cid[0], filename)
else
puts "No CID for #{smiles} was found."
endThis produces an image of imatinib called 5291.png in your working directory:

Wrapping Up
As you can see, we're just scratching the surface. The approach outlined here offers nearly unlimited possibilities for repackaging PubChem's own content, and mashing this content up with that of other sites. Happy hacking!

