From Famine to Feast: A Bumper Crop of Free Chemistry Databases
"Until PubChem came on the scene, the state of chemoinformatics compared to bioinformatics was 20 years behind," says Christopher Lipinski, who formulated the eponymous rule-of-five criteria for drug bioavailability.
-Monya Baker, Nature Reviews Drug Discovery
The number of free chemistry databases on the Web just keeps growing. A recent Depth-First article discussed twelve of them. It turns out that Chembiogrid from Indiana University maintains a list of forty free chemistry databases, most of which are alive and well.
As this trend continues, the need for database standards will become painfully obvious. Not only will interoperable infrastructure technologies and user interface standards need to be developed, but thorny intellectual property issues including access, chain of title, and digital rights will need to be resolved. However, the most immediate need is much more down-to-earth: to involve chemists with the growing number of free alternatives to the chemical information monopoly they've come to rely on.
Twelve Free Chemistry Databases
Just two years ago, trying to find free online chemistry databases was an exercise in futility. Now, they're sprouting up all over the Web like wildflowers after a wet Spring. What follows is a far-from-complete roundup of some of the more interesting places to start your chemical search.
PubChem- The granddaddy of all free chemistry databases. Search over 8 million compounds by a variety of criteria. Although some PubChem records are linked into the primary literature through MeSH, most are not. But this doesn't seem to be PubChem's true calling. Instead, PubChem may well evolve into the world's largest online collection of molecular data sheets. Increasingly, the other databases in this list are cross-referencing their entries into PubChem. PubChem's entire database can be downloaded by FTP. CAS Registry are correct to see PubChem as the first real competition they've had in decades.
ZINC- A free database of commercially-available compounds for virtual screening. Search over 4.6 million compounds by structure, IUPAC name, InChI, and a host of calculated properties.
eMolecules- Google for molecules. With a simple interface and super fast search engine, eMolecules augments PubChem with other information sources, including specialty chemical catalogs. Although eMolecules' emphasis seems to be on commercially-available compounds, it's only possible to get a link directly into a supplier's online catalog for a limited number of molecules. Most of the links are to PubChem records. For this reason, I don't find eMolecules very useful in its current form. If you remember something called "Chmoogle", this is the same service (moral: don't mess with Google).
CHEBI- "A freely available dictionary of molecular entities focused on ‘small’ chemical compounds." CHEBI draws its information from two main sources: Integrated Relational Enzyme Database of the EBI and the Kyoto Encyclopedia of Genes and Genomes. Find out what proteins a molecule has been associated with and in what context. Provides cross-links to CAS registry numbers, Beilstein registry numbers, and Gmelin registry numbers.
NIST Chemistry WebBook- Physical data (thermochemical, thermophysical, and ion energetics) for mostly organic compounds. Search by formula, structure, CAS number, and IUPAC name.
BioCyc- A collection of about 3,500 compounds involved as enzyme substrates, products, inhibitors, and activators. On accepting a license agreement, the entire database can be freely downloaded in Chemical Markup Language format.
ChemExper- Find a supplier for your specialty chemical needs. Search by structure, name, molecular formula, and CAS number. After finding you compound, get an offer from one or more suppliers. I can't vouch for how this works in practice, but it sounds like a good idea.
Compendium of Pesticide Common Names- More than 1,100 commonly-used pesticides. Compounds are located by browsing indexed lists (IUPAC name, CAS number, and trade name) rather than searching. Each entry lists, among other pieces of information, a chemical structure and sub-classifications (repellents, antifeedants, synergists, etc.).
NMRShiftDB- Organic structures and their nuclear magnetic resonance (nmr) chemical shifts. NMRShiftDB contains chemical shift data for over 22,000 organic compounds and 19,000 spectra. Records can be searched by structure, chemical shift and nucleus. NMRShiftDB is truly open; it can be accessed programmatically and the source code for the software that runs the online database can be freely downloaded. Individual users can submit their own spectral shifts for peer review and subsequent inclusion into the database.
Chemical Structure Lookup Service (CSLS)- An address book for chemical structures. If you've ever used Metacrawler, then you'll recognize the idea behind SCLS, which is to aggregate several free chemistry databases. Search over 27 million molecules by IUPAC name, InChI, structure, SMILES, and a variety of molecular identifiers. Your results set will contain links into specific databases that host the molecules you find. The user interface isn't just unfriendly - it's downright antisocial. But if you can get past this, CSLS may well be one of the most useful services in this list.
DrugBank- Combines detailed drug data with comprehensive drug target information. Search over 4,300 drugs by trade name, SMILES, and InChI. Each record contains information on target of action, therapeutic indication, medications the drug is an ingredient of, and trade names. Searches can be limited to only approved drugs or experimental drugs. Both the concept and interface to this service are well thought-out.
Wikipedia- Wikipedia? Yes, Wikipedia. Wikipedia offers several kinds of chemical information produced by a knowledgeable, all-volunteer army. Looking for information on organic compounds? Consider this datasheet on morphine as an example. For those interested in synthesis, Wikipedia is increasingly being used to collaboratively author short reviews on the topic. Search capabilities are currently limited to text and don't appear to work with IUPAC names or CAS numbers. Where this quintessential disruptive technology and its offspring end up taking chemical publishing is unclear, but the ride will be spectacular.
Hacking PubChem: Direct Access with FTP
A previous article in the Hacking PubChem series pointed out that the entire PubChem database can be downloaded via FTP. This article shows how simple tools written in Ruby can be used to efficiently process the massive amount of data on PubChem's FTP-server.
Prerequisites
The only software you'll need for this tutorial is Ruby.
Organization of PubChem's FTP-Server
PubChem is a big database. To deal with its size, the FTP-server spreads its contents over about 950 files. Each file contains a contiguous range of Compound Identification Numbers (CIDs), which appears to be set at 10,000 [Now 25,000, see below]. In some of the files I've examined, the actual number of compounds in a given block was less than 10,000. The root directory containing the files can be accessed here.
Compression Saves the Day
For storage and transmission efficiency, PubChem's SDF files are compressed using the GZip algorithm, giving files that typically range in size from five to seven megabytes. Compression ratios for the files I've examined are about 10:1. I'm calling these files "SDFGZ" files, and they have the extension *.sdf.gz.
A back of the envelope calculation, based on 950 files with an average size of 6 MB and a compression ratio of 10:1, gives an approximate storage requirement of 57 GB for the uncompressed PubChem database. Although storing this much data is feasible with today's hardware, there are many better uses for storage space. This is especially true if only a few fields of the PubChem database are of interest.
Setting Up
You'll need to download some SDFGZ data. This tutorial uses the file containing CIDs 9540001-9550000. [Note: PubChem recently increased the number of compounds in each sdfgz file to 25,000. This means that the link to the file no longer works. Instead, choose a file from here.] Put this file in your working directory.
A Short Library
Create a file called sdfgz.rb containing the following code:
require 'zlib'
# A simple splitter for *.sdf.gz files available
# from PubChem's FTP-server.
class SDFGZSplitter
@@stop = "$$$$\n"
@@blank = ""
# Configures this SDFGZSplitter using the <tt>IO</tt>
# object <tt>io</tt>.
def initialize(io)
@gzip = Zlib::GzipReader.new(io)
end
# Yield a sequence of SDFile records.
def each_record
record = get_record
while record != @@blank
yield record
record = get_record
end
end
# Gets the next record, or an empty string if
# none is available.
def get_record
line = read_line
record = [line]
while !(@@stop.eql?(line) || nil == line)
line = read_line
record << line
end
record.join
end
private
# Reads the next line in the SDFGZ file.
def read_line
begin
line = @gzip.readline
rescue EOFError
return nil
end
line
end
end
# Utility class for getting data out of a SDFile record.
class Extractor
# Gets the data from <tt>record</tt> associated with
# <tt>key</tt>.
def self.extract_data(record, key)
record.match(/> <#{key}>\n(.+)\n/)
$1
end
# Gets the molfile for <tt>record</tt>.
def self.extract_molfile(record)
record.match(/M END$/).pre_match + "M END\n"
end
endThe SDFGZSplitter class uses Ruby's built-in GZip library to read SDFGZ files without inflating them. The method each_record is a Ruby iterator, one of the strangely cool things that makes Ruby the language it is. The iterator's job is to allow retrieval of each SDFGZ record individually, until all records have been retrieved.
Using the Library
As a test for the sdfgz library, lets scrape all PubChem CIDs and InChI identifiers from an SDFGZ file, and place the result into a new CSV file. Create the following code, either in a file to be run by ruby or in a terminal session using irb:
require 'sdfgz'
file = File.new('Compound_09540001_09550000.sdf.gz')
splitter = SDFGZSplitter.new(file)
puts "parsing..."
File.open('dictionary.csv', 'w+') do |file|
splitter.each_record do |record|
cid = Extractor.extract_data(record, 'PUBCHEM_COMPOUND_CID')
inchi = Extractor.extract_data(record, 'PUBCHEM_NIST_INCHI')
file << "#{cid},\"#{inchi}\"\n"
end
endRunning this test creates a (rather large) file called dictionary.csv in your working directory. Its contents consist of the following truncated output:
9540001,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/p-1/fC20H21N2O4/h21-22H/q-1"
9540002,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/f/h21-22,24H"
9540003,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/p-1/fC19H19N2O5/h20-21H/q-1"
9540004,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/f/h20-21,23H"
...Many customizations of the above code are possible. For example, it would not be difficult to programatically log into the PubChem FTP-server, download a file, and process it as shown. By parsing the SDFGZ filename, a program could even know which file contained a given CID. Because the SDFGZSplitter constructor takes a Ruby IO object, it's also feasible to process PubChem's SDFGZ files directly from the FTP-server, without downloading them beforehand. But that's a subject for another day.
Summing Up
The PubChem FTP-server is a treasure trove of useful data that's available free of charge. Using simple tools like those discussed here, it's possible to generate a virtually infinite variety of customized views of this valuable resource. Many creative, and novel, applications are possible by combining the capabilities shown here with those of Open Source chemical informatics software, such as RCDK, and other Open data sources, such as NMRShiftDB.
Hacking PubChem: Free Speech or Free Beer?
Government information available from this site is within the public domain. Public domain information on the National Library of Medicine (NLM) Web pages may be freely distributed and copied. However, it is requested that in any subsequent use of this work, NLM be given appropriate acknowledgment.
This site also contains resources such as PubMed Central, Bookshelf, OMIM, and PubChem which incorporate material contributed or licensed by individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners.
Open Source licensing is nothing short of revolutionary. Of all of the things an Open Source license makes possible, perhaps the most far-reaching is the right of licensees to create and distribute derivative works. This is what separates "software that's free" ("free as in beer") from "Free Software" ("free as in speech"). A licensee that is not free to create and distribute derivative works has virtually no incentive to build on what the original creator has given away. Would you contribute your valuable time to improving something that you knew you could never use as you saw fit? This may sound like semantic hair-splitting, but it's far from it. None of the phenomenal progress made in Open Source software would have been possible without the basic rights to create and distribute derivative works.
PubChem's Copyright Disclaimer should give anyone familiar with Open Source licensing grounds to ponder. Apparently, NIH is telling its users that it doesn't have the authority to grant them the right to copy all PubChem content or distribute derivative works. But what parts of PubChem can these rights be granted for, if any? What parts of Pubchem are copyrighted, and therefore owned, by contributors? How can a user find out which parts of PubChem are subject to copyright claims by contributors?
It isn't too difficult to imagine a scenario in which PubChem requires those depositing data to agree to a copyright waiver. This waiver would simply grant PubChem users the sublicensable right to copy a depositor's content verbatim and to distribute derivative works based on it, royalty-free. The depositor would still retain any copyright they might want to assert outside of PubChem. If the depositor doesn't own these rights, or isn't willing to part with them, then that content would be rejected. This has been done for years in Open Source software projects and is being done increasingly with Creative Commons licenses for non-software intellectual property. Both approaches have strengths and weaknesses, and my aim is not to advocate either one. The point is simply that the idea is not new.
Maybe a copyright waiver isn't feasible. Regardless, PubChem could create a mechanism whereby content for which a contributor is asserting copyright claims can be identified as such and optionally avoided by its users.
While I'd never turn down free beer, and I'd always thank those offering, in the long run free speech is far more sustaining.
Hacking PubChem: Query by SMILES
Recently, I showed how a simple PubChem API could be built from a few lines of Ruby code. The API we created could retrieve a molfile and a 2-D molecular rendering given a PubChem compound ID (CID). In this tutorial, we'll see how a SMILES query mechanism can be added to the API, enabling CIDs to be retrieved from any valid SMILES string. We'll also see how to extend this capability to retrieving a 2-D image from PubChem by submitting a SMILES string.
Credits
The API that follows is based on the pubchem.rb file found in Chemruby by Tadashi Kadowaki and Nobua Tanaka.
Defining the Problem
We want to create a PubChem API that returns an Array of CIDs given any valid SMILES string. The API will communicate with the publically-available molecular database PubChem using HTTP.
In some cases, PubChem associates more than one CID for a given molecular structure. For example, querying the SMILES string c1ccccc1 (benzene) finds both benzene and C-14 benzene. The software needs to handle these cases as well.
Prerequisites
The only thing you'll need for this tutorial is Ruby, preferably v1.8 or better.
Code
Create a file called query.rb in your working directory containing the following code:
require 'uri'
require 'net/http'
# A simple SMILES query for PubChem based on the file <tt>pubchem.rb</tt>,
# and originally part of Chemruby (http://rubyforge.org/project/chemruby).
# Distributed under Ruby's License.
#
# Copyright (C) 2005, 2006 KADOWAKI Tadashi <kado@kuicr.kyoto-u.ac.jp>
# TANAKA Nobuya <tanaka@kuicr.kyoto-u.ac.jp>
# APODACA Richard <r_apodaca@users.sf.net>
class PubChemQuery
@@host="pubchem.ncbi.nlm.nih.gov"
@@searchpath="/search/"
@@query="PreQSrv.cgi"
@@boundary="-----boundary-----"
# Synthetic form data. Lifted from Chemruby <tt>pubchem.rb</tt>
@@data = [
@@boundary, "Content-Disposition: form-data; name=\"mode\"", "", "simplequery",
@@boundary, "Content-Disposition: form-data; name=\"queue\"", "", "ssquery",
@@boundary, "Content-Disposition: form-data; name=\"simple_searchdata\"", "", '%s',
@@boundary, "Content-Disposition: form-data; name=\"simple_searchtype\"", "", "fs",
@@boundary, "Content-Disposition: form-data; name=\"maxhits\"", "", '%s',
@@boundary].join("\x0d\x0a")
# Returns an <tt>Array</tt> of CIDs matching <tt>smiles</tt>. If no matches are found,
# <tt>nil</tt> is returned.
def self.query_by_smiles(smiles, maxhits = 100)
form_response = post_form(smiles, maxhits)
wait_response = process_wait_page(form_response)
url = get_report_url(wait_response)
url ? process_report(url) : nil
end
private
# Returns the response to posting the initial search form.
def self.post_form(smiles, maxhits)
response = ''
Net::HTTP.start(@@host, 80) do |http|
response = http.post(@@searchpath + @@query, @@data % [smiles, maxhits],
{
'Content-Type' => "multipart/form-data; boundary=#{@@boundary}",
'Referer' => "http://pubchem.ncbi.nlm.nih.gov/search/"
}).body
end
response
end
# Processes the wait page displayed after submission of the search form.
def self.process_wait_page(body)
response = ''
if m = /url="([^"]+)"/.match(body)
Net::HTTP.start(@@host, 80) do |http|
response = http.get(@@searchpath + m[1]).body
end
end
response
end
# Returns the URL, as a <tt>String</tt>, to the search report, given the specified
# body of the wait page.
def self.get_report_url(body)
url = nil
Net::HTTP.start(@@host, 80) do |http|
while /setTimeout\('document.location.replace\("([^"]+)"\);', (\d+)\)/ =~ body do
sleep($2.to_f/100)
response = http.get(URI.parse($1).to_s)
body = response.body
url = response['location']
end
end
url
end
# Extracts CIDs from the search report contained at <tt>url</tt>.
def self.process_report(url)
cid = Array.new
Net::HTTP.start(@@host, 80) do |http|
# text format
url.sub!(/cmd=Select\+from\+History/, 'cmd=Text&dopt=Brief')
http.get(url).body.scan(/\d+: CID: (\d+)/).each do |id|
cid.push(id[0])
end
end
cid
end
endYou might want to manually submit a SMILES query to PubChem as a refresher on how this webapp works. Briefly, the contents of the SMILES search field are read, and a wait screen appears, typically for three seconds. You are then redirected to a results report page containing thumbnail images of the hits and their CIDs.
The PubChemQuery class contains a single public class method, query_by_smiles. This method builds a form to submit, based on the supplied SMILES string and optional maxhits argument. It then waits until PubChem indicates that the query is about to finish processing. The URL for the results report page is then parsed. If a nonempty URL was found, then its page is loaded, and CIDs are scraped. Otherwise, the method returns nil.
Usage
Using PubChemQuery consists of invoking its class method query_by_smiles. You can do so either via the Ruby interpreter (ruby), or preferably through Interactive Ruby (irb).
require 'query'
smiles = "c1cccc(Cl)c1(Cl)" # chlorobenzene
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
puts cid # => 7239Layering Complexity
We can combine the SMILES query API discussed here with the molfile and image retrieval discussed in the earlier Hacking Pubchem article.
Let's say you'd like to download PubChem's 2-D image of imatinib (Gleevec) by submitting its SMILES string. Copy the file named pubchem.rb, provided in the original PubChem tutorial, into your working directory. Now you can programmatically download imatinib's 2-D image from PubChem based only on a SMILES string, for example:
require 'pubchem'
require 'query'
smiles="Cc3ccc(NC(=O)c2ccc(CN1CCN(C)CC1)cc2)cc3Nc5nccc(c4cccnc4)n5" #imatinib
puts "Searching CID(s) for SMILES, #{smiles} ..."
cid = PubChemQuery.query_by_smiles(smiles)
if cid
puts "CID found: #{cid[0]}"
filename = cid[0] + ".png"
puts "Writing image to #{filename} ..."
PubChem.write_image(cid[0], filename)
else
puts "No CID for #{smiles} was found."
endThis produces an image of imatinib called 5291.png in your working directory:

Wrapping Up
As you can see, we're just scratching the surface. The approach outlined here offers nearly unlimited possibilities for repackaging PubChem's own content, and mashing this content up with that of other sites. Happy hacking!

