Create Your Own PubChem Datasets: Exporting Results As SD Files

Posted by Rich Apodaca Tue, 13 Nov 2007 21:43:00 GMT

Recently, I needed to create a subset of the PubChem database in Structure Data File (SD File) format. Although it's far from obvious how to do this, the capability does exist. In this article, I'll give a step-by-step procedure for creating custom datasets in SD File format from arbitrary PubChem structure queries.

Create and Execute the Query

Let's say we want to create a dataset in SD File format containing all N-Boc-protected piperidines registered in PubChem.

From the main PubChem site, choose the Structure Search link. Then click the "Sketch" button.

Next, draw your molecule in the 2D structure editor:

Then click the "Done" button.

Before starting the query (by clicking the "Search" button), be sure to select the "Substructure" option under "Search Type."

Exporting the Results

You should now be looking at a screen containing the first few hits of a 7700+ hitset. But how do we export these results in SD Format?

Next to a field labeled "Display", you'll see a drop-down box containing several different options. Choose the one labeled "PubChem Download."

You'll be redirected to a download page from which you can select output formats, including SDF, or SD File. You can also select a compression type (datasets of even 2000 records can be quite large uncompressed). For this example, we'll select SDF format with GZip compression.

Clicking on the "Download" button takes us to a status page that eventually informs us when our download has been processed. You should then get a "Save File" dialog or something similar. If not, you should see a link to the compressed SD file.

Downloading the results file completes the process.

Parsing SD Files with Ruby and Rubidium

Posted by Rich Apodaca Mon, 12 Nov 2007 16:27:00 GMT

Reading SD files is a bread-and-butter cheminformatics operation. At a minimum, a cheminformatics toolkit needs to parse the individual entries of an SD file, and provide access to the embedded molfile and data hash for each.

Recent articles have introduced Rubidium, a Ruby cheminformatics scripting environment. The Rubidium team now announces the release of Rubidium-0.1.1, which, among other features, introduces the ability to parse SD files.

Prerequisites

Rubidium is designed to run on JRuby. Installing JRuby is straightforward on unix-like systems. First, download the JRuby-1.1b1 binary release. Then, unpack the archive to your directory of choice. Set $JRUBY_HOME and $JAVA_HOME. Finally, add $JRUBY_HOME/bin to your path.

Installing Rubidium-0.1.1

Generally speaking, it should be possible to install Rubidium with a one-line command to RubyGems:

$ jruby -S gem install rbtk

Unfortunately at the time of this writing, I was receiving the mysterious RubyGems 404 error with the RubyForge remote repository:

$ jruby -S gem install rbtk
Select which gem to install for your platform (java)
 1. rbtk 0.1.1 (java)
 2. rbtk 0.1.0 (java)
 3. Skip this gem
 4. Cancel installation
> 1
ERROR:  While executing gem ... (OpenURI::HTTPError)
    404 Not Found

This appears to affect only certain RubyGems on RubyForge - possibly only those with multiple versions. It seems to be an error on the RubyForge server that occasionally appears and then disappears.

As a workaround, you can download the Rubidium gem and install it manually:

$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem

Because Rubidium-0.1.1 introduces an Active Support dependency, you will need to install that library before installing Rubidium:

$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
ERROR:  While executing gem ... (RuntimeError)
    Error instaling tmp/rbtk-0.1.1-jruby.gem:
        rbtk requires activesupport >= 1.4.2
$ jruby -S gem install activesupport
Successfully installed activesupport-1.4.4
Installing ri documentation for activesupport-1.4.4...
Installing RDoc documentation for activesupport-1.4.4...
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
Successfully installed rbtk, version 0.1.1
Installing ri documentation for rbtk-0.1.1-jruby...
Installing RDoc documentation for rbtk-0.1.1-jruby...

It's possible that the RubyForge 404 issue will be resolved by the time you read this article, so jruby -S gem install rbtk should be tried first.

Parsing an SD File

Let's say we'd like to extract all InChIs from a PubChem dataset. If you don't have one handy, a compilation of about 2000 PubChem benzodiazepines has been deposited on RubyForge.

With our unzipped datafile in our working directory, we can now test the SD File parser by saving the following library to a file called parse.rb:

require 'rubygems'
gem 'rbtk'
require 'rubidium/sdf'

def parse_sd filename
  p = Rubidium::SDF::Parser.new File.new(filename)

  p.each do |entry|
    puts "InChI: #{entry['PUBCHEM_NIST_INCHI']}"
  end
end
which can be tested with jirb:
$ jirb
irb(main):001:0> require 'parse'
=> true
irb(main):002:0> parse_sd 'pubchem_benzodiazepine_20071110.sdf'
InChI: InChI=1/C16H12Cl2N2O/c1-20-14-7-6-12(18)8-13(14)16(19-9-15(20)21)10-2-4-11(17)5-3-10/h2-8H,9H2,1H3

[truncated]

RSpec and Behavior-Driven Development

If you check out the Rubidium source distribution, you'll notice that the SD parser library is tested with RSpec, the BDD framework for Ruby. Ultimately, all components of Rubidium will be tested and documented this way.

Acknowledgments

Rubidium's new SD file parser was written by Moses Hohman. It was kindly donated by Collaborative Drug Discovery, who have built their drug discovery application using Ruby on Rails.

Future Directions

One problem in working with SD files is pinpointing encoding errors. A parser should not only raise an exception, but point to a line number and identify offending text to aid debugging. Rubidium's SD parser will eventually incorporate these enhancements.

Because Rubidium runs on JRuby, performance gains may be achievable by re-writing select portions in Java.

Parsing SD files is only the beginning of the story. Many cheminformatics applications need a convenient, fast, and robust method for writing molfiles. This is also something Rubidium will attempt to provide.

If your company or organization is curious about Ruby and cheminforamatics, give Rubidium a try. Rubidium is licensed under the permissive MIT License to make collaboration as simple as possible.

Hacking PubChem: Direct Access with FTP

Posted by Rich Apodaca Fri, 29 Sep 2006 05:59:00 GMT

A previous article in the Hacking PubChem series pointed out that the entire PubChem database can be downloaded via FTP. This article shows how simple tools written in Ruby can be used to efficiently process the massive amount of data on PubChem's FTP-server.

Prerequisites

The only software you'll need for this tutorial is Ruby.

Organization of PubChem's FTP-Server

PubChem is a big database. To deal with its size, the FTP-server spreads its contents over about 950 files. Each file contains a contiguous range of Compound Identification Numbers (CIDs), which appears to be set at 10,000 [Now 25,000, see below]. In some of the files I've examined, the actual number of compounds in a given block was less than 10,000. The root directory containing the files can be accessed here.

Compression Saves the Day

For storage and transmission efficiency, PubChem's SDF files are compressed using the GZip algorithm, giving files that typically range in size from five to seven megabytes. Compression ratios for the files I've examined are about 10:1. I'm calling these files "SDFGZ" files, and they have the extension *.sdf.gz.

A back of the envelope calculation, based on 950 files with an average size of 6 MB and a compression ratio of 10:1, gives an approximate storage requirement of 57 GB for the uncompressed PubChem database. Although storing this much data is feasible with today's hardware, there are many better uses for storage space. This is especially true if only a few fields of the PubChem database are of interest.

Setting Up

You'll need to download some SDFGZ data. This tutorial uses the file containing CIDs 9540001-9550000. [Note: PubChem recently increased the number of compounds in each sdfgz file to 25,000. This means that the link to the file no longer works. Instead, choose a file from here.] Put this file in your working directory.

A Short Library

Create a file called sdfgz.rb containing the following code:

require 'zlib'

# A simple splitter for *.sdf.gz files available
# from PubChem's FTP-server.
class SDFGZSplitter
  @@stop = "$$$$\n"
  @@blank = ""

  # Configures this SDFGZSplitter using the <tt>IO</tt>
  # object <tt>io</tt>.
  def initialize(io)
    @gzip = Zlib::GzipReader.new(io)
  end

  # Yield a sequence of SDFile records.
  def each_record
    record = get_record

    while record != @@blank
      yield record
      record = get_record
    end
  end

  # Gets the next record, or an empty string if
  # none is available.
  def get_record
    line = read_line
    record = [line]

    while !(@@stop.eql?(line) || nil == line)
      line = read_line
      record << line
    end

    record.join
  end

  private

  # Reads the next line in the SDFGZ file.
  def read_line
    begin
      line = @gzip.readline
    rescue EOFError
      return nil
    end

    line
  end
end

# Utility class for getting data out of a SDFile record.
class Extractor
  # Gets the data from <tt>record</tt> associated with
  # <tt>key</tt>.
  def self.extract_data(record, key)
    record.match(/> <#{key}>\n(.+)\n/)
    $1
  end

  # Gets the molfile for <tt>record</tt>.
  def self.extract_molfile(record)
    record.match(/M  END$/).pre_match + "M  END\n"
  end
end

The SDFGZSplitter class uses Ruby's built-in GZip library to read SDFGZ files without inflating them. The method each_record is a Ruby iterator, one of the strangely cool things that makes Ruby the language it is. The iterator's job is to allow retrieval of each SDFGZ record individually, until all records have been retrieved.

Using the Library

As a test for the sdfgz library, lets scrape all PubChem CIDs and InChI identifiers from an SDFGZ file, and place the result into a new CSV file. Create the following code, either in a file to be run by ruby or in a terminal session using irb:

require 'sdfgz'

file = File.new('Compound_09540001_09550000.sdf.gz')
splitter = SDFGZSplitter.new(file)

puts "parsing..."

File.open('dictionary.csv', 'w+') do |file|
  splitter.each_record do |record|
    cid = Extractor.extract_data(record, 'PUBCHEM_COMPOUND_CID')
    inchi = Extractor.extract_data(record, 'PUBCHEM_NIST_INCHI')

    file << "#{cid},\"#{inchi}\"\n"
  end 
end

Running this test creates a (rather large) file called dictionary.csv in your working directory. Its contents consist of the following truncated output:

9540001,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/p-1/fC20H21N2O4/h21-22H/q-1"
9540002,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/f/h21-22,24H"
9540003,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/p-1/fC19H19N2O5/h20-21H/q-1"
9540004,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/f/h20-21,23H"

...

Many customizations of the above code are possible. For example, it would not be difficult to programatically log into the PubChem FTP-server, download a file, and process it as shown. By parsing the SDFGZ filename, a program could even know which file contained a given CID. Because the SDFGZSplitter constructor takes a Ruby IO object, it's also feasible to process PubChem's SDFGZ files directly from the FTP-server, without downloading them beforehand. But that's a subject for another day.

Summing Up

The PubChem FTP-server is a treasure trove of useful data that's available free of charge. Using simple tools like those discussed here, it's possible to generate a virtually infinite variety of customized views of this valuable resource. Many creative, and novel, applications are possible by combining the capabilities shown here with those of Open Source chemical informatics software, such as RCDK, and other Open data sources, such as NMRShiftDB.