Small Molecule 3D Coordinates From PubChem 4

Posted by Rich Apodaca Fri, 23 May 2008 14:53:00 GMT

The PubChem team has quietly introduced a new feature - 3D coordinates for many of the small molecules in its compound collection. To my knowledge, these coordinates are only currently available via FTP. From the README:

The data contained here consists of a theoretical 3D description of PubChem Compound records computed using the MMFF94s force field without coulombic terms, including MMFF charges. Each provided theoretical 3D conformer is not a stationary point on the hyper-potential surface (i.e., is not at a minimum energy). Rather, the theoretical 3D description is a low energy conformer selected from a conformer model (a theoretical description of the conformational flexibility of a chemical structure consisting of multiple 3D representations or poses sampled using an RMSD {root mean squared distance} threshold) describing energetically-accessible and (potentially) biologically relevant coformations of a chemical structure.

Not every PubChem Compound record will have a theoretical 3D description. Structures considered too large (containing more than 50 non-hydrogen atoms) or too flexible (containing more than 15 rotatable bonds) are excluded. Furthermore, chemical structures containing elements other than H, C, N, O, F, P, S, Cl, Br, and I are also excluded.

Generation of theoretical 3D descriptions of small molecules is computationally intensive. As such, some PubChem Compound records may be added at a later time.

(A few open source packages for generating 3D conformers are also available.)

Recently, Geoff Hutchison wrote in to suggest that a potentially useful new feature of Chempedia could be the ability to directly obtain 3D coordinates for a molecule of interest.

One very economical way to do that would be to use PubChem's 3D dataset. It would also be trivial to display these coordinates as a resizable Jmol applet, in analogy to Chempedia's recently-added 2D molecule resizing feature.

Of course, there are many other potential uses for the PubChem conformer dataset, especially when applied to Web applications.

Hacking PubChem: Direct Access with FTP

Posted by Rich Apodaca Fri, 29 Sep 2006 05:59:00 GMT

A previous article in the Hacking PubChem series pointed out that the entire PubChem database can be downloaded via FTP. This article shows how simple tools written in Ruby can be used to efficiently process the massive amount of data on PubChem's FTP-server.

Prerequisites

The only software you'll need for this tutorial is Ruby.

Organization of PubChem's FTP-Server

PubChem is a big database. To deal with its size, the FTP-server spreads its contents over about 950 files. Each file contains a contiguous range of Compound Identification Numbers (CIDs), which appears to be set at 10,000 [Now 25,000, see below]. In some of the files I've examined, the actual number of compounds in a given block was less than 10,000. The root directory containing the files can be accessed here.

Compression Saves the Day

For storage and transmission efficiency, PubChem's SDF files are compressed using the GZip algorithm, giving files that typically range in size from five to seven megabytes. Compression ratios for the files I've examined are about 10:1. I'm calling these files "SDFGZ" files, and they have the extension *.sdf.gz.

A back of the envelope calculation, based on 950 files with an average size of 6 MB and a compression ratio of 10:1, gives an approximate storage requirement of 57 GB for the uncompressed PubChem database. Although storing this much data is feasible with today's hardware, there are many better uses for storage space. This is especially true if only a few fields of the PubChem database are of interest.

Setting Up

You'll need to download some SDFGZ data. This tutorial uses the file containing CIDs 9540001-9550000. [Note: PubChem recently increased the number of compounds in each sdfgz file to 25,000. This means that the link to the file no longer works. Instead, choose a file from here.] Put this file in your working directory.

A Short Library

Create a file called sdfgz.rb containing the following code:

require 'zlib'

# A simple splitter for *.sdf.gz files available
# from PubChem's FTP-server.
class SDFGZSplitter
  @@stop = "$$$$\n"
  @@blank = ""

  # Configures this SDFGZSplitter using the <tt>IO</tt>
  # object <tt>io</tt>.
  def initialize(io)
    @gzip = Zlib::GzipReader.new(io)
  end

  # Yield a sequence of SDFile records.
  def each_record
    record = get_record

    while record != @@blank
      yield record
      record = get_record
    end
  end

  # Gets the next record, or an empty string if
  # none is available.
  def get_record
    line = read_line
    record = [line]

    while !(@@stop.eql?(line) || nil == line)
      line = read_line
      record << line
    end

    record.join
  end

  private

  # Reads the next line in the SDFGZ file.
  def read_line
    begin
      line = @gzip.readline
    rescue EOFError
      return nil
    end

    line
  end
end

# Utility class for getting data out of a SDFile record.
class Extractor
  # Gets the data from <tt>record</tt> associated with
  # <tt>key</tt>.
  def self.extract_data(record, key)
    record.match(/> <#{key}>\n(.+)\n/)
    $1
  end

  # Gets the molfile for <tt>record</tt>.
  def self.extract_molfile(record)
    record.match(/M  END$/).pre_match + "M  END\n"
  end
end

The SDFGZSplitter class uses Ruby's built-in GZip library to read SDFGZ files without inflating them. The method each_record is a Ruby iterator, one of the strangely cool things that makes Ruby the language it is. The iterator's job is to allow retrieval of each SDFGZ record individually, until all records have been retrieved.

Using the Library

As a test for the sdfgz library, lets scrape all PubChem CIDs and InChI identifiers from an SDFGZ file, and place the result into a new CSV file. Create the following code, either in a file to be run by ruby or in a terminal session using irb:

require 'sdfgz'

file = File.new('Compound_09540001_09550000.sdf.gz')
splitter = SDFGZSplitter.new(file)

puts "parsing..."

File.open('dictionary.csv', 'w+') do |file|
  splitter.each_record do |record|
    cid = Extractor.extract_data(record, 'PUBCHEM_COMPOUND_CID')
    inchi = Extractor.extract_data(record, 'PUBCHEM_NIST_INCHI')

    file << "#{cid},\"#{inchi}\"\n"
  end 
end

Running this test creates a (rather large) file called dictionary.csv in your working directory. Its contents consist of the following truncated output:

9540001,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/p-1/fC20H21N2O4/h21-22H/q-1"
9540002,"InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/f/h21-22,24H"
9540003,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/p-1/fC19H19N2O5/h20-21H/q-1"
9540004,"InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/f/h20-21,23H"

...

Many customizations of the above code are possible. For example, it would not be difficult to programatically log into the PubChem FTP-server, download a file, and process it as shown. By parsing the SDFGZ filename, a program could even know which file contained a given CID. Because the SDFGZSplitter constructor takes a Ruby IO object, it's also feasible to process PubChem's SDFGZ files directly from the FTP-server, without downloading them beforehand. But that's a subject for another day.

Summing Up

The PubChem FTP-server is a treasure trove of useful data that's available free of charge. Using simple tools like those discussed here, it's possible to generate a virtually infinite variety of customized views of this valuable resource. Many creative, and novel, applications are possible by combining the capabilities shown here with those of Open Source chemical informatics software, such as RCDK, and other Open data sources, such as NMRShiftDB.