<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag sdfgz</title>
    <link>http://depth-first.com/articles/tag/sdfgz</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Recombining Compressed PubChem SD Files with Open Babel</title>
      <description>&lt;p&gt;&lt;a href="http://openbabel.org"&gt;&lt;img src="http://depth-first.com/files/Babel256.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;While testing &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, it became necessary to test the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt; with SD Files containing several hundred thousand records. Although it's tempting to meet this need by constructing "dummy" files with the same record or small set of records repeated, tests are always far more illuminating when real data is used.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; is an excellent source of large molecular datasets, and the entire database can be &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;downloaded by FTP&lt;/a&gt;. Because of PubChem's massive size, what's downloadable consists of files broken up into groups of about 25,000 in gzipped SD File format (*.sdf.gz). Although this is an excellent resource, it creates a problem: how can you conveniently recombine this set of compressed SD Files into a single SD File?&lt;/p&gt;

&lt;p&gt;You might think about writing some "quick" code in your language of choice. Fortunately, &lt;a href="http://openbabel.org"&gt;Open Babel&lt;/a&gt; gets the job done - without any of the coding or debugging.&lt;/p&gt;

&lt;p&gt;The following command will create a single SD File from all of the compressed SD Files in a given directory, while also stripping explicit hydrogens and removing all fields except PUBCHEM_COMPOUND_CID.&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
babel *.sdf.gz pubchem.sdf -d --delete PUBCHEM_COMPOUND_CANONICALIZED,PUBCHEM_CACTVS_COMPLEXITY,PUBCHEM_CACTVS_HBOND_ACCEPTOR,PUBCHEM_CACTVS_HBOND_DONOR,PUBCHEM_CACTVS_ROTATABLE_BOND,PUBCHEM_CACTVS_SUBSKEYS,PUBCHEM_IUPAC_OPENEYE_NAME,PUBCHEM_IUPAC_CAS_NAME,PUBCHEM_IUPAC_NAME,PUBCHEM_IUPAC_SYSTEMATIC_NAME,PUBCHEM_IUPAC_TRADITIONAL_NAME,PUBCHEM_NIST_INCHI,PUBCHEM_EXACT_MASS,PUBCHEM_MOLECULAR_FORMULA,PUBCHEM_MOLECULAR_WEIGHT,PUBCHEM_OPENEYE_CAN_SMILES,PUBCHEM_OPENEYE_ISO_SMILES,PUBCHEM_CACTVS_TPSA,PUBCHEM_MONOISOTOPIC_WEIGHT,PUBCHEM_TOTAL_CHARGE,PUBCHEM_HEAVY_ATOM_COUNT,PUBCHEM_ATOM_DEF_STEREO_COUNT,PUBCHEM_ATOM_UDEF_STEREO_COUNT,PUBCHEM_BOND_DEF_STEREO_COUNT,PUBCHEM_BOND_UDEF_STEREO_COUNT,PUBCHEM_ISOTOPIC_ATOM_COUNT,PUBCHEM_COMPONENT_COUNT,PUBCHEM_CACTVS_TAUTO_COUNT,PUBCHEM_BONDANNOTATIONS,PUBCHEM_CACTVS_XLOGP

865543 molecules converted
7 info messages 15372962 audit log messages 
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Apparently, there is no way to tell babel to &lt;em&gt;keep&lt;/em&gt; just a particular field in an SD File - they need to be removed individually.&lt;/p&gt;

&lt;p&gt;Still, not bad for a few seconds on the command line.&lt;/p&gt;</description>
      <pubDate>Wed, 01 Oct 2008 01:25:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:725a5f70-77e1-4aee-a79d-e7fb9f7c3401</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/01/recombining-compressed-pubchem-sd-files-with-open-babel</link>
      <category>Tools</category>
      <category>openbabel</category>
      <category>sdfile</category>
      <category>pubchem</category>
      <category>sdfgz</category>
      <category>commandline</category>
    </item>
    <item>
      <title>Cheminformatics for the Web: Convert SD Files to HTML with Ruby CDK</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/ruby_logo_new.gif" align="right"&gt;&lt;/img&gt;The Structure Data File (SDF) format is the de facto standard for cheminformatics data exchange.  One of the problems that arises when working with SD Files, especially large ones like those distributed by &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt;, is "seeing" the structures they contain. Although commercial software packages are available for doing so, they are generally closed, unreasonably expensive, or overly complex. This article describes a simple solution to the SDF visualization problem that uses Open Source tools controlled from the elegant and agile &lt;a href="http://www.ruby-lang.org"&gt;Ruby&lt;/a&gt; programming language.&lt;/p&gt;

&lt;h4&gt;Cut to the Chase&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://depth-first.com/demo/20061113/index.html"&gt;This page&lt;/a&gt; shows the output produced by the software. You'll see a neatly arranged grid of colorful 2-D chemical structures in a Web page that was generated directly from a PubChem &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;SDFGZ file&lt;/a&gt;. Each structure has a number below it, the PubChem Compound ID (CID). Both the structure and CID are hyperlinked to the Compound Summary page on PubChem. A partial screenshot is provided below.&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;a href="http://depth-first.com/demo/20061113/index.html"&gt;&lt;img src="http://depth-first.com/demo/20061113/screenshot.png" border="0"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/center&gt;&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;For this tutorial, you'll need &lt;a href="http://depth-first.com/articles/2006/10/30/agile-chemical-informatics-development-with-cdk-and-ruby-rcdk-0-3-0"&gt;Ruby CDK&lt;/a&gt; (RCDK). A &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-0"&gt;recent article&lt;/a&gt; described the small amount of system configuration required for RCDK on Linux. Another article showed how to &lt;a href="http://depth-first.com/articles/2006/10/12/running-ruby-java-bridge-on-windows"&gt;install RCDK on Windows&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Download the Software&lt;/h4&gt;

&lt;p&gt;The software described in this article can be &lt;a href="http://rubyforge.org/frs/download.php/14636/sdf-ripper-0.0.1.tar.gz"&gt;downloaded here&lt;/a&gt;. Inflate this file and make it your working directory. You should see a 14 MB SDFGZ file, a RHTML template, and three Ruby files.&lt;/p&gt;

&lt;h4&gt;Ripping PubChem SD Files&lt;/h4&gt;

&lt;p&gt;The software is designed to work with PubChem &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;SDFGZ files&lt;/a&gt;. The SDFGZ format simply results from the application of the gzip compression algorithm to an ordinary SD file.&lt;/p&gt;

&lt;p&gt;Ripping the example SDFGZ file is just a matter of running &lt;strong&gt;test.rb&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ ruby test.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;You'll see some output indicating that various CIDs are being processed. On completion, the software has created a directory called &lt;strong&gt;rip&lt;/strong&gt; containing a HTML file and an images directory.&lt;/p&gt;

&lt;h4&gt;The Little Engine That Could: CDK's StructureDiagramGenerator&lt;/h4&gt;

&lt;p&gt;If you've ever worked with PubChem's SD Files, you'll no doubt have noticed that the molfile section encodes all hydrogen atoms, which is not general practice. Rendering these hydrogens results in a very cluttered image.&lt;/p&gt;

&lt;p&gt;To solve this problem, the software creates its graphics from the &lt;tt&gt;PUBCHEM_OPENEYE_CAN_SMILES&lt;/tt&gt; field encoded by the SDFGZ file. This SMILES string is converted into a molecular representation and coordinates are assigned by CDK's &lt;tt&gt;StructureDiagramGenerator&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;When an image can't be rendered in this way, it is left out. This was done for CIDs 18, 115, 147, 148, 222, and 223, for example. There are three common themes in these missing structures: metals, phosphorous, and molecules with a single heavy atom. The problem may, in fact, lie in the underlying Structure-CDK software, rather than with CDK. Stay tuned for more on this.&lt;/p&gt;

&lt;h4&gt;PubChem for Debugging&lt;/h4&gt;

&lt;p&gt;In developing this SD File Ripper program, I realized that it could be used as a powerful debugging tool. Notice how the missing structures (and their SMILES strings) can easily be examined via PubChem by clicking the empty cell. The alternative would have been for the program to spit out a list of SMILES that didn't process properly and to then try to construct a mental image of what this string represents. With PubChem, we do away with this tedium altogether.&lt;/p&gt;

&lt;p&gt;I doubt the creators of PubChem envisioned this application of their work. Surely it's but one of many still to be discovered.&lt;/p&gt;

&lt;h4&gt;Another Cool Thing About Ruby: eRuby Templates&lt;/h4&gt;

&lt;p&gt;Our SDF Ripper program creates HTML output, something for which Ruby is well-suited through its eRuby ERB library. Among other uses, ERB enables Ruby code to be embedded within HTML. This inside-out scripting capability resembles that of other templating languages such as PHP, ASP, and JSP (ERB is used extensively by the &lt;a href="http://www.rubyonrails.org/"&gt;Ruby on Rails&lt;/a&gt; web application framework). The file &lt;strong&gt;template.rhtml&lt;/strong&gt; contains the rippers's ERB template. The separation of program logic from presentation makes it very simple to customize the appearance of the output.&lt;/p&gt;

&lt;h4&gt;Room to Grow&lt;/h4&gt;

&lt;p&gt;Our SDF Ripper only works with SDFGZ files from PubChem. The program is short enough that it should be simple to adapt it for your specific needs. It would not be much work at all, for example, to create an HTML table containing all fields encoded by the SDFGZ file. Similarly, adding support for non-compressed SD files is straightforward. If JavaScript is your medium, the possibilities become even more interesting. How about a pop-up menu showing an enlarged structure and data summary, &lt;a href="http://netflix.com"&gt;a la Netflix&lt;/a&gt; when the user mouses over an image?&lt;/p&gt;

&lt;p&gt;Paging is a technique that divides large Web pages into smaller pages linked to one another. For example, Google's search results are divided into groups of ten by default. Adding paging support to the software described here would also not be difficult, and would enable the convenient browsing of much larger datasets.&lt;/p&gt;

&lt;h4&gt;Other Software That Does This&lt;/h4&gt;

&lt;p&gt;I am aware of no product, commercial or otherwise, that performs the SDF to HTML conversion in the way shown here. &lt;a href="http://scitegic.com"&gt;SciTegic&lt;/a&gt; does offer an &lt;a href="http://www.scitegic.com/products/reporting/"&gt;HTML table component&lt;/a&gt; as part of its &lt;a href="http://www.scitegic.com/products/overview/"&gt;Pipeline Pilot&lt;/a&gt; framework, but as far as I know, no standalone version is available.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://cheminfo.informatics.indiana.edu/~rguha/"&gt;Rajarshi Guha&lt;/a&gt;, among his many other interesting projects, has written a &lt;a href="http://cheminfo.informatics.indiana.edu/~rguha/code/java/#draw2d"&gt;Java SDF to PDF convertor&lt;/a&gt; that uses CDK.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;This article has demonstrated how the combination of RCDK and Ruby makes short work of converting the contents of an SD file into a Web-ready format. As usual, we've only scratched the surface of what's easily within reach. Watch for future articles to build on the concepts outlined here.&lt;/p&gt;</description>
      <pubDate>Mon, 13 Nov 2006 15:23:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:f7c9a381-b492-40f5-8a95-d37b8a269080</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/11/13/cheminformatics-for-the-web-convert-sd-files-to-html-with-ruby-cdk</link>
      <category>Tools</category>
      <category>sdfgz</category>
      <category>sdf</category>
      <category>pubchem</category>
      <category>html</category>
      <category>ruby</category>
      <category>rip</category>
    </item>
    <item>
      <title>Hacking PubChem: Direct Access with FTP</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;A &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;previous article&lt;/a&gt; in the &lt;em&gt;Hacking PubChem&lt;/em&gt; series pointed out that the entire PubChem database can be &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/"&gt;downloaded via FTP&lt;/a&gt;. This article shows how simple tools written in Ruby can be used to efficiently process the massive amount of data on PubChem's FTP-server.&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;The only software you'll need for this tutorial is &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Organization of PubChem's FTP-Server&lt;/h4&gt;

&lt;p&gt;PubChem is a big database. To deal with its size, the FTP-server spreads its contents over about 950 files. Each file contains a contiguous range of Compound Identification Numbers (CIDs), which appears to be set at 10,000 [&lt;em&gt;Now 25,000, see below&lt;/em&gt;]. In some of the files I've examined, the actual number of compounds in a given block was less than 10,000. The root directory containing the files can be accessed &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Compression Saves the Day&lt;/h4&gt;

&lt;p&gt;For storage and transmission efficiency, PubChem's SDF files are compressed using the GZip algorithm, giving files that typically range in size from five to seven megabytes. Compression ratios for the files I've examined are about 10:1. I'm calling these files "SDFGZ" files, and they have the extension &lt;tt&gt;*.sdf.gz&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;A back of the envelope calculation, based on 950 files with an average size of 6 MB and a compression ratio of 10:1, gives an approximate storage requirement of 57 GB for the uncompressed PubChem database. Although storing this much data is feasible with today's hardware, there are many better uses for storage space. This is especially true if only a few fields of the PubChem database are of interest.&lt;/p&gt;

&lt;h4&gt;Setting Up&lt;/h4&gt;

&lt;p&gt;You'll need to download some SDFGZ data. This tutorial uses the &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_09540001_09550000.sdf.gz"&gt;file containing CIDs 9540001-9550000&lt;/a&gt;. [&lt;em&gt;Note: PubChem recently increased the number of compounds in each sdfgz file to 25,000. This means that the link to the file no longer works. Instead, choose a file from &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/"&gt;here&lt;/a&gt;&lt;/em&gt;.] Put this file in your working directory.&lt;/p&gt;

&lt;h4&gt;A Short Library&lt;/h4&gt;

&lt;p&gt;Create a file called &lt;strong&gt;sdfgz.rb&lt;/strong&gt; containing the following code:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;zlib&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# A simple splitter for *.sdf.gz files available&lt;/span&gt;
&lt;span class="comment"&gt;# from PubChem's FTP-server.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;SDFGZSplitter&lt;/span&gt;
  &lt;span class="attribute"&gt;@@stop&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;$$$$&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="attribute"&gt;@@blank&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

  &lt;span class="comment"&gt;# Configures this SDFGZSplitter using the &amp;lt;tt&amp;gt;IO&amp;lt;/tt&amp;gt;&lt;/span&gt;
  &lt;span class="comment"&gt;# object &amp;lt;tt&amp;gt;io&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;io&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="attribute"&gt;@gzip&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Zlib&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;GzipReader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;io&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Yield a sequence of SDFile records.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;each_record&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;get_record&lt;/span&gt;

    &lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;!=&lt;/span&gt; &lt;span class="attribute"&gt;@@blank&lt;/span&gt;
      &lt;span class="keyword"&gt;yield&lt;/span&gt; &lt;span class="ident"&gt;record&lt;/span&gt;
      &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;get_record&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Gets the next record, or an empty string if&lt;/span&gt;
  &lt;span class="comment"&gt;# none is available.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;get_record&lt;/span&gt;
    &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;read_line&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt;

    &lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="punct"&gt;!(&lt;/span&gt;&lt;span class="attribute"&gt;@@stop&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;eql?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;||&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
      &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;read_line&lt;/span&gt;
      &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="ident"&gt;line&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="ident"&gt;private&lt;/span&gt;

  &lt;span class="comment"&gt;# Reads the next line in the SDFGZ file.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;read_line&lt;/span&gt;
    &lt;span class="keyword"&gt;begin&lt;/span&gt;
      &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@gzip&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;readline&lt;/span&gt;
    &lt;span class="keyword"&gt;rescue&lt;/span&gt; &lt;span class="constant"&gt;EOFError&lt;/span&gt;
      &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;line&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Utility class for getting data out of a SDFile record.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Extractor&lt;/span&gt;
  &lt;span class="comment"&gt;# Gets the data from &amp;lt;tt&amp;gt;record&amp;lt;/tt&amp;gt; associated with&lt;/span&gt;
  &lt;span class="comment"&gt;# &amp;lt;tt&amp;gt;key&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;key&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;&amp;gt; &amp;lt;&lt;span class="expr"&gt;#{key}&lt;/span&gt;&amp;gt;&lt;span class="escape"&gt;\n&lt;/span&gt;(.+)&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt;
    &lt;span class="global"&gt;$1&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Gets the molfile for &amp;lt;tt&amp;gt;record&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.extract_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;M  END$&lt;/span&gt;&lt;span class="punct"&gt;/).&lt;/span&gt;&lt;span class="ident"&gt;pre_match&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;M  END&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;tt&gt;SDFGZSplitter&lt;/tt&gt; class uses Ruby's built-in GZip library to read SDFGZ files without inflating them. The method &lt;tt&gt;each_record&lt;/tt&gt; is a &lt;a href="http://www.rubycentral.com/book/tut_containers.html"&gt;Ruby iterator&lt;/a&gt;, one of the strangely cool things that makes Ruby the language it is. The iterator's job is to allow retrieval of each SDFGZ record individually, until all records have been retrieved.&lt;/p&gt;

&lt;h4&gt;Using the Library&lt;/h4&gt;

&lt;p&gt;As a test for the &lt;tt&gt;sdfgz&lt;/tt&gt; library, lets scrape all PubChem CIDs and InChI identifiers from an SDFGZ file, and place the result into a new CSV file. Create the following code, either in a file to be run by &lt;tt&gt;ruby&lt;/tt&gt; or in a terminal session using &lt;tt&gt;irb&lt;/tt&gt;:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;sdfgz&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;Compound_09540001_09550000.sdf.gz&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
&lt;span class="ident"&gt;splitter&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;SDFGZSplitter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;parsing...&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;dictionary.csv&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;w+&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="ident"&gt;splitter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each_record&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;cid&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Extractor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;PUBCHEM_COMPOUND_CID&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
    &lt;span class="ident"&gt;inchi&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Extractor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;PUBCHEM_NIST_INCHI&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;

    &lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="expr"&gt;#{cid}&lt;/span&gt;,&lt;span class="escape"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="expr"&gt;#{inchi}&lt;/span&gt;&lt;span class="escape"&gt;\&amp;quot;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt; 
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Running this test creates a (rather large) file called &lt;strong&gt;dictionary.csv&lt;/strong&gt; in your working directory. Its contents consist of the following truncated output:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;9540001,&amp;quot;InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/p-1/fC20H21N2O4/h21-22H/q-1&amp;quot;
9540002,&amp;quot;InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/f/h21-22,24H&amp;quot;
9540003,&amp;quot;InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/p-1/fC19H19N2O5/h20-21H/q-1&amp;quot;
9540004,&amp;quot;InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/f/h20-21,23H&amp;quot;

...&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Many customizations of the above code are possible. For example, it would not be difficult to programatically log into the PubChem FTP-server, download a file, and process it as shown. By parsing the SDFGZ filename, a program could even know which file contained a given CID. Because the &lt;tt&gt;SDFGZSplitter&lt;/tt&gt; constructor takes a Ruby &lt;tt&gt;IO&lt;/tt&gt; object, it's also feasible to process PubChem's SDFGZ files directly from the FTP-server, without downloading them beforehand. But that's a subject for another day.&lt;/p&gt;

&lt;h4&gt;Summing Up&lt;/h4&gt;

&lt;p&gt;The PubChem FTP-server is a treasure trove of useful data that's available free of charge. Using simple tools like those discussed here, it's possible to generate a virtually infinite variety of customized views of this valuable resource. Many creative, and novel, applications are possible by combining the capabilities shown here with those of Open Source chemical informatics software, such as &lt;a href="http://depth-first.com/articles/2006/09/26/looking-at-inchis"&gt;RCDK&lt;/a&gt;, and other Open data sources, such as &lt;a href="http://depth-first.com/articles/2006/09/04/hacking-nmrshiftdb"&gt;NMRShiftDB&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Fri, 29 Sep 2006 01:59:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:698cbe0d-c45a-4d91-95c0-682e0c7d6a6f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp</link>
      <category>Databases</category>
      <category>pubchem</category>
      <category>ftp</category>
      <category>ruby</category>
      <category>sdfgz</category>
      <category>sdfile</category>
      <category>gzip</category>
    </item>
  </channel>
</rss>
