<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag sdfile</title>
    <link>http://depth-first.com/articles/tag/sdfile</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>PubChem WTF #1</title>
      <description>&lt;p&gt;&lt;center&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21022536"&gt;&lt;img src="http://depth-first.com/demo/20081010/21022536.png"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;In preparation for the first beta release of &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt;, I've been performing a lot of tests with &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;PubChem SD files&lt;/a&gt;. It turns out that having a tool that can be used to quickly browse through tens of thousands of PubChem molecules turns up some very strange beasts, including &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21022536"&gt;the one depicted above&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're still curious as to what this PubChem record is actually referring to, &lt;a href="http://www.daylight.com/daycgi/depict?433143324333433443354343364335374334384333394332324331433143323243393343383443373543364336433535433434433333433232433143314332324333334334344335354336433643353543343443333343323243314331433232433333433434433535433643364335354334344333334332324331433143323243333343343443353543364336433535433434433333433232433143314332324333334334344335354336433643353543343443333343323243314343324333433443354336"&gt;this tool&lt;/a&gt; is quite useful.&lt;/p&gt;</description>
      <pubDate>Sat, 11 Oct 2008 03:35:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:97c12357-2edc-4353-867f-509714551267</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/11/pubchem-wtf-1</link>
      <category>Meta</category>
      <category>pubchem</category>
      <category>pubchemwtf</category>
      <category>chemphoto</category>
      <category>sdfile</category>
    </item>
    <item>
      <title>Recombining Compressed PubChem SD Files with Open Babel</title>
      <description>&lt;p&gt;&lt;a href="http://openbabel.org"&gt;&lt;img src="http://depth-first.com/files/Babel256.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;While testing &lt;a href="http://metamolecular.com/chemphoto"&gt;ChemPhoto&lt;/a&gt;, it became necessary to test the &lt;a href="http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto"&gt;chemical structure imaging application&lt;/a&gt; with SD Files containing several hundred thousand records. Although it's tempting to meet this need by constructing "dummy" files with the same record or small set of records repeated, tests are always far more illuminating when real data is used.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; is an excellent source of large molecular datasets, and the entire database can be &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;downloaded by FTP&lt;/a&gt;. Because of PubChem's massive size, what's downloadable consists of files broken up into groups of about 25,000 in gzipped SD File format (*.sdf.gz). Although this is an excellent resource, it creates a problem: how can you conveniently recombine this set of compressed SD Files into a single SD File?&lt;/p&gt;

&lt;p&gt;You might think about writing some "quick" code in your language of choice. Fortunately, &lt;a href="http://openbabel.org"&gt;Open Babel&lt;/a&gt; gets the job done - without any of the coding or debugging.&lt;/p&gt;

&lt;p&gt;The following command will create a single SD File from all of the compressed SD Files in a given directory, while also stripping explicit hydrogens and removing all fields except PUBCHEM_COMPOUND_CID.&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
babel *.sdf.gz pubchem.sdf -d --delete PUBCHEM_COMPOUND_CANONICALIZED,PUBCHEM_CACTVS_COMPLEXITY,PUBCHEM_CACTVS_HBOND_ACCEPTOR,PUBCHEM_CACTVS_HBOND_DONOR,PUBCHEM_CACTVS_ROTATABLE_BOND,PUBCHEM_CACTVS_SUBSKEYS,PUBCHEM_IUPAC_OPENEYE_NAME,PUBCHEM_IUPAC_CAS_NAME,PUBCHEM_IUPAC_NAME,PUBCHEM_IUPAC_SYSTEMATIC_NAME,PUBCHEM_IUPAC_TRADITIONAL_NAME,PUBCHEM_NIST_INCHI,PUBCHEM_EXACT_MASS,PUBCHEM_MOLECULAR_FORMULA,PUBCHEM_MOLECULAR_WEIGHT,PUBCHEM_OPENEYE_CAN_SMILES,PUBCHEM_OPENEYE_ISO_SMILES,PUBCHEM_CACTVS_TPSA,PUBCHEM_MONOISOTOPIC_WEIGHT,PUBCHEM_TOTAL_CHARGE,PUBCHEM_HEAVY_ATOM_COUNT,PUBCHEM_ATOM_DEF_STEREO_COUNT,PUBCHEM_ATOM_UDEF_STEREO_COUNT,PUBCHEM_BOND_DEF_STEREO_COUNT,PUBCHEM_BOND_UDEF_STEREO_COUNT,PUBCHEM_ISOTOPIC_ATOM_COUNT,PUBCHEM_COMPONENT_COUNT,PUBCHEM_CACTVS_TAUTO_COUNT,PUBCHEM_BONDANNOTATIONS,PUBCHEM_CACTVS_XLOGP

865543 molecules converted
7 info messages 15372962 audit log messages 
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Apparently, there is no way to tell babel to &lt;em&gt;keep&lt;/em&gt; just a particular field in an SD File - they need to be removed individually.&lt;/p&gt;

&lt;p&gt;Still, not bad for a few seconds on the command line.&lt;/p&gt;</description>
      <pubDate>Wed, 01 Oct 2008 01:25:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:725a5f70-77e1-4aee-a79d-e7fb9f7c3401</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/10/01/recombining-compressed-pubchem-sd-files-with-open-babel</link>
      <category>Tools</category>
      <category>openbabel</category>
      <category>sdfile</category>
      <category>pubchem</category>
      <category>sdfgz</category>
      <category>commandline</category>
    </item>
    <item>
      <title>Smarter Cheminformatics: From SD File to Image Collection with ChemPhoto</title>
      <description>&lt;p&gt;&lt;a href="http://metamolecular.com/chemphoto"&gt;&lt;img src="http://depth-first.com/demo/20080908/chemphoto.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;The old adage says time is money. Unfortunately, working chemists are often forced to spend a remarkable amount of valuable time and mental effort on menial chemical information processing tasks. These are things that could be done faster and with better quality by the right software, if it were available. Most importantly, these tasks take resources away from much more valuable work that &lt;em&gt;can't&lt;/em&gt; be automated.&lt;/p&gt;

&lt;h4&gt;The Problem in a Nutshell&lt;/h4&gt;

&lt;p&gt;As a case in point, consider the creation of 2D chemical structure images. If you maintain a compound collection of any kind, sooner or later you may end up asking yourself how you can create a set of images depicting the chemical structures in your collection.&lt;/p&gt;

&lt;h4&gt;A Specific Example: Chemical Suppliers&lt;/h4&gt;

&lt;p&gt;For example, you might work for a chemical supplier that maintains a Web-based eCommerce site, one or more PDF catalogs, or printed brochures. Your customers are chemists and they expect to see chemical structures in your product listings. How can you make this happen?&lt;/p&gt;

&lt;p&gt;If you look around for software that automates this job, you'll more likely than not come up empty-handed. The software that solves this problem well simply doesn't exist yet.&lt;/p&gt;

&lt;h4&gt;Doing it the Hard Way&lt;/h4&gt;

&lt;p&gt;In the absence of software to solve the problem, the only way to get the job done is to buckle down and do it manually. Most chemical structure editors allow you to save output as a raster image. Provided that this output matches your requirements, your system might consist of the following steps:&lt;/p&gt;

&lt;p&gt;(1) For every product in your catalog, create a single molfile or its machine-readable equivalent.&lt;/p&gt;

&lt;p&gt;(2) Load one file into your editor.&lt;/p&gt;

&lt;p&gt;(3) Save the file as a raster image, being careful to make sure all of the drawing settings and image size parameters are identical to the rest of your images.&lt;/p&gt;

&lt;p&gt;(4) Repeat Steps (2)-(3) until you have all of your images.&lt;/p&gt;

&lt;p&gt;There are many problems with this approach. For example, if your images ever need to be made larger (or smaller), you'll have to create all of your images over again (which can easily number in the thousands). Similarly, if for some reason you want to change the appearance of the images such as background, atom label coloring, or line thicknesses, you'll be forced into a lot of manual work. Finally, this system requires you to keep track of structures that have been imaged and those that haven't, which can in itself be nontrivial and error-prone, especially for thousands of products.&lt;/p&gt;

&lt;p&gt;With the right software, this problem would disappear.&lt;/p&gt;

&lt;h4&gt;One Solution: Customized Imaging Service&lt;/h4&gt;

&lt;p&gt;My company, &lt;a href="http://metamolecular.com"&gt;Metamolecular&lt;/a&gt;, has recently provided custom imaging services to a few chemical suppliers wanting thousands of good-looking structure images rendered automatically. The service made use of the versatile &lt;a href="http://metamolecular.com/chemwriter"&gt;ChemWriter&lt;/a&gt; rendering engine together with some custom code written in &lt;a href="http://depth-first.com/articles/tag/ruby"&gt;Ruby&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Although the imaging service works very well as a one-off solution, it's less than optimal in the longer term. Any changes to the image collection must be processed by Metamolecular, and then sent back to the client. A cheaper and faster solution would be to offer software that implements the functionality of the service.&lt;/p&gt;

&lt;h4&gt;A Better Solution: Chemical Structure Imaging Software&lt;/h4&gt;

&lt;p&gt;Wouldn't it be great if easy-to-use software existed that could automatically generate thousands of chemical structure images with the press of a button?&lt;/p&gt;

&lt;p&gt;In particular, the software should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Run on any modern platform (Windows, Mac OS X, Linux).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Read industry-standard Structure Data Files (SD Files).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Be capable of working with tens of thousands of chemical structures at a time even on older machines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Store fully-customizable drawing settings in a format that could be used over and over again for a consistent and professional look.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Allow the output to be previewed exactly as it will appear in the generated images ("what you see is what you get").&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output to a variety of image formats, including: Portable Network Graphics (PNG image); JPG image; &lt;a href="http://depth-first.com/articles/2008/06/10/adobe-flash-for-cheminformatics-fast-scalable-and-attractive-2d-depiction-of-chemical-structures-with-vector-graphics"&gt;Flash&lt;/a&gt; (SWF file); &lt;a href="http://depth-first.com/articles/2006/09/09/generating-and-serving-2-d-molecular-svgs"&gt;Scalable Vector Graphics&lt;/a&gt; (SVG); and &lt;a href="http://depth-first.com/articles/2008/08/07/encapsulated-postscript-for-cheminformatics"&gt;Encapsulated PostScript&lt;/a&gt; (EPS file).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Introducing ChemPhoto&lt;/h4&gt;

&lt;p&gt;ChemPhoto is designed to solve the problem of consistently creating large numbers of high-quality 2D chemical structure images. Currently in development, the first versions of ChemPhoto will be available for review within the next two weeks.&lt;/p&gt;

&lt;p&gt;ChemPhoto consists of a lightweight and intuitive user interface layer built on top of the ChemWriter rendering engine. ChemPhoto focuses on doing one thing very well, so it wouldn't be useful for creating or editing SD Files (a task for which many tools already exist). The software is specifically designed to work well with large SD Files, such as the 25,000-structure sets that can be obtained from &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt;. Written in Java, ChemPhoto runs on Windows, Mac OS X, and Linux. Future articles will discuss ChemPhoto's design and implementation.&lt;/p&gt;

&lt;p&gt;If you're interested in evaluating ChemPhoto, feel free to &lt;a href="http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=" onclick="window.open('http://mailhide.recaptcha.net/d?k=01R9bxyP6XNdc0duoUCzBBHA==&amp;amp;c=vZ7R0VDctRzIRzbSs1-LZwDzjTjAnfCS4KONqGHxY9I=', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;" title="Reveal this e-mail address"&gt;drop me a line&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Mon, 08 Sep 2008 19:04:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:e69d3005-c83f-463b-9826-e08b4339f91a</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/09/08/smarter-cheminformatics-from-sd-file-to-image-collection-with-chemphoto</link>
      <category>Tools</category>
      <category>chemphoto</category>
      <category>image</category>
      <category>png</category>
      <category>jpg</category>
      <category>swf</category>
      <category>svg</category>
      <category>eps</category>
      <category>sdfile</category>
      <category>automation</category>
    </item>
    <item>
      <title>Create Your Own PubChem Datasets: Exporting Results As SD Files</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif"  align="right"&gt;&lt;/img&gt;&lt;/a&gt;Recently, &lt;a href="http://depth-first.com/articles/2007/11/12/parsing-sd-files-with-ruby-and-rubidium"&gt;I needed to create a subset&lt;/a&gt; of the PubChem database in Structure Data File (SD File) format. Although it's far from obvious how to do this, the capability does exist. In this article, I'll give a step-by-step procedure for creating custom datasets in SD File format from arbitrary PubChem structure queries.&lt;/p&gt;

&lt;h4&gt;Create and Execute the Query&lt;/h4&gt;

&lt;p&gt;Let's say we want to create a dataset in SD File format containing all N-Boc-protected piperidines registered in PubChem.&lt;/p&gt;

&lt;p&gt;From the main &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem site&lt;/a&gt;, choose the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/search/"&gt;Structure Search&lt;/a&gt; link. Then click the "Sketch" button.&lt;/p&gt;

&lt;p&gt;Next, draw your molecule in the 2D structure editor:&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20071113/draw.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;Then click the "Done" button.&lt;/p&gt;

&lt;p&gt;Before starting the query (by clicking the "Search" button), be sure to select the "Substructure" option under "Search Type."&lt;/p&gt;

&lt;h4&gt;Exporting the Results&lt;/h4&gt;

&lt;p&gt;You should now be looking at a screen containing the first few hits of a 7700+ hitset. But how do we export these results in SD Format?&lt;/p&gt;

&lt;p&gt;Next to a field labeled "Display", you'll see a drop-down box containing several different options. Choose the one labeled "PubChem Download."&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20071113/export.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;You'll be redirected to a download page from which you can select output formats, including SDF, or SD File. You can also select a compression type (datasets of even 2000 records can be quite large uncompressed). For this example, we'll select SDF format with GZip compression.&lt;/p&gt;

&lt;p&gt;Clicking on the "Download" button takes us to a status page that eventually informs us when our download has been processed. You should then get a "Save File" dialog or something similar. If not, you should see a link to the compressed SD file.&lt;/p&gt;

&lt;p&gt;Downloading the results file completes the process.&lt;/p&gt;</description>
      <pubDate>Tue, 13 Nov 2007 16:43:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:978ad5ab-d385-4905-abc6-2d9025a601d0</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/11/13/create-your-own-pubchem-datasets-exporting-results-as-sd-files</link>
      <category>Tools</category>
      <category>pubchem</category>
      <category>sdfile</category>
      <category>dataset</category>
    </item>
    <item>
      <title>Parsing SD Files with Ruby and Rubidium</title>
      <description>&lt;p&gt;&lt;a href="http://rbtk.rubyforge.org"&gt;&lt;img src="http://depth-first.com/demo/20071015/rubidium.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Reading SD files is a bread-and-butter cheminformatics operation. At a minimum, a cheminformatics toolkit needs to parse the individual entries of an SD file, and provide access to the embedded molfile and data hash for each.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://depth-first.com/articles/tag/rubidium"&gt;Recent articles&lt;/a&gt; have introduced &lt;a href="http://rbtk.rubyforge.org"&gt;Rubidium&lt;/a&gt;, a Ruby cheminformatics scripting environment. The Rubidium team now announces the release of &lt;a href="http://rubyforge.org/frs/?group_id=4671"&gt;Rubidium-0.1.1&lt;/a&gt;, which, among other features, introduces the ability to parse SD files.&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;Rubidium is designed to run on &lt;a href="http://jruby.codehaus.org/"&gt;JRuby&lt;/a&gt;. Installing JRuby is straightforward on unix-like systems. First, download the &lt;a href="http://dist.codehaus.org/jruby/jruby-bin-1.1b1.tar.gz"&gt;JRuby-1.1b1 binary release&lt;/a&gt;. Then, unpack the archive to your directory of choice. Set &lt;tt&gt;$JRUBY_HOME&lt;/tt&gt; and &lt;tt&gt;$JAVA_HOME&lt;/tt&gt;. Finally, add &lt;tt&gt;$JRUBY_HOME/bin&lt;/tt&gt; to your path.&lt;/p&gt;

&lt;h4&gt;Installing Rubidium-0.1.1&lt;/h4&gt;

&lt;p&gt;Generally speaking, it should be possible to install Rubidium with a one-line command to RubyGems:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ jruby -S gem install rbtk
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Unfortunately at the time of this writing, I was receiving the mysterious &lt;a href="http://www.google.com/search?q=rubygems+%22ERROR:++While+executing+gem+...+OpenURI::HTTPError%22&amp;amp;hl=en&amp;amp;pwst=1&amp;amp;start=0&amp;amp;sa=N"&gt;RubyGems 404 error&lt;/a&gt; with the RubyForge remote repository:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ jruby -S gem install rbtk
Select which gem to install for your platform (java)
 1. rbtk 0.1.1 (java)
 2. rbtk 0.1.0 (java)
 3. Skip this gem
 4. Cancel installation
&gt; 1
ERROR:  While executing gem ... (OpenURI::HTTPError)
    404 Not Found
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;This appears to affect only certain RubyGems on RubyForge - possibly only those with multiple versions. It seems to be an error on the RubyForge server that occasionally appears and then disappears.&lt;/p&gt;

&lt;p&gt;As a workaround, you can &lt;a href="http://rubyforge.org/frs/download.php/27819/rbtk-0.1.1-jruby.gem"&gt;download the Rubidium gem&lt;/a&gt; and install it manually:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
&lt;/div&gt;

&lt;p&gt;&lt;/pre&gt;&lt;/p&gt;

&lt;p&gt;Because Rubidium-0.1.1 introduces an &lt;a href="http://rubyforge.org/projects/activesupport/"&gt;Active Support&lt;/a&gt; dependency, you will need to install that library before installing Rubidium:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
ERROR:  While executing gem ... (RuntimeError)
    Error instaling tmp/rbtk-0.1.1-jruby.gem:
        rbtk requires activesupport &gt;= 1.4.2
$ jruby -S gem install activesupport
Successfully installed activesupport-1.4.4
Installing ri documentation for activesupport-1.4.4...
Installing RDoc documentation for activesupport-1.4.4...
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
Successfully installed rbtk, version 0.1.1
Installing ri documentation for rbtk-0.1.1-jruby...
Installing RDoc documentation for rbtk-0.1.1-jruby...
&lt;/div&gt;

&lt;p&gt;&lt;/pre&gt;&lt;/p&gt;

&lt;p&gt;It's possible that the RubyForge 404 issue will be resolved by the time you read this article, so &lt;tt&gt;jruby -S gem install rbtk&lt;/tt&gt; should be tried first.&lt;/p&gt;

&lt;h4&gt;Parsing an SD File&lt;/h4&gt;

&lt;p&gt;Let's say we'd like to extract all InChIs from a PubChem dataset. If you don't have one handy, a compilation of about 2000 PubChem benzodiazepines has been &lt;a href="http://rubyforge.org/frs/download.php/27768/pubchem_benzodiazepine_20071110.sdf.gz"&gt;deposited on RubyForge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With our unzipped datafile in our working directory, we can now test the SD File parser by saving the following library to a file called &lt;strong&gt;parse.rb&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;gem&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rbtk&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubidium/sdf&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;parse_sd&lt;/span&gt; &lt;span class="ident"&gt;filename&lt;/span&gt;
  &lt;span class="ident"&gt;p&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Rubidium&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;SDF&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;Parser&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt; &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;filename&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

  &lt;span class="ident"&gt;p&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;entry&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;InChI: &lt;span class="expr"&gt;#{entry['PUBCHEM_NIST_INCHI']}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

which can be tested with &lt;tt&gt;jirb&lt;/tt&gt;:

&lt;div class="console"&gt;
&lt;pre&gt;
$ jirb
irb(main):001:0&gt; require 'parse'
=&gt; true
irb(main):002:0&gt; parse_sd 'pubchem_benzodiazepine_20071110.sdf'
InChI: InChI=1/C16H12Cl2N2O/c1-20-14-7-6-12(18)8-13(14)16(19-9-15(20)21)10-2-4-11(17)5-3-10/h2-8H,9H2,1H3

[truncated]
&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;RSpec and Behavior-Driven Development&lt;/h4&gt;

&lt;p&gt;If you &lt;a href="http://rubyforge.org/frs/download.php/27820/rbtk-0.1.1.tar.gz"&gt;check out the Rubidium source distribution&lt;/a&gt;, you'll notice that the SD parser library is tested with &lt;a href="http://rspec.rubyforge.org/"&gt;RSpec&lt;/a&gt;, the &lt;a href="http://en.wikipedia.org/wiki/Behavior_driven_development"&gt;BDD&lt;/a&gt; framework for Ruby. Ultimately, all components of Rubidium will be tested and documented this way.&lt;/p&gt;

&lt;h4&gt;Acknowledgments&lt;/h4&gt;

&lt;p&gt;Rubidium's new SD file parser was written by &lt;a href="http://www.moseshohman.com/"&gt;Moses Hohman&lt;/a&gt;. It was kindly donated by &lt;a href="http://www.collaborativedrug.com/"&gt;Collaborative Drug Discovery&lt;/a&gt;, who have built their drug discovery application using &lt;a href="http://rubyonrails.com"&gt;Ruby on Rails&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Future Directions&lt;/h4&gt;

&lt;p&gt;One problem in working with SD files is pinpointing encoding errors. A parser should not only raise an exception, but point to a line number and identify offending text to aid debugging. Rubidium's SD parser will eventually incorporate these enhancements.&lt;/p&gt;

&lt;p&gt;Because Rubidium runs on JRuby, performance gains may be achievable by re-writing select portions in Java.&lt;/p&gt;

&lt;p&gt;Parsing SD files is only the beginning of the story. Many cheminformatics applications need a convenient, fast, and robust method for &lt;em&gt;writing&lt;/em&gt; molfiles. This is also something Rubidium will attempt to provide.&lt;/p&gt;

&lt;p&gt;If your company or organization is curious about Ruby and cheminforamatics, give Rubidium a try. Rubidium is licensed under the permissive &lt;a href="http://www.opensource.org/licenses/mit-license.php"&gt;MIT License&lt;/a&gt; to make collaboration as simple as possible.&lt;/p&gt;</description>
      <pubDate>Mon, 12 Nov 2007 11:27:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:8e195fb8-22d0-4ea3-a2bd-40f44281fc8f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/11/12/parsing-sd-files-with-ruby-and-rubidium</link>
      <category>Tools</category>
      <category>rubidium</category>
      <category>ruby</category>
      <category>cdd</category>
      <category>sdfile</category>
      <category>sdf</category>
      <category>bdd</category>
      <category>rspec</category>
      <category>jruby</category>
    </item>
    <item>
      <title>Hacking PubChem: Direct Access with FTP</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right"&gt;&lt;/img&gt;A &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;previous article&lt;/a&gt; in the &lt;em&gt;Hacking PubChem&lt;/em&gt; series pointed out that the entire PubChem database can be &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/"&gt;downloaded via FTP&lt;/a&gt;. This article shows how simple tools written in Ruby can be used to efficiently process the massive amount of data on PubChem's FTP-server.&lt;/p&gt;

&lt;h4&gt;Prerequisites&lt;/h4&gt;

&lt;p&gt;The only software you'll need for this tutorial is &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Organization of PubChem's FTP-Server&lt;/h4&gt;

&lt;p&gt;PubChem is a big database. To deal with its size, the FTP-server spreads its contents over about 950 files. Each file contains a contiguous range of Compound Identification Numbers (CIDs), which appears to be set at 10,000 [&lt;em&gt;Now 25,000, see below&lt;/em&gt;]. In some of the files I've examined, the actual number of compounds in a given block was less than 10,000. The root directory containing the files can be accessed &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Compression Saves the Day&lt;/h4&gt;

&lt;p&gt;For storage and transmission efficiency, PubChem's SDF files are compressed using the GZip algorithm, giving files that typically range in size from five to seven megabytes. Compression ratios for the files I've examined are about 10:1. I'm calling these files "SDFGZ" files, and they have the extension &lt;tt&gt;*.sdf.gz&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;A back of the envelope calculation, based on 950 files with an average size of 6 MB and a compression ratio of 10:1, gives an approximate storage requirement of 57 GB for the uncompressed PubChem database. Although storing this much data is feasible with today's hardware, there are many better uses for storage space. This is especially true if only a few fields of the PubChem database are of interest.&lt;/p&gt;

&lt;h4&gt;Setting Up&lt;/h4&gt;

&lt;p&gt;You'll need to download some SDFGZ data. This tutorial uses the &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_09540001_09550000.sdf.gz"&gt;file containing CIDs 9540001-9550000&lt;/a&gt;. [&lt;em&gt;Note: PubChem recently increased the number of compounds in each sdfgz file to 25,000. This means that the link to the file no longer works. Instead, choose a file from &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/"&gt;here&lt;/a&gt;&lt;/em&gt;.] Put this file in your working directory.&lt;/p&gt;

&lt;h4&gt;A Short Library&lt;/h4&gt;

&lt;p&gt;Create a file called &lt;strong&gt;sdfgz.rb&lt;/strong&gt; containing the following code:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;zlib&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# A simple splitter for *.sdf.gz files available&lt;/span&gt;
&lt;span class="comment"&gt;# from PubChem's FTP-server.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;SDFGZSplitter&lt;/span&gt;
  &lt;span class="attribute"&gt;@@stop&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;$$$$&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="attribute"&gt;@@blank&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

  &lt;span class="comment"&gt;# Configures this SDFGZSplitter using the &amp;lt;tt&amp;gt;IO&amp;lt;/tt&amp;gt;&lt;/span&gt;
  &lt;span class="comment"&gt;# object &amp;lt;tt&amp;gt;io&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;initialize&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;io&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="attribute"&gt;@gzip&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Zlib&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;GzipReader&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;io&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Yield a sequence of SDFile records.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;each_record&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;get_record&lt;/span&gt;

    &lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;!=&lt;/span&gt; &lt;span class="attribute"&gt;@@blank&lt;/span&gt;
      &lt;span class="keyword"&gt;yield&lt;/span&gt; &lt;span class="ident"&gt;record&lt;/span&gt;
      &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;get_record&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Gets the next record, or an empty string if&lt;/span&gt;
  &lt;span class="comment"&gt;# none is available.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;get_record&lt;/span&gt;
    &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;read_line&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt;

    &lt;span class="keyword"&gt;while&lt;/span&gt; &lt;span class="punct"&gt;!(&lt;/span&gt;&lt;span class="attribute"&gt;@@stop&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;eql?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;||&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
      &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;read_line&lt;/span&gt;
      &lt;span class="ident"&gt;record&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="ident"&gt;line&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="ident"&gt;private&lt;/span&gt;

  &lt;span class="comment"&gt;# Reads the next line in the SDFGZ file.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;read_line&lt;/span&gt;
    &lt;span class="keyword"&gt;begin&lt;/span&gt;
      &lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="attribute"&gt;@gzip&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;readline&lt;/span&gt;
    &lt;span class="keyword"&gt;rescue&lt;/span&gt; &lt;span class="constant"&gt;EOFError&lt;/span&gt;
      &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;

    &lt;span class="ident"&gt;line&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Utility class for getting data out of a SDFile record.&lt;/span&gt;
&lt;span class="keyword"&gt;class &lt;/span&gt;&lt;span class="class"&gt;Extractor&lt;/span&gt;
  &lt;span class="comment"&gt;# Gets the data from &amp;lt;tt&amp;gt;record&amp;lt;/tt&amp;gt; associated with&lt;/span&gt;
  &lt;span class="comment"&gt;# &amp;lt;tt&amp;gt;key&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;key&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;&amp;gt; &amp;lt;&lt;span class="expr"&gt;#{key}&lt;/span&gt;&amp;gt;&lt;span class="escape"&gt;\n&lt;/span&gt;(.+)&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt;
    &lt;span class="global"&gt;$1&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Gets the molfile for &amp;lt;tt&amp;gt;record&amp;lt;/tt&amp;gt;.&lt;/span&gt;
  &lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;self.extract_molfile&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
    &lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;M  END$&lt;/span&gt;&lt;span class="punct"&gt;/).&lt;/span&gt;&lt;span class="ident"&gt;pre_match&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;M  END&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;tt&gt;SDFGZSplitter&lt;/tt&gt; class uses Ruby's built-in GZip library to read SDFGZ files without inflating them. The method &lt;tt&gt;each_record&lt;/tt&gt; is a &lt;a href="http://www.rubycentral.com/book/tut_containers.html"&gt;Ruby iterator&lt;/a&gt;, one of the strangely cool things that makes Ruby the language it is. The iterator's job is to allow retrieval of each SDFGZ record individually, until all records have been retrieved.&lt;/p&gt;

&lt;h4&gt;Using the Library&lt;/h4&gt;

&lt;p&gt;As a test for the &lt;tt&gt;sdfgz&lt;/tt&gt; library, lets scrape all PubChem CIDs and InChI identifiers from an SDFGZ file, and place the result into a new CSV file. Create the following code, either in a file to be run by &lt;tt&gt;ruby&lt;/tt&gt; or in a terminal session using &lt;tt&gt;irb&lt;/tt&gt;:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;sdfgz&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;Compound_09540001_09550000.sdf.gz&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
&lt;span class="ident"&gt;splitter&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;SDFGZSplitter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;parsing...&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;dictionary.csv&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;w+&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="ident"&gt;splitter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each_record&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;cid&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Extractor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;PUBCHEM_COMPOUND_CID&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
    &lt;span class="ident"&gt;inchi&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Extractor&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;extract_data&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;record&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;PUBCHEM_NIST_INCHI&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;

    &lt;span class="ident"&gt;file&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="expr"&gt;#{cid}&lt;/span&gt;,&lt;span class="escape"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="expr"&gt;#{inchi}&lt;/span&gt;&lt;span class="escape"&gt;\&amp;quot;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt; 
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Running this test creates a (rather large) file called &lt;strong&gt;dictionary.csv&lt;/strong&gt; in your working directory. Its contents consist of the following truncated output:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;9540001,&amp;quot;InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/p-1/fC20H21N2O4/h21-22H/q-1&amp;quot;
9540002,&amp;quot;InChI=1/C20H22N2O4/c1-13-7-5-10-16(14(13)2)22-20(26)15-8-3-4-9-17(15)21-18(23)11-6-12-19(24)25/h3-5,7-10H,6,11-12H2,1-2H3,(H,21,23)(H,22,26)(H,24,25)/f/h21-22,24H&amp;quot;
9540003,&amp;quot;InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/p-1/fC19H19N2O5/h20-21H/q-1&amp;quot;
9540004,&amp;quot;InChI=1/C19H20N2O5/c1-26-16-8-3-7-15(12-16)21-19(25)13-5-2-6-14(11-13)20-17(22)9-4-10-18(23)24/h2-3,5-8,11-12H,4,9-10H2,1H3,(H,20,22)(H,21,25)(H,23,24)/f/h20-21,23H&amp;quot;

...&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Many customizations of the above code are possible. For example, it would not be difficult to programatically log into the PubChem FTP-server, download a file, and process it as shown. By parsing the SDFGZ filename, a program could even know which file contained a given CID. Because the &lt;tt&gt;SDFGZSplitter&lt;/tt&gt; constructor takes a Ruby &lt;tt&gt;IO&lt;/tt&gt; object, it's also feasible to process PubChem's SDFGZ files directly from the FTP-server, without downloading them beforehand. But that's a subject for another day.&lt;/p&gt;

&lt;h4&gt;Summing Up&lt;/h4&gt;

&lt;p&gt;The PubChem FTP-server is a treasure trove of useful data that's available free of charge. Using simple tools like those discussed here, it's possible to generate a virtually infinite variety of customized views of this valuable resource. Many creative, and novel, applications are possible by combining the capabilities shown here with those of Open Source chemical informatics software, such as &lt;a href="http://depth-first.com/articles/2006/09/26/looking-at-inchis"&gt;RCDK&lt;/a&gt;, and other Open data sources, such as &lt;a href="http://depth-first.com/articles/2006/09/04/hacking-nmrshiftdb"&gt;NMRShiftDB&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Fri, 29 Sep 2006 01:59:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:698cbe0d-c45a-4d91-95c0-682e0c7d6a6f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp</link>
      <category>Databases</category>
      <category>pubchem</category>
      <category>ftp</category>
      <category>ruby</category>
      <category>sdfgz</category>
      <category>sdfile</category>
      <category>gzip</category>
    </item>
  </channel>
</rss>
