Parsing SD Files with Ruby and Rubidium
Reading SD files is a bread-and-butter cheminformatics operation. At a minimum, a cheminformatics toolkit needs to parse the individual entries of an SD file, and provide access to the embedded molfile and data hash for each.
Recent articles have introduced Rubidium, a Ruby cheminformatics scripting environment. The Rubidium team now announces the release of Rubidium-0.1.1, which, among other features, introduces the ability to parse SD files.
Prerequisites
Rubidium is designed to run on JRuby. Installing JRuby is straightforward on unix-like systems. First, download the JRuby-1.1b1 binary release. Then, unpack the archive to your directory of choice. Set $JRUBY_HOME and $JAVA_HOME. Finally, add $JRUBY_HOME/bin to your path.
Installing Rubidium-0.1.1
Generally speaking, it should be possible to install Rubidium with a one-line command to RubyGems:
$ jruby -S gem install rbtk
Unfortunately at the time of this writing, I was receiving the mysterious RubyGems 404 error with the RubyForge remote repository:
$ jruby -S gem install rbtk
Select which gem to install for your platform (java)
1. rbtk 0.1.1 (java)
2. rbtk 0.1.0 (java)
3. Skip this gem
4. Cancel installation
> 1
ERROR: While executing gem ... (OpenURI::HTTPError)
404 Not Found
This appears to affect only certain RubyGems on RubyForge - possibly only those with multiple versions. It seems to be an error on the RubyForge server that occasionally appears and then disappears.
As a workaround, you can download the Rubidium gem and install it manually:
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
Because Rubidium-0.1.1 introduces an Active Support dependency, you will need to install that library before installing Rubidium:
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
ERROR: While executing gem ... (RuntimeError)
Error instaling tmp/rbtk-0.1.1-jruby.gem:
rbtk requires activesupport >= 1.4.2
$ jruby -S gem install activesupport
Successfully installed activesupport-1.4.4
Installing ri documentation for activesupport-1.4.4...
Installing RDoc documentation for activesupport-1.4.4...
$ jruby -S gem install tmp/rbtk-0.1.1-jruby.gem
Successfully installed rbtk, version 0.1.1
Installing ri documentation for rbtk-0.1.1-jruby...
Installing RDoc documentation for rbtk-0.1.1-jruby...
It's possible that the RubyForge 404 issue will be resolved by the time you read this article, so jruby -S gem install rbtk should be tried first.
Parsing an SD File
Let's say we'd like to extract all InChIs from a PubChem dataset. If you don't have one handy, a compilation of about 2000 PubChem benzodiazepines has been deposited on RubyForge.
With our unzipped datafile in our working directory, we can now test the SD File parser by saving the following library to a file called parse.rb:
require 'rubygems'
gem 'rbtk'
require 'rubidium/sdf'
def parse_sd filename
p = Rubidium::SDF::Parser.new File.new(filename)
p.each do |entry|
puts "InChI: #{entry['PUBCHEM_NIST_INCHI']}"
end
end$ jirb irb(main):001:0> require 'parse' => true irb(main):002:0> parse_sd 'pubchem_benzodiazepine_20071110.sdf' InChI: InChI=1/C16H12Cl2N2O/c1-20-14-7-6-12(18)8-13(14)16(19-9-15(20)21)10-2-4-11(17)5-3-10/h2-8H,9H2,1H3 [truncated]
RSpec and Behavior-Driven Development
If you check out the Rubidium source distribution, you'll notice that the SD parser library is tested with RSpec, the BDD framework for Ruby. Ultimately, all components of Rubidium will be tested and documented this way.
Acknowledgments
Rubidium's new SD file parser was written by Moses Hohman. It was kindly donated by Collaborative Drug Discovery, who have built their drug discovery application using Ruby on Rails.
Future Directions
One problem in working with SD files is pinpointing encoding errors. A parser should not only raise an exception, but point to a line number and identify offending text to aid debugging. Rubidium's SD parser will eventually incorporate these enhancements.
Because Rubidium runs on JRuby, performance gains may be achievable by re-writing select portions in Java.
Parsing SD files is only the beginning of the story. Many cheminformatics applications need a convenient, fast, and robust method for writing molfiles. This is also something Rubidium will attempt to provide.
If your company or organization is curious about Ruby and cheminforamatics, give Rubidium a try. Rubidium is licensed under the permissive MIT License to make collaboration as simple as possible.
Cheminformatics for the Web: Convert SD Files to HTML with Ruby CDK
The Structure Data File (SDF) format is the de facto standard for cheminformatics data exchange. One of the problems that arises when working with SD Files, especially large ones like those distributed by PubChem, is "seeing" the structures they contain. Although commercial software packages are available for doing so, they are generally closed, unreasonably expensive, or overly complex. This article describes a simple solution to the SDF visualization problem that uses Open Source tools controlled from the elegant and agile Ruby programming language.
Cut to the Chase
This page shows the output produced by the software. You'll see a neatly arranged grid of colorful 2-D chemical structures in a Web page that was generated directly from a PubChem SDFGZ file. Each structure has a number below it, the PubChem Compound ID (CID). Both the structure and CID are hyperlinked to the Compound Summary page on PubChem. A partial screenshot is provided below.

Prerequisites
For this tutorial, you'll need Ruby CDK (RCDK). A recent article described the small amount of system configuration required for RCDK on Linux. Another article showed how to install RCDK on Windows.
Download the Software
The software described in this article can be downloaded here. Inflate this file and make it your working directory. You should see a 14 MB SDFGZ file, a RHTML template, and three Ruby files.
Ripping PubChem SD Files
The software is designed to work with PubChem SDFGZ files. The SDFGZ format simply results from the application of the gzip compression algorithm to an ordinary SD file.
Ripping the example SDFGZ file is just a matter of running test.rb:
$ ruby test.rb
You'll see some output indicating that various CIDs are being processed. On completion, the software has created a directory called rip containing a HTML file and an images directory.
The Little Engine That Could: CDK's StructureDiagramGenerator
If you've ever worked with PubChem's SD Files, you'll no doubt have noticed that the molfile section encodes all hydrogen atoms, which is not general practice. Rendering these hydrogens results in a very cluttered image.
To solve this problem, the software creates its graphics from the PUBCHEM_OPENEYE_CAN_SMILES field encoded by the SDFGZ file. This SMILES string is converted into a molecular representation and coordinates are assigned by CDK's StructureDiagramGenerator.
When an image can't be rendered in this way, it is left out. This was done for CIDs 18, 115, 147, 148, 222, and 223, for example. There are three common themes in these missing structures: metals, phosphorous, and molecules with a single heavy atom. The problem may, in fact, lie in the underlying Structure-CDK software, rather than with CDK. Stay tuned for more on this.
PubChem for Debugging
In developing this SD File Ripper program, I realized that it could be used as a powerful debugging tool. Notice how the missing structures (and their SMILES strings) can easily be examined via PubChem by clicking the empty cell. The alternative would have been for the program to spit out a list of SMILES that didn't process properly and to then try to construct a mental image of what this string represents. With PubChem, we do away with this tedium altogether.
I doubt the creators of PubChem envisioned this application of their work. Surely it's but one of many still to be discovered.
Another Cool Thing About Ruby: eRuby Templates
Our SDF Ripper program creates HTML output, something for which Ruby is well-suited through its eRuby ERB library. Among other uses, ERB enables Ruby code to be embedded within HTML. This inside-out scripting capability resembles that of other templating languages such as PHP, ASP, and JSP (ERB is used extensively by the Ruby on Rails web application framework). The file template.rhtml contains the rippers's ERB template. The separation of program logic from presentation makes it very simple to customize the appearance of the output.
Room to Grow
Our SDF Ripper only works with SDFGZ files from PubChem. The program is short enough that it should be simple to adapt it for your specific needs. It would not be much work at all, for example, to create an HTML table containing all fields encoded by the SDFGZ file. Similarly, adding support for non-compressed SD files is straightforward. If JavaScript is your medium, the possibilities become even more interesting. How about a pop-up menu showing an enlarged structure and data summary, a la Netflix when the user mouses over an image?
Paging is a technique that divides large Web pages into smaller pages linked to one another. For example, Google's search results are divided into groups of ten by default. Adding paging support to the software described here would also not be difficult, and would enable the convenient browsing of much larger datasets.
Other Software That Does This
I am aware of no product, commercial or otherwise, that performs the SDF to HTML conversion in the way shown here. SciTegic does offer an HTML table component as part of its Pipeline Pilot framework, but as far as I know, no standalone version is available.
Rajarshi Guha, among his many other interesting projects, has written a Java SDF to PDF convertor that uses CDK.
Conclusions
This article has demonstrated how the combination of RCDK and Ruby makes short work of converting the contents of an SD file into a Web-ready format. As usual, we've only scratched the surface of what's easily within reach. Watch for future articles to build on the concepts outlined here.

