Recombining Compressed PubChem SD Files with Open Babel

October 01, 2008

While testing ChemPhoto, it became necessary to test the chemical structure imaging application with SD Files containing several hundred thousand records. Although it's tempting to meet this need by constructing "dummy" files with the same record or small set of records repeated, tests are always far more illuminating when real data is used.

PubChem is an excellent source of large molecular datasets, and the entire database can be downloaded by FTP. Because of PubChem's massive size, what's downloadable consists of files broken up into groups of about 25,000 in gzipped SD File format (*.sdf.gz). Although this is an excellent resource, it creates a problem: how can you conveniently recombine this set of compressed SD Files into a single SD File?

You might think about writing some "quick" code in your language of choice. Fortunately, Open Babel gets the job done - without any of the coding or debugging.

The following command will create a single SD File from all of the compressed SD Files in a given directory, while also stripping explicit hydrogens and removing all fields except PUBCHEM_COMPOUND_CID.


865543 molecules converted
7 info messages 15372962 audit log messages 

Apparently, there is no way to tell babel to keep just a particular field in an SD File - they need to be removed individually.

Still, not bad for a few seconds on the command line.