Mirroring PubChem the Easy Way with PubChem Fu

July 20, 2010

If you want to work with the PubChem dataset, two of your biggest problems will likely be creating a local working copy and keeping it synchronized. Although a few articles on doing just this have appeared here on Depth-First, the information is a bit scattered. Also, some time and effort is required in getting a robust mirroring system in place and working as expected. Enter PubChem Fu, a simple tool designed to help you maintain a complete, up-to-date copy of the PubChem dataset.

Previous Articles in this Series:

What it Does

PubChem Fu lets you create and automatically update a local working copy of all PubChem Compound and Substance data.

This tool uses Ruby and a clever little Ruby utility called Whenever to configure cron for you. Simply give it a time to pull daily updates at, and you can always have access to the complete, latest PubChem dataset.


Create a full copy of PubChem (in the same directory as PubChem Fu):

$ rake full

Pull available daily updates (in the same directory as PubChem Fu):

$ rake daily

Automate pulling daily updates:

$ whenever --update-crontab pubchem

Make sure your cron task is set:

$ crontab -l


# Begin Whenever generated tasks for: pubchem
PATH= ...

15 14 * * * cd ~/local/pubchem && RAILS_ENV=production /usr/bin/env rake daily

# End Whenever generated tasks for: pubchem

That part about RAILS_ENV is a by-product of Whenever being mainly used in a Ruby on Rails environment. I'm sure there's a way to block that output if needed. If you know, please drop me a line.

More Information

PubChem Fu is a simple utility built on a powerful foundation. For more information, see this video on automating cron with Ruby.

PubChem's large size presents many challenges, some of which may have solutions that could be included with PubChem Fu, and some of which could be applied to other large datasets. Stay tuned.