Mirroring PubChem the Easy Way with PubChem Fu
If you want to work with the PubChem dataset, two of your biggest problems will likely be creating a local working copy and keeping it synchronized. Although a few articles on doing just this have appeared here on Depth-First, the information is a bit scattered. Also, some time and effort is required in getting a robust mirroring system in place and working as expected. Enter PubChem Fu, a simple tool designed to help you maintain a complete, up-to-date copy of the PubChem dataset.
Previous Articles in this Series:
- Big Data in Chemistry: Incrementally Mirror PubChem the Easy Way
- Big Data in Chemistry: Mirroring PubChem the Easy Way Part 2
- Big Data in Chemistry: Mirroring PubChem the Easy Way
What it Does
PubChem Fu lets you create and automatically update a local working copy of all PubChem Compound and Substance data.
This tool uses Ruby and a clever little Ruby utility called Whenever to configure cron for you. Simply give it a time to pull daily updates at, and you can always have access to the complete, latest PubChem dataset.
Usage
Create a full copy of PubChem (in the same directory as PubChem Fu):
rake full
Pull available daily updates (in the same directory as PubChem Fu):
rake daily
Automate pulling daily updates:
whenever --update-crontab pubchem
Make sure your cron task is set:
crontab -l
...
# Begin Whenever generated tasks for: pubchem
PATH= ...
15 14 * * * cd ~/local/pubchem && RAILS_ENV=production /usr/bin/env rake daily
# End Whenever generated tasks for: pubchem
That part about RAILS_ENV is a by-product of Whenever being mainly used in a Ruby on Rails environment. I'm sure there's a way to block that output if needed. If you know, please drop me a line.
More Information
PubChem Fu is a simple utility built on a powerful foundation. For more information, see this video on automating cron with Ruby.
PubChem's large size presents many challenges, some of which may have solutions that could be included with PubChem Fu, and some of which could be applied to other large datasets. Stay tuned.