PubChem's massive size presents special challenges when working with this chemical dataset. Synchronization in particular requires special care. Although it's very easy to use a tool such as wget to perform a complete, one-time download PubChem's archive files, this approach scales poorly if our goal is to maintain a copy that's always up-to-date. The PubChem dataset's substantial size makes it impractical to download frequently, and especially problematic when an up-to-date local copy is needed quickly.

This article describes a simple way to create a low-maintenance, low-bandwidth, up-to-date local mirror of PubChem using two Unix tools.

What It Does

The method described here will create two directories on your filesystem that will exactly mirror the contents of the PubChem Compound and Substance archives, respectively. A simple command, which can either be run as a nightly cron job or on demand, will efficiently bring these local files up-to-date with PubChem whenever it's run.

Step #1: Create A Workspace and Mount PubChem FTP Site

We're going to need a workspace. In this workspace, we'll first create a mountpoint for the PubChem FTP site archives, then we'll mount the archives:

$ mkdir workspace
$ cd workspace
$ mkdir -p ftp.ncbi.nlm.nih.gov/pubchem
$ curlftpfs ftp.ncbi.nlm.nih.gov/pubchem/ ftp.ncbi.nlm.nih.gov/pubchem/

My Linux distribution (Ubuntu Karmic) gives me the error message:

fusermount: failed to open /etc/fuse.conf: Permission denied

which doesn't seem to matter. The FTP site is mounted, as I can see by listing the top-level entries:

$ ls ftp.ncbi.nlm.nih.gov/pubchem
Bioassay  Compound     data_spec     README      Substance
CACTVS    Compound_3D  publications  specifications

We can unmount the PubChemFTP site with fusermount:

$ fusermount -u ftp.ncbi.nlm.nih.gov/pubchem/

Step #2: Create Synchronization Directories and Transfer Files

Next, let's create two directories to hold the PubChem files - one for Compounds and one for Substances:

$ mkdir substances
$ mkdir compounds

Now comes the magic. We'll use rsync to copy the contents of the mounted FTP archive into each of our local directories. First, we synchronize the Compounds:

$ rsync -r -t -v --progress --bwlimit=500  --include='*/' --include='*.sdf.gz' --exclude='*' ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/ compounds

This is going to take nearly 24 hours.

The option --bwlimit sets the maximum bandwidth (in Mb/S). The --include and --exclude options say that we're only interested in gzipped sd files.

We synchronize Substance records analogously:

$ rsync -r -t -v --progress --bwlimit=500  --include='*/' --include='*.sdf.gz' --exclude='*' ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/SDF/ substances

This command will take even longer to run.

Step #3: There Is No Step #3

That's really all there is to it. Every time we run the rsync command, we'll synchronize our local copy of the PubChem archive with the one on the PubChem FTP server. PubChem ensures that these archives are always current, so every time we synchronize, we'll have up-to-date files.

Why RSync?

RSync ensures that our synchronizations will be as efficient as possible by only downloading the archive files that change. From time to time, old records are updated in PubChem, and these changes appear as a new archive file that replaces an old archive file. The new file gets an updated timestamp. If you check out the Compounds FTP directory you'll notice several different timestamps reflecting the various updates of existing records. New records appear as new archive files.

The genius of RSync is that it performs an incremental backup; files that haven't changed since our last update are never downloaded.

We can even take this incremental backup idea one step further. Although I don't yet know if PubChem supports it, it's possible to create GZip archives optimized for rsync. This uses a variant of the GZip compression algorithm that makes it possible to transmit only the section of a gzip file that's actually changed, keeping network traffic to an absolute minimum.

This rsyncable archive capability is built into most gzip binary distributions.

Conclusions

Creating and maintaining your own up-to-date, verbatim copy of PubChem is both simple and inexpensive. The trick is to first mount the FTP archive using curlftpfs and then use rsync to perform an incremental backup of the mounted archive. The method described here works equally well as a cron job or an ad hoc command.

Credits: Mirror an FTP Directory with RSync and Curlftps; Rsyncable gzip


Nature News is running a story on Matthew Todd and his initiative to develop a more practical treatment for Schistosomiasis by thinking different:

"My funded project is intended to be the kernel, to which anyone can add," Todd says. He hopes that the project will become a successful example of open-source science, and open-source 'wet lab' chemistry in particular, a concept that has been slow to take off.

Call me an optimist, but the problems with getting something like this to work will have less to do with a scarcity of volunteer-minded chemists and more to do with finding them around the world and connecting them to each other.

Chempedia Lab is a service that might have a role to play. It's a question and answer site dedicated to experimental chemistry. Ask a question and get a peer-reviewed answer. No inflated bureaucracy, no lengthy review process, no unaffordable subscriptions, no conflicts of interest, no nagging questions about re-use, no counterproductive rewards system. Just you, your peers, and the information - the way science is supposed to work.

Maybe you're thinking that something like this can't possibly work. If so, I'll leave you with a simple challenge - do the experiment yourself. Ask the toughest question you can think of and see how long it takes to get either exactly the answer you were looking for, or an answer that puts you on the right track. Then ask yourself how you would have answered the same question without Chempedia Lab.

Although Chempedia Lab may not be the best platform, one thing is clear - open science has no chance in the context of traditional scientific communication. That system is simply too cost-ineffective, both in terms of money and time.

My guess is that for every Matthew Todd there are at least a hundred others who would like to start the same kind of initiative, but who feel they lack the funding, the lab space, the staff, or some other critical resource. Thinking different about everything in the way we do chemistry - from who does the research, to where it gets done, to the medium of collaboration - is the key.


If you've ever worked with the PubChem dataset, you may have found yourself wanting to create a custom subset that filters out certain records. This article, the fourth in a continuing series, shows a very simple way to create a custom PubChem dataset using PubCouch.

The Problem

I really like PubChem. It's the world's largest collection of freely-downloadable chemical structures and an excellent use of taxpayer dollars.

But PubChem has faced some tough tradeoffs over the years., one of the foremost being how inclusive it should be. In other words, when to say 'no' to a substance depositor. I won't rehash the details here, but suffice it to say that the technologies on which PubChem is based are limited in important ways (for example: organometallics).

As part of ongoing work to expand Chempedia, the free chemical substance registry, I became interested in the possibility of building a subset of the PubChem Compound registry that only contained structures that could be safely encoded by the MDL Molfile specification. Call it "PubChem: The Good Parts."

This database was likely to be huge and pretty non-relational. It looked like a perfect job for PubCouch.

A Solution

The software to solve this problem has been built into PubCouch. There are a couple of ways to run it, but I find one of the simplest is to use JRuby:

$ git clone git@github.com:metamolecular/pubcouch.git
$ cd pubcouch
$ ant jar
$ jruby -S rake compounds:snapshot

To get that last part working, you'll need to install JRuby. This is optional; you could also create an Ant task or use some other script. The point is that we're running a pre-packaged PubCouch task called "Compounds".

There's one more thing - you'll obviously need CouchDB installed, and you'll need an empty database called "compounds". The database name can be changed to fit your preferences.

Finally, the way this works is likely to change in the future. To be sure you'll be able to access the code describe here, please use this commit.

Filtering

After running the snapshot task, you'll see some output indicating Compound IDs being checked and written.

Not every compound is being written. Only those passing a specific set of requirements will end up in CouchDB:

  1. No bond annotations other than 'aromatic'.
  2. No multicomponent (disconnected) Compounds.
  3. No undefined stereochemistry.
  4. No charged species.

These happen to be my requirements - yours will probably differ somewhat. To change the applied filter, simply change the method Compounds.StrictFilter.pass. It's that simple.

Fine-Tuning

This is all pretty rough at this point. There are many opportunities to refine the code for flexibility and performance. For example, I initially experimented with CouchDB's bulk update capability, which compresses multiple writes into a single HTTP request. But this actually resulted in more memory/processor usage. My guess is that this was probably less due to CouchDB than it was to the JSON overhead in the JCouchDB library I'm using to talk to CouchDB. Your results may vary.

Conclusions

PubChem is an excellent free resource for raw chemical structures, if filtered correctly. This article showed how to create your own personal subset of PubChem using PubCouch.


If you've been following along with the development of PubCouch, the CouchDB interface for PubChem, you've probably noticed that only a fraction of the code relates to CouchDB itself. What's the rest of it doing?

This article, the the third in a series on using CouchDB for PubChem data describes how PubCouch transforms PubChem's collection of gzipped archive files into a stream of structure-data records that can be processed as if it were one big SD File.

The Problem

If you want to work with the PubChem dataset, one of the first problems you'll face is how to import the data into your database management system (dbms). The PubChem FTP server contains a rather large collection of archive "bundles", which are simply gzipped SD Files of records within a certain ID range.

In most cases, importing the PubChem database will consist of sequentially reading every Compound and Substance record, applying the appropriate intermediate processing, and storing the result.

So, we have a mismatch in the way PubChem stores its data (multiple gzipped archives) and the way we want to process it (as one big SD File). And by the way, how about not storing a bunch of temporary files, but rather transfer data directly from the archive to our database?

A Solution

One of the reasons Java was chosen as PubCouch's development language is its built-in support for high-performance IO operations. InputStream, the foundation of this support, turns out to be a very versatile class enabling a variety of filtering and reprocessing operations on raw data streams.

Our FTP Client can return at most one raw byte stream from each archive file. By applying a set of filters on this stream, we can get pretty close to where we need to be:

These filters alone won't do the job - remember, we want to treat the entire FTP archive as one big SD File.

SequenceInputStream is just what we need. This nifty little class can make a series of InputStreams (i.e., the individual PubChem files) appear as one big InputStream.

Putting this all together, we end up with a chained series of inputs:

[InputStream -> GZIPInputStream] ->
SequenceInputStream ->
StreamReader ->
BufferedReader

We now have a BufferedReader that will for all intents and purposes look like we've just opened a massive SD File. Handing this Reader to an SD File processor will let us capture all Substance or Compound records using a simple conceptual model.

Conclusions

By using Java's support for stream chaining and transformation, PubCouch makes it possible to work with the PubChem FTP archive as if it were one big SD File. This turns out to be useful regardless of how you decide to ultimately represent and store the resulting records. There are still some rough edges in the implementation and possibilities for extending the concept (i.e., random-access), but the idea can be used on many other datasources, and in many other contexts.


Although I find CouchDB's slogan "Relax" comforting, after having worked with Couch both on OS X and Linux, I can say with confidence that the only time to relax is after CouchDB is installed and tested. The process of getting to that point is about as relaxing as root canal.

This article, the second in a series on using CouchDB for PubChem data, describes in detail how to install CouchDB from source on a freshly-built Ubuntu Karmic server.

The Problem

At Metamolecular, we've installed the latest Ubuntu release - Karmic Koala on production and development boxes - getting CouchDB installed and running was the next task.

Unfortunately, the Karmic CouchDB binary distribution is broken - badly, and I can confirm all of the problems previously reported. It would appear that for now, CouchDB is one of those Ubuntu packages best compiled from source. There's another reason to install from source: CouchDB is still developing quite rapidly. Building from source increases the chances we'll be able to stay up-to-date at all times.

I was disappointed at the lack of documentation for installing CouchDB from source on Karmic, so I've decided to share what worked from me here. This procedure worked on a newly-built system with no previous CouchDB installation. If you've already installed the binary packages, or other versions of Erlang or SpiderMonkey, you may have more work to do. What follows is a little out of order, and I may clean up when time permits.

Download, Compile, and Install CouchDB

$ sudo aptitude install build-essential erlang libicu-dev libmozjs-dev libcurl4-openssl-dev
$ mkdir src
$ cd src
$ wget http://apache.cs.utah.edu/couchdb/0.10.1/apache-couchdb-0.10.1.tar.gz
$ tar xvf apache-couchdb-0.10.1.tar.gz
$ cd apache-couchdb-0.10.1/
$ ./configure
$ make
$ sudo make install

Configure CouchDB

The CouchDB Book indicates these permissions need to be set, so I used:

$ sudo chmod -R 0770 /usr/local/etc/couchdb
$ sudo chmod -R 0770 /usr/local/var/lib/couchdb
$ sudo chmod -R 0770 /usr/local/var/log/couchdb
$ sudo chmod -R 0770 /usr/local/var/run/couchdb

$ sudo chown -R couchdb:couchdb /usr/local/etc/couchdb
$ sudo chown -R couchdb:couchdb /usr/local/etc/couchdb
$ sudo chown -R couchdb:couchdb /usr/local/var/lib/couchdb
$ sudo chown -R couchdb:couchdb /usr/local/var/log/couchdb
$ sudo chown -R couchdb:couchdb /usr/local/var/run/couchdb

Unfortunately, CouchDB won't run at this point:

$ sudo -i -u couchdb couchdb -b
sudo: unable to change directory to /var/lib/couchdb: No such file or directory
Apache CouchDB needs write permission on the STDOUT file: couchdb.stdout

We need to create the missing directory and assign its permissions:

$ sudo mkdir /var/lib/couchdb
$ sudo chown -R couchdb:couchdb /var/lib/couchdb

I like to be able to run servers using init.d, so I used:

$ sudo cp -v /usr/local/etc/init.d/couchdb /etc/init.d/couchdb

Now we can start CouchDB and confirm it's working:

$ sudo /etc/init.d/couchdb start
$ netstat -an | grep 5984
tcp        0      0 127.0.0.1:5984          0.0.0.0:*               LISTEN

The problem with this configuration is we won't be able to use the Futon admin console from any location other than localhost. Because I'm setting up a server that will be compiling long running tasks, I want to know I can pop into Futon to check things out.

To do that, we'll need to make a slight change. In the file /usr/local/etc/couchdb/default.ini, change the line that reads:

bind_address = 127.0.0.1

to:

bind_address = 0.0.0.0

One Last Wrinkle

Although you'll be able to create and read documents with what we've done so far, a cryptic error is displayed when we try to compile views: "{exit_status,127}".

Starting with this thread, I was able to piece together the answer. Running the following command shows us that Couch can't find one of the Spidermonkey libraries:

$ /usr/local/bin/couchjs /usr/local/share/couchdb/server/main.js
/usr/local/lib/couchdb/bin/couchjs: error while loading shared libraries: libmozjs.so.0d: cannot open shared object file: No such file or directory

We can find out which package might fit the bill with:

$ aptitude search libmozjs
p   libmozjs-dev                                                   - Development files for the Mozilla SpiderMonkey JavaScript library       
i   libmozjs0d                                                     - The Mozilla SpiderMonkey JavaScript library                             
p   libmozjs0d-dbg                                                 - Development files for the Mozilla SpiderMonkey JavaScript library

We installed the first of these libraries at the beginning of the process; it turns out we need to install the other one, too:

$ sudo aptitude install libmozjs0d

We can now restart CouchDB and bask in all of its Map/Reduce glory:

$ sudo /etc/init.d/couchdb stop
$ sudo /etc/init.d/couchdb start

Conclusions

Installing CouchDB from source on the newest Ubuntu release is not hard with some basic documentation. We now have a completely up-to-date CouchDB system that we can (hopefully) upgrade as new releases are made. If you want to use CouchDB in production, there are a few more security-related steps you'll want to take, but as a development system, the setup described here should work nicely.

If the procedure outline here worked for you, or if you find problems with it, I'd really appreciate your comments.


View all articles in the archives.