The Best API May Be No API At All: PubChem and PDB 2
Both PubChem and the Protein Data Bank (PDB) maintain vast collections of molecular data. Individual users are free to view and search these collections via standard Web browsers. But what are the options if you're developing software to interact with these databases?
Various application programming interfaces (APIs) are available for accessing PubChem and PDB records. For example, PubChem recently introduced its Power User Gateway (PUG), an XML-based query language. But writing APIs is extremely difficult; reconciling the need for simplicity with the need for rich functionality is a tough balancing act. Where do you draw the line?
Recently, Bosco described a remarkably short method to retrieve PDB records using nothing more than standard Python. Given the similarities between Python and Ruby, it seemed reasonable that his method could be adapted to Ruby.
The following Ruby library accepts a PDB identifier and returns the corresponding PDB record:
require 'net/http'
module PDB
# Returns a PDB record for the given id
def self.get_record id
Net::HTTP.get_response('www.rcsb.org', "/pdb/files/#{id}.pdb").body
end
end
$ irb
irb(main):001:0> require 'pdb'
=> true
irb(main):002:0> puts PDB::get_record('1hpn')
HEADER GLYCOSAMINOGLYCAN 17-JAN-95 1HPN
TITLE N.M.R. AND MOLECULAR-MODELLING STUDIES OF THE SOLUTION
TITLE 2 CONFORMATION OF HEPARIN
[truncated]
Several months ago, a D-F article described a related, but somewhat lengthier approach to retrieving PubChem molfiles. Using the same approach we used for PDB, we can create the world's shortest PubChem library:
require 'net/http'
module PubChem
# Returns a molfile for the given PubChem CID
def self.get_molfile cid
Net::HTTP.get_response('pubchem.ncbi.nlm.nih.gov', "/summary/summary.cgi?cid=#{cid}&disopt=DisplaySDF").body
end
end
$ irb
irb(main):001:0> require 'pubchem'
=> true
irb(main):002:0> puts PubChem::get_molfile('969472') #eszopiclone (Lunesta)
969472
-OEChem-08130700422D
44 47 0 1 0 0 0 0 0999 V2000
9.2619 -2.2732 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0
[truncated]
Both of these Ruby libraries leverage one the most versatile and robust protocols ever developed: plain old http. The last few years have witnessed a renaissance in using bare http as platform for building simplified yet powerful Web APIs with less software. Referred to as REST, the approach has gained traction partly in response to the wasteful complexities introduced by various XML-based approaches. Although slow to catch on in cheminformatics, REST has enormous potential in unifying a diverse array of isolated database systems.
One limitation of the approach described here is that the PubChem (or PDB) folks may get upset if you use it a lot. For example, if you examine the PubChem robots.txt file, you'll notice that access to the summary.cgi resource, which our library makes use of, is prohibited to robots:
...
User-agent: *
...
Disallow: /summary/summary.cgi
...What makes a "robot" and does your software qualify for exclusion? The answer is not enirely clear-cut, especially in the era of browser-side scripting.
Regardless, it looks like PubChem's policy was put in place in 2004, long before PubChem had experience with usage patterns for its service. It may be that this restriction could be relaxed without adversely affecting PubChem's ability to operate efficiently. It may even be possible to offer a low-level http retrieval method alongside PubChem's PUG interface on a machine dedicated to automated queries (i.e., Entrez eUtils).
As developers, our mission is to deliver functionality, not to write software. We should extract every possible ounce of value from established protocols and APIs before writing a single line of additional code. REST, and the creative use of good old http, are powerful tools to do so.
Image Credit Dru!
James Gosling Unplugged
Earlier this week, James Gosling fielded a variety of questions from the San Diego Java Users' Group. Subjects ranged from the deeply technical to the ridiculously political, and the entire event was simply great. Here, in no particular order, were some of the highlights:
Question: "What's your least favorite word?" Answer: "Action Item." My kind of guy.
Virtually all banking transactions on the planet are processed by Java code.
The NASDAQ is a single computer running a Java program.
Backward compatibility in Java is a huge problem; everything in rt.jar is used by someone somewhere so nothing can be taken out. The Calendar class is the overengineered monstrosity that it is because there are people, such as those building historical databases, who absolutely need it.
The decision whether to include a library in core Java is deeply political. Having your organization's library in rt.jar apparently carries a great deal of prestige, a fact that has bedeviled efforts to pare down core Java.
The reason we have JDK 1.4, 1.5, and 1.6 rather than 4, 5, and 6 is that the contracts Sun negotiated 10+ years ago with its partners required renegotiation if the major revision were ever incremented. It was easier to just never increment the major revision.
SWT was a major sore spot. Looks like Gosling's going to have the last laugh after all, though.
Sau Paulo has two Java users' groups of about 6,000 members each. Their meetings regularly draw 1000+ attendees. According to Gosling, Brazil is one of the world's software hotspots.
Some of these things certainly seem plausible, but others less so. If you've got any information one way or the other, I'd like to hear from you.
Ten Things That Surprised Me About Blogging
Depth-First began with a single post on August 12th, 2006. One year and 193 posts later, I thought it would be interesting to reflect back on how my expectations about blogging diverged from reality. Many of my observations come from statistics compiled by Google Analytics. An investment in time to install and use analytics software more than pays for itself within a few weeks.
With these thoughts in mind, here are ten things that surprised me:
The Undiscovered Continent of "Gray Literature." Like television networks, scientific journals serve subscribers with often wildly differing interests from yours. Some content will simply never make it into a print journal, no matter how useful or newsworthy. When you publish a blog, the only OK you need is your own. This means you can tackle subjects that are literally impossible for scientific journals to cover. In fact, your blog can contain valuable content not available anywhere else. After you wrap your brain around this simple concept, the most interesting things can start to happen.
It takes about six months of constant writing to see signs of being noticed. This is the phase during which I suspect many bloggers throw in the towel. Unless you already enjoy world renown in your field (in which case you're probably not blogging), expect to pay your six month dues.
Syndication matters. A lot. When I started D-F, I couldn't understand why people seemed so excited about "syndication." But I did notice a steady build-up of traffic on my server logs dedicated solely to accessing my RSS and Atom feeds. I decided to track this traffic using FeedBurner and discovered, to my amazement, that it accounted for about a third of all non-robot traffic on my site.
Readers take the weekend off. Traffic drops off rather significantly (50-60%) on weekends. Conversely, traffic peaks quite consistently on Wednesdays. Although the former is understandable, the latter continues to puzzle me.
For niche subjects, AdSense sucks. Chemistry is a subject dominated by niches. Any recent ACS program will show 40+ divisions; chemistry is all about the long tail. In the early days of D-F, I experimented with AdSense and got nada. This made sense (pun intended) because the ads being shown had almost nothing to do with what I was writing about. Adwords can work very well for broadly-appealing subjects. But if you're writing about your area of scientific expertise, which by definition is about as niche as you can get, AdSense is more likely to be an eyesore than a source of income.
Your most valuable asset: the archive. On a typical day, only about 15% of D-F's pageviews come from a user landing on the homepage itself. The rest come from pages previously found and linked to from other sources, bookmarks, or from search engines. When seen from this perspective, (which is hard to grasp when your archive contains ten articles) it becomes very important to maintain access to the archives and ensure that links can always be followed. On two occasions I've nearly hosed my database, but my backups saved the day. Never assume your archive of articles and comments will be there tomorrow.
You never know who's reading. The vast majority of readers never post a comment or write an email. I should have expected this, since the vast majority of blogs I read never get a comment or email from me. On the other hand, I've met a few very intelligent and friendly people online as a direct result of one or more D-F articles, which is the ultimate reward.
Writing begets more writing. Forcing yourself to write regularly about a subject you know about is very therapeutic. More importantly, being forced to back up your arguments in writing on a regular basis makes you examine your own assumptions more carefully. Most important of all, writing regularly brings new ideas into view that you would have missed otherwise.
Forget about getting Dugg, try to get StumbledUpon instead. Being a regular reader of Digg, I was well aware of its massive audience and the surge in traffic that follows a site being featured on the Digg home page. The effect usually lasts a couple of days. But the traffic surge produced from a listing in StumbleUpon, continues unabated for weeks, and results in permanently higher overall page views.
Robots everywhere. Google Analytics doesn't record activity from robots and web spiders. But my server log reveals a staggering amount of traffic due to non-human visitors. The software interacting with my site is doing everything from indexing it for Google to trying to post annoying ringtone spam comments. Still, the benefit of automated user agents more than outweighs the inconvenience. Make sure your robots.txt file allows any and all user agents.
Image Credit: Seether_Alpha
Never Draw the Same Molecule Twice: Viewing Image Metadata 5
Chemists are accustomed to embedding live molecular objects in their documents with Microsoft Word/ChemDraw. These objects can then be reprocessed and embedded into other documents, such as PowerPoint presentations, saving enormous amounts of time. What if the same feature were available with Web documents?
A recent D-F article proposed a method to encode molecular structure data within commonly-used Web image formats such as PNG. That article contained an embedded image of GlaxoSmithKline's diabetes treatment rosiglitazone (Avandia) encoded by a rendering toolkit built with Firefly. I claimed that this image contained the complete connection table and atom coordinates as embedded metadata. In this article, I'll show a simple method to read this metadata.
Metadata is a standard part of the PNG specification; to read it requires nothing more than software capable of recognizing it. I recently found a Web-based, cross-platform method for doing so. The Image Metadata Viewer by FileFormat.info accepts an uploaded image file and returns that image's metadata. Let's try it with the image of rosiglitazone.
After saving the image to my hard drive, uploading it to FileFormat.info and pressing start, I can see that the image contains metadata:

The metadata can be viewed either as XML or as plain text. Choosing plain text (second option) gives me the complete molfile, stored as a key/value hash (molfile=[molfile]).
Clearly, reading metadata is not a problem given the right software. But this leaves the question of how metadata is encoded in the first place - especially in a programming language such as Java. Like everything else, it's not difficult when you know how. Stay tuned for the answer.


