The Paper Laboratory Notebook: Chemistry's Most Ancient Data Tomb 1
Derek Lowe's In the Pipeline hosts an interesting discussion on Electronic Laboratory Notebooks (ELNs). The wasteful process of entombing valuable scientific data often begins with the paper lab notebook, so the subject of ELNs should be of great interest to anyone involved in creating, using, or reprocessing chemical information.
Why do paper notebooks continue to persist in chemistry?
The issue is complex, but in my view stems from the lack of a truly usable and affordable tool. Although the term "tool" may suggest software, it actually involves a much more complex beast consisting of hardware, software, an ergonomic hardware/software user interface, and a computer network. In chemistry, the problem is compounded by the centrality of chemical structures and the inability of most generic ELN products to capture or use them.
Given these constraints, and the costs associated with creating and marketing general-purpose products designed to work within them, it's not surprising that many organizations decide to roll their own ELN. And it's even less surprising that many others decide sticking with paper is a better option - at least for now.
Image Credit: John Thurm
Raiding Chemistry's Data Tombs 6

Duncan Hull offers an interesting commentary on the rapid increase in the number of biologically-oriented databases. He asks whether all of this abundance is leading to nothing more than a bad case of data indigestion, in which data is dumped into write-only "data tombs," never to be seen again.
A data tomb is created whenever the ability to generate data outstrips the ability to do useful things with it. Like the burial tombs of ancient civilizations, data tombs are created for many reasons and take many forms.
Where are chemistry's data tombs and what do they look like? Given that the number of free chemistry databases pales in comparison to the number free biological databases, the question may seem irrelevant.
Nevertheless, data tombs in chemistry are ubiquitous. The most obvious examples are the supplementary data sections of major chemical journals. These write-only databases suffer from dual afflictions of copyright restriction and electronic degradation.
The collective experimental sections of the world's chemical literature is, in effect, a vast catacomb of jealously-guarded, but poorly-catalogued treasures.
Data silos are an especially prevalent kind of data tomb that result when data is created for a single use and either for technical or political reasons never placed in a real database. SD files containing SAR data, PowerPoint slides containing tables of synthetic yields, and Word documents containing experimental procedures are some of the forms these chemical data silos take.
What chemical data tombs have you run into, and what methods did you use to raid them?
Image Credit: Duncan Hull
If You Want to Change the World, Build the Tool First - Part 2 2
Let's face it - real change is painful for most people. Think back, for example, to your last big change at work, and chances are pretty good that the experience was not entirely enjoyable - especially if the change was imposed on you.
As designers of tools, it's easy to forget just how unpleasant change is for your users. Being closely involved and invested in the development of your tool only makes it harder to empathize with the people whose routines you'll be interrupting.
When innovations fail to catch on, it may be tempting to explain the situation in terms of users not "getting it," or through the intervention of outside forces with their own agenda. But more often than not, the real problem results from the innovation failing to offer a reasonable promise of compensation for the inconvenience that change brings.
The previous article in this series, suggested that the same dynamic applied to the compilation, management, and sharing of spectral data by chemists. More to the point:
... cheminformatics has failed to deliver an inexpensive, robust, and truly usable solution to the problem of compiling, managing, and sharing spectral data for scientists of average computer skills. ...
To be sure, there are tools that address parts of the problem. But no solution addresses them all and that's why scientists and publishers resort to using obviously inferior solutions like PDFs. Let's take each of the requirements one at a time:
Inexpensive. One of the chronic problems in vertical markets like chemistry software is the lack of ubiquitous tools. Lack of ubiquity is a recipe for balkanization. Because chemistry software tends to be highly specialized and expensive to develop, suppliers must and do pass these costs onto customers. Change linked to money is especially hard to accept. The key, therefore, to developing the ideal tool is to relentlessly focus on keeping development cost low so as to deliver a low-cost (or free) tool. It's all but guaranteed that the ideal tool will take advantage of multiple pieces of Open Source software.
Robust. Few things are more difficult than trying to convince a skeptic to try a new, unreliable technology. Getting the last 20% in reliability is orders of magnitude more difficult than getting the first 80%. Part-way simply won't cut it.
Usable. A steep learning curve is a surefire deterrent to adoption. Chemistry has a long history of software with poor usability. Who could blame jaded users for turning away from "yet another piece of software." Make it obvious or don't make it at all. Tying the tool to a specific operating system or browser is an especially bad idea; "usable" means usable by everyone.
The ideal solution must also address the three key needs chemists have with respect to using their spectra:
Compile Spectra Contrary to an apparently popular belief among non-experimental chemists, most experimental chemists create their own spectra. There may be a "spectroscopist" who handles unusual cases, but the vast majority of spectra are created and interpreted by the chemist. They need a tool that requires no thought or planning to get a spectrum from the instrument into a database and ultimately onto their desktop.
Manage Spectra During any given year, an organic chemist of average productivity can generate hundreds of spectra. It's a safe assumption today that these will be in digital format. The volume of data creates its own set of problems: where to store the spectra, how to store them, how to find them again, and how to manipulate them once they are found. Tagging the spectra in such a way that the sample history can be reconstructed is critical.
Share Spectra One of the primary channels for sharing spectral data is through scientific publication. The tool must offer an obvious solution for scientists to compile their data into packages that publishers can work with and readers can do something with.
The analogy that springs to mind is blogging. As early as 1994, blogging was technically possible - all the pieces were in place and the demand for online content was mushrooming. But why didn't it happen? There was no tool that actually made it cheap and easy to blog. Staring in 2000-2001, those tools started to appear. Today, we take it for granted that anyone who wants to publish their own writing can do so almost immediately.
The availability of the tool did what years of discussion failed to do; it changed behavior. It succeeded by offering a reward that more than compensated for the pain of change.
The development of a ubiquitous tool for spectral data compilation, management, and sharing is an opportunity with a potentially big reward for the group that gets it right. It's one of those uninteresting, widespread problems that creates a natural scarcity of good solutions and people willing to develop them. Most players in the field have concluded (prematurely) that the solution(s) already exists, and so are reluctant to get involved.
What more could you ask for as a developer?
Image Credit: Daniel Morris
If You Want to Change the World, Build the Tool First - Part 1 4
Breakthroughs in technologies for managing and exchanging information always precede explosions in information exchange. From a safe distance, this principle seems completely obvious. Yet, like most obvious things, it's all too easy to forget in the heat of battle.
Recently, Peter Murray-Rust discussed the appalling state of data capture, dissemination, preservation and curation. His comments were prompted by an article written by Nico Adams. In it, Nico discusses his initial excitement by the publication of a large spectroscopic dataset, followed by his frustration in finding that the "data" really consisted of nothing more than flat images stored in PDF format.
The article in question is titled Preparation and Infrared/Raman Classification of 630 Spectroscopically Encoded Styrene Copolymers. Not having a subscription to the ASAP contents of this particular journal, I can only go by what appears in the abstract. From the abstract and title, it's clear that the dataset is the centerpiece of this article:
The barcoded resins (BCRs) were introduced recently as a platform for encoded combinatorial chemistry. One of the main challenges yet to be overcome is the demonstration that a large number of BCRs could be generated and classified with high confidence. Here, we describe the synthesis and classification of 630 polystyrene-based copolymers prepared from the combinatorial association of 15 spectroscopically active styrene monomers. Each of the 630 copolymers displayed a unique vibrational fingerprint (infrared and Raman), which was converted into a spectral vector. ...
Apparently, the technique enables polymer beads to be encoded with a spectroscopically-readable tag for use in identifying attached compounds at the end of a split-pool synthesis. Yet the supplementary material for the article consists of nothing more than static images like the one below:

For researchers hoping to build on the experiments described in the paper, and for those hoping to model or compile the results, static images like the one shown above are practically useless.
Why did this happen and why do incidents like it play out with bewildering regularity in chemistry?
Nico looks to scientists and publishers, whereas Peter focuses on the publishers as the root cause.
I understand the reasoning and share their concern about the problem, but I disagree about the cause.
The cause of this problem is neither the policies of publishers nor the lack of understanding of the problem by scientists - those are just symptoms. The root cause is a failure of cheminformatics itself. Simply put, cheminformatics has failed to deliver an inexpensive, robust, and truly usable solution to the problem of compiling, managing, and sharing spectral data for scientists of average computer skills.
The tool hasn't been built yet. No tool means that both scientists and publishers will continue to use the only tools they have any faith in, despite their obvious flaws. No tool leads to more of the same, from both scientists and publishers. No tool also means an enormous opportunity for the group that develops it.
Read Part 2 to find out why.
Image Credit: Neil T

