How to Find Chemical Information on the Internet: Why Open Source, Open Access, and Open Data Matter
The Web may be the most effective information-delivery platform ever created. Unfortunately, a variety of barriers, both technical and cultural, restrict the use of the Web for chemistry. In the last few years, three powerful forces for change have emerged: Open Source; Open Access; and Open Data. Most of what's written on these subjects takes a theoretical angle that makes it difficult to visualize real benefits. In this article, I'll discuss these ideas from a much more practical perspective.
A Thought Experiment
Try this simple thought experiment: using only a browser and the free Internet, find all Web pages pages that have anything scientifically-relevant to say about your favorite molecule. How would you do it?
It's Trivial
If you were lucky enough to have chosen a molecule with a trivial name such as 'caffeine', you could just try Google. Google's first result would link you to the Caffeine Wikipedia article. Wikipedia is an evolving phenomenon that, according to some critics, will never have a place in scientific research. It may not be ready now, but reading the meticulously annotated and cross-referenced entry for caffeine should make anyone who would say "never" at least a little nervous. Many of the citations in Wikipedia's caffeine article point to the primary scientific literature through PubMed.
The remainder of Google's top-50 results are general audience items unlikely to interest a scientist keeping his or her nose to the grindstone: companies that sell caffeinated products; a variety of FAQs; self-help medical articles; and of course, this one. We shouldn't be surprised. In the eyes of a massive search engine like Google, chemistry is just one of many niche markets.
Adding terms to our Google search might produce more targeted results. For example, what if we wanted to find a proton NMR spectrum of caffeine? We could type "caffeine proton nmr" into Google. The first result links, indirectly, to an article in the subscription-only Russian Journal of Physical Chemistry A. This does us no good because we have no subscription, limited funds, and no access to the journal at a library. The second link is a direct hit: the proton NMR spectrum of caffeine in water-formic acid. Significantly, the information is contained in a peer-reviewed article (DOI) published by the Japanese Open Access journal, Analytical Sciences. The fact that Analytical Sciences is an Open Access journal has made a world of difference in our search.
Although this might seem like the perfect solution, recall that the goal of the experiment was to locate all scientifically-relevant online content relating to the molecule. The technique we just used is most likely to succeed when we want specific information about molecules with a single trivial name. Even then, many resources may not cite a trivial name at all.
The Real World
Our options are even more limited when it comes to comprehensively text-searching even the simplest molecules lacking a widely-used trivial name. For example, consider the molecule represented by the systematic name '3-phenyl-2-methylpropene.'

If we were using a proprietary system such those offered by Chemical Abstracts Services (CAS), we could simply enter the structure into a client and read off our results. This works because CAS isn't matching text when a query is submitted. Instead, it's matching molecular structures that have previously been encoded by both humans and machines.
The minute we step out of the orderly system created by CAS and into the chaos of the Internet, we confront a thorny problem. In practice, there are only two widely-used methods to convert a molecular structure diagram into a form that can be text-searched, and each has major limitations:
IUPAC Nomenclature This method has the advantage of being Open. It suffers from the complexity of its encoding rules, resulting in a variety of nonstandard implementations. As a result, it's possible to find multiple phrasings of the same IUPAC name, reducing its use as a unique identifier.
CAS Numbers This method replaces a standard encoding system with a central registration authority. The advantages are that the representation of the identifier itself is unambiguous. Conversely, the meaning of a CAS Number can only be known by referring to a registration authority. Unfortunately, the current business model for CAS is based on restricting information flow, rather than promoting it.
Search by IUPAC Name
Let's try searching Google for IUPAC nomenclature. Entering '3-phenyl-2-methylpropene' produces a results page containing three unique entries. One of them links to a database run from the University of Hamburg. I gather the purpose of this database, which I was unaware of before doing this search, is to link to Landolt-Börnstein Online, a collection of numerical data. Interestingly, a search for data on a single compound turned into the discovery of yet another free chemistry database.
The remaining hits from the Google search linked to two pages (pdf, pdf) from ACS journals. The ACS routinely makes the first page of its articles a free download. It's interesting to note that these were the only ACS hits returned by Google.
We've exhausted the possibilities with our chosen IUPAC name and Google. But before moving on to searching by CAS Number, we need to solve a problem. How do we get the CAS number for our compound?
Finding a CAS Number: A PubChem Detour
Concisely summarizing what PubChem does can be difficult because different users will emphasize different aspects of its design. For our purposes, PubChem probably contains a Web page describing the compound we've been researching, and on that page may be a CAS number.
Submitting our molecule to PubChem's search page produces one result. Fortunately, this page lists our compound's CAS number: 3290-53-7.
Search by CAS Number
Submitting our CAS number to Google produces sixteen results. The first two link to the Landolt-Börnstein pages. The next result links to a product listing page for a chemical supplier.
The fourth result links to a far more interesting page - the Organic Syntheses website. Fortunately for us, Organic Syntheses makes its contents freely available online. Following the link takes us to a preparation in which one of the reagents can be substituted with our molecule of interest. Further down in this page, we can see that this molecule has been further cross-referenced. Two procedures are listed, one of which is new to us. Following the link, we find a complete synthetic procedure with full characterization and primary literature citations. Jackpot.
Organic Syntheses permits free public access, but is it Open Access? Many would say not, due to the fact that it retains full copyright to its works and doesn't permit free redistribution. The distinction mainly matters to those seeking to create Open Data repositories based on the contents of periodicals such as Organic Syntheses. To an end user, however, the distinction matters little in the short run.
The remaining results from our Google search are interesting, but mainly consist of chemical supplier catalogs. It should, however, be noted that all of the results returned by a Google search of our CAS number were relevant to our molecule of interest.
InChI
In an effort to overcome the limitations of CAS Registry Numbers and IUPAC systematic nomenclature as unique molecular identifiers, a new system has recently been introduced, the IUPAC International Chemical Identifier (InChI). In contrast to a CAS number, an InChI can be assigned independently of a central authority. Like systematic nomenclature, an InChI can be decoded to a molecular representation. Unlike IUPAC systematic nomenclature, an InChI is generated by a computer algorithm far too complicated for human use. The developers of the InChI software have released their work under an Open Source license, promoting its widespread use by ensuring that services like PubChem will have no difficulties integrating InChI with their software infrastructures. Unlike either CAS Numbers or IUPAC names, InChIs are not yet in widespread use, a point which currently limits their utility.
The PubChem page for our search molecule listed an InChI, as do all PubChem Compound Summary pages. As shown by Peter Murray-Rust and others, it is perfectly feasible to use Google to search for InChIs. Let's try.
Submitting our InChI query to Google gives no results. Leaving off the leading 'InChI=' text, as briefly mentioned here, also results in no hits. This tells us that Google has found no instances of our InChI, and that Google still does not crawl PubChem Compound Summary pages.
Use a Free Database
Numerous free chemistry databases are now running on the Internet. For example, a recent article highlighted thirty-two of them. Would one of them be useful to our search? We need to ask ourselves if we really want to perform more than thirty individual searches. What if we were looking for data on several molecules? Nothing would prevent us from doing this in theory, but in practice, this is too much work.
What we'd really like is to submit a structure query to a single service that will query all of these free databases for us. While such services do exist in name, their breadth is restricted. A more comprehensive solution would be very helpful indeed.
Conclusions
The Web's convenience and ubiquity have prompted many calls for greater Web accessibility to public chemical information. As hinted at by the examples in this article, Open Source, Open Data, and Open Access are three interrelated forces that can make this vision a reality. Open Access journals lower the economic barriers to compiling Open Data sources. Making these Open Data sources useful to scientists in a cost-effective way requires Open Source software. The availability of good Open Source software stimulates the creative combination of Open Data sources. And so on.
A lot needs to be done before this positive feedback loop can replace the status quo. But even with the chaotic, balkanized system that now exists, the benefits are plain to see. With even a small amount of coordination among Open Source software developers, Open Data providers, and scientific publishers, the most amazing things could happen.
Making the Case: Topological Maximum Cross Correlation
... For the Gasteiger partial charges, we took maximum values for positive and negative charges from the “fragmentlike” subset of the ZINC database, consisting of 49 134[sic] molecules, carrying out the calculation with Open Babel 2.0.0.
...
... All structure handling, atom typing, and descriptor calculation was carried out using the open source Java library JOELib.
...
Source code (in Java) to generate the TMACC descriptors is freely available from our Web site under the GNU General Public License at http://comp.chem.nottingham.ac.uk/download/tmacc/index.html.
-James Melvile and Jonathan Hirst, J. Chem. Inf. Model.
Science happens when the experiments and conclusions of your fellow scientists can be freely questioned and independently verified. For example, readers of the cited paper may have questions about the assumptions in the TMACC method, or how to implement it. Questions may be raised about the suitability of the data set used and how others would perform. Readers may even have questions about how to extend TMACC to areas not considered by the authors.
By basing their work on open source software and open data, and by releasing their reference implementation as open source, Melvile and Hirst raise their work to the level of science. The questions that any reasonable scientist would have about the work described in the paper can be answered at any desired level of detail because all source code and all test data are freely available.
Why don't all authors adopt the same approach? Why doesn't a flagship journal such as J. Chem. Inf. Model. require it of all manuscript submissions? As far back as 1984, John Figueras was making this case. Thankfully, Melville and Hirst are taking the message seriously.
Making the Case: Personal Chemistry Client

Good software designed with chemists in mind is still quite rare, and when that software is Open Source it's even rarer still. Two very popular titles are Jmol and PyMol. A third, Bioclipse, is gaining in popularity. So it was with great interest that I came upon a company called AKos Consulting & Solutions, and their Open Source application Personal Chemistry Client (PCC).
PCC, as best as I can tell, is designed to be a personal chemical database. The good news is that PCC is licensed under the GNU General Public License. The bad news is that PCC also requires two important things from your computer:
A system capable of running the Microsoft .NET 2.0 Framework. The framework itself is included with the download. Unfortunately, this requirement rules out running PCC on Linux or Mac OSX.
The free (as in beer) ActiveX plugin ChemViewX from Hyleos. This plugin is also included with the download.
I was able to download, install, and use PCC on my system (Windows XP Home) without a problem. Aside from some confusing behaviors of its template components, the application seems to work as described.
Behind the scenes, PCC uses the Chemistry Development Kit (CDK) for structure searching. It's not clear what rendering engine is used by the ChemViewX plugin. Just looking at the output, though, it may well also be CDK.
The emergence of PCC, an Open Source program developed by a for-profit vendor, is an exciting development. Business models may take some time to solidify, but chemistry clearly offers numerous peculiarities to take advantage of. And the folks behind PCC are well ahead of the curve.
Open Source and Open Data: Why We Should Eat Our Own Dogfood
The National Institutes of Health (NIH) has decided to use OpenEye Scientific Software's cheminformatics toolkits to provide key infrastructure for PubChem, a database of small organic molecules containing chemical structure and biological activities information. PubChem is being developed by the National Center for Biotechnology Information (NCBI) as part of the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. "I am excited to see our software built into PubChem," says Roger Sayle OpenEye's Vice President of Bioinformatics. "It's gratifying that our software will be part of such a useful public resource."
...
Along with the recent decision by the Research Collaboratory for Structural Bioinformatics (RCSB) to use OpenEye's cheminformatics toolkits to curate and depict the Protein Data Bank (PDB) ligand dictionary, the NIH's decision is a clear indication of the speed and robustness of OpenEye's technology for large and diverse sets of chemical structures and data. "It has been beneficial working with the NCBI for their project," says Sayle. "Their data includes enough unusual chemistry to make it a nice validation of the software beyond our regular test sets."
-OpenEye Scientific Software Press Release - October 12, 2004
Why did PubChem, the granddaddy of all open chemistry databases, choose a closed, proprietary toolkit for its software infrastructure? A recent Depth-First article highlighted twelve free chemistry databases. Of those for which information is available, many have chosen the same path as PubChem. Why is this?
A huge opportunity is being wasted every time this happens. We, the authors of Open Source software packages, could be working with the architects of Open Data systems to solve their problems in ways that vendors of closed systems can't. We could be using these Open Data systems as real-world proving grounds for our software, fixing bugs that would have never been detected otherwise, and pushing our systems to the limit. We could be identifying new and exciting uses for our software as the organizations we work with repeatedly ask "what if." Sadly, none of this is happening. A great deal more needs to be done by the Open Source community to persuade the Open Data community to at least try their software. The worst that can happen is that we begin to understand the appeal of closed, proprietary products.
One bright spot is NMRShiftDB, which uses the Open Source Chemistry Development Kit for its infrastructure. This is a fine example of Open Source software powering an Open Data source in chemistry. More examples of this kind of Open Source/Open Data symbiosis would go a long way toward making the case.
Eating your own dogfood is an effective way to break into new markets and develop truly competitive products. After all, if those with closely-aligned goals won't use what you have to offer, who else will?
Dispelling Open Source Confusion: An Introduction to Licenses
Selecting an open-source license is a minefield for which few are prepared when they need to be. There are a plethora of licenses under which open-source software can be released. Selecting a license at the initiation of a FOSS [Free and Open Source Software] project is likely to be a low priority, as there is no initial value to the project. Without a line of source code written, wading through the legalese and nuances of distribution licenses seems unimportant. In reality, the irrevocable nature of the license makes this the most critical time if authors wish to eventually exercise control over derivative works. ... Unfortunately, even the most carefully selected and restrictive license may not afford complete protection from unanticipated and undesired uses.
-Matthew T. Stahl, Drug Discovery Today
Few subjects cause as much confusion and as many heated debates as Open Source licensing. The Open Source Initiative has approved over 50 licenses compatible with their ten-point definition of "Open Source". Whenever that many solutions to a problem exist, it's a sure sign that one size does not fit all. In this article, I'll introduce some of the key concepts in Open Source licensing.
Disclaimer
There is a phrase used so often in discussing the legal aspects of Open Source software that it has its own acronym: I Am Not A Lawyer (IANAL). Clearly IANAL, and chances are that you are not one either. Yet the very acts of writing and using Open Source software require basic familiarity with licensing terms and concepts. My aim in this article is not to provide legal advice, but rather to relate what I've found useful in trying to understand Open Source licensing for my own work. When in doubt, hire a lawyer.
One Good Book

The best writing on the subject of Open Source licensing I've read can be found in the book Open Source Licensing by Lawrence Rosen. An intellectual property attorney, Rosen also served as general counsel and secretary of the Open Source Initiative. His book is remarkably clear and easy to read. If you'd rather not pay for a hardcopy, it can be viewed in its entirety online.
The Good News
Fortunately, all Open Source licenses share some common features, if you know what to look for. For example, most licenses can be divided into one of two major categories:
Academic Licenses These licenses, named for their original use in universities, allow unlimited freedom to distribute binaries based on altered source code without making these changes public. Examples of widely-used academic licenses include the Apache License, the BSD License, and the MIT License.
Reciprocal Licenses These licenses require, to varying degrees, the developer of a derivative work to release his or her modifications to the public if their work is distributed. The question of what constitutes a "derivative work" varies from license to license, but most generally involves the modification of the files of a software package. Examples of widely-used reciprocal licenses include the GNU General Public License (GPL), the GNU Lesser General Public License (LGPL), the Mozilla Public License (MPL), and the Common Public License (CPL).
The Importance of Copyright
A frequently-encountered misconception equates Open Source licensing with release into the "public domain." Nothing could be further from the truth. The difference is in the ownership of copyright.
Software in the public domain has no owner. All enjoy unrestricted freedom to copy and otherwise use public domain software. A well-known example is David Megginson's SAX XML toolkit. Megginson, by placing his software in the public domain has forfeited all rights to control how his work is used. Sun Microsystems incorporated SAX into their Java Development Kit without any obligation to Megginson whatsoever. SAX is not Open Source software; it is public domain software.
In contrast, software distributed under an Open Source license remains the intellectual property of the copyright owner. The license is simply a mechanism for the software's creator to give some (or all) of their rights to a licensee, usually in exchange for conditions that must be met. Ownership remains with the creator, who is free do distribute his or her work simultaneously under commercial and Open Source licenses if they so desire.
As you can see, copyright gives a license its legal legitimacy. Far from placing software in the public domain, Open Source licenses use copyright law in the same ways as commercial licenses. This is why understanding Open Source licenses is so important for developers and users alike.
Reciprocity: Share and Share Alike?
Critics of the GPL frequently cite its "viral" nature. The debate essentially boils down to the following paragraph:
You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
Like a virus that spreads through replication, the GPL spreads by forcing licensees to release their modifications under the GPL. There are at least two other terms that describe this concept. The Free Software Foundation (FSF) uses the term "copyleft." Lawrence Rosen prefers the term "reciprocity" because of its neutral tone and greater descriptive ability. It's the term I'll also use. Reciprocity is such a fundamental concept in the GPL and other licenses that Rosen's book dedicates an entire chapter to the subject.
Developers distribute their software under reciprocal licenses for a variety of reasons. Two of the most common are:
To limit "freeloading", or the use of the software by those (typically companies) who contribute nothing back to the developer community.
To prevent "forking", or the establishment of a competing software package based on the original package.
In reality, Open Source licenses are limited in their ability to prevent either freeloading or forking. For example, provided that a company distributes no modifications to a GPLed package, they are under no obligation to release any of their own source code. Forking happens whenever one or more developers feel strongly enough about a subject to go in a different direction; an Open Source license does nothing to change this.
Given the limitations (and complexities) of reciprocity provisions, one might ask "why bother?". This is an excellent question, the answer to which will depend on your specific goals for your software. And as Stahl points out, the time to make this choice is before a line of code has been written.
Conclusions
Although Open Source licensing may appear to be a minefield, there is nothing mysterious about it. A lot of good writing is available on the subject, with Lawrence Rosen's book being a prime example. If you plan on creating or using Open Source software, learning the basic ideas behind Open Source licensing is a wise investment.


