Forty-Eight Free QSAR Datasets (and More)
Whether you're a medicinal chemist or an informatician, QSAR datasets can be very helpful in understanding complex biological phenomena. These datasets typically consist of a hundred or fewer compounds associated with a specific parameter such as intestinal absorption, volume of distribution, blood-brain barrier penetration, or activity at one or more biological targets. Most of them are published as part of a paper appearing in a peer-reviewed journal.
Unlike chemistry databases, which typically combine a search engine to a dataset of thousands or millions of compounds with a user interface, the QSAR dataset is much more focused and raw. You need to supply your own data viewer, report generator, and query tool.
The Internet hosts a bewildering assortment of QSAR datasets tucked into various nooks and crannies. The problem is finding them. One useful resource is cheminformatics.org, which hosts a page linking to forty-four datasets.
Recently, Shaillay Kumar Dogra, Scientific Editor of QSARWorld, wrote in to let me know about the site's offering of forty-eight free QSAR datasets. Each dataset is linked to the primary literature and is available in four formats, including SD File. In contrast to many datasets, those at QSARWorld are manually curated. QSARWorld is also actively seeking new datasets to convert into machine-readable form; if you find one, write to them to have it added in the collection.
Systematic efforts to collect, curate, and distribute raw data from the primary literature are long overdue. QSARWorld offers an intriguing model for doing so. Although some non-scientific issues, such as intellectual property rights, don't appear to have been addressed yet by QSARWorld, the site's offering of machine-readable raw data offers plenty of food for thought to anyone working with QSAR.
What's your favorite dataset resource?
Image Credit: B.G. Lewandowski
How to Find Chemical Information on the Internet: Why Open Source, Open Access, and Open Data Matter
The Web may be the most effective information-delivery platform ever created. Unfortunately, a variety of barriers, both technical and cultural, restrict the use of the Web for chemistry. In the last few years, three powerful forces for change have emerged: Open Source; Open Access; and Open Data. Most of what's written on these subjects takes a theoretical angle that makes it difficult to visualize real benefits. In this article, I'll discuss these ideas from a much more practical perspective.
A Thought Experiment
Try this simple thought experiment: using only a browser and the free Internet, find all Web pages pages that have anything scientifically-relevant to say about your favorite molecule. How would you do it?
It's Trivial
If you were lucky enough to have chosen a molecule with a trivial name such as 'caffeine', you could just try Google. Google's first result would link you to the Caffeine Wikipedia article. Wikipedia is an evolving phenomenon that, according to some critics, will never have a place in scientific research. It may not be ready now, but reading the meticulously annotated and cross-referenced entry for caffeine should make anyone who would say "never" at least a little nervous. Many of the citations in Wikipedia's caffeine article point to the primary scientific literature through PubMed.
The remainder of Google's top-50 results are general audience items unlikely to interest a scientist keeping his or her nose to the grindstone: companies that sell caffeinated products; a variety of FAQs; self-help medical articles; and of course, this one. We shouldn't be surprised. In the eyes of a massive search engine like Google, chemistry is just one of many niche markets.
Adding terms to our Google search might produce more targeted results. For example, what if we wanted to find a proton NMR spectrum of caffeine? We could type "caffeine proton nmr" into Google. The first result links, indirectly, to an article in the subscription-only Russian Journal of Physical Chemistry A. This does us no good because we have no subscription, limited funds, and no access to the journal at a library. The second link is a direct hit: the proton NMR spectrum of caffeine in water-formic acid. Significantly, the information is contained in a peer-reviewed article (DOI) published by the Japanese Open Access journal, Analytical Sciences. The fact that Analytical Sciences is an Open Access journal has made a world of difference in our search.
Although this might seem like the perfect solution, recall that the goal of the experiment was to locate all scientifically-relevant online content relating to the molecule. The technique we just used is most likely to succeed when we want specific information about molecules with a single trivial name. Even then, many resources may not cite a trivial name at all.
The Real World
Our options are even more limited when it comes to comprehensively text-searching even the simplest molecules lacking a widely-used trivial name. For example, consider the molecule represented by the systematic name '3-phenyl-2-methylpropene.'

If we were using a proprietary system such those offered by Chemical Abstracts Services (CAS), we could simply enter the structure into a client and read off our results. This works because CAS isn't matching text when a query is submitted. Instead, it's matching molecular structures that have previously been encoded by both humans and machines.
The minute we step out of the orderly system created by CAS and into the chaos of the Internet, we confront a thorny problem. In practice, there are only two widely-used methods to convert a molecular structure diagram into a form that can be text-searched, and each has major limitations:
IUPAC Nomenclature This method has the advantage of being Open. It suffers from the complexity of its encoding rules, resulting in a variety of nonstandard implementations. As a result, it's possible to find multiple phrasings of the same IUPAC name, reducing its use as a unique identifier.
CAS Numbers This method replaces a standard encoding system with a central registration authority. The advantages are that the representation of the identifier itself is unambiguous. Conversely, the meaning of a CAS Number can only be known by referring to a registration authority. Unfortunately, the current business model for CAS is based on restricting information flow, rather than promoting it.
Search by IUPAC Name
Let's try searching Google for IUPAC nomenclature. Entering '3-phenyl-2-methylpropene' produces a results page containing three unique entries. One of them links to a database run from the University of Hamburg. I gather the purpose of this database, which I was unaware of before doing this search, is to link to Landolt-Börnstein Online, a collection of numerical data. Interestingly, a search for data on a single compound turned into the discovery of yet another free chemistry database.
The remaining hits from the Google search linked to two pages (pdf, pdf) from ACS journals. The ACS routinely makes the first page of its articles a free download. It's interesting to note that these were the only ACS hits returned by Google.
We've exhausted the possibilities with our chosen IUPAC name and Google. But before moving on to searching by CAS Number, we need to solve a problem. How do we get the CAS number for our compound?
Finding a CAS Number: A PubChem Detour
Concisely summarizing what PubChem does can be difficult because different users will emphasize different aspects of its design. For our purposes, PubChem probably contains a Web page describing the compound we've been researching, and on that page may be a CAS number.
Submitting our molecule to PubChem's search page produces one result. Fortunately, this page lists our compound's CAS number: 3290-53-7.
Search by CAS Number
Submitting our CAS number to Google produces sixteen results. The first two link to the Landolt-Börnstein pages. The next result links to a product listing page for a chemical supplier.
The fourth result links to a far more interesting page - the Organic Syntheses website. Fortunately for us, Organic Syntheses makes its contents freely available online. Following the link takes us to a preparation in which one of the reagents can be substituted with our molecule of interest. Further down in this page, we can see that this molecule has been further cross-referenced. Two procedures are listed, one of which is new to us. Following the link, we find a complete synthetic procedure with full characterization and primary literature citations. Jackpot.
Organic Syntheses permits free public access, but is it Open Access? Many would say not, due to the fact that it retains full copyright to its works and doesn't permit free redistribution. The distinction mainly matters to those seeking to create Open Data repositories based on the contents of periodicals such as Organic Syntheses. To an end user, however, the distinction matters little in the short run.
The remaining results from our Google search are interesting, but mainly consist of chemical supplier catalogs. It should, however, be noted that all of the results returned by a Google search of our CAS number were relevant to our molecule of interest.
InChI
In an effort to overcome the limitations of CAS Registry Numbers and IUPAC systematic nomenclature as unique molecular identifiers, a new system has recently been introduced, the IUPAC International Chemical Identifier (InChI). In contrast to a CAS number, an InChI can be assigned independently of a central authority. Like systematic nomenclature, an InChI can be decoded to a molecular representation. Unlike IUPAC systematic nomenclature, an InChI is generated by a computer algorithm far too complicated for human use. The developers of the InChI software have released their work under an Open Source license, promoting its widespread use by ensuring that services like PubChem will have no difficulties integrating InChI with their software infrastructures. Unlike either CAS Numbers or IUPAC names, InChIs are not yet in widespread use, a point which currently limits their utility.
The PubChem page for our search molecule listed an InChI, as do all PubChem Compound Summary pages. As shown by Peter Murray-Rust and others, it is perfectly feasible to use Google to search for InChIs. Let's try.
Submitting our InChI query to Google gives no results. Leaving off the leading 'InChI=' text, as briefly mentioned here, also results in no hits. This tells us that Google has found no instances of our InChI, and that Google still does not crawl PubChem Compound Summary pages.
Use a Free Database
Numerous free chemistry databases are now running on the Internet. For example, a recent article highlighted thirty-two of them. Would one of them be useful to our search? We need to ask ourselves if we really want to perform more than thirty individual searches. What if we were looking for data on several molecules? Nothing would prevent us from doing this in theory, but in practice, this is too much work.
What we'd really like is to submit a structure query to a single service that will query all of these free databases for us. While such services do exist in name, their breadth is restricted. A more comprehensive solution would be very helpful indeed.
Conclusions
The Web's convenience and ubiquity have prompted many calls for greater Web accessibility to public chemical information. As hinted at by the examples in this article, Open Source, Open Data, and Open Access are three interrelated forces that can make this vision a reality. Open Access journals lower the economic barriers to compiling Open Data sources. Making these Open Data sources useful to scientists in a cost-effective way requires Open Source software. The availability of good Open Source software stimulates the creative combination of Open Data sources. And so on.
A lot needs to be done before this positive feedback loop can replace the status quo. But even with the chaotic, balkanized system that now exists, the benefits are plain to see. With even a small amount of coordination among Open Source software developers, Open Data providers, and scientific publishers, the most amazing things could happen.
Making the Case: Topological Maximum Cross Correlation
... For the Gasteiger partial charges, we took maximum values for positive and negative charges from the “fragmentlike” subset of the ZINC database, consisting of 49 134[sic] molecules, carrying out the calculation with Open Babel 2.0.0.
...
... All structure handling, atom typing, and descriptor calculation was carried out using the open source Java library JOELib.
...
Source code (in Java) to generate the TMACC descriptors is freely available from our Web site under the GNU General Public License at http://comp.chem.nottingham.ac.uk/download/tmacc/index.html.
-James Melvile and Jonathan Hirst, J. Chem. Inf. Model.
Science happens when the experiments and conclusions of your fellow scientists can be freely questioned and independently verified. For example, readers of the cited paper may have questions about the assumptions in the TMACC method, or how to implement it. Questions may be raised about the suitability of the data set used and how others would perform. Readers may even have questions about how to extend TMACC to areas not considered by the authors.
By basing their work on open source software and open data, and by releasing their reference implementation as open source, Melvile and Hirst raise their work to the level of science. The questions that any reasonable scientist would have about the work described in the paper can be answered at any desired level of detail because all source code and all test data are freely available.
Why don't all authors adopt the same approach? Why doesn't a flagship journal such as J. Chem. Inf. Model. require it of all manuscript submissions? As far back as 1984, John Figueras was making this case. Thankfully, Melville and Hirst are taking the message seriously.
From Famine to Feast: A Bumper Crop of Free Chemistry Databases
"Until PubChem came on the scene, the state of chemoinformatics compared to bioinformatics was 20 years behind," says Christopher Lipinski, who formulated the eponymous rule-of-five criteria for drug bioavailability.
-Monya Baker, Nature Reviews Drug Discovery
The number of free chemistry databases on the Web just keeps growing. A recent Depth-First article discussed twelve of them. It turns out that Chembiogrid from Indiana University maintains a list of forty free chemistry databases, most of which are alive and well.
As this trend continues, the need for database standards will become painfully obvious. Not only will interoperable infrastructure technologies and user interface standards need to be developed, but thorny intellectual property issues including access, chain of title, and digital rights will need to be resolved. However, the most immediate need is much more down-to-earth: to involve chemists with the growing number of free alternatives to the chemical information monopoly they've come to rely on.
Open Source and Open Data: Why We Should Eat Our Own Dogfood
The National Institutes of Health (NIH) has decided to use OpenEye Scientific Software's cheminformatics toolkits to provide key infrastructure for PubChem, a database of small organic molecules containing chemical structure and biological activities information. PubChem is being developed by the National Center for Biotechnology Information (NCBI) as part of the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. "I am excited to see our software built into PubChem," says Roger Sayle OpenEye's Vice President of Bioinformatics. "It's gratifying that our software will be part of such a useful public resource."
...
Along with the recent decision by the Research Collaboratory for Structural Bioinformatics (RCSB) to use OpenEye's cheminformatics toolkits to curate and depict the Protein Data Bank (PDB) ligand dictionary, the NIH's decision is a clear indication of the speed and robustness of OpenEye's technology for large and diverse sets of chemical structures and data. "It has been beneficial working with the NCBI for their project," says Sayle. "Their data includes enough unusual chemistry to make it a nice validation of the software beyond our regular test sets."
-OpenEye Scientific Software Press Release - October 12, 2004
Why did PubChem, the granddaddy of all open chemistry databases, choose a closed, proprietary toolkit for its software infrastructure? A recent Depth-First article highlighted twelve free chemistry databases. Of those for which information is available, many have chosen the same path as PubChem. Why is this?
A huge opportunity is being wasted every time this happens. We, the authors of Open Source software packages, could be working with the architects of Open Data systems to solve their problems in ways that vendors of closed systems can't. We could be using these Open Data systems as real-world proving grounds for our software, fixing bugs that would have never been detected otherwise, and pushing our systems to the limit. We could be identifying new and exciting uses for our software as the organizations we work with repeatedly ask "what if." Sadly, none of this is happening. A great deal more needs to be done by the Open Source community to persuade the Open Data community to at least try their software. The worst that can happen is that we begin to understand the appeal of closed, proprietary products.
One bright spot is NMRShiftDB, which uses the Open Source Chemistry Development Kit for its infrastructure. This is a fine example of Open Source software powering an Open Data source in chemistry. More examples of this kind of Open Source/Open Data symbiosis would go a long way toward making the case.
Eating your own dogfood is an effective way to break into new markets and develop truly competitive products. After all, if those with closely-aligned goals won't use what you have to offer, who else will?
Older posts: 1 2


