Forty-Eight Free QSAR Datasets (and More)
Whether you're a medicinal chemist or an informatician, QSAR datasets can be very helpful in understanding complex biological phenomena. These datasets typically consist of a hundred or fewer compounds associated with a specific parameter such as intestinal absorption, volume of distribution, blood-brain barrier penetration, or activity at one or more biological targets. Most of them are published as part of a paper appearing in a peer-reviewed journal.
Unlike chemistry databases, which typically combine a search engine to a dataset of thousands or millions of compounds with a user interface, the QSAR dataset is much more focused and raw. You need to supply your own data viewer, report generator, and query tool.
The Internet hosts a bewildering assortment of QSAR datasets tucked into various nooks and crannies. The problem is finding them. One useful resource is cheminformatics.org, which hosts a page linking to forty-four datasets.
Recently, Shaillay Kumar Dogra, Scientific Editor of QSARWorld, wrote in to let me know about the site's offering of forty-eight free QSAR datasets. Each dataset is linked to the primary literature and is available in four formats, including SD File. In contrast to many datasets, those at QSARWorld are manually curated. QSARWorld is also actively seeking new datasets to convert into machine-readable form; if you find one, write to them to have it added in the collection.
Systematic efforts to collect, curate, and distribute raw data from the primary literature are long overdue. QSARWorld offers an intriguing model for doing so. Although some non-scientific issues, such as intellectual property rights, don't appear to have been addressed yet by QSARWorld, the site's offering of machine-readable raw data offers plenty of food for thought to anyone working with QSAR.
What's your favorite dataset resource?
Image Credit: B.G. Lewandowski
How Would Your Cheminformatics Tool Do This? 4

Reference: Zapata, Caballero, Espinosa, Tarraga, and Molina - Org. Lett.
Signal to Noise and the Chemistry Blog 3
Chemistry World is running an article in its December issue titled Surfing Web2O that briefly touches on the subject of chemistry blogs. From analysis to commentary to news gathering, blogging is changing the way large numbers of people relate to each other and the world around them. Why should chemistry be immune to this phenomenon?
One thing that is clear is that scientific blogging, in contrast to traditional scientific publication, is a much more fluid and engaging medium. Roald Hoffman, in his recent Boston ACS talk used the term "ossified" to describe the current state of chemistry publication. Although he went on to talk about how Angewandte Chemie was different, for a split second I though he might start talking about chemistry blogs.
Every new medium has its problems - and chemistry blogging is no exception. First, there's the credibility problem - the perception that the information content of chemistry blogs is somehow innately lower than print journals (a problem that every new medium faces). But beyond this are the much larger problems of understanding how this new medium works, what it can offer you as a participant, and what you might be giving up by participating.
Recent Depth-First articles have touched on some of these subjects:
Self Referential One of the least obvious side-effects of blogging is that you make it onto Google's radar - big time. How valuable would it be to 'own' the top search terms in your field?
Advice to Job Seekers From C&E News: Blog Thyself Getting a job: the killer app for scientific blogging?
Thinking of Starting an Anonymous Science Blog? Five Reasons to Think Again It's a small, small world.
Ten Things That Surprised Me About Blogging Title says it all.
Go West Young Man: Does Open Access Really Matter in the Long Run? Why the future of scientific publication may look a lot more like Google, Digg, and Feedburner and a lot less like the ACS.
You may be curious about starting a chemistry blog of your own, but what makes a good one? There are dozens of styles that seem to work, but for me the key qualities come down to a clear purpose (high signal-to-noise ratio), consistency, and attention to detail. Here are some (but by no means all) that I think work especially well:
Chemical Blogspace This isn't actually a chemistry blog, but rather a chemistry blog aggregator run by Egon Willighagen. Whether you're new to chemistry blogs or not, this is essential reading.
Molecule of the Day To the point and always on topic.
In the Pipeline One of the first, and best, in the field. Proof that blogging and working in industry are not incompatible.
Kinase Pro Analyzing the Kinase patent literature one day at a time.
Computational Organic Chemistry Companion to a book on the same subject.
Drugs and Poisons Entertaining, informative, and always on topic.
The Half Decent Pharmaceutical Chemistry Blog Three words: Saturday Night Synthesis.
Sigma-Aldrich's ChemBlogs Proof that scientific product marketing can be much more than it currently is. Also see this article.
A Synthetic Environment Top five lists galore, the history of chemistry, and always something unexpected.
University of Ottawa NMR Facility Blog Short writeups on NMR.
Totally Synthetic Chemistry blogs can continue the scientific discussion in real time after a paper has been published. Totally Synthetic offers an excellent model for doing this.
Carbon-Based Curiosities Chemistry isn't supposed to be that much fun, is it?
New media never succeed by trying to imitate the content or form of established media; they succeed by doing what established media can't. The same is true for chemistry blogging. The established peer-review, publisher-controlled system of scientific communication does many things poorly. Look to blog-like online chemical resources to exploit these weaknesses and thrive.
Image Credit: altemark
From C Source Code to Platform-Independent Executable Jarfile: Using NestedVM to Build JInChI
A recent series of articles discussed in some detail the process of compiling source code written in C and C++ to pure Java bytecode with NestedVM. But the full conversion process, starting with source and finishing with an executable jarfile, has to my knowledge never been documented. This article uses the InChI toolkit to illustrate the complete process for converting a real-world C source distribution into a platform-independent, executable jarfile that can be run with any modern Java Virtual Machine (JVM).
About InChI
The previous article in this series introduced JInChI, the first and only pure Java implementation of the IUPAC/NIST InChI toolkit. This toolkit is used to convert molecular connection tables encoded in MDL's SD File format into ASCII character strings called 'InChIs' that have a variety of applications in the field of cheminformatics. Although an excellent JNI-InChI interface is available, JNI won't be a viable option in every situation. Our pure Java implementation nicely complements the JNI-InChI library.
In this tutorial, we'll build version 1.0.2b of the InChI toolkit. This version, among other features, supports the generation of InChI Keys.
Prerequisites
This article assumes you've already installed NestedVM on your system. Building NestedVM required the installation of many dependencies and was a fairly lengthy, but straightforward, process on my Linux system.
Step 1: Prepare Your Environment
Before building anything, we'll need to set up our environment. NestedVM makes this simple:
$ cd /your/path/to/nestedvm/ $ source env.sh
Next, let's create a directory to hold the various components we'll need during the build process:
$ cd /your/projects/directory $ mkdir jinchi $ cd jinchi
Next, we'll download and unpack the InChI source distribution:
$ wget http://www.iupac.org/inchi/download/inchi102b.zip $ unzip inchi102b.zip
Step 2: Cross-Compile InChI
We now have everything we need to begin cross-compiling. NestedVM uses a two-part process in which source code is first cross-compiled to a MIPS binary. That MIPS binary is then translated to Java bytecode. We start by invoking make with the appropriate cross-compiler flags (which I found by looking through the InChI Makefile):
$ make C_COMPILER=mips-unknown-elf-gcc LINKER=mips-unknown-elf-gcc
This creates a MIPS binary (cInChI-1). Unless you're running on a MIPS machine, this binary won't be executable.
$ ./cInChI-1 bash: ./cInChI-1: cannot execute binary file
We can now translate the MIPS binary into pure Java bytecode:
$ java org.ibex.nestedvm.Compiler -outfile JInChI.class JInChI cInChI-1
This produces a Java class file:
$ ll JInChI.class -rw-r--r-- 1 rich rich 4372362 Nov 30 08:27 JInChI.class
We can verify that the classfile has been compiled correctly by running it:
$ java JInChI InChI ver 1, Software version 1.02-beta August 2007. Usage: cInChI-1 inputFile [outputFile [logFile [problemFile]]] [-option[ -option...]] Options: SNon Exclude stereo (Default: Include Absolute stereo) SRel Relative stereo -- truncated --
We have now done something truly remarkable: we've taken a standard C source code distribution and converted it into an executable Java class file. It runs, but only because the NestedVM runtime is on our classpath (thanks to the source command we used at the beginning of the process).
What we really want is a self-contained, executable jarfile that can be run, unmodified, on any system with Java installed.
Step 3: Build the JInChI Jarfile
We begin by moving up the the root directory of our jinchi project, creating a new directory to hold our java-specific files (the JInChI.class file and the NestedVM runtime), and copying them into it:
$ cd ../../.. $ mkdir jinchi-1.0.2b.1 $ mv InChI-1-software-1-02-beta/cInChI/gcc_makefile/JInChI.class jinchi-1.0.2b.1/ $ cp -r /your/path/to/nestedvm/build/org/ jinchi-1.0.2b.1
An executable jarfile generally needs a manifest to point to the main execution class. One way to do that is to first create a manifest:
$ vi jinchi-1.0.2b.1/MANIFEST.MF
It's essential that this file end with a newline.
$ cat jinchi-1.0.2b.1/MANIFEST.MF Main-Class: JInChI
With everything in place, we can create the jarfile:
$ cd jinchi-1.0.2b.1/ $ ls JInChI.class MANIFEST.MF org/ $ jar -cfm jinchi-1.0.2b.1.jar MANIFEST.MF * $ ls jinchi-1.0.2b.1.jar JInChI.class MANIFEST.MF org/
We've successfully converted standard C source code into a platform independent executable jarfile. But does it work?
Step 4: Test JInChI
We can confirm that the process has worked by running the jarfile (you should do this in a new shell session to verify that the jarfile is indeed independent of your NestedVM installation).
$ java -jar jinchi-1.0.2b.1.jar InChI ver 1, Software version 1.02-beta August 2007. Usage: cInChI-1 inputFile [outputFile [logFile [problemFile]]] [-option[ -option...]] Options: SNon Exclude stereo (Default: Include Absolute stereo) SRel Relative stereo
That's all there is to it! Your shiny new jarfile can be run on any system with a JVM installed. The one created here has been successfully tested on Mac OS X, Linux, and Windows.
If you'd prefer to download the JInChI jarfile, it can be obtained from SourceForge.
Conclusions
This article has illustrated in detail the process of converting a standard C source distribution into a platform-independent executable jarfile. Given the appropriate MIPS cross-compiler (many of which come with the NestedVM distribution), the same process can be repeated with code written in a variety of other languages.
You may be wondering what kind of performance hit you can expect with the approach outlined here. After all, we'd be comparing a native binary to something running on top of two abstraction layers: the NestedVM runtime and a JVM. It's not as bad as you might think, but that's a story for another time.
Image Credit: smithco


