Forty-Eight Free QSAR Datasets (and More)

Posted by Rich Apodaca Thu, 06 Dec 2007 15:20:00 GMT

Whether you're a medicinal chemist or an informatician, QSAR datasets can be very helpful in understanding complex biological phenomena. These datasets typically consist of a hundred or fewer compounds associated with a specific parameter such as intestinal absorption, volume of distribution, blood-brain barrier penetration, or activity at one or more biological targets. Most of them are published as part of a paper appearing in a peer-reviewed journal.

Unlike chemistry databases, which typically combine a search engine to a dataset of thousands or millions of compounds with a user interface, the QSAR dataset is much more focused and raw. You need to supply your own data viewer, report generator, and query tool.

The Internet hosts a bewildering assortment of QSAR datasets tucked into various nooks and crannies. The problem is finding them. One useful resource is cheminformatics.org, which hosts a page linking to forty-four datasets.

Recently, Shaillay Kumar Dogra, Scientific Editor of QSARWorld, wrote in to let me know about the site's offering of forty-eight free QSAR datasets. Each dataset is linked to the primary literature and is available in four formats, including SD File. In contrast to many datasets, those at QSARWorld are manually curated. QSARWorld is also actively seeking new datasets to convert into machine-readable form; if you find one, write to them to have it added in the collection.

Systematic efforts to collect, curate, and distribute raw data from the primary literature are long overdue. QSARWorld offers an intriguing model for doing so. Although some non-scientific issues, such as intellectual property rights, don't appear to have been addressed yet by QSARWorld, the site's offering of machine-readable raw data offers plenty of food for thought to anyone working with QSAR.

What's your favorite dataset resource?

Image Credit: B.G. Lewandowski

How Would Your Cheminformatics Tool Do This? 4

Posted by Rich Apodaca Wed, 05 Dec 2007 13:44:00 GMT

Reference: Zapata, Caballero, Espinosa, Tarraga, and Molina - Org. Lett.

Signal to Noise and the Chemistry Blog 3

Posted by Rich Apodaca Tue, 04 Dec 2007 16:53:00 GMT

Chemistry World is running an article in its December issue titled Surfing Web2O that briefly touches on the subject of chemistry blogs. From analysis to commentary to news gathering, blogging is changing the way large numbers of people relate to each other and the world around them. Why should chemistry be immune to this phenomenon?

One thing that is clear is that scientific blogging, in contrast to traditional scientific publication, is a much more fluid and engaging medium. Roald Hoffman, in his recent Boston ACS talk used the term "ossified" to describe the current state of chemistry publication. Although he went on to talk about how Angewandte Chemie was different, for a split second I though he might start talking about chemistry blogs.

Every new medium has its problems - and chemistry blogging is no exception. First, there's the credibility problem - the perception that the information content of chemistry blogs is somehow innately lower than print journals (a problem that every new medium faces). But beyond this are the much larger problems of understanding how this new medium works, what it can offer you as a participant, and what you might be giving up by participating.

Recent Depth-First articles have touched on some of these subjects:

You may be curious about starting a chemistry blog of your own, but what makes a good one? There are dozens of styles that seem to work, but for me the key qualities come down to a clear purpose (high signal-to-noise ratio), consistency, and attention to detail. Here are some (but by no means all) that I think work especially well:

New media never succeed by trying to imitate the content or form of established media; they succeed by doing what established media can't. The same is true for chemistry blogging. The established peer-review, publisher-controlled system of scientific communication does many things poorly. Look to blog-like online chemical resources to exploit these weaknesses and thrive.

Image Credit: altemark

From C Source Code to Platform-Independent Executable Jarfile: Using NestedVM to Build JInChI

Posted by Rich Apodaca Mon, 03 Dec 2007 13:42:00 GMT

A recent series of articles discussed in some detail the process of compiling source code written in C and C++ to pure Java bytecode with NestedVM. But the full conversion process, starting with source and finishing with an executable jarfile, has to my knowledge never been documented. This article uses the InChI toolkit to illustrate the complete process for converting a real-world C source distribution into a platform-independent, executable jarfile that can be run with any modern Java Virtual Machine (JVM).

About InChI

The previous article in this series introduced JInChI, the first and only pure Java implementation of the IUPAC/NIST InChI toolkit. This toolkit is used to convert molecular connection tables encoded in MDL's SD File format into ASCII character strings called 'InChIs' that have a variety of applications in the field of cheminformatics. Although an excellent JNI-InChI interface is available, JNI won't be a viable option in every situation. Our pure Java implementation nicely complements the JNI-InChI library.

In this tutorial, we'll build version 1.0.2b of the InChI toolkit. This version, among other features, supports the generation of InChI Keys.

Prerequisites

This article assumes you've already installed NestedVM on your system. Building NestedVM required the installation of many dependencies and was a fairly lengthy, but straightforward, process on my Linux system.

Step 1: Prepare Your Environment

Before building anything, we'll need to set up our environment. NestedVM makes this simple:

$ cd /your/path/to/nestedvm/
$ source env.sh

Next, let's create a directory to hold the various components we'll need during the build process:

$ cd /your/projects/directory
$ mkdir jinchi
$ cd jinchi

Next, we'll download and unpack the InChI source distribution:

$ wget http://www.iupac.org/inchi/download/inchi102b.zip
$ unzip inchi102b.zip

Step 2: Cross-Compile InChI

We now have everything we need to begin cross-compiling. NestedVM uses a two-part process in which source code is first cross-compiled to a MIPS binary. That MIPS binary is then translated to Java bytecode. We start by invoking make with the appropriate cross-compiler flags (which I found by looking through the InChI Makefile):

$ make C_COMPILER=mips-unknown-elf-gcc LINKER=mips-unknown-elf-gcc

This creates a MIPS binary (cInChI-1). Unless you're running on a MIPS machine, this binary won't be executable.

$ ./cInChI-1
bash: ./cInChI-1: cannot execute binary file

We can now translate the MIPS binary into pure Java bytecode:

$ java org.ibex.nestedvm.Compiler -outfile JInChI.class JInChI cInChI-1

This produces a Java class file:

$ ll JInChI.class
-rw-r--r-- 1 rich rich 4372362 Nov 30 08:27 JInChI.class

We can verify that the classfile has been compiled correctly by running it:

$ java JInChI
InChI ver 1, Software version 1.02-beta August 2007.

Usage:
cInChI-1 inputFile [outputFile [logFile [problemFile]]] [-option[ -option...]]

Options:
  SNon        Exclude stereo (Default: Include Absolute stereo)
  SRel        Relative stereo

-- truncated --

We have now done something truly remarkable: we've taken a standard C source code distribution and converted it into an executable Java class file. It runs, but only because the NestedVM runtime is on our classpath (thanks to the source command we used at the beginning of the process).

What we really want is a self-contained, executable jarfile that can be run, unmodified, on any system with Java installed.

Step 3: Build the JInChI Jarfile

We begin by moving up the the root directory of our jinchi project, creating a new directory to hold our java-specific files (the JInChI.class file and the NestedVM runtime), and copying them into it:

$ cd ../../..
$ mkdir jinchi-1.0.2b.1
$ mv InChI-1-software-1-02-beta/cInChI/gcc_makefile/JInChI.class jinchi-1.0.2b.1/
$ cp -r /your/path/to/nestedvm/build/org/ jinchi-1.0.2b.1

An executable jarfile generally needs a manifest to point to the main execution class. One way to do that is to first create a manifest:

$ vi jinchi-1.0.2b.1/MANIFEST.MF

It's essential that this file end with a newline.

$ cat jinchi-1.0.2b.1/MANIFEST.MF
Main-Class: JInChI

With everything in place, we can create the jarfile:

$ cd jinchi-1.0.2b.1/
$ ls
JInChI.class  MANIFEST.MF  org/
$ jar -cfm jinchi-1.0.2b.1.jar MANIFEST.MF *
$ ls
jinchi-1.0.2b.1.jar  JInChI.class  MANIFEST.MF  org/

We've successfully converted standard C source code into a platform independent executable jarfile. But does it work?

Step 4: Test JInChI

We can confirm that the process has worked by running the jarfile (you should do this in a new shell session to verify that the jarfile is indeed independent of your NestedVM installation).

$ java -jar jinchi-1.0.2b.1.jar
InChI ver 1, Software version 1.02-beta August 2007.

Usage:
cInChI-1 inputFile [outputFile [logFile [problemFile]]] [-option[ -option...]]

Options:
  SNon        Exclude stereo (Default: Include Absolute stereo)
  SRel        Relative stereo

That's all there is to it! Your shiny new jarfile can be run on any system with a JVM installed. The one created here has been successfully tested on Mac OS X, Linux, and Windows.

If you'd prefer to download the JInChI jarfile, it can be obtained from SourceForge.

Conclusions

This article has illustrated in detail the process of converting a standard C source distribution into a platform-independent executable jarfile. Given the appropriate MIPS cross-compiler (many of which come with the NestedVM distribution), the same process can be repeated with code written in a variety of other languages.

You may be wondering what kind of performance hit you can expect with the approach outlined here. After all, we'd be comparing a native binary to something running on top of two abstraction layers: the NestedVM runtime and a JVM. It's not as bad as you might think, but that's a story for another time.

Image Credit: smithco

Casual Saturdays: Perspective 1

Posted by Rich Apodaca Sat, 01 Dec 2007 18:24:00 GMT

Older posts: 1 2 3