Googling for Molecules: New and Improved InChIMatic
InChIMatic, as described previously, is a new service that lets you perform exact structure searches on the Web using Google. A new version offers searching via several other search engines and features a streamlined interface. The screenshot below shows the the current search engine options with 1-bromonaphthalene in the editor window.

There are noticeable differences in the abilities of search engines other than Google to find InChIs. Google seems to offer the most complete coverage. For example, all search engines I've tried have returned either a subset or recapitulation of Google's results.
One of the most striking things about InChIMatic is how specific the search results are. Every molecule that has produced results for me has been a direct hit. Also notable is how few InChIs are currently indexed by Google and other search engines. Tackling that problem will require a convenient and unobtrusive way to get InChIs into Web pages and to get those pages indexed by search engines. But more on that later.
InChI Spam
Do you remember when getting email - any email - was exciting? For me, that time was 1995 and I had just found the Internet. Of course, I remember looking forward to messages from people I knew. But I also remember being blown away by the idea that I could write to anyone with an email account, anywhere in the world for essentially free - and that they could do the same. Back then, it was fun to get email, no matter what the source.
Today, spam is something that I, like millions of others, deal with on a daily basis. And it's not limited to email. Anyone who runs a blog knows about comment spam and how difficult it can be to eradicate it. Even trackback is being used as a medium for blog spam. Of course, keyword Spam on the Web has been a constant problem for search engines - eliminating it has in part led to more than a few fortunes earned at companies like Google.
Recently, I introduced a small Web application called InChIMatic. It lets you conveniently do exact-structure molecular queries thorough popular search engines like Google. Draw your structure, click "Search" and find your matches.
There aren't a lot of InChIs visible to search engines now, as an InChIMatic query for even the most trivial molecule will reveal. Regardless of you views on InChI as a technology for bringing chemistry to the Web, it seems very likely that the number of InChIs visible to search engines will increase significantly over the next few years. And with this increase may come sites dedicated to nothing other than publishing a lot of irrelevant InChIs in the hope of attracting accidental advertising click-throughs.
Right now, searching the Web by InChIs offers a very high signal-to-noise ratio experience - not unlike email in 1995. The shysters haven't yet discovered it and nobody is counting on the technology for mission-critical work. But if and when the idea of indexing chemical content on the Web through InChIs begins to catch on, filtering tools will become essential. If this scenario seems implausible, think back to your first experience with email and how concerned you were about spam then.
Photo Credit: cobalt123
Anatomy of a Cheminformatics Web Application: Structure Cleanup in Java Molecular Editor
A very useful feature of many 2-D structure editors is a "clean" function that tidies up bond lengths and angles. Java Molecular Editor (JME) is a lightweight 2-D editor that lacks this functionality. In this article, I'll describe a small Web application called "Cleanup" that adds a "clean" function to JME through Ajax and server-side programming, rather than directly extending JME itself. The technique described here differs somewhat from that described in a previous article on adding InChI support to JME with Ajax.
Cleanup in Action
Let's say Bob needs to draw the structure of the H1 antagonist chlorpheniramine with JME. He mistakenly creates irregular bond angles at several points, but continues drawing anyway. His finished molecule looks like that shown below:

Rather than starting over to beautify his molecule, Bob, simply presses the Clean Molecule button. This produces a structure with much more aesthetically-pleasing atom coordinates:

If Bob needs to continue drawing at this point he can. In fact, he can press Clean Molecule as many times as he wants to clean his structure at any time. Each time he presses the button, his structure is retained within the JME window.
Download and Prerequisites
Cleanup requires Ruby on Rails and Ruby CDK. Both of these libraries can be installed using the RubyGems packaging system.
A recent article described the small amount of system configuration required for Ruby CDK on Linux. Another article showed how to install Ruby CDK on Windows.
The complete Cleanup source package can be downloaded from RubyForge. For convenience, a copy of JME is included with the distribution. The author, Peter Ertl, has kindly given permission for the bundled JME applet to be used with Cleanup. For other uses, consult the JME homepage.
Running Cleanup
After inflating the Cleanup archive, the following commands will start the server:
$ cd jme-cleanup-0.0.1 $ ruby script/server
AMD64 Linux users will need to prepend a LD_PRELOAD assignment to the script/server invocation. On my system, which uses Sun's JDK, this looks like:
$ cd jme-cleanup-0.0.1 $ LD_PRELOAD=/usr/java/jdk1.5.0_09/jre/lib/amd64/libzip.so ruby script/server
After starting the Cleanup server, pointing your browser to http://localhost:3000/editor/cleanup will run the application.
How It Works: A Web Application in Two Parts
Cleanup is a Web application consisting of two main parts - one written for a Web server, and one written for a Browser client. These two components work together to achieve an effect that, to a user, is indistinguishable from extending the JME applet with Java.
The first component consists of small Rails application that accepts a Molfile as input and produces a Molfile with re-assigned coordinates as output. A Rails Action, clean_structure accepts a Molfile encoded as form data and produces a response Molfile with re-assigned coordinates.
The second component of the Cleanup application is written in JavaScript and executed from within the Browser. Let's take a look:
<script language="JavaScript">
/*
* Returns the client-specific version of XMLHttpRequest
*/
function createXHR()
{
var xhr;
try
{
xhr = new ActiveXObject("Msxml2.XMLHTTP"); // IE 5.0+
}
catch (e)
{
try
{
xhr = new ActiveXObject("Microsoft.XMLHTTP"); // IE 5.0-
}
catch (E)
{
xhr = false;
}
}
if (!xhr && typeof XMLHttpRequest != 'undefined')
{
xhr = new XMLHttpRequest(); // every other browser
}
return xhr;
}
function cleanStructure()
{
var molfile = document.jme.molFile();
var xhr = createXHR();
xhr.open("GET", "clean_structure?molfile=" + encodeURIComponent(molfile));
xhr.onreadystatechange=function()
{
if (xhr.readyState != 4) return;
cleanMolfile = xhr.responseText;
document.jme.readMolFile(cleanMolfile);
}
xhr.send(null);
}
</script>
As you can see, the client side of Cleanup consists of two JavaScript functions, createXHR and cleanStructure.
The purpose of createXHR is to return a valid instance of the central Ajax JavaScript object, XMLHttpRequest. This function is a standard idiom in Ajax programming, and many JavaScript toolkits eliminate the need to write it explicitly. The function is included here mainly for the purpose of illustration. Microsoft browsers define two different flavors of XMLHttpRequest, and both differ from the flavor supported by every other browser. To take this browser-specific behavior into account, a series of try/catch blocks are used.
The second function, cleanStructure does all of the JME-specific work. After obtaining an instance of XMLHttpRequest, a HTTP GET request is built from JME's molfile. Of course, the magic of this request is that it is asynchronous; it will not block the browser while it is being processed. When the request is complete, the cleaned Molfile is read by JME.
Through the coordinated action of both of Cleanup's components, the application gives the appearance that JME has cleaned its own structure.
So What?
Well-designed, rich functionality makes software interesting and useful. At the same time, users demand software that loads and responds quickly. Using the technique presented here, it's possible to satisfy both of these contradictory requirements. Delegating key tasks to a server obviates the need to transmit large Java libraries to clients. Instead, small Java libraries can be transmitted, and several small asynchronous requests will be processed along the way.
Viewed from this perspective, the capabilities of a good Java applet take on a very different character from what many have become accustomed to. In particular, extensibility and a robust, text-based communication protocol become much more important than built-in features.
For example, we could provide a much more consistent user experience if the Clean Molfile button were contained inside the JME editor itself, instead of on the Web page. In a more general sense, we'd like JME to offer the option of defining custom buttons that can be assigned arbitrary actions. Because Java/JavaScript integration is very well-supported, these custom actions could actually be written in JavaScript.
Conclusions
Java applets have been much maligned of late, partly due to the realization that in many situations they can be replaced with Ajax. However, well-designed, small, and extensible Java applets can play a key role in certain kinds of Ajax applications such as the one described here. Future articles in this series will explore some more of the many possibilities.
Anatomy of a Cheminformatics Web Application: InChIMatic
InChI is an open molecular identifier system. Although InChIs obviate the need for a central registration authority, they are complex enough that they must be generated by computer. Currently, a few desktop molecular editors can generate InChI identifiers. But wouldn't it be more convenient if this capability existed in a simple Web application that could be used from any computer - anywhere? This article describes a Web application called "InChIMatic", which does just that.
In this article, I'll show how Java Molecular Editor (JME), a lightweight 2-D structure editor, can be extended to produce InChI identifiers through server-side software written in Ruby, rather than by extending the applet with Java code.
Downloads and Prerequisites
InChIMatic requires Ruby on Rails and the Rino InChI toolkit. Both of these libraries can be installed using the RubyGems packaging system.
The complete InChIMatic source package can be downloaded from RubyForge. For convenience, a copy of JME is included with the distribution. The author, Peter Ertl, has kindly given permission for the bundled JME applet to be used with InChIMatic. For other uses, consult the JME homepage.
Running InChIMatic
$ cd inchimatic-0.0.2 $ ruby script/server
Pointing your browser to http://localhost:3000/inchi/input, drawing a structure in the JME window, and pressing the "InChI!" button will produce the corresponding InChI in the area below.

Behind the Scenes
The JME applet itself provides no capabilities for generating InChI identifiers. This functionality is instead provided by the Rails server via the Rino InChI library.
Let's say Susan wants to get the InChI for 3,4-dichlorophenol. After entering the structure into the JME window, she presses the "InChI!" button. This sets in motion the following sequence of events:
The JavaScript function writeMolfile() is called. This retrieves a molfile representation of 3,4-dichlorophenol from JME, which is then written to to the hidden field molfile.
A Rails listener notices that the hidden text field has been updated and so invokes the InChIMatic ajax_inchi action. This is a Rails Ajax action that will update only a portion of the InChIMatic window. For more detail on this Rails Ajax technique, see the previous Anatomy of a Cheminformatics Web Application article.
The ajax_inchi action retrieves the contents of the hidden molfile field. This molfile is then used to generate an InChI using Rino. This InChI is then saved to the instance variable inchi.
The contents of the InChIMatic area partitioned by the results div are then updated with the InChI obtained in Step 3. The JME applet itself is unaffected by this operation, allowing Susan to further elaborate her molecule, if she'd like.
So What? Re-Thinking the Role of Applets
JME is, by itself, incapable of generating InChIs. Yet InChIMatic provides this capability as if it existed natively. In other words, a lightweight, fast-loading, and responsive 2-D editor can be extended on the server side, rather than on the client side. The difference is imperceptible to the user, but ripe with potential for the developer.
One of the most common, and completely valid, complaints about Java applets is that they take too long to load. Offloading some of the functionality currently being bundled in applets onto a Web server offers one way to combat the problem. Furthermore, combining Java applets with Ajax and powerful Web application frameworks like Ruby on Rails offers virtually limitless opportunities to re-think the role of Java applets in Web application development.
Conclusions
JME's strength comes, perhaps ironically, from its limited functionality. By using some simple Web programming techniques, JME can be extended with server-side programming. The advantages, compared to extending the JME applet itself with Java on the client side, are significant. Future articles in this series will explore some of the possibilities.
Hacking Molbank: Creating a Graphical Table of Contents
Molbank is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including machine-readable molecular representations of its content, and its very liberal policy on mirroring and robots. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.
The Graphical Table of Contents (GTOC)
The Molbank Graphical Table of Contents (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.

Prerequisites, Downloading, and Running
To run this project, you'll need Ruby CDK. A recent article described the small amount of system configuration required for Ruby CDK on Linux. Another article showed how to install Ruby CDK on Windows.
The complete source code for this project can be downloaded from RubyForge. A subdirectory called demo contains the pre-built final result.
After unpacking the molbank-0.1.0 archive, the demo application can be run:
$ cd molbank-0.0.1 $ ruby test.rb
Problems, We've Got Problems
Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:
Blank Images The entry for M52 is blank. Checking the underlying molfile reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the molecules shown in M52 only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.
Images Containing an "R" Atom Label Entry M53 shows an "R" group at what should be the carbonyl carbon. The underlying molfile contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.
Molfile not Found Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.
Broken Molfiles The Molfile for M162 encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with M402. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.
Bogus Molfiles For reasons I still can't understand, the Molfile for M407 encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.
After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:
What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?
How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)
Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.
Behind the Scenes
The Ruby Molbank GTOC generator works by connecting to the www.mdpi.net server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using Ruby CDK. As a final step, the index.html page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is produced with eRuby using a previously-described technique.
Conclusions
Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.
In some ways, the software described here and its output are less interesting than the larger questions they raise:
How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?
How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?
In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?
These and many related questions are waiting just around the corner. As Open Access becomes more viable, both technically and commercially, look to Open Source and Open Data to provide the synergies that will unlock its true potential.

