Self-Referential 1

Posted by Rich Apodaca Fri, 20 Apr 2007 14:19:00 GMT

One of the strange things about writing a blog is stumbling on your own work in Google's search results. In doing some research for a talk, I noticed that Depth-First articles appear at the top of many Google searches. I'm not sure what to make of this, but for what it's worth, here are some unquoted Google search terms that currently return a Depth-First article in the top-three results:

Image Credit: rabinal

Why the Web Isn't Ready for Chemistry

Posted by Rich Apodaca Mon, 05 Mar 2007 14:55:00 GMT

Wouldn't it be wonderful if chemical structure searching were as easy as using Google? Draw your molecule, press a button and get the good stuff first. That day may well arrive, but without the creation of some key technologies, the wait will be very long. This article describes an unsuccessful attempt to bring the chemically-aware Web closer to reality.

Background

Recently, I introduced a small Web application called InChIMatic. It lets you draw a structure and search for it though one of a number of popular search engines.

InChIMatic turns a molecular query into text, which is then searched. This magic is made possible through the IUPAC International Chemical Identifier (InChI). InChI has enormous potential for enabling chemical Web searches, but several barriers must be overcome first.

For example, if you run even the most trivial of queries with InChIMatic, you'll quickly see that search engines have only indexed a small number of InChIs. One reason is that InChIs are not yet widely-used by Web authors. But the deeper problem is that many pages containing InChIs are not indexed by search engines. For example, PubChem's vast collection of InChIs is apparently invisible to Google.

Compounding the problems of using InChIs to index chemical content on the Web is the lack of a standard, unobtrusive method for embedding the identifier into Web pages. Understandably, no author wants to invest valuable time and effort on an indexing system that doesn't work with their content and page layout. This problem is the subject of the current article.

Materials and Methods

The InChIMatic article contained a test for how well Google and "invisible" InChIs might work together. If you mouse over the word "1-bromonaphthalene" in the first paragraph of that article, you'll see a small popup window containing the InChI. I accomplished this effect with the following HTML:

<span title="InChI=1/C10H7Br/c11-10-7-3-5-8-4-1-2-6-9(8)10/h1-7H">
  1-bromonaphthalene
</span>

My goal wasn't the popup effect. Instead, I wanted to test the title attribute as an unobtrusive vector for getting InChIs indexed by Google. This excellent idea was a suggestion made by Oliver Koepler in response to Egon Willighagen's article on invisible InChIs.

The idea is simple: InChIs are to be read by machines, not humans. InChIs consist of long strings of text that contain no widely-recognized wrappable characters. As a result, displaying InChIs in Web pages can break page layouts. Even if a wrapping mechanism is used, such as with the overflow attribute, I find InChIs unpleasant to look at and just plain distracting. There's no good reason why any chemist should have to look at them.

Chemists themselves are, understandably, reluctant to invest in ad hoc methods to index their molecular content - they need a real solution. It needs to be simple, it needs to be robust, it needs to be easy to apply retroactively, and it needs to be ready today.

Results

After about two days, Google had indexed the article containing the hidden InChI for 1-bromonaphthalene. Using InchIMatic, I searched Google for the InChI, but only found the same NMRShiftDB item returned in previous queries.

A few days later, a new Depth-First link appeared in Google. It pointed to the main XML Atom feed for Depth-First. This is a step in the right direction, but a far cry from the solution chemists need.

None of the other major search engines supported by InChIMatic returned a link to the Depth-First article containing the hidden InChI. The only new result was retrieved by Search.com. Like Google's result, this new link pointed to Depth-First's main XML feed.

Conclusions

Google doesn't index the contents of the title attribute and may never do so. This should not be surprising. Google has made a fortune in part by staying one step ahead of Search Engine Optimization (SEO) tricksters. By ignoring the contents of the title attribute, Google and other search engines eliminate a real threat that could corrupt the search results that drive their business.

What about other methods for concealing InChIs? One study suggests that none of them will work, either. A two-year old experiment on SEO techniques compared ten different methods to conceal a text string from human viewers. Methods ranged from applying the display:none attribute, to using matched font and background color, to concealing the text in a hidden frame. Although some of these methods may have initially been successful in getting content into Google, none of them work now.

KinasePro recently described a failed attempt to get Google to index a SMILES string hidden in the alt attribute of the img element. Although Technorati did index this content, a Technorati search for the 1-bromonaphthalene InChI returned no hits. A Technorati search for the article containing the hidden InChI did work, suggesting that Technorati also ignores the title attribute.

Why it Matters

Google and other search engines are in a perpetual state of war with SEO tricksters, and rightly so. At stake are search results that make up some of most valuable intellectual property in the world. Any attempt to make InChIs appear invisible to humans is likely to be interpreted by major search engines as spam and treated accordingly. It seems very unlikely that this stance will ever change, regardless of how legitimate the motivation might be.

This leaves us with the fundamental problem of how to build a workable, Web-based chemical indexing system. The CAS registry system has served chemistry as the de facto standard for decades, but for a variety of reasons it is unworkable as an open technology for the Web. The more modern approach of combining InChI and standard search engines has major limitations, as outlined in this article.

If anything in cheminformatics is broken, it's the indexing and retrieval of molecular information on the Web. For those interested in solving a tough problem that really matters, this is a golden opportunity.

Googling for Molecules: New and Improved InChIMatic

Posted by Rich Apodaca Wed, 28 Feb 2007 14:59:00 GMT

InChIMatic, as described previously, is a new service that lets you perform exact structure searches on the Web using Google. A new version offers searching via several other search engines and features a streamlined interface. The screenshot below shows the the current search engine options with 1-bromonaphthalene in the editor window.

There are noticeable differences in the abilities of search engines other than Google to find InChIs. Google seems to offer the most complete coverage. For example, all search engines I've tried have returned either a subset or recapitulation of Google's results.

One of the most striking things about InChIMatic is how specific the search results are. Every molecule that has produced results for me has been a direct hit. Also notable is how few InChIs are currently indexed by Google and other search engines. Tackling that problem will require a convenient and unobtrusive way to get InChIs into Web pages and to get those pages indexed by search engines. But more on that later.

Google for Molecules with InChIMatic

Posted by Rich Apodaca Mon, 19 Feb 2007 15:18:00 GMT

InChIMatic is a simple Web application that uses Google to perform exact structure searches on the Web. After drawing your structure in the editor window, click the "InChI!" button to get a link. This link takes you to a Google query that displays matches for your molecule. You'll need both Java and JavaScript enabled in your browser to use InChIMatic.

The Technical Details

The technology at the heart of InChIMatic is the IUPAC International Chemical Identifier (InChI). An InChI is an alphanumeric string that uniquely identifies a molecular structure. By converting molecular structures to text, InChI makes it easy to use standard Internet tools to do exact structure searches.

The earliest reference in the peer-reviewed literature to using Google for searching InChIs is contained in a 2005 paper. More recently, a service called QueryChem has taken this idea one step further by using the Google API to perform substructure searches based on InChI.

InChIMatic works differently. Unlike a raw Google search, InChIMatic builds a Google query link for you. Unlike QueryChem, InChIMatic doesn't use the Google API and so has none of its restrictions. This does result in a limitation: InChIMatic can only currently be used to for exact structure queries.

The InChIMatic Web application has been discussed in greater technical detail in a previous article. The rapid Web application development framework Ruby on Rails made building InChIMatic a snap. InChIMatic is served by the Ruby application container Mongrel, which is hosted on a Linux server running Apache. Rino provided the Ruby interface to the IUPAC/NIST InChI toolkit. The 2-D structure editor is Java Molecular Editor (JME) by Peter Ertl, which is used with his kind permission.

Open Source (OSI) LogoAside from JME, all components of InChIMatic, from the operating system it runs on to the InChI system itself, are Open Source software.

Using InChI to Raise the Visibility of Your Content

InChIMatic returns many Google results for common molecules. But less common, known molecules return no hits at all. Three factors are responsible: (1) Google doesn't index all InChIs on the Internet; (2) few content providers currently use InChI; and (3) there is no standard and convenient mechanism to embed InChIs into Web pages for indexing by Google.

For these reasons, I consider InChI to be bleeding edge technology. Some will find it useful, most will not. Unfortunately, this state of affairs will persist until problems (1) and (3) are solved.

Nevertheless, if you're technically adventurous, InChIMatic offers a relatively painless way to begin incorporating InChIs into your content and verifying that they get indexed. There's no software to download, install, or upgrade. Forget about operating system incompatibilities (hopefully!). Just point your Java-enabled browser to inchimatic.com.

Although there's no standard method to encode InChIs in Web pages, some interesting ideas have been put forward. Egon Willighagen has proposed a system based on RDFa. Future iterations of InChIMatic may include support for generating scripts and/or markup for including InChIs into blogs and other online content.

Conclusions

InChI is a complex new technology in need of easy-to-use tools. InChIMatic is one such tool that makes it possible to perform exact structure queries using Google.

One of the exciting things about Web applications is how quickly they can evolve. If in trying out InChIMatic you find something you'd like changed or added, please feel free to write me.

The Chemically-Aware Web: Are We There Yet?

Posted by Rich Apodaca Wed, 13 Sep 2006 17:25:00 GMT

Recently, I wrote a tutorial on embedding 2-D molecular renderings into webpages as Scalable Vector Graphics (SVG). This tutorial also contained a small experiment on the current chemical informatics capabilities of the Web.

Here is a scenario from the near future: Joe is writing a review on Cephalosporin C that he wants to publish the modern way - directly to the Web. An entirely new concept in scientific publishing has started to take hold. Rather than submitting scientific articles to publishers, who then make hamburger out of them and strip authors of their rights to reproduce their own work, a new system in which journals simply aggregate content already on the Web is gaining momentum. Some journals specialize in only including the very best scientific Web content available, and so enjoy a prestige factor. It's still a peer review system, but with inversion of control. The trick for scientists is getting their work indexed, and so noticed, in the first place.

Joe just downloaded a new 2-D structure editor, FooChemPaint, that he heard can make the structure drawings in his review structure-searchable. Every chemist he knows is talking about a new free search engine called Haystac (Haystac Ain't Chmoogle) that lets them substructure-search the web. For some reason, you need to create your structures using FooChemPaint if you want your own documents to be included in the search results.

After Joe finishes drawing Cephalosporin C with FooChemPaint, he chooses the File->Save As... menu item. Instead of saving as a JPG or PNG like he's done with other software, he saves the image as SVG. He then embeds the SVG into his review using a procedure similar to the one I outlined previously.

From Joe's perspective, he hasn't done anything very new. But unknown to Joe, FooChemPaint has automatically inserted the InChI identifier of Cephalosporin C as metadata into his SVG document. This enables ordinary search engines such as Google to associate the InChI with his SVG. The best part is that the entire process is essentially invisible to Joe.

Haystac is a web application that presents users with an online structure editor for preparing molecular queries. When a structure query is submitted, Haystac searches its molecular database for matches. This database, in turn, was built by a web spider specifically designed to look for InChI identifiers, maybe with the help of Google's Web API. One of Haystac's records for the structure of Cephalosporin C points to Joe's review article.

Science fiction? Maybe. This is where the experiment comes in. Before I submitted the article on SVG, I manually annotated the SVG of Alprazolam with the corresponding InChI. The XML source can be viewed in Firefox by right-clicking on the SVG image and choosing This Frame->View Frame Source, or alternatively here. Below is a fragment of the XML:

<svg ...>
  <rdf:RDF
    xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
    xmlns:dc = "http://purl.org/dc/elements/1.1/" >
    <rdf:Description about="http://depth-first.com"
      dc:title="InChI=1/C17H13ClN4/c1-11-20-21-16-10-19-17(12-5-3-2-4-6-12)14-9-13(18)7-8-15(14)22(11)16/h2-9H,10H2,1H3"
      dc:format="image/svg+xml"
      dc:language="en" >
      <dc:creator>
        <rdf:Bag>
          <rdf:li>Richard L. Apodaca</rdf:li>
        </rdf:Bag>
      </dc:creator>
    </rdf:Description>
  </rdf:RDF>

  <!-- etc. -->
</svg>

Today I searched for the title of my article in Google and found it. I then searched for the InChI in the SVG metadata and did not find it. Currently, a search of this InChI shows only one hit from the DrugBank database.

The experiment failed in its stated goal of getting the InChI of Alprazolam indexed by Google via the metadata in its SVG rendering. Was it the formatting of my RDF tags? Is metadata just indexed more slowly than other content? Does Google just ignore metadata to avoid keyword stuffing by Search Engine Optimization tricksters? Are embedded SVG documents ignored by Google altogether? Whatever the reason, the technical barriers to a system like this working today are very low and dropping rapidly.