Why the Web Isn't Ready for Chemistry
Wouldn't it be wonderful if chemical structure searching were as easy as using Google? Draw your molecule, press a button and get the good stuff first. That day may well arrive, but without the creation of some key technologies, the wait will be very long. This article describes an unsuccessful attempt to bring the chemically-aware Web closer to reality.
Background
Recently, I introduced a small Web application called InChIMatic. It lets you draw a structure and search for it though one of a number of popular search engines.
InChIMatic turns a molecular query into text, which is then searched. This magic is made possible through the IUPAC International Chemical Identifier (InChI). InChI has enormous potential for enabling chemical Web searches, but several barriers must be overcome first.
For example, if you run even the most trivial of queries with InChIMatic, you'll quickly see that search engines have only indexed a small number of InChIs. One reason is that InChIs are not yet widely-used by Web authors. But the deeper problem is that many pages containing InChIs are not indexed by search engines. For example, PubChem's vast collection of InChIs is apparently invisible to Google.
Compounding the problems of using InChIs to index chemical content on the Web is the lack of a standard, unobtrusive method for embedding the identifier into Web pages. Understandably, no author wants to invest valuable time and effort on an indexing system that doesn't work with their content and page layout. This problem is the subject of the current article.
Materials and Methods
The InChIMatic article contained a test for how well Google and "invisible" InChIs might work together. If you mouse over the word "1-bromonaphthalene" in the first paragraph of that article, you'll see a small popup window containing the InChI. I accomplished this effect with the following HTML:
<span title="InChI=1/C10H7Br/c11-10-7-3-5-8-4-1-2-6-9(8)10/h1-7H">
1-bromonaphthalene
</span>My goal wasn't the popup effect. Instead, I wanted to test the title attribute as an unobtrusive vector for getting InChIs indexed by Google. This excellent idea was a suggestion made by Oliver Koepler in response to Egon Willighagen's article on invisible InChIs.
The idea is simple: InChIs are to be read by machines, not humans. InChIs consist of long strings of text that contain no widely-recognized wrappable characters. As a result, displaying InChIs in Web pages can break page layouts. Even if a wrapping mechanism is used, such as with the overflow attribute, I find InChIs unpleasant to look at and just plain distracting. There's no good reason why any chemist should have to look at them.
Chemists themselves are, understandably, reluctant to invest in ad hoc methods to index their molecular content - they need a real solution. It needs to be simple, it needs to be robust, it needs to be easy to apply retroactively, and it needs to be ready today.
Results
After about two days, Google had indexed the article containing the hidden InChI for 1-bromonaphthalene. Using InchIMatic, I searched Google for the InChI, but only found the same NMRShiftDB item returned in previous queries.
A few days later, a new Depth-First link appeared in Google. It pointed to the main XML Atom feed for Depth-First. This is a step in the right direction, but a far cry from the solution chemists need.
None of the other major search engines supported by InChIMatic returned a link to the Depth-First article containing the hidden InChI. The only new result was retrieved by Search.com. Like Google's result, this new link pointed to Depth-First's main XML feed.
Conclusions
Google doesn't index the contents of the title attribute and may never do so. This should not be surprising. Google has made a fortune in part by staying one step ahead of Search Engine Optimization (SEO) tricksters. By ignoring the contents of the title attribute, Google and other search engines eliminate a real threat that could corrupt the search results that drive their business.
What about other methods for concealing InChIs? One study suggests that none of them will work, either. A two-year old experiment on SEO techniques compared ten different methods to conceal a text string from human viewers. Methods ranged from applying the display:none attribute, to using matched font and background color, to concealing the text in a hidden frame. Although some of these methods may have initially been successful in getting content into Google, none of them work now.
KinasePro recently described a failed attempt to get Google to index a SMILES string hidden in the alt attribute of the img element. Although Technorati did index this content, a Technorati search for the 1-bromonaphthalene InChI returned no hits. A Technorati search for the article containing the hidden InChI did work, suggesting that Technorati also ignores the title attribute.
Why it Matters
Google and other search engines are in a perpetual state of war with SEO tricksters, and rightly so. At stake are search results that make up some of most valuable intellectual property in the world. Any attempt to make InChIs appear invisible to humans is likely to be interpreted by major search engines as spam and treated accordingly. It seems very unlikely that this stance will ever change, regardless of how legitimate the motivation might be.
This leaves us with the fundamental problem of how to build a workable, Web-based chemical indexing system. The CAS registry system has served chemistry as the de facto standard for decades, but for a variety of reasons it is unworkable as an open technology for the Web. The more modern approach of combining InChI and standard search engines has major limitations, as outlined in this article.
If anything in cheminformatics is broken, it's the indexing and retrieval of molecular information on the Web. For those interested in solving a tough problem that really matters, this is a golden opportunity.
InChI Spam
Do you remember when getting email - any email - was exciting? For me, that time was 1995 and I had just found the Internet. Of course, I remember looking forward to messages from people I knew. But I also remember being blown away by the idea that I could write to anyone with an email account, anywhere in the world for essentially free - and that they could do the same. Back then, it was fun to get email, no matter what the source.
Today, spam is something that I, like millions of others, deal with on a daily basis. And it's not limited to email. Anyone who runs a blog knows about comment spam and how difficult it can be to eradicate it. Even trackback is being used as a medium for blog spam. Of course, keyword Spam on the Web has been a constant problem for search engines - eliminating it has in part led to more than a few fortunes earned at companies like Google.
Recently, I introduced a small Web application called InChIMatic. It lets you conveniently do exact-structure molecular queries thorough popular search engines like Google. Draw your structure, click "Search" and find your matches.
There aren't a lot of InChIs visible to search engines now, as an InChIMatic query for even the most trivial molecule will reveal. Regardless of you views on InChI as a technology for bringing chemistry to the Web, it seems very likely that the number of InChIs visible to search engines will increase significantly over the next few years. And with this increase may come sites dedicated to nothing other than publishing a lot of irrelevant InChIs in the hope of attracting accidental advertising click-throughs.
Right now, searching the Web by InChIs offers a very high signal-to-noise ratio experience - not unlike email in 1995. The shysters haven't yet discovered it and nobody is counting on the technology for mission-critical work. But if and when the idea of indexing chemical content on the Web through InChIs begins to catch on, filtering tools will become essential. If this scenario seems implausible, think back to your first experience with email and how concerned you were about spam then.
Photo Credit: cobalt123

