Why Web Development is Hard 2

Posted by Rich Apodaca Fri, 16 Nov 2007 09:00:00 GMT

The very thing you'd like most to do as a developer is the thing your users can't stand.

Name That Graph Revealed: Oligarchy 2.0 5

Posted by Rich Apodaca Wed, 05 Sep 2007 08:28:00 GMT

Web 2.0 may be all about participation, but the numbers reported by The McKinsey Quarterly suggest a self-selecting oligarchy rather than a democracy. Success may well depend more on engaging the top 2-10% of users rather than appealing to all of them. Food for though when forming your next community, be it electronic or otherwise.

image credit: The McKinsey Quarterly

Can Your Cheminformatics Tool Do This? 2

Posted by Rich Apodaca Wed, 13 Jun 2007 08:36:00 GMT

Eleven Qualities of The Perfect Line Notation for the Web 2

Posted by Rich Apodaca Wed, 14 Mar 2007 10:18:00 GMT

If you had to design the perfect line notation for the Web, what would it look like? This is hardly an academic exercise given the central role played by line notations in information systems. For a variety of reasons, existing line notations may not be the right match for the Web. This article explores this question and outlines the main qualities needed by a Web-friendly line notation.

A Few Lines About Line Notations

A line notation is any system that converts a molecular structure into a single line of text. Chemists have been using line notations for over 140 years - long before the advent of computers. Because of their versatility, line notations are frequently used in situations they were not designed for. When this happens, limitations become apparent, resulting in renewed efforts to build a better system.

As noted previously, the invention of new line notations is a field whose popularity ebbs and flows over time. Currently, the three most important line notations are:

  • IUPAC Nomenclature
  • Simplified Molecular Input Line Entry System (SMILES)
  • IUPAC International Chemical Identifier (InChI)

Each of these systems has its own unique characteristics. IUPAC nomenclature is the oldest and most widely-used line notation. It appears in numerous contexts, including Web pages, peer-reviewed journals, reports, patents, MSDS sheets, catalogs, and reagent bottles. By comparison, SMILES is a distant second in popularity. It's main role has been to facilitate machine entry of structural information by humans, like this. InChI is the newest of the bunch. It serves both as a line notation and as a unique identifier requiring no central authority.

The Perfect Line Notation for the Web

The emergence of the Web as a standard information delivery platform has refocused the attention of many developers on the line notation problem. With this idea in mind, here are some guesses about the qualities of the ideal Web-friendly line notation.

  1. Readily Encodable and Decodable by Humans. There's something unnerving about a line notation that can't easily be deciphered by humans. Is this really the right string? Did I copy it completely? This problem surfaces with every line notation, but some fare better than others. IUPAC nomenclature, for example, is one of the first things taught in many beginning organic chemistry classes. It's complicated, but still understandable by non-experts.

  2. Readily Encodable and Decodable by Machines. It may be relatively simple for humans to read and write IUPAC nomenclature, but not so for machines. Software that reads and writes SMILES, on the other hand, is by comparison easy to write. This explains the abundance of software packages that handle SMILES and the scarcity of those that handle IUPAC nomenclature.

  3. Uses URI-Safe Characters Only. A URI uniquely identifies every document on the Internet. Why can't a line notation be used in combination with a URI to uniquely identify every molecule? One reason is that every line notation currently in use contains characters unsafe for use in URIs. Any line notation designed for use on the Web needs to avoid these characters in its syntax. Update: InChI doesn't use unsafe characters, but it does use the reserved characters "=", "?", and "/". These characters may therefore need to be escaped, depending on the context.

  4. Encodes All Molecules. Buried within every line notation is an opinion on what chemistry is really about. To operate on the Web, these opinions need to be as closely aligned as possible with those of chemists themselves. Several Depth-First articles have discussed the limitations of existing line notations as molecular languages.

  5. Compact. Nobody wants to look at or manipulate a line of text that's longer than it needs to be. Of course, the more expressive a line notation is, the more verbose it will be. In other words, qualities 4 and 5 will always be in conflict.

  6. Canonicalizable. A line notation supports canonicalization when it specifies rules that can be guaranteed to always generate the same line notation for a given molecule. This feature enables many labor-saving assumptions. For example, a canonical representation makes a great identifier in a database, reducing the cost of storing and retrieving structural information.

  7. Explicit Hydrogen Atom Encoding. SMILES makes few requirements regarding hydrogen atom encoding. As a result, each software implementation is left to its own devices. The resulting confusion is the price paid for the convenience (Quality 1) of a compact notation (Quality 5).

  8. Hierarchical Structure. One of InChI's innovations was the introduction of a hierarchical encoding system. This system, also referred to as InChI "layers", enables a molecule to be viewed at several levels of resolution: as a molecular formula; as a network of atoms; as a network of atoms containing hydrogen atoms; as an atomic network with stereochemistry; and so on. I'm unaware of any reports in which this feature has been exploited in a practical way, although they aren't difficult to imagine.

  9. Flat Structure. By grouping structural features into layers (Quality 8), InChI introduces a lot of complexity that is absent in SMILES and even IUPAC nomenclature. This complexity, in part, makes it difficult for both humans and machines to properly encode InChIs (Qualities 1 and 2). Given this complexity, and the fact that the utility of hierarchical encoding has yet to be conclusively demonstrated, it may be better to avoid it.

  10. Open Source Software Implementation. No encoding standard in today's world stands a chance of gaining acceptance without an open source reference implementation. InChI broke new ground in this area and should serve as a model for any system that follows.

  11. Unencumbered by Patents. The success of molfile and SMILES as de facto standards derives partly from the decision made by their authors to refrain from patenting their languages. As a result, developers are motivated build their own implementations, rather than invent yet another language.

Conclusions

A robust and modern line notation system is a key technology for chemically enabling the Web. Existing line notations, although useful in many contexts, were not designed with this particular role in mind. The time has come to consider whether a new line notation system, designed specifically with the Web and modern chemistry in mind, might offer a better solution.

Photo credit: Wenwen - Flickr

Why the Web Isn't Ready for Chemistry

Posted by Rich Apodaca Mon, 05 Mar 2007 09:55:00 GMT

Wouldn't it be wonderful if chemical structure searching were as easy as using Google? Draw your molecule, press a button and get the good stuff first. That day may well arrive, but without the creation of some key technologies, the wait will be very long. This article describes an unsuccessful attempt to bring the chemically-aware Web closer to reality.

Background

Recently, I introduced a small Web application called InChIMatic. It lets you draw a structure and search for it though one of a number of popular search engines.

InChIMatic turns a molecular query into text, which is then searched. This magic is made possible through the IUPAC International Chemical Identifier (InChI). InChI has enormous potential for enabling chemical Web searches, but several barriers must be overcome first.

For example, if you run even the most trivial of queries with InChIMatic, you'll quickly see that search engines have only indexed a small number of InChIs. One reason is that InChIs are not yet widely-used by Web authors. But the deeper problem is that many pages containing InChIs are not indexed by search engines. For example, PubChem's vast collection of InChIs is apparently invisible to Google.

Compounding the problems of using InChIs to index chemical content on the Web is the lack of a standard, unobtrusive method for embedding the identifier into Web pages. Understandably, no author wants to invest valuable time and effort on an indexing system that doesn't work with their content and page layout. This problem is the subject of the current article.

Materials and Methods

The InChIMatic article contained a test for how well Google and "invisible" InChIs might work together. If you mouse over the word "1-bromonaphthalene" in the first paragraph of that article, you'll see a small popup window containing the InChI. I accomplished this effect with the following HTML:

<span title="InChI=1/C10H7Br/c11-10-7-3-5-8-4-1-2-6-9(8)10/h1-7H">
  1-bromonaphthalene
</span>

My goal wasn't the popup effect. Instead, I wanted to test the title attribute as an unobtrusive vector for getting InChIs indexed by Google. This excellent idea was a suggestion made by Oliver Koepler in response to Egon Willighagen's article on invisible InChIs.

The idea is simple: InChIs are to be read by machines, not humans. InChIs consist of long strings of text that contain no widely-recognized wrappable characters. As a result, displaying InChIs in Web pages can break page layouts. Even if a wrapping mechanism is used, such as with the overflow attribute, I find InChIs unpleasant to look at and just plain distracting. There's no good reason why any chemist should have to look at them.

Chemists themselves are, understandably, reluctant to invest in ad hoc methods to index their molecular content - they need a real solution. It needs to be simple, it needs to be robust, it needs to be easy to apply retroactively, and it needs to be ready today.

Results

After about two days, Google had indexed the article containing the hidden InChI for 1-bromonaphthalene. Using InchIMatic, I searched Google for the InChI, but only found the same NMRShiftDB item returned in previous queries.

A few days later, a new Depth-First link appeared in Google. It pointed to the main XML Atom feed for Depth-First. This is a step in the right direction, but a far cry from the solution chemists need.

None of the other major search engines supported by InChIMatic returned a link to the Depth-First article containing the hidden InChI. The only new result was retrieved by Search.com. Like Google's result, this new link pointed to Depth-First's main XML feed.

Conclusions

Google doesn't index the contents of the title attribute and may never do so. This should not be surprising. Google has made a fortune in part by staying one step ahead of Search Engine Optimization (SEO) tricksters. By ignoring the contents of the title attribute, Google and other search engines eliminate a real threat that could corrupt the search results that drive their business.

What about other methods for concealing InChIs? One study suggests that none of them will work, either. A two-year old experiment on SEO techniques compared ten different methods to conceal a text string from human viewers. Methods ranged from applying the display:none attribute, to using matched font and background color, to concealing the text in a hidden frame. Although some of these methods may have initially been successful in getting content into Google, none of them work now.

KinasePro recently described a failed attempt to get Google to index a SMILES string hidden in the alt attribute of the img element. Although Technorati did index this content, a Technorati search for the 1-bromonaphthalene InChI returned no hits. A Technorati search for the article containing the hidden InChI did work, suggesting that Technorati also ignores the title attribute.

Why it Matters

Google and other search engines are in a perpetual state of war with SEO tricksters, and rightly so. At stake are search results that make up some of most valuable intellectual property in the world. Any attempt to make InChIs appear invisible to humans is likely to be interpreted by major search engines as spam and treated accordingly. It seems very unlikely that this stance will ever change, regardless of how legitimate the motivation might be.

This leaves us with the fundamental problem of how to build a workable, Web-based chemical indexing system. The CAS registry system has served chemistry as the de facto standard for decades, but for a variety of reasons it is unworkable as an open technology for the Web. The more modern approach of combining InChI and standard search engines has major limitations, as outlined in this article.

If anything in cheminformatics is broken, it's the indexing and retrieval of molecular information on the Web. For those interested in solving a tough problem that really matters, this is a golden opportunity.

Older posts: 1 2 3