Everything Old is New Again: Wiswesser Line Notation (WLN)

Posted by Rich Apodaca Fri, 20 Jul 2007 08:46:00 GMT

Sometimes, searching through the attic of scientific ideas turns up unexpected treasures. Like old clothing styles that suddenly become fashionable again, the passage of time has a way of making old ideas relevant by supplying new context. Those ideas that once enjoyed widespread popularity followed by complete obscurity are especially interesting. This article talks about one of them and why it may matter again.

Some History

Wiswesser Line-Formula Chemical Notation (WLN) was the most popular of perhaps a dozen actively-used line notations systems during the 1960s and 1970s. Developed by William J. Wiswesser over a period of many years starting in the 1940s, WLN contains a surprising number of modern ideas about chemistry and information. At one point a serious contender for the position now held by IUPAC nomenclature, WLN has become so obscure that few chemists have even heard of it and no modern software can manipulate it. Even finding information on the basic grammar of WLN is difficult: almost all of this documentation is contained in out-of-print books.

A Guide

To my surprise, WLN is both easy to understand and easy to use. As far as canonicalized line notations go, WLN is far easier to comprehend than either InChI or Canonical SMILES. Even more surprisingly, WLN actually meets more than a few of the requirements for the ideal line notation for the Web. I was always struck by claims that high school graduates with little chemistry background could be trained to encode WLN in a few weeks; this now seems very plausible.

My guide is Elbert Smith's short 1968 book The Wiswesser Line-Formula Chemical Notation. I was able to pick up a used copy in excellent condition for under $30.00 from Amazon.

Some Examples

Functional groups, carbon chains, and rings play central roles in WLN. Unlike modern line notations that emphasize atoms, WLN is designed to mirror the way that chemists actually think about chemistry.

Consider acetone:

1V1

The two "1"s stand for saturated one-carbon chains, i.e. methyl groups. The "V" stands for a carbon doubly-bonded to oxygen.

Given nothing more than the above example, the encoding of diethyl ether should be completely clear:

2O2

"O" simply stands for a divalent oxygen atom.

The benzene ring is one of the most ubiquitous functional groups in organic chemistry. Wiswesser knew this and wanted to make it easy to encode aromatic compounds. His solution is simplicity itself. Consider acetophenone:

1VR

The "R" stands for a benzene ring. WLN canonicalization gives it the lowest priority and this is why it appears last.

What about disubstituted aromatics? Consider 4-chloroacetophenone:

GR DV1

The "G" symbol stands for chlorine. The " DV1" stands for the 4-acyl substituent. Here, the "D" denotes the 4-postion. The 3- position would result in " CV1", and the 2- position would give " BV1". The space character means that the character following it should be interpreted as ring locant.

WLN uses a very simple system of canonicalization based on alphanumeric order. Priority increases in the direction: (1) symbols; (2) numbers in numerical order; and (3) letters in alphabetical order (with the exception of R which has lower priority than symbols). Coding generally begins at the substituent assigned the highest priority. This explains why 4-chloroacetophenone is not coded as "1VR DG".

Advantages of WLN

WLN is remarkably compact, especially when compared to SMILES and InChI. For example, consider the InChI for 4-chloroacetophenone, which is eight times longer than the corresponding WLN:

InChI=1/C8H7ClO/c1-6(10)7-2-4-8(9)5-3-7/h2-5H,1H3

Additionally, it's readily apparent to a human observer when a WLN is not properly coded - after all, the language was designed to be both read and written by humans rather than machines. Anyone can look at "GR DV1" and deduce almost instantly that it contains a carbonyl group (V), a phenyl group (R), a chloro group (G), and a methyl group (1).

And if this functional group recognition is easy for humans, it's orders of magnitude easier for machines. It's not difficult at all to imagine very sophisticated and fast molecular query systems that do nothing more than simple processing of the ASCII text contained within WLN strings.

Conclusions

It's very unlikely that WLN will ever be resurrected for the purpose of replacing existing line notations. On the other hand, WLN offers many potentially useful concepts for those creating new line notations. As they say, history doesn't repeat itself, but it frequently rhymes.

My InChI Runneth Over 2

Posted by Rich Apodaca Thu, 17 May 2007 08:59:00 GMT

The only solution to this problem I've found is to set the CSS overflow property to "scroll":

InChI=1/C50H70O14/c1-25(24-51)14-28-17-37(52)50(8)41(54-28)19-33-34(61-50)18-32-29(55-33)10-9-12-46(4)42(58-32)23-49(7)40(62-46)21-39-47(5,64-49)13-11-30-44(60-39)26(2)15-31-36(56-30)22-48(6)38(57-31)20-35-45(63-48)27(3)16-43(53)59-35/h9-10,16,24,26,28-42,44-45,52H,1,11-15,17-23H2,2-8H3/b10-9-

Strings and Things

Posted by Rich Apodaca Wed, 25 Apr 2007 09:28:00 GMT

I ran across John Bradshaw's excellent presentation Strings and Things. Part historical overview, part explanation of the SMILES/SMARTS line notation systems, Bradshaw's slides are chock full of interesting tidbits.

My favorite: slide 29 - "Line notations are dead." It's a wonderful illustration of why predicting the future of technology is so tricky. The light pen became the mouse, the computer display became color, and Digital fell off a cliff. SMILES and SMARTS are the only things to have survived.

Rethinking the Command Line for Chemistry 1

Posted by Rich Apodaca Tue, 27 Mar 2007 12:30:00 GMT



A recent article discussed the renaissance of the command line. Particularly on the Web, command line interfaces have become so advanced, that most of us don't even realize we're using them. Consider the Google search box, which is nothing more than one of the most powerful command line interfaces ever developed.

A service called YubNub takes this idea one step further. YubNub is a meta command line interface for the Web. The following YubNub command will do a Flickr search for benzene.

If this were all YubNub did, it would be merely interesting. What makes YubNub remarkable is that you can create your own commands that other people can use. I recently added the "ginchi" command to query Google for an InChI. Now you can try it out:

By itself this isn't particularly useful because you can just go to Google and query the InChI directly. However, it's not too hard to imagine several commands like ginchi that could be added. Some would use Google, others would use other services. How about something that searches Mitch Garcia's chemistry journal Yahoo pipe? It would be very convenient to have all of those commands accessible from the same Web page.

Command line interfaces can be phenomenally useful for both beginning and advanced users. The hardest part to get right is not what the user sees as they type, but what happens after they hit the enter key.

Line notations are the perfect match for command line interfaces. The widespread use of SMILES and the precision of InChI offer many possibilities for innovative chemistry Web services.

Eleven Qualities of The Perfect Line Notation for the Web 2

Posted by Rich Apodaca Wed, 14 Mar 2007 10:18:00 GMT

If you had to design the perfect line notation for the Web, what would it look like? This is hardly an academic exercise given the central role played by line notations in information systems. For a variety of reasons, existing line notations may not be the right match for the Web. This article explores this question and outlines the main qualities needed by a Web-friendly line notation.

A Few Lines About Line Notations

A line notation is any system that converts a molecular structure into a single line of text. Chemists have been using line notations for over 140 years - long before the advent of computers. Because of their versatility, line notations are frequently used in situations they were not designed for. When this happens, limitations become apparent, resulting in renewed efforts to build a better system.

As noted previously, the invention of new line notations is a field whose popularity ebbs and flows over time. Currently, the three most important line notations are:

  • IUPAC Nomenclature
  • Simplified Molecular Input Line Entry System (SMILES)
  • IUPAC International Chemical Identifier (InChI)

Each of these systems has its own unique characteristics. IUPAC nomenclature is the oldest and most widely-used line notation. It appears in numerous contexts, including Web pages, peer-reviewed journals, reports, patents, MSDS sheets, catalogs, and reagent bottles. By comparison, SMILES is a distant second in popularity. It's main role has been to facilitate machine entry of structural information by humans, like this. InChI is the newest of the bunch. It serves both as a line notation and as a unique identifier requiring no central authority.

The Perfect Line Notation for the Web

The emergence of the Web as a standard information delivery platform has refocused the attention of many developers on the line notation problem. With this idea in mind, here are some guesses about the qualities of the ideal Web-friendly line notation.

  1. Readily Encodable and Decodable by Humans. There's something unnerving about a line notation that can't easily be deciphered by humans. Is this really the right string? Did I copy it completely? This problem surfaces with every line notation, but some fare better than others. IUPAC nomenclature, for example, is one of the first things taught in many beginning organic chemistry classes. It's complicated, but still understandable by non-experts.

  2. Readily Encodable and Decodable by Machines. It may be relatively simple for humans to read and write IUPAC nomenclature, but not so for machines. Software that reads and writes SMILES, on the other hand, is by comparison easy to write. This explains the abundance of software packages that handle SMILES and the scarcity of those that handle IUPAC nomenclature.

  3. Uses URI-Safe Characters Only. A URI uniquely identifies every document on the Internet. Why can't a line notation be used in combination with a URI to uniquely identify every molecule? One reason is that every line notation currently in use contains characters unsafe for use in URIs. Any line notation designed for use on the Web needs to avoid these characters in its syntax. Update: InChI doesn't use unsafe characters, but it does use the reserved characters "=", "?", and "/". These characters may therefore need to be escaped, depending on the context.

  4. Encodes All Molecules. Buried within every line notation is an opinion on what chemistry is really about. To operate on the Web, these opinions need to be as closely aligned as possible with those of chemists themselves. Several Depth-First articles have discussed the limitations of existing line notations as molecular languages.

  5. Compact. Nobody wants to look at or manipulate a line of text that's longer than it needs to be. Of course, the more expressive a line notation is, the more verbose it will be. In other words, qualities 4 and 5 will always be in conflict.

  6. Canonicalizable. A line notation supports canonicalization when it specifies rules that can be guaranteed to always generate the same line notation for a given molecule. This feature enables many labor-saving assumptions. For example, a canonical representation makes a great identifier in a database, reducing the cost of storing and retrieving structural information.

  7. Explicit Hydrogen Atom Encoding. SMILES makes few requirements regarding hydrogen atom encoding. As a result, each software implementation is left to its own devices. The resulting confusion is the price paid for the convenience (Quality 1) of a compact notation (Quality 5).

  8. Hierarchical Structure. One of InChI's innovations was the introduction of a hierarchical encoding system. This system, also referred to as InChI "layers", enables a molecule to be viewed at several levels of resolution: as a molecular formula; as a network of atoms; as a network of atoms containing hydrogen atoms; as an atomic network with stereochemistry; and so on. I'm unaware of any reports in which this feature has been exploited in a practical way, although they aren't difficult to imagine.

  9. Flat Structure. By grouping structural features into layers (Quality 8), InChI introduces a lot of complexity that is absent in SMILES and even IUPAC nomenclature. This complexity, in part, makes it difficult for both humans and machines to properly encode InChIs (Qualities 1 and 2). Given this complexity, and the fact that the utility of hierarchical encoding has yet to be conclusively demonstrated, it may be better to avoid it.

  10. Open Source Software Implementation. No encoding standard in today's world stands a chance of gaining acceptance without an open source reference implementation. InChI broke new ground in this area and should serve as a model for any system that follows.

  11. Unencumbered by Patents. The success of molfile and SMILES as de facto standards derives partly from the decision made by their authors to refrain from patenting their languages. As a result, developers are motivated build their own implementations, rather than invent yet another language.

Conclusions

A robust and modern line notation system is a key technology for chemically enabling the Web. Existing line notations, although useful in many contexts, were not designed with this particular role in mind. The time has come to consider whether a new line notation system, designed specifically with the Web and modern chemistry in mind, might offer a better solution.

Photo credit: Wenwen - Flickr

Older posts: 1 2