Beyond SMILES
Since its first public description in 1988, SMILES has become one of chemistry's most widely-used information exchange formats. All major toolkits support it. A lot of user-facing software reads and writes it. Many public databases include it. More recently it's become a popular input/output format for machine learning.
But despite this widespread adoption, SMILES is remarkably under-specified. For example, Weininger's original 1988 paper left many topics as an exercise for the reader. The word "stereochemistry" appears not once. The syntax for this "language" was only described anecdotally. Some semantic considerations such as hydrogen counting and aromaticity were only partially addressed.
Since then, various sources have filled in some gaps. The Daylight Theory Manual introduced stereochemistry and a more explicit valence model. A subsequent publication by Weininger added a partial formal grammar. OpenSMILES produced a more complete, but limited grammar, vastly expanded on non-tetrahedral stereochemistry, and addressed a number of other issues. SMILES+, an IUPAC initiative, looked as if it were positioning itself to pick up where OpenSMILES left off, but has so far not done so. These efforts have certainly improved the situation to a degree, but I'm not sure that's enough.
I started to become acutely aware of the limitations of SMILES documentation as I was writing Purr. Purr is a low-level toolkit for reading and writing SMILES. Think of it as a library that a cheminformatics toolkit would use to deal with SMILES. Some of the many problems I encountered made their way into blog posts, including:
- Writing Aromatic SMILES
- Fast Hydrogen Counting in SMILES
- SMILES Formal Grammar Revisited
- Abstract Syntax Trees for SMILES
- Running a SMILES Validation Benchmark
- Hydrogen Suppression in SMILES
- Stereochemistry and Atom Parity in SMILES
- A Comprehensive Treatment of Aromaticity in the SMILES Language
A programmer implementing a specification should not need to make things up. Unfortunately, that's exactly what I found myself doing when writing Purr. All the while I couldn't shake the thought that somewhere, at some time, another programmer had done the same thing, but guessed differently. And in the future, another programmer will make yet a different set of guesses.
This is no way to build an industry standard for data sharing. It's clearly possible to do better, but how to proceed? I can think of three possibilities.
Option One would be to create a new language. Now, I've been down this road before and it's a bumpy ride. A new language may address all kinds of problems, but without users it won't amount to much. Chemistry is already fragmented with respect to software, so throwing yet another "standard" onto the pile hardly seems productive at this point.
Option Two would be to fix the problems with existing SMILES documentation. OpenSMILES presents itself as a community effort focused on documenting SMILES. Unfortunately, involvement has remained low over the years, major points (like this one) have remained unresolved, and the mechanism for adopting nontrivial changes is far from clear. I'm unaware of any software implementation that both claims and demonstrates full OpenSMILES compatibility. But the larger problem is that many of the issues I see are cross-cutting. These aren't issues that can be fixed with a few pull requests.
Option Three combines elements of Options One and Two. Define a language based on SMILES, then document it to the degree needed to build fully-compliant readers and writers without guesswork. As a separate language, it can be specified without lugging around any of the baggage that came before (I'm looking at you, "aromaticity" and Hückel). But as a backward-compatible refinement to SMILES, the language would be readable and writable by existing software.
The main problem with Option Three is that there is no "SMILES language" on which to base a backward compatible effort. There are only a dozen or so SMILES dialects, some of which are better documented and/or used than others. The Daylight implementation itself falls into this category, mainly because of the waning accessibility of the Daylight Toolkit. So backward-compatibility would be a tough goal, not due to any technical challenge, but rather incomplete documentation. One way to address the issue would be to narrow the feature set of the new language to a lowest common denominator. This is a game of subtraction, not addition. Subtracting features can be far more difficult than adding them because each one has at least one user on the other end who can't bear to lose it.
Despite the challenges involved, Option Three appears to offer the most bang for the buck.
So with that introduction, I present Dialect, a work-in-progress molecular language designed to be backward-compatible with SMILES-as-practiced. Dialect will be introduced with a comprehensive paper detailing its design, syntax, semantics, and implementation. That paper is being written in the open on GitHub. As you can see, the manuscript is not yet very far along. Purr will be an open source reference implementation for Dialect, but hopefully not the only one. If any of this sounds interesting, I invite you to check out what's there, file issues, ask questions, and of course make pull requests.