Toward a Standard InChI Formal Grammar
InChI is widely viewed as merely an identifier, not a serialization format, but a recent article challenged that notion. Although the idea may be discouraged in some quarters, it is possible to use InChI to read and write molecular structure. Tools for reliably doing so could lead to new applications. One missing component is a formal description of the InChI's syntax. This article takes the first few steps toward its creation.
Grammar Review
A useful tool for expressing language syntax is formal grammar (aka "grammar"). According to Wikipedia, a grammar "describes how to form strings from a language's alphabet that are valid according to the language's syntax." A language's formal grammar description can take many forms, with one of the most popular being extended Backus–Naur form (EBNF). One of EBNF's advantages is human and machine readability, making it a suitable starting point for hand-crafted and automatically-generated parsers. For more on EBNF, and its use in parser construction, see A Beginner's Guide to Parsing in Rust (Rust experience not needed).
As an aside, you may wonder why regular expressions aren't used. The simple answer is that they lack the capability to precisely and efficiently capture the grammar of a language like the one used by InChI. For a discussion, see What is the correct regular expression for InChI?.
As with many things related to language, the view becomes murky past a point. In the case of EBNF, one force of chaos is the proliferation of "EBNF-like" grammar languages. It seems like every parser generator defines its own language for expressing grammars, forcing the user to abandon the idea of portability and commit to that particular generator. Fortunately, one variant stands out for its expressiveness and wide adoption: W3C EBNF Notation. This form will be used here as well.
Constructing a Formal Grammar
To my knowledge, no InChI formal grammar has ever been published in any form, although I was encouraged to create one many years ago. The task is aided by the availability of detailed documentation and the free distribution of the InChI source code. These sources were distilled into a working grammar using the following approach:
- Read the InChI documentation.
- Formulate a hypothesis about grammar.
- Reduce the hypothesis to one or more production rules.
- Test the production rules against valid strings, which can conveniently be generated by InChI Wasm.
Authoritative sources of InChI documentation include:
- The InChI 1.06 Technical manual, available as a documentation download. It provides many details around InChI syntax and semantics.
- InChI, the IUPAC International Chemical Identifier. A paper that discusses the design of InChI.
A test harness is crucial for work like this because it's easy to introduce regressions when updating a grammar. This problem can be addressed through the use of a test harness, run every time the grammar is edited. Although several candidates might be deployed for this purpose, the one I found best-suited was Ruby EBNF. Unlike other options, it accepts W3C EBNF directly, with no reformatting required. It also accepts long strings, which failed to parse with another tool I've previously used.
InChI syntax is not trivial. To limit the scope of this effort in its first iteration, I set the goal of parsing just the first four standard InChI layers: prefix; formula; connections; and h_atoms. The resulting grammar should be suitable for parsers that recognize strings representing many neutral, natural isotopic abundance, non-metallic, stereocenter- and conformer-free molecules.
The Grammar
Using the approach outlined above, a partial InChI grammar was constructed. If you're interested in following this work in progress, consider setting up a notification on the GitHub repository. Portions of the grammar will be discussed below to give a feel for how it works.
String ::= prefix formula? connections? h_atoms?
The topmost production rule, String
sets the stage by defining four layers — the first mandatory and the remaining three optional. The name of the production rule appears to the left of the assignment operator (::=
), with the rule itself appearing to the right. The right hand side is expressed in terms of terminals (character literals) and non-terminals (the name of a production rule). This first rule for String
uses only non-terminals. Their definitions follow.
formula ::= "/" elemental ( "." elemental )*
connections ::= "/c" graph? ( ";" graph? )*
h_atoms ::= "/h" hydrogens? ( ";" hydrogens? )*
These rules are defined in terms of terminals and non-terminals. All three support disconnection via dot or semicolon characters (.
or ;
, respectively). The h_atoms
production rule supports both virtual hydrogens and InChI's concept of "mobile hydrogens." All of these nonterminals are defined later.
Perhaps surprisingly, most of this grammar's rules are dedicated to the definition of the elemental
production rule.
elemental ::= count? ( c h? b? br? cl? f? rest_elements )
| ( b? br? cl? f? h? rest_elements )
rest_elements ::= i? mg? n? o? p? s?
b ::= "B" count?
c ::= "C" count?
h ::= "H" count?
br ::= "Br" count?
cl ::= "Cl" count?
f ::= "F" count?
i ::= "I" count?
mg ::= "Mg" count?
n ::= "N" count?
o ::= "O" count?
p ::= "P" count?
s ::= "S" count?
/* and so on... */
This elaborate setup is unfortunately the only way to both ensure that the elemental
non-terminal contains only chemical element symbols and that these elements and their counts appear in Hill order.
The remainder of the grammar follows a similar pattern. Any non-terminals introduced by one rule are defined in subsequent rules. By the last line of the grammar, all production rules have been defined in terms of terminals. The result is suitable for use when reading and writing InChI strings.
Testing the Grammar
The full grammar can be tested by cloning the project's source repository and following the instructions on the README page. Given a Ruby installation and the cloned repository, the tests can be run like so:
cd test
bundle install --path vendor/bundle
ruby test.rb
The harness emits the text "PASS" if all strings are parsed without error, or "FAIL" if one or more strings fail to parse. The test suite contains every InChI included in the Technical Manual (truncated where appropriate), as well as some more complex examples found through this Twitter thread and other sources. A case could be made for adding invalid InChI strings to the test suite to confirm they are not parsed. After all, a grammar that accepted all strings would make the test suite pass just as well.
Conclusion
A formal grammar provides a solid foundation on which to build readers and writers for the strings of a language such as InChI. Although no InChI formal grammar has ever been published, a proof-of-concept was developed from existing documentation and reverse engineering the output of the InChI software. A full grammar will require more work, but the process described here should be suitable.