Virtual Hydrogens

What's the difference between computational chemistry and cheminformatics? Computational chemists encode all the hydrogens. This tongue-in-cheek statement, whose origins I can't track, isn't far off the mark.

Implicit Hydrogen

The practice of ignoring hydrogens to emphasize the carbon framework has a long tradition in organic chemistry. As far back as 1872, Kekulé himself used the convention when discussing alternative structures of benzene:

Kekule and Hydrogen Suppression
Hydrogen Suppression. Including the hydrogens in these diagrams would have distracted from the task at hand - understanding the carbon framework.

Over the next 100 years or so, authors experimented with hydrogen suppression on and off. By the 1970s, most chemists had standardized on the hydrogen-suppressed form now in common use. In particular, hydrogens are only added when the count is ambiguous or in certain special cases such as carbonyl carbons.

As the field of cheminformatics gathered steam the 1960s, it naturally gravitated toward hydrogen-suppressed representations. Given the severe resource constraints at the time, however, the earliest systems were based not on individual atoms, but on collections of them. The best-known example from this period is Wiswesser Line Notation (WLN)), which is based on functional groups.

Eventually it became feasible to represent each atom individually. But rather than elevating hydrogens to first-class status, the new systems being developed after the 1970s left hydrogens in the economy section, just like the structure drawings being published by organic chemists.

Somewhere along the line, the term "implicit hydrogen" was adopted. The Merriam-Webster dictionary defines "implicit" as: "capable of being understood from something else though unexpressed."

In other words, the presence of an implicit hydrogen is deduced — somehow. And this is where the trouble starts. The slightest ambiguity or inconsistency in the rules for implicit hydrogen assignment can lead to discrepancies in fundamental properties such as molecular weight and formula, not to mention higher level properties including hybridization, isomorphism testing, and clustering behavior. Such discrepancies are not uncommon. Given the difficulty of detection, I suspect the problem is if anything under-reported.

Implicit Hydrogen in Practice

Language influences how we think. From this perspective, the over-use of the term "implicit hydrogen" has led to many problems. An implicit property can never be explicitly encoded or it would be… explicit. Those wanting to avoid ambiguity by explicitly encoding a hydrogen count onto each atom, whether it be in a file format or a software representation, are faced with migraine-inducing task of explaining "explicit implicit hydrogens."

As another example, consider Open Babel, a workhorse open source cheminformatics toolkit and file conversion utility. Prior to the 3.0 release in late 2019, Open Babel did not allow for the explicit encoding of a hydrogen count:

With OB 3.0, the number of implicit hydrogens is stored as a property of the atom. This value can be interrogated and set independently of any other property of the atom. This is how other mature cheminformatics toolkits handle implicit hydrogens. In contrast, in OB 2.x this was a derived property worked out from valence rules and some additional flags set on an atom to indicate non-standard valency.

The same phenomenon shows up in encoding formats. Consider Molfile and SMILES, de facto industry standards.

The molfile format is incapable of explicitly capturing a hydrogen count. Instead, a set of valence rules are used to deduce it. These rules are not codified within the ctfile specification as might be expected. Instead, they appear in a separate document, the BIOVIA Chemical Representation Guide.

Although not well-known, a weak form of explicit implicit hydrogen support is available in V2000 molfiles. Setting the "valence" atom property overrides the default valence. From the Representation Guide:

You can use explicit valence to assign a non-default valence to an atom, and thereby control the number of implicit hydrogens at an atom. For example:

Use explicit valence to specify implicit hydrogens on-metal hydrides. By default, the valence for a metal is set to the number of bonds that are attached to non-hydrogen atoms. If you assign an explicit valence, however, the atom has implicit hydrogens at unfilled valences. ..

In other words, the implicit hydrogen count of an atom can be set indirectly with valence. This property interacts in subtle ways with other properties such as charge and radical, complicating implementation.

SMILES offers users the option of either setting hydrogen counts explicitly (inside brackets) or allowing them to be computed through a set of rules (everywhere else). A sometimes overlooked detail is that using a bracket around an atom requires the hydrogen count property to be set. Failure to do so implies a count of zero.

To make things fun, SMILES and Molfile have different valence rules. For example, SMILES supports a +5 valence for nitrogen but Molfile does not. This means that SMILES and Molfile encoding what appears to be the same molecule with 5-valent nitrogen may lead to different hydrogen counts. However, it's also possible that the software you're using applies the same valence model across the board incorrectly.

Rules-based implicit hydrogen assignment allows for more compact representation, but this comes at a price. Developers must both interpret and implement the rules with complete fidelity. For examples of the difficulties these requirements can lead to, see NextMove Software's SMILES Benchmark.

Virtual Hydrogen

Can we do better? I think so. I'd like to propose the term "virtual hydrogen."

A virtual hydrogen is a positive or zero integer property associated with an atom. This number represents the virtual hydrogen count. An atom can mix virtual hydrogen and atomic hydrogen representations. To convert an atomic hydrogen, delete it and increment the virtual hydrogen count by one. To convert a virtual hydrogen, decrement the count and add an atomic hydrogen. Multivalent hydrogens, as in boranes for example, are not eligible for conversion. Also ineligible are atomic hydrogens whose removal would destroy necessary stereochemical information.

Merriam-Webster defines "virtual" as: "being such in essence or effect through not formally recognized or admitted." The word "virtual" is also used extensively in computer science as reflected by such terms as: "virtual keyboard;" "virtual reality;" and "virtual memory." In these contexts, the word assumes a meaning similar to the Macmillan Dictionary definition: "almost the same as the thing that is mentioned." A virtual hydrogen is almost the same thing as an atomic hydrogen.

There's no linguistic or logical barrier to encoding "virtual hydrogens" explicitly. Indeed, they should be explicitly encoded as an integer field closely associated with every hosting atom.

The concept of virtual hydrogens allows us to talk about hydrogen suppression more precisely. For example, the mandatory hydrogen in an aromatic SMILES for pyrrole (e.e., [nH]1cccc1) is virtual. Those hydrogens deduced to be present in propane (CCC) are implicit. Encoding a hydrogen as a standalone atom, as in [CH3][H] makes it atomic (note the corresponding deduction from the hosting carbon virtual hydrogen count). Molfile supports no form of virtual hydrogen, but it does support implicit hydrogens and a method to override the default valence rules. InChI abandons implicit hydrogens altogether. Virtual hydrogens are instead encoded for every atom within a dedicated layer. The 3.0 release of Open Babel brings virtual hydrogen support to a system that previously lacked it.

Implicit hydrogen counts are deduced from a set of published valence rules whereas virtual hydrogens are explicitly represented as an integer count. Atomic hydrogen is represented as a proper atom like carbon itself. We can now speak about hydrogen suppression with high precision.

Conclusion

Hydrogen is ubiquitous in organic molecules. Organic chemists discovered long ago that drawing every hydrogen leads to noise. Cheminformatics inherited both hydrogen-suppressed drawings and representation formats born in relative computational poverty. Somehow we've muddled along, using the term "implicit hydrogens" in the most inappropriate places. The result has been garbled communication and inconsistency.

Virtual hydrogens supply the missing piece. They are easy to spot as an explicit integer hydrogen count associated with a specific atom. Virtual and atomic hydrogens can be mixed freely, provided that desirable information isn't destroyed in the conversion. Unlike "implicit hydrogens," no mental gymnastics are required to talk about explicitly encoded virtual hydrogens. As such I believe the term "virtual hydrogen" can clear out several cobwebs in cheminformatics.