How Would Your Cheminformatics Tool Do This?

Posted by Rich Apodaca Fri, 30 Nov 2007 16:24:00 GMT

Reference: Fukuzawa, Oki, Hosaka, Sugasawa, and Kikuchi, Org. Lett.

How Would Your Cheminformatics Tool Do This?

Posted by Rich Apodaca Thu, 29 Nov 2007 14:32:00 GMT

Reference: Grasa and Colacot, Org. Lett.

SMILES and Aromaticity: Broken? 13

Posted by Rich Apodaca Wed, 28 Nov 2007 14:43:00 GMT

Since its introduction in 1988, the Simplified Molecular Input Line Entry System (SMILES) has become one of the most widely-used molecular encoding systems in cheminformatics. But all technologies, no matter how widely-used, can be improved, and SMILES is no exception. This article, the first in a series, discusses a particularly thorny problem in the SMILES language.

A Little About SMILES

From the beginning, SMILES was a creative response to the complexity of the then-dominant Wiswesser Line Notation. This can be seen perhaps nowhere more clearly than in the introduction to Weininger's seminal paper on SMILES:

SMILES is a chemical notation language specifically designed for computer use by chemists. ... Among several approaches to computerized chemical notation, line notation is popular because it represents molecular structure by a linear string of symbols, similar to natural language. The Wiswesser Line Notation is the most widely used representative of this method. It meets the essential requirements for a deterministic chemical notation, but it is difficult to use because many rules must be followed to generate the correct notation of a complex structure. To overcome this and other difficulties, the SMILES system was designed to be truly computer interactive.

What started out as a way for humans to more easily encode molecular structures has since evolved into a way for computers to encode molecular structures. Several factors are responsible for this shift, the biggest being the emergence of the Graphical User Interface, and with it, the chemical structure editor.

Today, few chemists know how to encode SMILES nor, understandably, do they want to.

But rather than dying out, SMILES found a new niche. Computers in the late '80's were mere toys; storage space was measured in kilobytes, and bandwith was practically nonexistent. But with a few ASCII characters, the complete connection table of most organic molecules could be encoded by SMILES. Not only this, but the algorithms needed to encode and decode SMILES were easy to reduce to practice in software. Daylight's original implementation of SMILES was soon joined by many others.

A de facto standard was born.

If It Ain't Broke, Don't Fix It

For the last twenty years, SMILES has been used with great success to encode and store molecular structures. In an industry with few standards, SMILES is a rare example that shows what might be possible.

If SMILES has been so successful, then what's broken that needs fixing?

Over the years, a growing list of missing, inconsistent, or confusing aspects of the SMILES language have come to light. One vendor of a SMILES implementation has even cataloged some of them. In most cases, the various implementers of SMILES systems have done the only thing they could do under the circumstances: apply their own judgment and best guesses.

The result has been the gradual introduction of subtle incompatibilities among the SMILES implementations currently in use. This is the problem that the OpenSMILES group aims to address.

This status quo works in an environment of information silos, proprietary code, and closed data. But as cheminformatics moves in the direction of open data and interoperability, the problems become painfully apparent.

Of all the topics that have been discussed so far by the OpenSMILES group, one stands out for its level of interest, number of contributors, strong opinions, and detailed discussion: lower-case atom symbols and aromaticity.

Aromaticity in SMILES

SMILES allows two kinds of atoms to be specified: upper-case and lower-case. Lower case atoms, according to existing documentation, signify 'aromatic' atoms.

Weininger made clear that the reason for introducing lower case atom symbols was to facilitate canonicalization and substructure recognition. From the original paper:

Aromaticity must be detected in a system that generates an unambiguous chemical nomenclature. As will be discussed in following papers, this is needed both for the generation of a unique nomenclature and for effective substructure recognition. There can be no definition of 'aromaticity' that is both rigorous and all-encompassing: the word implies something about 'reactivity' to a synthetic chemist, 'ring current' to a NMR spectroscopist, 'symmetry' to a crystallographer, and presumably 'odor' to the original user of the word. Our objective in defining aromaticity is to provide an automatic and rigorous definition for the purposes of generating an unambiguous chemical nomenclature. Although the SMILES algorithm produces results that most chemists find natural, nothing is implied by this definition about physical properties.

Kekule structures, in which double bonds and single bonds alternate, make it difficult for computers to implement certain kinds of algorithms. Defining lower case atom symbols to remove artificial asymmetry eliminated these problems.

Weininger's original paper then goes on to describe the criteria for aromaticity in the SMILES language. At it's core, aromaticity boils down to the following defintion:

... To qualify as aromatic, all atoms in the ring must be sp2 hybridized and the number of available 'excess' π electrons must satisfy Hückel's 4n+2 criterion. ...

Seems simple enough, but even in 1988 things were not so clear. For just a few sentences later, Weininger continues:

... Entries of c1ccc1 and c1ccccccc1 will produce the correct antiaromatic structures for cyclobutadiene and cyclooctatetraene, C1=CC=C1 and C1=CC=CC=CC=C1, respectively. ... [emphasis added]

How are we to interpret this? Apparently, c1ccc1 and c1ccccccc1, neither of which obey the 4n+2 rule, are nevertheless valid SMILES. We can even use Daylight's Depict application to verify for ourselves that both c1ccc1 and c1ccccccc1 are read and depicted.

Perhaps the concept of "antiaromaticity" (in contrast to "non-aromaticity") holds a special place in the SMILES language. If so, this distinction has never been clarified.

While puzzling over the apparent contradiction, we later read that:

... For example, quinone is nonaromatic, with only four excess electrons.

Weininger goes on to imply that the only correct way to represent quinone in SMILES is without lower case atom symbols, for example:

O=C1CCC(=O)CC1

And still later:

... For example, if one of the benzene ring's electrons is removed to form c1ccc[cH+]1, this ion is not aromatic because there are only five π electrons. ...

Ambiguity makes it impossible to write standardized software: either 4n+2 is the rule for triggering the aromatic flag, and therefore lower case atom symbols, or it is not. If exceptions to this rule are needed, they must be specified in enough detail to be reduced to practice. To my knowledge, no documentation written in 1988 or since then has provided the necessary guidance.

We can't have it both ways.

More Brokenness

Next, consider some of the examples left out of the original SMILES description. What about oligocyclic aromatics?

Fluorenone, according to the SMILES electron counting rules, has twelve π electrons and is therefore not aromatic. Strictly speaking, a SMILES like this:

O=c2c1ccccc1c3ccccc23

in which the carbonyl carbon is represented with a lower case atom symbol, should be considered invalid. Not just undesirable, but verboten.

Yet Daylight's own Depict program, and other SMILES implementations, treat it as valid.

Despite the lack of an aromatic tricyclic ring system, we may nevertheless want (or need) to represent fluorenone using lower case atom symbols. After all, canonicalization and substructure searches are very difficult otherwise.

So any software we write needs to peel back layers of the tricyclic ring system in a quest for isolated aromatic rings. This exercise is clearly chemically meaningless as all atoms are coplanar and sp2 hybridized, and therefore interact. The counterargument is that the SMILES aromaticity model has no basis in reality - it's just a convention. So we press on.

We eventually end up with a SMILES like this:

O=C2c1ccccc1c3ccccc23

The larger problem is making it clear when a reader or writer is and isn't allowed to perform this peeling back operation in search of aromaticity. Does the above SMILES match the SMILES definition of aromaticity or does it not? Are we allowed to peel back ring systems looking for imaginary 'embedded' aromatic ring systems or are we not?

The answer may exist somewhere, just not in the documentation I have access to.

The pragmatic approach, and the one taken by some implementations, is to simply ignore the whole question, forget about 4n+2, and call everything that 'looks' aromatic, like the fluorenone carbonyl carbon, 'aromatic.'

As another example, consider acenaphthalene:

c1cc2cccc3ccc(c1)c23

Based on the published 4n+2 rules for SMILES aromaticity detection, acenaphthalene's twelve π electrons mean that it can't be represented in the aromatic form. It's not just discouraged - it's not allowed. Yet the Daylight Depict program, and a few other SMILES implementations, will accept this input as valid.

The only way we can take advantage of the symmetrization afforded by lower case atom labels is to go hunting for isolated benzene rings. Upon doing so, we arrive at the following SMILES:

c1cc2C=Cc3cccc(c1)c23

Once again, we've more or less made an arbitrary distinction, assigning one set of carbons as aromatic and the other, fully coplanar, conjugated, and sp2-hybridized set as non-aromatic. Does the SMILES language allow us to do this? Again, the answer may exist somewhere, but not in any material I've been able to find.

To put it simply, where in the SMILES documentation are we informed of which atoms in a coplanar, fully conjugated and sp2 hybridized ring system can be ignored from the 4n+2 test?

For that matter, how do we know that oligocyclic aromatic ring systems are supported at all? Maybe only isolated five- and six-membered rings should be evaluated.

Consider pyrrolopyridine (depicted above):

c2ccn1cccc1c2

Now let's assume that the SMILES 4n+2 rule can only be applied to individual rings, not ring systems. This prevents us from writing a SMILES like the one shown above because the left-hand pyridine ring has a formal π electron count of 7 - two from each endocyclic double bond, two from the nitrogen atom, and one from the exocyclic double bond.

The best we could do is to write a SMILES like this:

c2cc1C=CC=Cn1c2

The only way we can create an 'aromatic' SMILES for the 4n+2 pyrrolopyridine ring system is to combine the electron counts for both rings.

But Daylight's own Depict system, and I suspect many others, imply that the fully aromatic version of the pyrrolopyridine SMILES is valid.

Once again, we can't have it both ways. If full ring systems need to be perceived and tested for 4n+2 π electrons, then consistency requires it also be done for acenaphthalene, fluorenone, and countless others for which space and time prevent discussion. If particular ring systems are exempt, then the SMILES language documentation should specify in detail how to tell the difference.

Conclusions

Given the problems in combining SMILES' symmetrization capability and lower-case atom symbols with the overloaded concept of aromaticity, one has to wonder - is it worth the trouble? Given the disregard for these rules by working third-party code, by Daylight, and by the original SMILES documentation, how reasonable is it to continue to use 4n+2 as the rule? What does the resulting confusion really buy?

There is a simple way to resolve the issue, but you're probably not going to like it - at least not at first. But that's a story for another time.

ChemWriter, Chemical Structures, and the Web 2

Posted by Rich Apodaca Tue, 27 Nov 2007 17:09:00 GMT

Of all the components that make up today's cheminformatics systems, the 2D structure editor may be the most widely-used. A 2D structure editor is often a chemist's first and most enduring exposure to cheminformatics, and can be encountered as early as Junior High or High School.

Over time, a good 2D structure editor becomes every bit as important to a chemist as a text editor is to a writer or software developer. At any given ACS organic division symposium, you're likely to find several bench chemists who only casually, if ever, use a 3D molecular modelling program; finding any who don't regularly use a 2D structure editor would be much more challenging.

2D structure editors are ubiquitous. They can be found in one form or another in most cheminformatics systems, ranging from databases, to standalone applications, to property calculators, and even 3D molecular modelling programs.

Despite the importance of structure editors, they don't get much attention among cheminformatics developers. For example, if your bibliography is anything like mine, it contains dozens of papers on molecular descriptors. Yet the number of cheminformatics papers describing the design of ergonomic chemical structure editors is, well, one or maybe two.

About ChemWriter

ChemWriter™ is a new product aimed at making 2D chemical structure editors a lot more interesting, easy to use, and versatile than they have been in the past. Designed specifically as a lightweight, extendable component, ChemWriter is ideal for use in chemically-enabled Web applications.

The second beta version of ChemWriter has recently been released by my company, Metamolecular, LLC. A recent article on the Metamolecular company blog discusses ChemWriter in more detail.

The Structure Editor In-Depth

Because the design and use of 2D chemical structure editors is an unusual subject in cheminformatics, a compilation of articles on the topic from Depth-First and the Metamolecular Web site is provided below. Many of these articles refer to "Firefly", which was ChemWriter's name during early development.

Why the Structure Editor Matters

Creating ChemWriter

Using ChemWriter

Extending ChemWriter

Compiling Open Babel to Pure Java Bytecode with NestedVM: Building A Runnable Classfile that Almost Works 2

Posted by Rich Apodaca Mon, 26 Nov 2007 15:10:00 GMT

Previously, I described an unsuccessful first attempt to compile the popular cheminformatics C/C++ library Open Babel to pure Java bytecode using NestedVM. This article follows that topic one step further, and shows how to obtain a runnable Java classfile. Although major functionality is missing, the principle of compiling arbitrary C/C++ code to both Java source code and Java bytecode is illustrated.

Getting Started

This articles assumes that you've installed NestedVM and downloaded Open Babel on your system. You'll then need to set up your environment (from the nestedvm installation directory):

$ source env.sh

Run the Configure Script

The configure script we used last time didn't attempt to statically compile the binary utilities in the tools directory. This time, we'll add flags to allow this:

$ ./configure --disable-dynamic-modules --enable-static=yes --enable-shared=no --enable-inchi --host=mips-unknown-elf
$ make

Note: leaving out the static compile directives does not produce a fully-functioning classfile either.

Next, we'll attempt to directly create the babel binary in Java classfile format, as we did last time:

$ cd tools
$ java org.ibex.nestedvm.Compiler -outfile Babel.class Babel babel
Exception in thread "main" java.lang.IllegalStateException: unresolved phantom target
        at org.ibex.classgen.MethodGen.resolveTarget(MethodGen.java:555)
        at org.ibex.classgen.MethodGen._generateCode(MethodGen.java:664)
        at org.ibex.classgen.MethodGen.generateCode(MethodGen.java:618)
        at org.ibex.classgen.MethodGen.dump(MethodGen.java:888)
        at org.ibex.classgen.ClassFile._dump(ClassFile.java:193)
        at org.ibex.classgen.ClassFile.dump(ClassFile.java:160)
        at org.ibex.nestedvm.ClassFileCompiler.__go(ClassFileCompiler.java:380)
        at org.ibex.nestedvm.ClassFileCompiler._go(ClassFileCompiler.java:72)
        at org.ibex.nestedvm.Compiler.go(Compiler.java:259)
        at org.ibex.nestedvm.Compiler.main(Compiler.java:183)

We're getting the same error as before. Although, an announcement of a bugfix was posted to the NestedVM list, in my hands the new version of NestedVM caused the same error.

As a workaround, we can compile to Java sourcecode first:

$ java org.ibex.nestedvm.Compiler -outformat java -outfile Babel.java Babel babel

We now have a Java source file encoding the babel program. Does it compile?

$ javac Babel.java
The system is out of resources.
Consult the following stack trace for details.
java.lang.OutOfMemoryError: Java heap space
        at com.sun.tools.javac.util.Position$LineMapImpl.build(Position.java:139)
        at com.sun.tools.javac.util.Position.makeLineMap(Position.java:63)
        at com.sun.tools.javac.parser.Scanner.getLineMap(Scanner.java:1105)
        at com.sun.tools.javac.main.JavaCompiler.parse(JavaCompiler.java:512)
        at com.sun.tools.javac.main.JavaCompiler.parse(JavaCompiler.java:550)
        at com.sun.tools.javac.main.JavaCompiler.parseFiles(JavaCompiler.java:801)
        at com.sun.tools.javac.main.JavaCompiler.compile(JavaCompiler.java:727)
        at com.sun.tools.javac.main.Main.compile(Main.java:353)
        at com.sun.tools.javac.main.Main.compile(Main.java:279)
        at com.sun.tools.javac.main.Main.compile(Main.java:270)
        at com.sun.tools.javac.Main.compile(Main.java:69)
        at com.sun.tools.javac.Main.main(Main.java:54)

Not exactly. But this is a massive source file, so we'll need to increase the Java compiler's memory allowance:

$ javac Babel.java -J-Xms256m -J-Xmx256m
Note: Babel.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

This seems to have worked. Can we run the classfile?

$ java Babel -H
Open Babel converts chemical structures from one file format to another

Usage: Babel <input spe> <output spec> [Options]

Each spec can be a file whose extension decides the format.
Optionally the format can be specified by preceding the file by
-i<format-type> e.g. -icml, for input and -o for output

--truncated--

Success! But before we get too excited, let's make sure Open Babel's file formats are recognized by testing for "SMILES":

$ java Babel -Hsmi
Format type: smi was not recognized

As you can see, we have successfully converted the babel program to an executable classfile, but this classfile is missing most of the features of the native binary.

This may seem hopeless, but consider that natively compiling Open Babel using the above configure flags also produces a binary that doesn't know about SMILES or any other format.

So, it's very likely that if we can produce a native, statically compiled, self contained babel executable, then we will have solved the problem of running Open Babel entirely on a JVM.

This doesn't seem like a difficult problem, but apparently it is.

Older posts: 1 2 3 ... 5