Balancing Chemical Equations in ReactionMate Part 1: More Than Meets the Eye

May 17th, 2012

Balancing chemical equations is one of the first things taught in most introductory chemistry classes. As such, it might seem as if everything there is to know about the topic was published 75 years ago or more by chemists long since gone.

To my surprise, I found this was simply not so.

This post, the first in a series, discusses why a cheminformatics fan like myself would even consider the problem of balancing chemical equations as a serious topic, and what a deeper understanding of the subject might mean for you.

Origins: ReactionMate

ReactionMate (App Store) at the moment is a little iOS app with big ambitions. It aims to become a useful companion to anyone studying, performing, or using chemical reactions.

One of the fundamental requirements for processing chemical reactions computationally is the automatic balancing of chemical equations, a function ReactionMate performs relatively well.

Given this foundation, there are many interesting directions the app could be taken, both in education and research. Hopefully, I’ll be discussing some of them later.

When I started working on ReactionMate, my main motivation was to better learn iOS development with Objective-C. A reaction balancer, so I thought, would be an easy project.

Although I did reach the first milestone and now have an app for sale to show for the effort, the path was much longer and darker than I imagined.

Balancing By Inspection

You may remember back to learning how to balance chemical equations in General Chemistry. A common method still taught is known as balancing by inspection. This explanation from Khan Academy is typical:

Although a fine approach for balancing simple equations by hand, balancing by inspection does not lend itself well to use in a software algorithm. I suspect that the same thing that makes balancing by inspection so difficult for students to master is the same thing that makes it unsuitable for use as algorithm: it’s hit or miss and then try again.

Balancing With Matricies/Linear Algebra

A more general and deterministic method for balancing chemical equations uses linear algebra. Given the constraint that the number of atoms of a given type on the left of the equation must equal the number of that type of atom on the right, it’s possible to set up a system of linear equations and use matrix operations to solve for coefficients. The video below shows how this works conceptually:

Issues

Implementing the necessary matrix algebra in Objective-C was not terribly difficult in a first pass. Coefficients for the first few chemical equations were found without problems.

However, as the number and complexity of tested equations grew, problems began to surface - each demanding a solution. The next installment will discuss some of the issues and how they were addressed.

Conclusions

Balancing chemical equations is a deep topic at the intersection between mathematics and chemistry. Balanced chemical equations also lie at the heart of many important areas in chemistry. With academic papers appearing as recently as a couple of years ago, this topic is full of surprises.

Comments and Reactions

Why ACS Must Come Clean on Journal Publication Costs

March 14th, 2012

The now-fading discussion of Elsevier and the Research Works Act got me thinking about the American Chemical Society, which runs a very large scientific publication business of its own. I started wondering why the ACS exists in the first place and what its long-term vision might be.

In 1938, a U.S. Congressional Charter was granted to the American Chemical Society. Although essentially honorific, this document makes for interesting reading. Front and center is Section 2, containing these inspiring words:

That the objects of the incorporation shall be to encourage in the broadest and most liberal manner the advancement of chemistry in all its branches; the promotion of research in chemical science and industry; the improvement of the qualifications and usefulness of chemists through high standards of professional ethics, education, and attainments; the increase and diffusion of chemical knowledge; and by its meetings, professional contacts, reports, papers, discussions, and publications, to promote scientific interests and inquiry, thereby fostering public welfare and education, aiding the development of our country’s industries, and adding to the material prosperity and happiness of our people.

As a longstanding member of the ACS, I question the policies ACS has pursued in light of the above statement. ACS has consistently maintained that the only way it can continue to publish journal content is to either compel authors to pay anywhere from $1,000 to $3,000 for its opt-in Author Choice Option, or to compel them to transfer all rights to the content they’re publishing so that ACS can place keep it behind a variety of paywall mechanisms.

Under either option, this is a pay to play system. There are justifiable costs that everyone must pay, and then there’s profiteering. The latter is diametrically opposed to encouraging “in the broadest and most liberal manner the advancement of chemistry”.

Having been in business for a number of years now, I fully appreciate the need to cover costs and turn a profit. But ACS is a non-profit organization. In theory, it just needs to cover costs.

What are the costs to produce the various ACS journals? Nobody asked such a question back when journals were printed on dead trees and filled library shelves. Desktop publishing, Moore’s law, and the Web have changed all of that.

Conspicuously absent from the various ACS policy statements on open access is any form of financial transparency. Must we take at face value claims that ACS publications all need to continue with business as usual lest they cease to be financially viable?

As a previous Depth-First article pointed out, the best we can do is speculate. An annual financial statement is released by ACS and it is possible to connect some dots, but this makes for poor decision-making.

Release of a detailed breakdown of the costs to produce each ACS journal would go a long way to elevating the open access debate, and could turn out to support the ACS case.

Sadly, I fear no releases of this kind of financial data will be forthcoming. Less than 10% of the total ACS budget comes from dues and meeting registration fees. ACS is clearly using the proceeds of its publication business (and Chemical Abstracts) to fund scholarships, outreach programs, webinars, social networking experiments, job fairs, lobbying efforts, scientific awards, employment surveys, and executive compensation packages, among other things.

Some of these programs offer real value. Others are counterproductive, to put it kindly. Supporting all of these activities is not the issue. The issue is to what extent ACS is paying for all of this stuff by creating and perpetuating a dysfunctional publication system that hurts chemistry in the long run.

These decisions are not unilaterally made by Madeleine Jacobs or any of the other folks whose smiling images regularly grace the pages of C&ENews. The ACS is run by its members - at least one can hope so.

If you’re an ACS member concerned with the direction ACS is headed, you have the obligation and right to ask for financial transparency around ACS publications.

Comments and Reactions

Education of a Scientist

February 7th, 2012

Creating this video was fun, and there’s an obvious business model behind the company making it possible. Guess what? I just published it for free to a worldwide audience.

Why doesn’t science work this way?

Comments and Reactions

George Whitesides: The Concept of the Scientific Paper is Eroding Before Our Very Eyes

January 28th, 2012

George Whitesides is well known as an innovator and one of chemistry’s most visible representatives. He has the distinction of being credited with the highest Hirsch index for any living chemist. You could say he knows something about scientific publication.

The above video excerpt was taken from an extended interview with Whitesides on the topic of publication. Although the full interview series is worth watching, most striking are Whitesides’ views on the changes happening in scientific publication and what the future is likely to hold:

One of the troubles with universities is there’s a tendency to do terrific research, embed it in prose that is impenetrable even to experts, bury it in papers, and have to everyone’s surprise nothing come out of it. … It may well be that what we have in the future is some combination of very short snips in one or another kind of extended abstract leading to, through links - through something else, more and more levels of detail.

The scientific paper evolved under an environment in which scarcity ruled. This scarcity enforced a uniform format on scientific discourse. Due to physical constraints on printing and distribution, only certain kinds of research could be published. A great deal of valuable information and effort have been wasted in the process.

The Web and desktop publishing have abolished these constraints, but most scientist continue to act as if nothing has changed. They continue to huddle around the dying embers imprimatur left over from a handful of pre-digital journals. Or they try and try again to push these journals into distribution models (Open Access) that are simply incompatible with the top-heavy organizations that have grown around outdated publication models.

Those scientist and nimble information service providers who understand that the old rules no longer apply will enjoy significant advantages that will amaze (and possibly leave behind) the ones who can’t (or won’t) adapt.

Comments and Reactions

Five Things to Do Instead of Protesting the Research Works Act (HR 3699)

January 23rd, 2012

The Research Works Act (‘RWA’, or HR 3699) would reverse the US federal government policy of requiring recipients of National Institutes of Health (NIH) funds to deposit copies of their research papers into PubMed Central. It would also prevent the adoption of similar policies by other federal funding agencies going forward.

As a scientist who has participated in the authoring and review of a few scientific papers in closed, for-profit journals, I believe the Research Works Act should be allowed to pass, and that opposition to it focusses on the wrong problem, despite good intentions.

Fighting The Research Works Act Is Not Worthy of Your Time or Intellect

For those still willing to entertain an alternative perspective on this charged issue, consider:

  1. The NIH Public Access Policy in no way changes the copyright status of the works appearing on PubMed Central. Redistribution, duplication, or repurposing of works from PubMed Central could still make you liable to the usual copyright infringement penalties - even if you are the author.

  2. The NIH Public Access Policy is assembling an incomplete corpus of scientific works. Authors not supported by NIH, which includes most in the chemical and pharmaceutical industries, and foreign authors - among others, are not subject to the NIH policy. Their works will generally never appear in PubMed Central. Although in certain situations having access to an incomplete corpus of scientific papers can be helpful, for the most part the practical utility is about on par with an Internet in which 4 out of 5 sites are blacked out.

  3. The Public Access Policy attempts to do through legislation what scientists should be doing for themselves - namely, completing the transition from pre-digital era of publication scarcity to the post-digital era of limitless, cheap publication capacity.

Fighting the Research Works Act is one of the least effective things a scientist could do to fix a deeply dysfunctional system of scientific publication. If you must place a “Stop RWA” banner on your Twitter avatar, or write your representatives about the “Evil Scientific Publisher Lobby” to feel better, by all means do it.

You’ll accomplish little for science, however.

If on the other hand you want to pass onto the next generation a system of scientific communication that accelerates science rather than holding it back, you’ll have to work much harder and take some rather unpleasant risks.

Government can’t do it for us. It’s our mess and we need to clean it up.

Five Ways to Effect Real Change

  1. Identify all journals in your field that: (1) require authors to transfer copyright as a condition of publication; and (2) publish any research works under terms that don’t allow free redistribution and commercial repurposing of content.

  2. Submit no further manuscripts to the journals you identified in Step (1). Refuse all requests to review papers for these journals. Write letters to the editors of these journals explaining your actions. Publicize your actions through some public medium (e.g., blog or letter to the editor of a trade magazine).

  3. You’ll still need to publish, of course. Find at least one journal in your field that allows you to retain copyright to your work and which makes its content available for redistribution and commercial repurposing. From now on, only submit articles to that journal and accept all reasonable requests to review papers from them.

  4. Identify the ten most influential scientists in your field. Find out which of them publish mainly or exclusively in the journals you identified in Step (1). Write a letter to each leader explaining the crisis in scientific publication, the harm they’re doing to science and their group in continuing to publish in these journals, and the steps you’ve taken to solve the problem. Ask for their commitment to do the same as you have.

  5. You’re very likely to get either no response from the leaders in your field or a negative one. If it should happen that you get a favorable response, ask this leader to publish an open letter to the editor of the appropriate journals explaining why their policies are detrimental to scientific progress.

Conclusions

The resolution to the scientific publishing crisis will not come through a government bailout in the form of public access policies. It will come from starving entrenched old-guard publishers of the only value they’re currently adding to the scientific publication system - imprimatur. Regardless of whether you’re a leader in your field or just a concerned scientist, imprimatur comes from the combined perceptions of you and your peers. Fortunately, perceptions can change.

The way to create a scientific publication system that advances the cause of science is to make it repugnant, ridiculous, and lonely to participate in one that doesn’t.

Comments and Reactions

Digital Destruction in Scientific Publishing: Why This Scientist Supports the Research Works Act (HR 3699)

January 18th, 2012

Steve Jobs described the textbook business as an $8 billion a year industry “ripe for digital destruction.” The scientific publication business resembles the textbook industry in many ways, including lack of value-creation Jobs was referring to. The Research Works Act (HR 3699) is the scientific publication industry’s surest path to digital destruction. Every scientist who cares about the future of scientific communication should support it.

Yes, I’m encouraging support for this bill, and no, I’m not being ironic. Read on to find out why.

The Slow Decline of the Scientific Publication Business

The advent of desktop publishing and the Web has utterly disrupted the business model of scientific publishers. Before these technologies were widely available, publishers added tremendous value to the scientific publication process through editing, typesetting, peer-review, aggregation of content, printing onto paper, distribution, archival, and imprimatur.

With the exception of imprimatur, a key concept discussed below, nothing scientific publishers are currently doing today adds any value to the process that can’t be cheaply (and increasingly - freely) obtained elsewhere.

At a time in which scientific publishers are adding less and less value to the process, you might think that journal prices would be decreasing as well. Instead, the opposite has happened with journal prices rising faster than the rate of inflation for many years running. Simultaneously, library budgets are being repeatedly cut due to lackluster public funding for science and a multi-year economic slump.

Those who have been paying attention know that a process of culling the weakest of the traditional scientific journals has been underway for the last ten years as research libraries are squeezed between the irresistible forces of shrinking funds and rising costs.

NIH’s Public Access Policy

Partly in reaction to the declining availability of scientific research papers brought on by declining library budgets and unending journal price increases, NIH was authorized to take action. HR 2764, a large spending bill containing a tiny section that became the NIH Public Access Policy, was signed into law in 2007 by then-president George W. Bush. Director of NIH was granted new powers to make scientific publication more widely accessible:

SEC. 218. The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law.

Publishers faced a dilemma: forbid deposition of manuscripts or risk losing content? For the most part, they relented and allowed deposition. For example, the American Chemical Society conformed to the new policy by granting its authors a special exception allowing them to deposit manuscripts into PubMed Central. According the the current version of that policy, an ACS author can comply with the federal regulation in three ways:

  • The ACS deposits the final, published article on the author’s behalf, for immediate open availability via the ACS AuthorChoice fee-based option. [cost: $1,000 - $3,000]
  • The author deposits the peer-reviewed manuscript, accepted for publication but prior to ACS’ copy editing and production, with NIH, for open availability 12 months after publication.
  • ACS deposits on behalf of the author the peer-reviewed manuscript, accepted for publication but prior to ACS’ copy editing and production, with NIH, for open availability 12 months after publication.

The Research Works Act

HR 3699 is a very short bill aimed at reversing the NIH Public Access Policy (full text). Its key provisions are contained in Section 2:

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that– (1) causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or (2) requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work.

According to the definitions included in the bill, ‘private-sector work’ means any work created by a non-employee of the federal government. A scientist at a public research university receiving NIH funding is more likely than not considered to be producing a ‘private-sector research work.’

The research works act is supported by the Association of American Publishers, whose membership includes the American Chemical Society.

Clearly, the intent of this bill is to dismantle the NIH Public Access Policy and to close off a source of content distribution publishers find unfair.

Imprimatur: Still Worth Big Bucks

Any scientist who has been an active participant in scientific publication as an author, reviewer, and consumer recognizes that the only remaining value added by scientific publishers today is imprimatur. Imprimatur is the implied endorsement received by authors who publish in certain scientific journals, particularly in those that earned a high level of prestige during the pre-digital period of publication scarcity.

Ironically, imprimatur remains so valuable in science that it has kept numerous publishers afloat despite wave upon wave digital destruction being visited on sister industries such as book publication and newspapers.

But imprimatur can lose its luster, particularly in an environment in which fewer and fewer scientist can actually read the publications appearing in ‘high-impact’ journals. Prestige counts for nothing in science if your peers can’t read your papers. Nevertheless, that’s where scientific publication is heading.

This is why you must support HR 3699 if you care about the future of scientific publication.

Embrace Digital Destruction

Currently, NIH’s Public Access Policy is the only means available for many scientists seeking to access key scientific works. Removing this government subsidy will permit the natural process of digital destruction run its course.

As scientist continue to lose access to prestigious, old-guard scientific publications through unending cost-increases and library budget cuts, they will be forced to seek alternatives. Initially, they will be looking for alternatives as consumers of scientific papers, a point supported by the widespread enthusiasm for the NIH Public Access Policy among the scientific community.

But as more and more publishers are bankrupted by their own inability to profitably innovate, scientists will be forced to consider alternatives as authors as well. At this point, the decaying scientific publication model we know today will be finally dead and a new era will have begun.

Attempts to artificially prop up the status quo through legislation and litigation, no matter how well-intended, will only delay the digital destruction of the old guard in scientific publishing.

Conclusions

We as scientist have nobody to blame but ourselves for the mess that scientific publication has become. If we lack the courage to risk career setbacks by publishing in ‘third-tier’ open access journals, experimenting with open science using the many free tools the Web offers, or boycotting old-guard publishers, then we must wait patiently for digital destruction to break this ridiculous cycle for us.

Comments and Reactions

On the (F)utility of Extending the Molfile Format

January 11th, 2012

MDL V2000 molfile format is the closest thing cheminformatics has to a universally-adopted standard. First publicly described in depth in 1991 and developed over the previous 13 years, the molfile format is read and written by nearly all software doing anything significant with organic chemical structures today.

Think about this for a moment - the lingua franca of cheminformatics predates Windows 3.0 and the World Wide Web. And it continues to be used as first described, with a few changes along the way.

Reliance on an old standard is not necessarily a bad thing. But when that standard’s limitations hinder progress, it’s time to look for alternatives. Has the time come to say ‘goodbye’ to the trusty V2000 molfile format, or does it still have plenty of life left?

V3000

A major limitation of V2000 is its hard cap on the number of atoms and bonds - 999. V2000 is a file format rooted in a bad idea long since abandoned, and made painfully obvious in the wake of Y2K - fixed-width fields. If your molecule contains more than 999 atoms or bonds you’ll overflow the ‘atom counts’ field, and you’ll be unable to specify a bond to the higher-numbered atoms.

Noting this and other problems, MDL (the originator of the V2000 spec) published a next-generation format called V3000 in 1995.

Over the course of the next fifteen years, V3000 went from merely being an option for large structures to the preferred format for encoding any structure, at least according to the molfile spec’s authors. From the most recent June 2010 update:

Current Symyx products support reading and writing of both V2000 and V3000 formats. These products continue to default to writing V2000 molfiles to maximize interoperability with third party applications. Future product versions might default to output of the preferred V3000 format.

(Other changes were afoot as well: MDL was bought by Symyx, which was itself bought by Accelrys.)

Hard figures on the adoption of V3000 are difficult to come by. My sense in talking to others and reading the literature is that V3000 is being used internally by some large pharmaceutical companies - primarily for its ability to represent partially-defined stereochemistry.

However, when it comes to exchanging information between organizations, V2000 (in particular the V2000 Structure Data File) is the only game in town (SMILES files are a fallback, but that’s a story for another time). For example, none of the Sixty-Four Free chemistry databases offering bulk downloads provide a V3000 option. They all use V2000.

Looking at the advantages stated by Symyx itself for V3000 - it’s not hard to see why most in cheminformatics see little to be gained by switching:

  • “Provides better support for new chemical properties or objects, and supports enhanced stereochemistry” [Needed by a minority in cheminformatics, and practical only under well-designed validation protocols.]
  • “Removes fixed field widths to support large structures. …” [Cheminformatics has historically concerned itself with small molecules, so this has not been a major concern.]
  • “Supports the use of templates in a template block, which is useful for representing large structures, such as biological molecules. …” [This is where things get interesting - see below].
  • “Uses free format and tagging of information for easier parsing.” [Unfortunately, V3000 was introduced before XML, so parsing these tags actually requires a lot of work by comparison.]
  • “Provides better backward compatibility through BEGIN/END blocks.” [V3000 attempts to offer a nested file structure, but again without the convenience of tooling available from a modern file format. This means more work for developers, not less.]

V3000 has so far failed to replace V2000 as an inter-organization exchange format. Furthermore, few software packages have fully implemented the V3000 spec (although the same could be argued for V2000 as well).

Is it possible that V3000, although solving some problems, didn’t address enough of the right problems?

Self-Contained Sequence Representation

The pharmaceutical industry’s innovation problems are no secret. In looking for solutions, one avenue gaining some traction is biopolymer-based drugs such as proteins, carbohydrates, and polynucleotides.

This has led to a new problem in cheminformatics - the need to effectively deal with co-mingled small molecule and biopolymer structures in corporate registration systems.

The problem is that cheminformatics has traditionally been concerned with small molecules - those typically having fewer than 100 heavy atoms. But it’s not uncommon for biopolymers to consist of structures of thousands of heavy atoms or more. The optimizations that work so well for encoding, displaying, editing, and algorithmically processing large collections of small molecules begin to break down when applied to these larger molecules.

Very recently, Accelrys published their solution to this problem, Self-Contained Sequence Representation (SCSR). Based on the V3000 molfile format, SCSR offers a common framework for encoding both small molecules and biopolymers. In a whitepaper on this technology, Keith Taylor explains:

The biochemistry industry needs a method of representation for large structures that is meaningful to biologists and chemists, reduces redundant information and enables structural features to be searched using a computer system.

Remember those “templates” and “template blocks” touted as advantages for V3000 in the specification? This is where those data structures prove themselves.

In SCSR, atoms can define literal atomic species such as carbon or nitrogen. But atoms can also contain entire substructures, such as amino acid residues. Rather than repetitively encoding every atom in an alanine polypeptide residue, it’s only necessary to give a pointer to a single instance of alanine encoded in the file - once per occurrence. This system saves considerable storage space, processing time, and leads to more more natural-looking depictions.

Chemically-modified residues and even polyvalent residues can both be handled using a consistent (if somewhat perplexing) notation system. Although focussed on biopolymers, this system could be adapted for use with any large molecule containing repeating subunits.

Points to note include:

  • Self-Contained Sequence Representation is a V3000 extension only.
  • Although fully implemented in Accelrys products, adoption of SCSR by major software vendors remains an open question at this point.
  • SCSR is mainly a compression technology - it’s possible to convert any V3000 SCSR file with complete fidelity into an expanded non-SCSR representation and vice versa.

It will be interesting to see to what extent the drive to unify chemical and biological registration systems moves more organizations away from V2000 molfiles. Supporting two file formats does require overhead that most groups would rather avoid. On the other hand, the need to support V2000 as an interchange format (e.g., chemical suppliers sending SD files to big pharmas) will likely hinder this transition.

V2000 Issues

Anyone who has worked with V2000 molfiles in even a moderate-sized structure database has been hit at last once with one of the technology’s most serious shortcomings - V2000 molfiles can’t faithfully encode a large amount of interesting small molecule chemistry.

Over the course of a few years, I detailed some of these cases on Depth-First. Many of the structures that can’t be represented at sufficient detail involve multicenter bonding and axial chirality. Molecules like ferrocene and binapthalene are good representatives, but there are many others.

To be clear, these problems can be managed by applying ad-hoc fixes internally, but when it comes time for V2000 to play is star role as information exchange format, things can get ugly.

Zero-Order Bonds and Explicit Hydrogen Count Extension to V2000

Even simpler structures pose problems for V2000. Developing a solution by extending V2000 was the focus of a recent paper by Alex Clark.

V2000 provides two conveniences that have no doubt contributed to the widespread adoption of the format: (1) all bonds must be single, double, or triple; and (2) assignment of hydrogens to heavy atoms is optional. However, these conveniences also lead directly to some troubling cases of mistaken molecular identity.

Consider dimethyl tin and tin dichloride. The paper argues that file formats rooted in implicit hydrogen conventions make it impossible for software to determine how many hydrogens to assign to either species. It’s implied that V2000 is such a format, and a proposed extension is detailed.

However, one of the lesser known features of the V2000 format is the atom valence field. In its typically terse way, the June 2010 release of CTfile Formats has this to say on the subject of the valence field:

Shows number of bonds to this atom, including bonds to implied H’s.

Although sometimes considered a query-only feature, nothing in the V2000 specification supports this conclusion. On the contrary, V2000 defines another atom property called “hydrogen count” marked as “[Query]” (like bond topology and query bond types). The “valence” property is marked as “[Generic]” (like atom charge). Furthermore, V2000 supports an atom property in the properties block called “Substitution Count” that can be used with “[Query]” only and which does not override the atom “valence” property in the atom block (as other equivalent atom properties do). The terms “[Query]” and “[Generic]” are only ever defined in terms of the kinds of files that support them, so differences of opinion on this point are to be expected.

Regardless, the V2000 format apparently already provides a mechanism to distinguish tin(II) and tin(IV) oxidation states in the two examples above. The solution would therefore be for software encoders to specify a valence of “2” for tin(II) species and a valence of “4” for tin(IV) species, and for software decoders to respect this convention.

The extent to which software vendors support the atom valence attribute is an interesting question. A quick look at the CDK GitHub repository shows that MDLV2000Reader does in fact read the valence attribute, but this attribute is only assigned to instances of IPseudoAtom. The Open Babel GitHub repository shows no indication that the atom valence property is either read or written at all (mdlformat.cpp).

This is a little odd. The two most widely-used open source cheminformatics toolkits both ignore a built-in solution to the problem of implicit hydrogen counts of di- and tetravelent tin in V2000 molfiles. (Full disclosure - my company’s structure editor ChemWriter also ignores the valence atom attribute). It seems likely that more than a few commercial packages take the same approach. Now a recent paper proposes an extension that would at least in part replicate the functionality of the atom valence attribute.

Another problem described in Clark’s paper is the limitation of bond orders to one, two or three. There is no zero bond order. Take for example, the cobalt complex below:

Cobalt is bonded to six amino groups, but in V2000 we’re forced to give each bond a minimum order of 1. This leads to some strange conclusions. First, cobalt would have an implied formal charge of +9. Second, counting implicit hydrogens on each nitrogen atom (assuming we’re not using the atom valence property) would lead to 2. We’d mis-calculate the molecular mass by 6 AMU. The paper considers some alternative representations, but each suffers from limitations of its own.

The paper proposes an extension to V2000 to allow for zero-order bonds, enabling the following improved representation to be displayed and used for calculations:

Although this solves the problem for certain classes of structure, it leaves others unsolved. For example, the zero-order bond representation of ferrocene below gives reasonably good calculations for formal charge, but it fails to capture the symmetry of the molecule, assigning significance to the two C-Fe bonds labeled as single where no significance is warranted and implying different bond types in each cyclopentyl ring, among other issues.

Finally, it’s ironic that the proposed extensions apply to the much older V2000 format (the originator of which now claims will not be extended further). No mention is even made of the newer V3000 format that Accelrys says will be the basis of all future extensions.

Multiplication of Errors

V2000 is based on a very limited chemistry model, and this is the root cause of most of the problems cited by critics.

One of the most egregious errors is in the elevation of side-effects - atomic charge and bond order - to the status of fundamental atomic and bonding properties.

Ask any knowledgeable chemists how to calculate atomic formal charge and they’ll do so by counting valence electrons. Valence electron count is the property any robust chemical interchange format should be capturing, not formal charge. Formal charge is the side-effect of a well-defined electron count. In case you missed that, let me re-state it:

Any chemical file format (or in-memory representation) that relies on charge as a fundamental atomic property is doomed to the same problems as V2000.

Bond order calculations, like formal charge calculations, are made by counting electrons. As a result, bond order is a very limiting property to choose as a basis for a chemical representation system. Bond order is derived from electron count, not the other way around. In other words:

Any chemical file format (or in-memory representation) that relies on bond order as a fundamental bond property is doomed to the same problems as V2000.

My concern is that attempts to extend the V2000 format by adding new templates (e.g., zero-order bond, and not to be confused with “templates” discussed above) will alleviate only part of the problem, but leave the rest festering. The problem with V2000 is not the lack of templates (e.g., no zero-order bond), but the use of templates like bond types in the first place.

Templates can also be found in stereochemical definitions, both in V2000 and elsewhere. All of the problems encountered with using V2000 to represent, for example axial chirality, can be traced to the reliance on templates.

Clark offers the following tongue-in-cheek witticism:

cheminformaticians do not know what molecule file format they will be using in 20 years, but they know it will be called MDL Molfile.

I would counter with:

cheminformaticians do not know what molecule file format will overcome the limitations of MDL Molfile, but they know it won’t be based on templates.

Conclusions

No amount of retrofitting will solve the fundamental flaws of V2000 as a chemical representation and information exchange format. They are:

  1. Optional specification of implicit hydrogen counts.
  2. Encoding the side effect properties of charge and bond order, rather than the fundamental property of electron count.
  3. Elevation of bonding and stereochemistry templates to the status of fundamental properties.
  4. Deeply-ingrained and restrictive notions of bonding that precludes multi-center bonds or bonds with odd electron counts.
  5. Decades of inertia on the part of software providers and the cheminformatics community.

V3000 offers some true innovations in the areas of partial stereochemistry definition and biopolymer encoding that are worth understanding. However, at its core V3000 relies on the same basic (flawed) model of chemistry as V2000.

It’s tempting to continue ignoring molecular corner cases as they accumulate. After all, if the only structures a cheminformatics system sees are small organic molecules, V2000 works nearly flawlessly. And if this is the only kind of molecule chemists continue to make, there really is no problem to solve.

But I’m optimistic about the future of chemical research and the creativity of chemists. It’s time to retire both V2000 and V3000 molfile formats so that we in cheminformatics can stop making excuses and start keeping pace with advances in chemistry.

Comments and Reactions

IBM Donates Large Collection of Patent Chemical Structures to NIH/PubChem

December 15th, 2011

IBM recently announced the donation to PubChem of more than 2.4 million chemical structures extracted from the patent literature and biomedical journals. (link, link) According to Marc Nicklaus of NIH:

… Non-U.S. patents are included as the source of structures in this data donation. This information is not directly part of the donated file itself, though. There is a link for each record that points back to an IBM web page that provides some additional information (apparently for free) of the type, “PMIDs and patent numbers found for documents containing IBM Structure ID=0015AFBF08D8F183C1F8E32A430CFFEB.” What one finds there in this case is simply: EP0244956A1 …presumably the European patent in which this compound appeared.

BTW, these data were donated to both PubChem and us (NCI CADD Group). We’re currently processing the file and will incorporate the structures into our services on http://cactus.nci.nih.gov.

The donation resulted from research performed using IBM’s Strategic IP Insight Platform (SIIP). Last year, Stephen Boyer discussed technical aspects of the patent mining work as it applies to cheminformatics (below).

IBM’s donation should be viewed in the context of related recent events including the release of screening data for over 300,000 structures against malaria by GlaxoSmithKline and Novartis.

Are data releases like these by large companies merely a fad or the start of something big? Only time will tell. But given the ongoing pain and renewed drive to innovate in the pharmaceutical industry, I wouldn’t be surprised to see multiple announcements along the same lines in the coming year.

Comments and Reactions

Understanding the PyMOL User Interface

November 3rd, 2011

PyMOL is designed a bit differently from other applications, which can take some getting used to. The tutorial gives a high-level overview of the user interface.

Comments and Reactions

Install PyMOL on Mac OS X Snow Leopard

November 2nd, 2011

Installing free PyMOL on Mac isn’t as easy as installing on Windows or Linux. For one thing, there is no precompiled binary (although you can buy one from Schrodinger). The tutorial shows how to install the free version of PyMOL using MacPorts (background).

Another problem is that compiling from source leads to compile errors, and these errors depend on the version of Python you’re running. With Python 2.7 and PyMOL 1.4.1, this is what I saw:

$ pymol setup.py build
[snip]
layer0/ShaderMgr.c: In function ‘ShaderMgrConfig’:
layer0/ShaderMgr.c:173: error: ‘GLEW_OK’ undeclared (first use in this function)
layer0/ShaderMgr.c:173: error: (Each undeclared identifier is reported only once
layer0/ShaderMgr.c:173: error: for each function it appears in.)
layer0/ShaderMgr.c:174: error: ‘GLEW_VERSION_2_0’ undeclared (first use in this function)
layer0/ShaderMgr.c:185: warning: format ‘%s’ expects type ‘char *’, but argument 3 has type ‘int’
layer0/ShaderMgr.c: In function ‘ShaderMgrConfig’:
layer0/ShaderMgr.c:173: error: ‘GLEW_OK’ undeclared (first use in this function)
layer0/ShaderMgr.c:173: error: (Each undeclared identifier is reported only once
layer0/ShaderMgr.c:173: error: for each function it appears in.)
layer0/ShaderMgr.c:174: error: ‘GLEW_VERSION_2_0’ undeclared (first use in this function)
layer0/ShaderMgr.c:185: warning: format ‘%s’ expects type ‘char *’, but argument 3 has type ‘int’
lipo: can't figure out the architecture type of: /var/folders/9+/9+lCZbKTGeSvLa-KoByjCE+++TI/-Tmp-//cc6lq8a7.out
error: command 'gcc-4.2' failed with exit status 1

Here’s what I saw using Python 2.6 (which comes with Snow Leopard).

I did manage to find a solution, but it’s a bit ugly and I’d like to see what I get from the PyMOL Users List before posting it.

Comments and Reactions