Molfile and SD File Formats: Broken But Irreplaceable?

June 07, 2012

Compared to many areas of computer science, cheminforamtics is a backwater. Consider our "standard" file formats Molfile and SD File. These formats, collectively referred to as "CTfile Formats" were first described in the peer review literature in 1992, and had been in use for years prior to that. CTfile is an 80-character, line-based text file format that bears little resemblance to the standard information interchange formats XML and JSON in wide use everywhere else.

But make no mistake. This quaint but broken method of moving data around is just as relevant today as it was in 1992 - likely more so.

Why do we in cheminformatics continue to use CTfile formats to the exclusion of all others? This was the question posed to me by Ian Daniher (@itdaniher) of Nonolith Labs. Ian has developed PyChEBI - "a Python script to convert the quasi-obsolete SDF file format into a sane (Pythonic) datastructure."

I've come up with these reasons why CTfile remains the go-to data format in chemistry:

  1. CTfile is good enough. The CTfile format offers enough functionality to solve most problems involving data exchange in chemistry.
  2. CTfile is relatively well-documented. A single PDF contains everything you'll need to implement a CTfile reader and writer. The documentation has been updated regularly for many years now.
  3. CTfile is easy to understand. Although the specification has grown a number of odd whiskers over time, the core concepts remain very easy to understand and implement.
  4. Lack of compelling alternatives. Chemical Markup Language (CML), an XML-based format, has been in development for many years. Yet it still fails to attract attention outside of a limited audience in cheminformatics. Lack of good documentation coupled with constantly-evolving schema and tooling are two reasons, but this list is evidence of many others.
  5. Lack of a competing standard used by must-have software. CTfile formats were developed by MDL Information Systems for use with its suite of user-focused chemistry tools. MDL recognized that CTfile was key to making its business work - and so dedicated the resources necessary to develop and document it. Similar statements apply to the ChemDraw CDX format.
  6. Open Standard. Software that reads and writes CTfiles can be created without paying license fees to MDL or its successor companies. Combined with freely-available documentation, this is about as close as we get in cheminformatics to an open standard.
  7. Databases rule. Demand for chemical information interchange formats is driven in large part by chemical databases, free and otherwise. When these databases offer the ability to perform mass downloads, they generally use a CTfile format. CTfile is the only format these services can rely on their users being able to open.
  8. Software. Molfile and SD File are the only file formats for which stable, well-tested readers and writers have been universally implemented. No other file format enjoys such privilege and this is unlikely to change anytime soon.
  9. Worse is better. Enough said.

Will CTfile ever be replaced by another standard? Of course. I suspect that the replacement will address one or more of the points above. The replacement for CTfile will also likely start out by gaining a foothold at the periphery of chemistry/cheminformatics and will be ignored by the mainstream for some time. Given the seeming insurmountability of the task, I further suspect that the standard that replaces CTfile will be developed by a group of relative outsiders - they would be the only ones who would think such a thing is even possible.

So, Ian, I salute your initiative. You may be onto something.