Five Reasons Why Chemical Societies Need Free Databases and Web Services 2
For those who may not have seen the news, the Royal Society of Chemistry (RSC) earlier this week announced the acquisition of ChemSpider, the free database of chemical structures and related data. The tone of the press releases and commentary around the Web has been congratulatory, which is to be expected given the dedication and hard work by ChemSpider's creators. And much of the discussion focuses on what the chemistry community gains by the move. But there's much more to the story.
What's in it for RSC?
What's lacking in the public discussion is a clear explanation of what one of chemistry's oldest institutions hopes to gain by acquiring one of its newest.
Times are tough all over, and the scientific publishing business is no exception. This year the American Chemical Society (ACS) announced cuts to its staff and employee benefits programs amid declining revenues and investment returns, a situation unlikely to reverse itself anytime soon.
Although a service like ChemSpider can be created very inexpensively, growth and maintenance will likely require significant resource commitment. Neither the RSC nor ChemSpider offer any indication of how the service will break even, much less contribute to RSC's bottom line.
The Big Picture
Chemical societies around the world are likely to be quite interested in what happens from here.
In years past, paid database and journal subscriptions laid the foundation for many of the activities supported by the largest chemical societies. But the paid subscription model sits in the crossfire of several long-term trends, most notably price increases that habitually outpace the rate of inflation, severe budget cuts in both academia and industry, and the emergence of dozens of free chemistry databases, Web services, and other communication channels beyond ChemSpider.
What's in it for You?
If you work for or are otherwise involved with a chemical society, what does the creation or acquisition of a free Web service like Chemspider do for you? Here, in no particular order, are some possibilities:
Consult your Mission Statement. The RSC is dedicated to the "advancement of chemistry as a science, the dissemination of chemical knowledge, and the development of chemical applications." Many societies share similar statements of purpose. Free Web services represent one of the most cost-effective ways to achieve this goal.
Nontraditional revenue sources. No, we're not talking about advertising, although that's a possibility. Just because a Web service is "free" doesn't mean that all of its services need to be. For but one example, consider that many in industry are concerned about the information revealed by company employees' queries on public Web services. There are many ways to address these concerns - and create revenue in the process. With even a small amount of creativity, many more opportunities like this can likely be found.
Increased visibility for your other products and services. Google does it. IBM does it. Hundreds of smaller companies you may never have heard of do it. They've all built permission assets as a way to more effectively communicate their message to people who matter to them. Chemists will routinely ignore (and even scorn) your advertisements. How likely are they to ignore a free Web service that solves their problem?
Increase the reach and cohesion of your community. ChemSpider is one of the few public-facing chemistry databases that accept community-created information. Users of a system who only consume information have little stake in it. Users who contribute tend to be much more involved in the process, and the organization behind it.
Winner takes all. Quick - what's the second most popular search engine. What's the second most popular online encyclopedia? What's the second most popular video sharing site? What's the second most popular microblogging service? What's the second most popular photo sharing site? You've heard of all of the front runners, even if don't use them. Have you even heard of any of the also-rans? When it comes to free online resources, winner takes all. By avoiding the creation of free online resources, you run the real risk of rendering your chemical society irrelevant.
Conclusions
The Web is in the process of changing the operating rules for every organization, particularly in information-rich technical fields like chemistry. If your chemical society ignores the changes now underway, then what exactly is its plan for staying relevant?
Cheminformatics and Micropublication in Chemistry
Over at Zusammen, a post on open notebook science and the least publishable unit drew some interesting comments. Jean-Claude Bradley introduced the term "micropublication", which seems to describe the concept very well.
A follow-up article explores the background and requirements for a workable micropublication system in chemistry.
Many of the points apply to any experimental science. But where chemistry is unique is in the widespread use of chemical structures. Cheminformatics is the central discipline needed to make this happen.
We're starting to see early signs that micropublication could work in chemistry. Consider ChemSpider which is to my knowledge one of the first public-facing chemical database that includes user-created content. While not a micropublication system, it does have some of the key elements. The success of the Wikipedia chemistry project is another indication of real support for the idea of chemistry micropublication. Finally, consider that Chemical Abstracts itself was created, up until the late 1960s, by mainly volunteer effort.
How much of a role will chemistry micropublication play in the future of cheminformatics? Perhaps none. What is clear is that a chemistry micropublication system that actually worked would initiate a major shift in the way chemists create - and consume - chemical information.
Web-Centric Science 3
From The Realm of Organic Synthesis comes a common feeling of frustration with the way scientific information is distributed, and an increasingly common proposal for a solution:
I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder. I’ll gladly contribute! How do we get the ball rolling?
Today we're witnessing a major re-evaluation of the scientific publication system. At issue is the fundamental inefficiency with the way things work - in terms of time, effort, and especially money.
To change any system, you must first understand its parts and how they work together. In chemistry, the workflow for scientific publication goes something like this (items in boldface are key components):
You perform several Experiments over the course of between several months to several years.
You record your Observations in a Notebook, typically visible only to You and those You choose to share with.
You prepare a Manuscript summarizing your Observations using a Word Processor. This Manuscript contains machine-readable Tables, Chemical Structures, Characterization Data, and Cross-References.
After internal review within your Organization, You submit the Manuscript to a Publisher.
Publisher finds 2-3 (semi)qualified Reviewers for the Manuscript. The ease of finding (semi)qualified reviewers may be a function of the Prestige of the Publisher.
Reviewers, together with the Journal Editor decide on whether the Manuscript is publishable at all and if so, what Revisions need to be made.
You make Revisions to your Manuscript and send the result back to the Publisher.
Publisher publishes a Paper from your revised Manuscript. The form this document takes varies, but generally consists of Physical Paper, PDF, or HTML. All of these formats prevent, to varying degrees, the machine-readability of the Tables, Chemical Structures, Characterization Data, and Cross-References in your Manuscript.
Readers of the Journal who find your Paper immediately useful may Bookmark it (either literally, or by printing/copying it) for future reference. In most cases, however, your Paper will not be immediately read or noted.
To place your Paper into a larger Context, an Abstractor attempts to once again make machine-readable those elements from your Paper of broadest interest: Tables, Chemical Structures, Characterization Data, and Cross-References.
To enable the efficient location of your Paper by Researchers looking for answers to scientific questions, Abstractor creates and maintains a Database Service. Finding your Paper relies on Abstractor re-generating as much machine-readable information as possible.
Lavishly Inefficient
To a non-scientist who has used the Web for their entire adult life, this system appears lavishly inefficient. Each step requires people-power and toll-gates. There's nothing wrong with employing people to do work, of course. The problem is in employing expensive people to do work that cheap machines do far more efficiently. The problem is when the expensive people you employ work to maintain unnecessary steps in the production process. The problem is when passionate volunteers (or cheap labor) can do more with less than your paid staff. The problem is when cheaper technologies make your product look less appealing.
Just ask the automakers. Or the corner booksellers. Or the newspapers. Or textile manufacturers. Or the recording industry.
Reinventing the Wheel?
If you were going to re-invent the scientific publication system using any modern technology, how would you do it?
While I sympathize with the desire expressed by "J" (author of The Realm of Organic Synthesis - who might want to reconsider anonymously blogging science), I believe the Web offers a much more compelling range of solutions to the problem. The bad news is that it requires change at every level in the scientific publication process - and change is painful.
Who Profits from Inefficiency?
There's a case to be made that the inefficiency of the current publication system is actually helpful in certain situations. For example, the de-digitization and re-digitization of scientific content creates a profitable market for Abstractors. Another example: limited peer review of Manuscripts may be seen at benefiting authors anxious about being scooped by their competitors. Still another example: the lack of scalablility in how Publishers currently operate can lead to a Prestige factor for Journals that maintain high quality standards.
Nevertheless, all of these advantages (and more) could be built into systems with the Web as their organizing principle.
Stirrings
For a glimpse of what the future of chemistry publication might hold, consider Open Notebook Science, ChemSpider, and Collaborative Drug Discovery. Each of these services shares a Web-centric view of information management and collaboration. And each contains at its core a fundamentally unique view of the role publication plays in the daily workflow of scientists.
Finally, consider GitHub, a Web-centric developer tool that demonstrate more clearly than any other I'm aware of how tenuous the distinctions between individual work, collaboration, and publication can actually be.
Conclusions
There's no question that a Web-centric scientific publication system can work much more effectively than what we have today - for authors, readers, and abstractors. The question is - are we ready for it?
Image Credit: Robert Scoble
Five Questions About the InChI Resolver 16
Yesterday the Royal Society of Chemistry (RSC) and ChemZoo (of ChemSpider fame) announced a plan to collaborate on the creation of an InChI Resolver service. From the announcement:
Using the InChI - an IUPAC standard identifier for compounds - scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future.
...
The InChI Resolver will be based on ChemSpider's existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. 'ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources' adds Antony Williams of ChemSpider, 'We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.'
It's encouraging to see a major scientific publisher lend its support to InChI in further evidence of the broad adoption of the identifier. And an InChI key resolver is something I've previously said might be a good idea.
Still, InChI and InChI Key represent a significant change in platform for the field of chemistry, in which CAS Registry Numbers are the gold standard for chemical identification.
If we've learned anything from the last 30 years of information technology, it's that once a platform (no matter how dysfunctional) becomes entrenched, nothing short of a game-changing strategy and herculean effort can replace it. The failure of Windows Vista offers a stark reminder of the power of an entrenched platform. Closer to home, the failure of V3000 molfiles to gain significant traction against V2000 offers another.
With these thoughts in mind, here are some questions about the new InChI Resolver service:
What problem is the service really trying to solve? Although it might be obvious to those close to the situation, it's not quite clear to me. Many, if not most, of the desktop cheminformatics packages sold today now have support for generating InChIs. It's also possible to embed InChI in text documents without using a Web service. Convenient it's not, which may be the point. But if that's the case then the focus of the service should be convenience, simplicity, and ease of use.
How hard would it be to crack an InChI hash? Before dismissing this as impossible, consider that an InChI key is a form of encryption, and a weak one at that. Breaking encryption schemes has a long history in computer science. Given the regularity of InChI syntax, how hard would it be to create software that can computationally provide the InChI that was used to generate an InChI key? What alternative hashing method might make it easier to do so? If there is one, it would become the standard, not the one currently being used.
How will the authenticity of a hashed InChI from an untrusted source be verified? An InChI key might take the form of 'AAAAAAAAAAA-BBBBBBB-XYZ'. Given an arbitrary InChI key provided by an untrusted third party, how would you independently verify that it actually represents a valid key? In the absense of software like that described in Question 2, it would be impossible.
What about BINOLs and Ferrocenes? InChI can't distinguish between stereoisomers arising from axial chirality such as that found in widely-used molecules such as BINOL. There are multiple ways to represent organometallics such as ferrocene using InChI, and each will give rise to a unique InChI key. This is a Bad Thing.
Why bother with an InChI key at all? Consider a hypothetical InChI key: 'AAAAAAAAAAA-BBBBBBB-XYZ'. To an end user uninterested in information technology, why does it matter how the key was generated? One selling point might be that given an arbitrary key, the chemical structure it represents can be decoded independently of any service. But that service is the core of the RSC/ChemSpider proposal - and it will apparently only be able to resolve previously-deposited InchI keys. Sound familiar? This is essentially how the CAS Registry system works, except the CAS system can differentiate BINOL stereoisomers, uniquely identify organometallics, and even handle polymers and complex mixtures.
Within the RSC/ChemZoo proposal is a gem of an idea. The CAS Registry system is closed and in all likelihood will remain forever so. Verifying the authenticity of CAS number/chemical structure assignments is a big problem made worse by the closed nature of the CAS Registry system. Chemists must have a reliable method to reference chemical structures. There are no doubt many solutions to this problem with big payoffs to the field of chemistry for the one that actually works.
Streamlining Cheminformatics on the Web: Let InChI Do the Heavy Lifting and Get Some REST 11
A recent Depth-First article discussed the advantages of minimal Web APIs in Cheminformatics. Recently, Antony Williams unveiled some simplified ChemSpider URL schemes, mainly from the perspective of enabling Google indexing. However, it's possible to take this scheme much, much further. Here I present a proposal for radically simplifying (and unifying) the development of cheminformatics Web APIs and the software that interacts with them.
The New ChemSpider URLs
ChemSpider now has several new kinds of URLs. For the purposes of this article, the most interesting of these are of the format:
These URLs may seem unremarkable, but there's much more than meets the eye. They let anonymous developers query ChemSpider about specific substances - without needing to know much at all about how ChemSpider itself works. Goodbye API. Goodbye API support. Goodbye API documentation. Goodbye angle brackets. Hello to getting stuff done. It's all very RESTful. Well, at least it could be that way with some minor modification.
Some Recommendations
ChemSpider hasn't quite reached that place where the API just disappears. The problem is that the ChemSpider URLs listed above point to query results pages, not compound summary pages. Were these URLs to redirect to a summary page, we could construct the following URLs to extract ChemSpider resources (I've replaced the '=' sign with a '/' for simplicity):
.../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ Get all resources for the molecule identified by the given InChIKey - i.e., "Compound summary page"
.../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/molfile.mol Get the molfile for the molecule identified by the given InChIKey
.../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/small_image.png Get the small image for the molecule indentified by the given InChIKey.
.../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/large_image.png Get the large image for the molecule identified by the given InChIKey.
.../InChIKey/DEIYFTQMQPDXOT-RERXVCSDCZ/citations.xml Get the list of citations for the molecule identified by the given InchIKey, in XML format.
Jane, a developer building Web applications on top of this new ChemSpider API, would immediately notice that things just work. Let's say her online database stores IC50s at the dopamine D2 receptor. On the summary page for each molecule, she wants to link out to the ChemSpider compound summary page, if available. She would simply construct the InChIKey on her server, build the needed ChemSpider URL and GET it. An HTTP 404 would indicate no molecule with that Key exists on ChemSpider and so no link would be shown. An HTTP 200 would indicate ChemSpider has the molecule, and so the link would appear.
Conclusions
It would be interesting enough if ChemSpider adopted a system like that described here. But the real power of this approach would emerge if multiple Web services were to adopt it. By following a simple set of conventions, these services would enable third party developers to elegantly mashup all manner of cheminformatics resources into applications unimaginable today.
Technically, there's nothing that prevents this system from being implemented on every free chemistry database in existence today. However, doing so would transfer a significant degree of control from service operators to third-party developers. Not all providers will be comfortable with that idea.
Cheminformatics Web service providers need to carefully consider whether they're trying to develop a platform or an integrated service. As history has shown, the strategies, and upside potential, for each approach can differ dramatically.
Older posts: 1 2

