Disruptive Innovation in Scientific Publishing: Directory of Open Access Journals
The Directory of Open Access Journals (DOAJ) currently lists 2420 Open Access scholarly journals. Of these, 52 currently fall under the category of chemistry. Although the organic chemistry subcategory only currently lists three journals, the general chemistry category actually contains several journals containing organic chemistry content, such as the Bulletin of the Korean Chemical Society, Chemical and Pharmaceutical Bulletin, and Molbank.
Clearly, the chemistry journals included in DOAJ's listings would not be considered to be in "the mainstream" by experts in the field. And that's exactly the point. Innovation always happens at the margins.
As Clayton Christensen puts it in his landmark book, The Innovator's Dilemma:
As we shall see, the list of leading companies that failed when confronted with disruptive changes in technology and market structure is a long one. ... One theme common to all of these failures, however, is that the decisions that led to failure were made when the leaders in question were widely regarded as among the best companies in the world.
Replacing the word "company" with "scientific journal" leads to an important hypothesis about the future of scientific publishing.
And on the subject of disruptive innovation itself, Christensen writes:
Occasionally, however, disruptive technologies emerge: innovations that result in worse product performance, at least in the near-term. Ironically, in each of the instances studied in this book, it was disruptive technologies that precipitated the leading firms' failure.
It seems very unlikely that scientific publishing operates according to a different set of rules than any other technology-driven business. The coming wave of disruptive innovation will be dramatic, and the outcome completely predictable.
Making the Case
The SMIREP system is available from http://www.karwath.org/systems/smirep/ under the GNU General Public License. The Web page also contains the data files used in the Experimental Section. The system is provided in Python and C source code, including the required Python OpenBabel module OBGrep.
-Andreas Karwath and Luc De Raedt, J. Chem. Inf. Model ASAP Articles
Karwath and De Raedt are onto something more than just an innovative use of SMILES strings. When the majority of chemical informatics papers provide instructions for downloading both complete source code and complete data sets, the game will have changed forever. Advocating this postion in essays, presentations, emails, and letters is one way to make the case, and a very old one at that. For your next paper, why not make the case with a statement like the one above?
Hacking PubChem: Why The Open Access Fight is Just the Beginning
Like no other medium, the Internet tests our basic beliefs about the rights of resource owners and resource users. As the Internet increasingly becomes home to scientific publication mechanisms that have no counterpart in the physical world, a larger question looms: what separates fair use of these services from abuse?
Depth-First hosts a series of articles, with possibly many more to follow, on programatically accessing open chemical information databases:
The availability of open chemical information resources like PubChem and NMRShiftDB is a very recent phenomenon, and desperately overdue. One premise of this blog is that chemical informatics is at the start of a renaissance; the chemical information revolution that started in the 1950's is now set to continue after a long period of stagnation. Large, open data sources, and open software that mines it, will fuel this transformation, just as they have in bioinformatics.
The interaction of non-browser software with public databases, although rich in potential payoffs, can also lead to a great deal of damage. PubChem contains millions of structure-searchable compounds. Setting the wrong kinds of programs loose on this site could cause service interruptions ranging from the annoying to the severe.
There is no standard mechanism for website owners to spell out acceptable use policies to non-browser software. The closest thing we have to a standard is the Robots Exclusion Protocol. This protocol defines acceptable behaviors for a robot, which according to one definition consist of: "... a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." Other definitions are in use. The one thing these definitions seem to have in common is the concept of scale: the more comprehensive and indiscriminate the program is in its interactions with a website, the more like a robot, and less like a browser, it becomes.
Site owners specify their robots policy in a file called robots.txt hosted on their servers. The PubChem robots.txt file currently includes the following policies:
User-agent: *
Disallow: /substance/PcsSrv.cgi
Disallow: /summary/summary.cgi
Disallow: /assay/assay.cgi
Disallow: /image/imgsrv.fcgi
Disallow: /image/smi2gif.fcgi
Disallow: /image/smi2gif.cgi
Disallow: /image/structurefly.cgi
Disallow: /search/NbrQsrv.cgi
Disallow: /search/PreQSrv.cgiHere, User-agent refers to the name of the robot, which is set as a wildcard, meaning any robot. The Disallow lines refer to resources off-limits to robots.
One of these disallowed resources, /search/PreQSrv.cgi is explicitly used in the PubChem SMILES query article.
Is a person who runs software of the type I describe in these articles violating PubChem's use policy? The best answer I can give is, "it depends." I think it would be hard for reasonable people to suggest that using the software as described in the tutorials, with their deliberately limited scope, for research purposes, and with no intent to do damage, represents abuse.
On the other hand, I can see how reasonable people could argue that a website operating as a comprehensive front-end to PubChem using the techniques described in these articles could be considered abuse. I know I might consider it abuse if I ran PubChem, depending on why I was running the service.
If I wanted to stimulate innovation in the area of open database mining, I might actually encourage front ends and similar third-party PubChem services. I might set aside servers specifically dedicated to this kind of activity. I might even develop an Open Source PubChem Web-API to help developers get started. Unfortunately, NIH's intentions are not exactly clear on this point.
Looking at the NCBI's Copyright and Disclaimers page, the only document that to my knowledge states any kind of use policy, is not especially illuminating:
Conditions of Use
This site is maintained by the U.S. Government and is protected by various provisions of Title 18 of the U.S. Code. Violations of Title 18 are subject to criminal prosecution in a federal court. For site security purposes, as well as to ensure that this service remains available to all users, we use software programs to monitor traffic and to identify unauthorized attempts to upload or change information or otherwise cause damage. In the event of authorized law enforcement investigations and pursuant to any required legal process, information from these sources may be used to help identify an individual.
We are left with the critical, but unanswered question: "What represents an unauthorized use of PubChem?"
The document cited above also raises the truly bizarre possibility of PubChem not actually being capable of granting rights to redistribute what is contained on its servers:
This site also contains resources such as PubMed Central, Bookshelf, OMIM, and PubChem which incorporate material contributed or licensed by individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. ...
But this is a subject for another day.
Getting back to accessing PubChem data, one very far-sighted thing the NIH has done is to make the entire dataset freely downloadable in three different file formats. Rather than mine the PubChem website itself, you could download the data to your machine, letting the software you write access it locally. The sheer size of this dataset creates problems of its own. Future articles will describe some approaches to solving them.
Regardless of your views on the use and abuse of chemical information resources like PubChem, it's clear that getting open resources on the Web is only the first in a long series of controversial steps that will ultimately transform both the practice and culture of research.
Toward an Open, Worldwide Chemical Information Network
...Whatever your views of the present situation may be, I think there is general agreement that more attention will be given in the next few years to the information network concept. The hardware capability for such a network is well assured; in fact, the capability exists today. The real question is when, and under what conditions, the chemical community will determine that an economic need exists for a network that will tie together a wide range of chemical information services.
-Walter M. Carlson J. Chem. Doc. 1965, 5, 1-3
Several online chemical information services, including PubChem, NMRShiftDB, and ZINC, have emerged in a relatively short period of time. As these systems go from being toys for hackers to essential components of scientific workflow, their true potential will be unlocked by developing innovative ways to tie these disparate systems together.
This is not unlike the situation Carlson was describing in his 1964 luncheon speech before the ACS Division of Chemical Literature. Technologies have changed radically, but the fundamental problem of integrating disparate chemical information systems remains unsolved and ripe with possibilities.
A future in which Chemical Abstracts Service no longer dominates the collection and distribution of chemical information is looking more possible than ever before. If recent history is any guide to this future, we can look to an array of semi-independent, open systems using open standards and operating on a global scale to become the new focal point. In fact, the capability exists today.
Readily Available, Without Infringements or Restrictions
...If we consider that one of the purposes of publication is to offer testable data, then it would seem that a minimum requirement would be that where computer programs and their results are presented, the author will make source code available on request. ACS could render good service by undertaking the distribution of such requested code. Furthermore, I would make it a condition for publication that such source code be provided. If the scientist is unwilling to disclose his code because he wishes to engage in a commercial venture, then I suggest that he be invited to take out a paid advertisement in the journal and be denied the privilege of publication to promote his product.
-John Figueras J. Chem. Inf. Comput. Sci. 1984, 24, 276
Science moves forward only insofar as observations can be validated and put to use by a third party. Chemical informatics is no different from any other field in this respect. Yet publications of the type Mr. Figueras opposed can still be found in 2006. Why is this?
At issue isn't just software. The ACS has recently spoken out on the necessity of open data sets. As a condition for publication, any data reported in a manuscript must now either appear in Supplementary Material or be “readily available, without infringements or restrictions.” Although this is a positive development, the wait continues for an equivalent statement on the availability of source code.
Open software systems and open data packages are most useful when they can be readily found by others and used together. In an effort to work on this problem, several individuals, including myself, formed The Blue Obelisk group. Through this group and others like it, like-minded researches can begin to reap the benefits of openness enjoyed by other fields.

