Dispelling Open Source Confusion: An Introduction to Licenses
Selecting an open-source license is a minefield for which few are prepared when they need to be. There are a plethora of licenses under which open-source software can be released. Selecting a license at the initiation of a FOSS [Free and Open Source Software] project is likely to be a low priority, as there is no initial value to the project. Without a line of source code written, wading through the legalese and nuances of distribution licenses seems unimportant. In reality, the irrevocable nature of the license makes this the most critical time if authors wish to eventually exercise control over derivative works. ... Unfortunately, even the most carefully selected and restrictive license may not afford complete protection from unanticipated and undesired uses.
-Matthew T. Stahl, Drug Discovery Today
Few subjects cause as much confusion and as many heated debates as Open Source licensing. The Open Source Initiative has approved over 50 licenses compatible with their ten-point definition of "Open Source". Whenever that many solutions to a problem exist, it's a sure sign that one size does not fit all. In this article, I'll introduce some of the key concepts in Open Source licensing.
Disclaimer
There is a phrase used so often in discussing the legal aspects of Open Source software that it has its own acronym: I Am Not A Lawyer (IANAL). Clearly IANAL, and chances are that you are not one either. Yet the very acts of writing and using Open Source software require basic familiarity with licensing terms and concepts. My aim in this article is not to provide legal advice, but rather to relate what I've found useful in trying to understand Open Source licensing for my own work. When in doubt, hire a lawyer.
One Good Book

The best writing on the subject of Open Source licensing I've read can be found in the book Open Source Licensing by Lawrence Rosen. An intellectual property attorney, Rosen also served as general counsel and secretary of the Open Source Initiative. His book is remarkably clear and easy to read. If you'd rather not pay for a hardcopy, it can be viewed in its entirety online.
The Good News
Fortunately, all Open Source licenses share some common features, if you know what to look for. For example, most licenses can be divided into one of two major categories:
Academic Licenses These licenses, named for their original use in universities, allow unlimited freedom to distribute binaries based on altered source code without making these changes public. Examples of widely-used academic licenses include the Apache License, the BSD License, and the MIT License.
Reciprocal Licenses These licenses require, to varying degrees, the developer of a derivative work to release his or her modifications to the public if their work is distributed. The question of what constitutes a "derivative work" varies from license to license, but most generally involves the modification of the files of a software package. Examples of widely-used reciprocal licenses include the GNU General Public License (GPL), the GNU Lesser General Public License (LGPL), the Mozilla Public License (MPL), and the Common Public License (CPL).
The Importance of Copyright
A frequently-encountered misconception equates Open Source licensing with release into the "public domain." Nothing could be further from the truth. The difference is in the ownership of copyright.
Software in the public domain has no owner. All enjoy unrestricted freedom to copy and otherwise use public domain software. A well-known example is David Megginson's SAX XML toolkit. Megginson, by placing his software in the public domain has forfeited all rights to control how his work is used. Sun Microsystems incorporated SAX into their Java Development Kit without any obligation to Megginson whatsoever. SAX is not Open Source software; it is public domain software.
In contrast, software distributed under an Open Source license remains the intellectual property of the copyright owner. The license is simply a mechanism for the software's creator to give some (or all) of their rights to a licensee, usually in exchange for conditions that must be met. Ownership remains with the creator, who is free do distribute his or her work simultaneously under commercial and Open Source licenses if they so desire.
As you can see, copyright gives a license its legal legitimacy. Far from placing software in the public domain, Open Source licenses use copyright law in the same ways as commercial licenses. This is why understanding Open Source licenses is so important for developers and users alike.
Reciprocity: Share and Share Alike?
Critics of the GPL frequently cite its "viral" nature. The debate essentially boils down to the following paragraph:
You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
Like a virus that spreads through replication, the GPL spreads by forcing licensees to release their modifications under the GPL. There are at least two other terms that describe this concept. The Free Software Foundation (FSF) uses the term "copyleft." Lawrence Rosen prefers the term "reciprocity" because of its neutral tone and greater descriptive ability. It's the term I'll also use. Reciprocity is such a fundamental concept in the GPL and other licenses that Rosen's book dedicates an entire chapter to the subject.
Developers distribute their software under reciprocal licenses for a variety of reasons. Two of the most common are:
To limit "freeloading", or the use of the software by those (typically companies) who contribute nothing back to the developer community.
To prevent "forking", or the establishment of a competing software package based on the original package.
In reality, Open Source licenses are limited in their ability to prevent either freeloading or forking. For example, provided that a company distributes no modifications to a GPLed package, they are under no obligation to release any of their own source code. Forking happens whenever one or more developers feel strongly enough about a subject to go in a different direction; an Open Source license does nothing to change this.
Given the limitations (and complexities) of reciprocity provisions, one might ask "why bother?". This is an excellent question, the answer to which will depend on your specific goals for your software. And as Stahl points out, the time to make this choice is before a line of code has been written.
Conclusions
Although Open Source licensing may appear to be a minefield, there is nothing mysterious about it. A lot of good writing is available on the subject, with Lawrence Rosen's book being a prime example. If you plan on creating or using Open Source software, learning the basic ideas behind Open Source licensing is a wise investment.
Making the Case: In Silico Prediction of Ames Test Mutagenicity
The two models (SAm and AIm) and the RHC [robust hybrid classifier] were implemented in C++ using OpenBabel 1.100.2 libraries (http://openbabel.sourceforge.net/wiki/Main_Page).
The AI model (AIm) is based on the LAZAR system (http://www.predictive-toxicology.org/lazar/index.html) developed by C. Helma...
-Paolo Mazzatorta, Liên-Anh Tran, Benoît Schilter, and Martin Grigorov J. Chem. Inf. Model.
Yet another appearance of Open Source software in the primary cheminformatics literature comes by way of a paper from Mazzatorta, Tran, Shilter, and Grigorov of the Nestlé Research Center. This work employs two Open Source libraries: lazar, a tool for the prediction of toxic properties of chemical structures; and Open Babel, a widely-used, low-level library for cheminformatics. lazar, in turn, is based on both Open Babel and the GNU Scientific Library (GSL), a numerical library. Unfortunately, the Nestlé authors don't indicate whether the source code for their system is publicly available. Nevertheless, their work gives a taste of the kinds of synergies that inevitably develop through the the use of Open Source software.
Making the Case: Similarity by Compression
...The structures were converted to SMILES format and canonicalized using a program written with the open-source Java cheminformatics library JOELib2. ... To conclude, we have demonstrated that SMILES strings and compression programs are a simple, yet powerful method for similarity searching, competitive with state-of-the-art-techniques. The Ruby scripts used to carry out the experiments described in this paper are available for download from http://comp.chem.nottingham.ac.uk/download/zippity/.
James Melville, Jenna Riley, and Johathan Hirst, J. Chem Inf. Model.
Yet another appearance of Open Source software in the literature comes by way of a paper from Melville, Riley, and Hirst. This work takes advantage of the alphabet-like nature of SMILES strings and widely-available compression algorithms to perform molecular similarity analyses. Not only does this work use the Open Source JOELib library but the authors have made the Ruby scripts that perform the similarity analysis freely available under the same terms as Ruby (Ruby's license or the GPL).
The times they are a-changein'.
Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry
Molbank, published by Molecuar Diversity Preservation International, is one of the oldest of a handful of Open Access journals in chemistry. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why they didn't implement it sooner.
A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.
Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.
Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of Chemical Abstract Service has even been able to contemplate, let alone prepare for.
This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.
We Have Met the Enemy and He Is Us
The basic problem of the primary literature is that the material to be published grows more rapidly than the number of people or institutions interested in buying and/or using it. A smaller, but still nagging, difficulty is that unit costs increase more rapidly than publishers are able to increase unit productivity.
... But in the last analysis, the primary literature would easily be able to continue basically unchanged, were it not for the fact that the demand has stabilized, while the supply of material has not yet done so.
-David E. Gushee J. Chem. Doc. 1970, 10, 30-32
Gushee goes on to discuss the decline of ACS journal subscription rates and the simultaneous increase in total pages printed and journals published. One wonders to what extent these trends continued over the last 36 years and how this phenomenon may driving the current escalation in journal costs.
About this "price squeeze" and a publisher's inability to escape it, Gushee writes:
A scientific society cannot, however, control cost as the typical business can. In journal publishing, the only real cost we can save is the page we don't print. And to restrict the number of pages printed is to interfere with the dissemination of knowledge, which is, after all, the basic reason the Society exists in the first place.
There are many interesting tidbits in this Back to the Future article, but perhaps none more so than the following:
Should the number of pages go over some critical number, then we get into a position of having to charge such a high price that individuals can no longer afford the journal. Chemical Abstracts, as an entity, reached that point some years ago and can no longer be considered a publication for individual subscriptions.
How expensive does a journal need to become before it can no longer be considered a publication for individual libraries? When that point is reached, who is responsible?



