<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag opendata</title>
    <link>http://depth-first.com/articles/tag/opendata</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Forty-Eight Free QSAR Datasets (and More)</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/brianlewandowski/45385584/"&gt;&lt;img src="http://depth-first.com/demo/20071206/zakim.jpg" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Whether you're a medicinal chemist or an informatician, &lt;acronym title="Quantitative Structure Activity Relationship"&gt;QSAR&lt;/acronym&gt; datasets can be very helpful in understanding complex biological phenomena. These datasets typically consist of a hundred or fewer compounds associated with a specific parameter such as intestinal absorption, volume of distribution, blood-brain barrier penetration, or activity at one or more biological targets. Most of them are published as part of a paper appearing in a peer-reviewed journal.&lt;/p&gt;

&lt;p&gt;Unlike &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;chemistry databases&lt;/a&gt;, which typically combine a search engine to a dataset of thousands or millions of compounds with a user interface, the QSAR dataset is much more focused and raw. You need to supply your own data viewer, report generator, and query tool.&lt;/p&gt;

&lt;p&gt;The Internet hosts a bewildering assortment of QSAR datasets tucked into various nooks and crannies. The problem is finding them. One useful resource is &lt;a href="http://cheminformatics.org"&gt;cheminformatics.org&lt;/a&gt;, which hosts a page linking to &lt;a href="http://cheminformatics.org/datasets/index.shtml"&gt;forty-four datasets&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Recently, Shaillay Kumar Dogra, Scientific Editor of &lt;a href="http://www.qsarworld.com/index.php"&gt;QSARWorld&lt;/a&gt;, wrote in to let me know about the site's offering of &lt;a href="http://www.qsarworld.com/qsar-datasets.php"&gt;forty-eight free QSAR datasets&lt;/a&gt;. Each dataset is linked to the primary literature and is available in four formats, including SD File. In contrast to many datasets, those at QSARWorld are manually curated. QSARWorld is also actively seeking new datasets to convert into machine-readable form; if you find one, write to them to have it added in the collection.&lt;/p&gt;

&lt;p&gt;Systematic efforts to collect, curate, and distribute raw data from the primary literature are long overdue. QSARWorld offers an intriguing model for doing so. Although some non-scientific issues, such as intellectual property rights, don't appear to have been addressed yet by QSARWorld, the site's offering of machine-readable raw data offers plenty of food for thought to anyone working with QSAR.&lt;/p&gt;

&lt;p&gt;What's your favorite dataset resource?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image Credit: &lt;a href="http://flickr.com/photos/brianlewandowski/"&gt;B.G. Lewandowski&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</description>
      <pubDate>Thu, 06 Dec 2007 10:20:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:cfd85703-dc1c-49d8-a2ac-578a1f1e196e</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/12/06/forty-eight-free-qsar-datasets-and-more</link>
      <category>Tools</category>
      <category>qsar</category>
      <category>qsarworld</category>
      <category>dataset</category>
      <category>opendata</category>
    </item>
    <item>
      <title>How to Find Chemical Information on the Internet: Why Open Source, Open Access, and Open Data Matter</title>
      <description>&lt;p&gt;The Web may be the most effective information-delivery platform ever created. Unfortunately, a variety of barriers, both technical and cultural, restrict the use of the Web for chemistry. In the last few years, three powerful forces for change have emerged: &lt;a href="http://opensource.org/"&gt;Open Source&lt;/a&gt;; &lt;a href="http://en.wikipedia.org/wiki/Open_access"&gt;Open Access&lt;/a&gt;; and &lt;a href="http://en.wikipedia.org/wiki/Open_Data"&gt;Open Data&lt;/a&gt;. Most of what's written on these subjects takes a theoretical angle that makes it difficult to visualize real benefits. In this article, I'll discuss these ideas from a much more practical perspective.&lt;/p&gt;

&lt;h4&gt;A Thought Experiment&lt;/h4&gt;

&lt;p&gt;Try this simple thought experiment: using only a browser and the free Internet, find all Web pages pages that have anything scientifically-relevant to say about your favorite molecule. How would you do it?&lt;/p&gt;

&lt;h4&gt;It's Trivial&lt;/h4&gt;

&lt;p&gt;&lt;img src="http://depth-first.com/demo/20070123/wikipedia.jpg" align="right"&gt;&lt;/img&gt;If you were lucky enough to have chosen a molecule with a trivial name such as 'caffeine', you could just try &lt;a href="http://www.google.com/search?hl=en&amp;amp;q=caffeine&amp;amp;btnG=Google+Search"&gt;Google&lt;/a&gt;. Google's first result would link you to &lt;a href="http://en.wikipedia.org/wiki/Caffeine"&gt;the Caffeine Wikipedia article&lt;/a&gt;. Wikipedia is an evolving phenomenon that, according to some critics, will never have a place in scientific research. It may not be ready now, but reading the meticulously annotated and cross-referenced entry for caffeine should make anyone who would say "never" at least a little nervous. Many of the citations in Wikipedia's caffeine article point to the primary scientific literature through &lt;a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed"&gt;PubMed&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The remainder of Google's top-50 results are general audience items unlikely to interest a scientist keeping his or her nose to the grindstone: companies that sell caffeinated products; a variety of FAQs; self-help medical articles; and of course, &lt;a href="http://www.energyfiend.com/death-by-caffeine/"&gt;this one&lt;/a&gt;. We shouldn't be surprised. In the eyes of a massive search engine like Google, chemistry is just one of many niche markets.&lt;/p&gt;

&lt;p&gt;Adding terms to our Google search might produce more targeted results. For example, what if we wanted to find a proton NMR spectrum of caffeine? We could type "caffeine proton nmr" into Google. The first result &lt;a href="http://cat.inist.fr/?aModele=afficheN&amp;amp;cpsidt=16701620"&gt;links&lt;/a&gt;, indirectly, to &lt;a href="http://www.maik.rssi.ru/cgi-bin/search.pl?type=abstract&amp;amp;name=physchem&amp;amp;number=4&amp;amp;year=5&amp;amp;page=573"&gt;an article&lt;/a&gt; in the subscription-only &lt;a href="http://www.maik.rssi.ru/journals/physchem.htm"&gt;Russian Journal of Physical Chemistry A&lt;/a&gt;. This does us no good because we have no subscription, limited funds, and no access to the journal at a library. The &lt;a href="http://joi.jlc.jst.go.jp/JST.JSTAGE/analsci/19.1079?from=Google"&gt;second link&lt;/a&gt; is a direct hit: the proton NMR spectrum of caffeine in water-formic acid. Significantly, the information is contained in a peer-reviewed article (&lt;a href="http://dx.doi.org/10.2116/analsci.19.1079"&gt;DOI&lt;/a&gt;) published by the Japanese &lt;a href="http://en.wikipedia.org/wiki/Open_access"&gt;Open Access&lt;/a&gt; journal, &lt;a href="http://www.jsac.or.jp/cgi-bin/analsci/toc/"&gt;Analytical Sciences&lt;/a&gt;. The fact that &lt;em&gt;Analytical Sciences&lt;/em&gt; is an Open Access journal has made a world of difference in our search.&lt;/p&gt;

&lt;p&gt;Although this might seem like the perfect solution, recall that the goal of the experiment was to locate &lt;em&gt;all&lt;/em&gt; scientifically-relevant online content relating to the molecule. The technique we just used is most likely to succeed when we want specific information about molecules with a single trivial name. Even then, many resources may not cite a trivial name at all.&lt;/p&gt;

&lt;h4&gt;The Real World&lt;/h4&gt;

&lt;p&gt;Our options are even more limited when it comes to comprehensively text-searching even the simplest molecules lacking a widely-used trivial name. For example, consider the molecule represented by the systematic name '3-phenyl-2-methylpropene.'&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;img src="http://depth-first.com/demo/20070126/example.png"&gt;&lt;/img&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;If we were using a proprietary system such those offered by &lt;a href="http://www.cas.org/"&gt;Chemical Abstracts Services&lt;/a&gt; (CAS), we could simply enter the structure into a client and read off our results. This works because CAS isn't matching text when a query is submitted. Instead, it's matching molecular structures that have previously been encoded by both humans and machines.&lt;/p&gt;

&lt;p&gt;The minute we step out of the orderly system created by CAS and into the chaos of the Internet, we confront a thorny problem. In practice, there are only two widely-used methods to convert a molecular structure diagram into a form that can be text-searched, and each has major limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IUPAC Nomenclature&lt;/strong&gt; This method has the advantage of being Open. It suffers from the complexity of its encoding rules, resulting in a variety of nonstandard implementations. As a result, it's possible to find multiple phrasings of the same IUPAC name, reducing its use as a unique identifier.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CAS Numbers&lt;/strong&gt; This method replaces a standard encoding system with a central registration authority. The advantages are that the &lt;em&gt;representation&lt;/em&gt; of the identifier itself is unambiguous. Conversely, the &lt;em&gt;meaning&lt;/em&gt; of a CAS Number can only be known by referring to a registration authority. Unfortunately, the &lt;a href="http://info.cas.org/infopolicy.html"&gt;current business model&lt;/a&gt; for CAS is based on &lt;em&gt;restricting&lt;/em&gt; information flow, rather than promoting it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Search by IUPAC Name&lt;/h4&gt;

&lt;p&gt;&lt;a href=""&gt;&lt;img src="http://depth-first.com/demo/20070126/iupac_logo.png" align="right" border="0"&gt;&lt;/img&gt;&lt;/a&gt;Let's try searching Google for IUPAC nomenclature. Entering '&lt;a href="http://www.google.com/search?hl=en&amp;amp;q=3-phenyl-2-methylpropene&amp;amp;btnG=Google+Search"&gt;3-phenyl-2-methylpropene&lt;/a&gt;' produces a results page containing three unique entries. One of them &lt;a href="http://lb.chemie.uni-hamburg.de/static/data/81_rckntdu1.html"&gt;links&lt;/a&gt; to a &lt;a href="http://lb.chemie.uni-hamburg.de/static/"&gt;database&lt;/a&gt; run from the University of Hamburg. I gather the purpose of this database, which I was unaware of before doing this search, is to link to &lt;a href="http://www.springer.com/west/home/laboe?SGWID=4-10113-0-0-0"&gt;Landolt-B&amp;#246;rnstein Online&lt;/a&gt;, a collection of numerical data. Interestingly, a search for data on a single compound turned into the discovery of &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;yet another free chemistry database&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The remaining hits from the Google search linked to two pages (&lt;a href="http://pubs.acs.org/cgi-bin/abstract.cgi/joceah/1982/47/i06/f-pdf/f_jo00345a032.pdf?sessid=6006l3"&gt;pdf&lt;/a&gt;, &lt;a href="http://pubs.acs.org/cgi-bin/abstract.cgi/jacsat/1962/84/i09/f-pdf/f_ja00868a019.pdf?sessid=6006l3"&gt;pdf&lt;/a&gt;) from ACS journals. The ACS routinely makes the first page of its articles a free download. It's interesting to note that these were the &lt;em&gt;only&lt;/em&gt; ACS hits returned by Google.&lt;/p&gt;

&lt;p&gt;We've exhausted the possibilities with our chosen IUPAC name and Google. But before moving on to searching by CAS Number, we need to solve a problem. How do we get the CAS number for our compound?&lt;/p&gt;

&lt;h4&gt;Finding a CAS Number: A PubChem Detour&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right" border="0"&gt;&lt;/img&gt;&lt;/a&gt;Concisely summarizing what PubChem does can be difficult because different users will emphasize different aspects of its design. For our purposes, PubChem probably contains a Web page describing the compound we've been researching, and on that page may be a CAS number.&lt;/p&gt;

&lt;p&gt;Submitting our molecule to &lt;a href="http://pubchem.ncbi.nlm.nih.gov/search/"&gt;PubChem's search page&lt;/a&gt; produces &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=18687"&gt;one result&lt;/a&gt;. Fortunately, this page lists our compound's CAS number: 3290-53-7.&lt;/p&gt;

&lt;h4&gt;Search by CAS Number&lt;/h4&gt;

&lt;p&gt;Submitting our CAS number to Google produces sixteen results. The first two link to the Landolt-B&amp;#246;rnstein pages. The next result &lt;a href="http://www.alfa.com/Alf/Product%20Indexes/Alfa_Complete/M20_idx.html"&gt;links to&lt;/a&gt; a product listing page for a chemical supplier.&lt;/p&gt;

&lt;p&gt;The fourth result &lt;a href="http://www.orgsyn.org/orgsyn/prep.asp?prep=cv5p0471"&gt;links to&lt;/a&gt; a far more interesting page - the &lt;a href="http://www.orgsyn.org/"&gt;&lt;em&gt;Organic Syntheses&lt;/em&gt;&lt;/a&gt; website. Fortunately for us, &lt;em&gt;Organic Syntheses&lt;/em&gt; makes its contents freely available online. Following the link takes us to a preparation in which one of the reagents can be substituted with our molecule of interest. Further down in this page, we can see that this molecule has been further &lt;a href="http://www.orgsyn.org/orgsyn/chemname.asp?nameID=41708"&gt;cross-referenced&lt;/a&gt;. Two procedures are listed, &lt;a href="http://www.orgsyn.org/orgsyn/prep.asp?prep=cv6p0722"&gt;one of which&lt;/a&gt; is new to us. Following the link, we find a complete synthetic procedure with full characterization and primary literature citations. Jackpot.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Organic Syntheses&lt;/em&gt; permits free public access, but is it &lt;a href="http://en.wikipedia.org/wiki/Open_access"&gt;Open Access&lt;/a&gt;? Many would say not, due to the fact that it retains full copyright to its works and doesn't permit free redistribution. The distinction mainly matters to those seeking to create &lt;a href="http://en.wikipedia.org/wiki/Open_Data"&gt;Open Data&lt;/a&gt; repositories based on the contents of periodicals such as &lt;em&gt;Organic Syntheses&lt;/em&gt;. To an end user, however, the distinction matters little in the short run.&lt;/p&gt;

&lt;p&gt;The remaining results from our Google search are interesting, but mainly consist of chemical supplier catalogs. It should, however, be noted that &lt;strong&gt;all of the results&lt;/strong&gt; returned by a Google search of our CAS number &lt;strong&gt;were relevant&lt;/strong&gt; to our molecule of interest.&lt;/p&gt;

&lt;h4&gt;InChI&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://www.opensource.org/docs/definition.php"&gt; &lt;img src="http://www.opensource.org/trademarks/opensource/web/opensource-110x95.png" align="right" alt="Open Source (OSI) Logo" border="0" width="110" height="95"&gt;&lt;/a&gt;In an effort to overcome the limitations of CAS Registry Numbers and IUPAC systematic nomenclature as unique molecular identifiers, a new system has recently been introduced, the &lt;a href="http://www.iupac.org/inchi/"&gt;IUPAC International Chemical Identifier&lt;/a&gt; (InChI). In contrast to a CAS number, an InChI can be assigned independently of a central authority. Like systematic nomenclature, an InChI can be &lt;a href="http://depth-first.com/articles/2006/09/26/looking-at-inchis"&gt;decoded to a molecular representation&lt;/a&gt;. Unlike IUPAC systematic nomenclature, an InChI is generated by a &lt;a href="http://depth-first.com/articles/2006/08/12/inchi-canonicalization-algorithm"&gt;computer algorithm&lt;/a&gt; far too complicated for human use. The developers of the InChI software have released their work under an &lt;a href="http://www.opensource.org/"&gt;Open Source&lt;/a&gt; license, promoting its widespread use by ensuring that services like PubChem will have no difficulties integrating InChI with their software infrastructures. Unlike either CAS Numbers or IUPAC names, InChIs are not yet in widespread use, a point which currently limits their utility.&lt;/p&gt;

&lt;p&gt;The PubChem page for our search molecule listed an InChI, as do all PubChem Compound Summary pages. As shown by &lt;a href="http://video.google.com/videoplay?docid=-6653695245776470969"&gt;Peter Murray-Rust and others&lt;/a&gt;, it is perfectly feasible to use Google to search for InChIs. Let's try.&lt;/p&gt;

&lt;p&gt;Submitting our &lt;a href="http://www.google.com/search?hl=en&amp;amp;lr=&amp;amp;q=InChI%3D1%2FC10H12%2Fc1-9%282%298-10-6-4-3-5-7-10%2Fh3-7H%2C1%2C8H2%2C2H3&amp;amp;btnG=Search"&gt;InChI query to Google&lt;/a&gt; gives no results. Leaving off the leading 'InChI=' text, &lt;a href="http://video.google.com/videoplay?docid=-6653695245776470969"&gt;as briefly mentioned here&lt;/a&gt;, also results in no hits. This tells us that Google has found no instances of our InChI, and that Google still does not crawl PubChem Compound Summary pages.&lt;/p&gt;

&lt;h4&gt;Use a Free Database&lt;/h4&gt;

&lt;p&gt;Numerous free chemistry databases are now running on the Internet. For example, a recent article highlighted &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;thirty-two of them&lt;/a&gt;. Would one of them be useful to our search? We need to ask ourselves if we really want to perform more than thirty individual searches. What if we were looking for data on several molecules? Nothing would prevent us from doing this in theory, but in practice, this is too much work.&lt;/p&gt;

&lt;p&gt;What we'd really like is to submit a structure query to a single service that will query all of these free databases for us. While such services do exist in name, their breadth is restricted. A more comprehensive solution would be very helpful indeed.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;The Web's convenience and ubiquity have prompted many calls for greater Web accessibility to public chemical information. As hinted at by the examples in this article, Open Source, Open Data, and Open Access are three interrelated forces that can make this vision a reality. Open Access journals lower the economic barriers to compiling Open Data sources. Making these Open Data sources useful to scientists in a cost-effective way requires Open Source software. The availability of good Open Source software stimulates the creative combination of Open Data sources. And so on.&lt;/p&gt;

&lt;p&gt;A lot needs to be done before this positive feedback loop can replace the status quo. But even with the chaotic, balkanized system that now exists, the benefits are plain to see. With even a small amount of coordination among Open Source software developers, Open Data providers, and scientific publishers, the most amazing things could happen.&lt;/p&gt;</description>
      <pubDate>Fri, 26 Jan 2007 16:21:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:0684cce8-a084-43e4-b454-d0ff9da40c6c</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/01/26/how-to-find-chemical-information-on-the-internet-why-open-source-open-access-and-open-data-matter</link>
      <category>Open X</category>
      <category>opensource</category>
      <category>opendata</category>
      <category>openaccess</category>
    </item>
    <item>
      <title>Making the Case: Topological Maximum Cross Correlation</title>
      <description>&lt;blockquote&gt;
    &lt;p&gt;... For the Gasteiger partial charges, we took maximum values for positive and negative charges from the &#8220;fragmentlike&#8221; subset of the ZINC database, consisting of 49 134[sic] molecules, carrying out the calculation with Open Babel 2.0.0.&lt;/p&gt;

    &lt;p&gt;...&lt;/p&gt;

    &lt;p&gt;... All structure handling, atom typing, and descriptor calculation was carried out using the open source Java library JOELib.&lt;/p&gt;

    &lt;p&gt;...&lt;/p&gt;

    &lt;p&gt;Source code (in Java) to generate the TMACC descriptors is freely available from our Web site under the GNU General Public License at &lt;a href="http://comp.chem.nottingham.ac.uk/download/tmacc/index.html"&gt;http://comp.chem.nottingham.ac.uk/download/tmacc/index.html&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;-&lt;cite&gt;James Melvile and Jonathan Hirst, &lt;a href="http://dx.doi.org/10.1021/ci6004178"&gt;J. Chem. Inf. Model.&lt;/a&gt;&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Science happens when the experiments and conclusions of your fellow scientists can be freely questioned and independently verified. For example, readers of the cited paper may have questions about the assumptions in the TMACC method, or how to implement it. Questions may be raised about the suitability of the data set used and how others would perform. Readers may even have questions about how to extend TMACC to areas not considered by the authors.&lt;/p&gt;

&lt;p&gt;By basing their work on open source software and open data, and by releasing their reference implementation as open source, Melvile and Hirst raise their work to the level of science. The questions that any reasonable scientist would have about the work described in the paper can be answered at any desired level of detail because all source code and all test data are freely available.&lt;/p&gt;

&lt;p&gt;Why don't all authors adopt the same approach? Why doesn't a flagship journal such as &lt;em&gt;J. Chem. Inf. Model.&lt;/em&gt; require it of all manuscript submissions? As far back as 1984, John Figueras was &lt;a href="http://depth-first.com/articles/2006/08/23/readily-available-without-infringements-or-restrictions"&gt;making this case&lt;/a&gt;. Thankfully, Melville and Hirst are taking the message seriously.&lt;/p&gt;</description>
      <pubDate>Tue, 23 Jan 2007 16:01:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:f88fc27f-13d8-431a-8800-d1007ee72c82</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/01/23/making-the-case-topological-maximum-cross-correlation</link>
      <category>Open X</category>
      <category>opensource</category>
      <category>opendata</category>
      <category>tmacc</category>
    </item>
    <item>
      <title>From Famine to Feast: A Bumper Crop of Free Chemistry Databases</title>
      <description>&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.com/files/pubchemlogo.gif" align="right" border="0"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;"Until PubChem came on the scene, the state of chemoinformatics compared to bioinformatics was 20 years behind," says Christopher Lipinski, who formulated the eponymous rule-of-five criteria for drug bioavailability.&lt;/p&gt;

    &lt;p&gt;-&lt;cite&gt;Monya Baker, &lt;a href="http://dx.doi.org/10.1038/nrd2148"&gt;Nature Reviews Drug Discovery&lt;/a&gt;&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The number of free chemistry databases on the Web just keeps growing. A recent Depth-First article discussed &lt;a href="http://depth-first.com/articles/2006/11/07/twelve-free-chemistry-databases"&gt;twelve of them&lt;/a&gt;. It turns out that &lt;a href="http://www.indiana.edu/~cheminfo/cicc/"&gt;Chembiogrid&lt;/a&gt; from Indiana University maintains a &lt;a href="http://www.indiana.edu/~cheminfo/cicc/databases.html#free"&gt;list of forty free chemistry databases&lt;/a&gt;, most of which are alive and well.&lt;/p&gt;

&lt;p&gt;As this trend continues, the need for database standards will become painfully obvious. Not only will interoperable infrastructure technologies and user interface standards need to be developed, but thorny intellectual property issues including &lt;a href="http://depth-first.com/articles/2006/09/27/hacking-pubchem-free-speech-or-free-beer"&gt;access, chain of title&lt;/a&gt;, and &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;digital rights&lt;/a&gt; will need to be resolved. However, the most immediate need is much more down-to-earth: to involve chemists with the growing number of free alternatives to the &lt;a href="http://www.cas.org/"&gt;chemical information monopoly&lt;/a&gt; they've come to rely on.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.numly.com/numly/verify.asp?id=98219-070105-899330-68"&gt;&lt;img alt="numly esn" src="http://numly.com/numly/icon.asp?id=9821907010589933068" border="0"&gt; 98219-070105-899330-68&lt;/a&gt; &lt;/p&gt;</description>
      <pubDate>Fri, 05 Jan 2007 14:53:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:701a5e92-94f4-4c6f-af18-f39018caec88</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/01/05/from-famine-to-feast-a-bumper-crop-of-free-chemistry-databases</link>
      <category>Databases</category>
      <category>pubchem</category>
      <category>opendata</category>
      <category>openaccess</category>
      <category>chembiogrid</category>
      <category>cas</category>
    </item>
    <item>
      <title>Open Source and Open Data: Why We Should Eat Our Own Dogfood</title>
      <description>&lt;p&gt;&lt;a href="http://flickr.com/photos/toyohara/78832630/"&gt;&lt;img src="http://depth-first.com/demo/20070103/will_he_eat.jpg" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;The National Institutes of Health (NIH) has decided to use OpenEye Scientific Software's cheminformatics toolkits to provide key infrastructure for PubChem, a database of small organic molecules containing chemical structure and biological activities information. PubChem is being developed by the National Center for Biotechnology Information (NCBI) as part of the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. "I am excited to see our software built into PubChem," says Roger Sayle OpenEye's Vice President of Bioinformatics. "It's gratifying that our software will be part of such a useful public resource."&lt;/p&gt;

    &lt;p&gt;...&lt;/p&gt;

    &lt;p&gt;Along with the recent decision by the &lt;a href="http://www.rcsb.org/"&gt;Research Collaboratory for Structural Bioinformatics&lt;/a&gt; (RCSB) to use OpenEye's cheminformatics toolkits to curate and depict the &lt;a href="http://www.rcsb.org/pdb/"&gt;Protein Data Bank (PDB) ligand dictionary&lt;/a&gt;, the NIH's decision is a clear indication of the speed and robustness of OpenEye's technology for large and diverse sets of chemical structures and data. "It has been beneficial working with the NCBI for their project," says Sayle. "Their data includes enough unusual chemistry to make it a nice validation of the software beyond our regular test sets."&lt;/p&gt;

    &lt;p&gt;-&lt;cite&gt;&lt;a href="http://www.eyesopen.com/about/news/press_releases/2004/PubChem.html"&gt;OpenEye Scientific Software Press Release - October 12, 2004&lt;/a&gt;&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Why did &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt;, the granddaddy of all open chemistry databases, choose a closed, proprietary toolkit for its software infrastructure? A recent Depth-First article highlighted &lt;a href="http://depth-first.com/articles/2006/11/07/twelve-free-chemistry-databases"&gt;twelve free chemistry databases&lt;/a&gt;. Of those for which information is available, many have chosen the same path as PubChem. Why is this?&lt;/p&gt;

&lt;p&gt;A huge opportunity is being wasted every time this happens. We, the authors of Open Source software packages, could be working with the architects of Open Data systems to solve their problems in ways that vendors of closed systems can't. We could be using these Open Data systems as real-world proving grounds for our software, fixing bugs that would have never been detected otherwise, and pushing our systems to the limit. We could be identifying new and exciting uses for our software as the organizations we work with repeatedly ask "what if." Sadly, none of this is happening.  A great deal more needs to be done by the Open Source community to persuade the Open Data community to at least try their software. The worst that can happen is that we begin to understand the appeal of closed, proprietary products.&lt;/p&gt;

&lt;p&gt;One bright spot is &lt;a href="http://nmrshiftdb.ice.mpg.de/"&gt;NMRShiftDB&lt;/a&gt;, which uses the Open Source &lt;a href="http://cdk.sf.net"&gt;Chemistry Development Kit&lt;/a&gt; for its infrastructure. This is a fine example of Open Source software powering an Open Data source in chemistry. More examples of this kind of Open Source/Open Data symbiosis would go a long way toward making the case.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Eat_one's_own_dog_food"&gt;Eating your own dogfood&lt;/a&gt; is an effective way to break into new markets and develop truly competitive products. After all, if those with closely-aligned goals won't use what you have to offer, who else will?&lt;/p&gt;</description>
      <pubDate>Wed, 03 Jan 2007 15:57:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:48445f83-d946-4b72-ad5e-d982bb95708b</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/01/03/open-source-and-open-data-why-we-should-eat-our-own-dogfood</link>
      <category>Meta</category>
      <category>opensource</category>
      <category>opendata</category>
      <category>openeye</category>
      <category>pubchem</category>
      <category>dogfood</category>
    </item>
    <item>
      <title>Hacking Molbank: Creating a Graphical Table of Contents</title>
      <description>&lt;p&gt;&lt;a href="http://www.mdpi.org/"&gt;&lt;img src="http://depth-first.com/files/mdpi-small.gif" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.mdpi.org/"&gt;Molbank&lt;/a&gt; is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including &lt;a href="http://depth-first.com/articles/2006/11/30/molbank-and-the-convergence-of-open-access-open-data-and-open-source-in-chemistry"&gt;machine-readable molecular representations of its content&lt;/a&gt;, and its very &lt;a href="http://depth-first.com/articles/2006/12/01/hacking-molbank-downloading-a-complete-chemistry-journal"&gt;liberal policy on mirroring and robots&lt;/a&gt;. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.&lt;/p&gt;

&lt;h4&gt;The Graphical Table of Contents (GTOC)&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://depth-first.com/demo/20061211/molbank/index.html"&gt;The Molbank Graphical Table of Contents&lt;/a&gt; (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;a href="http://depth-first.com/demo/20061211/molbank/index.html"&gt;&lt;img src="http://depth-first.com/demo/20061211/screenshot_1.png" border="0"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/center&gt;&lt;/p&gt;

&lt;h4&gt;Prerequisites, Downloading, and Running&lt;/h4&gt;

&lt;p&gt;To run this project, you'll need &lt;a href="http://depth-first.com/articles/2006/10/30/agile-chemical-informatics-development-with-cdk-and-ruby-rcdk-0-3-0"&gt;Ruby CDK&lt;/a&gt;. A recent article described the small amount of system configuration required for &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-0"&gt;Ruby CDK on Linux&lt;/a&gt;. Another article showed how to install &lt;a href="http://depth-first.com/articles/2006/10/12/running-ruby-java-bridge-on-windows"&gt;Ruby CDK on Windows&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The complete source code for this project can be &lt;a href="http://rubyforge.org/frs/download.php/15500/molbank-0.0.1.tar.gz"&gt;downloaded from RubyForge&lt;/a&gt;. A subdirectory called &lt;strong&gt;demo&lt;/strong&gt; contains the pre-built final result.&lt;/p&gt;

&lt;p&gt;After unpacking the &lt;strong&gt;molbank-0.1.0&lt;/strong&gt; archive, the demo application can be run:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ cd molbank-0.0.1
$ ruby test.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;Problems, We've Got Problems&lt;/h4&gt;

&lt;p&gt;Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Blank Images&lt;/strong&gt; The entry for M52 is blank. Checking the &lt;a href="http://www.mdpi.net/molbank/m0052.mol"&gt;underlying molfile&lt;/a&gt; reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the &lt;a href="http://www.mdpi.net/molbank/m0052.htm"&gt;molecules shown in M52&lt;/a&gt; only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Images Containing an "R" Atom Label&lt;/strong&gt; Entry M53 shows an "R" group at what should be the carbonyl carbon. &lt;a href="http://www.mdpi.net/molbank/m0053.mol"&gt;The underlying molfile&lt;/a&gt; contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Molfile not Found&lt;/strong&gt; Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Broken Molfiles&lt;/strong&gt; &lt;a href="http://www.mdpi.net/molbank/m0162.mol"&gt;The Molfile for M162&lt;/a&gt; encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with &lt;a href="http://www.mdpi.net/molbank/molbank2005/m402.mol"&gt;M402&lt;/a&gt;. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bogus Molfiles&lt;/strong&gt; For reasons I still can't understand, &lt;a href="http://www.mdpi.net/molbank/molbank2005/m407.mol"&gt;the Molfile for M407&lt;/a&gt; encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.&lt;/p&gt;

&lt;h4&gt;Behind the Scenes&lt;/h4&gt;

&lt;p&gt;The Ruby Molbank GTOC generator works by connecting to the &lt;a href="http://www.mdpi.net"&gt;www.mdpi.net&lt;/a&gt; server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using &lt;a href="http://rubyforge.org/projects/rcdk"&gt;Ruby CDK&lt;/a&gt;. As a final step, the &lt;strong&gt;index.html&lt;/strong&gt; page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is &lt;a href="http://depth-first.com/articles/2006/11/13/cheminformatics-for-the-web-convert-sd-files-to-html-with-ruby-cdk"&gt;produced with eRuby&lt;/a&gt; using a previously-described technique.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.&lt;/p&gt;

&lt;p&gt;In some ways, the software described here and its output are less interesting than the larger questions they raise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These and many related questions are waiting just around the corner. As Open Access becomes more viable, both &lt;a href="http://depth-first.com/articles/2006/10/19/disruptive-innovation-in-scientific-publishing-free-journal-management-systems"&gt;technically &lt;/a&gt; and &lt;a href="http://depth-first.com/articles/2006/10/26/more-open-access-in-the-sciences-metal-based-drugs-and-hindawi-publishing"&gt;commercially&lt;/a&gt;, look to Open Source and Open Data to provide the synergies that will unlock its true potential.&lt;/p&gt;</description>
      <pubDate>Mon, 11 Dec 2006 15:00:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:6c2f002b-3d8d-40fc-a4a5-8008c473e7d7</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/12/11/hacking-molbank-creating-a-graphical-table-of-contents</link>
      <category>Web</category>
      <category>molbank</category>
      <category>gtoc</category>
      <category>2d</category>
      <category>rcdk</category>
      <category>ruby</category>
      <category>mdpi</category>
      <category>opensource</category>
      <category>openaccess</category>
      <category>opendata</category>
    </item>
    <item>
      <title>Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry</title>
      <description>&lt;p&gt;&lt;a href="http://www.mdpi.org/"&gt;&lt;img src="http://depth-first.com/files/mdpi-small.gif" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.mdpi.org/molbank/"&gt;Molbank&lt;/a&gt;, published by &lt;a href="http://www.mdpi.org/"&gt;Molecuar Diversity Preservation International&lt;/a&gt;, is one of the oldest of a handful of &lt;a href="http://depth-first.com/articles/2006/10/18/disruptive-innovation-in-scientific-publishing-directory-of-open-access-journals"&gt;Open Access journals in chemistry&lt;/a&gt;. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why &lt;em&gt;they&lt;/em&gt; didn't implement it sooner.&lt;/p&gt;

&lt;p&gt;A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.&lt;/p&gt;

&lt;p&gt;Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.&lt;/p&gt;

&lt;p&gt;Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of &lt;a href="http://www.cas.org/"&gt;Chemical Abstract Service&lt;/a&gt; has even been able to contemplate, let alone prepare for.&lt;/p&gt;

&lt;p&gt;This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.&lt;/p&gt;</description>
      <pubDate>Thu, 30 Nov 2006 15:01:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:0ec69fe1-07ac-46d0-9112-95afd038e81f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/11/30/molbank-and-the-convergence-of-open-access-open-data-and-open-source-in-chemistry</link>
      <category>Open X</category>
      <category>opensource</category>
      <category>opendata</category>
      <category>openaccess</category>
      <category>mdpi</category>
      <category>molbank</category>
      <category>integration</category>
      <category>molfile</category>
    </item>
    <item>
      <title>Readily Available, Without Infringements or Restrictions</title>
      <description>&lt;blockquote&gt;
    &lt;p&gt;...If we consider that one of the purposes of publication is to offer &lt;em&gt;testable&lt;/em&gt; data, then it would seem that a minimum requirement would be that where computer programs and their results are presented, the author will make source code available on request. ACS could render good service by undertaking the distribution of such requested code. Furthermore, I would make it a condition for publication that such source code be provided. If the scientist is unwilling to disclose his code because he wishes to engage in a commercial venture, then I suggest that he be invited to take out a paid advertisement in the journal and be denied the privilege of publication to promote his product.&lt;/p&gt;

    &lt;p&gt;&lt;cite&gt;-John Figueras &lt;a href="http://dx.doi.org/10.1021/ci00044a601"&gt;J. Chem. Inf. Comput. Sci. 1984, 24, 276&lt;/a&gt;&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Science moves forward only insofar as observations can be validated and put to use by a third party. Chemical informatics is no different from any other field in this respect. Yet publications of the type Mr. Figueras opposed can still be found in 2006. Why is this?&lt;/p&gt;

&lt;p&gt;At issue isn't just software. The ACS has recently spoken out on the necessity of &lt;a href="http://dx.doi.org/10.1021/ci0680079"&gt;open data sets&lt;/a&gt;. As a condition for publication, any data reported in a manuscript must now either appear in  Supplementary Material or be &#8220;readily available, without infringements or restrictions.&#8221; Although this is a positive development, the wait continues for an equivalent statement on the availability of source code.&lt;/p&gt;

&lt;p&gt;Open software systems and open data packages are most useful when they can be readily found by others and used together. In an effort to work on this problem, several individuals, including myself, formed &lt;a href="http://blueobelisk.org"&gt;The Blue Obelisk&lt;/a&gt; group. Through this group and others like it, like-minded researches can begin to reap the benefits of openness &lt;a href="http://bioinformatics.org/"&gt;enjoyed by other fields&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Wed, 23 Aug 2006 05:47:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:97e279ea-8589-4f72-aff1-9c877bad3f69</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/08/23/readily-available-without-infringements-or-restrictions</link>
      <category>Open X</category>
      <category>opensource</category>
      <category>blueobelisk</category>
      <category>opendata</category>
    </item>
  </channel>
</rss>
