<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag molbank</title>
    <link>http://depth-first.com/articles/tag/molbank</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Hacking Molbank: Creating a Graphical Table of Contents</title>
      <description>&lt;p&gt;&lt;a href="http://www.mdpi.org/"&gt;&lt;img src="http://depth-first.com/files/mdpi-small.gif" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.mdpi.org/"&gt;Molbank&lt;/a&gt; is an Open Access collection of single-compound articles on synthetic chemistry. Previous articles on Depth-First have highlighted Molbank's practice of including &lt;a href="http://depth-first.com/articles/2006/11/30/molbank-and-the-convergence-of-open-access-open-data-and-open-source-in-chemistry"&gt;machine-readable molecular representations of its content&lt;/a&gt;, and its very &lt;a href="http://depth-first.com/articles/2006/12/01/hacking-molbank-downloading-a-complete-chemistry-journal"&gt;liberal policy on mirroring and robots&lt;/a&gt;. In this article, we'll take advantage of both of these features to build something that was left out of Molbank: a graphical table of contents.&lt;/p&gt;

&lt;h4&gt;The Graphical Table of Contents (GTOC)&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://depth-first.com/demo/20061211/molbank/index.html"&gt;The Molbank Graphical Table of Contents&lt;/a&gt; (Molbank GTOC) is available online. It consists of a single Web page containing a grid of color 2-D chemical structures representing the contents of Molbank. Each structure is hyperlinked into the Molbank site itself. Clicking on the structure takes you to the complete synthetic procedure and characterization data.&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;a href="http://depth-first.com/demo/20061211/molbank/index.html"&gt;&lt;img src="http://depth-first.com/demo/20061211/screenshot_1.png" border="0"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/center&gt;&lt;/p&gt;

&lt;h4&gt;Prerequisites, Downloading, and Running&lt;/h4&gt;

&lt;p&gt;To run this project, you'll need &lt;a href="http://depth-first.com/articles/2006/10/30/agile-chemical-informatics-development-with-cdk-and-ruby-rcdk-0-3-0"&gt;Ruby CDK&lt;/a&gt;. A recent article described the small amount of system configuration required for &lt;a href="http://depth-first.com/articles/2006/09/25/cdk-the-ruby-way-rcdk-0-2-0"&gt;Ruby CDK on Linux&lt;/a&gt;. Another article showed how to install &lt;a href="http://depth-first.com/articles/2006/10/12/running-ruby-java-bridge-on-windows"&gt;Ruby CDK on Windows&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The complete source code for this project can be &lt;a href="http://rubyforge.org/frs/download.php/15500/molbank-0.0.1.tar.gz"&gt;downloaded from RubyForge&lt;/a&gt;. A subdirectory called &lt;strong&gt;demo&lt;/strong&gt; contains the pre-built final result.&lt;/p&gt;

&lt;p&gt;After unpacking the &lt;strong&gt;molbank-0.1.0&lt;/strong&gt; archive, the demo application can be run:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ cd molbank-0.0.1
$ ruby test.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;h4&gt;Problems, We've Got Problems&lt;/h4&gt;

&lt;p&gt;Several problems were uncovered while building the Molbank GTOC. This is to be expected with any data produced "in the wild" rather than within the safety of an Ivory Tower. Here are the main categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Blank Images&lt;/strong&gt; The entry for M52 is blank. Checking the &lt;a href="http://www.mdpi.net/molbank/m0052.mol"&gt;underlying molfile&lt;/a&gt; reveals four instances of bond stereo flags set to "6," a problem common to many of the blank images in the GTOC. According to the Molfile specification, a value of 6 indicates "Down, double bonds," whatever that means. Given that the &lt;a href="http://www.mdpi.net/molbank/m0052.htm"&gt;molecules shown in M52&lt;/a&gt; only have one possible stereo bond, and that the Molfile specification relies on 2-D coordinates to encode double-bond geometry, an encoding inconsistency or incorrect stereo interpretation may be the cause.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Images Containing an "R" Atom Label&lt;/strong&gt; Entry M53 shows an "R" group at what should be the carbonyl carbon. &lt;a href="http://www.mdpi.net/molbank/m0053.mol"&gt;The underlying molfile&lt;/a&gt; contains several less-common entries in the properties block, a common feature of images containing "R" in the GTOC.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Molfile not Found&lt;/strong&gt; Entry M95 has no associated Molfile because it simply reports errata for other articles. M253-M259, on the other hand, lack molfiles because the articles were "withdrawn before publication." M347 describes a cyclodextrin for which, understandably, no molfile was provided. There are also a couple of cases in which a link to a molfile is provided, but is not available, such as M352.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Broken Molfiles&lt;/strong&gt; &lt;a href="http://www.mdpi.net/molbank/m0162.mol"&gt;The Molfile for M162&lt;/a&gt; encodes its line endings as two carriage returns and a newline, giving rise to the appearance of blank lines after data lines. This is something the Molfile specification strictly forbids. Apparently, the underlying CDK molfile reader can only handle one carriage return and a newline. Perhaps the extra return was introduced as the file was copied into and out of text editors on various operating systems in preparation for uploading it to Molbank. Another common problem was binary files being used for molfiles, such as with &lt;a href="http://www.mdpi.net/molbank/molbank2005/m402.mol"&gt;M402&lt;/a&gt;. These files don't appear to be compressed with either Zip or GZip and their nature is currently unknown.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bogus Molfiles&lt;/strong&gt; For reasons I still can't understand, &lt;a href="http://www.mdpi.net/molbank/molbank2005/m407.mol"&gt;the Molfile for M407&lt;/a&gt; encodes ethylene. So do several other Molbank molfiles. Other common dummy molfiles include toluene, benzene, and ethane.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After cataloging the problems that exist with the Molbank dataset and the software used to mine it, two interesting questions come into focus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What can be done to help Molbank fix the most obvious problems in their molfiles and would they accept these improvements?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How can "real" datasets like Molbank help developers build better cheminformatics software? (a graphical Molfile Debugger Utility would come in handy...)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clearly, the connection between Open Access, Open Source, and Open Data is very strong and runs very deep.&lt;/p&gt;

&lt;h4&gt;Behind the Scenes&lt;/h4&gt;

&lt;p&gt;The Ruby Molbank GTOC generator works by connecting to the &lt;a href="http://www.mdpi.net"&gt;www.mdpi.net&lt;/a&gt; server to get its data in real-time. Internally, the software creates a map of the Molbank website so that the molfile (and URL) for any article can be retrieved on demand. Each readable molfile is used to create a 2-D image using &lt;a href="http://rubyforge.org/projects/rcdk"&gt;Ruby CDK&lt;/a&gt;. As a final step, the &lt;strong&gt;index.html&lt;/strong&gt; page is generated, linking the 2-D images to a specific URL for a Molbank article. This file is &lt;a href="http://depth-first.com/articles/2006/11/13/cheminformatics-for-the-web-convert-sd-files-to-html-with-ruby-cdk"&gt;produced with eRuby&lt;/a&gt; using a previously-described technique.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Building a Graphical Table of Contents for Molbank is not that difficult given the power of Ruby, and Molbank's forward-thinking attitude toward mirroring and robots. In working on this project, several problems were uncovered, both with Molbank's data, and the software used to mine it.&lt;/p&gt;

&lt;p&gt;In some ways, the software described here and its output are less interesting than the larger questions they raise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;How do scientific journals best serve not only their readers, but developers who want to provide new ways to use the journal?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How far does copyright extend in scientific publications? For example, are molfiles copyrightable? If so, at what level of detail are they not? If atom coordinates or some other kind of non-essential information is left out, does that change anything?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In what other practical ways could the connection between Open Source, Open Data, and Open Access be explored?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These and many related questions are waiting just around the corner. As Open Access becomes more viable, both &lt;a href="http://depth-first.com/articles/2006/10/19/disruptive-innovation-in-scientific-publishing-free-journal-management-systems"&gt;technically &lt;/a&gt; and &lt;a href="http://depth-first.com/articles/2006/10/26/more-open-access-in-the-sciences-metal-based-drugs-and-hindawi-publishing"&gt;commercially&lt;/a&gt;, look to Open Source and Open Data to provide the synergies that will unlock its true potential.&lt;/p&gt;</description>
      <pubDate>Mon, 11 Dec 2006 15:00:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:6c2f002b-3d8d-40fc-a4a5-8008c473e7d7</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/12/11/hacking-molbank-creating-a-graphical-table-of-contents</link>
      <category>Web</category>
      <category>molbank</category>
      <category>gtoc</category>
      <category>2d</category>
      <category>rcdk</category>
      <category>ruby</category>
      <category>mdpi</category>
      <category>opensource</category>
      <category>openaccess</category>
      <category>opendata</category>
    </item>
    <item>
      <title>Hacking Molbank: Downloading a Complete Chemistry Journal</title>
      <description>&lt;p&gt;&lt;a href="http://www.mdpi.org/"&gt;&lt;img src="http://depth-first.com/files/mdpi-small.gif" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;The previous article in this series highlighted Molbank as a tool for studying the &lt;a href="http://depth-first.com/articles/2006/11/30/molbank-and-the-convergence-of-open-access-open-data-and-open-source-in-chemistry"&gt;convergence of Open Access, Open Data, and Open Source in chemistry&lt;/a&gt;. This article will outline some of the technical and legal aspects of downloading and using Molbank content.&lt;/p&gt;

&lt;h4&gt;Mirror, Mirror&lt;/h4&gt;

&lt;p&gt;MDPI themselves &lt;a href="http://mdpi.net/MIRRORING/mirroring.html"&gt;actively encourage&lt;/a&gt; the copying of their journal content by a process known as mirroring:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;We encourage two types of mirroring :&lt;/p&gt;

    &lt;ul&gt;
    &lt;li&gt;Institutional Mirroring : Institutions may help not only their own members, but neighbouring scientists, to have a faster and reliable access to MDPI journals. For institutions, this is a tradeoff : they save bandwidth on outgoing traffic, while having more inbound traffic. One positive aspect is that sites supporting mirrors become more visited and better known. We are going to maintain a list of supporting institutional mirror sites which is going to be presented in an extremely visible fashion, on the welcome pages of each journal, so that all MDPI readers can access the nearest site.&lt;/li&gt;
    &lt;li&gt;Personnal Mirroring : With hard disks becoming larger and cheaper, it becomes not unreasonnable to set up his/her own personnal mirror, with all the information at your fingertips !. An automated procedure, running at night, keeps your personnal mirror always updated. This is extremely convenient. You may keep this mirror to yourself, or openned to your colleagues, you may do what you wish !&lt;/li&gt;
    &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The text then goes on to give explicit instructions on how to create a mirror of the entire MDPI site and all of its journal content using Linux. So not only does MDPI explicitly allow the non-commercial copying of their content, but that copy can then be hosted on the Web, transmitted through other media, or simply used locally. It's the latter of these uses that this article will address.&lt;/p&gt;

&lt;h4&gt;Create a Molbank Archive&lt;/h4&gt;

&lt;p&gt;The Unix command &lt;tt&gt;wget&lt;/tt&gt; can be used to copy the content of any website. Before using &lt;tt&gt;wget&lt;/tt&gt;, or any similar tool, you should &lt;a href="http://depth-first.com/articles/2006/09/22/hacking-pubchem-why-the-open-access-fight-is-just-the-beginning"&gt;check the &lt;tt&gt;robots.txt&lt;/tt&gt; file&lt;/a&gt; for the site of interest. I have so far been unable to find a &lt;tt&gt;robots.txt&lt;/tt&gt; file on the MDPI site, so I assume there is no problem with running either &lt;tt&gt;wget&lt;/tt&gt; or other robotic agents. But for the purposes of this tutorial, it is more convenient to create a local copy.&lt;/p&gt;

&lt;p&gt;To create a local copy of all 2005 articles in Molbank, for example, use &lt;tt&gt;wget&lt;/tt&gt; with the appropriate arguments:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ wget -r -l2 http://www.mdpi.net/molbank/molbank2005.htm
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;The &lt;tt&gt;-r&lt;/tt&gt; flag turns on recursive directory retrieval, and the &lt;tt&gt;-l2&lt;/tt&gt; flag sets the retrieval depth to two.&lt;/p&gt;

&lt;p&gt;When the process is complete, you should have a directory called &lt;strong&gt;www.mdpi.net&lt;/strong&gt; in your working directory. This directory will contain a subdirectory called &lt;strong&gt;molbank&lt;/strong&gt; which in turn contains two directories: &lt;strong&gt;2005&lt;/strong&gt; and &lt;strong&gt;2006&lt;/strong&gt;. Under the &lt;strong&gt;2005&lt;/strong&gt; directory, you'll find all of Molbank's articles in HTML format, all images, and all molfiles. It's not clear to me yet why the &lt;strong&gt;2006&lt;/strong&gt; directory is created and why it only contains one article.&lt;/p&gt;

&lt;h4&gt;Checking the Archive&lt;/h4&gt;

&lt;p&gt;A large number of Molbank's molfiles appear to be corrupted. This isn't related to &lt;tt&gt;wget&lt;/tt&gt;, because these files are also corrupted when viewed through a browser directly from &lt;a href="http://www.mdpi.org"&gt;http://www.mdpi.org&lt;/a&gt;. For example, the molfile for Molbank article #393 appears corrupted (as do all of the other molfiles for July 2005):&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.mdpi.org/molbank/molbank2005/m393.mol"&gt;http://www.mdpi.org/molbank/molbank2005/m393.mol&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You'll also find several instances of bogus molfiles containing only one or two atoms, such as for Molbank article #431:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.mdpi.org/molbank/molbank2005/m431.mol"&gt;http://www.mdpi.org/molbank/molbank2005/m431.mol&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some molfiles are missing altogether, such as the one for Molbank article #405:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.mdpi.org/molbank/molbank2005/m405.mol"&gt;http://www.mdpi.org/molbank/molbank2005/m405.mol&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clearly, the integrity of Molbank's molfiles can not be assumed. Software designed to work with this dataset will therefore need to be capable of gracefully handling corrupted, nonexistent, and bogus molfiles.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Molbank permits the non-profit copying of its entire article collection. With some simple command-line tools, it's possible to quickly and easily create your own personal Molbank mirror. A cursory examination of the molfiles contained in Molbank showed several problems that need to be taken into consideration. The remaining articles in this series will describe some ways that Molbank's content can be put to use with Open Source software, and mashed up with Open Data.&lt;/p&gt;</description>
      <pubDate>Fri, 01 Dec 2006 15:13:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:c3ddc1b1-2497-414b-89d4-afbfc6fa38e6</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/12/01/hacking-molbank-downloading-a-complete-chemistry-journal</link>
      <category>Tools</category>
      <category>molbank</category>
      <category>wget</category>
      <category>molfile</category>
      <category>mirror</category>
    </item>
    <item>
      <title>Molbank and the Convergence of Open Access, Open Data, and Open Source in Chemistry</title>
      <description>&lt;p&gt;&lt;a href="http://www.mdpi.org/"&gt;&lt;img src="http://depth-first.com/files/mdpi-small.gif" border="0" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.mdpi.org/molbank/"&gt;Molbank&lt;/a&gt;, published by &lt;a href="http://www.mdpi.org/"&gt;Molecuar Diversity Preservation International&lt;/a&gt;, is one of the oldest of a handful of &lt;a href="http://depth-first.com/articles/2006/10/18/disruptive-innovation-in-scientific-publishing-directory-of-open-access-journals"&gt;Open Access journals in chemistry&lt;/a&gt;. Although its longevity is a remarkable accomplishment in itself, there is much more to Molbank than meets eye. Just below the surface is a feature so revolutionary, yet simple, that chemistry publishers years from now will wonder why &lt;em&gt;they&lt;/em&gt; didn't implement it sooner.&lt;/p&gt;

&lt;p&gt;A Molbank article consists of a short monograph on a single compound, or possibly two. This may strike some scientists as a strange way to publish results, and it is unusual. On the other hand, this system offers vast potential to capture useful, but "unpublishable" findings that would otherwise be lost. Back when scientists actually read hardcopy journals, such a system would never have been feasible. Today, with hard drive space measured in terabytes, fiber optics cables crisscrossing the planet, Internet connectivity for almost everyone, and servers that can be had for virtually nothing, this system not only looks perfectly feasible, but preferable in many ways to the status quo.&lt;/p&gt;

&lt;p&gt;Here's the revolutionary part: each article that Molbank publishes is accompanied by a publicly-available, machine-readable file encoding the structure of the article's subject molecule. That's it. There's nothing tricky or high-tech about it. In fact, the practice is about as low-tech as you could imagine. The file format in which structures are encoded, molfile, dates back at least fifteen years, and nearly every piece of chemistry software - both end-user and developer tools - can handle it. What makes Molbank's practice revolutionary is that not a single chemistry journal, Open Access or subscription-based, currently does this.&lt;/p&gt;

&lt;p&gt;Why does the simple inclusion of a publicly-available molfile encoding molecular structures in a paper matter so much? This is where the second two entities of the trinity named in this article's title come into play: Open Source and Open Data. By providing a mechanism for a computer to decipher the chemistry in a paper, Molbank has opened the door to a host of highly-productive integration activities that nobody outside of &lt;a href="http://www.cas.org/"&gt;Chemical Abstract Service&lt;/a&gt; has even been able to contemplate, let alone prepare for.&lt;/p&gt;

&lt;p&gt;This article is the first in a series aimed at exploring the wide-open space that Molbank has created. Rather than arguing my point with words, I'll actually build working demonstrations of what is now easily within reach. At the same time, I'll document my work on this blog. I'm not sure where all of this will end up, but I do hope to shine some light on a vital, although currently obscure, component of the Open Access debate.&lt;/p&gt;</description>
      <pubDate>Thu, 30 Nov 2006 15:01:00 +0000</pubDate>
      <guid isPermaLink="false">urn:uuid:0ec69fe1-07ac-46d0-9112-95afd038e81f</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2006/11/30/molbank-and-the-convergence-of-open-access-open-data-and-open-source-in-chemistry</link>
      <category>Open X</category>
      <category>opensource</category>
      <category>opendata</category>
      <category>openaccess</category>
      <category>mdpi</category>
      <category>molbank</category>
      <category>integration</category>
      <category>molfile</category>
    </item>
  </channel>
</rss>
