<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Tag merckindex</title>
    <link>http://depth-first.com/articles/tag/merckindex</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Building Chempedia: Learning About Contributors</title>
      <description>&lt;p&gt;&lt;a href="http://chempedia.com"&gt;&lt;img src="http://depth-first.com/demo/20080513/chempedia.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://chempedia.com/"&gt;Chempedia&lt;/a&gt; is a free online chemical encyclopedia similar in concept to the Merck Index, but &lt;a href="http://depth-first.com/articles/2008/04/28/building-chempedia-indexing-wikipedias-6-411-compound-monographs"&gt;radically different&lt;/a&gt; in implementation. One key difference: the Merck Index is compiled by a small number of paid professionals while Chempedia is compiled by thousands of unpaid volunteers. Although this distinction raises a host of intriguing questions, one of the most basic revolves around what can be said about these volunteers in the aggregate. This article, the first in a series, explores this issue with some statistics compiled from Chempedia.&lt;/p&gt;

&lt;h4&gt;Learning About Contributors&lt;/h4&gt;

&lt;p&gt;Chempedia works in part by aggregating content from Wikipedia dealing with single molecular entities, or "Compound Monographs." This content is created by the now &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Introduction"&gt;famous process&lt;/a&gt; of individuals taking upon themselves the responsibility of fixing what's broken in Wikipedia. (Some take it upon themselves to &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Vandalism"&gt;break what's working&lt;/a&gt;, but that's another topic.)&lt;/p&gt;

&lt;p&gt;Chempedia associates each of its Compound Monographs with the last Wikipedia user to edit it. The current interface to these relationships is available on the &lt;a href="http://chempedia.com/contributors"&gt;Chempedia contributors page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The interface to this page is currently limited. The analyses reported here were made for the most part by querying the Chempedia database directly.&lt;/p&gt;

&lt;p&gt;Each contributor is linked to a contributor summary page containing links to that user's Wikipedia homepage and talk page, as well as a complete listing of all active contributions. For example, you can view the contributor page for one of Chempedia's most active contributors, &lt;a href="http://chempedia.com/contributors/40"&gt;Arcadian&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The data model is also limited. Because Chempedia only records the last Contributor to edit a Monograph, when another Contributor edits a Monograph, the link between the previous Contributor is lost. As a result, many Contributors have no associated Monographs.&lt;/p&gt;

&lt;h4&gt;How Many Monographs?&lt;/h4&gt;

&lt;p&gt;Chempedia currently hosts 6,308 Compound Monographs.&lt;/p&gt;

&lt;h4&gt;How Many Contributors?&lt;/h4&gt;

&lt;p&gt;Chempedia currently lists &lt;a href="http://chempedia.com/contributors"&gt;2,516 Contributors&lt;/a&gt;. Of these, 1,046, or 42% are associated with one or more Monographs, meaning that they were the last to edit. The remainder are associated with no Monographs for which they were the last to edit.&lt;/p&gt;

&lt;p&gt;Here is a list of the top 20 Contributors and the number of Monographs they were the last to edit:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/2"&gt;anonymous&lt;/a&gt;&lt;/td&gt;&lt;td&gt;1022&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/2"&gt;DOI bot&lt;/a&gt;&lt;/td&gt;&lt;td&gt;904&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/1"&gt;Edgar181&lt;/a&gt;&lt;/td&gt;&lt;td&gt;378&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/66"&gt;Fvasconcellos&lt;/a&gt;&lt;/td&gt;&lt;td&gt;170&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/31"&gt;Meodipt&lt;/a&gt;&lt;/td&gt;&lt;td&gt;151&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/40"&gt;Arcadian&lt;/a&gt;&lt;/td&gt;&lt;td&gt;144&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/59"&gt;Chem-awb&lt;/a&gt;&lt;/td&gt;&lt;td&gt;133&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/22"&gt;Chowbok&lt;/a&gt;&lt;/td&gt;&lt;td&gt;122&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/2"&gt;Rifleman 82&lt;/a&gt;&lt;/td&gt;&lt;td&gt;114&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/10"&gt;SmackBot&lt;/a&gt;&lt;/td&gt;&lt;td&gt;105&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/19"&gt;Thijs!bot&lt;/a&gt;&lt;/td&gt;&lt;td&gt;99&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/1236"&gt;ChemNerd&lt;/a&gt;&lt;/td&gt;&lt;td&gt;85&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/127"&gt;Puppy8800&lt;/a&gt;&lt;/td&gt;&lt;td&gt;80&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/48"&gt;DumZiBoT&lt;/a&gt;&lt;/td&gt;&lt;td&gt;78&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/182"&gt;Axiosaurus&lt;/a&gt;&lt;/td&gt;&lt;td&gt;63&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/6"&gt;Chempedia&lt;/a&gt;&lt;/td&gt;&lt;td&gt;63&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/174"&gt;Carlo Banez&lt;/a&gt;&lt;/td&gt;&lt;td&gt;55&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/13"&gt;Benjah-bmm27&lt;/a&gt;&lt;/td&gt;&lt;td&gt;52&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/93"&gt;OKBot&lt;/a&gt;&lt;/td&gt;&lt;td&gt;51&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://chempedia.com/contributors/45"&gt;Cacycle&lt;/a&gt;&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;These Contributors represent 1.9% of all active Contributors and collectively are responsible for being the last to edit 62% of all Monographs. Although not performed here, a histogram plotting number of contributions would be expected to follow a &lt;a href="http://en.wikipedia.org/wiki/Power_law"&gt;power law&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;'Anonymous' is an aggregation of all users who edited a Monograph without a Wikipedia account. 16% of all Monographs were last edited by an anonymous user. Leaving out the aggregated 'anonymous' users indicates that roughly half of all Monographs were last edited by the top 19 Contributors.&lt;/p&gt;

&lt;h4&gt;What is a Contributor?&lt;/h4&gt;

&lt;p&gt;Although it's difficult to say a lot about individual Contributors, most appear to have some training in science, although that training may not have involved chemistry or biology. Still others (for example, &lt;a href="http://chempedia.com/contributors/2404"&gt;SJP&lt;/a&gt;) appear to have been drawn to contribute to a Monograph based on their nonscientific experience with the title compound or in an effort to fight vandalism or otherwise improve the nonscientific content of the Monograph. The ability of services like Wikipedia (and by extension Chempedia) to provide a platform for those without formal training in a particular area to make useful contributions is without question one of its most useful (and controversial) features.&lt;/p&gt;

&lt;p&gt;Some Contributors are not even human, but rather robots designed to improve the quality of Wikipedia articles in general. For example, &lt;a href="http://chempedia.com/contributors/10"&gt;SmackBot&lt;/a&gt; performs an array of tedious quality control jobs such as fixing bad checksum ISBNs (&lt;a href="http://www.cas.org/expertise/cascontent/registry/checkdig.html"&gt;CAS Numbers, anyone?&lt;/a&gt;) and capitalization errors.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Wikipedia's collaboration model has made the creation of a free and continuously-updated chemical encyclopedia feasible. Applying chemistry-specific user interfaces and data models exposes this hidden treasure. Although it's tempting to think of this process as mainly being the work of a handful of trained scientists, the numbers suggest a much broader base of contributors. Future articles will explore this idea.&lt;/p&gt;

&lt;p&gt;Related Article: &lt;a href="http://depth-first.com/articles/2008/05/21/building-chempedia-social-networking-applied-to-chemistry"&gt;&lt;em&gt;Building Chempedia: Social Networking Applied to Chemistry&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Wed, 02 Jul 2008 11:50:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:cc2cc82d-b3d9-4bba-89de-69f685033389</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/07/02/building-chempedia-learning-about-contributors</link>
      <category>Tools</category>
      <category>chempedia</category>
      <category>wikipedia</category>
      <category>collectiveintelligence</category>
      <category>socialnetworking</category>
      <category>merckindex</category>
    </item>
    <item>
      <title>Building Chempedia: Start Simple, Then Iterate</title>
      <description>&lt;p&gt;&lt;a href="http://chempedia.com"&gt;&lt;img src="http://depth-first.com/demo/20080513/chempedia.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;As a medium for building software, the Web offers unparalleled adaptability. With nothing to download or install, users of Web applications automatically see the newest version - always. This may sound like a small thing, and technically it is. But it dramatically increases the effectiveness with which software can be created. &lt;a href="http://depth-first.com/articles/2008/04/28/building-chempedia-indexing-wikipedias-6-411-compound-monographs"&gt;The previous article in this series&lt;/a&gt; introduced &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt;, the free Chemical encyclopedia and cheminformatics Web application. This article will discuss the process by which Chempedia will become a better service over time.&lt;/p&gt;

&lt;h4&gt;Iterative Web Application Development&lt;/h4&gt;

&lt;p&gt;Chempedia, like all actively-developed software, is a work in progress. It will be built in stages starting with the addition of new features, followed by a round of user feedback, bug fixing, and stabilization. This will then be followed by the next major iteration, and so on.&lt;/p&gt;

&lt;p&gt;This iterative design style is ideally suited for Web applications. Because the barrier to pushing out new versions is essentially non-existent, a Web application can evolve at a much more rapid rate than other kinds of software. Indeed, the first version of a Web application need only work well enough to prove a point.&lt;/p&gt;

&lt;p&gt;One of the keys to iterative Web development is a technology framework designed to facilitate it. Chempedia is being developed with &lt;a href="http://rubyonrails.com/"&gt;Ruby on Rails&lt;/a&gt;, a tool that enables Web developers to take full advantage of the iterative development style the Web makes possible.&lt;/p&gt;

&lt;p&gt;Another key element of iterative Web development is users willing to explore the system and offer criticism. Evolution succeeds only when the environment stresses an ecosystem; the same is true in Web application development.&lt;/p&gt;

&lt;p&gt;Chempedia will take full advantage of the evolutionary nature of Web application development. As features are added and (hopefully) use of the service grows, Chempedia will evolve in ways that are impossible to predict today.&lt;/p&gt;

&lt;h4&gt;What's Wrong With Chempedia?&lt;/h4&gt;

&lt;p&gt;If you happened to take a look at Chempedia last week (that version is now no longer visible), you probably noticed many, many things that needed improvement. Some concerns were in the areas of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Navigation. Navigation works best when the right granularity of options is achieved. Chempedia's navigation system grouped both closely-related and dissimilar actions at the same level.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metaphor. The initial idea behind Chempedia was to see what happened when PubChem's chemical structures were mashed up with Wikiepia articles, using &lt;a href="http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem"&gt;CAS numbers&lt;/a&gt; as the common link. The site design reflected this, with no clear organizing principle other than mashup. However, after the initial demonstration of the success of this approach, it became clear that Chempedia was strikingly similar in both form and function to the &lt;a href="http://depth-first.com/articles/2008/04/28/building-chempedia-indexing-wikipedias-6-411-compound-monographs"&gt;Merck Index&lt;/a&gt;. Perhaps this should be used as a clue in deriving a better organizing principle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wikipedia integration. The old Chempedia site didn't make it nearly as convenient as is should be to create or edit compound monographs. Because Chempedia serves as a chemically-aware front-end for Wikipedia, the easier it is to get to Wikipedia from Chempedia, the better.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;What Changed?&lt;/h4&gt;

&lt;p&gt;During the process of trying to fix Chempedia's problems, it became clear that a major redesign was in order. This consisted of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Creating a landing page oriented toward search.&lt;/strong&gt; Using the Merck Index as a metaphor suggested that &lt;a href="http://chempedia.com"&gt;Chempedia's landing page&lt;/a&gt; should be designed around search, not browsing - as it was originally designed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Emphasizing compound monographs, not compounds.&lt;/strong&gt; Chempedia's central organizing principle is now the Compound Monograph. One way this is seen is in the new URL structure, which makes it very easy to see where a Chempedia link is about to take you. For example, consider the URL for &lt;a href="http://chempedia.com/monographs/benzene"&gt;benzene&lt;/a&gt;. Another way this can be seen is in the inclusion of &lt;a href="http://chempedia.com/monographs/virginiamycin"&gt;Compound Monographs lacking a chemical structure&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Designing a streamlined menu system.&lt;/strong&gt; The main menu system has been broken down into just three main categories: &lt;a href="http://chempedia.com/"&gt;Search&lt;/a&gt;; &lt;a href="http://chempedia.com/monographs"&gt;Browse&lt;/a&gt;; and &lt;a href="http://chempedia.com/monographs/new"&gt;Create&lt;/a&gt;. These headings refer to actions on Compound Monographs, again in line with their importance as an organizing principle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Promoting better integration with Wikipedia.&lt;/strong&gt; After experimenting with a few implementation possibilities, it is now possible to edit Wikipedia articles directly from the Chempedia site, thanks to the use of &lt;a href="http://en.wikipedia.org/wiki/IFrame"&gt;inline frame&lt;/a&gt;. Once again, this capability is tied to the Compound Monograph, from which editing and updating links are accessible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Striving for comprehensive Wikipedia coverage.&lt;/strong&gt; Wikipedia had far more compound monographs than could be found on Chempedia, &lt;a href="http://depth-first.com/articles/2008/04/28/building-chempedia-indexing-wikipedias-6-411-compound-monographs"&gt;6,411 of them&lt;/a&gt;, to be precise. Chempedia now contains all of them, regardless of whether a chemical structure can be found based on a CAS number in PubChem. This includes inorganics, organometallics, polymers, mixtures, and polypeptides.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Miles to Go Yet&lt;/h4&gt;

&lt;p&gt;Chempedia is far from being finished. For example, you'll notice many instances in which a Compound Monograph is &lt;a href="http://chempedia.com/monographs/parthenolide"&gt;truncated&lt;/a&gt;. This arises from difficulties in parsing Wikipedia's &lt;a href="http://en.wikipedia.org/wiki/Wikilink"&gt;Wikitext&lt;/a&gt; format (more on this later).&lt;/p&gt;

&lt;p&gt;Ultimately, the full text of each Wikipedia article will be present on Chempedia rather than just the first introductory paragraph. But it will take a significant amount of work to ensure that each article's Wikitext entry can be parsed faithfully.&lt;/p&gt;

&lt;p&gt;Chempedia allows search by CAS number, PubChem CID and exact title. Full-text searching is not yet implemented, nor is autocomplete search, both of which would greatly enhance the usability of the service.&lt;/p&gt;

&lt;p&gt;Exact structure searching is made possible by the &lt;a href="http://metamolecular.com/chemwriter"&gt;ChemWriter&lt;/a&gt; editor in combination with &lt;a href="http://en.wikipedia.org/wiki/SHA-1"&gt;SHA-1&lt;/a&gt; hashed &lt;a href="http://depth-first.com/articles/2007/09/27/inchi-for-newbies"&gt;InChIs&lt;/a&gt;. Substructure search and query atom search will ultimately be added, but for an encyclopedia containing relatively few molecules, most of which having trivial names, this isn't yet seen as being critical.&lt;/p&gt;

&lt;p&gt;You'll notice many Monographs on Chempedia that have no structure information. Behind the scenes, Chempedia uses the 350,000+ CAS numbers now contained in the &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt; database to associate a chemical structure with a Wikipedia article. In the future, these associations will be made by Chempedia and Wikipedia users, which will allow every Chempedia small-molecule Monograph to have a structure associated with it. (It will also create a rather large, publicly-curated, open database of CAS numbers linked to chemical structures, but that's a story for another time).&lt;/p&gt;

&lt;h4&gt;Your Feedback is Essential&lt;/h4&gt;

&lt;p&gt;Finally, many of the changes made in this iteration were the result of conversions with chemists and developers. If you see something on Chempedia that just doesn't work for you, please don't be shy about &lt;a href="http://chempedia.com/messages/new"&gt;saying so&lt;/a&gt;. Feedback is an essential ingredient in making Chempedia the best service it can be.&lt;/p&gt;</description>
      <pubDate>Tue, 13 May 2008 11:38:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:63df5614-92fb-4363-a060-212645be6315</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/05/13/building-chempedia-start-simple-then-iterate</link>
      <category>Meta</category>
      <category>chempedia</category>
      <category>evolution</category>
      <category>webapplication</category>
      <category>rails</category>
      <category>compoundmonograph</category>
      <category>merckindex</category>
      <category>iteration</category>
    </item>
    <item>
      <title>Building Chempedia: Indexing Wikipedia's 6,411 Compound Monographs</title>
      <description>&lt;p&gt;&lt;img src="http://depth-first.com/demo/20080428/merck.png" align="right"&gt;&lt;/img&gt;&lt;a href="http://www.merckbooks.com/mindex/"&gt;The Merck Index&lt;/a&gt; is one of chemistry's most useful reference works. Organized like an encyclopedia, each entry, or "Compound Monograph," describes a single compound complete with chemical structure, CAS Number, IUPAC name, trivial names, physical properties, and leading primary literature references describing uses. Unlike other chemistry databases, the Merck Index focuses on only those compounds with important industrial, biological, medical, or technical applications.&lt;/p&gt;

&lt;h4&gt;What's Wrong with the Merck Index?&lt;/h4&gt;

&lt;p&gt;Wonderful product though it may be, the Merck Index has some limitations. For starters, online versions are not free. The disadvantages of this access model go well beyond a simple price barrier; it prevents the very thing the Web was designed to promote: linking. Another limitation is the time it takes for new versions to appear, which is typically measured in years. Still another limitation is in the cost of adding entries for niche compounds that may not be suitable for a general audience, a major barrier to exposing &lt;a href="http://depth-first.com/articles/2007/08/27/the-long-tail-and-chemistry-why-so-many-acs-meeting-talks-are-uninteresting"&gt;chemistry's long tail&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;What's Chempedia?&lt;/h4&gt;

&lt;p&gt;If we wanted to create a free, online service that worked like the Merck Index but which took full advantage of today's powerful collaboration and information technology tools, how could we go about doing so?&lt;/p&gt;

&lt;p&gt;This article, the first in a series, discusses &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt;, a free, structure-oriented online encyclopedia of useful chemical compounds designed to answer this question.&lt;/p&gt;

&lt;h4&gt;Background&lt;/h4&gt;

&lt;p&gt;The following articles may be useful in understanding Chempedia's approach and underlying technology:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2008/04/17/user-created-compound-monographs-on-chempedia-net-open-sourcing-the-collation-and-indexing-of-chemical-information"&gt;User-Created Compound Monographs on Chempedia.net: Open Sourcing the Collation and Indexing of Chemical Information&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2008/04/04/chempedia-net-mashing-up-pubchem-and-wikipedia"&gt;Chempedia.net: Mashing Up PubChem and Wikipedia&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs"&gt;Wikipedia for Cheminformatics: A Simple Web API for Finding CAS Numbers in Compound Monographs&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;Thirty-Two Free Chemistry Databases&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Where to Begin?&lt;/h4&gt;

&lt;p&gt;One of the first problems we'd face in building a free Web-based version of the Merck Index is where to get the compound monographs.&lt;/p&gt;

&lt;p&gt;It turns out that &lt;a href="http://wikipedia.org"&gt;Wikipedia&lt;/a&gt; (yes, Wikipedia) hosts a growing collection of compound monographs that, when viewed together, bear a striking resemblance to the Merck Index. And the effort is becoming increasingly organized with respect to content and data provenance.&lt;/p&gt;

&lt;p&gt;Why not start here?&lt;/p&gt;

&lt;h4&gt;The Task at Hand&lt;/h4&gt;

&lt;p&gt;To get an idea of just how Wikipedia's collection of compound monographs compares to the Merck Index, it helps to know: (1) how to find Wikipedia compound monographs; and (2) the range of information available for each entry.&lt;/p&gt;

&lt;p&gt;This tutorial will describe a simple method to index Wikipedia's compound monographs using nothing but free tools and data. Subsequent articles will discuss qualitative aspects of Wikipedia's compound monographs and the challenges involved in organizing them into a chemically-aware service.&lt;/p&gt;

&lt;h4&gt;Indexing Wikipedia's Compound Monographs&lt;/h4&gt;

&lt;p&gt;We can index Wikipedia compound monographs via a simple procedure.&lt;/p&gt;

&lt;p&gt;Most compound monographs employ one of four precompiled Wikpedia templates: &lt;a href="http://en.wikipedia.org/wiki/Template:Chembox"&gt;Chembox&lt;/a&gt; (deprecated); &lt;a href="http://en.wikipedia.org/wiki/Template:Chembox_new"&gt;Chembox new&lt;/a&gt;; &lt;a href="http://en.wikipedia.org/wiki/Template:Drugbox"&gt;Drugbox&lt;/a&gt;; and &lt;a href="http://en.wikipedia.org/wiki/Template:Explosivebox"&gt;Explosivebox&lt;/a&gt;. As an example of what these templates look like, see the right-hand box on Wikipedia's entry on &lt;a href="http://en.wikipedia.org/wiki/Modafinil"&gt;modafinil&lt;/a&gt;. To index Wikipedia's compound monographs, all we need to do is find the titles of all articles using one of these four templates.&lt;/p&gt;

&lt;p&gt;To get started, we'll need a local copy of Wikipedia. The complete set of all Wikipedia articles, as of March 12, 2008 can be &lt;a href="http://download.wikimedia.org/enwiki/20080312/enwiki-20080312-pages-articles.xml.bz2"&gt;downloaded here&lt;/a&gt;. This data dump is updated periodically, so you may have access to a more recent version.&lt;/p&gt;

&lt;p&gt;The Wikipedia dump, which contains the full text of every article in Wikipedia, consists of a 3.5 GB file in &lt;a href="http://www.bzip.org/"&gt;BZip2&lt;/a&gt; format. Fortunately, we won't need to inflate it to index its chemical content.&lt;/p&gt;

&lt;p&gt;The following code will scan the raw Wikipedia dump and produce a list of all compound monograph titles:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="ident"&gt;log&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;monographs.txt&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;w&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="keyword"&gt;while&lt;/span&gt;&lt;span class="punct"&gt;((&lt;/span&gt;&lt;span class="ident"&gt;line&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;STDIN&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt;
  &lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt; &lt;span class="punct"&gt;/&amp;lt;&lt;/span&gt;&lt;span class="ident"&gt;title&lt;/span&gt;&lt;span class="punct"&gt;&amp;gt;(.*)&amp;lt;\/&lt;/span&gt;&lt;span class="regex"&gt;title&amp;gt;&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;

  &lt;span class="ident"&gt;if&lt;/span&gt; &lt;span class="global"&gt;$1&lt;/span&gt;
    &lt;span class="ident"&gt;title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="global"&gt;$1&lt;/span&gt;

    &lt;span class="keyword"&gt;next&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;line&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt; &lt;span class="punct"&gt;/\{\{(&lt;/span&gt;&lt;span class="ident"&gt;chembox&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;drugbox&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;explosivebox&lt;/span&gt;&lt;span class="punct"&gt;)/&lt;/span&gt;&lt;span class="ident"&gt;i&lt;/span&gt;
    &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;||&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;match&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;:&lt;/span&gt;&lt;span class="punct"&gt;/)&lt;/span&gt;
      &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;
      &lt;span class="ident"&gt;log&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;title&lt;/span&gt;
      &lt;span class="ident"&gt;log&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;flush&lt;/span&gt;

      &lt;span class="ident"&gt;title&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="ident"&gt;log&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;close&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Saving this code into a file called &lt;strong&gt;filter.rb&lt;/strong&gt;, we can run it by piping the output of &lt;tt&gt;bzcat&lt;/tt&gt; on the raw dump file:&lt;/p&gt;

&lt;div class="console"&gt;
&lt;pre&gt;
$ bzcat &amp;lt;path_to_dump&amp;gt;/enwiki-20080312-pages-articles.xml.bz2 | ruby filter.rb
&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Alphabetizing the output file gives a complete listing of Wikipedia's compound monograph titles (all 6,411 of them), which for convenience can be &lt;a href="http://depth-first.com/demo/20080428/compound_monographs_20080315.txt"&gt;downloaded here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We can construct a URL to each Wikipedia compound monograph by prepending the title with &lt;strong&gt;http://wikipedia.org/wiki/&lt;/strong&gt;. In other words, our program's output can be used both as a list of chemical names and as a hash of chemical names to Wikipedia URLs. And with the URL in hand, &lt;a href="http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs"&gt;all kinds of interesting things can be done&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Limitations&lt;/h4&gt;

&lt;p&gt;Although easy to carry out, the procedure described here has some limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monographs added after March 12, 2008 are not visible.&lt;/li&gt;
&lt;li&gt;Monographs that don't use the chembox, chembox new, drugbox, or explosivebox templates are not visible.&lt;/li&gt;
&lt;li&gt;A very small number of articles erroneously use the chembox template, for example &lt;a href="http://en.wikipedia.org/wiki/Iraq%27s_Chemical_Warfare"&gt;this one&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Chempedia Redesign&lt;/h4&gt;

&lt;p&gt;Currently, Chempedia doesn't include all 6,411 monographs but rather a subset created by a much less comprehensive indexing method. As part of a major redesign of the site, all Wikipedia compound monographs will be available on Chempedia, which should result in a much more useful service.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Wikipedia is fast becoming a major storehouse of chemical information with tantalizing potential for creating powerful new services for chemists. More to the point for cheminformatics, the entire Wikipedia dataset can be downloaded and reprocessed free of charge; Wikipedia is one of those rare cheminformatics datasets that is &lt;a href="http://depth-first.com/articles/2006/09/27/hacking-pubchem-free-speech-or-free-beer"&gt;both free as in speech and free as in beer&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As this article has shown, some simple programming is all it takes to begin doing useful things with Wikipedia's chemical content. Future articles will discuss some of the possibilities.&lt;/p&gt;</description>
      <pubDate>Mon, 28 Apr 2008 18:22:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:6980ce0d-0482-48ba-9489-ca1235632f66</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/28/building-chempedia-indexing-wikipedias-6-411-compound-monographs</link>
      <category>Meta</category>
      <category>chempedia</category>
      <category>wikipedia</category>
      <category>compoundmonograph</category>
      <category>bzip2</category>
      <category>merckindex</category>
    </item>
    <item>
      <title>User-Created Compound Monographs on Chempedia.net: Open Sourcing the Collation and Indexing of Chemical Information</title>
      <description>&lt;p&gt;&lt;a href="http://chempedia.com"&gt;&lt;img src="http://chempedia.net/images/global/logo.png" align="right"&gt;&lt;/img&gt;&lt;/a&gt;Printed encyclopedias of chemical information like the &lt;a href="http://www.merckbooks.com/mindex/"&gt;Merck Index&lt;/a&gt; suffer from the problem of becoming obsolete on publication. When new compounds are discovered, or when the information about a compound changes, those changes can take many months or years to appear in print form due to the high cost of publication. It doesn't have to be that way. This article introduces a new feature to the free online chemical encyclopedia &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt; that lets working scientists update is contents via &lt;a href="http://wikipedia.org"&gt;Wikipedia&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;About Chempedia.net&lt;/h4&gt;

&lt;p&gt;A &lt;a href="http://depth-first.com/articles/2008/04/04/chempedia-net-mashing-up-pubchem-and-wikipedia"&gt;recent article&lt;/a&gt; introduced &lt;a href="http://chempedia.com"&gt;Chempdia&lt;/a&gt;, the free online chemical encyclopedia. This service is built on two of the largest &lt;a href="http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases"&gt;free and open repositories of chemical information&lt;/a&gt; in existence: &lt;a href="http://wikipedia.org"&gt;Wikipedia&lt;/a&gt; and &lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;PubChem&lt;/a&gt;. PubChem supplies low-level chemical information such as connection tables, and Wikipedia supplies free-text descriptions of the properties and uses of certain molecules.&lt;/p&gt;

&lt;h4&gt;Which Molecules?&lt;/h4&gt;

&lt;p&gt;Currently, Chempedia.net only includes &lt;a href="http://depth-first.com/articles/2008/04/02/wikipedia-for-cheminformatics-a-simple-web-api-for-finding-cas-numbers-in-compound-monographs"&gt;compound monographs&lt;/a&gt; for about 1,000 of its over 300,000 molecules. These monographs were located by a manual process in which the titles for all Wikipedia articles were downloaded in alphabetized form; this process clustered titles that represented IUPAC nomenclature due to its use of leading numbers and symbols. IUPAC nomenclature titles were extracted, and then a script was written to extract the chemical information from these titles and combine it with that from PubChem.&lt;/p&gt;

&lt;p&gt;This method, although useful for getting a service running, is clearly flawed. The biggest problem is in how to discover new compound monographs.&lt;/p&gt;

&lt;h4&gt;Why Not Put Users in Control?&lt;/h4&gt;

&lt;p&gt;Chempedia users themselves are in the best position to know when an existing Wikipedia compound monograph should appear in Chempedia but doesn't, when an existing monograph needs to be updated, or when a new monograph is written and needs to be linked.&lt;/p&gt;

&lt;p&gt;How can the process be &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;automated&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;As a partial answer to this question, users &lt;a href="http://chempedia.net/articles/new"&gt;now have the ability to notify Chempedia of any changes to a Wikipedia compound monograph&lt;/a&gt;, and to have those changes immediately reflected in the next viewing of a Chempedia compound monograph.&lt;/p&gt;

&lt;h4&gt;An Example&lt;/h4&gt;

&lt;p&gt;As an example, let's take &lt;a href="http://en.wikipedia.org/wiki/anandamide"&gt;anandamide&lt;/a&gt;, a compound I've had some experience with during my time as a medicinal chemist. Although the &lt;a href="http://chempedia.net/compounds/6030"&gt;Chempedia entry for ananandamide&lt;/a&gt; exists, there is (or as of today - was) no link to the Wikipedia compound monograph. Let's create one.&lt;/p&gt;

&lt;p&gt;At the top of &lt;a href="http://chempedia.com/"&gt;Chempedia's main menu&lt;/a&gt;, you'll see a link titled '&lt;a href="http://chempedia.net/articles/new"&gt;Update&lt;/a&gt;'. Choosing this link leads to a form that will ask for two pieces of information: (1) the title of the Wikipedia article to which you want Chempedia to link - in this case '&lt;a href="http://en.wikipedia.org/wiki/anandamide"&gt;anandamide&lt;/a&gt;'; and (2) &lt;a href="http://depth-first.com/articles/2007/09/18/six-reasons-i-like-recaptcha-or-how-to-build-a-web-service-worth-talking-about"&gt;reCaptcha&lt;/a&gt; text to keep robots from making mischief.&lt;/p&gt;

&lt;p&gt;Submitting this information is all that's needed to create a new or updated link from Chempedia to Wikipedia. Chempedia handles the rest.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Wikipedia is a vast source of free, high-quality, semi-structured chemical information just waiting to have good chemically-aware interfaces applied to it. Chempedia.net is an attempt to do just that, but it's a bit more as well. Although it may appear that Chempedia is the major beneficiary in this relationship, Wikipedia also benefits. When chemists have a tool that allows them to query and visualize Wikipedia using their native language (the chemical structure) they're in a better position to both use and contribute to Wikipedia itself - something I've started to do.&lt;/p&gt;

&lt;p&gt;This positive feedback effect is the real value of exposing Web services. The question is: who in cheminformatics is willing and able to take the risk to discover this simple principle and its benefits?&lt;/p&gt;</description>
      <pubDate>Thu, 17 Apr 2008 17:50:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:9db0f83e-ebaf-49cc-af9d-03d44250c05d</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2008/04/17/user-created-compound-monographs-on-chempedia-net-open-sourcing-the-collation-and-indexing-of-chemical-information</link>
      <category>Tools</category>
      <category>chempedia</category>
      <category>wikipedia</category>
      <category>webservice</category>
      <category>mashup</category>
      <category>compoundmonograph</category>
      <category>merckindex</category>
    </item>
  </channel>
</rss>
