Chempedia.net: Mashing Up PubChem and Wikipedia 12

Posted by Rich Apodaca Fri, 04 Apr 2008 14:06:00 GMT

PubChem and Wikipedia represent two of the largest open repositories of chemical information in the world. And they complement each other very nicely. PubChem contains mainly low-level chemical structure information whereas Wikipedia contains free-text descriptions of chemical compounds in the form of compound monographs.

Both services offer permission and access to copy and reuse their contents. But neither service is, by itself, nearly as useful as it could be.

Why not mash them up?

To explore that question my company, Metamolecular, LLC has launched Chempedia.

To my knowledge, Chempedia represents the first publicly-facing database of compounds to incorporate Wikipedia's collection of organic compound monographs. And it's one of the few cheminformatics services to make use of free-text descriptions generated by individual chemists.

Chempedia has been somewhat selective about the compounds it includes. To date, it has spidered over 2,500 monographs, combining them with over 300,000 of the most interesting compounds from PubChem. Not every Chempedia.net molecule has a monograph, but now there's a tool that can actually make that absence apparent.

Chempedia is both an experiment and a service. It's immediately useful for anyone in the business of making or doing things with organic molecules. It's created several unexpected moments of "Oh, that's actually a useful molecule!" It also will serve as a platform to test some of the ideas discussed in Depth-First over the last year or so on the advantages of the Web for collaboration in chemistry.

Stay tuned for more details about how Chempedia was created and some of its applications in chemistry.

Five Open Tools for 2D Structure Layout (aka Structure Diagram Generation) 9

Posted by Rich Apodaca Wed, 26 Mar 2008 13:11:00 GMT

Given a molecular representation without 2D coordinates, how would you display a human-readable view?

This problem can arise in many situations, one of the most common of which is the parsing of line notations such as IUPAC nomenclature, SMILES, or InChI.

And then there are the cases when you have 2D coordinates, but they're not very aesthetically pleasing. Maybe the coordinates were created by people either in a hurry or working with low quality editors, or maybe they were generated as distorted 2D projections of 3D coordinates. Whatever the reason, simply having 2D coordinates may not be the same as having good 2D coordinates.

Last year, a Depth-First article discussed the Structure Diagram Generation (SDG) problem and how it can be solved with Open Source software. Given that nearly a year has passed, it seemed appropriate to revisit the topic.

The good news is that there are at least four independent Open Source implementations of SDG algorithms, and one potential open database approach. They are, in no particular order:

  • MCDL Written in Java, the emphasis of this software appears to be facilitating the use of Modular Chemical Descriptor Language. Unfortunately, no new releases of this intriguing software package have been made in the last year.

  • Chemistry Development Kit (CDK) This useful package handles about 70-80% of a typical assortment of chemical structures well. The large amount of activity on the CDK project in general makes this a particularly good SDG system to contribute to, especially in the areas of refactoring and handling special cases. See also Christoph Steinbeck's overview of CDK's layout system.

  • BKChem A 2D structure editor written in Python. Give it an InChI and it will display the structure, courtesy of SDG. The system worked remarkably well with the molecules I tested. BKChem has also been reported to work in batch mode.

  • RDKit Written in Python and C++, this package is the newest of the bunch. Although I haven't had much luck compiling RDKit, it still looks quite promising. Any chance of switching to make as a build system?

  • PubChem PubChem? Maybe. With a database of small molecules now numbering well over ten million, there's a good chance that the molecule for which you need to assign coordinates is already in PubChem. And if it's in PubChem, 2D coordinates have already been assigned. Use an InChI as a hash key, and voila - instant SDG without much software. Given the novelty of large, publicly-available databases of small molecules such as PubChem, this approach may have a great deal of untapped potential.

SDG is one of those issues that can stay off the radar for some only to become an instant, nagging problem with no clear way out. The tools cited here offer an excellent place to begin working toward a comprehensive solution.

Create Your Own PubChem Datasets: Exporting Results As SD Files

Posted by Rich Apodaca Tue, 13 Nov 2007 21:43:00 GMT

Recently, I needed to create a subset of the PubChem database in Structure Data File (SD File) format. Although it's far from obvious how to do this, the capability does exist. In this article, I'll give a step-by-step procedure for creating custom datasets in SD File format from arbitrary PubChem structure queries.

Create and Execute the Query

Let's say we want to create a dataset in SD File format containing all N-Boc-protected piperidines registered in PubChem.

From the main PubChem site, choose the Structure Search link. Then click the "Sketch" button.

Next, draw your molecule in the 2D structure editor:

Then click the "Done" button.

Before starting the query (by clicking the "Search" button), be sure to select the "Substructure" option under "Search Type."

Exporting the Results

You should now be looking at a screen containing the first few hits of a 7700+ hitset. But how do we export these results in SD Format?

Next to a field labeled "Display", you'll see a drop-down box containing several different options. Choose the one labeled "PubChem Download."

You'll be redirected to a download page from which you can select output formats, including SDF, or SD File. You can also select a compression type (datasets of even 2000 records can be quite large uncompressed). For this example, we'll select SDF format with GZip compression.

Clicking on the "Download" button takes us to a status page that eventually informs us when our download has been processed. You should then get a "Save File" dialog or something similar. If not, you should see a link to the compressed SD file.

Downloading the results file completes the process.

PubChem for Newbies 2

Posted by Rich Apodaca Wed, 26 Sep 2007 12:42:00 GMT

PubChem is arguably the most important free repository of information about small molecules on the planet. Although its size is staggering (over 10 million unique compounds), what makes PubChem important is its completely open approach to chemical information. Never before in the history of chemistry has so much information been made available, free to anyone who cares to use it.

Despite PubChem's pioneering approach, many factors make the service difficult to learn and navigate. Most notably, its practical yet bewildering integration into NIH's other far-flung database activities serve as highly effective camouflage for the treasures that lie beneath.

With this in mind, I thought it would be useful to collect all Depth-First articles on PubChem into one place. They have been broken down into categories, although many articles contain elements useful to anyone interested in PubChem.

For Chemists

For Hackers

For Everyone

If you've got a favorite PubChem resource you'd like to share, please feel free to leave a comment.

Image Service Credit: txt2pic.com

Hacking PubChem: Visually Inspect Results for CAS Number and Keyword Searches 1

Posted by Rich Apodaca Tue, 25 Sep 2007 14:55:00 GMT

A recent article described how PubChem could be used to quickly search for CAS numbers. Although useful, the approach is limited in that only an array of PubChem CIDs was returned. What would be really useful would be a simple way to create a report with entries hyperlinked into the PubChem site itself to aid in visual inspection. In this tutorial, we'll see how an HTML template and a few extra lines of code can do just that.

The Template

Ruby supports a number of HTML templating mechanisms. In this example, we'll use an ERB template resurrected from the Molbank graphical table of contents tutorial:

<html>
  <head>
    <title>
      <%= "PubChem Search for #{term}" %>
    </title>
  </head>
  <body>
    <h1>
      <%= "Search: #{term}" %>
    </h1>
    <table>
      <tr>
      <% col = 0 %>
      <% cids.each do |cid| %>
        <td>
          <% image = "http://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=#{cid}" %>
          <% summary = "http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=#{cid}" %>
          <a href="<%= summary %>">
            <img src="<%= image %>" border="2"></img>
          </a>
          <center>
            <span style="font-size: 8px">
              <a href="<%= summary %>"><%= "CID-#{cid}" %></a>
            </span>
          </center>
        </td>
        <% col += 1 %>
        <% if col > 5 %>
          <% col = 0 %>
          </tr>
          <tr>
        <% end %>
      <%end %>
      </tr>
    </table>
  </body>
</html>

The above template uses a search term and an array of CIDs to build a table of results. Each cell in the table contains a color 2D image and the CID, both hyperlinked into PubChem itself.

Saving this library to a file called template.rhtml is all we need to do.

The Library

The library is a modification of the one shown in the previous article in this series:

require 'rubygems'
require 'mechanize'
require 'erb'

module PubChemTerms
  def report term
    cids = get_cids term
    erb = ERB.new(IO.read("template.rhtml"))

    File.open "output.html", 'w+' do |file|
      file << erb.result(binding)
    end
  end

  def get_cids term
    agent = WWW::Mechanize.new
    page = agent.get "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=#{term}"

    (page.parser/"id").collect {|id| id.innerHTML}
  end
end

The method report accepts a search term and uses our template to render a report.

Testing

By saving the above library in a file called pubchem.rb, we can search by keyword via interactive ruby (irb):

$ irb
irb(main):001:0> require 'pubchem'
=> true
irb(main):002:0> include PubChemTerms
=> Object
irb(main):003:0> report 'esomeprazole'
=> #

This produces a file called output.html that can be viewed with any browser:

As in the original version of the library, we can also query by CAS number:

$ irb
irb(main):001:0> require 'pubchem'
=> true
irb(main):002:0> include PubChemTerms
=> Object
irb(main):003:0> report '119141-88-7'
=> #

Conclusions

The simple approach outlined here could be extended in many ways. For example, we could easily retrieve molfiles based on keyword or CAS number search. We could pipe queries together or work with query lists. We could blend in ChemSpider data. We could even build a simple Web application (with Rails) that returned customized reports. Mixing in Ruby CDK or Ruby Open Babel offers still more possibilities.

Increasingly, the most important question in cheminformatics is not "What can we build?", but rather "What should we build?" Success in this new world requires a much deeper understanding of how cheminformatics software is being used by real chemists and where it's not.

Older posts: 1 2 3 ... 5