Building Chempedia: The Human Element

Posted by Rich Apodaca Thu, 15 May 2008 18:50:00 GMT

The study of chemistry is an inherently social activity. From the papers we use and cite, to the conferences we attend, to the informal discussions we engage in daily, being a chemist means interacting with your fellow chemists. Yet strangely, most chemical information systems either totally ignore this central fact, or provide only the most meager of tools to harness it to its full potential. This article discusses how Chempedia currently integrates the social with the scientific, and what may be in store for the future.

Chempedia as a Tool for Scientific Collaboration

Like all chemical reference works, Chempedia is written by people with their own interests, skills, and ambitions. Unlike almost every other chemical reference work, Chempedia (through Wikipedia, on which it's based) offers intriguing possibilities to directly collaborate and learn from its contributors - or even become one of them.

How can Chempedia better facilitate scientific collaboration?

A Simple But Possibly Useful Feature

Yesterday, a new feature was added to Chempedia that makes it easier to understand the recent history of a Compound Monograph. The new feature shows the date that a Compound Monograph was last edited, and the Wikiepdia user who edited it:

Clicking on the link takes you to the Wikipedia users page, in this case the one for Meodipt. (Wikipedia users frequently use handles rather than their given names.) From Meodipt's page, we can see that s/he received degrees in chemistry and pharmacology and is currently studying law. Meodipt's interests include pharmacology, chemistry, law, and science. We can also see that Meodipt is maintaining a good-sized list of CAS numbers for drugs, grouped by indication.

We might be curious about what Meodipt found worth changing, and how s/he changed it. We could do so by first clicking the Chempedia edit link. In the Wikipedia box (framed by the red dotted lines), we would then click on the 'history' tab. Clicking on the 'last' link for the top entry shows us exactly what Meodipt changed on Pravadoline's compound monograph (also visible through this link).

Looking Ahead

Linking a real person to changes in a Compound Monograph could be enormously useful, if done properly. After all, bringing people with highly focussed interests together is the essence of scientific collaboration. The Chempedia/Wikipedia combination provides one way to do that.

As Chis Anderson puts it, "social networking should be a feature, not a destination." Scientists were social networking long before the Internet, the computer, and the telephone were invented; indeed scientists who fail to connect with their fellow scientists have a difficult time of prospering. When seen from this perspective, it's surprising that good 'social networking' features would not be viewed as a top priority in chemical information systems.

The Chempedia author credit system in its current form is rather simplistic and may not actually promote scientific collaboration at all. But it's not hard to imagine ways to make it far more effective. Future articles will discuss some of the possibilities.

The Daily Molecule: The Wonders of Chemistry - One Molecule at a Time 4

Posted by Rich Apodaca Wed, 14 May 2008 15:58:00 GMT

Chemistry is a big field judged by any standard, including the proliferation of American Chemical Society (ACS) divisions. Each subdiscipline in chemistry is in turn so big, that once a chemist becomes 'differentiated' it's easy to lose touch even with neighboring subdisciplines. It doesn't have to be that way. This article introduces a new service, The Daily Molecule designed to make it just a little bit easier (and hopefully fun) to stay in the chemical loop.

What Is It?

The idea is simple: every weekday, a new molecule will be featured on The Daily Molecule with a short write-up and some leading references. Although molecules in the news will get first priority, any molecule is fair game.

The material for The Daily Molecule will be drawn from Chempedia, which in turn gets some of its content from Wikipedia. In other words, the entries on the Daily Molecule will be largeley written by my fellow chemists.

The process of creating a Daily Molecule entry is not time-consuming, but much of what is being done manually now could be automated in the future. The technology platform lends itself well to many forms of chemistry-specific modification (see below).

I hesitate to use the term 'blog' to describe The Daily Molecule, but the description may be helpful to an extent.

The Daily Molecule is unlike a blog in that most content will be generated by others, selected by some criteria, reformatted for consistency, and published. In that sense, The Daily Molecule is a something like a mini scientific journal, but it turns the process of acquiring content on its head.

If chemistry ever evolves beyond the current model of publication, which seems inevitable at this point, the journals of the future may resemble The Daily Molecule in one or more ways.

Technology

The software running The Daily Molecule is a modified version of SimpleLog, a Web application based on Ruby on Rails. Unlike most blogging engines, SimpleLog focuses on implementing only the most basic publication features, and doing them to perfection. If you know a little Ruby and can work with Rails, you can do a lot with SimpleLog.

One of the first items of business will be to implement reCAPTCHA support and activate comments on articles.

Some ideas for chemically-enabling The Daily Molecule include a graphical abstract sidebar and (sub)structure search. Currently, the 2D chemical structure images posted to The Daily Molecule have complete connection tables embedded as metadata, a feature with some interesting possibilities.

The Molecule of the Day/Week/Month

The basic idea behind The Daily Molecule is not new. Many other services have sprung up over the last ten years that operate, at least on the surface, similarly. Some examples:

Quite a few others don't appear on this list.

The different idea behind the The Daily Molecule is that chemical content already exists in on the Web in machine-readable format with licenses that permit its re-use; all that's needed is a way to aggregate, format, and package that information in a form suitable for once-daily scanning and cheminformatics manipulation.

Conclusions

Like no other medium, the Web blurs artificial distinctions: between work and play; between private and public; between on-topic and off-topic; between fame and obscurity; between mine and yours; between big and small; and between profit and non-profit. Chemistry may be late to the party, but is not immune to its call.

Building Chempedia: Indexing Wikipedia's 6,411 Compound Monographs 5

Posted by Rich Apodaca Mon, 28 Apr 2008 22:22:00 GMT

The Merck Index is one of chemistry's most useful reference works. Organized like an encyclopedia, each entry, or "Compound Monograph," describes a single compound complete with chemical structure, CAS Number, IUPAC name, trivial names, physical properties, and leading primary literature references describing uses. Unlike other chemistry databases, the Merck Index focuses on only those compounds with important industrial, biological, medical, or technical applications.

What's Wrong with the Merck Index?

Wonderful product though it may be, the Merck Index has some limitations. For starters, online versions are not free. The disadvantages of this access model go well beyond a simple price barrier; it prevents the very thing the Web was designed to promote: linking. Another limitation is the time it takes for new versions to appear, which is typically measured in years. Still another limitation is in the cost of adding entries for niche compounds that may not be suitable for a general audience, a major barrier to exposing chemistry's long tail.

What's Chempedia?

If we wanted to create a free, online service that worked like the Merck Index but which took full advantage of today's powerful collaboration and information technology tools, how could we go about doing so?

This article, the first in a series, discusses Chempedia, a free, structure-oriented online encyclopedia of useful chemical compounds designed to answer this question.

Background

The following articles may be useful in understanding Chempedia's approach and underlying technology:

Where to Begin?

One of the first problems we'd face in building a free Web-based version of the Merck Index is where to get the compound monographs.

It turns out that Wikipedia (yes, Wikipedia) hosts a growing collection of compound monographs that, when viewed together, bear a striking resemblance to the Merck Index. And the effort is becoming increasingly organized with respect to content and data provenance.

Why not start here?

The Task at Hand

To get an idea of just how Wikipedia's collection of compound monographs compares to the Merck Index, it helps to know: (1) how to find Wikipedia compound monographs; and (2) the range of information available for each entry.

This tutorial will describe a simple method to index Wikipedia's compound monographs using nothing but free tools and data. Subsequent articles will discuss qualitative aspects of Wikipedia's compound monographs and the challenges involved in organizing them into a chemically-aware service.

Indexing Wikipedia's Compound Monographs

We can index Wikipedia compound monographs via a simple procedure.

Most compound monographs employ one of four precompiled Wikpedia templates: Chembox (deprecated); Chembox new; Drugbox; and Explosivebox. As an example of what these templates look like, see the right-hand box on Wikipedia's entry on modafinil. To index Wikipedia's compound monographs, all we need to do is find the titles of all articles using one of these four templates.

To get started, we'll need a local copy of Wikipedia. The complete set of all Wikipedia articles, as of March 12, 2008 can be downloaded here. This data dump is updated periodically, so you may have access to a more recent version.

The Wikipedia dump, which contains the full text of every article in Wikipedia, consists of a 3.5 GB file in BZip2 format. Fortunately, we won't need to inflate it to index its chemical content.

The following code will scan the raw Wikipedia dump and produce a list of all compound monograph titles:

title = ""
log = File.new 'monographs.txt', "w"

while((line = STDIN.gets))
  line.match /<title>(.*)<\/title>/

  if $1
    title = $1

    next
  end

  if line.match /\{\{(chembox|drugbox|explosivebox)/i
    unless title == "" || title.match(/:/)
      puts title
      log.puts title
      log.flush

      title = ""
    end
  end
end

log.close

Saving this code into a file called filter.rb, we can run it by piping the output of bzcat on the raw dump file:

$ bzcat <path_to_dump>/enwiki-20080312-pages-articles.xml.bz2 | ruby filter.rb

Alphabetizing the output file gives a complete listing of Wikipedia's compound monograph titles (all 6,411 of them), which for convenience can be downloaded here.

We can construct a URL to each Wikipedia compound monograph by prepending the title with http://wikipedia.org/wiki/. In other words, our program's output can be used both as a list of chemical names and as a hash of chemical names to Wikipedia URLs. And with the URL in hand, all kinds of interesting things can be done.

Limitations

Although easy to carry out, the procedure described here has some limitations:

  • Monographs added after March 12, 2008 are not visible.
  • Monographs that don't use the chembox, chembox new, drugbox, or explosivebox templates are not visible.
  • A very small number of articles erroneously use the chembox template, for example this one.

Chempedia Redesign

Currently, Chempedia doesn't include all 6,411 monographs but rather a subset created by a much less comprehensive indexing method. As part of a major redesign of the site, all Wikipedia compound monographs will be available on Chempedia, which should result in a much more useful service.

Conclusions

Wikipedia is fast becoming a major storehouse of chemical information with tantalizing potential for creating powerful new services for chemists. More to the point for cheminformatics, the entire Wikipedia dataset can be downloaded and reprocessed free of charge; Wikipedia is one of those rare cheminformatics datasets that is both free as in speech and free as in beer.

As this article has shown, some simple programming is all it takes to begin doing useful things with Wikipedia's chemical content. Future articles will discuss some of the possibilities.

User-Created Compound Monographs on Chempedia.net: Open Sourcing the Collation and Indexing of Chemical Information 13

Posted by Rich Apodaca Thu, 17 Apr 2008 21:50:00 GMT

Printed encyclopedias of chemical information like the Merck Index suffer from the problem of becoming obsolete on publication. When new compounds are discovered, or when the information about a compound changes, those changes can take many months or years to appear in print form due to the high cost of publication. It doesn't have to be that way. This article introduces a new feature to the free online chemical encyclopedia Chempedia that lets working scientists update is contents via Wikipedia.

About Chempedia.net

A recent article introduced Chempdia, the free online chemical encyclopedia. This service is built on two of the largest free and open repositories of chemical information in existence: Wikipedia and PubChem. PubChem supplies low-level chemical information such as connection tables, and Wikipedia supplies free-text descriptions of the properties and uses of certain molecules.

Which Molecules?

Currently, Chempedia.net only includes compound monographs for about 1,000 of its over 300,000 molecules. These monographs were located by a manual process in which the titles for all Wikipedia articles were downloaded in alphabetized form; this process clustered titles that represented IUPAC nomenclature due to its use of leading numbers and symbols. IUPAC nomenclature titles were extracted, and then a script was written to extract the chemical information from these titles and combine it with that from PubChem.

This method, although useful for getting a service running, is clearly flawed. The biggest problem is in how to discover new compound monographs.

Why Not Put Users in Control?

Chempedia users themselves are in the best position to know when an existing Wikipedia compound monograph should appear in Chempedia but doesn't, when an existing monograph needs to be updated, or when a new monograph is written and needs to be linked.

How can the process be automated?

As a partial answer to this question, users now have the ability to notify Chempedia of any changes to a Wikipedia compound monograph, and to have those changes immediately reflected in the next viewing of a Chempedia compound monograph.

An Example

As an example, let's take anandamide, a compound I've had some experience with during my time as a medicinal chemist. Although the Chempedia entry for ananandamide exists, there is (or as of today - was) no link to the Wikipedia compound monograph. Let's create one.

At the top of Chempedia's main menu, you'll see a link titled 'Update'. Choosing this link leads to a form that will ask for two pieces of information: (1) the title of the Wikipedia article to which you want Chempedia to link - in this case 'anandamide'; and (2) reCaptcha text to keep robots from making mischief.

Submitting this information is all that's needed to create a new or updated link from Chempedia to Wikipedia. Chempedia handles the rest.

Conclusions

Wikipedia is a vast source of free, high-quality, semi-structured chemical information just waiting to have good chemically-aware interfaces applied to it. Chempedia.net is an attempt to do just that, but it's a bit more as well. Although it may appear that Chempedia is the major beneficiary in this relationship, Wikipedia also benefits. When chemists have a tool that allows them to query and visualize Wikipedia using their native language (the chemical structure) they're in a better position to both use and contribute to Wikipedia itself - something I've started to do.

This positive feedback effect is the real value of exposing Web services. The question is: who in cheminformatics is willing and able to take the risk to discover this simple principle and its benefits?

Chempedia.net: Mashing Up PubChem and Wikipedia 12

Posted by Rich Apodaca Fri, 04 Apr 2008 14:06:00 GMT

PubChem and Wikipedia represent two of the largest open repositories of chemical information in the world. And they complement each other very nicely. PubChem contains mainly low-level chemical structure information whereas Wikipedia contains free-text descriptions of chemical compounds in the form of compound monographs.

Both services offer permission and access to copy and reuse their contents. But neither service is, by itself, nearly as useful as it could be.

Why not mash them up?

To explore that question my company, Metamolecular, LLC has launched Chempedia.

To my knowledge, Chempedia represents the first publicly-facing database of compounds to incorporate Wikipedia's collection of organic compound monographs. And it's one of the few cheminformatics services to make use of free-text descriptions generated by individual chemists.

Chempedia has been somewhat selective about the compounds it includes. To date, it has spidered over 2,500 monographs, combining them with over 300,000 of the most interesting compounds from PubChem. Not every Chempedia.net molecule has a monograph, but now there's a tool that can actually make that absence apparent.

Chempedia is both an experiment and a service. It's immediately useful for anyone in the business of making or doing things with organic molecules. It's created several unexpected moments of "Oh, that's actually a useful molecule!" It also will serve as a platform to test some of the ideas discussed in Depth-First over the last year or so on the advantages of the Web for collaboration in chemistry.

Stay tuned for more details about how Chempedia was created and some of its applications in chemistry.

Older posts: 1 2