Chempedia.net: Mashing Up PubChem and Wikipedia 12
PubChem and Wikipedia represent two of the largest open repositories of chemical information in the world. And they complement each other very nicely. PubChem contains mainly low-level chemical structure information whereas Wikipedia contains free-text descriptions of chemical compounds in the form of compound monographs.
Both services offer permission and access to copy and reuse their contents. But neither service is, by itself, nearly as useful as it could be.
Why not mash them up?
To explore that question my company, Metamolecular, LLC has launched Chempedia.
To my knowledge, Chempedia represents the first publicly-facing database of compounds to incorporate Wikipedia's collection of organic compound monographs. And it's one of the few cheminformatics services to make use of free-text descriptions generated by individual chemists.
Chempedia has been somewhat selective about the compounds it includes. To date, it has spidered over 2,500 monographs, combining them with over 300,000 of the most interesting compounds from PubChem. Not every Chempedia.net molecule has a monograph, but now there's a tool that can actually make that absence apparent.
Chempedia is both an experiment and a service. It's immediately useful for anyone in the business of making or doing things with organic molecules. It's created several unexpected moments of "Oh, that's actually a useful molecule!" It also will serve as a platform to test some of the ideas discussed in Depth-First over the last year or so on the advantages of the Web for collaboration in chemistry.
Stay tuned for more details about how Chempedia was created and some of its applications in chemistry.


Nice job!
I translated this post to Korean in my own blog. Nowadays, I can spend some time to my blog. ;-)==
This could also have been done with a userscript. Do you plan to add added value? New info of some kind, or integrate more sources? I mean, ChemSpider could easily do this too, and integrates much more...
Hello Rich, chempedia looks good, any plans to integrate with ChEBI where possible?
Really nice work. One idea: Have you ever thought of making some of the structured Wikipedia information available for download. Let me give an example: I want to have known drugs from Wikipedia in an SD-File. For some strange modelling I also want the bioavailability (see e.g. http://en.wikipedia.org/wiki/Gleevec). Or even making this available from chempedia: Apply a structure search for a certain moiety PLUS a cutoff for bioavailable molecules? This would imho add significant value.
Cheers, Sepp
Nicely done Rich. Pretty and fast. Seems like we will both be delivering similar capabilities when we roll out WiChempedia on top of ChemSpider. Looking forward to discussing this with you at ACS.
I did one search only to play ..searched Phenol and ended up here: http://chempedia.net/compounds/458 where it said "From Wikipedia:
Cyclohexanol is a secondary alcohol, formula C6H11OH, consisting of a cyclohexane ring with one hydrogen substituent replaced by a hydroxyl group [1] ." I think you need to do some double bond checking ...
Antony, thanks for comments - looking forward to seeing WiChempedia. From famine to feast!
Yikes - looks like PubChem lists 108-93-0 as one of the CAS numbers for both phenol and cyclohexanol. Thanks for the catch!
Nice idea, Rich! I think many Wikipedia chemists want to see the content re-used in this kind of way. It provides a single search location that takes you to the appropriate information quickly and easily. I wish I could be in New Orleans to hear what your plans are. I also wish we could do structure searches like that on Wikipedia itself!
I also hit one or two bugs - for example, a correction to my structure (removing an unwanted CH3) looked OK on the screen when I did CTRL-Z, but it didn't get corrected in the search (which searched for a methylated derivative).
I'm curious to know how you identified which structures had a an entry on WP - we at WP:Chem should have a list fairly soon that will be made public, and this may be useful to you.
Nice, are synthesis rules and more literature references possible?
And oxygen is a little odd with two hydrogens, which I would call just water ;-) http://chempedia.net/compounds/375950
Egon, the added value is in connecting two previously isolated bodies of knowledge to do things neither resource can do alone and in providing a user interface that makes it clear what's now possible. I'd be interested in hearing your suggestions for features that could bring out the synergy!
Duncan, sounds interesting... how would you like to see ChEBI integrated?
Josef, those are good suggestions. There's been some discussion about Wikipedia's Chembox/Drugbox and the possibilities for creating structured data resources from their content.
Martin, I've never seen that behavior before with ChemWriter. Please let me know if it happens again.
Joerg, those are definitely possibilities. For the time being, I'd like to rely on datasources to do management and curation of user input - I don't want to duplicate what already works. Provided that a datasource exists and its licensing and access policies permit, I'm open to including that content in chempedia.net.
Antony, the entry for phenol, and others like it, are now fixed. The solution involved creating the concept of both the canonical structure for a CAS number and a canonic CAS number for a structure, based on the number of CAS number citations. Sounds more complicated than it is - relational databases and join tables (via Rails has_many :through relationships) make it easy.
Notice how cyclohexanol is now listed as a compound related to phenol, and vice versa. I'm not sure this is the best solution in the long term, but for now it allows all CAS number references from PubChem to be used, and possibly reported as incorrect.
Rich, this looks good.
I will probably be including a similar feature in my project. It would be great if you could bring together more of the 32 databases. If not the actual content, at least a link.
Kris, I'd be interested in seeing your project when it's ready to be shown. Bringing in content from the 32 databases (more like 90+ by now) is something I plan on doing - terms of use and content match permitting.