Just a Flesh Wound 4
SEMANTIC KNIGHT: None shall pass without formally defining the ontological meta-semantic thingies of their domain something-or-others!
HACKER: What?
SEMANTIC KNIGHT: None shall pass without using all sorts of semantic meta-meta-meta-stuff that we will invent Real Soon Now!
HACKER: I have no quarrel with you, good Sir Knight, but I must get my work done on the Web. Stand aside!
SEMANTIC KNIGHT: None shall find anything on the Internet without semantic metadata!
HACKER: So be it!
HACKER and SEMANTIC KNIGHT: Aaah!, hiyaah!, etc.
[HACKER chops the SEMANTIC KNIGHT's first argument off by building efficent statistical/heuristic search engines]
HACKER: Now stand aside, worthy adversary.
SEMANTIC KNIGHT: 'Tis but a scratch.
HACKER: A scratch? Your argument has been cut off!
SEMANTIC KNIGHT: No, it isn't.
HACKER: Well, what's that, then?
SEMANTIC KNIGHT: I've had worse. None shall have an effective syndication network without RDF Site Summaries!
[clang]
Hiyaah!
[clang]
Aaaaaaaah!
[HACKER chops the SEMANTIC KNIGHT's second argument off by building the blogs/RSS/Aggregators/Bloglines/etc. network ]
HACKER: Victory is mine!
SEMANTIC KNIGHT: Have at you!
[kick]
HACKER: Eh. You are indeed brave, Sir Knight, but the fight is mine.
SEMANTIC KNIGHT: Oh, had enough, eh?
HACKER: Look, you stupid &^%$# You've got no arguments left.
SEMANTIC KNIGHT: Yes, I have.
HACKER: Look!
SEMANTIC KNIGHT: Just a flesh wound.
[kick]
HACKER: Look, stop that.
SEMANTIC KNIGHT: You won't be able to get machine-machine services without an ontology to formally describe all the relationships!
[kick]
HACKER: Right!
[whop]
[HACKER chops the SEMANTIC KNIGHT's third argument off by building SOAPy and RESTful services with only implicit semantic descriptions]
SEMANTIC KNIGHT: Right. I'll do you for that!
HACKER: You'll what?
SEMANTIC KNIGHT: Come here!
HACKER: What are you going to do, bleed on me?
SEMANTIC KNIGHT: I'm invincible!
HACKER: You're a looney.
SEMANTIC KNIGHT: The SEMANTIC Knight always triumphs! Have at you! Come on, then. I have an battalion of KR theorists on my side!
[whop]
[HACKER chops the SEMANTIC KNIGHT's last argument off with an army of actual code writers]
SEMANTIC KNIGHT: Oh? All right, we'll call it a draw.
HACKER: Come on, folks, let's go.
SEMANTIC KNIGHT: Oh. Oh, I see. Running away, eh? You yellow ^&^%$s! Come back here and take what's coming to you. I'll bite your legs off!
-Michael Champion, xml-dev list
Building Chempedia: Indexing Wikipedia's 6,411 Compound Monographs 5
The Merck Index is one of chemistry's most useful reference works. Organized like an encyclopedia, each entry, or "Compound Monograph," describes a single compound complete with chemical structure, CAS Number, IUPAC name, trivial names, physical properties, and leading primary literature references describing uses. Unlike other chemistry databases, the Merck Index focuses on only those compounds with important industrial, biological, medical, or technical applications.
What's Wrong with the Merck Index?
Wonderful product though it may be, the Merck Index has some limitations. For starters, online versions are not free. The disadvantages of this access model go well beyond a simple price barrier; it prevents the very thing the Web was designed to promote: linking. Another limitation is the time it takes for new versions to appear, which is typically measured in years. Still another limitation is in the cost of adding entries for niche compounds that may not be suitable for a general audience, a major barrier to exposing chemistry's long tail.
What's Chempedia?
If we wanted to create a free, online service that worked like the Merck Index but which took full advantage of today's powerful collaboration and information technology tools, how could we go about doing so?
This article, the first in a series, discusses Chempedia, a free, structure-oriented online encyclopedia of useful chemical compounds designed to answer this question.
Background
The following articles may be useful in understanding Chempedia's approach and underlying technology:
Where to Begin?
One of the first problems we'd face in building a free Web-based version of the Merck Index is where to get the compound monographs.
It turns out that Wikipedia (yes, Wikipedia) hosts a growing collection of compound monographs that, when viewed together, bear a striking resemblance to the Merck Index. And the effort is becoming increasingly organized with respect to content and data provenance.
Why not start here?
The Task at Hand
To get an idea of just how Wikipedia's collection of compound monographs compares to the Merck Index, it helps to know: (1) how to find Wikipedia compound monographs; and (2) the range of information available for each entry.
This tutorial will describe a simple method to index Wikipedia's compound monographs using nothing but free tools and data. Subsequent articles will discuss qualitative aspects of Wikipedia's compound monographs and the challenges involved in organizing them into a chemically-aware service.
Indexing Wikipedia's Compound Monographs
We can index Wikipedia compound monographs via a simple procedure.
Most compound monographs employ one of four precompiled Wikpedia templates: Chembox (deprecated); Chembox new; Drugbox; and Explosivebox. As an example of what these templates look like, see the right-hand box on Wikipedia's entry on modafinil. To index Wikipedia's compound monographs, all we need to do is find the titles of all articles using one of these four templates.
To get started, we'll need a local copy of Wikipedia. The complete set of all Wikipedia articles, as of March 12, 2008 can be downloaded here. This data dump is updated periodically, so you may have access to a more recent version.
The Wikipedia dump, which contains the full text of every article in Wikipedia, consists of a 3.5 GB file in BZip2 format. Fortunately, we won't need to inflate it to index its chemical content.
The following code will scan the raw Wikipedia dump and produce a list of all compound monograph titles:
title = ""
log = File.new 'monographs.txt', "w"
while((line = STDIN.gets))
line.match /<title>(.*)<\/title>/
if $1
title = $1
next
end
if line.match /\{\{(chembox|drugbox|explosivebox)/i
unless title == "" || title.match(/:/)
puts title
log.puts title
log.flush
title = ""
end
end
end
log.closeSaving this code into a file called filter.rb, we can run it by piping the output of bzcat on the raw dump file:
$ bzcat <path_to_dump>/enwiki-20080312-pages-articles.xml.bz2 | ruby filter.rb
Alphabetizing the output file gives a complete listing of Wikipedia's compound monograph titles (all 6,411 of them), which for convenience can be downloaded here.
We can construct a URL to each Wikipedia compound monograph by prepending the title with http://wikipedia.org/wiki/. In other words, our program's output can be used both as a list of chemical names and as a hash of chemical names to Wikipedia URLs. And with the URL in hand, all kinds of interesting things can be done.
Limitations
Although easy to carry out, the procedure described here has some limitations:
- Monographs added after March 12, 2008 are not visible.
- Monographs that don't use the chembox, chembox new, drugbox, or explosivebox templates are not visible.
- A very small number of articles erroneously use the chembox template, for example this one.
Chempedia Redesign
Currently, Chempedia doesn't include all 6,411 monographs but rather a subset created by a much less comprehensive indexing method. As part of a major redesign of the site, all Wikipedia compound monographs will be available on Chempedia, which should result in a much more useful service.
Conclusions
Wikipedia is fast becoming a major storehouse of chemical information with tantalizing potential for creating powerful new services for chemists. More to the point for cheminformatics, the entire Wikipedia dataset can be downloaded and reprocessed free of charge; Wikipedia is one of those rare cheminformatics datasets that is both free as in speech and free as in beer.
As this article has shown, some simple programming is all it takes to begin doing useful things with Wikipedia's chemical content. Future articles will discuss some of the possibilities.
CampDepict: Building a Simple SMILES Depict Web Application With JRuby, Structure CDK, and Camping
Today's tribute to the power of simplicity comes by way of John Jaeger, who has built one of the simplest cheminformatics Web applications ever written. His creation, CampDepict, interactively produces a raster image of a 2D chemical structure given a SMILES string, not unlike Daylight's Depict application.
CampDepict uses the Ruby Web microframework Camping. From the README:
Camping is a web framework which consistently stays at less than 4kb of code. You can probably view the complete source code on a single page. But, you know, it‘s so small that, if you think about it, what can it really do?
The idea here is to store a complete fledgling web application in a single file like many small CGIs. But to organize it as a Model-View-Controller application like Rails does. You can then easily move it to Rails once you‘ve got it going.
John's application is loosely-based on the Rails Depict application first described in 2006 here on Depth-First. His code makes use of CDK and Structure CDK, and it runs on JRuby.
If you've ever been curious about what Ruby has to offer cheminformatics, CampDepict could be just the application to get your feet wet.
Thinking of Founding a Science Startup? Look to What's Getting Cheaper 1
Deepak Singh recently started an interesting discussion (and follow-up) about the need for organizations that help early-stage bioscience startups in the same way that YCombinator does in the Web space. But having just attended my second YC Startup School, I'm left with a new-found appreciation of the role startup economics plays in shaping not just the startup landscape, but the culture of entrepreneurship that goes with it.
There's a world of difference between the kinds of startups YCombinator is interested in and the kind of startup most chemists and biologists would be in a position to found. As told by Paul Graham of YCombinator, founding a Web startup is cheap, and that changes everything:
There's something interesting happening right now. Startups are undergoing the same transformation that technology does when it becomes cheaper.
It's a pattern we see over and over in technology. Initially there's some device that's very expensive and made in small quantities. Then someone discovers how to make them cheaply; many more get built; and as a result they can be used in new ways.
Computers are a familiar example. When I was a kid, computers were big, expensive machines built one at a time. Now they're a commodity. Now we can stick computers in everything.
This pattern is very old. Most of the turning points in economic history are instances of it. It happened to steel in the 1850s, and to power in the 1780s. It happened to cloth manufacture in the thirteenth century, generating the wealth that later brought about the Renaissance. Agriculture itself was an instance of this pattern.
Now as well as being produced by startups, this pattern is happening to startups. It's so cheap to start web startups that orders of magnitudes more will be started. If the pattern holds true, that should cause dramatic changes.
Contrast the options available for a computer science student with those of a biology or chemistry student.
The computer science student enjoys access to state-of-the-art tools that have been commoditized to the point of being either completely free or very close to it: hardware; hosting; operating systems; programming languages; development frameworks; source code management tools; and, increasingly Web services. More than one multimillion-dollar Web startup has been founded with nothing more than a laptop, a dorm room, some macaroni, a few friends, and a good idea or two.
The life- or physical science student is faced with quite a different reality. Everything needed in getting started costs money - lots of money: lab space; instruments; consumables; a patent lawyer or two; and regulatory approval, both for day-to-day operations and possibly for the product to be sold.
Then there's the problem of time to market. A Web startup can go from nothing to finished product over a summer vacation. Depending on the product being sold, a science startup may take ten years or more to do the same.
This glacial product development cycle leaves the science startup with almost no room for error. In contrast, the Web startup is in a position to start offering a significantly flawed product early on and then iterate until it's perfect.
These contrasting situations go a long way to explaining why bioscience startups tend to be founded by thirty- or fourtysomethings and Web startups can be and are founded by teenagers.
With ready access to cheap means of production, Web startups enjoy many advantages that science startups can only dream of. For one thing, a product can actually be developed before approaching outside investors even becomes necessary. It's even possible to build a profitable Web startup purely from the profits created by selling the finished product.
The bio or chemistry startup, on the other hand, will tend to be dependent to varying degrees on outside investors from the beginning. In some cases, the University hosting a science startup's early-phase research will play the role of outside investor, much to the founders' disadvantage.
What do you get when you combine a need for large sums of money up-front with a need for almost perfect execution? A recipe for failing in business more frequently than anybody else.
We might expect this situation to change if the cost of founding a startup in the life- or physical sciences dropped significantly. It may take a little imagination to see this as a possibility right now. But the process of new markets forming when technolgy becomes radically cheaper is a fundamental feature of captitalist societies that has played out time and again over the last several hundred years.
If a transformation is in store for the economics of biotech and chemistry startups, what could trigger it?
Image Credit: mathoov



