The Economics of Free: Chris Anderson on Charlie Rose

Posted by Rich Apodaca Sat, 10 May 2008 17:35:00 GMT

Anderson's comments on the Long Tail and social networks are especially on-target, and relevant to the sciences.

Building a Unique Chemistry Journal: Responses to Questions from Nature Chemistry 3

Posted by Rich Apodaca Thu, 08 May 2008 18:48:00 GMT

Neil Withers of the soon-to-be-launched chemistry journal Nature Chemistry has asked for feedback to some questions about the best ways to display chemistry research papers on the Web. Here are some responses:

(1) HTML vs PDF: does anyone read the HTML articles? Do you read the PDF on-screen or print it out?

I've used PDFs both for offline archiving and sharing of especially important articles as well as one-off printing of a paper I'm interested in. I rarely read a paper on-screen if I can avoid it.

Typical workflow: (1) download PDF; (2) print it out; (3); let paper sit while I go do something in the lab that can't wait (or bring it with me); (4) put paper onto a rather large stack of papers just like it; (5) pull paper out of stack from time to time as needed; (6) (optional) file paper in an increasingly chaotic system of folders or recycle it.

This system is bad, and I cursed it weekly during my time as a research chemist. Most of my colleagues had similar experiences.

There are plenty of opportunities to address pain points with the Web. Some ideas:

  • Make it very easy to find papers on the Nature Chemistry site. If I know a paper is trivial to find, I'm less likely to print it out in the first place. Good search may not be enough (see question 3).

  • Make the online version as readable as it can be. Minimize fluff like menus, ads and general clutter. Maximize things that promote readability like reasonable column-widths, appropriate fonts, and attractive and readable images.

  • Add conveniences that make it easier to read the paper online such as hover-popups that display 2D chemical structures for trivial names and IUPAC nomenclature (see below).

Paper is portable but Web documents are alive. Both can be readable - for example, I never print out a blog posting to read it.

(2) Big vs little graphics: what does everyone else think about the tiny size of the graphics in ACS html articles?

Graphics should be sized appropriately. ACS HTML articles are a good example of failing to design the obvious. You'd never read a blog post that looked like those articles, so it's not surprising everyone prints out the PDF.

Another problem is over-wide columns. It's puzzling why journal publishers would ignore all of their hard-won design experience just because a document appears as a Web page. If the ACS used a narrower column width, the Web version would be more readable. For example, check out this article from Beilstein Journal of Organic Chemistry. The only thing I'd change is to make the font larger.

Both problems are correctable using the right software and techniques.

(3) Tagging/’semantic web’: what do you think about the toys on the RSC’s Project Prospect? What kind of things would you like to see tagged/linked to other content in Nature Chemistry? For instance, Steve would love to do something with named reactions.

If by tagging, you mean giving users the ability to tag articles like Flickr allows photos to be tagged, and for other users to make use of those tags while searching, I think it's long overdue and could be a game-changer. It would clearly play to the strength of the Web as a medium.

I must confess that I'm not a fan of the implementation of Project Prospect, although the idea has a lot going for it. There's too much bling and a lot of it fails on my Linux/Firefox 2 system.

The one Prospect feature well worth adapting would be the one that lets you get a 2D structure by clicking on a trivial name or IUPAC name. But there's a much better way to implement it:

  • Turn it on by default and get rid of the floating right-hand menu.

  • Make the structure appear, without clicking, by simply hovering the mouse over the trivial name or IUPAC nomenclature. Be sure the delay is set right so that it's not popping up unintentionally.

That's all there is to it. It needn't be complex, just usable.

Another possibility: harvest all of the 2D molecular structures appearing in articles over a given period of time to be displayed in a dense, hyperlinked graphical abstract format ideal for quick browsing.

(4) 3D molecular structures: do these help your understanding of a paper?

Rarely, and in many cases they just add clutter. For almost all small molecules, a properly laid-out and well-drawn 2D chemical structure is more useful. If a central point of discussion in a paper is a 3D structure, then that would be a good use of the technology.

(5) How useful to you are InChIs and SMILES?

Not very. Research chemists rarely care about this kind of technology. They'd much rather have a good-looking 2D chemical structure. InChIs and SMILES, if available, should be hidden away and only brought out when requested. A more basic problem is neither system will be able to encode all of the molecules your journal's authors are likely to discuss.

(6) Forward linking: the RSC and Elsevier/Science Direct offer this – do you use it? Would you use an RSS feed that alerted you to new citations of a particular paper.

It could be useful provided that clutter could be kept to a minimum. It's essentially a form of linkback (see below).

An RSS feed that published linkback activity might be useful, but many of the chemists I know still don't know what RSS is. On the other hand, a page (or email service) that could keep an interested reader updated on linkback activity on all of their papers of interest simultaneously could be very useful.

(7) Would you actually comment on papers if there was a comments box at the end?

Like Egon Willighagen, I'd probably use my blog to do it.

However, most chemists don't maintain blogs or other websites and for them I can see how the ability to post comments would be useful.

Both kinds of users could be accommodated through a combination of comments and linkbacks. Provided that a good spam filtration system were used, this two-pronged approach might be very useful to readers.

Blogs are just the tip of the iceberg, though. Web publication technologies are creating all kinds of opportunities for creating highly focused, constantly evolving, collaborative mini-reviews on special topics. Linkbacks would create value for both readers and authors of these mini-reviews as well as forward-thinking scientific publications that embrace them.

(8) We really like the Biochemical Society’s HTML article style (sample one here) – do you?

No. Frames makes that site very difficult to navigate.

It will be very interesting to see how Nature Publishing Group takes advantage of its opportunity to create something unique among chemistry publications. Asking the kinds of questions they're asking now, and doing so in the way they're doing it, shows they're at least on the right track.

1908 and All That: The Long Tail and Chemistry

Posted by Rich Apodaca Wed, 07 May 2008 14:37:00 GMT

Quite a few American Chemical Society (ACS) divisions are celebrating their 100th anniversaries this year. While this fact may at first glance seem like just a piece of nerdy trivia, Rudy Baum, Editor-in-chief of C&E News decided to dig deeper. And what he found was the Long Tail of chemistry, alive and well - in 1908.

In his editorial, Baum describes how he looked for the causes of the sudden appearance of so many ACS divisions in 1908. At its core, he found a growing realization on the part of influential chemists at the time that ACS membership was becoming too diverse in their interests and areas of specialization:

Specialization in subdisciplines of chemistry was also much on ACS members' minds in these years. Some members felt strongly that subdivisions of some sort should be created in the society to provide a venue for chemists from these areas to meet separate from the society as a whole. It was noted that chemists were going off and forming their own specialized organizations in areas like electrochemistry, biological chemistry, and agricultural chemistry.

As early as 1903, ACS established a committee of five distinguished members to look into this issue, with Massachusetts Institute of Technology's Arthur A. Noyes as the chairman. (Throughout its history, ACS has responded to challenges by creating committees!) The committee reported to the ACS Council at its June 1, 1903, meeting, and strongly recommended that "Divisions of the Society be established representing different important branches of chemistry."

For those familiar with the work of Chris Anderson, what's being described is nothing other than the Long Tail:

The theory of the Long Tail is that our culture and economy is increasingly shifting away from a focus on a relatively small number of "hits" (mainstream products and markets) at the head of the demand curve and toward a huge number of niches in the tail. As the costs of production and distribution fall, especially online, there is now less need to lump products and consumers into one-size-fits-all containers. In an era without the constraints of physical shelf space and other bottlenecks of distribution, narrowly-targeted goods and services can be as economically attractive as mainstream fare.

How much money does it cost to set up a new ACS division? Probably not that much. How big is the field of chemistry? Vast. Put the two together, and you have a recipe for today's ACS. A recent Depth-First article described this phenomenon. And C&E News itself maintains a (static?) blog on the Long Tail as it applies to chemical employment.

What does any of this have to do with chemical informatics? Although it may be tempting to think of chemists as a homogeneous group sharing a great deal of experience and knowledge, the proliferation of ACS divisions suggests otherwise. It seems reasonable to think that successful chemical information systems would do well to take this into account in their design and implementation.

Building Chempedia: Indexing Wikipedia's 6,411 Compound Monographs 5

Posted by Rich Apodaca Mon, 28 Apr 2008 22:22:00 GMT

The Merck Index is one of chemistry's most useful reference works. Organized like an encyclopedia, each entry, or "Compound Monograph," describes a single compound complete with chemical structure, CAS Number, IUPAC name, trivial names, physical properties, and leading primary literature references describing uses. Unlike other chemistry databases, the Merck Index focuses on only those compounds with important industrial, biological, medical, or technical applications.

What's Wrong with the Merck Index?

Wonderful product though it may be, the Merck Index has some limitations. For starters, online versions are not free. The disadvantages of this access model go well beyond a simple price barrier; it prevents the very thing the Web was designed to promote: linking. Another limitation is the time it takes for new versions to appear, which is typically measured in years. Still another limitation is in the cost of adding entries for niche compounds that may not be suitable for a general audience, a major barrier to exposing chemistry's long tail.

What's Chempedia?

If we wanted to create a free, online service that worked like the Merck Index but which took full advantage of today's powerful collaboration and information technology tools, how could we go about doing so?

This article, the first in a series, discusses Chempedia, a free, structure-oriented online encyclopedia of useful chemical compounds designed to answer this question.

Background

The following articles may be useful in understanding Chempedia's approach and underlying technology:

Where to Begin?

One of the first problems we'd face in building a free Web-based version of the Merck Index is where to get the compound monographs.

It turns out that Wikipedia (yes, Wikipedia) hosts a growing collection of compound monographs that, when viewed together, bear a striking resemblance to the Merck Index. And the effort is becoming increasingly organized with respect to content and data provenance.

Why not start here?

The Task at Hand

To get an idea of just how Wikipedia's collection of compound monographs compares to the Merck Index, it helps to know: (1) how to find Wikipedia compound monographs; and (2) the range of information available for each entry.

This tutorial will describe a simple method to index Wikipedia's compound monographs using nothing but free tools and data. Subsequent articles will discuss qualitative aspects of Wikipedia's compound monographs and the challenges involved in organizing them into a chemically-aware service.

Indexing Wikipedia's Compound Monographs

We can index Wikipedia compound monographs via a simple procedure.

Most compound monographs employ one of four precompiled Wikpedia templates: Chembox (deprecated); Chembox new; Drugbox; and Explosivebox. As an example of what these templates look like, see the right-hand box on Wikipedia's entry on modafinil. To index Wikipedia's compound monographs, all we need to do is find the titles of all articles using one of these four templates.

To get started, we'll need a local copy of Wikipedia. The complete set of all Wikipedia articles, as of March 12, 2008 can be downloaded here. This data dump is updated periodically, so you may have access to a more recent version.

The Wikipedia dump, which contains the full text of every article in Wikipedia, consists of a 3.5 GB file in BZip2 format. Fortunately, we won't need to inflate it to index its chemical content.

The following code will scan the raw Wikipedia dump and produce a list of all compound monograph titles:

title = ""
log = File.new 'monographs.txt', "w"

while((line = STDIN.gets))
  line.match /<title>(.*)<\/title>/

  if $1
    title = $1

    next
  end

  if line.match /\{\{(chembox|drugbox|explosivebox)/i
    unless title == "" || title.match(/:/)
      puts title
      log.puts title
      log.flush

      title = ""
    end
  end
end

log.close

Saving this code into a file called filter.rb, we can run it by piping the output of bzcat on the raw dump file:

$ bzcat <path_to_dump>/enwiki-20080312-pages-articles.xml.bz2 | ruby filter.rb

Alphabetizing the output file gives a complete listing of Wikipedia's compound monograph titles (all 6,411 of them), which for convenience can be downloaded here.

We can construct a URL to each Wikipedia compound monograph by prepending the title with http://wikipedia.org/wiki/. In other words, our program's output can be used both as a list of chemical names and as a hash of chemical names to Wikipedia URLs. And with the URL in hand, all kinds of interesting things can be done.

Limitations

Although easy to carry out, the procedure described here has some limitations:

  • Monographs added after March 12, 2008 are not visible.
  • Monographs that don't use the chembox, chembox new, drugbox, or explosivebox templates are not visible.
  • A very small number of articles erroneously use the chembox template, for example this one.

Chempedia Redesign

Currently, Chempedia doesn't include all 6,411 monographs but rather a subset created by a much less comprehensive indexing method. As part of a major redesign of the site, all Wikipedia compound monographs will be available on Chempedia, which should result in a much more useful service.

Conclusions

Wikipedia is fast becoming a major storehouse of chemical information with tantalizing potential for creating powerful new services for chemists. More to the point for cheminformatics, the entire Wikipedia dataset can be downloaded and reprocessed free of charge; Wikipedia is one of those rare cheminformatics datasets that is both free as in speech and free as in beer.

As this article has shown, some simple programming is all it takes to begin doing useful things with Wikipedia's chemical content. Future articles will discuss some of the possibilities.

Thinking of Founding a Science Startup? Look to What's Getting Cheaper 1

Posted by Rich Apodaca Tue, 22 Apr 2008 21:59:00 GMT

Deepak Singh recently started an interesting discussion (and follow-up) about the need for organizations that help early-stage bioscience startups in the same way that YCombinator does in the Web space. But having just attended my second YC Startup School, I'm left with a new-found appreciation of the role startup economics plays in shaping not just the startup landscape, but the culture of entrepreneurship that goes with it.

There's a world of difference between the kinds of startups YCombinator is interested in and the kind of startup most chemists and biologists would be in a position to found. As told by Paul Graham of YCombinator, founding a Web startup is cheap, and that changes everything:

There's something interesting happening right now. Startups are undergoing the same transformation that technology does when it becomes cheaper.

It's a pattern we see over and over in technology. Initially there's some device that's very expensive and made in small quantities. Then someone discovers how to make them cheaply; many more get built; and as a result they can be used in new ways.

Computers are a familiar example. When I was a kid, computers were big, expensive machines built one at a time. Now they're a commodity. Now we can stick computers in everything.

This pattern is very old. Most of the turning points in economic history are instances of it. It happened to steel in the 1850s, and to power in the 1780s. It happened to cloth manufacture in the thirteenth century, generating the wealth that later brought about the Renaissance. Agriculture itself was an instance of this pattern.

Now as well as being produced by startups, this pattern is happening to startups. It's so cheap to start web startups that orders of magnitudes more will be started. If the pattern holds true, that should cause dramatic changes.

Contrast the options available for a computer science student with those of a biology or chemistry student.

The computer science student enjoys access to state-of-the-art tools that have been commoditized to the point of being either completely free or very close to it: hardware; hosting; operating systems; programming languages; development frameworks; source code management tools; and, increasingly Web services. More than one multimillion-dollar Web startup has been founded with nothing more than a laptop, a dorm room, some macaroni, a few friends, and a good idea or two.

The life- or physical science student is faced with quite a different reality. Everything needed in getting started costs money - lots of money: lab space; instruments; consumables; a patent lawyer or two; and regulatory approval, both for day-to-day operations and possibly for the product to be sold.

Then there's the problem of time to market. A Web startup can go from nothing to finished product over a summer vacation. Depending on the product being sold, a science startup may take ten years or more to do the same.

This glacial product development cycle leaves the science startup with almost no room for error. In contrast, the Web startup is in a position to start offering a significantly flawed product early on and then iterate until it's perfect.

These contrasting situations go a long way to explaining why bioscience startups tend to be founded by thirty- or fourtysomethings and Web startups can be and are founded by teenagers.

With ready access to cheap means of production, Web startups enjoy many advantages that science startups can only dream of. For one thing, a product can actually be developed before approaching outside investors even becomes necessary. It's even possible to build a profitable Web startup purely from the profits created by selling the finished product.

The bio or chemistry startup, on the other hand, will tend to be dependent to varying degrees on outside investors from the beginning. In some cases, the University hosting a science startup's early-phase research will play the role of outside investor, much to the founders' disadvantage.

What do you get when you combine a need for large sums of money up-front with a need for almost perfect execution? A recipe for failing in business more frequently than anybody else.

We might expect this situation to change if the cost of founding a startup in the life- or physical sciences dropped significantly. It may take a little imagination to see this as a possibility right now. But the process of new markets forming when technolgy becomes radically cheaper is a fundamental feature of captitalist societies that has played out time and again over the last several hundred years.

If a transformation is in store for the economics of biotech and chemistry startups, what could trigger it?

Image Credit: mathoov

Older posts: 1 2 3 ... 25