Smaller, Cheaper, More Powerful

Friday March 12, 2010

By way of In the Pipeline, I ran across Rob Carlson's description of a garage screening lab in Silicon Valley:

I spent most of one Saturday hanging out at a garage biology lab in Silicon Valley. When I walked in the door, I was impressed by the sophistication of the set-up. The main project is screening for anti-cancer compounds (though it wasn't clear to me whether this meant small molecules or biologics), and the people involved have skillzzz and an accumulation of used/surplus equipment to accomplish whatever they want; two clean/cell-culture hoods, two biorobots (one of which is being reverse engineered), incubators, plate readers, and all the other doodads you might need. They aren't messing around. I didn't get into the details of the project, but the combination of equipment, pedigree, and short conversations with the participants told me all I needed to know. ...

Don't get me wrong. This is cool - very cool. The problem is that this approach doesn't scale and never will.

The year is 1975. The place: Cupertino, California. We're in a garage where a much younger Steve Wozniak has filled the place with surplus mainframe computer equipment he got for free from a friend. Is Steve going to have a whole bunch of fun and do insanely cool things? You bet. Is he going to change the world? Nope.

I'll let Woz speak for himself about what really changed everything:

Now, I still was in this mode where I had to build everything for free. Then I discovered that microprocessors had come out. I had sort of slipped out of the electronics world, out of the computer world, due to working in calculators at Hewlett-Packard. All of a sudden I discovered these microprocessors. What are they? Id didn't quite understand it fully, so I took a datasheet home. ...

I was embarrassed because the world had somehow jumped ahead of me - they had come out with cheap microcomputers based around microprocessors and I hadn't heard of it and I hadn't been a part of it.

Smaller, cheaper, more powerful - in that order. These are the things turned a bunch of hackers into millionaires. It's what put a computer on every desktop. It's also what toppled titans of industry and drove highly respected companies into the ground.

If anything will be capable of changing the way that chemical and biological research gets done, fundamentally changing the way drugs get discovered and new materials get developed, it will be three things: smaller, cheaper and more powerful.


Timo Boehme of OntoChem GmbH has recently reported a significant issue in the most recent InChI implementation (v2.02). Given two molfiles, both of which encode the same structure but use different atom numberings, two different InChIs are produced. The bug can be reproduced with the following molfiles:

0001
  OCTest  0310101

 11 10  0  0  0  0            999 V2000
    2.1434    2.0625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4289    2.4750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4289    3.3000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.7145    2.0625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0000    2.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.8579    2.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.8579    3.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5724    3.7125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5724    4.5375    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    3.5724    5.3625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.1434    3.7125    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  2  0  0  0  0
  2  4  1  0  0  0  0
  4  5  1  0  0  0  0
  1  6  1  0  0  0  0
  6  7  1  0  0  0  0
  7  8  2  0  0  0  0
  8  9  1  0  0  0  0
  9 10  3  0  0  0  0
  7 11  1  0  0  0  0
M  CHG  2   9   1  11  -1
M  END

which yields the InChI:

InChI=1S/C4H7N5O2/c5-8-2-3(10)7-1-4(11)9-6/h2,5-7,10H,1H2/b3-2-

and:

0002
  OCTest  031010

 11 10  0  0  0  0            999 V2000
    2.1434    3.7125    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    2.1434    2.0625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4289    2.4750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4289    3.3000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.7145    2.0625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0000    2.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.8579    2.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.8579    3.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5724    3.7125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5724    4.5375    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    3.5724    5.3625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  3  5  1  0  0  0  0
  5  6  1  0  0  0  0
  2  7  1  0  0  0  0
  7  8  1  0  0  0  0
  8  9  2  0  0  0  0
  9 10  1  0  0  0  0
 10 11  3  0  0  0  0
  8  1  1  0  0  0  0
M  CHG  2  10   1   1  -1
M  END

which yields the InChI:

InChI=1S/C4H6N5O2/c5-8-2-3(10)7-1-4(11)9-6/h2,6-7,10H,1H2/q-1/p+1/b3-2-

As you can see, two different InChIs are produced from molfiles encoding the same structure, an image of which is provided below:

In his description of the bug on the InChI Discussion List Boehme writes:

The problem is in the preprocessing phase where hydrogen atoms are (re)moved/added. Whether hydrogens are removed depends on the above mentioned order atoms are defined in the InChI input structure and thus will produce completely different InChI(keys).

The bug is independent of the molecule input format. The provided test case is given as SDF which can be used directly with the InChI command line tool while I tried the same with our Smiles parser which utilizes the JNI-INCHI library to direclty call InChI API via GetStdINCHI. Thus the problem is 'in the heart' of the InChI generation algorithm.

You can view the original bug report here.

I have reproduced the bug myself on an OS X system running Snow Leopard with a binary compiled from the 2.02 source.

The extent of this bug remains unclear, although at a minimum I suspect that any structure in which the above sample is a substructures would display the behavior.

Why Does It Matter?

This issue is important because of the central role InChI (and the derived InChI Key) has started to play as a unique molecular identifier, both for internal database table lookups for exact structure searches, and for inter-database communication. A great deal of work by a number of parties could be necessary should this issue be shown to be not limited to the example structure. The issue also underscores the importance of developing a written, English-language InChI specification and a comprehensive test suite.


Why is Chempedia Lab Failing?

Tuesday March 09, 2010

A little over three months since the launch of Chempedia Lab, I have some bad news: it's failing.

For the unfamiliar, Chempedia Lab is a question and answer site dedicated to experimental chemistry. The value proposition is simple: ask your toughest question and get a peer-reviewed answer quickly. At least that's the idea. There's a lot of not-so-secret sauce behind the technology platform, but you can read about that elsewhere. The most important points are that Chempedia Lab is an experiment in community building among professional chemists and advanced students, and that its approach is unique.

This article will discuss the original vision for Chempedia Lab, and offer some speculation on the causes of its failure so far.

A Vision for Collaboration

To say something is failing requires a vision of success. When my company launched Chempedia Lab, the vision was for a Web-based service that would become the premier source on the Web for up-to-date, peer-reviewed information on experimental chemistry. Chempedia Lab would do so by attracting a large number of practicing experimental chemists to ask, answer, and review technically-demanding questions.

Some Numbers to Date

After three months of continuous operation, what do the numbers say about our Chempedia Lab experiment?

Registered Users : 32

Users Asking At Least One Questions : 8

Users Answering At Least One Question: 14

Questions Asked: 42

Unanswered Questions: 6

Given the vision for what Chempedia Lab could become, the metric of highest importance is the number of users asking at least one question (eight). Although probably obvious, this metric matters a lot because the entire Chempedia Lab system is driven by new questions. Questions posed from diverse sources leads to a highly engaged community. Fewer questions leads to the opposite outcome.

Eight people asking questions might make for lively dinner conversation, but for a question-driven online community, it's far too few to be self-sustaining.

Of secondary importance is the total number of questions asked. With fewer than 100 of of these, it's clear that lack of users asking questions is leading to... well, an overall lack of questions.

The Good News

Although Chempedia Lab has so far failed to live up to the expectation of becoming a hub for experimental chemistry, there are some reasons to be optimistic. For example, try a Google search for 'solubility alcohols ammonia', 'dry hydrogen chloride gas', or 'cold bath temperatures'. A Chempedia Lab question is returned in the first four results for each search. In other words, new users are constantly being funneled to Chempedia Lab in response to keyword searches.

Another cause for optimism: since its inception, about 80% of Chempedia Lab's visitors have never visited the site before. New users continue to find the site, although relatively few are returning.

Possible Interpretations

Although any number of reasons for Chempedia Lab's failure to date could be considered, most of them fall into one of four categories:

  1. Chempedia Lab is unknown to the vast majority of experimental chemists. Simple problem and an even simpler solution - publicity, patient explanation... and time. Easier said than done, but not impossible.

  2. Chempedia Lab is widely-known, but does not appeal to those who for whom it's designed. Much more difficult problem to address because users who are turned off just leave, never come back, and don't offer feedback. Specific possibilities might include the notion that research chemists are uneasy about looking foolish in front of their peers by asking a 'too easy' question.

  3. Chempedia Lab is useful and reaching its target audience, but intellectual property issues prevent widespread participation in the asking and answering of questions. If this is the cause, there's little that Chempedia Lab itself can do to address the problem.

  4. The underlying system is flawed as a community-building mechanism in chemistry. In other words, one or more peculiarities in the world of experimental chemistry work against adoption. By comparison, a number of technologically similar sites on different topics have done quite well.

Of the four explanations, (1), (2), and (4) can be addressed by Chempedia Lab directly. At this point, I suspect (1) as the primary cause based on the relatively low traffic on Chempedia Lab to date. How low? Consider this graph, created with Google Analytics, and representing non-robotic visitors:

As you can see, on any given day fewer than 100 (non-unique) visits are made to Chempedia Lab. The traffic spike on January 27 was due to a flurry of referrals from StumbleUpon. The spike in late November came at the launch of Chempedia Lab, when a mass e-mailing to over 500 opt-in only addressees was sent. Traffic has been essentially low and flat from the beginning, marked by periodic spikes.

Reasons (2) and (4) are likely secondary causes. Chemists in industry may barred from participating or otherwise discouraged, reducing the pool of potential users. Lack of participation begets lack of participation.

Where To?

Community building is hard. And the job is made particularly difficult within fragmented groups having strong pre-existing traditions (and taboos) around communication - such as chemistry. It's possible that an idea like Chempedia Lab will never work in such a community.

But assuming that this worst case scenario doesn't apply and taking point (1) above as the main issue to be addressed (few experimental chemists have seen the site), there are a few possibilities that might be tried:

  1. Reach Key Influencers Individually. You know the gal who always seems to know where to find things and always has the latest information? That's the one. The idea being that if she knows about Chempedia Lab and uses it, she'll tell everyone in her immediate range of contacts about it. Chemist-bloggers might be one group. Unfortunately, having sent individual messages to a number of them, only one has responded, and to my knowledge only one has written anything about Chempedia Lab. Reason (2) at work?

  2. Reach the Entire Community at Once. What's the single biggest hurdle in reaching experimental chemists? There are few 'places' they hang out in large numbers. The weekly magazine Chemical & Engineering News would be the top candidate for this position. Although the price for paid advertising is high (thousands of dollars), coverage in the form of a news story is relatively inexpensive. To make Chempedia Lab newsworthy in the absence of widespread use would require something that has so far been missing (e.g., a remarkable story), but may nevertheless be an achievable goal. Lesser venues will have a lower bar to granting coverage.

  3. Do Nothing. That's right - just sit back and observe. The more fragmented the community and the more entrenched its practices, the slower it will be to adopt new ideas. It may be that the pain level of connecting with other experimental chemists outside of one's core personal contacts/institution, and in getting tough questions answered quickly is not yet high enough to overcome the barrier to trying a new way of doing things and risking professional embarrassment. Next year, things might be different.

Conclusions

For Chempedia Lab to have a future as a self-sustaining community resource, it's going to need a lot more participation from people willing to ask questions. It will require key influencers to take a risk and try it. It will take some creative uses of new- and traditional-media. And it will in all likelihood take time.

Imagine a resource that every experimental chemist can turn to for fast peer-reviewed answers to tough questions. A resource that updates itself as the facts are updated. A resource that helps you both with the technical side of your work and the social side. A resource that costs nothing to use - ever.

Chempedia Lab has far to go before living up to its potential, and may never make it. The headwinds it faces are not technical, but social. Still, as long as practical options still exist, it makes sense to continue with the experiment.


The LinkedIn Electronic Laboratory Notebook Forum continues to be a good place to hang out for info and perspectives from those in the know. Paraphrasing a recent exchange:

Q: When does a LIMS become an ELN?

A: When it costs more than a million dollars.


The previous article in this series described a simple way to set up your own PubChem mirror. By using some simple Unix command line tools, I showed one way to maintain a fully up-to-date snapshot of PubChem.

But how do you continue to maintain a dataset based on PubChem days, weeks, and months after you import the initial snapshot? The PubChem dataset will be simply too large to process every time you refresh your snapshot. You'll need something more incremental.

Fortunately, there's an answer. In addition to all of the other cool things wget can do with the PubChem FTP site, it can also be used to maintain a set of incremental updates.

This command does it:

wget --mirror --accept "*.sdf.gz,killed-SIDs,killed-CIDs" --wait 1 ftp://ftp.ncbi.nlm.nih.gov/pubchem/{Compound,Substance}/Daily/

We tell wget to mirror the daily Compound and Substance directories on the PubChem FTP server. In addition to sdfgz files, we include the files listing obsolete Compound and Structure records. Because we'll be making a lot of requests, we play nice by adding a one second delay between them.

Although the daily archives are rotated weekly, our local copy will contain all updates - as long as we don't delete them manually, and the script is run at least once per week.


View all articles in the archives.