<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Depth-First: Hashing InChIs</title>
    <link>http://depth-first.com/articles/2007/05/09/hashing-inchis</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Walking the Web of Chemical Informatics</description>
    <item>
      <title>Hashing InChIs</title>
      <description>&lt;p&gt;The InChI team has announced &lt;a href="http://chemdata.nist.gov/InChI/inchi-hash.pdf"&gt;a proposal&lt;/a&gt; for a standardized InChI hashing mechanism. This would create a free, fixed-length, alphanumeric molecular identifier.&lt;/p&gt;

&lt;p&gt;This is an excellent proposal. One of the biggest problems in working with InChIs (and other line notations such as SMILES) is that even medium-sized molecules produce very long identifiers. Another problem is the use of characters that must be escaped in URLs. The hashing proposal addresses both of these issues, getting very close to creating &lt;a href="http://depth-first.com/articles/2007/03/14/eleven-qualities-of-the-perfect-line-notation-for-the-web"&gt;the optimal molecular identifier&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, imagine the convenience of being able to refer to a molecule by a universally-recognized, machine-generated string like the one shown below:&lt;/p&gt;

&lt;p&gt;AAAAAAAAAAA-BBBBBBB-XYZ&lt;/p&gt;

&lt;p&gt;This is something that actually stands a chance of getting printed on reagent bottles, in catalogs, in patent applications, or anywhere else chemists are using chemical information. Aside from its length, it's not too different from that &lt;a href="http://www.cas.org/expertise/cascontent/registry/regsys.html"&gt;other molecular identifier system&lt;/a&gt;, but without the perpetual use tax.&lt;/p&gt;

&lt;p&gt;There are at least three downsides to this approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;For most purposes, hashing is a one-way process. It would become virtually impossible to computationally convert this hashed identifier back into its InChI or molecular representation . On the other hand, this could create a market for cryptography experts in cheminformatics. A hashed-InChI lookup service would start to look very useful.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Because of the one-way nature of hashing, the authenticity of a hashed InChI couldn't be directly verified. Checksums will help, but the fundamental problem remains. InChI itself can be &lt;a href="http://depth-first.com/articles/2006/09/19/decoding-inchis-with-rino"&gt;decoded&lt;/a&gt;, and therefore authenticated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It's possible, although extremely unlikely, that two different molecules will end up having the same hashed InChI. Reducing the collision probability means increasing the length of the identifier.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As in any design decision, the question is whether the benefits outweigh the disadvantages.&lt;/p&gt;

&lt;p&gt;Anyone is free to develop their own InChI hash system. Several, including me, already have. But by introducing a standard mechanism, the InChI team has the potential to create both a &lt;em&gt;free&lt;/em&gt; and easy-to-use molecular identifier.&lt;/p&gt;</description>
      <pubDate>Wed, 09 May 2007 14:01:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:57c15c90-1d32-4c6d-a46c-46a765320b6b</guid>
      <author>Rich Apodaca</author>
      <link>http://depth-first.com/articles/2007/05/09/hashing-inchis</link>
      <category>Meta</category>
      <category>inchi</category>
      <category>hash</category>
      <category>casnumber</category>
    </item>
    <item>
      <title>"Hashing InChIs" by steve heller</title>
      <description>&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Yes, the hash is one way, but by using a reasonably long bit string one will reduce the likelihood of a collision to 10 to the -11.  Not perfect, and NEVER will be, but close. And we do have a checksum in the proposed hash.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A computer (or computers) set up as a InChI lookup system, with a "real" IncHI string and the Index hash, analogous to the DNS lookup machines for the Internet IP address/domain name will enable people to verify the hash for all practical purposes. Not perfect either, but darn close.  This idea of a lookup computer was proposed when we gave an InChI seminar at Google.
&lt;a href="http://video.google.com/videoplay?docid=-6653695245776470969" rel="nofollow"&gt;http://video.google.com/videoplay?docid=-6653695245776470969&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;</description>
      <pubDate>Fri, 18 May 2007 09:29:53 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:06c0b7b5-4149-4a7e-aa49-103138781b08</guid>
      <link>http://depth-first.com/articles/2007/05/09/hashing-inchis#comment-40</link>
    </item>
  </channel>
</rss>
