A Simple and Portable Ruby Interface to InChI 8

Posted by Rich Apodaca Thu, 29 May 2008 12:12:00 GMT

Although the InChI software itself is written in C, it can still be used via Ruby. Rino offers one implementation of a Ruby InChI interface that makes use of a C extension. This article describes a more concise and portable solution.

The Code

The following code will accept a String encoding a molfile and return either its InChI, or an empty String if no InChI could be found:

module InChI
  def inchi_for molfile
    output = %x[echo "#{molfile}" | cInChI-1 -STDIO]

    output.eql?("") ? "" : output.split(/\n/)[1]
  end
end

This code takes advantage of Ruby's built-in support for Command Expansion.

Testing the Code

The code below tests the library:

require 'inchi'
include InChI

molfile =
"http://chempedia.com/compounds/106.mol
  -OEChem-03010811072D

 12 12  0     0  0  0  0  0  0999 V2000
    2.8660    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7321    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7321   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660    1.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.4631    0.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690    0.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.4631   -0.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    4.2690   -0.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.8660   -1.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  1  3  1  0  0  0  0
  1  7  1  0  0  0  0
  2  4  1  0  0  0  0
  2  8  1  0  0  0  0
  3  5  2  0  0  0  0
  3  9  1  0  0  0  0
  4  6  2  0  0  0  0
  4 10  1  0  0  0  0
  5  6  1  0  0  0  0
  5 11  1  0  0  0  0
  6 12  1  0  0  0  0
M  END"

puts "Found InChI: #{inchi_for(molfile)}"

We can run the test by saving it in a file called test.rb and executing it:

$ ruby test.rb
InChI version 1, Software version 1.02-beta August 2007
Log file not specified. Using standard error output.
Input file not specified. Using standard input.
Output file not specified. Using standard output.
Options: Mobile H Perception ON
Isotopic ON, Absolute Stereo ON
Omit undefined/unknown stereogenic centers and bonds
Full Aux. info
Input format: MOLfile
Output format: Plain text
Timeout per structure: 60.000 sec; Up to 1024 atoms per structure
End of file detected after structure #1.
Finished processing 1 structure: 0 errors, processing time 0:00:00.00
Found InChI: InChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H

Prerequisites

The above approach only requires that it be run on a UNIX-like system, and that a copy of the InChI library be present on your path.

Advantages

The approach described here offers some important advantages over Rino:

Disadvantages

This approach creates a lot of noisy log output to the console. There must be a way to suppress it, but so far I haven't found out how.

Conclusions

Using Ruby's support for Command Expansions has enabled the creation of a concise and portable Ruby interface to the InChI toolkit. Similar principles would apply to any Unix command-line binary, including for example, Open Babel.

Comments

Leave a response

  1. baoilleach Fri, 30 May 2008 05:24:28 GMT

    There's no need for files either if using the CDK [as Rino appears to ] or indeed if using OpenBabel. If interested in an example, see the code behind cinfony.cdkjython.readstring() and cinfony.cdkjython.Molecule.write().

  2. baoilleach Fri, 30 May 2008 05:26:39 GMT

    Actually, I realise I don't currently support InChI with the CDK, so I don't know if what I said is actually correct.

  3. Rich Apodaca Fri, 30 May 2008 08:45:17 GMT

    Noel, last time I checked CDK support for InChI was limited to Windows and 32-bit Intel.

    Rino actually uses neither CDK nor Open Babel, but rather the InChI toolkit directly. But my limited time and knowledge of C led to the use of temporary files in the C-Extension. And I doubt Rino would compile on OS X.

  4. Andrew Dalke Sat, 31 May 2008 05:42:55 GMT

    You are missing some required command-line parameters (notice that I don't say "options", because they aren't optional).

    Igor Pletnev on the InChI mailing list (10 January 2008, titled "InChI standard generation options (request for comments)") suggested

    /FixedH /RecMet /SPXYZ /SAsXYZ /Newps /Fb /Fnud

    but ChemSpider doesn't use /FixedH, making it hard to do an InChiKey search on their site.

    At the very least, you should enable the /Fb ("Fix bug leading to missing or undefined sp3 parity") option.

    Also, I see you are passing the entire contents of the input file into the string to be exec'ed by Ruby. I don't know how Ruby works, but I would be worried about embedded quotes in the molfile string. What if some property contained a " character, causing the quoted string to be unquoted? If you used this as part of a web service to convert structures to InChI strings then this is a possible security hole.

    I'm also slightly concerned about the potential size of the string you create. On my Mac, the maximum string is set by the kernel variable "kern.argmax" = 262144, so that's the upper limit on the structure you could pass in. Doing a quick search I found that inulin (CID:24763) has 801 heavy atoms and is 96,669 bytes long. This means it's very unlikely that the input structure will exceed that limit.

    But other machines have different limits. See for example http://www.in-ulm.de/~mascheck/various/argmax/ . For those people using IRIX (which is where I first ran into this limit), the max size is only 20,480 bytes. "Linux -2.6.7" is listed at 131,072 bytes, so it's possible that some very large files found in the wild will break that limit.

    My usual solution for this case is using something like Open3.popen3, which uses pipes to talk communicate with the co-process's stdin, stdout, and stderr. This also solves the problem of keeping inchi's stderr from reaching the console.

    irb(main):001:0> require "open3"

    => true

    irb(main):002:0> stdin, stdout, stderr = Open3.popen3("wc -c")

    => [#, #, #]

    irb(main):003:0> stdin.write("Hello!")

    => 6

    irb(main):004:0> stdin.close()

    => nil

    irb(main):005:0> stdout.read()

    => " 6\n"

    irb(main):006:0>

  5. Rich Apodaca Sat, 31 May 2008 10:31:29 GMT

    Andrew, thanks for the feedback. The email thread you refer to has some useful information for those who want to share their InChIs with others. I've never thought that exchanging InChI keys between organizations would work well precisely for this reason.

    Great suggestion about Open3. Unfortunately, it has issues on JRuby (1.1.1):

    $ jirb
    irb(main):001:0> require 'open3'
    => true
    irb(main):002:0> stdin, stdout, stderr = Open3.popen3("wc -c")
    NotImplementedError: fork is unsafe and disabled by default on JRuby
            from (irb):3:in `popen3'
            from (irb):3:in `load_history'
    irb(main):003:0>
    

    I did a test to check if an unclosed quote placed on the comments line would cause problems, and it does. An unclosed single quote was fine, though. A workaround would be:

    inchi_for m.gsub(/["]/, "")
    

    AFAIK, doing this doesn't change the resulting InChI in any way.

    A solution to the noisy output problem was also described here.

    So it looks like for now the method described above using Command Expansion is the most broadly-usable. For extremely large molfiles, this could be a problem, but for everything else it seems to work.

  6. Andrew Dalke Sat, 31 May 2008 12:16:26 GMT

    I've not been silent with my own complaints about InChI. :)

    "fork" is deemed unsafe? Strange. Are they worried about fork-bombs or something else? I'm more surprised that %x[] works because it has a lot of security problems. I pointed out the one with the quotes. You fixed it via removal of any double quotes, which should be fine for InChI. But you also need to remove "\" characters, which are interpreted as escape character in the context you're using. And because you are using double quotes, other characters, like $, and ` have meaning.

    Consider:

    irb(main):011:0> text = "Bad\\"
    => "Bad\\"
    irb(main):013:0> %x[echo "#{text}" | wc -c]
    sh: -c: line 1: unexpected EOF while looking for matching `"'
    sh: -c: line 2: syntax error: unexpected end of file
    => ""
    irb(main):014:0> text = "$$"
    => "$$"
    irb(main):015:0> %x[echo "#{text}"]        
    => "3749\n"
    irb(main):016:0> text = "`ls`"
    => "`ls`"
    irb(main):017:0> %x[echo #{text} | wc -c]
    => "     132\n"
    
    
    At the very least, use single quotes. And hope this code isn't run under Windows because the failure mode is going to be quite unexpected.

    It's really hard to be safe when using system(3C), which appears to be what Ruby is using here. I don't know Ruby well so don't know if a correct (and platform independent) function exists there. I did find the Escape module which has the right function.

    Doing research now, under Java/JRuby your best choice is to use Runtime.exec() (pre 1.5) or java.lang.ProcessBuilder .

    I usually use the "redirect stderr to /dev/null" trick when I do this sort of work, but sometimes I need the stderr output so I can get the program version number, or error message about why something failed.

    The moral of my story is, don't trust user input, especially when passed to the command line.

  7. Rich Apodaca Sat, 31 May 2008 15:04:56 GMT

    Andrew, you raise some very important security issues. The risk is that Ruby code using Command Expansion could be coaxed into executing arbitrary system commands, rather than just generating an error.

    A regex cleanup like this might work:

    inchi_for m.gsub(/[^0-9a-z\.\-\n ]/i, '')
    

    Single-quoting alone would ensure that the only substitutions that could occur are \\ -> \ and \' -> ', the latter of which causes an error.

    Your point is well taken - any use of this approach on foreign data should carefully consider all of the security implications.

  8. Andrew Dalke Sat, 31 May 2008 16:03:40 GMT

    Yes, whitelisting like that should work for this case, in the manner you have there. There's still a worry in my head if someone submits a multi-structure SD file, since cInChI-1 will convert all structures you give it. Your filter removes the "> <" key/value data fields from the SD file, and I don't know if that will confuse the inchi reader.

    I don't like being worried like this, so I do my best to avoid passing user input in on the command line.

Comments