Casual Saturdays: Argument Clinic 3

Posted by Rich Apodaca Sat, 13 Oct 2007 09:31:00 GMT

JRuby for Cheminformatics: Parsing IUPAC Nomenclature with OPSIN

Posted by Rich Apodaca Fri, 12 Oct 2007 10:37:00 GMT

Recent articles have discussed the use of JRuby for cheminformatics. We've seen how to parse SMILES strings, and read or write InChIs. In this article, we'll see how easy it is to parse IUPAC nomenclature from JRuby using Peter Corbett's OPSIN library.

Installation

After installing JRuby, simply download the OPSIN jarfile and copy it to your JRuby lib directory. You're done.

A Simple Library

We can write a simple library to convert an IUPAC name into a CML document:

require 'jruby'

import 'uk.ac.cam.ch.wwmm.opsin.NameToStructure'

module IUPAC
  @@nts = NameToStructure.new

  def read_name name
    cml = @@nts.parse_to_cml(name)

    raise "Could not parse '#{name}'." unless cml

    cml.to_xml
  end
end

The read_name method accepts an iupac name as a string and returns a CML document as a string. If the input can't be parsed, an exception is raised.

Testing the Library

We can test the library by saving it as a file called iupac.rb and invoking jirb:

$ jirb
irb(main):001:0> require 'iupac'
=> true
irb(main):002:0> include IUPAC
=> Object
irb(main):003:0> read_name('4-iodobenzoic acid')

This returns the XML shown below, which has been re-formatted for clarity:

<cml xmlns="http://www.xml-cml.org/schema">
  <molecule id="m1">
    <atomArray>
      <atom id="a1" elementType="C">
        <label value="1" />
      </atom>
      <atom id="a2" elementType="C">
        <label value="2" />
      </atom>
      <atom id="a3" elementType="C">
        <label value="3" />
      </atom>
      <atom id="a4" elementType="C">
        <label value="4" />
      </atom>
      <atom id="a5" elementType="C">
        <label value="5" />
      </atom>
      <atom id="a6" elementType="C">
        <label value="6" />
      </atom>
      <atom id="a7" elementType="C" />
      <atom id="a8" elementType="O" />
      <atom id="a9" elementType="O" />
      <atom id="a10" elementType="I">
        <label value="1" />
      </atom>
    </atomArray>
    <bondArray>
      <bond atomRefs2="a1 a2" order="2" />
      <bond atomRefs2="a2 a3" order="1" />
      <bond atomRefs2="a3 a4" order="2" />
      <bond atomRefs2="a4 a5" order="1" />
      <bond atomRefs2="a5 a6" order="2" />
      <bond atomRefs2="a6 a1" order="1" />
      <bond atomRefs2="a7 a1" order="1" />
      <bond atomRefs2="a7 a8" order="2" />
      <bond atomRefs2="a7 a9" order="1" />
      <bond atomRefs2="a10 a4" order="1" />
    </bondArray>
  </molecule>
</cml>

This simple Ruby library has parsed the name '4-iodobenzoic acid' and has returned a string containing the CML representation for the molecule. If we had wanted the read_name method to return a traversable XML object model, we could have enabled that as well.

Conclusions

One of the objections raised whenever the issue of "new" programming languages comes up, regardless of their merit, is the age-old refrain "Yeah, but where's the software?" With JRuby, we bypass this question altogether. We can leverage the full scope of the massive Java development effort over the last ten years, which includes several excellent cheminformatics libraries. With virtually no effort, we have a working cheminformatics platform based on a widely-used, versatile and dynamic object-oriented scripting language. Future articles will discuss extensions to this platform and some applications.

Open Access Business Models That Can Actually Work: Sigma-Aldrich's ChemBlogs 1

Posted by Rich Apodaca Thu, 11 Oct 2007 12:49:00 GMT

A gem of a chemistry blog has been operating for some time - apparently without much notice. ChemBlogs is Sigma-Aldrich's Web answer to their Aldrichimica Acta print magazine, and it's packed with mini-reviews on synthetic chemistry with links to the primary literature. This approach to scientific marketing has so much potential, I can't imagine why others aren't doing it.

Nevertheless, there are some small things that could be done to make ChemBlogs a lot more effective. Here, in no particular order, are some suggestions:

  • Submit the RSS feed to Chemical Blogspace. Chemical Blogspace is perhaps the most widely-read aggregator of free chemistry content on the Web. And it's one of the best ways to get your chemistry blog noticed, bookmarked, and linked to.

  • Make it easier to discover and use a post's permalink. If I see an article I like in ChemBlogs, such as this one on gold catalysis, there's no obvious way for me to link to it in my own blog. Standard practice is that all titles on the front page are hyperlinked to the article's permalink. This article discusses the importance of permalinks.

  • Don't moderate comments - use reCAPTCHA instead. Nothing stifles online discussion like moderated comments. The Web is about immediacy. Make a change and see it live instantly. Everything else is so 1999. If spam is the concern, reCAPTCHA is a wonderful tool for the job.

  • Drop the company group when identifying authors. No reader cares whether Sharbil J. Firsan is part of the Marketing Group or not. In fact, it's a bit of a turn-off to have the word "Marketing" appear at all.

  • Each author should have an online bio that links to their name. Although titles and company divisions are not useful, other information about authors is. In a multi-author blog like ChemBlogs, the byline should hyperlink to a bio of the author, or a collection of their writing. This makes it easier for readers to follow authors they like.

  • Link to the primary literature via DOI. ChemBlogs cites many articles appearing in journals, which is a great thing. Unfortunately, there's no way for a search engine to know that this is happening. The simple fix is to hyperlink a literature citation to the DOI entry, like this one for Chem. Rev. 1994, 94, 2483-2547.

  • Include InChIs for all important structures. Free tools like InChIMatic can then be used to quickly find articles dealing with those molecules.

  • Post more frequently and/or regularly. More content means more eyeballs. When it's regularly posted, readers know when to expect it.

  • Invite some working scientists to write articles. If recent experience with Wikipedia and Chemistry is any guide, there are plenty of capable scientist more than willing to create free, high-quality compound monographs and other chemical content. Invite some of them to contribute very short articles for ChemBlogs in their area of expertise and see what happens.

  • Release all content under a Creative Commons License. Information wants to be free - why not make it free? Allowing ChemBlogs' content to spread far and wide just makes it that much more visible. For example, at last count, Depth-First content was reproduced on about a dozen other Web sites, including one in Korean. This matches my goals exactly, and it's all perfectly legal thanks to the way the content is licensed.

With a little tweaking, Sigma-Aldrich's experiment in Permission Marketing could pay off - for everyone. Readers would conveniently get useful bits of information to make them more productive. The Internet would get new, high-quality chemical content - free to use and link to. Who knows - this might even become an Open Access business model that actually works.

And Sigma-Aldrich would have a far more effective marketing tool than anything else they currently use. With the possible exception of the Handbook, but even that could change.

Image Credit: angela7

How Would Your Cheminformatics Tool Do This?

Posted by Rich Apodaca Thu, 11 Oct 2007 08:47:00 GMT

JRuby for Cheminformatics: Reading and Writing InChIs Via the Java Native Interface 2

Posted by Rich Apodaca Wed, 10 Oct 2007 08:21:00 GMT

The increased use of the InChI identifier is making the reading and writing of InChIs a standard cheminformatics capability. Recent articles have discussed the advantages of JRuby for cheminformatics. One disadvantage of JRuby is that code written in C can't be directly used. The presents a potential problem for libraries, such as the InChI toolkit, that are written in C. Fortunately, the solution is simple. Today's tutorial will demonstrate how InChIs can be both read and written using the C-InChI toolkit via JRuby and the excellent JNI-InChI library.

About JNI-InChI

The JNI-InChI library, written by Jim Downing and Sam Adams, wraps the C InChI toolkit in a Java Native Interface. This low-level toolkit is suitable for building more complex software, but lacks many features present in the C InChI toolkit. For example, JNI-InChI doesn't directly interconvert SMILES or molfile with InChI. For that you'd need to build a support library. If you're building a toolkit from scratch, this lightweight approach can be a significant advantage.

The JNI-InChI binary distribution jarfile includes the compiled native InChI library. In this sense it's virtually indistinguishable from any other Java library. This simplified packaging makes it exceptionally easy to use JNI-InChI from JRuby, as we'll see below.

Installation

JRuby can be installed as described previously. To install the JNI-InChI library for JRuby, simply copy the current release jarfile into the lib directory of your JRuby installation. That's all there is to it.

A Simple Library

We can now write a simple library to read InChIs via JRuby:

require 'java'

include_class 'net.sf.jniinchi.JniInchiInput'
include_class 'net.sf.jniinchi.JniInchiInputInchi'
include_class 'net.sf.jniinchi.JniInchiWrapper'

module IUPAC
  def read_inchi inchi
    input = JniInchiInputInchi.new inchi

    JniInchiWrapper.getStructureFromInchi input
  end
end

Testing the Library

By saving the above library to a file called iupac.rb, we can parse InChIs via JRuby:

$ jirb
irb(main):001:0> require 'iupac'
=> true
irb(main):002:0> include IUPAC
=> Object
irb(main):003:0> output = read_inchi 'InChI=1/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H'
=> #
irb(main):004:0> output.num_atoms
=> 14
irb(main):005:0> output.num_bonds
=> 16

Writing InChIs

Because JNI-InChI is a low-level toolkit, writing InChIs is feasible, but not trivial. We must first construct a representation, and then get the InChI for it. For example, we could get the InChI for methane as follows:

$ jirb
irb(main):001:0> require 'java'
=> true
irb(main):002:0> include_class 'net.sf.jniinchi.JniInchiInput'
=> ["net.sf.jniinchi.JniInchiInput"]
irb(main):003:0> include_class 'net.sf.jniinchi.JniInchiAtom'
=> ["net.sf.jniinchi.JniInchiAtom"]
irb(main):004:0> include_class 'net.sf.jniinchi.JniInchiWrapper'
=> ["net.sf.jniinchi.JniInchiWrapper"]
irb(main):005:0> input = JniInchiInput.new
=> #
irb(main):006:0> a1 = input.add_atom JniInchiAtom.new(0,0,0, "C")
=> #
irb(main):007:0> a1.set_implicit_h(4)
=> nil
irb(main):008:0> output = JniInchiWrapper.get_inchi input
=> #
irb(main):009:0> output.get_inchi
=> "InChI=1/CH4/h1H4"

Fortunately, we don't have to work that hard. The Chemistry Development Kit, through JNI-InChI, supports reading and writing of InChIs via a variety of molecular languages, including SMILES and molfile. More on that later, though.

Conclusions

Provided that a Java Native Interface exists for a C library, it can be used from JRuby. Future articles will discuss the use of other cheminformatics libraries written in either C or C++ from JRuby, and their integration with pure Java and Ruby libraries.

Older posts: 1 2 3 4 5