Reading and Writing SD Files With MX 7

Posted by Rich Apodaca Mon, 15 Dec 2008 15:29:00 GMT

MDL Structure Data Files (SD Files) are the de facto standard for the exchange of chemical structures and associated data. As a result, methods for efficiently reading and writing these files play an important part in any cheminformatics toolkit.

The latest release of MX, the open source cheminformatics toolkit, adds support for reading and writing SD Files. Both source and platform-independent binary distributions are available.

The new release introduces SDFileReader. In interactive JRuby:

$ jirb
irb(main):001:0> require 'mx-0.107.0.jar'                       
=> true
irb(main):002:0> import com.metamolecular.mx.io.mdl.SDFileReader
=> Java::ComMetamolecularMxIoMdl::SDFileReader
irb(main):003:0> r=SDFileReader.new 'pubchem_sample_33.sdf'     
=> #<Java::ComMetamolecularMxIoMdl::SDFileReader:0x40b181 @java_object=com.metamolecular.mx.io.mdl.SDFileReader@145b02f>
irb(main):004:0> r.next_record                                  
=> nil
irb(main):005:0> m=r.get_molecule                               
=> #<Java::ComMetamolecularMxModel::DefaultMolecule:0xcb754f @java_object=com.metamolecular.mx.model.DefaultMolecule@60b407>
irb(main):006:0> m.count_atoms                                  
=> 31
irb(main):007:0> r.get_keys                                     
=> #<Java::JavaUtil::ArrayList:0x381d92 @java_object=[PUBCHEM_COMPOUND_CID, PUBCHEM_COMPOUND_CANONICALIZED, PUBCHEM_CACTVS_COMPLEXITY, PUBCHEM_CACTVS_HBOND_ACCEPTOR, PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_ROTATABLE_BOND, PUBCHEM_CACTVS_SUBSKEYS, PUBCHEM_IUPAC_OPENEYE_NAME, PUBCHEM_IUPAC_CAS_NAME, PUBCHEM_IUPAC_NAME, PUBCHEM_IUPAC_SYSTEMATIC_NAME, PUBCHEM_IUPAC_TRADITIONAL_NAME, PUBCHEM_NIST_INCHI, PUBCHEM_EXACT_MASS, PUBCHEM_MOLECULAR_FORMULA, PUBCHEM_MOLECULAR_WEIGHT, PUBCHEM_OPENEYE_CAN_SMILES, PUBCHEM_OPENEYE_ISO_SMILES, PUBCHEM_CACTVS_TPSA, PUBCHEM_MONOISOTOPIC_WEIGHT, PUBCHEM_TOTAL_CHARGE, PUBCHEM_HEAVY_ATOM_COUNT, PUBCHEM_ATOM_DEF_STEREO_COUNT, PUBCHEM_ATOM_UDEF_STEREO_COUNT, PUBCHEM_BOND_DEF_STEREO_COUNT, PUBCHEM_BOND_UDEF_STEREO_COUNT, PUBCHEM_ISOTOPIC_ATOM_COUNT, PUBCHEM_COMPONENT_COUNT, PUBCHEM_CACTVS_TAUTO_COUNT, PUBCHEM_BONDANNOTATIONS]>
irb(main):008:0> r.get_data 'PUBCHEM_COMPOUND_CID'              
=> "1"

SDFileReader implements lazy iteration with Molecules and data only being created when requested.

SD Files can be written with SDFileWriter. In interactive JRuby:

$ jirb
irb(main):001:0> require 'mx-0.107.0.jar'                       
=> true
irb(main):002:0> import com.metamolecular.mx.io.mdl.SDFileWriter
=> Java::ComMetamolecularMxIoMdl::SDFileWriter
irb(main):003:0> import com.metamolecular.mx.io.Molecules       
=> Java::ComMetamolecularMxIo::Molecules
irb(main):004:0> w=SDFileWriter.new 'output.sdf'                
=> #<Java::ComMetamolecularMxIoMdl::SDFileWriter:0x8a2023 @java_object=com.metamolecular.mx.io.mdl.SDFileWriter@43da1b>
irb(main):005:0> w.write_molecule Molecules.create_benzene      
=> nil
irb(main):006:0> w.write_data 'key', 'value'                    
=> nil
irb(main):007:0> w.close
=> nil

For an up-to-date summary of MX's current capabilities, please check out the MX Homepage.

Comments

Leave a response

  1. Hanjo Kim Fri, 19 Dec 2008 03:53:59 GMT

    Nice work. But MX seems not able to manage multi-line data. It fetches only the first line. Do you have any plan to solve this problem?

  2. Rich Apodaca Fri, 19 Dec 2008 05:32:11 GMT

    Hanjo, thanks for bringing this up. It's a problem that needs to be fixed.

    If you're interested, I'd be happy for you do work on it. If not, could you file a bug report?

    If you'd like to contribute some code, one way would be to create your own fork, write some tests, and then write some code that makes the test pass. For some examples of how the tests might look, check out the test package.

    Once you send me a pull request, I'll pull your changes into my repository.

    Git/Hub might seem like a big learning curve, but simple things are actually simple. GitHub has some excellent documentation and so does git.

  3. Hanjo Kim Sat, 20 Dec 2008 16:17:36 GMT

    Thank you for cheering me up to use GitHub, but i'm not a Java guy, so instead of entering into MX itself, I will rather explain my quick (and dirty) solution based on your ruby code here.

    The first step is dividing record returned by each_record method into each key-value pairs using "\n\n" (like record.split("\n\n"); it will produce error obviously if multi-line data contains blank line). Then, it is safe to use multi-line mode of regular expression like this;

     pair.match(/^>\s+<(.+)>\n(.+)/m)
     key, data = $1, $2
    

    This method looks working for me. I hope this will help you.

  4. Hanjo Kim Sat, 20 Dec 2008 16:40:08 GMT

    You can see my code in this blog post.

  5. Bruno Bienfait Mon, 22 Dec 2008 08:34:18 GMT

    Hanjo,

    the regular expression shown in your ruby code might fail if the header of the SD data is more complex, like in this example:

    >  <ROTATABLE_BONDS> (AA-173/40757587)
    4
    
    >  <LOGP> (AA-173/40757587)
    5.750000000000000e+000
    

    Bruno

  6. Hanjo Kim Tue, 23 Dec 2008 15:00:05 GMT

    Bruno,

    You are right. MX has a proper regular expression. My code above is not a complete one, but just an example of my logic. It should be changed like this;

     pair.match(/^>\s+<(.+)>\s*((.*))\n(.+)/m)
    

    Thanks for your interests.

  7. Rich Apodaca Tue, 06 Jan 2009 05:49:18 GMT

    Hanjo, an update and test supporting multi-line data in sd files is now in github (revision).

Comments