Reading Large SDfiles in Rust
Chemical data analysis pipelines often start with reading SDfiles, the field's de facto standard for information exchange. Given the growing size of many chemistry data sets, efficient methods for reading SDfiles have become ever more important. As part of a continuing series on Rust for Cheminformatics, this article takes a hands-on first look at reading arbitrarily large SDfiles in Rust. A recent article by Noel O'Boyle on reading large SDfiles in Python offers a good comparison.
Data Set
SDfiles are not hard to find, but large SDfile might be. If you're drawing a blank on where to find SDfiles in the gigabyte range, consider PubChem. Specifically, the bulk download FTP server offers a wide selection of large SDfiles. This tutorial uses the first Compound file. For now, unzip it.
Hello, SDfile: Counting Lines
Before diving into record parsing, let's start with the simpler case of counting lines. Doing so offers insights into some important points.
If you haven't done so already, install Rust. Next, create a blank project using Cargo, Rust's build system:
cargo new sdread && cd sdread
Your project will contain a source file at src/main.rs
. Replace its contents with the following:
// src/main.rs
use std::io::prelude::*;
use std::io::BufReader;
use std::fs::File;
fn main() -> std::io::Result<()> {
// replace this with the full path to your SDfile
let name = "/Users/rich/Downloads/Compound_000000001_000500000.sdf";
let file = File::open(name)?;
let reader = BufReader::new(file);
let count = reader.lines().fold(0, |sum, _| sum + 1);
println!("line count: {}", count);
Ok(())
}
Rust programs start execution at the main
function. Unlike other languages, main
requires no arguments or return value. In this example, we've opted to return a value of type std::io::Result
so that I/O errors will be propagated with the try operator (?
).
main
begins by opening the file and creating a buffered reader (BufReader
). The reader implements Rust's Iterator
trait, opening many possibilities. In this case, we use the Iterator#fold
method to count lines. fold
updates the previous (or initial) value by applying a closure for each element, and then returns the final result. In this case, we start with an initial count of zero and increment the count every time the reader returns another line. Analogous methods can be found in many functional languages.
BufReader
is a good first choice for those situations in which it's unclear whether a file will fit in memory. Instead of loading all bytes into memory, they're streamed one line at a time.
Next, build the project with Cargo:
cargo build
Execution time on my system clocked in at a rather alarming two minutes. What gives?
time ./target/debug/sdread
line count: 91736104
real 2m23.095s
user 2m19.330s
sys 0m1.364s
Compile a Release Build When Performance Counts
When a Rust program takes longer than expected to execute, a good first step is to check the compile target. The previous section used a debug build. Such a build includes overhead that isn't needed during normal execution. Removing it can lead to a substantial reduction in execution time.
Switch to a release build with:
cargo build --release
Running the release build yields a much more reasonable execution time:
time ./target/release/sdread
line count: 91736104
real 0m14.629s
user 0m14.017s
sys 0m0.593s
Iterate the First Record
Counting lines is instructive, but SDfiles are all about the records. Let's modify the example to return the first record of the SDfile as a String
. Here we'll introduce a custom Iterator
implementation. For background, see Returning Rust Iterators. Replace the content of src/main.rs
with the following:
// src/main.rs
use std::io::Lines;
use std::io::BufRead;
use std::io::BufReader;
use std::fs::File;
struct Records {
iter: Lines<BufReader<File>>
}
impl Iterator for Records {
type Item = Vec<String>;
fn next(&mut self) -> Option<Self::Item> {
let mut lines = vec![ ];
loop {
match self.iter.next() {
Some(result) => match result {
Ok(line) => {
if line == "$$$$" {
break Some(lines)
} else {
lines.push(line)
}
},
Err(error) => panic!(error)
},
None => break None
}
}
}
}
fn main() -> std::io::Result<()> {
let name = "/Users/rich/Downloads/Compound_000000001_000500000.sdf";
let file = File::open(name)?;
let reader = BufReader::new(file);
let mut records = Records { iter: reader.lines() };
if let Some(record) = records.next() {
println!("record: {:?}", record);
}
Ok(())
}
The main
function here resembles the previous version. But this time the reader is wrapped in a custom Records
iterator. The first record (provided it exists), is then printed through a call to next
.
The custom Records
iterator is implemented as follows. Within the Records
implementation (impl Iterator for Records
), we declare the type of the item that will be iterated (in this case a vector of String
). Then, the next
method is implemented. A functional Iterator
requires just these two elements.
The next
method adds lines to a vector buffer until finding the SDfile record terminator ($$$$
). When that happens the vector of lines is returned.
Count All Records
The Records
type serves as a gateway into a vast range of powerful tools open to Rust iterators. But first we begin on a humble note: counting all of the records contained in the SDfile.
// src/main.rs
use std::io::Lines;
use std::io::BufRead;
use std::io::BufReader;
use std::fs::File;
struct Records {
iter: Lines<BufReader<File>>
}
impl Iterator for Records {
type Item = Vec<String>;
fn next(&mut self) -> Option<Self::Item> {
let mut lines = vec![ ];
loop {
match self.iter.next() {
Some(result) => match result {
Ok(line) => {
if line == "$$$$" {
break Some(lines)
} else {
lines.push(line)
}
},
Err(error) => panic!(error)
},
None => break None
}
}
}
}
fn main() -> std::io::Result<()> {
let name = "/Users/rich/Downloads/Compound_000000001_000500000.sdf";
let file = File::open(name)?;
let reader = BufReader::new(file);
let count = (Records { iter: reader.lines() })
.fold(0, |count, _| count + 1);
println!("records: {}", count);
Ok(())
}
Here, Records
hasn't changed, but main
has. The fold
method makes another appearance, but instead of counting lines, fold
is counting records. Compile the release binary (cargo build --release
) then execute:
time ./target/release/sdread
records: 442655
real 0m20.696s
user 0m19.598s
sys 0m0.781s
The result can be confirmed by counting the number of record terminators ($$$$
) in the file:
grep -wc "\$\$\$\$" /Users/rich/Downloads/Compound_000000001_000500000.sdf
442655
This is a good time to point out that PubChem periodically culls Compound records and changes the size of its bulk download files. Future records counts may not match the totals presented here.
Compile Data Fields
The goal of most SDfile parsing is to extract structures and data. We can move one step closer to this goal by capturing Data Items in a HashMap
. The listing below shows one way.
// src/main.rs
use std::io::Lines;
use std::io::BufRead;
use std::io::BufReader;
use std::fs::File;
use std::collections::HashMap;
struct Records {
iter: Lines<BufReader<File>>
}
impl Iterator for Records {
type Item = HashMap<String, String>;
fn next(&mut self) -> Option<Self::Item> {
let mut result = HashMap::new();
let next = self.iter.next();
if next.is_none() {
return None
}
while next_line(&mut self.iter) != "M END" { }
loop {
let line = next_line(&mut self.iter);
if line == "$$$$" {
break Some(result);
}
let name = read_name(&line);
let data = read_data(&mut self.iter);
result.insert(name, data);
}
}
}
fn next_line(iter: &mut Lines<BufReader<File>>) -> String {
match iter.next() {
Some(line) => {
line.expect("reading line")
},
None => panic!("unexpected EOF")
}
}
fn read_name(line: &String) -> String {
let mut characters = line.chars();
if let Some(first) = characters.next() {
if first != '>' {
panic!("no leading > on header line");
}
} else {
panic!("unexpected blank line");
}
let mut result = String::new();
let mut capture = false;
for character in characters {
if capture {
if character == '>' {
return result;
} else {
result.push(character);
}
} else {
if character == '<' {
capture = true;
}
}
}
if capture == true {
panic!("no closing >");
} else {
panic!("no opening <");
}
}
fn read_data(iter: &mut Lines<BufReader<File>>) -> String {
let mut result = next_line(iter);
loop {
let line = next_line(iter);
if line.is_empty() {
break result;
}
result.push_str(&line);
}
}
fn main() -> std::io::Result<()> {
let name = "/Users/rich/Downloads/Compound_000000001_000500000.sdf";
let file = File::open(name)?;
let reader = BufReader::new(file);
let records = Records { iter: reader.lines() };
let stop = "5-(5-diazoimidazol-4-yl)-1H-1,2,4-triazole";
for record in records {
if let Some(name) = record.get("PUBCHEM_IUPAC_NAME") {
if name == stop {
let cid = record.get("PUBCHEM_COMPOUND_CID").expect("CID");
println!("Found name at CID {}", cid);
return Ok(());
}
}
}
println!("name not found");
Ok(())
}
The main
function uses a more complex Records
iterator for selective capture of fields from an SDfile. It iterates until the target IUPAC name is found (stop
), then prints the Compound Identifier of the corresponding record. This record was chosen to be the last one in the file so that execution time can be compared to that of the previous examples.
HashMap
is but one option for capturing Data Items. For example, nothing guarantees the uniqueness of a Field Name within the Data Items section. An alternative approach using vector of tuple (Vec<(String, String)>
) might be better suited for general use.
Gotchas
When reading lines with the approach presented in this article, the question of newlines might come up. The spec says nothing on this topic. It turns out that BufReader
will work with either CR or CRLF encoding, which covers the majority of situation. As noted in the Rust documentation:
The iterator returned from this function [
lines
] will yield instances of io::Result. Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.
Speaking of newlines, reading individual lines may not be the most efficient approach. A reader that reads an entire record (up to $$$$
), then works on the resulting full record string in memory may be more efficient. Ultimately, benchmarking with real data sets would be needed.
Future Directions
Although the reader described here is already quite functional, there's more to the story. Here are some possible jumping off points:
- support
*.sdf.gz
files (see, for example, niffler) - support collections of
*sdf.gz
files - concurrency, especially crates like Rayon that simplify the job
- better error handling, including validation of Field Names and line number reporting
- support for Data Headers lacking a field name such as "DT13"
- refactor into modules
- unit/integration tests
- consider re-use vs re-allocation within a
Records
iterator
Conclusion
This article offers a series of examples building up to a Rust Iterator
for SDfile records. Iterators are quite powerful in Rust, so building on this foundation unlocks a lot of potential. Additionally, the platform presented here can be extended in several ways. Future articles will highlight some of them.