Reading CTfiles with CTcore
CTfile is a widely-used family of file formats in cheminformatics and computational chemistry. CTfiles are most commonly processed through a cheminformatics toolkit. But sometimes that kind of power is overkill. You might, for example, want to pull out just certain pieces of information from a file without the overhead of building high-level data structures. In other situations, a toolkit does too little. For example, you might be interested in precise error reporting down to the row, column and expected character. Maybe rigorous validation of CTfile output generated by another utility is what you're really after. In these and other cases, a low-level CTfile utility is a better fit than a general-purpose toolkit. This article describes the beginnings of such a utility.
About CTcore
CTcore (install) is a library for reading and eventually writing CTfiles. Written in Rust, CTcore emphasizes performance, pushing runtime errors to compile-time when possible, and flexible deployment. CTcore is based on the CTfile specification, interpreted as narrowly as possible to minimize data corruption and maximize output quality. CTcore is far from complete, but even in its primitive state it can do some useful things.
Parse CTfile Header
The best-known member of the CTfile family is molfile. Two versions are specified: "V2000" and "V3000." For compatibility, both versions use the same header format. This header can be parsed by CTcore as follows.
// FILE ./tests/data/v3k.mol
// Molecule Name
// ABTESTING111032100122D 11234.12341123456.12341123456
// A Comment
// 6 5 0 1 999 V2000
//
fn read_molfile_header() -> Result<(), io::Error> {
let file = fs::File::open("./tests/data/v3k.mol").unwrap();
// https://stackoverflow.com/questions/26368288
let mut err = Ok(());
let mut buffer =
io::BufReader::new(file)
.bytes()
.scan(&mut err, |err, res| match res {
Ok(i) => Some(i),
Err(e) => {
**err = Err(e);
None
}
});
let mut reader = Reader::new(&mut buffer);
assert_eq!(
reader.line_with_blacklist::<80>(BLACKLIST),
Ok(Sequence::from_str(
format!("{: <80}", "Molecule Name").as_str()
))
);
assert_eq!(reader.newline(), Ok(()));
assert_eq!(reader.sequence::<2>(), Ok(Sequence::from_str("AB")));
assert_eq!(reader.sequence::<8>(), Ok(Sequence::from_str("TESTING1")));
assert_eq!(
reader.sequence::<10>(),
Ok(Sequence::from_str("1103210012"))
);
assert_eq!(reader.sequence::<2>(), Ok(Sequence::from_str("2D")));
assert_eq!(reader.fortran_int::<2>(), Ok(FortranInt::from_int(1)));
assert_eq!(
reader.fortran_float::<4, 5>(),
Ok(FortranFloat::from_float(1234.12341))
);
assert_eq!(
reader.fortran_float::<6, 5>(),
Ok(FortranFloat::from_float(123456.12341))
);
assert_eq!(reader.fortran_int::<6>(), Ok(FortranInt::from_int(123456)));
assert_eq!(reader.newline(), Ok(()));
assert_eq!(reader.line::<80>(), Ok(Sequence::from_str("A Comment")));
assert_eq!(reader.newline(), Ok(()));
assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(6)));
assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(5)));
assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(0)));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(1)));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(None));
assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(999)));
assert_eq!(reader.sequence::<6>(), Ok(Sequence::from_str(" V2000")));
err
}
The remainder of this article discusses how CTcore works, at both low and high levels of abstraction.
ASCII All the Way Down
As noted in the recent article CTfile Character Encoding, the CTfile documentation says nothing about character encoding. This presents a major problem for implementors. For one thing, CTfile is a line-oriented format in which some lines have a hard character count limit. Not even knowing what constitutes a character can lead to very bad outcomes. Second and more importantly, as Spolsky notes in his famous article on the topic of Unicode, a character encoding is required before any byte sequence can be transformed into text in the first place.
Although the documentation itself is silent on the issue of character encoding, the context surrounding the development of CTfile points squarely at ASCII. The lack of any guidance on this issue from a succession of CTfile corporate sponsors suggests that ASCII remains the character encoding today.
For these reasons, CTcore uses ASCII (US-ASCII) character encoding for all byte sequences. This means that files containing, for example, unicode, various Latin encodings, UTF-16, and so on will be rejected as errors. CTcore enforces this restriction at the level of data structures, meaning that it will be impossible to write a CTfile through CTcore using any encoding other than ASCII. The one exception, should CTcore eventually support it, would be XDfile, an XML-based alternative to SDfile that uses UTF-8 encoding.
Data Structures
CTcore favors data structures that are guaranteed to be valid at compile time. This approach has been documented in the essay Parse, Don't Validate. The approach is based on a simple rule of thumb: "write functions on the data representation you wish you had, not the data representation you are given." An important tool toward that goal is data structures that make illegal states unrepresentable. A second essay from a different author, Can Types Replace Validation, adds some useful nuance to the idea.
CTcore applies these ideas throughout its design and implementation.
The base type, Character
, represents an ASCII character. By default, the Rust types String
and char
use UTF-8 encoding. These are avoided by CTcore because supporting them in data structures would make it possible to encode illegal states unnecessarily. Character
is an enumeration that represents the printable ASCII characters like so:
pub enum Character {
Space,
Exclamation,
DoubleQuote,
// ...
}
Character
supports convenience conversion methods allowing interconversion with the u8
type for octet encoding.
A Character
sequence is captured by the Sequence
type:
pub struct Sequence<const L: usize>(Vec<Character>);
The type parameter L
is an example of one of Rust's newer features, const generics. The L
type parameter represents the maximum allowable length of the Sequence
. When used to represent, say, the molfile name field, it will be impossible to encode a name longer than 80 characters when L
is set to 80. This keeps the molfile name field within spec at all times. Sequence
supports methods allowing conversion to/from other types.
One limitation of const generics affects CTcore: it's possible for the type parameter L
to equal zero. This is a zero-length character sequence. It would be better if L
were required to be positive, but the current implementation of const generics does not support this restriction. This could, however, change in future releases.
CTfile supports various numerical types, which appear as CTcore data structures.
FortranInt
and FortranFloat
are used on the second line of the molfile header, a line I refer to as "the parameters line." As the name suggests, the Fortran types embody rules inherited from the programming language Fortran's numerical formatting capabilities. These capabilities and the format itself are beyond the scope of today's article. Suffice it to say that many rules apply and that FortranInt
and FortranFloat
encapsulate them all.
FortranInt
is used in the parameter line's "internal registry number" field, where I
equals six.
pub struct FortranInt<const I: usize> {
kind: Kind<I>,
}
enum Kind<const I: usize> {
Positive(FixedNatural<I>),
Negative(FixedNatural<I>),
Zero,
}
Positive and negative variants of FortranInt
contain a FixedNatural
. This is a fixed-width natural number.
The FortranInt
kind
attribute is private, meaning that it is impossible to construct a FortranInt
through a struct
literal. Instead, one of the public constructor methods must be used. The type parameter I
is the integer's width in characters. The internal registry number is restricted to no more than six characters, including a possible minus sign. FortranInt
's methods ensure that this constraint is met before a value can be used.
FortranFloat
is used by the "energy" and "scaling factors" part of the parameters line. Extending FortranInt
, FortranFloat
uses the same safeguards to ensure that only valid values can be created.
pub struct FortranFloat<const I: usize, const F: usize> {
integer_part: FortranInt<I>,
fractional_part: Vec<Digit>,
}
As before, the two attributes are private, meaning that FortranFloat
values can only be created through a validating constructor method.
Rounding out the current collection of numerical types is FixedCount
. FixedCount
is a fixed-width integer greater than or equal to zero. It is used on the molfile "counts" line (Line 4). Unfortunately, the restrictions on such fields are not covered in any detail by the CTfile documentation. For this reason, the validation rules are quite liberal.
pub enum FixedCount<const I: usize> {
Positive(FixedNatural<I>),
Zero,
}
Reader
CTfiles are read using a Reader
instance. Reader
wraps a Rust u8
iterator, allowing Reader
to be used not only with CTfiles encoded as strings, but also byte sequences obtained directly from storage and network devices.
Reader
supports a variety of public methods useful for processing CTfiles. These methods are in turn supported by private low-level methods. Public methods include:
line
. Read a line up to, but not including the newline sequence.line_with_blacklist
. Read a line up to but not including the newline sequence, while rejecting blacklisted sequences. This is useful for reading the "name" field for molfiles.sequence
. Read a sequence of characters.fortran_int
. Read aFortranInt
.fortran_float
. Read aFortranFloat
.fixed_count
. Read aFixedCount
.newline
. Read a newline, reset the current column to zero, and increment the current row.
These methods have built-in support for sophisticated error handling. Both the row and column number can be reported. For some errors, an expected list of ASCII characters is given. This level of error reporting is designed to work well in environments requiring strict adherence to specifications. Nevertheless, the level of detail can also aid in pinpointing encoding errors in other contexts.
Consider the following mis-formatted Fortran F6.3 field:
42.12a
Executing `Reader#fortran_float<2, 3>() yields the following error:
Error::Character(0, 5, vec![
Character::D0,
Character::D1,
Character::D2,
// ...
Character::D9
])
This error tells us that at Row 0, Column 5 a digit character was expected.
Uses
Two main uses for CTcore are clear: (1) a starting point for libraries focused on reading and writing CTfile member formats (e.g., V2000 molfile, V3000 molfile, SDfile, etc.); and (2) an efficient, low-level utility for those situations requiring light processing of CTfile content.
The first major application for CTcore will probably be reading V3000 molfiles. A previous article introduced Trey, a Rust crate for working with V3000 molfiles. It currently doesn't support reading V3000 molfiles, but given some more work on CTcore, this should be possible.
Conclusion
This article presents the first steps toward a suite of precision tools for reading and writing the CTfile format. Its first applications are likely to be support for reading and writing member CTfile formats such as molfile and SDfile.