Reading CTfiles with CTcore

By Richard L. Apodaca

2022-11-09T22:00:00Z

CTfile is a widely-used family of file formats in cheminformatics and computational chemistry. CTfiles are most commonly processed through a cheminformatics toolkit. But sometimes that kind of power is overkill. You might, for example, want to pull out just certain pieces of information from a file without the overhead of building high-level data structures. In other situations, a toolkit does too little. For example, you might be interested in precise error reporting down to the row, column and expected character. Maybe rigorous validation of CTfile output generated by another utility is what you're really after. In these and other cases, a low-level CTfile utility is a better fit than a general-purpose toolkit. This article describes the beginnings of such a utility.

About CTcore

CTcore (install) is a library for reading and eventually writing CTfiles. Written in Rust, CTcore emphasizes performance, pushing runtime errors to compile-time when possible, and flexible deployment. CTcore is based on the CTfile specification, interpreted as narrowly as possible to minimize data corruption and maximize output quality. CTcore is far from complete, but even in its primitive state it can do some useful things.

Parse CTfile Header

The best-known member of the CTfile family is molfile. Two versions are specified: "V2000" and "V3000." For compatibility, both versions use the same header format. This header can be parsed by CTcore as follows.

// FILE ./tests/data/v3k.mol
// Molecule Name                                                                   
// ABTESTING111032100122D 11234.12341123456.12341123456
// A Comment
//   6  5  0     1               999 V2000
//
fn read_molfile_header() -> Result<(), io::Error> {
    let file = fs::File::open("./tests/data/v3k.mol").unwrap();

    // https://stackoverflow.com/questions/26368288
    let mut err = Ok(());
    let mut buffer =
        io::BufReader::new(file)
            .bytes()
            .scan(&mut err, |err, res| match res {
                Ok(i) => Some(i),
                Err(e) => {
                    **err = Err(e);
                    None
                }
            });
    let mut reader = Reader::new(&mut buffer);

    assert_eq!(
        reader.line_with_blacklist::<80>(BLACKLIST),
        Ok(Sequence::from_str(
            format!("{: <80}", "Molecule Name").as_str()
        ))
    );
    assert_eq!(reader.newline(), Ok(()));
    assert_eq!(reader.sequence::<2>(), Ok(Sequence::from_str("AB")));
    assert_eq!(reader.sequence::<8>(), Ok(Sequence::from_str("TESTING1")));
    assert_eq!(
        reader.sequence::<10>(),
        Ok(Sequence::from_str("1103210012"))
    );
    assert_eq!(reader.sequence::<2>(), Ok(Sequence::from_str("2D")));
    assert_eq!(reader.fortran_int::<2>(), Ok(FortranInt::from_int(1)));
    assert_eq!(
        reader.fortran_float::<4, 5>(),
        Ok(FortranFloat::from_float(1234.12341))
    );
    assert_eq!(
        reader.fortran_float::<6, 5>(),
        Ok(FortranFloat::from_float(123456.12341))
    );
    assert_eq!(reader.fortran_int::<6>(), Ok(FortranInt::from_int(123456)));
    assert_eq!(reader.newline(), Ok(()));
    assert_eq!(reader.line::<80>(), Ok(Sequence::from_str("A Comment")));
    assert_eq!(reader.newline(), Ok(()));
    assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(6)));
    assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(5)));
    assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(0)));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(1)));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(None));
    assert_eq!(reader.fixed_count::<3>(), Ok(FixedCount::from_int(999)));
    assert_eq!(reader.sequence::<6>(), Ok(Sequence::from_str(" V2000")));

    err
}

The remainder of this article discusses how CTcore works, at both low and high levels of abstraction.

ASCII All the Way Down

As noted in the recent article CTfile Character Encoding, the CTfile documentation says nothing about character encoding. This presents a major problem for implementors. For one thing, CTfile is a line-oriented format in which some lines have a hard character count limit. Not even knowing what constitutes a character can lead to very bad outcomes. Second and more importantly, as Spolsky notes in his famous article on the topic of Unicode, a character encoding is required before any byte sequence can be transformed into text in the first place.

Although the documentation itself is silent on the issue of character encoding, the context surrounding the development of CTfile points squarely at ASCII. The lack of any guidance on this issue from a succession of CTfile corporate sponsors suggests that ASCII remains the character encoding today.

For these reasons, CTcore uses ASCII (US-ASCII) character encoding for all byte sequences. This means that files containing, for example, unicode, various Latin encodings, UTF-16, and so on will be rejected as errors. CTcore enforces this restriction at the level of data structures, meaning that it will be impossible to write a CTfile through CTcore using any encoding other than ASCII. The one exception, should CTcore eventually support it, would be XDfile, an XML-based alternative to SDfile that uses UTF-8 encoding.

Data Structures

CTcore favors data structures that are guaranteed to be valid at compile time. This approach has been documented in the essay Parse, Don't Validate. The approach is based on a simple rule of thumb: "write functions on the data representation you wish you had, not the data representation you are given." An important tool toward that goal is data structures that make illegal states unrepresentable. A second essay from a different author, Can Types Replace Validation, adds some useful nuance to the idea.

CTcore applies these ideas throughout its design and implementation.

The base type, Character, represents an ASCII character. By default, the Rust types String and char use UTF-8 encoding. These are avoided by CTcore because supporting them in data structures would make it possible to encode illegal states unnecessarily. Character is an enumeration that represents the printable ASCII characters like so:

pub enum Character {
    Space,
    Exclamation,
    DoubleQuote,
    // ...
}

Character supports convenience conversion methods allowing interconversion with the u8 type for octet encoding.

A Character sequence is captured by the Sequence type:

pub struct Sequence<const L: usize>(Vec<Character>);

The type parameter L is an example of one of Rust's newer features, const generics. The L type parameter represents the maximum allowable length of the Sequence. When used to represent, say, the molfile name field, it will be impossible to encode a name longer than 80 characters when L is set to 80. This keeps the molfile name field within spec at all times. Sequence supports methods allowing conversion to/from other types.

One limitation of const generics affects CTcore: it's possible for the type parameter L to equal zero. This is a zero-length character sequence. It would be better if L were required to be positive, but the current implementation of const generics does not support this restriction. This could, however, change in future releases.

CTfile supports various numerical types, which appear as CTcore data structures.

FortranInt and FortranFloat are used on the second line of the molfile header, a line I refer to as "the parameters line." As the name suggests, the Fortran types embody rules inherited from the programming language Fortran's numerical formatting capabilities. These capabilities and the format itself are beyond the scope of today's article. Suffice it to say that many rules apply and that FortranInt and FortranFloat encapsulate them all.

FortranInt is used in the parameter line's "internal registry number" field, where I equals six.

pub struct FortranInt<const I: usize> {
    kind: Kind<I>,
}

enum Kind<const I: usize> {
    Positive(FixedNatural<I>),
    Negative(FixedNatural<I>),
    Zero,
}

Positive and negative variants of FortranInt contain a FixedNatural. This is a fixed-width natural number.

The FortranInt kind attribute is private, meaning that it is impossible to construct a FortranInt through a struct literal. Instead, one of the public constructor methods must be used. The type parameter I is the integer's width in characters. The internal registry number is restricted to no more than six characters, including a possible minus sign. FortranInt's methods ensure that this constraint is met before a value can be used.

FortranFloat is used by the "energy" and "scaling factors" part of the parameters line. Extending FortranInt, FortranFloat uses the same safeguards to ensure that only valid values can be created.

pub struct FortranFloat<const I: usize, const F: usize> {
    integer_part: FortranInt<I>,
    fractional_part: Vec<Digit>,
}

As before, the two attributes are private, meaning that FortranFloat values can only be created through a validating constructor method.

Rounding out the current collection of numerical types is FixedCount. FixedCount is a fixed-width integer greater than or equal to zero. It is used on the molfile "counts" line (Line 4). Unfortunately, the restrictions on such fields are not covered in any detail by the CTfile documentation. For this reason, the validation rules are quite liberal.

pub enum FixedCount<const I: usize> {
    Positive(FixedNatural<I>),
    Zero,
}

Reader

CTfiles are read using a Reader instance. Reader wraps a Rust u8 iterator, allowing Reader to be used not only with CTfiles encoded as strings, but also byte sequences obtained directly from storage and network devices.

Reader supports a variety of public methods useful for processing CTfiles. These methods are in turn supported by private low-level methods. Public methods include:

line. Read a line up to, but not including the newline sequence.
line_with_blacklist. Read a line up to but not including the newline sequence, while rejecting blacklisted sequences. This is useful for reading the "name" field for molfiles.
sequence. Read a sequence of characters.
fortran_int. Read a FortranInt.
fortran_float. Read a FortranFloat.
fixed_count. Read a FixedCount.
newline. Read a newline, reset the current column to zero, and increment the current row.

These methods have built-in support for sophisticated error handling. Both the row and column number can be reported. For some errors, an expected list of ASCII characters is given. This level of error reporting is designed to work well in environments requiring strict adherence to specifications. Nevertheless, the level of detail can also aid in pinpointing encoding errors in other contexts.

Consider the following mis-formatted Fortran F6.3 field:

42.12a

Executing `Reader#fortran_float<2, 3>() yields the following error:

Error::Character(0, 5, vec![
    Character::D0,
    Character::D1,
    Character::D2,
    // ...
    Character::D9
])

This error tells us that at Row 0, Column 5 a digit character was expected.

Uses

Two main uses for CTcore are clear: (1) a starting point for libraries focused on reading and writing CTfile member formats (e.g., V2000 molfile, V3000 molfile, SDfile, etc.); and (2) an efficient, low-level utility for those situations requiring light processing of CTfile content.

The first major application for CTcore will probably be reading V3000 molfiles. A previous article introduced Trey, a Rust crate for working with V3000 molfiles. It currently doesn't support reading V3000 molfiles, but given some more work on CTcore, this should be possible.

Conclusion

This article presents the first steps toward a suite of precision tools for reading and writing the CTfile format. Its first applications are likely to be support for reading and writing member CTfile formats such as molfile and SDfile.