Using Rust typestates for BCF writing
If you've ever needed to produce a typed binary format where the header constrains what the body can contain, you've probably written validation code that runs at runtime and hoped your tests cover the edge cases. With Rust, you can turn this into compile errors (my favorite!) using typestates and phantom types. These patterns generalize well beyond any single format. I've previously mentioned this 10 years ago in a post, and I still think it's such a cool pattern that it deserves another post.
The format I'm working with is VCF (and its binary version BCF). It's the main output of Rastair, the variant caller I'm working on. I previously wrote about seqair, my reimplementation of core htslib functionality in Rust. Writing VCF/BCF records through rust-htslib was what originally motivated that work: The bindings worked but were (out of the box) inefficient[^forked], and I wanted to see if I could make it both correct and fast.
[^forked]: In Rastair, I used a fork that replaces some CString usage and it got much faster.
This post walks through the API choices I made, what alternatives exist, and what I learned about designing APIs like this.
The header builder
A VCF file starts with a header that declares every field the records are allowed to use: contigs (chromosomes), filters, INFO fields, FORMAT fields, and sample names. BCF, the binary version, adds a constraint: it encodes field names as integer indices into a "string dictionary," and that dictionary must be emitted in a fixed order[^bcf-dict]. The header builder walks through phases that mirror this dictionary order.
[^bcf-dict]: PASS filter at index 0, then other filters, then INFO definitions, then FORMAT definitions. This is not always obvious from reading the spec but it's what htslib expects.
Each phase is a zero-sized type parameter on VcfHeaderBuilder. Transition methods consume self and return the builder in the next phase. You can skip phases you don't need (go from Contigs straight to Infos if you have no filters), but you can never go backwards.
During each phase, only the matching register_ method is in scope. Each registration inserts the field name into the string dictionary at the next index and returns a typed key:
The type parameter (Scalar, Arr, Gt) flows from the field definition to the key, and it's what makes the rest of the system type-safe. These are the format's actual invariants expressed in the type system. But the key also carries the BCF dictionary index (a u32) and the VCF field name[^smolstr], resolved once and reused for every record.
[^smolstr]: I like using SmolStr for immutable small strings like these. It inlines short strings (avoiding a heap allocation) and clones cheaply via reference counting in case we get longer ones.
Typed keys and phantom types
The type parameter on InfoKey and FormatKey is an uninhabited marker type:
These types are "uninhabited": no value of type Scalar can ever exist at runtime. The single variant contains [Infallible], which itself has no values, so the variant can never be constructed. The types exist only to carry type information that the compiler uses to select the right encode method.
Why uninhabited enums rather than plain structs? A simpler option would be pub struct Scalar(PhantomData), which is inhabited but zero-sized. Nobody would construct one accidentally, and if they did, nothing bad would happen. The reason for the enum-with-Infallible dance is that a variant-less enum with a type parameter triggers E0392 ("type parameter T is never used"), so you need at least one variant that mentions T, and putting Infallible in it makes the type genuinely unconstructible[^inhabited]. In practice the difference is cosmetic. The struct version would work just as well. I went with the uninhabited version because it more precisely states the intent: these types are not values, they are labels.
[^inhabited]: You could also use a trait-based approach, where ScalarInt is a unit struct implementing trait InfoValueType { type Value; }. That avoids the weird-looking enum entirely but adds more boilerplate for each new marker.
Anyway. Here is how the markers select the right method:
There is no way to hand a &[f32] to a Scalar key, or use a FormatKey in an INFO context. This is great! (These are inherent methods, not a trait, I'll explain why below.)
The InfoEncoder and FormatEncoder traits underneath are object-safe and format-agnostic. A single RecordEncoder type contains an internal enum that dispatches to either a BCF or VCF text arm. The typed key narrows the broad trait interface down to exactly one method with exactly the right value type, and the format switch happens behind that. The traits being object-safe matters separately: the EncodeInfo trait (shown later) takes &mut dyn InfoEncoder, so domain types can encode themselves without being generic over the encoder.
The record encoder typestate
Writing a record walks another state machine. As a diagram, it looks like this:
The states are zero-sized marker structs, and the encoder is generic over them:
Each transition consumes self and returns the encoder in the next state:
You cannot write two filter decisions, skip the filter call entirely, or emit FORMAT fields before declaring a sample count.
The types are #[must_use], so the compiler warns you if you build a record and forget to call emit(). None of this catches actual logic bugs in practice. It's a safety net that makes the API hard to misuse, especially when someone else (or an LLM[^llm-help]) is writing the calling code.
[^llm-help]: As I wrote about in the previous post, a lot of seqair was written with Claude Code. Having the compiler enforce the protocol meant I didn't have to review every call site for ordering mistakes.
Underneath, the BCF arm writes typed values into shared and individual buffers, picks the smallest integer width that fits each value, and patches the record length prefix on emit(). The VCF arm writes tab-separated text with the percent-encoding and float-formatting rules the spec requires. Both arms reuse their internal buffers across records, so after the first record the encoder does zero allocations[^erased].
[^erased]: The writer is generic over W: Write, but the RecordEncoder itself uses &mut dyn Write internally so that the encoder type stays RecordEncoder<'a, State> with no extra type parameters leaking out.
Why key.encode() and not encoder.encode()
The calling code for writing a record looks like this:
The field name leads each line, a deliberate choice. The more conventional alternative would put the method on the encoder:
You could even make this generic with a trait:
That's a nice design and arguably more idiomatic. But I preferred key.encode() for mainly one reason: When scanning a block of field-encoding calls, the field names are the varying part. In the key.encode(...) style the field name starts the line and the block is easy to scan. Reading enc.encode(field, ...) is more noisy/difficult to me.
Domain types encoding themselves
The typed keys are nice for ad-hoc encoding, but in a real application like Rastair the values usually come from domain types that know how to serialize themselves. The EncodeInfo trait captures this:
A simple wrapper type might look like this:
The field definition is const, so it can live right next to the type. At header construction you write header.register_info(&Depth::DEF)? and the schema stays co-located with the code that produces the values.
The associated Key type also supports tuples, which handles the case where one domain type maps to multiple VCF fields:
The key is passed in rather than owned by the type. This keeps things flexible (the same type could encode under different field names in different contexts) at the cost of threading keys through call sites. In practice, Rastair collects all keys into a setup struct that gets passed around, so this hasn't been a burden.
Where the type system ends
The phantom types and typestates catch a lot of mistakes at compile time. But they cannot catch everything! Here's an example of something not caught.
BCF encodes integer arrays using the smallest type that fits the values: INT8 if everything is in [-120, 127][^bcf-reserved], INT16 for larger ranges, INT32 otherwise. Each width has its own sentinel value for "missing": 0x80 for INT8, 0x8000 for INT16, 0x80000000 for INT32.
[^bcf-reserved]: Why -120 and not -128? The BCF 2.2 spec reserves the 8 most-negative values of each integer type for sentinels. Two are currently defined ("missing" and "end of vector") and six are reserved for future use. So for INT8, -128 through -121 are off-limits, leaving [-120, 127] as the usable range. (A great use-case for niches!)
An early version of seqair had a bug where the type selection scanned all values including placeholders, and the missing-value sentinel was always set as i32::MIN. An array like [1, 2, MISSING] would correctly pick INT8 as the encoding (1 and 2 fit), but then always emit i32::MIN as the sentinel, which doesn't fit in a byte. The type system can't (currently) catch this. Both the declared type and the value type are i32. The bug is purely se
Discussion in the ATmosphere