Raw Record Source

{
  "path": "/seqair-bcf.html",
  "site": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/site.standard.publication/3mjcd2t6afe25",
  "$type": "site.standard.document",
  "title": "Using Rust typestates for BCF writing",
  "updatedAt": "2026-05-08T00:00:00.000Z",
  "bskyPostRef": {
    "cid": "bafyreiardrpukppphuf74zkvhl24qol37nfx6p33uqac2xg7xy6klfzzse",
    "uri": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/app.bsky.feed.post/3mldk3q6ncb2g"
  },
  "publishedAt": "2026-05-08T00:00:00.000Z",
  "textContent": "If you've ever needed to produce a typed binary format\nwhere the header constrains what the body can contain,\nyou've probably written validation code that runs at runtime\nand hoped your tests cover the edge cases.\nWith Rust, you can turn this into compile errors (my favorite!)\nusing typestates and phantom types.\nThese patterns generalize well beyond any single format.\nI've previously [mentioned this][elegant-apis-in-rust] 10 years ago in a post,\nand I still think it's such a cool pattern\nthat it deserves another post.\n\n[elegant-apis-in-rust]: https://deterministic.space/elegant-apis-in-rust.html#session-types\n\nThe format I'm working with is VCF (and its binary version BCF).\nIt's the main output of [Rastair], the variant caller I'm working on.\nI [previously wrote about seqair][seqair-post],\nmy reimplementation of core htslib functionality in Rust.\nWriting VCF/BCF records through [rust-htslib] was\nwhat originally motivated that work:\nThe bindings worked but were (out of the box) inefficient[^forked],\nand I wanted to see if I could make it both correct and fast.\n\n[^forked]: In Rastair, I used a fork that replaces some CString usage and it got much faster.\n\n[Rastair]: https://deterministic.space/rastair.html \"Notes on Rastair, a variant and methylation caller\"\n[rust-htslib]: https://github.com/rust-bio/rust-htslib \"HTSlib bindings and a high level Rust API for reading and writing BAM files.\"\n[seqair-post]: https://deterministic.space/seqair.html \"Seqair, a custom htslib reimplementation\"\n\nThis post walks through the API choices I made,\nwhat alternatives exist,\nand what I learned about designing APIs like this.\n\nThe header builder\n\nA VCF file starts with a header\nthat declares every field the records are allowed to use:\ncontigs (chromosomes), filters, INFO fields, FORMAT fields, and sample names.\nBCF, the binary version, adds a constraint:\nit encodes field names as integer indices into a \"string dictionary,\"\nand that dictionary must be emitted in a fixed order[^bcf-dict].\nThe header builder walks through phases\nthat mirror this dictionary order.\n\n[^bcf-dict]:\n    PASS filter at index 0, then other filters, then INFO definitions, then FORMAT definitions.\n    This is not always obvious from reading the spec but it's what htslib expects.\n\nEach phase is a zero-sized type parameter on VcfHeaderBuilder<Phase>.\nTransition methods consume self and return the builder in the next phase.\nYou can skip phases you don't need\n(go from Contigs straight to Infos if you have no filters),\nbut you can never go backwards.\n\nDuring each phase, only the matching register_ method is in scope.\nEach registration inserts the field name\ninto the string dictionary at the next index\nand returns a _typed key_:\n\nThe type parameter (Scalar<i32>, Arr<f32>, Gt)\nflows from the field definition to the key,\nand it's what makes the rest of the system type-safe.\nThese are the format's actual invariants expressed in the type system.\nBut the key also carries the BCF dictionary index (a u32)\nand the VCF field name[^smolstr],\nresolved once and reused for every record.\n\n[^smolstr]:\n    I like using [SmolStr] for immutable small strings like these.\n    It inlines short strings (avoiding a heap allocation)\n    and clones cheaply via reference counting in case we get longer ones.\n\n[SmolStr]: https://docs.rs/smol_str/0.3.6/smol_str/\n\nTyped keys and phantom types\n\nThe type parameter on InfoKey<V> and FormatKey<V>\nis an uninhabited marker type:\n\nThese types are \"uninhabited\":\nno value of type Scalar<i32> can ever exist at runtime.\nThe single variant contains [Infallible],\nwhich itself has no values,\nso the variant can never be constructed.\nThe types exist only to carry type information\nthat the compiler uses to select the right encode method.\n\nWhy uninhabited enums rather than plain structs?\nA simpler option would be pub struct Scalar<T>(PhantomData<T>),\nwhich is inhabited but zero-sized.\nNobody would construct one accidentally,\nand if they did, nothing bad would happen.\nThe reason for the enum-with-Infallible dance\nis that a variant-less enum with a type parameter\ntriggers E0392 (\"type parameter T is never used\"),\nso you need at least one variant that mentions T,\nand putting Infallible in it\nmakes the type genuinely unconstructible[^inhabited].\nIn practice the difference is cosmetic.\nThe struct version would work just as well.\nI went with the uninhabited version\nbecause it more precisely states the intent:\nthese types are not values, they are labels.\n\n[^inhabited]:\n    You could also use a trait-based approach,\n    where ScalarInt is a unit struct implementing trait InfoValueType { type Value; }.\n    That avoids the weird-looking enum entirely\n    but adds more boilerplate for each new marker.\n\nAnyway.\nHere is how the markers select the right method:\n\nThere is no way to hand a &[f32] to a Scalar<i32> key,\nor use a FormatKey<Gt> in an INFO context.\nThis is great!\n(These are inherent methods, not a trait,\nI'll explain why below.)\n\nThe InfoEncoder and FormatEncoder traits underneath\nare object-safe and format-agnostic.\nA single RecordEncoder type contains an internal enum\nthat dispatches to either a BCF or VCF text arm.\nThe typed key narrows the broad trait interface down to exactly one method\nwith exactly the right value type,\nand the format switch happens behind that.\nThe traits being object-safe matters separately:\nthe EncodeInfo trait (shown later) takes &mut dyn InfoEncoder,\nso domain types can encode themselves\nwithout being generic over the encoder.\n\nThe record encoder typestate\n\nWriting a record walks another state machine.\nAs a diagram, it looks like this:\n\nThe states are zero-sized marker structs,\nand the encoder is generic over them:\n\nEach transition consumes self and returns the encoder in the next state:\n\nYou cannot write two filter decisions,\nskip the filter call entirely,\nor emit FORMAT fields before declaring a sample count.\n\nThe types are #[must_use],\nso the compiler warns you\nif you build a record and forget to call emit().\nNone of this catches actual logic bugs in practice.\nIt's a safety net that makes the API hard to misuse,\nespecially when someone else (or an LLM[^llm-help]) is writing the calling code.\n\n[^llm-help]:\n    As I wrote about in the [previous post][seqair-post],\n    a lot of seqair was written with Claude Code.\n    Having the compiler enforce the protocol\n    meant I didn't have to review every call site for ordering mistakes.\n\nUnderneath, the BCF arm writes typed values into shared and individual buffers,\npicks the smallest integer width that fits each value,\nand patches the record length prefix on emit().\nThe VCF arm writes tab-separated text with the percent-encoding\nand float-formatting rules the spec requires.\nBoth arms reuse their internal buffers across records,\nso after the first record the encoder does zero allocations[^erased].\n\n[^erased]:\n    The writer is generic over W: Write,\n    but the RecordEncoder itself uses &mut dyn Write internally\n    so that the encoder type stays RecordEncoder<'a, State>\n    with no extra type parameters leaking out.\n\nWhy key.encode() and not encoder.encode()\n\nThe calling code for writing a record looks like this:\n\nThe field name leads each line, a deliberate choice.\nThe more conventional alternative\nwould put the method on the encoder:\n\nYou could even make this generic with a trait:\n\nThat's a nice design and arguably more idiomatic.\nBut I preferred key.encode() for mainly one reason:\nWhen scanning a block of field-encoding calls,\nthe field names are the varying part.\nIn the key.encode(...) style the field name starts the line\nand the block is easy to scan.\nReading enc.encode(field, ...) is more noisy/difficult to me.\n\nDomain types encoding themselves\n\nThe typed keys are nice for ad-hoc encoding,\nbut in a real application like [Rastair]\nthe values usually come from domain types\nthat know how to serialize themselves.\nThe EncodeInfo trait captures this:\n\nA simple wrapper type might look like this:\n\nThe field definition is const,\nso it can live right next to the type.\nAt header construction you write header.register_info(&Depth::DEF)?\nand the schema stays co-located with the code that produces the values.\n\nThe associated Key type also supports tuples,\nwhich handles the case\nwhere one domain type maps to multiple VCF fields:\n\nThe key is passed in rather than owned by the type.\nThis keeps things flexible\n(the same type could encode under different field names in different contexts)\nat the cost of threading keys through call sites.\nIn practice, Rastair collects all keys into a setup struct\nthat gets passed around,\nso this hasn't been a burden.\n\nWhere the type system ends\n\nThe phantom types and typestates catch a lot of mistakes at compile time.\nBut they cannot catch everything!\nHere's an example of something not caught.\n\nBCF encodes integer arrays\nusing the smallest type that fits the values:\nINT8 if everything is in [-120, 127][^bcf-reserved],\nINT16 for larger ranges, INT32 otherwise.\nEach width has its own sentinel value for \"missing\":\n0x80 for INT8, 0x8000 for INT16, 0x80000000 for INT32.\n\n[^bcf-reserved]:\n    Why -120 and not -128?\n    The BCF 2.2 spec reserves the 8 most-negative values of each integer type\n    for sentinels.\n    Two are currently defined (\"missing\" and \"end of vector\")\n    and six are reserved for future use.\n    So for INT8, -128 through -121 are off-limits,\n    leaving [-120, 127] as the usable range.\n    (A great use-case for [niches]!)\n\n[niches]: https://deterministic.space/niche-int-types-in-rust.html \"Niches for integer types in Rust\"\n\nAn early version of seqair had a bug\nwhere the type selection scanned all values including placeholders,\nand the missing-value sentinel was always set as i32::MIN.\nAn array like [1, 2, MISSING] would\ncorrectly pick INT8 as the encoding (1 and 2 fit),\nbut then always emit i32::MIN as the sentinel,\nwhich doesn't fit in a byte.\nThe type system can't (currently) catch this.\nBoth the declared type and the value type are i32.\nThe bug is purely se",
  "canonicalUrl": "https://deterministic.space/seqair-bcf.html"
}