{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibmoiyrxjkeop3armrn4imrrb5miychoxv5lmt7rlv7wioqdklcwi",
"uri": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/app.bsky.feed.post/3mkq5uetcp22i"
},
"canonicalUrl": "https://deterministic.space/seqair.html",
"path": "/seqair.html",
"publishedAt": "2026-04-30T00:00:00.000Z",
"site": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/site.standard.publication/3mjcd2t6afe25",
"textContent": "This post introduces [seqair],\na new Rust library for bioinformatics formats I've been working on.\nWe'll look in detail at one of my main design choices here,\na columnar store for iterating BAM data,\nand talk about how I tried to create nice Rust APIs.\n\n[Rastair], the project I've been working on,\nreads and writes a lot of file formats.\nSAM/BAM/CRAM input, VCF/BCF output, reference FASTA,\nall through [rust-htslib], a wrapper around [htslib],\nthe C library behind the [samtools] suite.\nAfter a year of using (and tweaking) it,\nI wanted to see what a Rust-native approach could look like.\nSeqair is the result of that experiment.\n\nThe current state as of 2026-04-30:\nLots of features working!\nAPI refinement in-progress,\nthen I'll put it on _crates.io_.\n\n_Update 2026-05-08:_ I released Seqair 0.1 now!\nFind it on crates.io and docs.rs.\n{class=\"btw\"}\n\n[Rastair]: https://www.rastair.com/ \"Rastair website\"\n[samtools]: https://www.htslib.org/ \"Samtools\"\n[htslib]: https://github.com/samtools/htslib \"samtools/htslib: C library for high-throughput sequencing data formats\"\n[rust-htslib]: https://github.com/rust-bio/rust-htslib \"HTSlib bindings and a high level Rust API for reading and writing BAM files.\"\n[seqair]: https://github.com/Softleif/seqair \"seqair repository\"\n\nAn experiment\n\nThe week before Easter we released Rastair 2.1 with GPU support.\nThe next big task is local read realignment[^realignment]\nwhich requires changing the order and annotations of BAM records.\nI'm very intrigued by the challenge\nand also by trying to understand the formats and their uses on a deeper level.\nTrying to optimize BCF writing in rust-htslib was how I got started[^hts-fork],\nand I didn't want to stop there.\nFor realignment,\nI could either fiddle with the raw pointers to _htslib_ structs,\nor write intermediate BAM records only to then parse them again.\nBoth are not very satisfying options.\n\nFor what Rastair needs, there are several possible implementations\nand _htslib_ and [noodles] each present only one of them.\nI had a vague idea of how _I_ would want to do it,\ngiven a clean slate.\n\n[noodles]: https://docs.rs/noodles/0.109.0/noodles/ \"Bioinformatics I/O libraries in Rust\"\n\n[^realignment]:\n With our knowledge of how TAPS methylation looks,\n we think we can fix some alignment issues in regions with a lot of insertions or deletions.\n\n[^hts-fork]:\n I forked rust-htslib and hts-sys to update both the bindings and add some features.\n My PRs aren't merged yet and there is more work I did that I could upstream...\n My time is limited.\n\nSo, I decided to do an experiment.\nCan I write my ideas as specifications using [tracey] (I've heard good things),\ncombine this with the extensive specifications for the formats\nand use Claude Code to get a prototype for this going?[^llm]\nThe goal is to prove whether my ideas are feasible\nand also have some fun with trying out new tools\nand see how they deal with some of my coding style choices.\n\n[tracey]: https://tracey.bearcove.eu/ \"Spec coverage for code\"\n\n[^llm]:\n I'll probably write about the LLM aspect of this separately,\n but this post I want to keep about the architecture and API.\n\nThis prototype is done when I have\na library that can replace rust-htslib for the pileup[^pileup] generation in Rastair\nand show an example of how realignment would be handled.\n\n[^pileup]:\n We have many reads that are short overlapping snippets of DNA.\n You can imagine this as a 2D grid with the position on the x axis\n and the reads as a stack going up (or down).\n A \"pileup\" is the column view, where each column shows all the bases for a position.\n In practice, there is also a lot of metadata.\n\nA Rustic API\n\nOne of my goals was to make the developer experience very Rust-like,\nbecause that is what I enjoy (and what I'm good at).\nThere are many aspects to this and I want to name only a few of them here.\n\nI wanted to have strong types for everything\nthat enforce logic constraints,\nand distinguish types with parameters.\nFor example:\nSome formats have zero-based positions\nand others (like VCF) count from 1.\nNaturally, I added a type Pos<T>\nthat exists as both Pos<Zero> and Pos<One>\n(aliased to Pos0 and Pos1 for convenience).\n\nAnother example is that there is one Reader entrypoint,\nwhich automatically detects the file format(s)\nand provides detailed errors.\nThis is not something I've seen _noodles_ provide,\nand the errors from _htslib_ were sadly not very helpful.\nAlso, _htslib_ has no way of plugging its logs into something other than stderr.\nSeqair uses [tracing] for logs and instrumentation,\nwhich is Rust-native and can be enabled or disabled by the end user.\n\n[tracing]: https://docs.rs/tracing/0.1.44/tracing/ \"A scoped, structured logging and diagnostics system.\"\n\nWhen writing VCF/BCF files[^also],\nwe first need to write a header,\nwhich defines all the fields.\nIn the binary version, we will refer to them by their ID.\nThis header needs to be written in a specific order\n(contigs, then filters, then info fields, then format fields, then samples).\nIn the same way,\nrecords (lines) need to be written in a specific order,\nso that we can stream them directly to an output buffer.\nAll of this is enforced using the type-state pattern[^type-state],\nwhere methods like VcfHeaderBuilder::formats or RecordEncoder::begin_samples\ntransition from one state to another,\nwhich then includes different methods to add different data.\nMore on this in a future post!\n\n[^also]: Yes, I also added this, after [complaining][rastair-post] about how inefficient the Rust wrapper was in my Rastair post.\n\n[rastair-post]: https://deterministic.space/rastair.html \"My post introducing Rastair\"\n\nSeqair leans harder into type-state builders than either _noodles_ or _rust-htslib_.\nI'm not sure this is necessarily better for the average user\nwho might not be enjoying these Rust features as much as me.\n\n[^type-state]:\n This is one of my favorite features in Rust that I don't get to use so often.\n [I wrote about it][elegant-apis-in-rust] all the way back in 2016!\n\n[elegant-apis-in-rust]: https://deterministic.space/elegant-apis-in-rust.html#session-types\n\nA columnar record store for BAM\n\nLet me pick some pieces that I think came out well,\nand talk about how the design comes together\nand how it differs from the other implementations (or not).\n\nWhen a reader decodes a segment[^segment],\nit produces hundreds to thousands of BAM records.\nEach record has a handful of fixed-size fields\n(position, flags, mapping quality)\nand several variable-length ones, namely\nthe read name, the CIGAR[^cigar], the sequence, the per-base qualities, the aux tags.\n\n[^segment]:\n Rastair splits everything up to smaller chunks for parallel processing.\n Seqair assumes this is the use case as well.\n\n[^cigar]:\n \"Compact Idiosyncratic Gapped Alignment Report\",\n a compact string describing how a read aligns to the reference.\n Something like 76M2D24M means \"76 matches, 2 deletions, 24 matches\".\n\nThe obvious design is one struct per record,\nwith the variable-length fields as Box<[u8]> or Vec<u8>.\nThat's six heap allocations per record[^alloc-count],\nand a typical segment has thousands of records\nand we then throw them all away and start again for the next segment.\n\n[^alloc-count]:\n Five slices for the variable-length fields plus the Record itself,\n maybe even as Rc.\n\nSeqair replaces this with a [RecordStore][record-store]\nwith vectors that act like columns of a table.\nA compact SlimRecord holds the fixed fields and _offsets_ into these slabs,\none each for names, bases, CIGAR bytes, quality, and auxiliary tags.\n\n[record-store]: https://github.com/Softleif/seqair/blob/f93c275683cfce96d49c18ed4aba9d9257302a4d/crates/seqair/src/bam/record_store.rs#L226\n\nPeople call it \"columnar\" but I always draw it like rows in my head.\nKinda like this:\n\nDecoding a BAM record is now a handful of extend_from_slice calls\ninto slabs that were pre-sized from the compressed byte count of the segment.\nAfter warm-up there are no allocations in the hot loop at all:\nclear() resets lengths without releasing capacity,\nand the next region reuses the same memory[^arena].\n\n[^arena]:\n This looks and feels like arena allocation,\n all that's missing is me calling it that.\n Using just Vecs is quite simple and they do the job.\n\nSplitting this into slabs isn't just because I think it would be cool\nbut also because the access pattern matches[^seq-simd].\nFor example, read names are (right now) only used during overlapping-pair dedup,\nso they sit by themselves as a compact contiguous buffer\nthat cache-prefetches nicely during a linear scan.\n\n[^seq-simd]:\n _Bonus:_ Seqair decodes the 4-bit packed sequence\n to [Base] using SIMD at push time\n so that the pileup engine never has to unpack bytes again\n and all bases are of a known structure.\n\nCIGAR lives in its own slab\nbecause it is the one thing that changes during local realignment.\nstore.set_alignment(idx, new_pos, new_cigar) appends the new ops to the end of the CIGAR slab\nand rewrites the record's cigar_off, n_cigar_ops, pos, and end_pos.\nThe old bytes become dead data in the slab,\nbut since we're about to call store.clear() at the end of the region it doesn't matter.\nAppend-only mutation means the sequence, quality, and aux slabs\nnever have to be touched when realignment moves a read around.\n\nThe pileup engine sits on top of this.\nA PileupAlignment carries pre-extracted flat fields\n(the base at this position, its Phred score, the read's MAPQ, the flags)\nand a record index into the store.\nThe store is borrowed for the duration of the iteration,\nand Rust's lifetimes do the rest.\n\nCustomization\n\nThe same append-only shape pays off twice more.\nThe reader actually takes a \"customizer\"\n(a user-defined type that implements seqair's CustomizeRecordStore trait)\nthat lets the user control some of its behavior.\n\nFirst, it lets users decide whether to keep a record.\nThe _customizer_ gets a parsed record, can analyze it,\nand decide wh",
"title": "Seqair, my Rust-native take on htslib",
"updatedAt": "2026-04-30T00:00:00.000Z"
}