Seqair, my Rust-native take on htslib

Pascal’s scribbles blog April 30, 2026
Source

This post introduces seqair, a new Rust library for bioinformatics formats I've been working on. We'll look in detail at one of my main design choices here, a columnar store for iterating BAM data, and talk about how I tried to create nice Rust APIs.

Rastair, the project I've been working on, reads and writes a lot of file formats. SAM/BAM/CRAM input, VCF/BCF output, reference FASTA, all through rust-htslib, a wrapper around htslib, the C library behind the samtools suite. After a year of using (and tweaking) it, I wanted to see what a Rust-native approach could look like. Seqair is the result of that experiment.

The current state as of 2026-04-30: Lots of features working! API refinement in-progress, then I'll put it on crates.io.

Update 2026-05-08: I released Seqair 0.1 now! Find it on crates.io and docs.rs. {class="btw"}

An experiment

The week before Easter we released Rastair 2.1 with GPU support. The next big task is local read realignment[^realignment] which requires changing the order and annotations of BAM records. I'm very intrigued by the challenge and also by trying to understand the formats and their uses on a deeper level. Trying to optimize BCF writing in rust-htslib was how I got started[^hts-fork], and I didn't want to stop there. For realignment, I could either fiddle with the raw pointers to htslib structs, or write intermediate BAM records only to then parse them again. Both are not very satisfying options.

For what Rastair needs, there are several possible implementations and htslib and noodles each present only one of them. I had a vague idea of how I would want to do it, given a clean slate.

[^realignment]: With our knowledge of how TAPS methylation looks, we think we can fix some alignment issues in regions with a lot of insertions or deletions.

[^hts-fork]: I forked rust-htslib and hts-sys to update both the bindings and add some features. My PRs aren't merged yet and there is more work I did that I could upstream... My time is limited.

So, I decided to do an experiment. Can I write my ideas as specifications using tracey (I've heard good things), combine this with the extensive specifications for the formats and use Claude Code to get a prototype for this going?[^llm] The goal is to prove whether my ideas are feasible and also have some fun with trying out new tools and see how they deal with some of my coding style choices.

[^llm]: I'll probably write about the LLM aspect of this separately, but this post I want to keep about the architecture and API.

This prototype is done when I have a library that can replace rust-htslib for the pileup[^pileup] generation in Rastair and show an example of how realignment would be handled.

[^pileup]: We have many reads that are short overlapping snippets of DNA. You can imagine this as a 2D grid with the position on the x axis and the reads as a stack going up (or down). A "pileup" is the column view, where each column shows all the bases for a position. In practice, there is also a lot of metadata.

A Rustic API

One of my goals was to make the developer experience very Rust-like, because that is what I enjoy (and what I'm good at). There are many aspects to this and I want to name only a few of them here.

I wanted to have strong types for everything that enforce logic constraints, and distinguish types with parameters. For example: Some formats have zero-based positions and others (like VCF) count from 1. Naturally, I added a type Pos that exists as both Pos and Pos (aliased to Pos0 and Pos1 for convenience).

Another example is that there is one Reader entrypoint, which automatically detects the file format(s) and provides detailed errors. This is not something I've seen noodles provide, and the errors from htslib were sadly not very helpful. Also, htslib has no way of plugging its logs into something other than stderr. Seqair uses tracing for logs and instrumentation, which is Rust-native and can be enabled or disabled by the end user.

When writing VCF/BCF files[^also], we first need to write a header, which defines all the fields. In the binary version, we will refer to them by their ID. This header needs to be written in a specific order (contigs, then filters, then info fields, then format fields, then samples). In the same way, records (lines) need to be written in a specific order, so that we can stream them directly to an output buffer. All of this is enforced using the type-state pattern[^type-state], where methods like VcfHeaderBuilder::formats or RecordEncoder::begin_samples transition from one state to another, which then includes different methods to add different data. More on this in a future post!

[^also]: Yes, I also added this, after complaining about how inefficient the Rust wrapper was in my Rastair post.

Seqair leans harder into type-state builders than either noodles or rust-htslib. I'm not sure this is necessarily better for the average user who might not be enjoying these Rust features as much as me.

[^type-state]: This is one of my favorite features in Rust that I don't get to use so often. I wrote about it all the way back in 2016!

A columnar record store for BAM

Let me pick some pieces that I think came out well, and talk about how the design comes together and how it differs from the other implementations (or not).

When a reader decodes a segment[^segment], it produces hundreds to thousands of BAM records. Each record has a handful of fixed-size fields (position, flags, mapping quality) and several variable-length ones, namely the read name, the CIGAR[^cigar], the sequence, the per-base qualities, the aux tags.

[^segment]: Rastair splits everything up to smaller chunks for parallel processing. Seqair assumes this is the use case as well.

[^cigar]: "Compact Idiosyncratic Gapped Alignment Report", a compact string describing how a read aligns to the reference. Something like 76M2D24M means "76 matches, 2 deletions, 24 matches".

The obvious design is one struct per record, with the variable-length fields as Box<[u8]> or Vec. That's six heap allocations per record[^alloc-count], and a typical segment has thousands of records and we then throw them all away and start again for the next segment.

[^alloc-count]: Five slices for the variable-length fields plus the Record itself, maybe even as Rc.

Seqair replaces this with a RecordStore with vectors that act like columns of a table. A compact SlimRecord holds the fixed fields and offsets into these slabs, one each for names, bases, CIGAR bytes, quality, and auxiliary tags.

People call it "columnar" but I always draw it like rows in my head. Kinda like this:

Decoding a BAM record is now a handful of extend_from_slice calls into slabs that were pre-sized from the compressed byte count of the segment. After warm-up there are no allocations in the hot loop at all: clear() resets lengths without releasing capacity, and the next region reuses the same memory[^arena].

[^arena]: This looks and feels like arena allocation, all that's missing is me calling it that. Using just Vecs is quite simple and they do the job.

Splitting this into slabs isn't just because I think it would be cool but also because the access pattern matches[^seq-simd]. For example, read names are (right now) only used during overlapping-pair dedup, so they sit by themselves as a compact contiguous buffer that cache-prefetches nicely during a linear scan.

[^seq-simd]: Bonus: Seqair decodes the 4-bit packed sequence to [Base] using SIMD at push time so that the pileup engine never has to unpack bytes again and all bases are of a known structure.

CIGAR lives in its own slab because it is the one thing that changes during local realignment. store.set_alignment(idx, new_pos, new_cigar) appends the new ops to the end of the CIGAR slab and rewrites the record's cigar_off, n_cigar_ops, pos, and end_pos. The old bytes become dead data in the slab, but since we're about to call store.clear() at the end of the region it doesn't matter. Append-only mutation means the sequence, quality, and aux slabs never have to be touched when realignment moves a read around.

The pileup engine sits on top of this. A PileupAlignment carries pre-extracted flat fields (the base at this position, its Phred score, the read's MAPQ, the flags) and a record index into the store. The store is borrowed for the duration of the iteration, and Rust's lifetimes do the rest.

Customization

The same append-only shape pays off twice more. The reader actually takes a "customizer" (a user-defined type that implements seqair's CustomizeRecordStore trait) that lets the user control some of its behavior.

First, it lets users decide whether to keep a record. The customizer gets a parsed record, can analyze it, and decide wh

Discussion in the ATmosphere

Loading comments...