Raw Record Source

{
  "path": "/rastair.html",
  "site": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/site.standard.publication/3mjcd2t6afe25",
  "$type": "site.standard.document",
  "title": "Notes on Rastair, a variant and methylation caller",
  "updatedAt": "2026-04-11T00:00:00.000Z",
  "bskyPostRef": {
    "cid": "bafyreic2wiq2isbfnsi3txv437nuqijod2poaegohert3zvkcg57wk6ray",
    "uri": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/app.bsky.feed.post/3mjhv3i77za2b"
  },
  "publishedAt": "2026-04-11T00:00:00.000Z",
  "textContent": "In the last year,\nI've had the pleasure to work on [Rastair][rastair],\na bioinformatics project by [Benjamin Schuster-Böckler][benjamin]\nfrom the [Ludwig Institute for Cancer Research][ludwig]\nat the University of Oxford.\nIt is a command-line application written in Rust\nthat has excellent performance and a rich feature set.\nAs of now in April 2026, we just released versions 2.0 and 2.1,\nso I wanted to take some time to write down some of my thoughts.\n\n[rastair]: https://bitbucket.org/bsblabludwig/rastair/ \"Rastair Repository on Bitbucket\"\n[benjamin]: https://www.ludwig.ox.ac.uk/team/benjamin-schuster-bockler \"Benjamin Schuster-Böckler\"\n[ludwig]: https://www.ludwig.ox.ac.uk/ \"Ludwig Cancer Research\"\n\nIntro\n\nBenjamin was involved in the development of [TAPS], an alternative way of processing DNA samples.\nIt has different properties than the typical (bisulfite) method and thus needs different tooling.\nExtremely simplified,\nI wrote a new version of an existing tool which takes in TAPS genome sequencing data\nand finds positions with variants as well as methylation[^5].\nThe complexity of the implementation comes down to two factors:\nThe size of the data being processed[^4],\nand the fact that we need to deal with a lot of uncertainty and statistics to make the right calls.\n\n[TAPS]: https://www.nature.com/articles/s41587-019-0041-2 \"TAPS paper in Nature Biotechnology\"\n[^5]: Basically, switches for genes.\n[^4]: It's a lot! Performance optimizations can be measured in hours saved.\n\nEven a year in, I'm not claiming to be an expert in statistics or bioinformatics,\nso I won't take credit for the ideas there.\nI have however put a lot of thought and effort into making sure Rastair outperforms similar tools in runtime speed and hopefully also user experience.\nIn this article, I'd like to describe some of the challenges and some neat Rust implementation details.\nIf you want to read more about the science,\n[check out our paper][paper]!\n\n[paper]: https://www.biorxiv.org/content/10.64898/2026.03.19.712983v1 \"Rastair: an integrated variant and methylation caller\"\n\nWhat Rastair does\n\nRastair's main command is \"call\".\nIt basically reads a BAM file (that contains short snippets of DNA that have been aligned to a reference genome)\nand outputs a VCF file with our analysis results by their position in the genome,\ni.e., the \"calls\" we make with the evidence from the input files.\n\nWe are looking for two things:\n1. Variants: Does the sample have a different base than the reference at this position?\n2. Methylation: Is this position in a CG group and does it have an additional methyl group on it?\n\nThis allows people to further analyze this to find patterns and understand properties about the DNA sample.\n\nPutting all CPU cores to work\n\nRastair will run on very large input data:\nMy test file is 120 GB and I want to test this on my laptop.\nTo have this go anywhere, we need to make sure we're using all the hardware we have,\nstarting with CPU cores.\n\nOur main input format is a BAM file.\nThe DNA snippets it contains have been aligned in a previous step\nso we know where they start and end and what the position relative to the reference is.\nVariants[^snv] can then be found\nby looking at just one position and all the reads we have for it.\nMethylation looks at two positions.\n\n[^snv]: SNV, single nucleotide variants\n\nRastair expects this BAM file to be sorted and to have an index file next to it,\nwhich allows random access to fetch specific positions.\nBAM files also have a header which lists all the \"contigs\" (e.g. chromosomes)\nand their length.\nWith this, we can split up the work and parallelize!\n\nRastair first builds a list of \"segments\"[^3]\nthat overlap a little on the edges\nto capture enough context for methylation calls\nand some statistics on the surrounding context.\nThis segment list becomes our task list:\nUsing Rayon, we process it using one thread per CPU core (or a given number).\n\n[^3]: With a length of 10 000 bases by default\n\nEach \"worker\" thread is (almost) lock-free.\nOn start, it opens all files it needs and puts them in a thread-local static.\nFor each segment,\nit then fetches the actual data from the files,\ndoes its processing,\nand sends the results along with the segment index to a channel.\nIf there are no results, it sends an empty list alongside its index.\n\nRastair uses an \"ordered channel\" with an internal buffer\nso that results can arrive out of order[^1]\nbut the subscriber will always get them in order.\n\n[^1]: Some segments might have no interesting data or less coverage, so they are faster to process.\n\nA special \"writer\" thread is the only subscriber to this channel:\nIt takes the results,\nconverts them to the output format(s) the user requested\nand then writes to the target files (or stdout).\n\nReducing allocations\n\nMultitheading was not enough to get this to be really fast.\nNext cultprit: Allocating memory,\nwhich (as so often) is one of the main performance bottlenecks.\n\nWhenever we can, we try to use \"iterator chains\"\nthat combine our processing steps together\nand filter out items before we ever allocate/collect them.\nThis means we usually have a very clean and pure mapping function in each step,\nand are only looking at one position at a time.\nWhen we need more context,\nwe have a map_surrounding helper that buffers the positions before/after.\n\nAnother trick is to accumulate in place:\nSimilar to [.collect()][trait-iterator],\nwe have our own types that can be built from iterators.\nIn some places we go over the same list many times, however,\nand just use a loop.\nWhy make it complicated?\n\n[trait-iterator]: https://doc.rust-lang.org/stable/std/iter/trait.Iterator.html#method.collect \"Iterator::collect docs\"\n\nIn many places we know that there are\n_most likely_ only a few items in a list.\nIn this case, we use [SmallVec][smallvec]\nto not allocate a Vec.\nThis is heavily used when building structures to call into the VCF writer\nwhere we have lots of arrays with 0-4 numbers.\nAnother benefit is to have more cache locality\nsince structs can contain most of the content directly\nwithout having to chase a pointer.\nI guess we could have used a size-capped library as well\nto reduce the amount of checks\n(and add error handling)\nbut it wasn't an issue yet.\n\n[smallvec]: https://docs.rs/smallvec/2.0.0-alpha.12/smallvec/ \"smallvec docs\"\n\nAside from short lists,\nwe also have a lot of short strings that we never change.\nI often use [SmolStr][struct-smolstr]\nin my Rust projects,\nand this is no exception.\nIt's a pretty neat little library that also comes with free cloning.\n\n[struct-smolstr]: https://docs.rs/smol_str/0.3.6/smol_str/struct.SmolStr.html \"SmolStr docs\"\n\nIn general,\nI opted for pretty generic tools here\nto not have to think about too many edge cases.\nWhat if a user has a contig name that is very very long?\nIt should just work.\n\nIt should also be noted that I switched to the [mimalloc][mimalloc] allocator\nquite early in the project,\nwhich gave a lot of allocation performance benefits\n(at least on macOS).\nI'm also using it in \"override\" mode,\nwhich means it's also used for non-Rust code like htslib[^2].\n\n[mimalloc]: https://docs.rs/crate/mimalloc/0.1.48 \"mimalloc docs\"\n\n[^2]: I quickly tried jemalloc but saw no performance difference. Since this is Rust, switching allocators is pretty simple.\n\nThere's a big C library\n\nThe main library for dealing with many of the different file formats in this ecosystem is [htslib][htslib].\nIt's written for and used by the [samtools][htslib-2] suite,\nand we're using it through the [rust-htslib][rust-htslib] wrapper.\nThis made it quite simple to get started\nand have great support for all typical formats from the get-go.\n\n[htslib]: https://github.com/samtools/htslib \"samtools/htslib: C library for high-throughput sequencing data formats\"\n[htslib-2]: https://www.htslib.org/ \"Samtools\"\n[rust-htslib]: https://github.com/rust-bio/rust-htslib \"rust-bio/rust-htslib: This library provides HTSlib bindings and a high level Rust API for reading and writing BAM files.\"\n\nWhile I was building our VCF/BCF output,\nI noticed some performance issues however.\nFrom Rust, we pass a &[u8] byte slice for field names,\nbut the C function expects a null-terminated slice,\nso rust-htslib allocates a std::ffi::CString for every call.\nFurthermore, in the error case it always calls str::from_utf8 (and to_owned)\nto print the field name.\nI fixed this in my fork of the library by using CStr8\nfrom the [cstr8][cstr8] crate\nwhich is both null-terminated and valid UTF-8.\nThe caller can trivially uphold this.\n\n[cstr8]: https://docs.rs/cstr8/0.1.4/cstr8/ \"cstr8 docs\"\n\nIn a similar vein,\nI also later added a bam::RecordView<'a> type\nwhich is a borrowed version of the existing Record structure.\nThis further reduced the needed allocations\nin our hottest loop.\n\nSmall Bonus Challenge\n\nWhen going over the reads in a BAM file,\nwe sometimes come across the same read twice.\nThis is due to the sequencing:\nTwo reads are generated from opposite ends of the same DNA fragment\nbut when the fragment is too short,\nboth reads cover the same positions in their middle.\nIf we count both reads,\nwe basically count one piece of evidence twice.\n\nThis means we need to de-duplicate reads by their ID (qname).\nIn many cases, we have low coverage (<50 reads for a position),\nbut in some regions there is much higher coverage[^6].\nRastair needs to handle both cases in a reasonable fashion.\n\n[^6]: E.g., due to misaligning reads with low quality.\n\nFor this,\nI designed a simple but effective two-way lookup system.\nSince reads often have a common prefix (a sequencing ID),\nI first only look at the last 4 characters of a read.\nThis is very quick to compare and fits nicely in the CPU cache.\nI build this suffix list up while going over the reads.\nFor each read,\nI first check if we already saw this suffix (at least once),\nif yes, we compare the full name.\nIf it is a real duplicate, we add this read to a \"skip\" list,\notherwise we add it to the list and continue processing.\nAt the end,\nwe remove ",
  "canonicalUrl": "https://deterministic.space/rastair.html"
}