{
  "path": "/bufwriter-lz4.html",
  "site": "at://did:plc:x67qh7v3fd7znbdhauc45ng3/site.standard.publication/3mjcd2t6afe25",
  "$type": "site.standard.document",
  "title": "Rust BufWriter and LZ4 Compression",
  "updatedAt": "2025-07-09T00:00:00.000Z",
  "publishedAt": "2025-07-09T00:00:00.000Z",
  "textContent": "Recently, I've been working on a Rust project again.\nIt deals with bioinformatics data, which can be quite large,\nso I got to play with profiling and optimizing the code.\nI've done some of this in past, but this time it was _actually_ useful.\nIn this post, I want to talk about a small optimization\nin working with LZ4 compression\nthat made a big difference in runtime performance.\n\nThis tool mainly reads in a BAM file\n(which contains aligned genome sequence data),\ndoes some processing on it,\nand outputs the results in various formats,\nchosen by the user.\nOne of the formats is the internal data structure used by the tool,\nwhich is convient for debugging and testing.\nSince this is Rust, all I had to do was add some #[derive(Serialize, Deserialize)] annotations,\nchoose a good format (I picked [MessagePack][msgpack]),\nand thanks to [serde][serde],\nwe have a data format.\nConcretely, I made an enum with all the possible structures I want to output,\n(which includes header fields)\nand serialize and write each structure separately,\nso that they are concatenated in the output file.\nTo read it back in,\nI wrote a little helper function[^rmp-stream] that\nkeeps deserializing these enum values until it reaches the end of the file.\nSo far, so good.\n\n[msgpack]: https://msgpack.org/ \"MessagePack: It's like JSON. but fast and small.\"\n[serde]: https://serde.rs/ \"Overview · Serde\"\n\n[^rmp-stream]: serde_json includes a [StreamDeserializer][struct-streamdeserializer] but  rmp_serde does not, so I wrote one myself. It's not as feature-complete (I think), but you can find it [here][317].\n\nCompression with LZ4\n\nHowever, the output file was quite large\n-- it's pretty much everything I have in RAM.\nI wanted to compress it,\nbut I also knew that compression is expensive,\nand for my debug output I don't really need to squeeze every byte out of it.\nI chose [LZ4][lz4], via the lz4 crate.\nIts [Encoder][struct-encoder]\nimplements Write,\nso we can just wrap our writer in it and continue to use it as before:\n\n[lz4]: https://lz4.github.io/lz4/ \"LZ4 - Extremely fast compression\"\n[struct-encoder]: https://docs.rs/lz4/1.28.1/lz4/struct.Encoder.html \"Encoder in lz4\"\n\nPretty early in my Rust journey,\nI learned that file I/O is not buffered by default,\nso it's a good idea to wrap the file in a BufWriter:\n\nThis then creates a chain like this:\n\nProfiling\n\nWhen profiling the code (with [samply][samply]),\nI noticed that the overhead from LZ4 was quite high.\nEven after lowering the compression level to 0,\nI wasn't happy.\nThis was slower than the BGZIP compression I use for BCF files!\nAnd that is based on Deflate, which, while optimized heavily,\nis not an algorithm that should play in the same league as LZ4.\nWhat is going on here?\n\n[samply]: https://github.com/mstange/samply/ \"mstange/samply: Command-line sampling profiler for macOS, Linux, and Windows\"\n\nI saw that there were many stacks with calls to LZ4F_compressUpdateImpl.\nLooking at [the implementation][lz4frame]\nwith the samples per line,\nI see a lot of calls to LZ4F_selectCompression, LZ4F_compressBound_internal,\nmemcpy (if the temporary block buffer has space and LZ4 wants to buffer),\nLZ4F_makeBlock, which writes the block header and checksum,\nand finally XXH32_update, which computes the checksum for the block.\nWhy is this being called so much and why are there so many blocks being made?\n\n[lz4frame]: https://github.com/lz4/lz4/blob/v1.10.0/lib/lz4frame.c#L977 \"lz4/lib/lz4frame.c at v1.10.0 · lz4/lz4\"\n\nLZ4 is a block-based compression algorithm,\nwhich means that it compresses data in chunks.\nThe chunks we are giving it are the serialized MessagePack data,\nwhich is around 250 bytes each.\nThis means that for every 250 byte chunk,\nwe're calling calling into LZ4 and ask it to compress it.\nAnd for every 250 byte chunk,\nit does the entire round checks and compression, and checksumming.\n\nSwap the buffer\n\nKnowing that LZ4 works with blocks internally,\nI had the idea that I could swap the way I use the buffer:\nInstead of buffering writing to the file system,\nI could buffer writing to the LZ4 encoder.\n\nAnd indeed, this works!\nIn my initial benchmark, this made this part of the code 1.83 times faster.\nAn amazing result for basically just swapping two lines of code.\n\n[struct-streamdeserializer]: https://docs.rs/serde_json/1.0.140/serde_json/struct.StreamDeserializer.html \"StreamDeserializer in serde_json\"\n[317]: https://github.com/3Hren/msgpack-rust/issues/317#issuecomment-3012814957 \"Can't deserialize entire file · Issue #317 · 3Hren/msgpack-rust · GitHub\"",
  "canonicalUrl": "https://deterministic.space/bufwriter-lz4.html"
}