{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreid5mfcdrqbw535hpndhgm4finjyyu3x76jfmczeeh25333vatv2iq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mm7befyfa7k2"
},
"path": "/t/kvquant-attention-aware-extensions-to-kv-cache-vector-quantization-paper-code/176091#post_1",
"publishedAt": "2026-05-19T09:16:42.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"OSF",
"GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub"
],
"textContent": "Hi everyone,\n\nSharing my paper KVQuant five structure-aware extensions to KV cache\n\ncompression built on top of TurboQuant (Zandieh et al., 2025).\n\nTurboQuant rotates KV vectors then applies Lloyd-Max quantization\n\nnear-optimal MSE with provable bounds. But it treats every token the same\n\nand ignores that quantization error has exploitable structure.\n\n****Five extensions:****\n\n- Attention-weighted bit assignment 47–70% lower weighted distortion\n\n- Delta compression 1.1–2.2x lower MSE on correlated streams\n\n- Adaptive bit allocation EMA tracker, promotes/demotes during generation\n\n- Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss\n\n- Product quantization 2-bit storage matching 3-bit scalar quality\n\n****Key result:**** 2-bit + rank-4 correction on gpt2-medium drops dPPL from\n\n+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit\n\nscalar completely collapses.\n\nPaper: OSF\n\nCode: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub",
"title": "KVQuant attention-aware extensions to KV cache vector quantization (paper + code)"
}