Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid5mfcdrqbw535hpndhgm4finjyyu3x76jfmczeeh25333vatv2iq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mm7befyfa7k2"
  },
  "path": "/t/kvquant-attention-aware-extensions-to-kv-cache-vector-quantization-paper-code/176091#post_1",
  "publishedAt": "2026-05-19T09:16:42.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "OSF",
    "GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub"
  ],
  "textContent": "Hi everyone,\n\nSharing my paper KVQuant five structure-aware extensions to KV cache\n\ncompression built on top of TurboQuant (Zandieh et al., 2025).\n\nTurboQuant rotates KV vectors then applies Lloyd-Max quantization\n\nnear-optimal MSE with provable bounds. But it treats every token the same\n\nand ignores that quantization error has exploitable structure.\n\n****Five extensions:****\n\n- Attention-weighted bit assignment 47–70% lower weighted distortion\n\n- Delta compression 1.1–2.2x lower MSE on correlated streams\n\n- Adaptive bit allocation EMA tracker, promotes/demotes during generation\n\n- Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss\n\n- Product quantization 2-bit storage matching 3-bit scalar quality\n\n****Key result:**** 2-bit + rank-4 correction on gpt2-medium drops dPPL from\n\n+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit\n\nscalar completely collapses.\n\nPaper: OSF\n\nCode: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub",
  "title": "KVQuant attention-aware extensions to KV cache vector quantization (paper + code)"
}