KVQuant attention-aware extensions to KV cache vector quantization (paper + code)
Hi everyone,
Sharing my paper KVQuant five structure-aware extensions to KV cache
compression built on top of TurboQuant (Zandieh et al., 2025).
TurboQuant rotates KV vectors then applies Lloyd-Max quantization
near-optimal MSE with provable bounds. But it treats every token the same
and ignores that quantization error has exploitable structure.
Five extensions:
Attention-weighted bit assignment 47–70% lower weighted distortion
Delta compression 1.1–2.2x lower MSE on correlated streams
Adaptive bit allocation EMA tracker, promotes/demotes during generation
Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss
Product quantization 2-bit storage matching 3-bit scalar quality
Key result: 2-bit + rank-4 correction on gpt2-medium drops dPPL from
+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit
scalar completely collapses.
Paper: OSF
Code: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub
Discussion in the ATmosphere