External Publication
Visit Post

KVQuant attention-aware extensions to KV cache vector quantization (paper + code)

Hugging Face Forums [Unofficial] May 19, 2026
Source

Hi everyone,

Sharing my paper KVQuant five structure-aware extensions to KV cache

compression built on top of TurboQuant (Zandieh et al., 2025).

TurboQuant rotates KV vectors then applies Lloyd-Max quantization

near-optimal MSE with provable bounds. But it treats every token the same

and ignores that quantization error has exploitable structure.

Five extensions:

  • Attention-weighted bit assignment 47–70% lower weighted distortion

  • Delta compression 1.1–2.2x lower MSE on correlated streams

  • Adaptive bit allocation EMA tracker, promotes/demotes during generation

  • Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss

  • Product quantization 2-bit storage matching 3-bit scalar quality

Key result: 2-bit + rank-4 correction on gpt2-medium drops dPPL from

+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit

scalar completely collapses.

Paper: OSF

Code: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub

Discussion in the ATmosphere

Loading comments...