KVQuant attention-aware extensions to KV cache vector quantization (paper + code)

Hugging Face Forums [Unofficial] May 19, 2026

Source

Hi everyone,

Sharing my paper KVQuant five structure-aware extensions to KV cache

compression built on top of TurboQuant (Zandieh et al., 2025).

TurboQuant rotates KV vectors then applies Lloyd-Max quantization

near-optimal MSE with provable bounds. But it treats every token the same

and ignores that quantization error has exploitable structure.

Five extensions:

Key result: 2-bit + rank-4 correction on gpt2-medium drops dPPL from

+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit

scalar completely collapses.

Paper: OSF

Code: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference · GitHub