AIOS — First Ground Truth Baseline (CPU DRAM Measurement)
Hugging Face Forums [Unofficial]
April 9, 2026
Update: TurboQuant (Google, ICLR 2026) is directly relevant to AIOS
Google released TurboQuant this week — a KV cache quantization algorithm that compresses attention key-value pairs from 16 bits to 3 bits with near-zero quality loss and no retraining required(TurboQuant: Redefining AI efficiency with extreme compression).
This is complementary to AIOS, not competing. They address the same bottleneck from different directions:
∙ TurboQuant reduces KV cache size — fewer bits per KV entry
∙ AIOS reduces KV cache DRAM reads — fewer times those bits are fetched per token
Both optimizations apply simultaneously. A model running TurboQuant under AIOS memory management addresses the KV bottleneck from two directions at once.
Our first baseline (Intel Ultra 7 265K) measured 2,340 MB/token on stock llama.cpp. At 4K context, KV cache reads are a significant fraction of that. TurboQuant’s 5x KV compression would reduce that fraction further before AIOS residency management applies on top.
The broader pattern: BitNet (weight arithmetic), CALM (forward passes), TurboQuant (KV size), AIOS (DRAM access patterns) — four independent groups addressing four non-overlapping bottlenecks in the same inference stack. None of them are sufficient alone. All of them stack.
Discussion in the ATmosphere