{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiguwwzmmlhz335652kcaczrrcyaimbxhktozilfblwoeqkmc5r3ue",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmgtr7m3nt32"
},
"path": "/t/kvboost-a-full-inference-engine-for-hf-causal-lms-kv-reuse-flashattention-2-awq-streaming-and-speculative-decoding/176159#post_1",
"publishedAt": "2026-05-22T11:18:31.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub"
],
"textContent": "Hey guys, I am building kvboost. GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub\n\n## What’s in the engine\n\n### 1. Cross-request KV cache reuse\n\nPrompts are split into fixed-size chunks and content-addressed by hash. On a\ncache hit, stored K/V tensors are loaded instead of recomputed. **CacheBlend\nseam repair** selectively recomputes the ~15% most-deviated tokens at chunk\nboundaries, so stitched K/V produces output quality identical to a full prefill.\n\nOn a 500-conversation ShareGPT replay (Qwen2.5-3B, RTX 4060 8GB):\n\nTurn | Baseline TTFT | KVBoost TTFT | KV reuse\n---|---|---|---\n1 | 18.8 ms | 17.4 ms | 35.7%\n3 | 35.2 ms | 20.6 ms | 99.2%\n8 | 121.6 ms | 26.5 ms | 99.6%\n\nTTFT stays flat. No measurable accuracy loss (99.2% WARM = 99.2% COLD).\n\n* * *\n\n### 2. Custom FlashAttention-2 CUDA kernel\n\nA tiled-softmax kernel that reduces HBM memory traffic from O(N²) to O(N)\nduring KV encoding. Supports `float16`/`bfloat16`, head dims 64/96/128, any\nsequence length, and causal masking. Covers Volta through Hopper (sm_70–sm_90).\nFalls back gracefully to `torch.nn.functional.scaled_dot_product_attention`\nif not compiled.\n\n\n pip install 'kvboost[cuda]' # builds the kernel\n\n\n* * *\n\n### 3. AWQ layer streaming — run models bigger than VRAM\n\nStreams INT4 layer weights from pinned host RAM into two CUDA staging slots,\noverlapping PCIe transfer with compute. Embeddings, layernorms, and a\nconfigurable number of head/tail decoder layers stay resident; the rest are\nDMA’d on demand.\n\n**Qwen2.5-32B-Instruct-AWQ on an RTX 3060 12GB (~19GB packed weights):**\n\n * Peak VRAM: 9.58 GB\n * Steady-state: 1.40 tok/s\n * No OOM. Fully coherent output.\n\n\n\n\n from kvboost import KVBoost\n from kvboost.streaming import StreamingConfig\n\n engine = KVBoost.from_pretrained(\n \"Qwen/Qwen2.5-32B-Instruct-AWQ\",\n streaming_config=StreamingConfig(keep_first_k=9, keep_last_k=9),\n )\n\n\n* * *\n\n### 4. Speculative decoding stacked on streaming\n\nA small resident draft model proposes K tokens; the streamed target verifies\nthem in a single multi-token forward — the same DMA cycle, but yielding\nmultiple tokens per round.\n\n**Qwen2.5-32B target + 1.5B draft, RTX 3060 12GB, gamma=5:**\n\nMode | tok/s (decode)\n---|---\nStreaming only | 0.91\n+ Speculative (γ=5) | **2.79**\n\n3.07× decode speedup. Acceptance rate 40%, avg 3.0 committed tokens per round.\nGreedy mode is bit-for-bit identical to non-speculative greedy.\n\n* * *\n\n### 5. OpenAI-compatible server\n\nAsync prefix-grouped batching: requests sharing a prompt prefix are dispatched\nas a single batch, loading shared K/V once and broadcasting zero-copy. Drop-in\nfor the OpenAI SDK, LangChain, LlamaIndex, Instructor, and the Vercel AI SDK.\n\n\n kvboost-server --model Qwen/Qwen2.5-3B --port 8000 \\\n --recompute-strategy cacheblend \\\n --kv-cache-bits 8 \\\n --batch-window-ms 20\n\n\nAll four optimizations compose — AWQ streaming + speculative decoding + KV\nreuse + FlashAttention all work together through the same endpoint.\n\n* * *\n\n## Quick start\n\n\n from kvboost import KVBoost\n\n engine = KVBoost.from_pretrained(\"Qwen/Qwen2.5-3B-Instruct\")\n engine.warm(\"You are a helpful assistant.\")\n\n result = engine.generate(\"Your prompt here\", max_new_tokens=256)\n print(f\"TTFT: {result.ttft_ms:.1f} ms | KV reuse: {result.kv_reuse_ratio:.0%}\")\n\n\n\n pip install kvboost # CPU / MPS\n pip install 'kvboost[cuda]' # + FlashAttention-2 kernel\n pip install 'kvboost[server]' # + OpenAI-compatible server\n pip install 'kvboost[streaming]' # + AWQ layer streaming\n\n\n**Repo:** GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub\n\nWould love feedback from anyone running multi-turn agents, RAG pipelines, or\ntrying to squeeze large models onto consumer GPUs — those are the workloads\nthis was built for.",
"title": "KVBoost – a full inference engine for HF causal LMs: KV reuse, FlashAttention-2, AWQ streaming, and speculative decoding"
}