Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiguwwzmmlhz335652kcaczrrcyaimbxhktozilfblwoeqkmc5r3ue",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmgtr7m3nt32"
  },
  "path": "/t/kvboost-a-full-inference-engine-for-hf-causal-lms-kv-reuse-flashattention-2-awq-streaming-and-speculative-decoding/176159#post_1",
  "publishedAt": "2026-05-22T11:18:31.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub"
  ],
  "textContent": "Hey guys, I am building kvboost. GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub\n\n## What’s in the engine\n\n### 1. Cross-request KV cache reuse\n\nPrompts are split into fixed-size chunks and content-addressed by hash. On a\ncache hit, stored K/V tensors are loaded instead of recomputed. **CacheBlend\nseam repair** selectively recomputes the ~15% most-deviated tokens at chunk\nboundaries, so stitched K/V produces output quality identical to a full prefill.\n\nOn a 500-conversation ShareGPT replay (Qwen2.5-3B, RTX 4060 8GB):\n\nTurn | Baseline TTFT | KVBoost TTFT | KV reuse\n---|---|---|---\n1 | 18.8 ms | 17.4 ms | 35.7%\n3 | 35.2 ms | 20.6 ms | 99.2%\n8 | 121.6 ms | 26.5 ms | 99.6%\n\nTTFT stays flat. No measurable accuracy loss (99.2% WARM = 99.2% COLD).\n\n* * *\n\n### 2. Custom FlashAttention-2 CUDA kernel\n\nA tiled-softmax kernel that reduces HBM memory traffic from O(N²) to O(N)\nduring KV encoding. Supports `float16`/`bfloat16`, head dims 64/96/128, any\nsequence length, and causal masking. Covers Volta through Hopper (sm_70–sm_90).\nFalls back gracefully to `torch.nn.functional.scaled_dot_product_attention`\nif not compiled.\n\n\n    pip install 'kvboost[cuda]'   # builds the kernel\n\n\n* * *\n\n### 3. AWQ layer streaming — run models bigger than VRAM\n\nStreams INT4 layer weights from pinned host RAM into two CUDA staging slots,\noverlapping PCIe transfer with compute. Embeddings, layernorms, and a\nconfigurable number of head/tail decoder layers stay resident; the rest are\nDMA’d on demand.\n\n**Qwen2.5-32B-Instruct-AWQ on an RTX 3060 12GB (~19GB packed weights):**\n\n  * Peak VRAM: 9.58 GB\n  * Steady-state: 1.40 tok/s\n  * No OOM. Fully coherent output.\n\n\n\n\n    from kvboost import KVBoost\n    from kvboost.streaming import StreamingConfig\n\n    engine = KVBoost.from_pretrained(\n        \"Qwen/Qwen2.5-32B-Instruct-AWQ\",\n        streaming_config=StreamingConfig(keep_first_k=9, keep_last_k=9),\n    )\n\n\n* * *\n\n### 4. Speculative decoding stacked on streaming\n\nA small resident draft model proposes K tokens; the streamed target verifies\nthem in a single multi-token forward — the same DMA cycle, but yielding\nmultiple tokens per round.\n\n**Qwen2.5-32B target + 1.5B draft, RTX 3060 12GB, gamma=5:**\n\nMode | tok/s (decode)\n---|---\nStreaming only | 0.91\n+ Speculative (γ=5) | **2.79**\n\n3.07× decode speedup. Acceptance rate 40%, avg 3.0 committed tokens per round.\nGreedy mode is bit-for-bit identical to non-speculative greedy.\n\n* * *\n\n### 5. OpenAI-compatible server\n\nAsync prefix-grouped batching: requests sharing a prompt prefix are dispatched\nas a single batch, loading shared K/V once and broadcasting zero-copy. Drop-in\nfor the OpenAI SDK, LangChain, LlamaIndex, Instructor, and the Vercel AI SDK.\n\n\n    kvboost-server --model Qwen/Qwen2.5-3B --port 8000 \\\n        --recompute-strategy cacheblend \\\n        --kv-cache-bits 8 \\\n        --batch-window-ms 20\n\n\nAll four optimizations compose — AWQ streaming + speculative decoding + KV\nreuse + FlashAttention all work together through the same endpoint.\n\n* * *\n\n## Quick start\n\n\n    from kvboost import KVBoost\n\n    engine = KVBoost.from_pretrained(\"Qwen/Qwen2.5-3B-Instruct\")\n    engine.warm(\"You are a helpful assistant.\")\n\n    result = engine.generate(\"Your prompt here\", max_new_tokens=256)\n    print(f\"TTFT: {result.ttft_ms:.1f} ms | KV reuse: {result.kv_reuse_ratio:.0%}\")\n\n\n\n    pip install kvboost              # CPU / MPS\n    pip install 'kvboost[cuda]'      # + FlashAttention-2 kernel\n    pip install 'kvboost[server]'    # + OpenAI-compatible server\n    pip install 'kvboost[streaming]' # + AWQ layer streaming\n\n\n**Repo:** GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub\n\nWould love feedback from anyone running multi-turn agents, RAG pipelines, or\ntrying to squeeze large models onto consumer GPUs — those are the workloads\nthis was built for.",
  "title": "KVBoost – a full inference engine for HF causal LMs: KV reuse, FlashAttention-2, AWQ streaming, and speculative decoding"
}