Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigvkgm7ejctptnxyea4jydczl2t5z4h2664fimylaxtbn24sor6b4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mp4a4v42f5g2"
  },
  "path": "/t/deepseek-qwen/176657#post_5",
  "publishedAt": "2026-06-25T10:01:19.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "When running vLLM on a single H200 with quantized models:\n\n  1. **Leverage FP8/AWQ natively:** vLLM has top-tier kernel support for Hopper GPUs. Run your models with --quantization fp8 or --quantization awq to maximize throughput.\n  2. **KV Cache Tuning:** Set --gpu-memory-utilization 0.90 to leave room for CUDA overhead, and monitor your max context length capabilities.\n  3. **MoE Optimization:** vLLM has dedicated MoE optimizations. If you do run a MoE like DeepSeek or Mixtral, ensure your vLLM version is fully updated to leverage the latest Hopper-optimized MoE kernels.\n\n\n\nStart with **Qwen 2.5 72B (FP8)** or a **4-bit quantized DeepSeek V4 Flash** , and you’ll see incredible performance.",
  "title": "Deepseek? Qwen?"
}