Deepseek? Qwen?
Hugging Face Forums [Unofficial]
June 25, 2026
When running vLLM on a single H200 with quantized models:
- Leverage FP8/AWQ natively: vLLM has top-tier kernel support for Hopper GPUs. Run your models with --quantization fp8 or --quantization awq to maximize throughput.
- KV Cache Tuning: Set --gpu-memory-utilization 0.90 to leave room for CUDA overhead, and monitor your max context length capabilities.
- MoE Optimization: vLLM has dedicated MoE optimizations. If you do run a MoE like DeepSeek or Mixtral, ensure your vLLM version is fully updated to leverage the latest Hopper-optimized MoE kernels.
Start with Qwen 2.5 72B (FP8) or a 4-bit quantized DeepSeek V4 Flash , and you’ll see incredible performance.
Discussion in the ATmosphere