External Publication
Visit Post

Deepseek? Qwen?

Hugging Face Forums [Unofficial] June 25, 2026
Source

When running vLLM on a single H200 with quantized models:

  1. Leverage FP8/AWQ natively: vLLM has top-tier kernel support for Hopper GPUs. Run your models with --quantization fp8 or --quantization awq to maximize throughput.
  2. KV Cache Tuning: Set --gpu-memory-utilization 0.90 to leave room for CUDA overhead, and monitor your max context length capabilities.
  3. MoE Optimization: vLLM has dedicated MoE optimizations. If you do run a MoE like DeepSeek or Mixtral, ensure your vLLM version is fully updated to leverage the latest Hopper-optimized MoE kernels.

Start with Qwen 2.5 72B (FP8) or a 4-bit quantized DeepSeek V4 Flash , and you’ll see incredible performance.

Discussion in the ATmosphere

Loading comments...