CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images
Seems platform-side issue? LLM suggested:
This looks less like a pure vLLM bug and more like an H200 multi-GPU / NVSwitch / Fabric Manager issue on the HF side.
If I were debugging it, I’d probably try three things first:
- see whether single-GPU on H200 works (
CUDA_VISIBLE_DEVICES=0,tensor_parallel_size=1); - try
VLLM_WORKER_MULTIPROC_METHOD=spawn, or use vllm serve / a normal script instead of python -c, since the vLLM multiprocessing docs explain that the startup path differs there; - check
nvidia-smi -q | grep -A 2 Fabric, since both NVIDIA’s CUDA 802 guidance and DigitalOcean’s note on this exact error point at fabric / Fabric Manager on NVSwitch systems.
If single-GPU works but multi-GPU fails, or Fabric state looks wrong, this probably isn’t something a user can really fix from inside the job. The NVIDIA AI Enterprise docs say Fabric Manager is required for HGX 1/2/4/8-GPU VMs, and on H100/H200 shared NVSwitch setups that management lives on the host / service-VM side. That makes it sound much more like an HF infra issue than an application issue.
So I’d probably keep both the forum thread and the GitHub issue updated with:
- job ID
- image + flavor
- what works vs what fails
- result of the single-GPU test
- Fabric output
- whether
spawnchanges anything
There’s also a somewhat similar AWS EKS issue where vLLM hit the same CUDA 802 path and it ended up looking node / AMI-side rather than model-side.
Discussion in the ATmosphere