CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images
CUDA 802 on H200 multi-GPU Jobs, looks like NVSwitch Fabric Manager isn’t ready at job start
Posting here too in case anyone on Jobs / infra sees this first. Full repro with 4 failing job IDs, 2 working controls (l4x4 and plain-PyTorch on h200x4), images, flavors, and references is in GitHub issue CUDA Error 802 on all H200 multi-GPU HF Jobs with vLLM, across CUDA 12 and CUDA 13 images · Issue #4128 · huggingface/huggingface_hub · GitHub .
Short version of what’s new since I filed that issue:
Fabric Manager state at job start on h200x4 (cu130-nightly image, driver 580.126.09 / CUDA 13.0):
Fabric State : In Progress Status : N/A GPU Fabric GUID : N/A
Identical on all 4 H200s, never transitioned to Completed during the job.
vLLM at TP=1 on h200x4 also fails with CUDA 802 (CUDA_VISIBLE_DEVICES=0, tensor_parallel_size=1, Qwen/Qwen2.5-0.5B). So no tensor-parallel routing and no NVLink handoff in play – vLLM just touching CUDA on a single visible device. Same error as the multi-GPU case.
Combined with the plain-PyTorch control (which works fine on the same h200x4 flavor), it really does look like vLLM’s CUDA init path runs before Fabric Manager is ready on NVSwitch hosts, while PyTorch’s init tolerates it.
Asks:
Can the Jobs entrypoint on NVSwitch flavors wait for Fabric State: Completed before starting user code? Even a 30-60s gate would prevent this.
Is there a currently-working H200-class flavor (or equivalent multi-GPU flavor) with enough VRAM for a base model in the 70-110B range? I was targeting GLM-4.5-Base (355B, fits on h200x4 / x8 only). If that’s not unblockable this week, are any other large base models on your infra currently running successfully – e.g. DeepSeek V3 Base, LLaMA 4 Scout Base, Qwen 2.5 72B Base – or are they all hitting the same CUDA 802 path? Any confirmed-working pairing of (flavor, base model) would help most.
Time-sensitive on my end , so any pointer on a flavor that works today would help most.
Discussion in the ATmosphere