CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images
Every H200 multi-GPU job I launch fails at CUDA initialization, before any model weights load. The error is:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
The failure occurs in vLLM’s multiproc_executor.py at WorkerProc init. I’ve now tested three different vLLM image versions (CUDA 12.x runtime and CUDA 13 runtime) and the error is identical in all three. It is not model-specific, TP-size-specific, or CUDA-runtime-version-specific.
What I’ve confirmed:
| Setup | Result |
|—|—|
| pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel on h200x4, single process (nvidia-smi + torch.cuda.device_count()) | works, returns 4 |
| vllm/vllm-openai:v0.19.1 on l4x4 | works end-to-end |
| vllm/vllm-openai:v0.19.1 on h200x4, Qwen2.5-7B | fails with 802 (twice on retry) |
| vllm/vllm-openai:v0.19.1 on h200x8, GLM-4.5-Base | fails with 802 |
| vllm/vllm-openai:cu130-nightly on h200x4, Qwen2.5-7B | fails with 802 |
The fact that plain PyTorch single-process works on the same h200x4 node but every vLLM multi-process worker fails suggests the issue is specific to how CUDA context is initialized inside spawned worker subprocesses on H200 nodes. This pattern matches Fabric Manager / NVSwitch visibility regressions documented in:
How do I fix a "system not initialized" error on multi-GPU Droplets? | DigitalOcean Documentation
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized · Issue #2554 · awslabs/amazon-eks-ami · GitHub
CUDA initialization failure with error Error 802: system not yet initialized - GPU - Hardware - NVIDIA Developer Forums
HF Jobs users can’t restart Fabric Manager or check FM/driver version match.
Details:
Flavors: h200x8 and h200x4 (both fail)
Host driver (confirmed via
nvidia-smiinside h200x4 container): NVIDIA 580.126.09, CUDA 13.0, 4× H200 @ 143771 MiBJob IDs:
elenaajayi/69e5aa28ac288e522d8f0179(h200x8, GLM-4.5-Base, v0.19.1)elenaajayi/69e5ab1dac288e522d8f017d(h200x4, Qwen2.5-7B, v0.19.1)elenaajayi/69e5ac7eac288e522d8f0181(h200x4, Qwen2.5-7B, v0.19.1, retry)elenaajayi/69e61257ac288e522d8f0281(h200x4, Qwen2.5-7B, cu130-nightly)Controls:
elenaajayi/69e5a714ac288e522d8f0177(l4x4, same image, runs clean)elenaajayi/69e5be88cd8c002f31dffddc(h200x4, plain PyTorch, nvidia-smi + device_count() succeed)Docker images tested:
vllm/vllm-openai:v0.19.1,vllm/vllm-openai:cu130-nightly,pytorch/pytorch:2.6.0-cuda12.4-cudnn9-develhuggingface_hub: 0.26.2
Is the HF infrastructure team aware of this? Is there a timeline for a fix, or an alternative H200 flavor I can try? This is blocking a NeurIPS paper run
Discussion in the ATmosphere