External Publication

CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images

Hugging Face Forums [Unofficial] April 21, 2026

Every H200 multi-GPU job I launch fails at CUDA initialization, before any model weights load. The error is:


RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

The failure occurs in vLLM’s multiproc_executor.py at WorkerProc init. I’ve now tested three different vLLM image versions (CUDA 12.x runtime and CUDA 13 runtime) and the error is identical in all three. It is not model-specific, TP-size-specific, or CUDA-runtime-version-specific.

What I’ve confirmed:

| Setup | Result |

|—|—|

| pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel on h200x4, single process (nvidia-smi + torch.cuda.device_count()) | works, returns 4 |

| vllm/vllm-openai:v0.19.1 on l4x4 | works end-to-end |

| vllm/vllm-openai:v0.19.1 on h200x4, Qwen2.5-7B | fails with 802 (twice on retry) |

| vllm/vllm-openai:v0.19.1 on h200x8, GLM-4.5-Base | fails with 802 |

| vllm/vllm-openai:cu130-nightly on h200x4, Qwen2.5-7B | fails with 802 |

The fact that plain PyTorch single-process works on the same h200x4 node but every vLLM multi-process worker fails suggests the issue is specific to how CUDA context is initialized inside spawned worker subprocesses on H200 nodes. This pattern matches Fabric Manager / NVSwitch visibility regressions documented in:

How do I fix a "system not initialized" error on multi-GPU Droplets? | DigitalOcean Documentation
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized · Issue #2554 · awslabs/amazon-eks-ami · GitHub
CUDA initialization failure with error Error 802: system not yet initialized - GPU - Hardware - NVIDIA Developer Forums

HF Jobs users can’t restart Fabric Manager or check FM/driver version match.

Details:

Flavors: h200x8 and h200x4 (both fail)
Host driver (confirmed via nvidia-smi inside h200x4 container): NVIDIA 580.126.09, CUDA 13.0, 4× H200 @ 143771 MiB
Job IDs:
elenaajayi/69e5aa28ac288e522d8f0179 (h200x8, GLM-4.5-Base, v0.19.1)
elenaajayi/69e5ab1dac288e522d8f017d (h200x4, Qwen2.5-7B, v0.19.1)
elenaajayi/69e5ac7eac288e522d8f0181 (h200x4, Qwen2.5-7B, v0.19.1, retry)
elenaajayi/69e61257ac288e522d8f0281 (h200x4, Qwen2.5-7B, cu130-nightly)
Controls:
elenaajayi/69e5a714ac288e522d8f0177 (l4x4, same image, runs clean)
elenaajayi/69e5be88cd8c002f31dffddc (h200x4, plain PyTorch, nvidia-smi + device_count() succeed)
Docker images tested: vllm/vllm-openai:v0.19.1, vllm/vllm-openai:cu130-nightly, pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel
huggingface_hub: 0.26.2

Is the HF infrastructure team aware of this? Is there a timeline for a fix, or an alternative H200 flavor I can try? This is blocking a NeurIPS paper run

Discussion in the ATmosphere