{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreic573skzwuv32rikitb32ayrhxa5s3bkcvntciyfxbequbugwo4je",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjxtzlkkz3n2"
},
"path": "/t/cuda-error-802-on-every-h200-multi-gpu-hf-job-across-three-vllm-images/175419#post_1",
"publishedAt": "2026-04-21T00:09:40.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"How do I fix a \"system not initialized\" error on multi-GPU Droplets? | DigitalOcean Documentation",
"RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized · Issue #2554 · awslabs/amazon-eks-ami · GitHub",
"CUDA initialization failure with error Error 802: system not yet initialized - GPU - Hardware - NVIDIA Developer Forums"
],
"textContent": "Every H200 multi-GPU job I launch fails at CUDA initialization, before any model weights load. The error is:\n\n```\n\nRuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized\n\n```\n\nThe failure occurs in vLLM’s `multiproc_executor.py` at `WorkerProc` init. I’ve now tested three different vLLM image versions (CUDA 12.x runtime and CUDA 13 runtime) and the error is identical in all three. It is not model-specific, TP-size-specific, or CUDA-runtime-version-specific.\n\nWhat I’ve confirmed:\n\n| Setup | Result |\n\n|—|—|\n\n| `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel` on h200x4, single process (`nvidia-smi` + `torch.cuda.device_count()`) | works, returns 4 |\n\n| `vllm/vllm-openai:v0.19.1` on l4x4 | works end-to-end |\n\n| `vllm/vllm-openai:v0.19.1` on h200x4, Qwen2.5-7B | fails with 802 (twice on retry) |\n\n| `vllm/vllm-openai:v0.19.1` on h200x8, GLM-4.5-Base | fails with 802 |\n\n| `vllm/vllm-openai:cu130-nightly` on h200x4, Qwen2.5-7B | fails with 802 |\n\nThe fact that plain PyTorch single-process works on the same h200x4 node but every vLLM multi-process worker fails suggests the issue is specific to how CUDA context is initialized inside spawned worker subprocesses on H200 nodes. This pattern matches Fabric Manager / NVSwitch visibility regressions documented in:\n\n- How do I fix a \"system not initialized\" error on multi-GPU Droplets? | DigitalOcean Documentation\n\n- RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized · Issue #2554 · awslabs/amazon-eks-ami · GitHub\n\n- CUDA initialization failure with error Error 802: system not yet initialized - GPU - Hardware - NVIDIA Developer Forums\n\nHF Jobs users can’t restart Fabric Manager or check FM/driver version match.\n\n**Details:**\n\n- Flavors: h200x8 and h200x4 (both fail)\n\n- Host driver (confirmed via `nvidia-smi` inside h200x4 container): NVIDIA 580.126.09, CUDA 13.0, 4× H200 @ 143771 MiB\n\n- Job IDs:\n\n- `elenaajayi/69e5aa28ac288e522d8f0179` (h200x8, GLM-4.5-Base, v0.19.1)\n\n- `elenaajayi/69e5ab1dac288e522d8f017d` (h200x4, Qwen2.5-7B, v0.19.1)\n\n- `elenaajayi/69e5ac7eac288e522d8f0181` (h200x4, Qwen2.5-7B, v0.19.1, retry)\n\n- `elenaajayi/69e61257ac288e522d8f0281` (h200x4, Qwen2.5-7B, cu130-nightly)\n\n- Controls:\n\n- `elenaajayi/69e5a714ac288e522d8f0177` (l4x4, same image, runs clean)\n\n- `elenaajayi/69e5be88cd8c002f31dffddc` (h200x4, plain PyTorch, nvidia-smi + device_count() succeed)\n\n- Docker images tested: `vllm/vllm-openai:v0.19.1`, `vllm/vllm-openai:cu130-nightly`, `pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel`\n\n- `huggingface_hub`: 0.26.2\n\nIs the HF infrastructure team aware of this? Is there a timeline for a fix, or an alternative H200 flavor I can try? This is blocking a NeurIPS paper run",
"title": "CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images"
}