Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiceum2h3tjqdzukgcmqhbhvs4gy4yjrjmdf6k4trljuxgwj3rm3f4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk2ldqkynnw2"
  },
  "path": "/t/cuda-error-802-on-every-h200-multi-gpu-hf-job-across-three-vllm-images/175419#post_2",
  "publishedAt": "2026-04-22T02:55:02.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "vllm serve / a normal script instead of python -c",
    "vLLM multiprocessing docs",
    "NVIDIA’s CUDA 802 guidance",
    "DigitalOcean’s note on this exact error",
    "NVIDIA AI Enterprise docs",
    "forum thread",
    "GitHub issue",
    "AWS EKS issue"
  ],
  "textContent": "Seems platform-side issue? LLM suggested:\n\n* * *\n\nThis looks less like a pure vLLM bug and more like an H200 multi-GPU / NVSwitch / Fabric Manager issue on the HF side.\n\nIf I were debugging it, I’d probably try three things first:\n\n  * see whether single-GPU on H200 works (`CUDA_VISIBLE_DEVICES=0`, `tensor_parallel_size=1`);\n  * try `VLLM_WORKER_MULTIPROC_METHOD=spawn`, or use vllm serve / a normal script instead of python -c, since the vLLM multiprocessing docs explain that the startup path differs there;\n  * check `nvidia-smi -q | grep -A 2 Fabric`, since both NVIDIA’s CUDA 802 guidance and DigitalOcean’s note on this exact error point at fabric / Fabric Manager on NVSwitch systems.\n\n\n\nIf single-GPU works but multi-GPU fails, or Fabric state looks wrong, this probably isn’t something a user can really fix from inside the job. The NVIDIA AI Enterprise docs say Fabric Manager is required for HGX 1/2/4/8-GPU VMs, and on H100/H200 shared NVSwitch setups that management lives on the host / service-VM side. That makes it sound much more like an HF infra issue than an application issue.\n\nSo I’d probably keep both the forum thread and the GitHub issue updated with:\n\n  * job ID\n  * image + flavor\n  * what works vs what fails\n  * result of the single-GPU test\n  * Fabric output\n  * whether `spawn` changes anything\n\n\n\nThere’s also a somewhat similar AWS EKS issue where vLLM hit the same CUDA 802 path and it ended up looking node / AMI-side rather than model-side.",
  "title": "CUDA Error 802 on every H200 multi-GPU HF Job, across three vLLM images"
}