Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig2evkt2zlhnqoyetxgv5ivqfhpveqhpumr4xa4sktelyzrnhmu6e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mj2vcnw5lo42"
  },
  "path": "/t/503-service-unavailable-hitting-multiple-major-image-to-3d-spaces-triposr-instantmesh-lgm-via-gradio-client/175118#post_5",
  "publishedAt": "2026-04-09T12:50:36.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "pip"
  ],
  "textContent": "Based on the issues I identified while modifying my own Spaces, the following are the minimum necessary concrete fixes:\n\n* * *\n\nAll four currently show `Runtime error`, but they do **not** fail for one shared code reason. What happened is closer to this: a platform-side event likely forced cold starts, and those cold starts exposed four different latent startup problems. So the right repair strategy is **not** “apply one global workaround,” but “make each repo boot cleanly in the current Spaces environment with the smallest justified diff.” The four buckets are: **missing`onnxruntime` after `rembg` import** for TripoSR and InstantMesh, a **dead upstream repo id** for CRM, and a **native CUDA runtime mismatch** for LGM. HF’s current Spaces config still supports explicit `python_version` pinning, and current ZeroGPU docs list `3.10.13` and `3.12.12` as supported Python versions, so version pinning is still part of the stabilization story. (Hugging Face)\n\nThe key principle is this: **even if the trigger was a restart, unpause, rebuild, scheduler issue, or temporary API problem, these fixes are still needed** because each current repo has a deterministic startup failure in its own code path. In other words, even if the platform caused the failure to become visible, the repo still has to be made bootable. That is why I would keep the patches minimal and specific instead of doing large framework upgrades first. pip has become stricter over time, but none of the four currently exposed failures are primarily “requirements syntax” bugs. They are startup dependency and binary/runtime mismatches. (pip)\n\n## What I would change first, in order\n\n  1. **TripoSR** : add `onnxruntime`.\n  2. **InstantMesh** : add `onnxruntime`, and pin `python_version: 3.10.13` in README.\n  3. **CRM** : replace the dead `stabilityai/stable-diffusion-2-1-base` scheduler source with `sd2-community/stable-diffusion-2-1-base`.\n  4. **LGM** : add `nvidia-cuda-runtime-cu11`, then preload `libcudart.so.11.0` before importing the compiled extension. This is the smallest targeted fix for the current public crash, but LGM is the only one where I would keep an explicit fallback plan in mind if the first patch is not enough. (Hugging Face)\n\n\n\n* * *\n\n## 1) TripoSR\n\n### Why it crashes now\n\nThe current app imports `rembg` at module import time, and the current `requirements.txt` includes bare `rembg` but not `onnxruntime`. The public runtime traceback for this Space shows exactly that failure path: `import rembg` → `import onnxruntime as ort` → `ModuleNotFoundError: No module named 'onnxruntime'`. The README already pins `python_version: 3.10.13`, so Python drift is **not** the first thing to fix here. (Hugging Face)\n\n### Smallest patch\n\n`requirements.txt`\n\n\n     omegaconf==2.3.0\n     Pillow==10.1.0\n     einops==0.7.0\n     transformers==4.35.0\n     trimesh==4.0.5\n     rembg\n    +onnxruntime\n     huggingface-hub\n     gradio\n\n\n### Why this patch is needed even if the trigger was external\n\nBecause the failure is deterministic at startup. The current repo asks Python to import `rembg` before the app can even finish importing, and the public crash shows that the installed environment does not contain `onnxruntime`. A platform restart may have exposed it, but a clean cold start will keep hitting the same line until `onnxruntime` is present. This is why I would **not** start by upgrading Gradio or Torch here. The smallest repair is to add the missing package that the current code path actually imports. (Hugging Face)\n\n### Why I am **not** making a bigger first patch\n\nYou could switch to a newer `rembg` extra layout, but that is not the smallest safe move for this repo. The exposed failure is not “wrong Gradio API,” not “wrong Torch version,” and not “wrong Python version.” It is specifically “`onnxruntime` is missing.” So the one-line fix above is the cleanest first pass. (Hugging Face)\n\n* * *\n\n## 2) InstantMesh\n\n### Why it crashes now\n\nThis Space has the same primary failure as TripoSR. `app.py` imports `rembg`, and the preprocessing path creates a `rembg` session. `requirements.txt` still lists bare `rembg`, and the public runtime traceback again shows `ModuleNotFoundError: No module named 'onnxruntime'`. Unlike TripoSR, its README metadata does **not** currently specify `python_version`, even though HF supports pinning it in README YAML. (Hugging Face)\n\n### Smallest patch\n\n`README.md`\n\n\n     title: InstantMesh\n     emoji:\n     colorFrom: indigo\n     colorTo: green\n     sdk: gradio\n     sdk_version: 4.26.0\n    +python_version: 3.10.13\n     app_file: app.py\n     pinned: false\n     short_description: Create a 3D model from an image in 10 seconds!\n     license: apache-2.0\n\n\n`requirements.txt`\n\n\n     torch==2.1.0\n     torchvision==0.16.0\n     torchaudio==2.1.0\n     pytorch-lightning==2.1.2\n     einops\n     omegaconf\n     deepspeed\n     torchmetrics\n     webdataset\n     accelerate\n     tensorboard\n     PyMCubes\n     trimesh\n     rembg\n    +onnxruntime\n     transformers==4.34.1\n     diffusers==0.19.3\n     bitsandbytes\n     imageio[ffmpeg]\n     xatlas\n     plyfile\n     xformers==0.0.22.post7\n     git+https://github.com/NVlabs/nvdiffrast/\n     huggingface-hub\n\n\n### Why this patch is needed even if the trigger was external\n\nAgain, because the current startup path already contains the failure. The app imports `rembg` before the UI is ready, and the publicly reported runtime failure is the missing `onnxruntime` import. The Python pin is a separate hardening step: HF lets Spaces pin `python_version`, and current ZeroGPU docs explicitly list `3.10.13` as supported. Even if the platform restart is what made the breakage visible, keeping Python fixed removes one more moving part from future cold starts. (Hugging Face)\n\n### What I would **not** do first\n\nI would not begin by mass-upgrading the whole dependency stack. There is already a community PR that proposes a larger cleanup including `numpy<2.0.0`, `Pillow==10.4.0`, newer `gradio`, and simplified requirements. That may be useful later, but the smallest justified first repair is still “add `onnxruntime` and pin Python.” (Hugging Face)\n\n### If the first patch boots but then fails later\n\nThe next smallest hardening step is:\n\n\n    +numpy<2.0.0\n    +Pillow==10.4.0\n\n\nI would only do that **after** confirming that the startup blocker moved past `rembg`/`onnxruntime`. The reason is simple: fix the deterministic boot failure first, then deal with second-order runtime drift. (Hugging Face)\n\n* * *\n\n## 3) CRM\n\n### Why it crashes now\n\nThe current public runtime error is very specific. The Space tries to build a `DDIMScheduler` from `stabilityai/stable-diffusion-2-1-base`, and that repo id no longer resolves publicly for the needed scheduler config. The crash trace points into `model/crm/model.py` at the scheduler initialization. At the same time, `app.py` defaults `--device` to `\"cuda\"` and moves the model there during startup, which is an additional fragility point once the scheduler problem is fixed. (Hugging Face)\n\n### Smallest patch\n\n`model/crm/model.py`\n\n\n    -self.scheduler = DDIMScheduler.from_pretrained(\n    -    \"stabilityai/stable-diffusion-2-1-base\",\n    -    subfolder=\"scheduler\",\n    -)\n    +self.scheduler = DDIMScheduler.from_pretrained(\n    +    \"sd2-community/stable-diffusion-2-1-base\",\n    +    subfolder=\"scheduler\",\n    +)\n\n\n### Why this patch is needed even if the trigger was external\n\nBecause the current repo points at a model id that no longer works for this code path, and the public crash trace shows exactly that path failing. `sd2-community/stable-diffusion-2-1-base` exists, and its repo contains `scheduler/scheduler_config.json`, which is the file CRM is trying to load. So this is not a speculative change. It is a direct one-line replacement for the dead dependency that the current startup path is trying to read. Even if a platform restart is what surfaced the error, any future cold start will keep failing until the repo id is replaced. (Hugging Face)\n\n### Optional but very cheap second line\n\n`app.py`\n\n\n    -parser.add_argument(\"--device\", type=str, default=\"cuda\")\n    +parser.add_argument(\"--device\", type=str, default=\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n\n### Why I would add that second line\n\nThe scheduler fix is the primary repair. But once the app gets past that point, startup still does `model = model.to(args.device)` and passes `device=args.device` into the pipeline constructor. Right now that default is hard-coded to `\"cuda\"`. So if the Space is restarted on a CPU-backed environment, or on a GPU path that is temporarily unavailable, the next boot can fail later in startup. That one-line default makes the app more robust without changing its interface or behavior when CUDA is actually available. (Hugging Face)\n\n### What I would **not** do first\n\nI would not start by adding tokens or auth logic. The current public problem is not “this repo is gated but otherwise correct.” The practical issue is that the code points at a repo id that no longer works for the scheduler path, and a community mirror already exposes the file CRM needs. So the smallest valid fix is to swap the source, not to add authentication plumbing. (Hugging Face)\n\n* * *\n\n## 4) LGM\n\n### Why it crashes now\n\nLGM is the outlier. The public runtime error is not a missing Python dependency. It is a compiled-extension failure: the Space downloads its checkpoint, installs a local wheel named `diff_gaussian_rasterization-0.0.0-cp310-cp310-linux_x86_64.whl`, and then crashes importing that extension because `libcudart.so.11.0` is missing. The README already pins `python_version: 3.10.13`, so Python drift is **not** the first issue here. The current code also initializes most of the heavy model stack at startup, not lazily. (Hugging Face)\n\n### Smallest targeted patch\n\n`requirements.txt`\n\n\n     torch==2.4.0\n     xformers\n     numpy\n     tyro\n     diffusers\n     dearpygui\n     einops\n     accelerate\n     gradio\n     imageio\n     imageio-ffmpeg\n     lpips\n     matplotlib\n     packaging\n     Pillow\n     pygltflib\n     rembg[gpu,cli]\n    +nvidia-cuda-runtime-cu11\n     rich\n     safetensors\n     scikit-image\n     scikit-learn\n     scipy\n     tqdm\n     transformers\n     trimesh\n     kiui >= 0.2.3\n     xatlas\n     roma\n     plyfile\n\n\n`app.py`\nAdd this **before** `from core.models import LGM`:\n\n\n    +import ctypes\n    +import site\n    +\n    +for sp in site.getsitepackages():\n    +    cudart = os.path.join(sp, \"nvidia\", \"cuda_runtime\", \"lib\", \"libcudart.so.11.0\")\n    +    if os.path.exists(cudart):\n    +        ctypes.CDLL(cudart)\n    +        break\n\n\n### Why this patch is needed even if the trigger was external\n\nBecause the current public crash is already precise: the installed compiled extension cannot find `libcudart.so.11.0`. NVIDIA publishes `nvidia-cuda-runtime-cu11` on PyPI as “CUDA Runtime native Libraries,” and this patch preloads the exact library the extension says it is missing before the extension import happens. That is the smallest repo-side change that directly matches the currently exposed failure. A platform-side restart may have exposed it, but once the process restarts, the same binary import will keep failing until the CUDA runtime library problem is addressed. (Hugging Face)\n\n### Important honesty note\n\nThis is the only one of the four where I would **not** promise the first patch is enough. It is the smallest **targeted** fix for the current public error, but native wheels can fail for more than one reason. If the wheel was built against a runtime/ABI combination that still does not match the current Spaces environment, then the next repair is no longer a one-liner. At that point, the smallest _real_ fix becomes either:\n\n  * rebuild that extension for the current runtime, or\n  * move the Space to Docker so CUDA and the extension are under your control.\nHF’s current ZeroGPU docs also make clear that ZeroGPU is its own environment with H200-backed shared GPU slices and specific supported versions, so binary assumptions that worked on an older setup can stop being valid after a cold restart. (Hugging Face)\n\n\n\n### What I would **not** do first\n\nI would not start by upgrading Gradio, Torch, or the whole app stack just to chase this one error. The current public failure happens before any of that becomes the main issue: it dies when the compiled rasterizer tries to load `_C` and cannot find `libcudart.so.11.0`. Solve the explicit binary import error first. Then, if it boots and another error appears, fix that next one. (Hugging Face)\n\n* * *\n\n## A compact “do this now” version\n\nIf I were patching these repos in the smallest reasonable way, I would do exactly this:\n\n### TripoSR\n\n\n    + onnxruntime\n\n\n### InstantMesh\n\n\n    README.md:\n    + python_version: 3.10.13\n\n    requirements.txt:\n    + onnxruntime\n\n\n### CRM\n\n\n    - \"stabilityai/stable-diffusion-2-1-base\"\n    + \"sd2-community/stable-diffusion-2-1-base\"\n\n\nOptional second line:\n\n\n    - default=\"cuda\"\n    + default=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n\n### LGM\n\n\n    requirements.txt:\n    + nvidia-cuda-runtime-cu11\n\n\nand preload `libcudart.so.11.0` before importing `core.models`. (Hugging Face)\n\n## Why I think these are the right first patches\n\nBecause they match the **actual currently exposed startup failures** , not a guessed historical failure, and because they keep the diffs small:\n\n  * TripoSR: missing Python package.\n  * InstantMesh: same missing package, plus missing Python pin.\n  * CRM: dead external repo id.\n  * LGM: missing CUDA runtime for a compiled extension. (Hugging Face)\n\n",
  "title": "503 Service Unavailable hitting multiple major Image-to-3D Spaces (TripoSR, InstantMesh, LGM) via Gradio Client"
}