Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidzqgieyvzlevxcpcym3vd76kxlptxwkae3lyryrfmykyd2ndz3ry",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mig7s3iaolr2"
  },
  "path": "/t/inference-endpoint-hanging-in-initializing/174847#post_2",
  "publishedAt": "2026-04-01T06:09:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "Hugging Face Forums",
    "Hugging Face Status",
    "GitHub"
  ],
  "textContent": "If it were purely due to the repository, the frequency of occurrences wouldn’t have suddenly increased recently…\nGiven that there have been several reports of failures related to “scale-from-zero wake-up,” a bug on the HF side is also a possibility.\n\n* * *\n\nMy view is:\n\n**Most likely:** a problem in the **scale-from-zero wake-up path** or the **platform scheduler / orchestration layer**.\n**Second most likely:** a **custom-container readiness or port-alignment problem**.\n**Less likely:** a **dependency or image drift problem**.\n**Least likely:** a **recent breaking spec change**. (Hugging Face)\n\n## The background\n\nWhen a Hugging Face Inference Endpoint starts, several things must succeed in order:\n\n  1. the platform must allocate the instance,\n  2. pull and start the container,\n  3. mount the model at `/repository`,\n  4. wait for the app to become healthy,\n  5. then mark the endpoint as ready. (Hugging Face)\n\n\n\nFor custom containers, Hugging Face says the platform probes `/health` every second, and that route should return `503` until the model is actually ready. Hugging Face also says that if the logs show the app is running but the endpoint still says `Initializing`, the usual cause is **incorrect port mapping**. (Hugging Face)\n\nThat matters because `Initializing` is not one bug. It is a **phase**. Failures can happen before the model server starts, while it starts, or after it starts but before readiness is accepted. (Hugging Face)\n\n## Why I think wake-from-zero is the top suspect\n\nHugging Face’s autoscaling guide says scale-to-zero introduces a **cold start** , that the proxy can return **`503`** while a new replica initializes, and that waking from 0 can take **a few minutes** , which is why request-driven wake-up is “typically not recommended” for applications that need responsiveness. They also provide `X-Scale-Up-Timeout` specifically for this path. (Hugging Face)\n\nThere are also public reports of this exact class of failure:\n\n  * a scaled-to-zero endpoint that stopped waking on HTTP request and did **not** return the documented `503`, which Hugging Face staff said they would investigate, (Hugging Face Forums)\n  * and another case where scale-to-zero led to **500 Internal Server Error** , the replica did not scale back up automatically, and users worked around it by sending a probe request and waiting before real traffic. (Hugging Face Forums)\n\n\n\nSo if your symptom is “works normally once it is up, but sometimes gets stuck during bring-up, especially after idling,” the best first explanation is **resume-path instability** , not “the model code itself is broken.” That is an inference from the documented cold-start behavior plus the similar public cases. (Hugging Face)\n\n## Why platform scheduling or infra is also very plausible\n\nHugging Face forum history shows endpoint startup failures caused by **hardware capacity** and regional platform issues. In one public case, the error was `Scheduling failure: not enough hardware capacity`, and HF staff replied there had been a **minor issue in eu-west-1**. (Hugging Face Forums)\n\nThat matters because startup can fail **before** your application is really serving. When that happens, the endpoint can sit in initialization without much useful application-level evidence. That last sentence is an inference, but it follows from HF’s documented startup stages and from the fact that capacity failures are public, real, and external to user code. (Hugging Face)\n\nRight now, HF’s public status page shows **Inference Endpoints** , **Inference Endpoints UI** , and **Inference Endpoints API** as **Operational** on April 1, 2026. That makes a broad, public, service-wide outage less likely at this moment. It does **not** rule out a region-specific, GPU-class-specific, or quota-specific problem. (Hugging Face Status)\n\n## Why custom-container config is the main user-side suspect\n\nHugging Face’s FAQ is very direct:\n\n  * if the app is running in logs but the endpoint is still `Initializing`, the usual cause is **port mapping mismatch** , and\n  * if you get **500s at deployment start or during scaling** , you should make sure the app has a health route that returns **200 only when it is truly ready**. (Hugging Face)\n\n\n\nThat means there are two classic custom-container mistakes:\n\n### 1. Port mismatch\n\nThe app listens on one port, but the endpoint config expects another. HF says the default expectation is **port 80** , unless you explicitly change it and keep the values aligned. (Hugging Face)\n\n### 2. Readiness too early\n\nThe container process starts, so the platform thinks it is ready, but the model is still loading. HF’s custom-container docs explicitly show the intended pattern: `/health` should return `503` until the model and tokenizer are fully initialized. (Hugging Face)\n\n## How this maps to `hommayushi3/vllm-huggingface`\n\nThe wrapper itself is simple:\n\n  * its Dockerfile pins `FROM vllm/vllm-openai:v0.6.6.post1`, (GitHub)\n  * the entrypoint uses `MODEL_PATH=/repository`, which matches HF’s documented mount point, (GitHub)\n  * and it launches `vllm serve` on **host`0.0.0.0` and port `80`**. (GitHub)\n\n\n\nSo the wrapper is **not obviously wrong on the basic HF contract**. It serves from `/repository`, and it uses port 80, which matches HF’s default expectation. (GitHub)\n\nBut I still see three wrapper-related risks:\n\n### A. Old base image\n\nIt is pinned to `v0.6.6.post1`, which is an old vLLM image. Old does not mean broken, but it means you inherit older startup behavior and older bugs. (GitHub)\n\n### B. Readiness depends on vLLM’s startup behavior\n\nThis wrapper does not add its own richer readiness logic. It mainly shells into `vllm serve`. That can make the container more sensitive to timing around startup and health checks. This is an inference from the entrypoint design plus HF’s readiness rules. (Hugging Face)\n\n### C. One config variable looks weakly wired\n\nThe script sets `VLLM_ATTENTION_BACKEND`, but the shown `vllm serve` command does not include that variable as a command-line option, and the snippet does not show it being exported before command execution. That suggests the setting may not actually affect the launched process. This is a code-level inference from the entrypoint script, not a confirmed public bug report. (GitHub)\n\n## What I think is happening in plain English\n\nThe likely story is this:\n\n  * the endpoint goes idle,\n  * a new request arrives,\n  * HF tries to wake the deployment,\n  * sometimes that wake-up path stalls at the platform or readiness boundary,\n  * a retry or restart causes the whole sequence to be attempted again,\n  * and then it succeeds. (Hugging Face)\n\n\n\nThat fits better than “the model image is permanently broken,” because a permanently broken image usually fails the same way every time. The repeated-success-after-retry pattern points more toward **intermittent orchestration / cold-start / scheduling behavior**. This is an inference, but it is the one most consistent with the docs and similar public cases. (Hugging Face)\n\n## My ranking for your case\n\n### 1. Scale-from-zero wake-up problem\n\nBest fit. HF documents the cold-start path, and there are similar reports where the endpoint did not wake correctly from 0. (Hugging Face)\n\n### 2. Platform scheduling / capacity problem\n\nAlso a strong fit. HF has public cases of startup failures caused by unavailable hardware or regional issues. (Hugging Face Forums)\n\n### 3. Custom-container readiness or port problem\n\nReal possibility, especially if the endpoint config and container config are not perfectly aligned. HF explicitly calls this out. (Hugging Face)\n\n### 4. Dependency or image drift\n\nPossible, but weaker. Nothing in the public docs or the wrapper repo strongly points to a new breaking change here. The wrapper image is pinned to an old base rather than obviously changing underneath you. (GitHub)\n\n### 5. Breaking HF spec change\n\nLeast likely. Current HF docs still describe the same basic behavior and requirements. (Hugging Face)\n\n## The fixes I would try, in order\n\n### 1. Turn off scale-to-zero temporarily\n\nSet **min replicas = 1** for a while. If the problem disappears, that is strong evidence that the bug is in the **wake-from-zero path**. HF’s docs say the endpoint stays available with the configured minimum replicas, and their FAQ recommends **at least 2 replicas** when high availability matters. (Hugging Face)\n\n### 2. Use `X-Scale-Up-Timeout`\n\nIf you keep scale-to-zero, add `X-Scale-Up-Timeout`, for example `600`, so the proxy holds the request while the replica wakes. HF documents this specifically for scale-up from zero. (Hugging Face)\n\n### 3. Verify port and health settings end to end\n\nCheck that:\n\n  * the endpoint config expects the same port the container serves,\n  * the container actually exposes that port,\n  * the configured health route is correct,\n  * and readiness only goes green when the model is truly loaded. (Hugging Face)\n\n\n\nFor this wrapper specifically, the server command uses **port 80** , so your HF endpoint config should match that unless you changed the image or command. (GitHub)\n\n### 4. Treat the image as immutable\n\nDo not rely on an unpinned deployment reference. Use a specific image tag or digest. The wrapper repo’s Dockerfile is pinned to a base image version, but your deployment should also pin the outer image reference you use. That makes failures reproducible. (GitHub)\n\n### 5. Try a control deployment\n\nDeploy either:\n\n  * the same image in another region or instance class, or\n  * a simpler known-good endpoint in the same region and GPU class.\n\n\n\nIf the simpler control also hangs on bring-up, that argues for **platform-side scheduling or availability** rather than your app. This is an inference, but it is the cleanest operational test. The public capacity issue reports are the reason this is worth doing. (Hugging Face Forums)\n\n## A simple decision rule\n\nUse this:\n\n  * **Only breaks when waking from zero**\n→ suspect **scale-to-zero / resume path** first. (Hugging Face)\n\n  * **Breaks on every fresh deployment and every warm restart**\n→ suspect **port mapping / health route / container boot** first. (Hugging Face)\n\n  * **Shows explicit scheduling or capacity messages**\n→ suspect **HF infra / region / hardware availability** first. (Hugging Face Forums)\n\n\n\n\n## Final answer\n\nMy best diagnosis is:\n\n**This is probably an intermittent cold-start orchestration issue, made more visible by scale-to-zero, with custom-container readiness as the main secondary cause.**\n\nSo the first thing I would do is **disable scale-to-zero**. The second is **verify port 80 and`/health` behavior**. The third is **try a control deployment in another region or GPU class**. Those three steps separate platform problems from container problems quickly, and they line up with HF’s documented behavior and the closest public failure reports. (Hugging Face)",
  "title": "Inference Endpoint Hanging in \"Initializing\""
}