{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigmra6kswuy5se3x5scyjbtotcfxg4nahwgtw6ghyo3jvf7r3erou",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miedehq4jsx2"
  },
  "path": "/t/why-starting-all-the-time-and-get-kill-in-30min/174824#post_3",
  "publishedAt": "2026-03-31T10:06:15.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "AIOHTTP"
  ],
  "textContent": "Just in case, here are some parts of the Space’s code that might be problematic:\n\n* * *\n\nThere are real code and deployment problems here.\n\nThe key point is this:\n\n**Your`404` log lines are not evidence that the Space itself is healthy.** In `gorgeous.py`, those `404`s come from your own outbound polling to the remote ModelScope endpoints `imageGet`, `firstLastGet`, and `videoGet`. Your code prints each response status, and on `404` it just sleeps 60 seconds and tries again. That means “the remote job is not ready yet,” not “Hugging Face accepted the Space as Running.” (Hugging Face)\n\nAlso, the exact **30 minute kill** matches Hugging Face’s default startup health timeout. For Docker Spaces, `app_port` defaults to `7860`, and `startup_duration_timeout` defaults to **30 minutes** unless you set it in the README metadata. (Hugging Face)\n\n## What is happening\n\nYour code starts an aiohttp server on port `7860`, then immediately enters a long remote-processing pipeline. On paper, that should be enough. But if anything after `await site.start()` fails, your top-level `except:` catches it, writes a traceback file, uploads it, and then goes into an infinite sleep. That can leave the container process alive while the actual web app is no longer healthy, which is a good fit for “keeps Starting, then gets killed after 30 minutes.” (Hugging Face)\n\n## Causes\n\n### 1. The broad `except:` can hide a real crash and leave the container half-dead\n\nAt the bottom of `gorgeous.py`, you run:\n\n\n    try:\n        uvloop.run(main())\n    except:\n        ...write traceback...\n        ...upload file...\n        time.sleep(math.inf)\n\n\nSo if `main()` fails at any point, the process does **not** fail fast. It goes into an infinite sleep instead. That is one of the strongest explanations for “logs look active, but the Space never becomes Running.” (Hugging Face)\n\n### 2. Secret handling is fragile\n\nYou build the authorization header like this:\n\n\n    'Bearer ' + os.getenv('modelscope')\n\n\nIf the `modelscope` secret is missing, that expression raises immediately because it is trying to concatenate a string and `None`. Later, the exception path also tries to upload with `os.getenv('huggingface')`. Hugging Face’s Docker docs say runtime secrets are injected as environment variables, so this code path depends completely on both secrets being present and valid. (Hugging Face)\n\n### 3. Your Dockerfile does not follow Hugging Face’s recommended Docker permissions setup\n\nYour Dockerfile uses `FROM ubuntu`, sets `WORKDIR /home/ubuntu`, and copies files without `--chown`. Your code writes `output.mp4` and `gorgeous.txt` into that working directory. Hugging Face’s Docker docs say the container runs with **user ID 1000** and recommend creating that user, switching to it, setting the workdir there, and using `COPY --chown=user` to avoid permission issues. (Hugging Face)\n\n### 4. Your README metadata is incomplete\n\nYour README only sets:\n\n  * `title`\n  * `emoji`\n  * `colorFrom`\n  * `colorTo`\n  * `sdk: docker`\n  * `pinned: false`\n\n\n\nIt does **not** set `app_port` or `startup_duration_timeout`. Missing `app_port` is not automatically fatal here, because the documented Docker default is `7860` and your code also uses `7860`. But missing `startup_duration_timeout` is why the failure cuts off at **30 minutes**. (Hugging Face)\n\n### 5. The web route is fragile\n\nYou serve the root path with:\n\n\n    app.add_routes([aiohttp.web.static('/', ..., show_index=True)])\n\n\nThat means your root is a static directory handler, not a normal health endpoint. aiohttp’s own docs say `add_static()` is for **development only** , not production. This may still work, but it is a weak choice for a Space that needs a simple, reliable HTTP response as soon as it boots. (Hugging Face)\n\n### 6. `numpy` is imported directly but not installed directly\n\n`gorgeous.py` imports `numpy`, but the Dockerfile only installs `huggingface_hub` and `modelscope` with pip. That means you are relying on a transitive dependency to provide `numpy`. Since your current logs show the script starts, `numpy` is probably arriving indirectly right now. But it is still a packaging bug waiting to break on a future rebuild. (Hugging Face)\n\n## What is probably **not** the main problem\n\nThe bind host is probably not the issue. Your code uses `TCPSite(runner, port=7860)` without a host, and aiohttp documents that `host=None` means **all interfaces**. So this is likely fine. (Hugging Face)\n\n## Best explanation in plain terms\n\nThe most likely sequence is:\n\n  1. The container starts.\n  2. Your server begins listening on `7860`.\n  3. Your worker logic starts polling remote endpoints and prints `404`.\n  4. Somewhere after startup, an exception or unhealthy state occurs.\n  5. Your `except:` block prevents a clean crash and instead sleeps forever.\n  6. Hugging Face never sees the Space become healthy enough within the startup window.\n  7. At 30 minutes, the Space is marked unhealthy and killed. (Hugging Face)\n\n\n\nThat is why the logs can look “correct” and the Space can still stay in `Starting`.\n\n## Fixes\n\n### Fix 1. Add explicit README metadata\n\nUse this at the top of `README.md`:\n\n\n    ---\n    title: Gorgeous\n    sdk: docker\n    app_port: 7860\n    startup_duration_timeout: 1h\n    ---\n\n\nThis makes the port explicit and raises the startup ceiling above the default 30 minutes. (Hugging Face)\n\n### Fix 2. Replace the static root with a real health endpoint\n\nUse a simple route like:\n\n\n    from aiohttp import web\n    import os\n\n    async def index(_):\n        return web.Response(text=\"ok\")\n\n    app = web.Application()\n    app.router.add_get(\"/\", index)\n\n\nThat is much safer than using `static('/')` as the root response. aiohttp’s docs explicitly warn against `add_static()` as a production serving strategy. (AIOHTTP)\n\n### Fix 3. Fail fast instead of sleeping forever after errors\n\nChange this:\n\n\n    except:\n        ...\n        time.sleep(math.inf)\n\n\nto this:\n\n\n    except Exception:\n        pathlib.Path(\"gorgeous.txt\").write_text(traceback.format_exc())\n        raise\n\n\nThat way, the container stops clearly and the logs show the real failure. Right now, your exception handler can mask the real bug. (Hugging Face)\n\n### Fix 4. Validate required secrets before doing any network work\n\nDo this near startup:\n\n\n    ms_token = os.environ.get(\"modelscope\")\n    hf_token = os.environ.get(\"huggingface\")\n\n    if not ms_token:\n        raise RuntimeError(\"Missing modelscope secret\")\n    if not hf_token:\n        raise RuntimeError(\"Missing huggingface secret\")\n\n\nThat turns a vague failure into an immediate, readable one. Hugging Face’s Docker docs confirm runtime secrets should be read from environment variables. (Hugging Face)\n\n### Fix 5. Follow Hugging Face’s Docker permission pattern\n\nA safer Dockerfile shape is:\n\n\n    FROM python:3.11-slim\n\n    RUN useradd -m -u 1000 user\n    USER user\n    ENV HOME=/home/user PATH=/home/user/.local/bin:$PATH\n    WORKDIR $HOME/app\n\n    COPY --chown=user . $HOME/app\n\n    RUN pip install --no-cache-dir aiohttp uvloop pillow av numpy huggingface_hub modelscope\n\n    ENTRYPOINT [\"python\", \"-u\", \"gorgeous.py\"]\n\n\nThis matches Hugging Face’s documented UID 1000 model and avoids the file ownership trap. (Hugging Face)\n\n### Fix 6. Separate “web server” from “long worker”\n\nBest pattern:\n\n  * start a tiny web app immediately\n  * return `200 OK` on `/`\n  * launch the long ModelScope polling in a background task\n\n\n\nThat makes the Space healthy first, then starts the heavy work. Right now, both responsibilities are mixed together in one startup path. (Hugging Face)\n\n## Priority order\n\nDo these first:\n\n  1. add `startup_duration_timeout: 1h`\n  2. add explicit `app_port: 7860`\n  3. replace `static('/')` with `GET / -> 200 OK`\n  4. remove the infinite sleep from the exception path\n  5. validate secrets at startup\n  6. fix Docker permissions and install `numpy` explicitly\n\n\n\n## Bottom line\n\nThe `404` lines in your screenshot are **not** the main bug. They are expected from your remote polling loop. The real problem is that your Space has several ways to become unhealthy while still looking alive, especially because of the broad `except:` plus infinite sleep, the fragile secret handling, the weak root route, and the missing startup timeout override. (Hugging Face)",
  "title": "Why starting all the time and get kill in 30min"
}