Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiaxs7tawztasu5jpvt7fq6gufnb3ahbz74asc4vyzxrhlw7ecphju",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjk7hfgnvxo2"
  },
  "path": "/t/dedicated-cpu-inference-endpoint-returns-empty-http-500-after-80s-is-there-a-configurable-request-timeout/175278#post_1",
  "publishedAt": "2026-04-15T14:56:20.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "* * *\n\n**Environment**\n\n  * Product: Dedicated private Inference Endpoint (CPU, not serverless)\n  * Region: eu-west-1\n  * Framework: custom EndpointHandler (Python, SimpleITK)\n  * Client: httpx with a 600s timeout\n\n\n\n* * *\n\n**Problem**\n\nRequests to our dedicated CPU endpoint occasionally return an HTTP 500 with a completely empty response body. This only happens for computationally expensive requests; lighter requests on the same endpoint complete successfully.\n\nOur handler wraps all processing in a try/except that returns a structured JSON error on any Python exception. An empty 500 means the container process was killed before Python could write a response.\n\n* * *\n\n**What we investigated**\n\nStep 1 - ruled out OOM\nWe added per-step INFO logging inside the handler. The HF application logs show the container reaching the expensive computation step, then going silent. Monitoring the endpoint metrics shows CPU spiking to 100% and staying there, but memory remaining well under 500 MB (limit is 2 GB). OOM does not look like the cause.I\n\nStep 2 - ruled out client-side timeout\nOur httpx client timeout is set to 600s. The 500 was received at ~165s, so the client timeout never fired. The error came from the server side.\n\nStep 3 - SLEEP_TEST experiment\nTo isolate whether this is a wall-clock timeout imposed by HF infrastructure (rather than anything specific to our computation), we replaced the real processing with a simple sleep loop that logs a heartbeat every 10 seconds:\n\n\n    elapsed = 0\n    while elapsed < 240:\n    time.sleep(10)\n    elapsed += 10\n    logger.info(\"SLEEP_TEST: %ds / 240s elapsed\", elapsed)\n    return {\"shape\": \\[256, 256, 50\\]}\n\n\nThis was enabled via a SLEEP_TEST environment variable on the endpoint. Result:\nthe last log line we received was SLEEP_TEST: 80s / 240s elapsed. A 500 with\nempty body was returned immediately after. The endpoint never reached the 90s\nheartbeat.\n\nThis confirms a hard wall-clock timeout of approximately 80–90 seconds imposed\nby the infrastructure, unrelated to our code or the specific computation being\nperformed.\n\nStep 4 - verified the computation itself is not broken\nWe ran the same processing locally and it completed successfully in ~120s.\n\n* * *\n\nQuestion\n\nIs there a configurable request timeout on dedicated CPU Inference Endpoints?\nThe ~80–90s hard kill appears to be imposed by a gateway or proxy layer (the\ncontainer process receives no signal we can intercept: there is no Python\nexception, no SIGTERM handler triggered).\n\nIf this limit is fixed for the CPU tier, is there a higher-tier option or\nconfiguration that supports longer-running synchronous requests? Alternatively,\nis there a recommended pattern for CPU-bound tasks that exceed this duration\n(e.g. polling, async task queues)?",
  "title": "Dedicated CPU Inference Endpoint returns empty HTTP 500 after ~80s: is there a configurable request timeout?"
}