External Publication
Visit Post

Dedicated CPU Inference Endpoint returns empty HTTP 500 after ~80s: is there a configurable request timeout?

Hugging Face Forums [Unofficial] April 15, 2026
Source

Environment

  • Product: Dedicated private Inference Endpoint (CPU, not serverless)
  • Region: eu-west-1
  • Framework: custom EndpointHandler (Python, SimpleITK)
  • Client: httpx with a 600s timeout

Problem

Requests to our dedicated CPU endpoint occasionally return an HTTP 500 with a completely empty response body. This only happens for computationally expensive requests; lighter requests on the same endpoint complete successfully.

Our handler wraps all processing in a try/except that returns a structured JSON error on any Python exception. An empty 500 means the container process was killed before Python could write a response.


What we investigated

Step 1 - ruled out OOM We added per-step INFO logging inside the handler. The HF application logs show the container reaching the expensive computation step, then going silent. Monitoring the endpoint metrics shows CPU spiking to 100% and staying there, but memory remaining well under 500 MB (limit is 2 GB). OOM does not look like the cause.I

Step 2 - ruled out client-side timeout Our httpx client timeout is set to 600s. The 500 was received at ~165s, so the client timeout never fired. The error came from the server side.

Step 3 - SLEEP_TEST experiment To isolate whether this is a wall-clock timeout imposed by HF infrastructure (rather than anything specific to our computation), we replaced the real processing with a simple sleep loop that logs a heartbeat every 10 seconds:

elapsed = 0
while elapsed < 240:
time.sleep(10)
elapsed += 10
logger.info("SLEEP_TEST: %ds / 240s elapsed", elapsed)
return {"shape": \[256, 256, 50\]}

This was enabled via a SLEEP_TEST environment variable on the endpoint. Result: the last log line we received was SLEEP_TEST: 80s / 240s elapsed. A 500 with empty body was returned immediately after. The endpoint never reached the 90s heartbeat.

This confirms a hard wall-clock timeout of approximately 80โ€“90 seconds imposed by the infrastructure, unrelated to our code or the specific computation being performed.

Step 4 - verified the computation itself is not broken We ran the same processing locally and it completed successfully in ~120s.


Question

Is there a configurable request timeout on dedicated CPU Inference Endpoints? The ~80โ€“90s hard kill appears to be imposed by a gateway or proxy layer (the container process receives no signal we can intercept: there is no Python exception, no SIGTERM handler triggered).

If this limit is fixed for the CPU tier, is there a higher-tier option or configuration that supports longer-running synchronous requests? Alternatively, is there a recommended pattern for CPU-bound tasks that exceed this duration (e.g. polling, async task queues)?

Discussion in the ATmosphere

Loading comments...