External Publication
Visit Post

HF ZeroGPU Space Hangs, No Output in the logs

Hugging Face Forums [Unofficial] April 24, 2026
Source

In other words, for example, is something happening only in the Zero GPU environment when data is being transferred? Or is there a slight difference between the Zero GPU environment for individual models and the all-in-one Zero GPU environment in terms of when libraries or models are loaded…?

Since I don’t have your actual code, debugging would basically just involve trying every possible scenario . In any case, frequent print statements are your best bet for debugging. Of course, logger is better, but even print statements make a big difference.

Actually, wait—in your case, the container logs were disappearing, weren’t they…? Is there some strange process in the pipeline that’s causing the container logs to vanish…? If that’s the case, e.g. code that directly manipulates CUDA is generally pretty risky in a Zero GPU environment.


The key is to stop asking “why does this function work locally but hang on ZeroGPU?” and instead turn it into a small set of yes/no isolation experiments.

Your current evidence says:

  • the upstream TorchScript model works alone on ZeroGPU,
  • the other dependent models work alone on ZeroGPU,
  • the full function works in a normal GPU environment,
  • the full function hangs only when integrated inside ZeroGPU.

That means the fastest path is not more model-by-model testing. You already did that. The fastest path is to isolate the first failing boundary :

model A output → conversion code → model B input
model B output → postprocessing
GPU tensor → CPU object
internal result → Gradio return value
Python call returned → CUDA actually synchronized

ZeroGPU is not just a normal persistent GPU host. HF’s docs describe it as a special Gradio runtime where GPU work is mediated by @spaces.GPU; outside the decorated function, PyTorch uses a CUDA-emulation mode, and inside it, real CUDA is used. That means code can be valid in a normal CUDA process but still fail under ZeroGPU’s request-scoped lifecycle. HF also documents compatibility limits compared with standard GPU Spaces. (Hugging Face)


The efficient isolation strategy

Use this sequence:

  1. Prove whether compute finishes at all.
  2. Find the first pipeline stage that cannot return.
  3. Distinguish bad data from bad state.
  4. Force CUDA errors to show at the real operation.
  5. Use a watchdog to see where Python is stuck.
  6. Test output serialization separately.

Do these in order. Do not randomly comment out code.


1. First test: full pipeline, but return "OK"

This is the fastest split.

Temporarily do this:

@spaces.GPU(duration=180)
def infer(x):
    print("entered infer", flush=True)

    result = full_pipeline(x)

    print("full pipeline finished", flush=True)

    # Do not return the real model output yet.
    return "OK"

How to interpret it

Result Meaning
"OK" returns Your model computation probably finishes. The hang is likely output conversion / Gradio serialization / file return.
"OK" does not return The hang is inside the compute pipeline or before it. Continue to stage isolation.
"full pipeline finished" prints but UI still spins The return object or Gradio output path is suspect.
"entered infer" does not print The callback/request path is suspect, not the model.

This test matters because many “inference hangs” are actually return-value hangs : returning a CUDA tensor, a giant nested object, a bad file path, a generator that never terminates, a malformed image/audio object, or a custom class.


2. Use return sentinels, not just logs

Because your logs are unreliable, use successful returns as proof. A returned string proves that:

  • the decorated function was entered,
  • the stage completed,
  • the return path worked,
  • and Gradio/ZeroGPU completed the request.

Suppose your function is:

input
→ preprocess
→ model A
→ convert A output
→ model B
→ convert B output
→ model C
→ postprocess
→ return

Temporarily write it like this:

@spaces.GPU(duration=180)
def infer(x):
    print("entered", flush=True)

    x = preprocess(x)
    return "stage 0: preprocess OK"

    a = model_a(x)
    return "stage 1: model A OK"

    b_input = convert_a_to_b(a)
    return "stage 2: A to B conversion OK"

    b = model_b(b_input)
    return "stage 3: model B OK"

    c_input = convert_b_to_c(b)
    return "stage 4: B to C conversion OK"

    c = model_c(c_input)
    return "stage 5: model C OK"

    out = postprocess(c)
    return "stage 6: postprocess OK"

Then move the return downward one stage at a time.

Yes, this is manual. It is also extremely fast.

What you are looking for

You want to find the first stage where this changes:

previous stage returns successfully
next stage spins until timeout

That failing stage is your first real target.


3. Build a debug dropdown so you do not rebuild constantly

A better version is to add a debug stop option to the UI.

def maybe_return(stage_name, stop_at):
    if stop_at == stage_name:
        return f"{stage_name} OK"
    return None

@spaces.GPU(duration=180)
def infer(x, stop_at):
    print("entered infer", flush=True)

    x = preprocess(x)
    r = maybe_return("preprocess", stop_at)
    if r:
        return r

    a = model_a(x)
    torch.cuda.synchronize()
    r = maybe_return("model_a", stop_at)
    if r:
        return r

    b_input = convert_a_to_b(a)
    r = maybe_return("convert_a_to_b", stop_at)
    if r:
        return r

    b = model_b(b_input)
    torch.cuda.synchronize()
    r = maybe_return("model_b", stop_at)
    if r:
        return r

    out = postprocess(b)
    r = maybe_return("postprocess", stop_at)
    if r:
        return r

    return out

In Gradio:

stop_at = gr.Dropdown(
    choices=[
        "preprocess",
        "model_a",
        "convert_a_to_b",
        "model_b",
        "postprocess",
        "full",
    ],
    value="full",
    label="Debug stop point",
)

This makes the Space itself a diagnostic tool.


4. Once a boundary fails, separate “bad data” from “bad state”

Assume the failing boundary is:

model A → convert A output → model B

There are two different possibilities:

Possibility 1: bad data

Model A produced an output that model B cannot handle.

Examples:

  • wrong shape,
  • wrong dtype,
  • wrong device,
  • non-contiguous tensor,
  • unexpected tuple/list/dict,
  • NaNs/Infs,
  • invalid token IDs,
  • invalid image/audio shape,
  • wrong batch dimension.

Possibility 2: bad state

Model A leaves the runtime in a state that makes model B hang.

Examples:

  • retained GPU memory,
  • CUDA stream issue,
  • async CUDA error,
  • global library state,
  • thread pool state,
  • worker process state,
  • model/cache singleton state.

Use this three-test matrix.


Test A: model B with synthetic valid input

@spaces.GPU(duration=180)
def infer(x):
    b_input = make_synthetic_valid_b_input()
    b = model_b(b_input)
    torch.cuda.synchronize()
    return "model B synthetic input OK"

If this fails, your isolated model B test is not equivalent to the real call.


Test B: run model A, discard output, then run model B with synthetic input

@spaces.GPU(duration=180)
def infer(x):
    a = model_a(preprocess(x))
    torch.cuda.synchronize()

    del a
    torch.cuda.empty_cache()

    b_input = make_synthetic_valid_b_input()
    b = model_b(b_input)
    torch.cuda.synchronize()

    return "model A side effect + model B synthetic input OK"

If this hangs, model A leaves harmful state behind.


Test C: run model B with actual model A output

@spaces.GPU(duration=180)
def infer(x):
    a = model_a(preprocess(x))
    torch.cuda.synchronize()

    b_input = convert_a_to_b(a)
    b = model_b(b_input)
    torch.cuda.synchronize()

    return "model A real output + model B OK"

Interpret the matrix

Test result Likely cause
A works, B works, C hangs Bad A→B data conversion
A works, B hangs Model A leaves bad runtime/GPU state
A hangs Your “model B alone” test was not equivalent
C works, full app hangs Later stage or output serialization

This is one of the most efficient ways to isolate multi-model hangs.


5. Add a step wrapper that proves Python return vs CUDA completion

A common trap: a PyTorch call can “return” to Python before CUDA work is actually finished. CUDA operations are asynchronous, and PyTorch documents that errors can be reported at a later operation; it recommends CUDA_LAUNCH_BLOCKING=1 for debugging because otherwise stack traces may point to the wrong place. (PyTorch Docs)

Use a wrapper like this:

import time
import traceback
import torch

def mark(msg):
    print(f"[{time.strftime('%H:%M:%S')}] {msg}", flush=True)

def cuda_mem(label):
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        mark(
            f"{label}: "
            f"allocated={torch.cuda.memory_allocated() / 1024**3:.2f}GB "
            f"reserved={torch.cuda.memory_reserved() / 1024**3:.2f}GB "
            f"max={torch.cuda.max_memory_allocated() / 1024**3:.2f}GB"
        )

def run_step(name, fn, *args, sync=True, **kwargs):
    mark(f"{name}: START")
    cuda_mem(f"{name}: before")

    t0 = time.perf_counter()

    try:
        out = fn(*args, **kwargs)
    except Exception:
        mark(f"{name}: EXCEPTION")
        print(traceback.format_exc(), flush=True)
        raise

    mark(f"{name}: PYTHON RETURNED")

    if sync and torch.cuda.is_available():
        mark(f"{name}: CUDA SYNC START")
        torch.cuda.synchronize()
        mark(f"{name}: CUDA SYNC DONE")

    mark(f"{name}: DONE in {time.perf_counter() - t0:.2f}s")
    cuda_mem(f"{name}: after")
    return out

Then use it everywhere:

@spaces.GPU(duration=180)
def infer(x):
    mark("infer entered")

    x = run_step("preprocess", preprocess, x, sync=False)
    a = run_step("model_a", model_a, x)
    b_input = run_step("convert_a_to_b", convert_a_to_b, a, sync=False)
    b = run_step("model_b", model_b, b_input)
    out = run_step("postprocess", postprocess, b, sync=False)

    mark("returning")
    return out

How to interpret the logs

Last log seen Meaning
model_b: START Python entered model B but did not return. Native call may be stuck.
model_b: PYTHON RETURNED but no CUDA SYNC DONE CUDA work did not complete; earlier async operation may be the real cause.
postprocess: DONE but UI still spins Output serialization / Gradio return path likely.
Memory jumps before timeout Combined memory / retained tensors likely.

This distinguishes three things people often merge together:

Python function call returned
CUDA work completed
Gradio response returned

They are not the same.


6. Force CUDA errors to appear closer to the cause

For one debug build, set:

CUDA_LAUNCH_BLOCKING=1
PYTHONFAULTHANDLER=1
TOKENIZERS_PARALLELISM=false

CUDA_LAUNCH_BLOCKING=1 is the standard CUDA/PyTorch debug move for asynchronous CUDA problems. PyTorch forum answers repeatedly recommend it because CUDA kernel errors may be reported at a later API call, making the stack trace misleading. (PyTorch Forums)

Then add:

torch.cuda.synchronize()

after every model call and after every GPU tensor conversion.

The goal is to turn this:

some earlier CUDA issue
later random hang
timeout

into this:

model A returned
model A sync failed/hung

That tells you where to look.


7. Add a watchdog stack dump for hangs

Because your failure is a hang, not an exception, add faulthandler.

import sys
import faulthandler

faulthandler.enable(file=sys.stderr, all_threads=True)

faulthandler.dump_traceback_later(
    60,
    repeat=True,
    file=sys.stderr,
    exit=False,
)

Python’s faulthandler is specifically designed to dump Python tracebacks on faults, after a timeout, or via signal. (bugs.python.org)

What it tells you

If the repeated traceback shows Python waiting here:

future.result()
queue.get()
thread.join()
client.predict()
requests.post()
for item in generator

you probably have a Python-level deadlock or blocking call.

If it repeatedly points to:

out = model_b(...)

then Python entered a native PyTorch/TorchScript/CUDA call and did not return.

Those are different fixes.


8. Use a CPU firebreak between models

Since every model works individually, test whether the GPU-to-GPU handoff is the problem.

Temporarily replace:

a = model_a(x)
b = model_b(convert_a_to_b(a))

with:

a = model_a(x)
torch.cuda.synchronize()

a_cpu = detach_to_cpu(a)
del a
torch.cuda.empty_cache()

b_input = convert_a_to_b(a_cpu)
b_input = move_to_cuda(b_input)

b = model_b(b_input)
torch.cuda.synchronize()

Helpers:

def detach_to_cpu(obj):
    if isinstance(obj, torch.Tensor):
        return obj.detach().cpu()
    if isinstance(obj, dict):
        return {k: detach_to_cpu(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [detach_to_cpu(v) for v in obj]
    if isinstance(obj, tuple):
        return tuple(detach_to_cpu(v) for v in obj)
    return obj

def move_to_cuda(obj):
    if isinstance(obj, torch.Tensor):
        return obj.to("cuda", non_blocking=False)
    if isinstance(obj, dict):
        return {k: move_to_cuda(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [move_to_cuda(v) for v in obj]
    if isinstance(obj, tuple):
        return tuple(move_to_cuda(v) for v in obj)
    return obj

Interpret it

Result Meaning
CPU firebreak fixes the hang GPU tensor lifetime, memory pressure, stream state, or async CUDA issue
CPU firebreak does not fix it Bad data conversion, threading, native call, or later output path
CPU firebreak works but is slower Good diagnostic result; optimize later

This is not meant as the final production design. It is a diagnostic cut.


9. Describe every tensor at every boundary

Add this:

def describe(name, obj, depth=0):
    if depth > 2:
        return

    print(f"[DESCRIBE] {name}: type={type(obj)}", flush=True)

    if isinstance(obj, torch.Tensor):
        print(
            f"[DESCRIBE] {name}: "
            f"shape={tuple(obj.shape)} "
            f"dtype={obj.dtype} "
            f"device={obj.device} "
            f"requires_grad={obj.requires_grad} "
            f"contiguous={obj.is_contiguous()}",
            flush=True,
        )

        if obj.numel() > 0 and obj.is_floating_point():
            x = obj.detach()
            print(
                f"[DESCRIBE] {name}: "
                f"finite={torch.isfinite(x).all().item()} "
                f"nan={torch.isnan(x).any().item()} "
                f"inf={torch.isinf(x).any().item()}",
                flush=True,
            )
        return

    if isinstance(obj, dict):
        print(f"[DESCRIBE] {name}: keys={list(obj.keys())}", flush=True)
        for k, v in list(obj.items())[:10]:
            describe(f"{name}.{k}", v, depth + 1)
        return

    if isinstance(obj, (list, tuple)):
        print(f"[DESCRIBE] {name}: len={len(obj)}", flush=True)
        for i, v in enumerate(obj[:10]):
            describe(f"{name}[{i}]", v, depth + 1)
        return

    print(f"[DESCRIBE] {name}: repr={repr(obj)[:500]}", flush=True)

Call it here:

describe("model_a_output", a)
describe("model_b_input", b_input)
describe("model_b_output", b)

You are looking for:

CUDA tensor where CPU tensor expected
CPU tensor where CUDA tensor expected
float32 vs float16 vs bfloat16 mismatch
non-contiguous tensor
wrong batch dimension
unexpected tuple/list/dict structure
NaN or Inf values
very large shape
invalid token IDs

The individual model tests may not use the same real intermediate values as the integrated pipeline.


10. Temporarily make everything single-threaded / no-worker

Integrated pipelines often trigger hidden thread/process behavior that individual model tests do not.

Search your code for:

multiprocessing
ProcessPoolExecutor
ThreadPoolExecutor
DataLoader
num_workers
joblib
subprocess
queue
future.result
thread.join
asyncio
gradio_client.Client
requests.post
httpx

For a debug run:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"

import torch
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

For DataLoader:

num_workers=0
pin_memory=False
persistent_workers=False

PyTorch’s multiprocessing docs warn about “poison fork” with accelerators: if the accelerator runtime is initialized before forking, child processes can fail because the runtime is not fork-safe; the docs recommend avoiding accelerator initialization before forking and using spawn or forkserver when CUDA subprocesses are needed. (PyTorch Docs)

If single-thread/no-worker mode fixes it, you are debugging a worker/fork/thread issue, not a model issue.


11. Check for nested Space/API/self-calls

Search for:

gradio_client.Client
client.predict
requests.post
httpx.post
/queue/join
/gradio_api/call
localhost
SPACE_HOST

A common integration deadlock is:

ZeroGPU request enters infer()
infer() calls another endpoint or same Space
that call waits on queue/GPU/quota
original request waits forever
duration expires

This can work locally because local execution does not use the same HF queue/GPU allocation path.

For one test, stub all external calls:

def call_external_model(...):
    return synthetic_valid_response

If the hang disappears, your issue is orchestration, not inference.


12. Check output serialization separately

After the full compute finishes, do not return the real result:

@spaces.GPU(duration=180)
def infer(x):
    result = full_pipeline(x)
    print("full pipeline computed", flush=True)
    return "OK"

If that works, progressively return:

return str(type(result))
return repr(result)[:1000]
return simplified_result
return final_result

This catches cases like:

  • CUDA tensor returned directly,
  • huge nested dict/list,
  • custom object,
  • generator,
  • invalid file path,
  • image/audio/video in a format Gradio does not expect,
  • JSON with non-serializable values,
  • numpy array with unexpected dtype/shape.

13. Make local more like ZeroGPU

Since it works locally, try to make local fail by reducing differences.

Run locally with:

CUDA_LAUNCH_BLOCKING=1
PYTHONFAULTHANDLER=1
TOKENIZERS_PARALLELISM=false
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1

Use the same:

  • Python version,
  • torch version,
  • Gradio version,
  • input,
  • model load order,
  • dtype,
  • precision mode,
  • cache state,
  • batch size,
  • output conversion path.

Also run local in a fresh process with a cold cache if possible. Warm local state can hide the problem.


The exact experiment order I recommend

Run these in this order:

Experiment 1: return before compute

@spaces.GPU(duration=180)
def infer(x):
    return "entered"

If this fails, the callback/wrapper path is the problem.

Experiment 2: full compute, simple return

result = full_pipeline(x)
return "OK"

If this works, debug output serialization.

Experiment 3: stage returns

Move a return after each pipeline stage until one fails.

Experiment 4: failing boundary matrix

For the first failing boundary A→B:

B synthetic input only
A then B synthetic input
A real output then B

Experiment 5: CPU firebreak

Move outputs to CPU between models, delete GPU tensors, clear cache, then move only next input back.

Experiment 6: CUDA debug

Set CUDA_LAUNCH_BLOCKING=1; add torch.cuda.synchronize() after each stage.

Experiment 7: hang watchdog

Add faulthandler.dump_traceback_later(60, repeat=True).

Experiment 8: no-worker mode

Disable multiprocessing, dataloader workers, tokenizer parallelism, and reduce thread counts.

This order is efficient because each step splits the search space in half.


What I would bet on in your case

Given all the evidence, I would bet on one of these:

Most likely: bad inter-model boundary

One model’s actual output is not exactly what the next stage expects under ZeroGPU.

Typical culprit:

dtype, device, shape, layout, contiguity, NaN/Inf, tuple/dict/list structure

Second most likely: GPU state or memory lifetime

Each model fits alone, but the integrated function retains too much GPU state or carries a bad async CUDA state forward.

Typical culprit:

intermediate tensors kept alive, no detach, no del, no CPU boundary, async CUDA error

Third most likely: hidden blocking call

Your integration code may contain a queue, future, thread, worker, subprocess, HTTP call, or Gradio client call that only appears in the full pipeline.

Typical culprit:

future.result(), queue.get(), thread.join(), client.predict(), requests.post()

Fourth most likely: output serialization

The actual full pipeline finishes, but the return value cannot be serialized cleanly by Gradio.

Typical culprit:

CUDA tensor, custom object, huge nested dict, invalid file path, generator, malformed media output

Short summary

  • Since each model works alone, isolate boundaries , not models.
  • Use return sentinels because logs are unreliable.
  • Use the A→B matrix : B alone, A then B synthetic, A real output then B.
  • Add torch.cuda.synchronize() after every stage.
  • Use CUDA_LAUNCH_BLOCKING=1 to reveal async CUDA issues.
  • Add faulthandler.dump_traceback_later() to catch hangs.
  • Test a CPU firebreak between models.
  • Test full_pipeline(); return "OK" to rule out output serialization.

Discussion in the ATmosphere

Loading comments...