Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreia47nrhuv7o7hqyhkt2tvd3idtnpdv4nvz22bnlnb4uxh2ikgns24",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkahnh2yi652"
  },
  "path": "/t/hf-zerogpu-space-hangs-no-output-in-the-logs/175410#post_6",
  "publishedAt": "2026-04-24T10:36:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "PyTorch Docs",
    "PyTorch Forums",
    "bugs.python.org",
    "@spaces.GPU"
  ],
  "textContent": "In other words, for example, is something happening only in the Zero GPU environment when data is being transferred? Or is there a slight difference between the Zero GPU environment for individual models and the all-in-one Zero GPU environment in terms of when libraries or models are loaded…?\n\nSince I don’t have your actual code, debugging would basically just involve trying every possible scenario . In any case, frequent print statements are your best bet for debugging. Of course, `logger` is better, but even `print` statements make a big difference.\n\nActually, wait—in your case, the container logs were disappearing, weren’t they…? Is there some strange process in the pipeline that’s causing the container logs to vanish…? If that’s the case, e.g. **code that directly manipulates CUDA is generally pretty risky in a Zero GPU environment**.\n\n* * *\n\nThe key is to stop asking “why does this function work locally but hang on ZeroGPU?” and instead turn it into a small set of **yes/no isolation experiments**.\n\nYour current evidence says:\n\n  * the upstream TorchScript model works alone on ZeroGPU,\n  * the other dependent models work alone on ZeroGPU,\n  * the full function works in a normal GPU environment,\n  * the full function hangs only when integrated inside ZeroGPU.\n\n\n\nThat means the fastest path is **not** more model-by-model testing. You already did that. The fastest path is to isolate the first failing **boundary** :\n\n\n    model A output → conversion code → model B input\n    model B output → postprocessing\n    GPU tensor → CPU object\n    internal result → Gradio return value\n    Python call returned → CUDA actually synchronized\n\n\nZeroGPU is not just a normal persistent GPU host. HF’s docs describe it as a special Gradio runtime where GPU work is mediated by `@spaces.GPU`; outside the decorated function, PyTorch uses a CUDA-emulation mode, and inside it, real CUDA is used. That means code can be valid in a normal CUDA process but still fail under ZeroGPU’s request-scoped lifecycle. HF also documents compatibility limits compared with standard GPU Spaces. (Hugging Face)\n\n* * *\n\n## The efficient isolation strategy\n\nUse this sequence:\n\n  1. **Prove whether compute finishes at all.**\n  2. **Find the first pipeline stage that cannot return.**\n  3. **Distinguish bad data from bad state.**\n  4. **Force CUDA errors to show at the real operation.**\n  5. **Use a watchdog to see where Python is stuck.**\n  6. **Test output serialization separately.**\n\n\n\nDo these in order. Do not randomly comment out code.\n\n* * *\n\n# 1. First test: full pipeline, but return `\"OK\"`\n\nThis is the fastest split.\n\nTemporarily do this:\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        print(\"entered infer\", flush=True)\n\n        result = full_pipeline(x)\n\n        print(\"full pipeline finished\", flush=True)\n\n        # Do not return the real model output yet.\n        return \"OK\"\n\n\n## How to interpret it\n\nResult | Meaning\n---|---\n`\"OK\"` returns | Your model computation probably finishes. The hang is likely output conversion / Gradio serialization / file return.\n`\"OK\"` does not return | The hang is inside the compute pipeline or before it. Continue to stage isolation.\n`\"full pipeline finished\"` prints but UI still spins | The return object or Gradio output path is suspect.\n`\"entered infer\"` does not print | The callback/request path is suspect, not the model.\n\nThis test matters because many “inference hangs” are actually **return-value hangs** : returning a CUDA tensor, a giant nested object, a bad file path, a generator that never terminates, a malformed image/audio object, or a custom class.\n\n* * *\n\n# 2. Use return sentinels, not just logs\n\nBecause your logs are unreliable, use successful returns as proof. A returned string proves that:\n\n  * the decorated function was entered,\n  * the stage completed,\n  * the return path worked,\n  * and Gradio/ZeroGPU completed the request.\n\n\n\nSuppose your function is:\n\n\n    input\n    → preprocess\n    → model A\n    → convert A output\n    → model B\n    → convert B output\n    → model C\n    → postprocess\n    → return\n\n\nTemporarily write it like this:\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        print(\"entered\", flush=True)\n\n        x = preprocess(x)\n        return \"stage 0: preprocess OK\"\n\n        a = model_a(x)\n        return \"stage 1: model A OK\"\n\n        b_input = convert_a_to_b(a)\n        return \"stage 2: A to B conversion OK\"\n\n        b = model_b(b_input)\n        return \"stage 3: model B OK\"\n\n        c_input = convert_b_to_c(b)\n        return \"stage 4: B to C conversion OK\"\n\n        c = model_c(c_input)\n        return \"stage 5: model C OK\"\n\n        out = postprocess(c)\n        return \"stage 6: postprocess OK\"\n\n\nThen move the `return` downward one stage at a time.\n\nYes, this is manual. It is also extremely fast.\n\n## What you are looking for\n\nYou want to find the first stage where this changes:\n\n\n    previous stage returns successfully\n    next stage spins until timeout\n\n\nThat failing stage is your first real target.\n\n* * *\n\n# 3. Build a debug dropdown so you do not rebuild constantly\n\nA better version is to add a debug stop option to the UI.\n\n\n    def maybe_return(stage_name, stop_at):\n        if stop_at == stage_name:\n            return f\"{stage_name} OK\"\n        return None\n\n    @spaces.GPU(duration=180)\n    def infer(x, stop_at):\n        print(\"entered infer\", flush=True)\n\n        x = preprocess(x)\n        r = maybe_return(\"preprocess\", stop_at)\n        if r:\n            return r\n\n        a = model_a(x)\n        torch.cuda.synchronize()\n        r = maybe_return(\"model_a\", stop_at)\n        if r:\n            return r\n\n        b_input = convert_a_to_b(a)\n        r = maybe_return(\"convert_a_to_b\", stop_at)\n        if r:\n            return r\n\n        b = model_b(b_input)\n        torch.cuda.synchronize()\n        r = maybe_return(\"model_b\", stop_at)\n        if r:\n            return r\n\n        out = postprocess(b)\n        r = maybe_return(\"postprocess\", stop_at)\n        if r:\n            return r\n\n        return out\n\n\nIn Gradio:\n\n\n    stop_at = gr.Dropdown(\n        choices=[\n            \"preprocess\",\n            \"model_a\",\n            \"convert_a_to_b\",\n            \"model_b\",\n            \"postprocess\",\n            \"full\",\n        ],\n        value=\"full\",\n        label=\"Debug stop point\",\n    )\n\n\nThis makes the Space itself a diagnostic tool.\n\n* * *\n\n# 4. Once a boundary fails, separate “bad data” from “bad state”\n\nAssume the failing boundary is:\n\n\n    model A → convert A output → model B\n\n\nThere are two different possibilities:\n\n## Possibility 1: bad data\n\nModel A produced an output that model B cannot handle.\n\nExamples:\n\n  * wrong shape,\n  * wrong dtype,\n  * wrong device,\n  * non-contiguous tensor,\n  * unexpected tuple/list/dict,\n  * NaNs/Infs,\n  * invalid token IDs,\n  * invalid image/audio shape,\n  * wrong batch dimension.\n\n\n\n## Possibility 2: bad state\n\nModel A leaves the runtime in a state that makes model B hang.\n\nExamples:\n\n  * retained GPU memory,\n  * CUDA stream issue,\n  * async CUDA error,\n  * global library state,\n  * thread pool state,\n  * worker process state,\n  * model/cache singleton state.\n\n\n\nUse this three-test matrix.\n\n* * *\n\n## Test A: model B with synthetic valid input\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        b_input = make_synthetic_valid_b_input()\n        b = model_b(b_input)\n        torch.cuda.synchronize()\n        return \"model B synthetic input OK\"\n\n\nIf this fails, your isolated model B test is not equivalent to the real call.\n\n* * *\n\n## Test B: run model A, discard output, then run model B with synthetic input\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        a = model_a(preprocess(x))\n        torch.cuda.synchronize()\n\n        del a\n        torch.cuda.empty_cache()\n\n        b_input = make_synthetic_valid_b_input()\n        b = model_b(b_input)\n        torch.cuda.synchronize()\n\n        return \"model A side effect + model B synthetic input OK\"\n\n\nIf this hangs, model A leaves harmful state behind.\n\n* * *\n\n## Test C: run model B with actual model A output\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        a = model_a(preprocess(x))\n        torch.cuda.synchronize()\n\n        b_input = convert_a_to_b(a)\n        b = model_b(b_input)\n        torch.cuda.synchronize()\n\n        return \"model A real output + model B OK\"\n\n\n## Interpret the matrix\n\nTest result | Likely cause\n---|---\nA works, B works, C hangs | Bad A→B data conversion\nA works, B hangs | Model A leaves bad runtime/GPU state\nA hangs | Your “model B alone” test was not equivalent\nC works, full app hangs | Later stage or output serialization\n\nThis is one of the most efficient ways to isolate multi-model hangs.\n\n* * *\n\n# 5. Add a step wrapper that proves Python return vs CUDA completion\n\nA common trap: a PyTorch call can “return” to Python before CUDA work is actually finished. CUDA operations are asynchronous, and PyTorch documents that errors can be reported at a later operation; it recommends `CUDA_LAUNCH_BLOCKING=1` for debugging because otherwise stack traces may point to the wrong place. (PyTorch Docs)\n\nUse a wrapper like this:\n\n\n    import time\n    import traceback\n    import torch\n\n    def mark(msg):\n        print(f\"[{time.strftime('%H:%M:%S')}] {msg}\", flush=True)\n\n    def cuda_mem(label):\n        if torch.cuda.is_available():\n            torch.cuda.synchronize()\n            mark(\n                f\"{label}: \"\n                f\"allocated={torch.cuda.memory_allocated() / 1024**3:.2f}GB \"\n                f\"reserved={torch.cuda.memory_reserved() / 1024**3:.2f}GB \"\n                f\"max={torch.cuda.max_memory_allocated() / 1024**3:.2f}GB\"\n            )\n\n    def run_step(name, fn, *args, sync=True, **kwargs):\n        mark(f\"{name}: START\")\n        cuda_mem(f\"{name}: before\")\n\n        t0 = time.perf_counter()\n\n        try:\n            out = fn(*args, **kwargs)\n        except Exception:\n            mark(f\"{name}: EXCEPTION\")\n            print(traceback.format_exc(), flush=True)\n            raise\n\n        mark(f\"{name}: PYTHON RETURNED\")\n\n        if sync and torch.cuda.is_available():\n            mark(f\"{name}: CUDA SYNC START\")\n            torch.cuda.synchronize()\n            mark(f\"{name}: CUDA SYNC DONE\")\n\n        mark(f\"{name}: DONE in {time.perf_counter() - t0:.2f}s\")\n        cuda_mem(f\"{name}: after\")\n        return out\n\n\nThen use it everywhere:\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        mark(\"infer entered\")\n\n        x = run_step(\"preprocess\", preprocess, x, sync=False)\n        a = run_step(\"model_a\", model_a, x)\n        b_input = run_step(\"convert_a_to_b\", convert_a_to_b, a, sync=False)\n        b = run_step(\"model_b\", model_b, b_input)\n        out = run_step(\"postprocess\", postprocess, b, sync=False)\n\n        mark(\"returning\")\n        return out\n\n\n## How to interpret the logs\n\nLast log seen | Meaning\n---|---\n`model_b: START` | Python entered model B but did not return. Native call may be stuck.\n`model_b: PYTHON RETURNED` but no `CUDA SYNC DONE` | CUDA work did not complete; earlier async operation may be the real cause.\n`postprocess: DONE` but UI still spins | Output serialization / Gradio return path likely.\nMemory jumps before timeout | Combined memory / retained tensors likely.\n\nThis distinguishes three things people often merge together:\n\n\n    Python function call returned\n    CUDA work completed\n    Gradio response returned\n\n\nThey are not the same.\n\n* * *\n\n# 6. Force CUDA errors to appear closer to the cause\n\nFor one debug build, set:\n\n\n    CUDA_LAUNCH_BLOCKING=1\n    PYTHONFAULTHANDLER=1\n    TOKENIZERS_PARALLELISM=false\n\n\n`CUDA_LAUNCH_BLOCKING=1` is the standard CUDA/PyTorch debug move for asynchronous CUDA problems. PyTorch forum answers repeatedly recommend it because CUDA kernel errors may be reported at a later API call, making the stack trace misleading. (PyTorch Forums)\n\nThen add:\n\n\n    torch.cuda.synchronize()\n\n\nafter every model call and after every GPU tensor conversion.\n\nThe goal is to turn this:\n\n\n    some earlier CUDA issue\n    later random hang\n    timeout\n\n\ninto this:\n\n\n    model A returned\n    model A sync failed/hung\n\n\nThat tells you where to look.\n\n* * *\n\n# 7. Add a watchdog stack dump for hangs\n\nBecause your failure is a hang, not an exception, add `faulthandler`.\n\n\n    import sys\n    import faulthandler\n\n    faulthandler.enable(file=sys.stderr, all_threads=True)\n\n    faulthandler.dump_traceback_later(\n        60,\n        repeat=True,\n        file=sys.stderr,\n        exit=False,\n    )\n\n\nPython’s `faulthandler` is specifically designed to dump Python tracebacks on faults, after a timeout, or via signal. (bugs.python.org)\n\n## What it tells you\n\nIf the repeated traceback shows Python waiting here:\n\n\n    future.result()\n    queue.get()\n    thread.join()\n    client.predict()\n    requests.post()\n    for item in generator\n\n\nyou probably have a Python-level deadlock or blocking call.\n\nIf it repeatedly points to:\n\n\n    out = model_b(...)\n\n\nthen Python entered a native PyTorch/TorchScript/CUDA call and did not return.\n\nThose are different fixes.\n\n* * *\n\n# 8. Use a CPU firebreak between models\n\nSince every model works individually, test whether the GPU-to-GPU handoff is the problem.\n\nTemporarily replace:\n\n\n    a = model_a(x)\n    b = model_b(convert_a_to_b(a))\n\n\nwith:\n\n\n    a = model_a(x)\n    torch.cuda.synchronize()\n\n    a_cpu = detach_to_cpu(a)\n    del a\n    torch.cuda.empty_cache()\n\n    b_input = convert_a_to_b(a_cpu)\n    b_input = move_to_cuda(b_input)\n\n    b = model_b(b_input)\n    torch.cuda.synchronize()\n\n\nHelpers:\n\n\n    def detach_to_cpu(obj):\n        if isinstance(obj, torch.Tensor):\n            return obj.detach().cpu()\n        if isinstance(obj, dict):\n            return {k: detach_to_cpu(v) for k, v in obj.items()}\n        if isinstance(obj, list):\n            return [detach_to_cpu(v) for v in obj]\n        if isinstance(obj, tuple):\n            return tuple(detach_to_cpu(v) for v in obj)\n        return obj\n\n    def move_to_cuda(obj):\n        if isinstance(obj, torch.Tensor):\n            return obj.to(\"cuda\", non_blocking=False)\n        if isinstance(obj, dict):\n            return {k: move_to_cuda(v) for k, v in obj.items()}\n        if isinstance(obj, list):\n            return [move_to_cuda(v) for v in obj]\n        if isinstance(obj, tuple):\n            return tuple(move_to_cuda(v) for v in obj)\n        return obj\n\n\n## Interpret it\n\nResult | Meaning\n---|---\nCPU firebreak fixes the hang | GPU tensor lifetime, memory pressure, stream state, or async CUDA issue\nCPU firebreak does not fix it | Bad data conversion, threading, native call, or later output path\nCPU firebreak works but is slower | Good diagnostic result; optimize later\n\nThis is not meant as the final production design. It is a diagnostic cut.\n\n* * *\n\n# 9. Describe every tensor at every boundary\n\nAdd this:\n\n\n    def describe(name, obj, depth=0):\n        if depth > 2:\n            return\n\n        print(f\"[DESCRIBE] {name}: type={type(obj)}\", flush=True)\n\n        if isinstance(obj, torch.Tensor):\n            print(\n                f\"[DESCRIBE] {name}: \"\n                f\"shape={tuple(obj.shape)} \"\n                f\"dtype={obj.dtype} \"\n                f\"device={obj.device} \"\n                f\"requires_grad={obj.requires_grad} \"\n                f\"contiguous={obj.is_contiguous()}\",\n                flush=True,\n            )\n\n            if obj.numel() > 0 and obj.is_floating_point():\n                x = obj.detach()\n                print(\n                    f\"[DESCRIBE] {name}: \"\n                    f\"finite={torch.isfinite(x).all().item()} \"\n                    f\"nan={torch.isnan(x).any().item()} \"\n                    f\"inf={torch.isinf(x).any().item()}\",\n                    flush=True,\n                )\n            return\n\n        if isinstance(obj, dict):\n            print(f\"[DESCRIBE] {name}: keys={list(obj.keys())}\", flush=True)\n            for k, v in list(obj.items())[:10]:\n                describe(f\"{name}.{k}\", v, depth + 1)\n            return\n\n        if isinstance(obj, (list, tuple)):\n            print(f\"[DESCRIBE] {name}: len={len(obj)}\", flush=True)\n            for i, v in enumerate(obj[:10]):\n                describe(f\"{name}[{i}]\", v, depth + 1)\n            return\n\n        print(f\"[DESCRIBE] {name}: repr={repr(obj)[:500]}\", flush=True)\n\n\nCall it here:\n\n\n    describe(\"model_a_output\", a)\n    describe(\"model_b_input\", b_input)\n    describe(\"model_b_output\", b)\n\n\nYou are looking for:\n\n\n    CUDA tensor where CPU tensor expected\n    CPU tensor where CUDA tensor expected\n    float32 vs float16 vs bfloat16 mismatch\n    non-contiguous tensor\n    wrong batch dimension\n    unexpected tuple/list/dict structure\n    NaN or Inf values\n    very large shape\n    invalid token IDs\n\n\nThe individual model tests may not use the same real intermediate values as the integrated pipeline.\n\n* * *\n\n# 10. Temporarily make everything single-threaded / no-worker\n\nIntegrated pipelines often trigger hidden thread/process behavior that individual model tests do not.\n\nSearch your code for:\n\n\n    multiprocessing\n    ProcessPoolExecutor\n    ThreadPoolExecutor\n    DataLoader\n    num_workers\n    joblib\n    subprocess\n    queue\n    future.result\n    thread.join\n    asyncio\n    gradio_client.Client\n    requests.post\n    httpx\n\n\nFor a debug run:\n\n\n    import os\n\n    os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n    os.environ[\"OMP_NUM_THREADS\"] = \"1\"\n    os.environ[\"MKL_NUM_THREADS\"] = \"1\"\n\n    import torch\n    torch.set_num_threads(1)\n    torch.set_num_interop_threads(1)\n\n\nFor `DataLoader`:\n\n\n    num_workers=0\n    pin_memory=False\n    persistent_workers=False\n\n\nPyTorch’s multiprocessing docs warn about “poison fork” with accelerators: if the accelerator runtime is initialized before forking, child processes can fail because the runtime is not fork-safe; the docs recommend avoiding accelerator initialization before forking and using `spawn` or `forkserver` when CUDA subprocesses are needed. (PyTorch Docs)\n\nIf single-thread/no-worker mode fixes it, you are debugging a worker/fork/thread issue, not a model issue.\n\n* * *\n\n# 11. Check for nested Space/API/self-calls\n\nSearch for:\n\n\n    gradio_client.Client\n    client.predict\n    requests.post\n    httpx.post\n    /queue/join\n    /gradio_api/call\n    localhost\n    SPACE_HOST\n\n\nA common integration deadlock is:\n\n\n    ZeroGPU request enters infer()\n    infer() calls another endpoint or same Space\n    that call waits on queue/GPU/quota\n    original request waits forever\n    duration expires\n\n\nThis can work locally because local execution does not use the same HF queue/GPU allocation path.\n\nFor one test, stub all external calls:\n\n\n    def call_external_model(...):\n        return synthetic_valid_response\n\n\nIf the hang disappears, your issue is orchestration, not inference.\n\n* * *\n\n# 12. Check output serialization separately\n\nAfter the full compute finishes, do not return the real result:\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        result = full_pipeline(x)\n        print(\"full pipeline computed\", flush=True)\n        return \"OK\"\n\n\nIf that works, progressively return:\n\n\n    return str(type(result))\n    return repr(result)[:1000]\n    return simplified_result\n    return final_result\n\n\nThis catches cases like:\n\n  * CUDA tensor returned directly,\n  * huge nested dict/list,\n  * custom object,\n  * generator,\n  * invalid file path,\n  * image/audio/video in a format Gradio does not expect,\n  * JSON with non-serializable values,\n  * numpy array with unexpected dtype/shape.\n\n\n\n* * *\n\n# 13. Make local more like ZeroGPU\n\nSince it works locally, try to make local fail by reducing differences.\n\nRun locally with:\n\n\n    CUDA_LAUNCH_BLOCKING=1\n    PYTHONFAULTHANDLER=1\n    TOKENIZERS_PARALLELISM=false\n    OMP_NUM_THREADS=1\n    MKL_NUM_THREADS=1\n\n\nUse the same:\n\n  * Python version,\n  * torch version,\n  * Gradio version,\n  * input,\n  * model load order,\n  * dtype,\n  * precision mode,\n  * cache state,\n  * batch size,\n  * output conversion path.\n\n\n\nAlso run local in a **fresh process** with a cold cache if possible. Warm local state can hide the problem.\n\n* * *\n\n# The exact experiment order I recommend\n\nRun these in this order:\n\n## Experiment 1: return before compute\n\n\n    @spaces.GPU(duration=180)\n    def infer(x):\n        return \"entered\"\n\n\nIf this fails, the callback/wrapper path is the problem.\n\n## Experiment 2: full compute, simple return\n\n\n    result = full_pipeline(x)\n    return \"OK\"\n\n\nIf this works, debug output serialization.\n\n## Experiment 3: stage returns\n\nMove a return after each pipeline stage until one fails.\n\n## Experiment 4: failing boundary matrix\n\nFor the first failing boundary A→B:\n\n\n    B synthetic input only\n    A then B synthetic input\n    A real output then B\n\n\n## Experiment 5: CPU firebreak\n\nMove outputs to CPU between models, delete GPU tensors, clear cache, then move only next input back.\n\n## Experiment 6: CUDA debug\n\nSet `CUDA_LAUNCH_BLOCKING=1`; add `torch.cuda.synchronize()` after each stage.\n\n## Experiment 7: hang watchdog\n\nAdd `faulthandler.dump_traceback_later(60, repeat=True)`.\n\n## Experiment 8: no-worker mode\n\nDisable multiprocessing, dataloader workers, tokenizer parallelism, and reduce thread counts.\n\nThis order is efficient because each step splits the search space in half.\n\n* * *\n\n# What I would bet on in your case\n\nGiven all the evidence, I would bet on one of these:\n\n## Most likely: bad inter-model boundary\n\nOne model’s actual output is not exactly what the next stage expects under ZeroGPU.\n\nTypical culprit:\n\n\n    dtype, device, shape, layout, contiguity, NaN/Inf, tuple/dict/list structure\n\n\n## Second most likely: GPU state or memory lifetime\n\nEach model fits alone, but the integrated function retains too much GPU state or carries a bad async CUDA state forward.\n\nTypical culprit:\n\n\n    intermediate tensors kept alive, no detach, no del, no CPU boundary, async CUDA error\n\n\n## Third most likely: hidden blocking call\n\nYour integration code may contain a queue, future, thread, worker, subprocess, HTTP call, or Gradio client call that only appears in the full pipeline.\n\nTypical culprit:\n\n\n    future.result(), queue.get(), thread.join(), client.predict(), requests.post()\n\n\n## Fourth most likely: output serialization\n\nThe actual full pipeline finishes, but the return value cannot be serialized cleanly by Gradio.\n\nTypical culprit:\n\n\n    CUDA tensor, custom object, huge nested dict, invalid file path, generator, malformed media output\n\n\n* * *\n\n## Short summary\n\n  * Since each model works alone, isolate **boundaries** , not models.\n  * Use **return sentinels** because logs are unreliable.\n  * Use the **A→B matrix** : B alone, A then B synthetic, A real output then B.\n  * Add `torch.cuda.synchronize()` after every stage.\n  * Use `CUDA_LAUNCH_BLOCKING=1` to reveal async CUDA issues.\n  * Add `faulthandler.dump_traceback_later()` to catch hangs.\n  * Test a **CPU firebreak** between models.\n  * Test `full_pipeline(); return \"OK\"` to rule out output serialization.\n\n",
  "title": "HF ZeroGPU Space Hangs, No Output in the logs"
}