External Publication

HF ZeroGPU Space Hangs, No Output in the logs

Hugging Face Forums [Unofficial] April 23, 2026

I’ve also done a bit of experimenting here with Zero GPU and torchscript. Before I knew it, the default PyTorch version for Zero GPU Spaces had been updated to 2.11.0 (in cases like this, the documentation is often written after the fact, so it doesn’t always match the actual behavior). I suspect that compatibility between this version and torchscript—in terms of how the weights actually behave, rather than just in theory—is quite questionable. It seems to work in some cases… but if it can’t use the GPU and falls back to the CPU, it’ll be too slow to fit within the duration and will likely time out.

Since I don’t have the actual code, this is purely speculation, but:

At this point, the likely causes and the practical solutions are much clearer.

The short answer

The most likely cause is not a generic ZeroGPU failure.

It is more likely one of these:

something specific about your real TorchScript artifact ,
how that artifact is placed on CUDA before callback entry ,
an interaction between that artifact and your older runtime stack ,
or a combination of those three.

That is the cleanest reading of the pattern now.

Why this is the right frame

The pattern you described points away from ordinary inference slowness and toward a worker-boundary problem :

the UI responds,
the click is registered,
the request enters the ZeroGPU path,
but control never reaches the first line of the decorated function.

HF’s ZeroGPU docs still matter for the semantics here: @spaces.GPU is the hosted ZeroGPU entry mechanism, the decorator is effect-free outside ZeroGPU, and HF explicitly says ZeroGPU can have limited compatibility compared with standard GPU Spaces. That means a container can behave correctly locally while still failing at the hosted ZeroGPU worker transition. (huggingface.co)

So the real question is no longer “why is inference slow?” It is:

what is making the hosted ZeroGPU worker unhappy before the callback body starts?

That broader runtime-contract lens is still the right one here.

Most likely causes

1. Your real TorchScript artifact is the top suspect

This is now the strongest explanation.

At this point, the observed evidence points away from a blanket “TorchScript never works on ZeroGPU” interpretation and much more toward something specific about your actual.pt file.

That “something specific” could be:

graph complexity,
operator set,
custom classes or custom ops,
serialization-time assumptions,
export-time environment differences,
or behavior that appears only on the real forward path.

I am not claiming which one without seeing the artifact. But the evidence now supports “artifact-specific” much more strongly than “platform-wide.”

Why this matters

This shifts the debugging target from the platform to the model artifact itself.

That is a big difference. It means the likely fix is not “tune the queue” or “increase duration.” It is more likely:

change how the artifact is loaded,
change where it is placed,
re-export it,
or run it on a newer baseline.

2. Module-level CUDA placement may be the real trigger

This is the second most likely cause.

There is an important difference between:

loading a TorchScript model at startup, and
placing that model on CUDA at startup before callback entry.

Those are not the same thing operationally.

The symptom you described — failure before the first line inside @spaces.GPU — is very consistent with a problem that happens before the model’s useful forward path starts. One very plausible way to get that is:

model exists at module scope,
model is moved to CUDA too early for the hosted ZeroGPU path,
then worker entry breaks before user code inside the callback begins.

So I would now treat this as a central hypothesis:

the real problem may be startup-time CUDA placement of the real TorchScript model, not TorchScript loading by itself.

That would explain why a normal local process can work and the hosted worker still fails.

3. Your older stack may be amplifying the issue

This is still a serious suspect.

Your failing stack is older than the current template-style ZeroGPU baseline. That matters because public issue history shows @spaces.GPU behavior can be sensitive to Gradio/runtime version changes. There is a Gradio issue where the decorator path itself appears to be involved in model-loading failure behavior on ZeroGPU. (github.com)

So the older stack is probably not just a neutral background detail. It may be:

making the artifact problem easier to trigger,
or exposing a boundary condition that the newer baseline handles better.

I would now think of your runtime versions as part of the problem surface, not just as static facts.

4. Basic `torch.jit.load(...)` is probably not the main problem anymore

I would move this lower on the list.

PyTorch’s docs describe torch.jit.load as the standard way to load a saved ScriptModule, with normal file-based behavior and map_location support. That basic API path is not, by itself, the most suspicious thing now.

So I would separate these two ideas clearly:

basic TorchScript file loading → probably not the central issue
your real artifact’s behavior after or around load → still highly suspicious

That distinction matters because it changes the solution strategy.

5. `torch.jit.optimized_execution(False)` is probably not the main fix

This flag is real and useful, but I no longer think it is central.

There is real PyTorch issue history around first-call TorchScript optimization overhead, and torch.jit.optimized_execution(False) is relevant to that class of problem.

But your failure pattern is earlier than that:

it happens before the useful callback body begins.

So my updated read is:

this flag may help with runtime cost or first-pass overhead,
but it is probably not the reason the hosted callback boundary fails.

It is a secondary control, not the main solution.

6. TorchScript’s current ecosystem status raises the risk of edge cases

This is background rather than the root cause, but it matters.

PyTorch’s current docs mark TorchScript as deprecated and recommend torch.export going forward. That does not mean your artifact should fail today, but it does mean TorchScript is no longer the most future-facing or highest-priority path in the ecosystem. (docs.pytorch.org)

So if you are hitting a complex edge case involving:

hosted runtime behavior,
worker-boundary timing,
and a real scripted artifact,

that is no longer surprising in the way it would have been when TorchScript was the clear primary path.

What is probably not the cause anymore

These are now lower-probability primary causes:

Not the main cause: UI complexity

You already reduced the UI enough that this should not be first on the list.

Not the main cause: logging blind spots

You used flushes and durable writes, and both point to the same boundary.

Not the main cause: hidden setup inside the callback body

The callback body never gets control in the failing pattern.

Not the main cause: generic ZeroGPU cannot run callbacks

The broader evidence now points away from that.

Not the main cause: generic TorchScript incompatibility

The broader evidence now points away from that too.

Solutions

Now that the likely causes are narrower, the solutions are much more concrete.

Solution 1: Use the newer baseline as your reference environment

This is the most important practical move.

Do not treat the older failing Space as the only truth source anymore.

Instead, treat the newer/current-style ZeroGPU baseline as the control environment and compare your real model against that.

Why this is the right move

Because it removes a whole class of ambiguity:

if the real model fails there too, the artifact becomes the prime suspect;
if the real model works there, the older stack becomes the stronger suspect.

That is much more informative than continuing to debug in the older environment alone.

Solution 2: Separate CPU-load from CUDA-placement

This is probably the single most important diagnostic and architectural split now.

You should think in two stages:

Stage A: can the real TorchScript artifact be loaded and kept on CPU at startup?

If not, the artifact itself is the main suspect.

Stage B: what changes when it is placed on CUDA before callback entry?

If CPU-load is fine but startup CUDA placement breaks hosted callback entry, then the fix is likely to be about when and where you move the model to CUDA.

Practical consequence

If startup CUDA placement is the trigger, the likely short-term fix is:

keep the model on CPU earlier,
only move/use it inside the ZeroGPU-managed path as needed for debugging or a revised serving design.

That is not a performance claim. It is a stability-first debugging move.

Solution 3: If the real artifact works on the newer baseline, migrate the original app toward that baseline

If the real model behaves on the newer setup, then the root problem is probably not the artifact alone.

At that point, the best solution is:

move the original app toward the newer runtime path instead of continuing to preserve the older one.

That means:

newer Gradio path,
newer spaces behavior,
current template-style structure,
and then reintroducing your app logic carefully.

In that scenario, trying to preserve the older runtime as the “real” environment just slows you down.

Solution 4: If the real artifact still fails on the newer baseline, treat it as an export/serialization problem

If the real .pt artifact fails even in the newer reference setup, then the likely solutions shift toward the artifact itself:

re-save or re-export it under a newer PyTorch stack,
simplify or isolate the problematic graph,
identify unusual/custom components,
or consider moving away from TorchScript if it is becoming a long-term maintenance liability.

Long-term direction

Because TorchScript is deprecated, torch.export becomes the most natural long-term direction if the scripted artifact turns out to be the recurring source of hosted-runtime pain. (docs.pytorch.org)

That is not a recommendation to rewrite everything immediately. It is the likely strategic path if the artifact itself turns out to be the core issue.

Solution 5: Trust live build/runtime evidence for exact defaults, and docs for semantics

This is a practical rule rather than a code change, but it matters.

For exact current defaults, live build/runtime behavior is often more reliable than prose docs that may lag. For high-level semantics — like what @spaces.GPU does, how ZeroGPU differs from standard GPU Spaces, and what config keys matter — the docs still matter. (huggingface.co)

That is the right way to combine the two sources of truth.

The clearest overall diagnosis

If I had to state the diagnosis as plainly as possible:

Your failure is most likely caused by your real TorchScript artifact or its startup/device-placement behavior , with your older runtime stack acting as a likely amplifier.

That is where I would put my confidence now.

The clearest overall solution path

If I had to state the solution path as plainly as possible:

use the newer baseline as the reference
test the real model there
split CPU-load from startup CUDA placement
if needed, migrate the old app toward the newer baseline
if needed, re-export / modernize the artifact path

That is the highest-leverage path now.

Final ranking

Most likely causes

real TorchScript artifact
module-level CUDA placement of that artifact
older runtime stack interaction
artifact-specific runtime/operator behavior
generic ZeroGPU problem as a distant possibility

Most likely solutions

move the real model into the newer baseline
separate CPU-load from CUDA-placement
migrate toward the newer stack if the real model works there
re-export or modernize the artifact if it still fails there
treatoptimized_execution(False) as secondary, not primary

Final takeaway

The best framing now is not :

“Why does ZeroGPU hang?”

The better framing is:

“Why does my real TorchScript artifact, or its startup/device placement, fail on my older hosted path when a simple TorchScript artifact can work on the current baseline?”

That is the actual problem now.

And that is a much more solvable problem.

The high-level rule remains: solve this as a runtime-contract and artifact-boundary issue , not as a generic inference-speed issue.

The short answer

Why this is the right frame

Most likely causes

1. Your real TorchScript artifact is the top suspect

Why this matters

2. Module-level CUDA placement may be the real trigger

3. Your older stack may be amplifying the issue

4. Basic torch.jit.load(...) is probably not the main problem anymore

5. torch.jit.optimized_execution(False) is probably not the main fix

6. TorchScript’s current ecosystem status raises the risk of edge cases

What is probably not the cause anymore

Not the main cause: UI complexity

Not the main cause: logging blind spots

Not the main cause: hidden setup inside the callback body

Not the main cause: generic ZeroGPU cannot run callbacks

Not the main cause: generic TorchScript incompatibility

Solutions

Solution 1: Use the newer baseline as your reference environment

Why this is the right move

Solution 2: Separate CPU-load from CUDA-placement

Stage A: can the real TorchScript artifact be loaded and kept on CPU at startup?

Stage B: what changes when it is placed on CUDA before callback entry?

Practical consequence

Solution 3: If the real artifact works on the newer baseline, migrate the original app toward that baseline

Solution 4: If the real artifact still fails on the newer baseline, treat it as an export/serialization problem

Long-term direction

Solution 5: Trust live build/runtime evidence for exact defaults, and docs for semantics

The clearest overall diagnosis

The clearest overall solution path

Final ranking

Most likely causes

Most likely solutions

Final takeaway

Discussion in the ATmosphere

4. Basic `torch.jit.load(...)` is probably not the main problem anymore

5. `torch.jit.optimized_execution(False)` is probably not the main fix