HF ZeroGPU Space Hangs, No Output in the logs
I’ve also done a bit of experimenting here with Zero GPU and torchscript. Before I knew it, the default PyTorch version for Zero GPU Spaces had been updated to 2.11.0 (in cases like this, the documentation is often written after the fact, so it doesn’t always match the actual behavior). I suspect that compatibility between this version and torchscript—in terms of how the weights actually behave, rather than just in theory—is quite questionable. It seems to work in some cases… but if it can’t use the GPU and falls back to the CPU, it’ll be too slow to fit within the duration and will likely time out.
Since I don’t have the actual code, this is purely speculation, but:
At this point, the likely causes and the practical solutions are much clearer.
The short answer
The most likely cause is not a generic ZeroGPU failure.
It is more likely one of these:
- something specific about your real TorchScript artifact ,
- how that artifact is placed on CUDA before callback entry ,
- an interaction between that artifact and your older runtime stack ,
- or a combination of those three.
That is the cleanest reading of the pattern now.
Why this is the right frame
The pattern you described points away from ordinary inference slowness and toward a worker-boundary problem :
- the UI responds,
- the click is registered,
- the request enters the ZeroGPU path,
- but control never reaches the first line of the decorated function.
HF’s ZeroGPU docs still matter for the semantics here: @spaces.GPU is the hosted ZeroGPU entry mechanism, the decorator is effect-free outside ZeroGPU, and HF explicitly says ZeroGPU can have limited compatibility compared with standard GPU Spaces. That means a container can behave correctly locally while still failing at the hosted ZeroGPU worker transition. (huggingface.co)
So the real question is no longer “why is inference slow?” It is:
what is making the hosted ZeroGPU worker unhappy before the callback body starts?
That broader runtime-contract lens is still the right one here.
Most likely causes
1. Your real TorchScript artifact is the top suspect
This is now the strongest explanation.
At this point, the observed evidence points away from a blanket “TorchScript never works on ZeroGPU” interpretation and much more toward something specific about your actual.pt file.
That “something specific” could be:
- graph complexity,
- operator set,
- custom classes or custom ops,
- serialization-time assumptions,
- export-time environment differences,
- or behavior that appears only on the real forward path.
I am not claiming which one without seeing the artifact. But the evidence now supports “artifact-specific” much more strongly than “platform-wide.”
Why this matters
This shifts the debugging target from the platform to the model artifact itself.
That is a big difference. It means the likely fix is not “tune the queue” or “increase duration.” It is more likely:
- change how the artifact is loaded,
- change where it is placed,
- re-export it,
- or run it on a newer baseline.
2. Module-level CUDA placement may be the real trigger
This is the second most likely cause.
There is an important difference between:
- loading a TorchScript model at startup, and
- placing that model on CUDA at startup before callback entry.
Those are not the same thing operationally.
The symptom you described — failure before the first line inside @spaces.GPU — is very consistent with a problem that happens before the model’s useful forward path starts. One very plausible way to get that is:
- model exists at module scope,
- model is moved to CUDA too early for the hosted ZeroGPU path,
- then worker entry breaks before user code inside the callback begins.
So I would now treat this as a central hypothesis:
the real problem may be startup-time CUDA placement of the real TorchScript model, not TorchScript loading by itself.
That would explain why a normal local process can work and the hosted worker still fails.
3. Your older stack may be amplifying the issue
This is still a serious suspect.
Your failing stack is older than the current template-style ZeroGPU baseline. That matters because public issue history shows @spaces.GPU behavior can be sensitive to Gradio/runtime version changes. There is a Gradio issue where the decorator path itself appears to be involved in model-loading failure behavior on ZeroGPU. (github.com)
So the older stack is probably not just a neutral background detail. It may be:
- making the artifact problem easier to trigger,
- or exposing a boundary condition that the newer baseline handles better.
I would now think of your runtime versions as part of the problem surface, not just as static facts.
4. Basic torch.jit.load(...) is probably not the main problem anymore
I would move this lower on the list.
PyTorch’s docs describe torch.jit.load as the standard way to load a saved ScriptModule, with normal file-based behavior and map_location support. That basic API path is not, by itself, the most suspicious thing now.
So I would separate these two ideas clearly:
- basic TorchScript file loading → probably not the central issue
- your real artifact’s behavior after or around load → still highly suspicious
That distinction matters because it changes the solution strategy.
5. torch.jit.optimized_execution(False) is probably not the main fix
This flag is real and useful, but I no longer think it is central.
There is real PyTorch issue history around first-call TorchScript optimization overhead, and torch.jit.optimized_execution(False) is relevant to that class of problem.
But your failure pattern is earlier than that:
- it happens before the useful callback body begins.
So my updated read is:
- this flag may help with runtime cost or first-pass overhead,
- but it is probably not the reason the hosted callback boundary fails.
It is a secondary control, not the main solution.
6. TorchScript’s current ecosystem status raises the risk of edge cases
This is background rather than the root cause, but it matters.
PyTorch’s current docs mark TorchScript as deprecated and recommend torch.export going forward. That does not mean your artifact should fail today, but it does mean TorchScript is no longer the most future-facing or highest-priority path in the ecosystem. (docs.pytorch.org)
So if you are hitting a complex edge case involving:
- hosted runtime behavior,
- worker-boundary timing,
- and a real scripted artifact,
that is no longer surprising in the way it would have been when TorchScript was the clear primary path.
What is probably not the cause anymore
These are now lower-probability primary causes:
Not the main cause: UI complexity
You already reduced the UI enough that this should not be first on the list.
Not the main cause: logging blind spots
You used flushes and durable writes, and both point to the same boundary.
Not the main cause: hidden setup inside the callback body
The callback body never gets control in the failing pattern.
Not the main cause: generic ZeroGPU cannot run callbacks
The broader evidence now points away from that.
Not the main cause: generic TorchScript incompatibility
The broader evidence now points away from that too.
Solutions
Now that the likely causes are narrower, the solutions are much more concrete.
Solution 1: Use the newer baseline as your reference environment
This is the most important practical move.
Do not treat the older failing Space as the only truth source anymore.
Instead, treat the newer/current-style ZeroGPU baseline as the control environment and compare your real model against that.
Why this is the right move
Because it removes a whole class of ambiguity:
- if the real model fails there too, the artifact becomes the prime suspect;
- if the real model works there, the older stack becomes the stronger suspect.
That is much more informative than continuing to debug in the older environment alone.
Solution 2: Separate CPU-load from CUDA-placement
This is probably the single most important diagnostic and architectural split now.
You should think in two stages:
Stage A: can the real TorchScript artifact be loaded and kept on CPU at startup?
If not, the artifact itself is the main suspect.
Stage B: what changes when it is placed on CUDA before callback entry?
If CPU-load is fine but startup CUDA placement breaks hosted callback entry, then the fix is likely to be about when and where you move the model to CUDA.
Practical consequence
If startup CUDA placement is the trigger, the likely short-term fix is:
- keep the model on CPU earlier,
- only move/use it inside the ZeroGPU-managed path as needed for debugging or a revised serving design.
That is not a performance claim. It is a stability-first debugging move.
Solution 3: If the real artifact works on the newer baseline, migrate the original app toward that baseline
If the real model behaves on the newer setup, then the root problem is probably not the artifact alone.
At that point, the best solution is:
move the original app toward the newer runtime path instead of continuing to preserve the older one.
That means:
- newer Gradio path,
- newer
spacesbehavior, - current template-style structure,
- and then reintroducing your app logic carefully.
In that scenario, trying to preserve the older runtime as the “real” environment just slows you down.
Solution 4: If the real artifact still fails on the newer baseline, treat it as an export/serialization problem
If the real .pt artifact fails even in the newer reference setup, then the likely solutions shift toward the artifact itself:
- re-save or re-export it under a newer PyTorch stack,
- simplify or isolate the problematic graph,
- identify unusual/custom components,
- or consider moving away from TorchScript if it is becoming a long-term maintenance liability.
Long-term direction
Because TorchScript is deprecated, torch.export becomes the most natural long-term direction if the scripted artifact turns out to be the recurring source of hosted-runtime pain. (docs.pytorch.org)
That is not a recommendation to rewrite everything immediately. It is the likely strategic path if the artifact itself turns out to be the core issue.
Solution 5: Trust live build/runtime evidence for exact defaults, and docs for semantics
This is a practical rule rather than a code change, but it matters.
For exact current defaults, live build/runtime behavior is often more reliable than prose docs that may lag. For high-level semantics — like what @spaces.GPU does, how ZeroGPU differs from standard GPU Spaces, and what config keys matter — the docs still matter. (huggingface.co)
That is the right way to combine the two sources of truth.
The clearest overall diagnosis
If I had to state the diagnosis as plainly as possible:
Your failure is most likely caused by your real TorchScript artifact or its startup/device-placement behavior , with your older runtime stack acting as a likely amplifier.
That is where I would put my confidence now.
The clearest overall solution path
If I had to state the solution path as plainly as possible:
- use the newer baseline as the reference
- test the real model there
- split CPU-load from startup CUDA placement
- if needed, migrate the old app toward the newer baseline
- if needed, re-export / modernize the artifact path
That is the highest-leverage path now.
Final ranking
Most likely causes
- real TorchScript artifact
- module-level CUDA placement of that artifact
- older runtime stack interaction
- artifact-specific runtime/operator behavior
- generic ZeroGPU problem as a distant possibility
Most likely solutions
- move the real model into the newer baseline
- separate CPU-load from CUDA-placement
- migrate toward the newer stack if the real model works there
- re-export or modernize the artifact if it still fails there
- treat
optimized_execution(False)as secondary, not primary
Final takeaway
The best framing now is not :
“Why does ZeroGPU hang?”
The better framing is:
“Why does my real TorchScript artifact, or its startup/device placement, fail on my older hosted path when a simple TorchScript artifact can work on the current baseline?”
That is the actual problem now.
And that is a much more solvable problem.
The high-level rule remains: solve this as a runtime-contract and artifact-boundary issue , not as a generic inference-speed issue.
Discussion in the ATmosphere