External Publication

Stable Diffusion XL much slower with candle than with diffusers

Hugging Face Forums [Unofficial] March 28, 2026

Hmm… It’s surprising that changing the scheduler doesn’t change the results… I’d think that in Rust, dependencies are less likely to break over time compared to Python…

If you’re set on a native Rust implementation, you might want to try tch-rs first. If PyTorch’s optimizations are the main contributor to the speed, this approach is more direct.

Based on what you already tested, the best next step is not another small Candle-side tweak.

The short answer

For SDXL today , the best practical move is:

Use Python Diffusers withtorch.compile for image generation now.
Keep Rust around it , not inside the diffusion core yet.
If Rust for the core is mandatory, evaluatetch-rs before diffusers-rs for SDXL.
Only keep digging into Candle if your goal is debugging or contributing upstream , not just getting a fast, correct SDXL pipeline. (Hugging Face)

Why this is the right move

Your recent tests already ruled out the two easiest explanations:

changing float type barely helped
changing scheduler did not fix speed or image correctness

That means the problem is probably not “one wrong config value.” It is more likely a combination of runtime-stack gap and implementation-parity gap for SDXL in Candle. Diffusers on PyTorch 2 gets optimized attention by default through SDPA , and Hugging Face documents torch.compile as an extra performance layer on top. PyTorch’s own SDXL acceleration post says those native optimizations can push diffusion inference up to 3× faster. Candle, by contrast, has an active RFC proposing a future tracing/compilation layer with JIT, kernel fusion, and static buffer planning , which strongly suggests that this kind of optimization layer is still emerging there rather than already mature in the current path. (Hugging Face)

What I would do next, in order

Option A. Best overall: keep Diffusers for generation

If the goal is good SDXL images plus speed now , stay with Diffusers + PyTorch 2 +torch.compile. That path is already working for you, and it matches the official optimization guidance: Diffusers supports SDPA by default on PyTorch 2, and recommends torch.compile for extra speed. PyTorch’s SDXL acceleration writeup is basically the background explanation for the speedup you just saw. (Hugging Face)

The clean architecture is:

Rust app for orchestration, UI, job queue, prompt handling, file management
Python worker or microservice for SDXL generation
communicate over CLI, HTTP, gRPC, or a local queue

That is not elegant in the “single-language purity” sense, but it is the fastest route to a system that is fast, correct, and maintainable. This recommendation is an inference from the facts above and from your own benchmark outcome. (Hugging Face)

Option B. If Rust is mandatory: try `tch-rs` first

If you truly need the generation core in Rust, I would look at tch-rs directly before betting on diffusers-rs for SDXL.

Why:

tch-rs is the Rust binding to libtorch , so you stay closer to the PyTorch execution stack. (GitHub)
tch-rs explicitly mentions a Stable Diffusion implementation following Diffusers-style code. (GitHub)
diffusers-rs is Torch-backed too, but its README currently says it supports Stable Diffusion v1.5 and v2.1. I could not verify a clearly documented, ready-to-use SDXL path from its public docs. (GitHub)

So the practical reading is:

diffusers-rs is promising
but for SDXL specifically , the README does not yet present it as the obvious drop-in answer (GitHub)

Option C. Only if you want to debug Candle itself

If your goal is to help Candle improve, the next useful step is not more top-level flags. It is a numerical comparison.

Do this:

fix one prompt, seed, size, guidance, steps
capture intermediate tensors at the same timestep in Diffusers and Candle
compare latents, scheduler inputs/outputs, and UNet outputs step by step

At this point, that is the only route likely to reveal why images still look wrong after matching the obvious settings. This is also consistent with Candle’s current state: there is ongoing interest in more advanced compilation and execution infrastructure, so parity bugs or performance gaps at the model-runtime boundary are plausible areas for real investigation. (GitHub)

What I would not do next

I would not spend more time on:

more scheduler swaps
more fp16/f32 toggles
minor decode-side changes

You already tested the high-probability knobs. The evidence now points away from “small local fix” and toward “different execution stack” or “deeper implementation difference.” (Hugging Face)

My recommendation

If I had to choose one path:

Use compiled Diffusers now, and wrap it from Rust if needed.

That gives you:

the speedup you already observed
the image quality you already trust
the least engineering risk

Then, in parallel:

evaluate tch-rs if Rust-only execution is a hard requirement
treat Candle SDXL as an experiment or contribution target, not the shortest path to a production-quality result today (Hugging Face)

Simple decision rule

Use this:

Need SDXL working now → Diffusers + torch.compile (Hugging Face)
Need Rust around the system → keep Rust at the orchestration layer, Python for generation (Hugging Face)
Need Rust in the generation core → investigate tch-rs first, then diffusers-rs only if you confirm SDXL support yourself (GitHub)
Need to understand Candle’s bug/perf gap → compare intermediate tensors and file an upstream issue with a minimal repro (GitHub)