Stable Diffusion XL much slower with candle than with diffusers
Hmm… It’s surprising that changing the scheduler doesn’t change the results… I’d think that in Rust, dependencies are less likely to break over time compared to Python…
If you’re set on a native Rust implementation, you might want to try tch-rs first. If PyTorch’s optimizations are the main contributor to the speed, this approach is more direct.
Based on what you already tested, the best next step is not another small Candle-side tweak.
The short answer
For SDXL today , the best practical move is:
- Use Python Diffusers with
torch.compilefor image generation now. - Keep Rust around it , not inside the diffusion core yet.
- If Rust for the core is mandatory, evaluate
tch-rsbeforediffusers-rsfor SDXL. - Only keep digging into Candle if your goal is debugging or contributing upstream , not just getting a fast, correct SDXL pipeline. (Hugging Face)
Why this is the right move
Your recent tests already ruled out the two easiest explanations:
- changing float type barely helped
- changing scheduler did not fix speed or image correctness
That means the problem is probably not “one wrong config value.” It is more likely a combination of runtime-stack gap and implementation-parity gap for SDXL in Candle. Diffusers on PyTorch 2 gets optimized attention by default through SDPA , and Hugging Face documents torch.compile as an extra performance layer on top. PyTorch’s own SDXL acceleration post says those native optimizations can push diffusion inference up to 3× faster. Candle, by contrast, has an active RFC proposing a future tracing/compilation layer with JIT, kernel fusion, and static buffer planning , which strongly suggests that this kind of optimization layer is still emerging there rather than already mature in the current path. (Hugging Face)
What I would do next, in order
Option A. Best overall: keep Diffusers for generation
If the goal is good SDXL images plus speed now , stay with Diffusers + PyTorch 2 +torch.compile. That path is already working for you, and it matches the official optimization guidance: Diffusers supports SDPA by default on PyTorch 2, and recommends torch.compile for extra speed. PyTorch’s SDXL acceleration writeup is basically the background explanation for the speedup you just saw. (Hugging Face)
The clean architecture is:
- Rust app for orchestration, UI, job queue, prompt handling, file management
- Python worker or microservice for SDXL generation
- communicate over CLI, HTTP, gRPC, or a local queue
That is not elegant in the “single-language purity” sense, but it is the fastest route to a system that is fast, correct, and maintainable. This recommendation is an inference from the facts above and from your own benchmark outcome. (Hugging Face)
Option B. If Rust is mandatory: try tch-rs first
If you truly need the generation core in Rust, I would look at tch-rs directly before betting on diffusers-rs for SDXL.
Why:
tch-rsis the Rust binding to libtorch , so you stay closer to the PyTorch execution stack. (GitHub)tch-rsexplicitly mentions a Stable Diffusion implementation following Diffusers-style code. (GitHub)diffusers-rsis Torch-backed too, but its README currently says it supports Stable Diffusion v1.5 and v2.1. I could not verify a clearly documented, ready-to-use SDXL path from its public docs. (GitHub)
So the practical reading is:
diffusers-rsis promising- but for SDXL specifically , the README does not yet present it as the obvious drop-in answer (GitHub)
Option C. Only if you want to debug Candle itself
If your goal is to help Candle improve, the next useful step is not more top-level flags. It is a numerical comparison.
Do this:
- fix one prompt, seed, size, guidance, steps
- capture intermediate tensors at the same timestep in Diffusers and Candle
- compare latents, scheduler inputs/outputs, and UNet outputs step by step
At this point, that is the only route likely to reveal why images still look wrong after matching the obvious settings. This is also consistent with Candle’s current state: there is ongoing interest in more advanced compilation and execution infrastructure, so parity bugs or performance gaps at the model-runtime boundary are plausible areas for real investigation. (GitHub)
What I would not do next
I would not spend more time on:
- more scheduler swaps
- more fp16/f32 toggles
- minor decode-side changes
You already tested the high-probability knobs. The evidence now points away from “small local fix” and toward “different execution stack” or “deeper implementation difference.” (Hugging Face)
My recommendation
If I had to choose one path:
Use compiled Diffusers now, and wrap it from Rust if needed.
That gives you:
- the speedup you already observed
- the image quality you already trust
- the least engineering risk
Then, in parallel:
- evaluate
tch-rsif Rust-only execution is a hard requirement - treat Candle SDXL as an experiment or contribution target, not the shortest path to a production-quality result today (Hugging Face)
Simple decision rule
Use this:
- Need SDXL working now → Diffusers +
torch.compile(Hugging Face) - Need Rust around the system → keep Rust at the orchestration layer, Python for generation (Hugging Face)
- Need Rust in the generation core → investigate
tch-rsfirst, thendiffusers-rsonly if you confirm SDXL support yourself (GitHub) - Need to understand Candle’s bug/perf gap → compare intermediate tensors and file an upstream issue with a minimal repro (GitHub)
Discussion in the ATmosphere