{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiglv65egiqciroin6td4t7t47op7seo27zo627xayyxzbs7k2pouu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi6hbxzsujl2"
},
"path": "/t/stable-diffusion-xl-much-slower-with-candle-than-with-diffusers/174649#post_4",
"publishedAt": "2026-03-28T22:39:37.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Hugging Face",
"GitHub"
],
"textContent": "Hmm… It’s surprising that changing the scheduler doesn’t change the results… I’d think that in Rust, dependencies are less likely to break over time compared to Python…\n\nIf you’re set on a native Rust implementation, you might want to try `tch-rs` first. If PyTorch’s optimizations are the main contributor to the speed, this approach is more direct.\n\n* * *\n\nBased on what you already tested, the best next step is **not** another small Candle-side tweak.\n\n## The short answer\n\nFor **SDXL today** , the best practical move is:\n\n 1. **Use Python Diffusers with`torch.compile` for image generation now.**\n 2. **Keep Rust around it** , not inside the diffusion core yet.\n 3. If Rust for the core is mandatory, **evaluate`tch-rs` before `diffusers-rs` for SDXL**.\n 4. Only keep digging into Candle if your goal is **debugging or contributing upstream** , not just getting a fast, correct SDXL pipeline. (Hugging Face)\n\n\n\n## Why this is the right move\n\nYour recent tests already ruled out the two easiest explanations:\n\n * changing float type barely helped\n * changing scheduler did not fix speed or image correctness\n\n\n\nThat means the problem is probably **not** “one wrong config value.” It is more likely a combination of **runtime-stack gap** and **implementation-parity gap** for SDXL in Candle. Diffusers on PyTorch 2 gets optimized attention by default through **SDPA** , and Hugging Face documents `torch.compile` as an extra performance layer on top. PyTorch’s own SDXL acceleration post says those native optimizations can push diffusion inference **up to 3× faster**. Candle, by contrast, has an active RFC proposing a future tracing/compilation layer with **JIT, kernel fusion, and static buffer planning** , which strongly suggests that this kind of optimization layer is still emerging there rather than already mature in the current path. (Hugging Face)\n\n## What I would do next, in order\n\n### Option A. Best overall: keep Diffusers for generation\n\nIf the goal is **good SDXL images plus speed now** , stay with **Diffusers + PyTorch 2 +`torch.compile`**. That path is already working for you, and it matches the official optimization guidance: Diffusers supports **SDPA by default** on PyTorch 2, and recommends `torch.compile` for extra speed. PyTorch’s SDXL acceleration writeup is basically the background explanation for the speedup you just saw. (Hugging Face)\n\nThe clean architecture is:\n\n * Rust app for orchestration, UI, job queue, prompt handling, file management\n * Python worker or microservice for SDXL generation\n * communicate over CLI, HTTP, gRPC, or a local queue\n\n\n\nThat is not elegant in the “single-language purity” sense, but it is the fastest route to a system that is **fast, correct, and maintainable**. This recommendation is an inference from the facts above and from your own benchmark outcome. (Hugging Face)\n\n### Option B. If Rust is mandatory: try `tch-rs` first\n\nIf you truly need the generation core in Rust, I would look at **`tch-rs` directly** before betting on `diffusers-rs` for SDXL.\n\nWhy:\n\n * `tch-rs` is the Rust binding to **libtorch** , so you stay closer to the PyTorch execution stack. (GitHub)\n * `tch-rs` explicitly mentions a **Stable Diffusion implementation following Diffusers-style code**. (GitHub)\n * `diffusers-rs` is Torch-backed too, but its README currently says it supports **Stable Diffusion v1.5 and v2.1**. I could not verify a clearly documented, ready-to-use **SDXL** path from its public docs. (GitHub)\n\n\n\nSo the practical reading is:\n\n * `diffusers-rs` is promising\n * but for **SDXL specifically** , the README does not yet present it as the obvious drop-in answer (GitHub)\n\n\n\n### Option C. Only if you want to debug Candle itself\n\nIf your goal is to help Candle improve, the next useful step is **not more top-level flags**. It is a **numerical comparison**.\n\nDo this:\n\n * fix one prompt, seed, size, guidance, steps\n * capture intermediate tensors at the same timestep in Diffusers and Candle\n * compare latents, scheduler inputs/outputs, and UNet outputs step by step\n\n\n\nAt this point, that is the only route likely to reveal why images still look wrong after matching the obvious settings. This is also consistent with Candle’s current state: there is ongoing interest in more advanced compilation and execution infrastructure, so parity bugs or performance gaps at the model-runtime boundary are plausible areas for real investigation. (GitHub)\n\n## What I would _not_ do next\n\nI would not spend more time on:\n\n * more scheduler swaps\n * more fp16/f32 toggles\n * minor decode-side changes\n\n\n\nYou already tested the high-probability knobs. The evidence now points away from “small local fix” and toward “different execution stack” or “deeper implementation difference.” (Hugging Face)\n\n## My recommendation\n\nIf I had to choose one path:\n\n### Use compiled Diffusers now, and wrap it from Rust if needed.\n\nThat gives you:\n\n * the speedup you already observed\n * the image quality you already trust\n * the least engineering risk\n\n\n\nThen, in parallel:\n\n * evaluate `tch-rs` if Rust-only execution is a hard requirement\n * treat Candle SDXL as an experiment or contribution target, not the shortest path to a production-quality result today (Hugging Face)\n\n\n\n## Simple decision rule\n\nUse this:\n\n * **Need SDXL working now** → Diffusers + `torch.compile` (Hugging Face)\n * **Need Rust around the system** → keep Rust at the orchestration layer, Python for generation (Hugging Face)\n * **Need Rust in the generation core** → investigate `tch-rs` first, then `diffusers-rs` only if you confirm SDXL support yourself (GitHub)\n * **Need to understand Candle’s bug/perf gap** → compare intermediate tensors and file an upstream issue with a minimal repro (GitHub)\n\n",
"title": "Stable Diffusion XL much slower with candle than with diffusers"
}