Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiarondz7mfbxs4uff7jw4s2bnbupm6nwucqulkrbkr3i7h4aunc5u",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhzg5fz6upx2"
  },
  "path": "/t/stable-diffusion-xl-much-slower-with-candle-than-with-diffusers/174649#post_2",
  "publishedAt": "2026-03-27T01:28:07.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face"
  ],
  "textContent": "Seems known mismatch…?\n\n* * *\n\nThe simplest explanation is:\n\n**You are not benchmarking equivalent SDXL pipelines.** Candle’s example is still on a weaker-looking SDXL configuration path, while Diffusers on modern PyTorch gets several runtime optimizations “for free” that Candle’s example does not obviously match. Your forum post’s numbers, about **18s / 2.83 steps/s** in Diffusers versus **45s / 1.2 steps/s** in Candle, line up with known public reports and with the current example code.\n\n## What is most likely going on\n\n### 1. Candle’s SDXL example is still using DDIM\n\nCandle’s current Stable Diffusion example README says that for **v1.5, v2.1, and XL 1.0** , the default scheduler is **DDIM**. That is already a warning sign for SDXL. Candle issue `#1508` says the example should not use DDIM for SDXL, and PR `#1635` changes the SDXL path from `DDIMSchedulerConfig` to `EulerAncestralDiscreteSchedulerConfig`. So even Candle contributors recognized this as a mismatch. (GitHub)\n\nThere is one important nuance. The official `stabilityai/stable-diffusion-xl-base-1.0` scheduler config currently shows `_class_name: \"EulerDiscreteScheduler\"`, not Euler Ancestral. So the safest statement is not “SDXL must be Euler A.” The safe statement is: **the official SDXL repo advertises an Euler-family scheduler, while Candle’s example still uses DDIM.** That is enough to make the comparison non-apples-to-apples. (Hugging Face)\n\n### 2. DDIM is a plausible reason for the quality drop\n\nThis is not just theory. Diffusers issue `#6068` reports that **DDIM on SDXL** can leave **residual noise** and a **“smudgy”** look because of timestep alignment problems. That matches your “sometimes lesser quality” observation very closely. So one likely story is:\n\n  * Candle SDXL example uses DDIM.\n  * DDIM has had SDXL-specific quality issues.\n  * Your worse quality is at least partly scheduler-driven. (GitHub)\n\n\n\n## Why Diffusers can still be much faster even on the same GPU\n\n### 3. Diffusers is standing on PyTorch 2 runtime optimizations\n\nHugging Face’s Diffusers optimization guide says **SDPA** is used by default in Diffusers, and in their SDXL benchmark it improves latency from **4.63s to 3.31s**. The same guide says compiling the **UNet and VAE** with `torch.compile` cuts it further from **3.31s to 2.54s** , and notes that the “max-autotune” mode uses **CUDA graphs** to reduce GPU launch overhead. PyTorch’s own blog summarizes the same stack as **bfloat16 + SDPA + torch.compile + combined q/k/v projections** , showing about **3×** SDXL latency improvement. (Hugging Face)\n\nThat matters because Candle’s README only promises a narrower optimization in the example: **flash attention makes image generation a lot faster and uses less memory**. That helps, but it is not the same as having the full PyTorch 2 stack of SDPA, graph-friendly compilation, CUDA graphs, and related kernel fusion. So “both are using fast attention” does **not** mean “both are equally optimized.” (GitHub)\n\n### 4. Diffusion models amplify small runtime overheads\n\nDiffusion inference is iterative. The denoiser runs over and over across many timesteps. PyTorch’s guidance explicitly warns that **graph breaks** reduce the benefit of `torch.compile`, and Diffusers’ optimization guide says `fullgraph=True` helps avoid those breaks. Their guide also talks about removing **GPU sync after compilation**. In plain terms, once you pay a little overhead every step, the total gap grows quickly over 30 or 50 steps. That is why a runtime-execution advantage can show up as a large end-to-end difference even when both implementations use the same model weights. (Hugging Face)\n\n## Why Candle can use less VRAM but still be slower\n\nThat part is not contradictory. Your post reports Candle using less VRAM than Diffusers during inference and decode. Lower memory use does **not** imply higher throughput. It can simply mean the runtime is making different tradeoffs. In your case, the public evidence points to exactly that: Candle’s example can be more memory-conservative while Diffusers is faster because Diffusers is using a more mature optimized execution stack.\n\n## Two secondary details in Candle’s example\n\nThese are not the main cause, but they matter.\n\nFirst, Candle’s `main.rs` now has a special fp16 SDXL VAE override. When SDXL or Turbo is used with `--use-f16`, it switches to **`madebyollin/sdxl-vae-fp16-fix`** , with a comment pointing to Candle issue `#1060`. That means SDXL fp16 behavior has already needed a workaround in Candle. If someone overrides the VAE path, quality or stability can move around for reasons unrelated to the scheduler. (GitHub)\n\nSecond, the same file builds the text encoder with **`DType::F32`** in `build_clip_transformer(..., DType::F32)`. That will not dominate the repeated denoising loop, but it can still add wall-clock cost compared with a more uniformly half-precision path. I would treat this as a secondary factor, not the main reason for the 2×+ gap you saw. (GitHub)\n\n## What I think is the real diagnosis\n\nIn order of likelihood:\n\n  1. **Scheduler mismatch** is probably the main reason for the quality difference. Candle’s SDXL example is on DDIM, while the official SDXL repo is Euler-family, and DDIM had a documented SDXL quality bug in Diffusers. (GitHub)\n\n  2. **Runtime maturity** is probably the main reason for the speed difference. Diffusers on PyTorch 2 gets SDPA, compile, CUDA-graph-assisted execution, and related optimizations that Candle’s example does not obviously replicate. (Hugging Face)\n\n  3. **Secondary implementation details** in Candle, such as the fp16 VAE workaround and F32 text encoder path, may add some extra overhead or quality variability, but they are unlikely to be the whole story. (GitHub)\n\n\n\n\n## What to try next\n\n### Best first test\n\nPatch Candle so SDXL uses an **Euler-family scheduler** instead of DDIM, then rerun the exact same prompt, seed, image size, guidance, and step count. This is the cleanest test because it isolates the most suspicious difference first. The open Candle PR already shows the exact scheduler change contributors had in mind. (GitHub)\n\n### Best second test\n\nBenchmark only the **denoising loop** , separate from text encoding and VAE decode. Your post already distinguishes total time from step speed, which is the right instinct. If Candle loses mostly in step speed, that points directly to runtime execution rather than one-time setup costs.\n\n### Best third test\n\nMake the Diffusers baseline explicit:\n\n  * plain Diffusers\n  * Diffusers with default SDPA\n  * Diffusers with compiled UNet/VAE\n\n\n\nThat tells you whether Candle is behind a basic baseline or only behind the fully optimized PyTorch 2 path. The docs show that these modes can differ materially. (Hugging Face)\n\n## What I would conclude today\n\nI would **not** conclude that Rust is the problem. I would also **not** conclude that Candle is “bad.” I would conclude that, for **SDXL inference right now** , Candle’s public example is not yet a fair throughput match for a modern Diffusers-on-PyTorch stack, and its SDXL scheduler choice makes the quality comparison worse than it needs to be. (GitHub)\n\nIf you want raw speed in Rust today, the strongest near-term workaround is often a **Torch-backed Rust path** such as `diffusers-rs` or `tch-rs`, because Candle’s own SD example is described as a conversion of `diffusers-rs` “using candle rather than libtorch.” That suggests the main tradeoff here is the execution stack, not the language. (GitHub)\n\nThe shortest practical summary is:\n\n  * **Worse quality:** likely **DDIM on SDXL**.\n  * **Worse speed:** likely **runtime optimization gap** versus modern Diffusers/PyTorch.\n  * **Less VRAM:** real, but not evidence of higher overall optimization.\n\n\n\n* * *\n\nHere is the concrete checklist.\n\n## Goal\n\nAnswer three questions cleanly:\n\n  1. Is the image-quality drop mostly a **scheduler** problem?\n  2. Is some of the wall-clock cost coming from Candle’s **text encoder dtype** choice?\n  3. After those fixes, is the remaining gap mostly just **runtime-stack maturity** versus Diffusers on PyTorch 2? (GitHub)\n\n\n\n## 1. Keep the comparison fixed\n\nUse the same prompt, seed, size, steps, guidance scale, and model each time. Candle’s SD example still defaults SDXL 1.0 to **DDIM** , and `--use-flash-attn` is the intended fast-attention path. (GitHub)\n\nBase command:\n\n\n    cargo run --release --example stable-diffusion --features=cuda,cudnn,flash-attn -- \\\n      --prompt \"A rusty robot holding a torch in its hand, photorealistic.\" \\\n      --sd-version xl \\\n      --use-flash-attn \\\n      --use-f16 \\\n      --guidance-scale 5 \\\n      --n-steps 50 \\\n      --height 1024 \\\n      --width 1024 \\\n      --seed 42\n\n\n## 2. First code change: switch the SDXL scheduler\n\nWhy first: Candle’s README says SDXL uses **DDIM** by default, Candle issue `#1508` says SDXL should use an Euler-family scheduler instead, and PR `#1635` changes the code to **Euler Ancestral**. Separately, the official SDXL scheduler config currently says **EulerDiscreteScheduler** , which is still Euler-family and still different from DDIM. (GitHub)\n\nChange this in:\n\n\n    candle-transformers/src/models/stable_diffusion/mod.rs\n\n\nFrom:\n\n\n    let scheduler = Arc::new(ddim::DDIMSchedulerConfig {\n        prediction_type,\n        ..Default::default()\n    });\n\n\nTo:\n\n\n    let scheduler = Arc::new(\n        euler_ancestral_discrete::EulerAncestralDiscreteSchedulerConfig {\n            prediction_type,\n            ..Default::default()\n        },\n    );\n\n\nThat is the exact replacement shown in PR `#1635`. (GitHub)\n\n## 3. Second code change: stop forcing the text encoder to F32\n\nIn current `main.rs`, Candle builds the CLIP text encoder with `DType::F32` even though `dtype` is already set from `--use-f16`. That is probably not the main reason for slow **step speed** , but it can add one-time overhead to total runtime. (GitHub)\n\nChange this in:\n\n\n    candle-examples/examples/stable-diffusion/main.rs\n\n\nFrom:\n\n\n    stable_diffusion::build_clip_transformer(clip_config, clip_weights, device, DType::F32)?;\n\n\nTo:\n\n\n    stable_diffusion::build_clip_transformer(clip_config, clip_weights, device, dtype)?;\n\n\n## 4. Add two tiny timers\n\nCandle already prints sampling progress starting at `println!(\"starting sampling\");`, and `save_image()` does the final VAE decode and write. Add two coarse timers so you can split:\n\n  * text encode\n  * sampling loop\n  * final decode/save (GitHub)\n\n\n\nMinimal idea:\n\n\n    let encode_t0 = std::time::Instant::now();\n    // text/token/CLIP work\n    println!(\"text encode total {:.2}s\", encode_t0.elapsed().as_secs_f32());\n\n    let sampling_t0 = std::time::Instant::now();\n    println!(\"starting sampling\");\n    // denoising loop\n    println!(\"sampling total {:.2}s\", sampling_t0.elapsed().as_secs_f32());\n\n    let decode_t0 = std::time::Instant::now();\n    save_image(...)?;\n    println!(\"final decode+save {:.2}s\", decode_t0.elapsed().as_secs_f32());\n\n\n## 5. Run these 3 benchmarks\n\n### Benchmark 1. Current baseline\n\nNo scheduler patch. No dtype patch. Just run the command.\n\nRecord:\n\n  * total time\n  * `text encode total`\n  * `sampling total`\n  * `final decode+save`\n  * output image\n\n\n\n### Benchmark 2. Scheduler-only\n\nApply the scheduler patch. Leave the text encoder line alone.\n\nRun the same command again.\n\nThis isolates the “DDIM vs Euler-family” effect. That matters because Diffusers has an SDXL issue showing **DDIM can produce smudgy / noisy results** on SDXL. (GitHub)\n\n### Benchmark 3. Scheduler + dtype\n\nKeep the scheduler patch. Also replace `DType::F32` with `dtype`.\n\nRun the same command again.\n\nThis tells you whether one-time prompt encoding was adding avoidable wall-clock cost. (GitHub)\n\n## 6. Interpret the results like this\n\nIf **Benchmark 2 fixes image quality a lot** , the scheduler was a major quality problem. That is the outcome I would expect first. Candle’s published SDXL path is DDIM, while the official SDXL config is Euler-family. (GitHub)\n\nIf **Benchmark 2 also improves sampling time a lot** , then DDIM was hurting both quality and throughput in this setup. That would make the original comparison less apples-to-apples than it looked. (GitHub)\n\nIf **Benchmark 3 lowers total time but barely changes sampling time** , then the F32 text encoder was only a front-end tax. The main bottleneck is still the denoising loop. (GitHub)\n\nIf **Benchmark 3 is still clearly behind Diffusers** , the remaining gap is probably mostly the runtime stack. Diffusers’ optimization docs say **SDPA is used by default** , and PyTorch 2 can improve text-to-image inference latency by **up to 3x**. (Hugging Face)\n\n## 7. What I expect\n\nMost likely:\n\n  * **Scheduler patch** helps quality the most.\n  * **Text encoder dtype patch** helps total time a bit.\n  * A meaningful gap may still remain, because Diffusers on PyTorch 2 gets SDPA and broader compiler/runtime optimizations that Candle’s public SDXL example does not obviously match. (Hugging Face)\n\n\n\nThe shortest decision rule is:\n\n  * quality improves after Benchmark 2 → blame the scheduler first\n  * total time improves after Benchmark 3 but sampling does not → blame the text encoder only for setup overhead\n  * large gap still remains after both → blame the runtime stack first (GitHub)\n\n",
  "title": "Stable Diffusion XL much slower with candle than with diffusers"
}