Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie5hhul7qshzbmugasmkdoi677qfuknat6yxiyjqlqpwjea72mb5e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgqn3rtt63e2"
  },
  "path": "/t/about-traning-lora-for-z-image-turbo/173911?page=2#post_22",
  "publishedAt": "2026-03-10T21:44:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub"
  ],
  "textContent": "I’m not sure if the following insights will be useful, but I’ve analyzed the current situation for now:\n\n* * *\n\n## My take\n\nYour conclusions are **mostly directionally right** , but I would tighten them in one important way:\n\nFor **Turbo face LoRA** , the biggest problem is usually **not** “too few steps” by itself. It is usually one of these:\n\n  * the dataset is teaching **too many things at once**\n  * the captions are carrying **too much variable information**\n  * the run is long enough that **Turbo’s distillation starts to break down**\n  * or all three together\n\n\n\nThat matters because Turbo is a **distilled** model with **no normal CFG/negative-prompt control** , and official/community guidance treats it as only **limitedly fine-tunable** compared with Z-Image Base. Z-Image Base is explicitly presented as the better starting point for LoRA training and downstream control, while Turbo needs special handling such as a **training adapter** or **DistillPatch** if you want to keep fast 8-step behavior. (Hugging Face)\n\n## What I think you got right\n\n### 1) Caption structure matters a lot\n\nYes. Especially on a **small identity dataset** , overly detailed captions can make the LoRA learn **caption dependence** instead of robust identity. And the opposite extreme—almost no captioning except a trigger—can make it too trigger-locked and less editable. The trainer docs around caption dropout reflect exactly this tradeoff: examples use small dropout values like **0.05–0.1** , and describe higher caption dropout as making changes apply more broadly across prompts, while lower/no dropout keeps the change tied more strictly to the prompts shown during training. (GitHub)\n\n### 2) Tight face crops learn faster for a face LoRA\n\nAlso yes. For Turbo personalization, small, tightly curated datasets can already be enough to imprint a subject, and the AI Toolkit write-up explicitly recommends **fewer, cleaner, higher-resolution images** , noting that even **nine** clean images were enough to imprint a subject in tests. (Hugging Face)\n\nSo your observation that a face-focused set learns faster than a broader “below-the-shoulders” set makes sense.\n\n* * *\n\n## Where I would adjust your conclusion\n\n### “More steps increase learning speed”\n\nI would frame this differently.\n\nMore steps do **not** really make the model learn “faster.”\nThey give the model **more exposure**. What you are likely seeing is:\n\n  * one run with a higher cap lets you inspect more checkpoints in one continuous trajectory\n  * resuming later can land you on a different part of the curve, where drift/overfit is already starting\n\n\n\nSo I would not treat “start high” as a law. I would treat it as:\n\n> “A single longer run with checkpoint sampling made it easier to catch the sweet spot.”\n\nThat is a useful practical insight, but it is different from “high steps are inherently better.”\n\n* * *\n\n## What your current failure pattern means\n\nYou said:\n\n  * 45 visuals\n  * 4500 steps\n  * learning feels erratic\n  * unrelated people/prompts start resembling your character\n  * neutral prompts still do not activate properly even after ~3250\n\n\n\nThat combination usually means:\n\n### The LoRA is **bleeding** , not just underlearning\n\nIf unrelated people begin to resemble your character, the LoRA is already strong enough to **overwrite base priors**.\nBut if neutral prompts still do not activate your subject well, then it is not learning identity in a clean, general way. It is learning something more like:\n\n> “When the prompt looks like my training captions, force my subject.”\n\nThat is usually a **caption/dataset mismatch** problem before it is a “need more steps” problem.\n\n* * *\n\n## My concrete diagnosis for your setup\n\n## 1) 45 images is not “too much” for Turbo in general\n\nBut it can be **too much variety for a face LoRA**.\n\nFor a face LoRA, 45 images is only helpful if they are mostly:\n\n  * same person\n  * face/neck dominant\n  * controlled variation in angle, lighting, expression\n\n\n\nIt becomes harmful when those 45 images also add too many extra moving parts:\n\n  * body framing changes\n  * hair style changes\n  * clothing changes\n  * strong location changes\n  * strong semantic roles\n\n\n\nFor a **face LoRA** , I would still prefer something like:\n\n  * **15–25 face/neck dominant**\n  * **3–5 upper-body**\n  * almost no knee-up/full-body\n\n\n\nThat fits the “fewer, cleaner” guidance much better than a broad 45-image identity set. (Hugging Face)\n\n## 2) Your caption template is good, but probably too broad for a face LoRA\n\nYour structure:\n\n`[trigger], [framing], [pose/action], [facial expression], [clothing], [accessories], [location], [lighting], [background]`\n\nis a **good general character template**.\n\nBut for a **face LoRA** , I would simplify it to something closer to:\n\n`[trigger], [framing], [angle], [expression], [important hair/accessory changes], [lighting]`\n\nand keep:\n\n  * clothing minimal\n  * location minimal\n  * background minimal\n\n\n\nWhy: for a face LoRA, you do **not** want the model spending too much capacity on “park / beach / house / native clothing / volcanic landscape / angry scene / crying scene / etc.” You want almost all learning pressure to go into **identity**.\n\nYour current template is probably better for the **second-stage character/look LoRA** , not the first face anchor.\n\n## 3) Raising CDO is not the first thing I would try\n\nA **small** increase can help when the LoRA is too tied to caption phrasing, yes. But the trainer examples and docs point to **small** values, not huge ones—typically around **0.05 to 0.1**. (GitHub)\n\nIn your case, if increasing caption dropout changed nothing, that is a sign the real bottleneck is probably:\n\n  * caption **entropy**\n  * or dataset **scope**\n  * not caption dropout alone\n\n\n\nSo I would **not** keep pushing CDO upward as the main lever.\n\n## 4) On the LoRA side, dropout is more interesting than more caption dropout\n\nLoRA/module dropout is explicitly described in trainer docs as a way to help prevent overfitting, and the AI Toolkit changes also describe rank/module dropout support as useful for small datasets. (GitHub)\n\nSo if your LoRA is bleeding into unrelated prompts, a **small** LoRA/module dropout is more sensible than heavily increasing caption dropout.\n\n* * *\n\n## What I would do next\n\n## Option A — stay on Turbo, but narrow the face run\n\nThis is the path I would try first since you want to continue with Turbo.\n\n### Change only these things:\n\n  * Cut the face dataset from **45** down to about **18–25**\n  * Keep mostly **face/neck**\n  * Keep **3–5 upper-body**\n  * Remove full-body / knee-up / role-heavy samples\n  * Simplify captions to **identity-relevant tokens only**\n  * Keep CDO modest: around **0.05–0.1**\n  * Add a small LoRA/module dropout if your trainer supports it\n\n\n\n### Why\n\nTurbo training is fragile because direct training on a step-distilled model breaks the distillation over time; the training adapter helps, but its own card says it is mainly for **shorter runs** like styles, concepts, and characters, and that **longer runs** can still lead to distillation breakdown and artifacts. (Hugging Face)\n\nThat is why a **narrow, shorter, cleaner** face run usually behaves better than a broad 45-image run.\n\n## Option B — use Base for the identity LoRA\n\nIf Turbo keeps behaving erratically, this is the cleanest fallback.\n\nZ-Image Base is explicitly described as:\n\n  * non-distilled\n  * a good base for LoRA training\n  * full CFG support\n  * strong negative prompting\n  * higher diversity/control than Turbo (Hugging Face)\n\n\n\nSo if your goal is:\n\n  * a face LoRA that responds to **neutral prompts**\n  * less weird gender/class drift\n  * less training instability\n\n\n\nthen Base is the more rational training target.\n\n## Option C — middle ground: de-distilled Turbo\n\nThere is also a community “de-turbo” route that is meant specifically to break Turbo’s distillation for training and can be trained directly, with inference around **CFG 2–3 and 20–30 steps**. That is not the official base model, but it is a practical middle ground if you want something closer to Turbo’s look while avoiding Turbo’s training fragility. (Hugging Face)\n\n* * *\n\n## The specific symptom: “angry woman turns into a man”\n\nI would interpret that like this:\n\nIt is **not** that the model needs to “learn that your character can be angry.”\n\nIt is more that:\n\n  * the base model has a strong prior for certain emotion/scene combinations\n  * your LoRA is not yet strong enough in a **neutral, identity-first way**\n  * so when the prompt becomes semantically strong, the base prior wins\n\n\n\nThat means the fix is usually:\n\n  * **better identity anchoring**\n  * **cleaner face dataset**\n  * **less variable captioning**\n\n\n\nnot just more steps.\n\n* * *\n\n## My short answers to your direct questions\n\n### Is 45 visuals too much for Turbo?\n\nNo in general.\nYes, potentially **too broad for a face LoRA**.\n\n### Should you raise CDO?\n\nOnly slightly, if at all.\nIt is not the main fix here. Keep it modest. (GitHub)\n\n### Should you use LoRA-side dropout/module dropout?\n\nYes, a **small amount** is worth trying if you are seeing bleed/overfit. The trainer docs describe LoRA dropout as helping prevent overfitting. (GitHub)\n\n### Should you try the base model?\n\nYes, if Turbo remains erratic. Base is the official better LoRA-training target. (Hugging Face)\n\n* * *\n\n## My strongest recommendation\n\nFor your **face LoRA** , stop trying to make one run do everything.\n\nDo this instead:\n\n### Face LoRA\n\n  * very tight dataset\n  * very simple captions\n  * short, identity-first training goal\n\n\n\n### Character / look LoRA\n\n  * broader framing\n  * richer captions\n  * location / lighting / clothing variety\n\n\n\nThat separation is the cleanest way to stop the exact failure pattern you are seeing.",
  "title": "About traning LoRa for Z Image Turbo"
}