External Publication
Visit Post

About traning LoRa for Z Image Turbo

Hugging Face Forums [Unofficial] March 10, 2026
Source

I’m not sure if the following insights will be useful, but I’ve analyzed the current situation for now:


My take

Your conclusions are mostly directionally right , but I would tighten them in one important way:

For Turbo face LoRA , the biggest problem is usually not “too few steps” by itself. It is usually one of these:

  • the dataset is teaching too many things at once
  • the captions are carrying too much variable information
  • the run is long enough that Turbo’s distillation starts to break down
  • or all three together

That matters because Turbo is a distilled model with no normal CFG/negative-prompt control , and official/community guidance treats it as only limitedly fine-tunable compared with Z-Image Base. Z-Image Base is explicitly presented as the better starting point for LoRA training and downstream control, while Turbo needs special handling such as a training adapter or DistillPatch if you want to keep fast 8-step behavior. (Hugging Face)

What I think you got right

1) Caption structure matters a lot

Yes. Especially on a small identity dataset , overly detailed captions can make the LoRA learn caption dependence instead of robust identity. And the opposite extreme—almost no captioning except a trigger—can make it too trigger-locked and less editable. The trainer docs around caption dropout reflect exactly this tradeoff: examples use small dropout values like 0.05–0.1 , and describe higher caption dropout as making changes apply more broadly across prompts, while lower/no dropout keeps the change tied more strictly to the prompts shown during training. (GitHub)

2) Tight face crops learn faster for a face LoRA

Also yes. For Turbo personalization, small, tightly curated datasets can already be enough to imprint a subject, and the AI Toolkit write-up explicitly recommends fewer, cleaner, higher-resolution images , noting that even nine clean images were enough to imprint a subject in tests. (Hugging Face)

So your observation that a face-focused set learns faster than a broader “below-the-shoulders” set makes sense.


Where I would adjust your conclusion

“More steps increase learning speed”

I would frame this differently.

More steps do not really make the model learn “faster.” They give the model more exposure. What you are likely seeing is:

  • one run with a higher cap lets you inspect more checkpoints in one continuous trajectory
  • resuming later can land you on a different part of the curve, where drift/overfit is already starting

So I would not treat “start high” as a law. I would treat it as:

“A single longer run with checkpoint sampling made it easier to catch the sweet spot.”

That is a useful practical insight, but it is different from “high steps are inherently better.”


What your current failure pattern means

You said:

  • 45 visuals
  • 4500 steps
  • learning feels erratic
  • unrelated people/prompts start resembling your character
  • neutral prompts still do not activate properly even after ~3250

That combination usually means:

The LoRA is bleeding , not just underlearning

If unrelated people begin to resemble your character, the LoRA is already strong enough to overwrite base priors. But if neutral prompts still do not activate your subject well, then it is not learning identity in a clean, general way. It is learning something more like:

“When the prompt looks like my training captions, force my subject.”

That is usually a caption/dataset mismatch problem before it is a “need more steps” problem.


My concrete diagnosis for your setup

1) 45 images is not “too much” for Turbo in general

But it can be too much variety for a face LoRA.

For a face LoRA, 45 images is only helpful if they are mostly:

  • same person
  • face/neck dominant
  • controlled variation in angle, lighting, expression

It becomes harmful when those 45 images also add too many extra moving parts:

  • body framing changes
  • hair style changes
  • clothing changes
  • strong location changes
  • strong semantic roles

For a face LoRA , I would still prefer something like:

  • 15–25 face/neck dominant
  • 3–5 upper-body
  • almost no knee-up/full-body

That fits the “fewer, cleaner” guidance much better than a broad 45-image identity set. (Hugging Face)

2) Your caption template is good, but probably too broad for a face LoRA

Your structure:

[trigger], [framing], [pose/action], [facial expression], [clothing], [accessories], [location], [lighting], [background]

is a good general character template.

But for a face LoRA , I would simplify it to something closer to:

[trigger], [framing], [angle], [expression], [important hair/accessory changes], [lighting]

and keep:

  • clothing minimal
  • location minimal
  • background minimal

Why: for a face LoRA, you do not want the model spending too much capacity on “park / beach / house / native clothing / volcanic landscape / angry scene / crying scene / etc.” You want almost all learning pressure to go into identity.

Your current template is probably better for the second-stage character/look LoRA , not the first face anchor.

3) Raising CDO is not the first thing I would try

A small increase can help when the LoRA is too tied to caption phrasing, yes. But the trainer examples and docs point to small values, not huge ones—typically around 0.05 to 0.1. (GitHub)

In your case, if increasing caption dropout changed nothing, that is a sign the real bottleneck is probably:

  • caption entropy
  • or dataset scope
  • not caption dropout alone

So I would not keep pushing CDO upward as the main lever.

4) On the LoRA side, dropout is more interesting than more caption dropout

LoRA/module dropout is explicitly described in trainer docs as a way to help prevent overfitting, and the AI Toolkit changes also describe rank/module dropout support as useful for small datasets. (GitHub)

So if your LoRA is bleeding into unrelated prompts, a small LoRA/module dropout is more sensible than heavily increasing caption dropout.


What I would do next

Option A — stay on Turbo, but narrow the face run

This is the path I would try first since you want to continue with Turbo.

Change only these things:

  • Cut the face dataset from 45 down to about 18–25
  • Keep mostly face/neck
  • Keep 3–5 upper-body
  • Remove full-body / knee-up / role-heavy samples
  • Simplify captions to identity-relevant tokens only
  • Keep CDO modest: around 0.05–0.1
  • Add a small LoRA/module dropout if your trainer supports it

Why

Turbo training is fragile because direct training on a step-distilled model breaks the distillation over time; the training adapter helps, but its own card says it is mainly for shorter runs like styles, concepts, and characters, and that longer runs can still lead to distillation breakdown and artifacts. (Hugging Face)

That is why a narrow, shorter, cleaner face run usually behaves better than a broad 45-image run.

Option B — use Base for the identity LoRA

If Turbo keeps behaving erratically, this is the cleanest fallback.

Z-Image Base is explicitly described as:

  • non-distilled
  • a good base for LoRA training
  • full CFG support
  • strong negative prompting
  • higher diversity/control than Turbo (Hugging Face)

So if your goal is:

  • a face LoRA that responds to neutral prompts
  • less weird gender/class drift
  • less training instability

then Base is the more rational training target.

Option C — middle ground: de-distilled Turbo

There is also a community “de-turbo” route that is meant specifically to break Turbo’s distillation for training and can be trained directly, with inference around CFG 2–3 and 20–30 steps. That is not the official base model, but it is a practical middle ground if you want something closer to Turbo’s look while avoiding Turbo’s training fragility. (Hugging Face)


The specific symptom: “angry woman turns into a man”

I would interpret that like this:

It is not that the model needs to “learn that your character can be angry.”

It is more that:

  • the base model has a strong prior for certain emotion/scene combinations
  • your LoRA is not yet strong enough in a neutral, identity-first way
  • so when the prompt becomes semantically strong, the base prior wins

That means the fix is usually:

  • better identity anchoring
  • cleaner face dataset
  • less variable captioning

not just more steps.


My short answers to your direct questions

Is 45 visuals too much for Turbo?

No in general. Yes, potentially too broad for a face LoRA.

Should you raise CDO?

Only slightly, if at all. It is not the main fix here. Keep it modest. (GitHub)

Should you use LoRA-side dropout/module dropout?

Yes, a small amount is worth trying if you are seeing bleed/overfit. The trainer docs describe LoRA dropout as helping prevent overfitting. (GitHub)

Should you try the base model?

Yes, if Turbo remains erratic. Base is the official better LoRA-training target. (Hugging Face)


My strongest recommendation

For your face LoRA , stop trying to make one run do everything.

Do this instead:

Face LoRA

  • very tight dataset
  • very simple captions
  • short, identity-first training goal

Character / look LoRA

  • broader framing
  • richer captions
  • location / lighting / clothing variety

That separation is the cleanest way to stop the exact failure pattern you are seeing.

Discussion in the ATmosphere

Loading comments...