Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifvbcmyghrfxvtp6b2iogyarxeenpcyihdyvppyemela4titfv57e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnizbfom5nr2"
  },
  "path": "/t/flux-lora-wont-show-my-legs-or-feet/116124#post_5",
  "publishedAt": "2026-06-04T23:49:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "T2I-CompBench",
    "T2I-CompBench++",
    "GenEval",
    "Attend-and-Excite",
    "HumanRefiner",
    "Distortion-5K / ViT-HD",
    "How to get full body shot in Flux?",
    "Need Help, Flux is not performing as expected in terms of composition and anatomy",
    "Diffusers: Load adapters / adjust LoRA weight scale",
    "Diffusers: Inference with PEFT / multiple adapters",
    "PEFT LoRA docs",
    "RunDiffusion: How to prepare a dataset for model training",
    "What exactly to caption for Flux LoRA training?",
    "kohya-ss Flux LoRA training tips discussion",
    "kohya-ss/sd-scripts issue #1916: FLUX LoRA trained on face images changes face when generating full-body images",
    "FluxGym GitHub repo",
    "Next Diffusion: How to train a Flux LoRA with FluxGym",
    "AI Toolkit by ostris",
    "fal.ai: Training a FLUX style LoRA",
    "Diffusers FLUX DreamBooth LoRA README",
    "Hugging Face: FLUX QLoRA on consumer hardware",
    "ControlNet paper",
    "Shakker-Labs / InstantX FLUX.1-dev-ControlNet-Union-Pro",
    "T2I-CompBench: compositional text-to-image benchmark",
    "T2I-CompBench GitHub/project",
    "GenEval: object/count/color/position evaluation for T2I",
    "Attend-and-Excite: catastrophic neglect / attention guidance",
    "HumanRefiner: abnormal human generation and limb quality",
    "Distortion-5K / ViT-HD: distorted body parts in generated images",
    "DreamBooth: subject-driven generation from a few images",
    "ControlNet: adding spatial controls to T2I diffusion models",
    "Diffusers: LoRA / PEFT inference",
    "Diffusers: loading adapters and LoRA scale",
    "Hugging Face FLUX QLoRA blog",
    "FluxGym",
    "RunDiffusion: dataset preparation guide",
    "Next Diffusion: training a Flux LoRA with FluxGym",
    "Related Flux LoRA full-body/face issue",
    "Flux composition/anatomy issue discussion",
    "Close-up / face-focused LoRA and full-body generalization discussion"
  ],
  "textContent": "Focusing primarily on ways to address issues raised in the thread, I collected a bunch of things that might be useful:\n\n* * *\n\n## TL;DR\n\nI would not treat this as one single “FLUX cannot draw legs” bug.\n\nIt looks more like several different failure modes overlapping:\n\nFailure mode | What it looks like | What I would try first\n---|---|---\n**Framing failure** | legs/feet are missing, cropped, or never enter the frame | stronger full-body framing prompt, vertical aspect ratio, camera distance, visible floor/shoes\n**Subject-token entanglement** | the LoRA keeps pulling the subject back into close-up / upper-body shots | lower LoRA strength, better captions, more varied shot distances\n**Identity-at-distance failure** | full body appears, but the face stops looking like the person | add full-body identity examples and bridge shots, not only face closeups\n**Caption entanglement** | outfit/background/crop type sticks to the person | caption clothes, background, camera distance, shot type\n**Human anatomy failure** | legs appear, but are warped/deformed | more clean lower-body examples, pose/control/inpaint; do not expect prompt alone to solve every case\n**Overtraining / prompt-following loss** | likeness improves but prompt flexibility collapses | test intermediate checkpoints and LoRA weights instead of assuming “more is better”\n\nSo my practical mental model would be:\n\n> Separate **identity** from **framing** , **clothing** , **background** , **camera distance** , and **feet/ground contact**.\n\nThe goal is not necessarily to retrain FLUX itself. The practical layer is usually: prompt framing → LoRA strength → captions → dataset balance → LoRA retraining → pose/control/inpaint if needed.\n\n* * *\n\n## 1. Why this can happen\n\nA subject LoRA may learn more than “who this person is.”\n\nIf most examples are close-up or mid-shot, the trigger token can also absorb things like:\n\n  * close-up framing\n  * upper-body crop\n  * missing lower body\n  * repeated outfit\n  * repeated background\n  * repeated lighting\n  * camera distance\n  * “this person usually appears without visible feet”\n\n\n\nThis is not unique to FLUX. Text-to-image models can make very plausible images while still being brittle about composition: which requested elements appear, where they appear, and which attributes belong to which object/body part. Benchmarks like T2I-CompBench, T2I-CompBench++, and GenEval are basically built around this kind of compositional weakness.\n\nThere is also a known failure mode where text-to-image diffusion models simply do not generate some requested concepts. Attend-and-Excite discusses this as **catastrophic neglect** , where one or more subjects/concepts in the prompt are not generated. That maps pretty well to “I said feet visible, but the model ignored the feet.”\n\nSeparately, human limbs are still a hard case for T2I models. Papers such as HumanRefiner and Distortion-5K / ViT-HD specifically discuss distorted limbs, missing fingers, deformed extremities, fused body parts, and other human-body distortions in generated images.\n\nSo I would split the problem into two parts:\n\n  1. **Missing lower body**\nThis is often a framing / prompt-attention / token-entanglement issue.\n\n  2. **Deformed lower body**\nThis can remain even after full-body examples, because human limb fidelity is still a general T2I weakness.\n\n\n\n\nThose two problems may need different mitigations.\n\n* * *\n\n## 2. First: try to save the existing LoRA\n\nBefore retraining, I would test the current LoRA with a small controlled grid.\n\nThe important part is to change **one thing at a time** :\n\n  * same seed\n  * same sampler/settings\n  * same resolution\n  * same prompt\n  * only change LoRA weight, aspect ratio, or prompt variant\n\n\n\nOtherwise it is hard to tell what helped.\n\n* * *\n\n## 2.1 Use full-body framing as a scene description, not just one token\n\nI would not rely on just:\n\n\n    full body photo of <token>\n\n\nThat may be too weak if the LoRA already learned the subject mostly as a portrait or upper-body concept.\n\nInstead, use multiple mutually reinforcing descriptions:\n\n\n    full-body photograph of <token>, standing naturally, full height visible from head to toe, entire body visible in frame, legs visible, feet visible, shoes visible, both shoes fully visible, both feet planted on the ground, photographed from a distance, camera far enough away to include the complete body, vertical 9:16 portrait, subject centered in frame, visible floor under both shoes\n\n\nThen repeat the framing near the end:\n\n\n    showing the complete full length of the subject, a full-body photograph capturing the subject in their entirety, not a close-up portrait, not an upper-body crop\n\n\nThis is not just “prompt superstition.” It is trying to reduce concept neglect by giving the model several ways to attend to the same requirement:\n\nRequirement | Prompt wording\n---|---\nwhole body | `full-body photograph`, `full height visible`\nno crop | `entire body visible in frame`, `head to toe`\nlower body | `legs visible`, `feet visible`\nconcrete feet cue | `shoes visible`, `both shoes fully visible`\nground/contact | `both feet planted on the ground`, `visible floor under both shoes`\nanti-closeup | `photographed from a distance`, `camera far enough away`\ncomposition | `vertical 9:16`, `subject centered in frame`\n\nFor this specific issue, I think `visible floor under both shoes` is surprisingly useful because it asks for the feet, the shoes, the ground plane, and the space below the body.\n\n* * *\n\n## 2.2 Feet/shoes/floor can work better than “feet visible” alone\n\nIf “feet visible” is ignored, make the feet part of a concrete visual object/scene.\n\nTry variants like:\n\n\n    wearing visible black sneakers, both shoes fully visible, standing on a wooden floor\n\n\n\n    wearing visible boots, both boots fully visible, standing on concrete ground\n\n\n\n    empty space below the feet, visible floor under both shoes, full body centered in frame\n\n\nThis is also consistent with practical Flux prompting discussions where people report that “full body” alone can still drift toward upper-body crops, and more concrete cues like shoes/floor/head-to-toe can help. See for example these related community threads:\n\n  * How to get full body shot in Flux?\n  * Need Help, Flux is not performing as expected in terms of composition and anatomy\n\n\n\n* * *\n\n## 2.3 Aspect ratio helps, but only if paired with distance\n\nVertical framing helps, but it is not enough by itself.\n\nIf the camera is still “close,” the model may simply make a large upper-body portrait inside a vertical frame.\n\nI would test:\n\nAspect ratio | Example resolution | Why\n---|---|---\n3:4 | `896x1194` / similar | natural portrait framing\n2:3 | `832x1248`, `896x1344`, `1024x1536` | good full-body compromise\n9:16 | `768x1365`, `832x1472` | more room for full body, but face gets smaller\n\nPair that with:\n\n\n    camera far enough away to include the complete body\n\n\n\n    empty space above the head and below the feet\n\n\n\n    visible floor under both shoes\n\n\n* * *\n\n## 2.4 Sweep LoRA strength instead of using one fixed value\n\nIf the LoRA is strong, it may preserve likeness but also preserve the training crop style.\n\nThat means a high LoRA weight can accidentally mean:\n\n> “make this person” + “make this person appear the way they appeared in the dataset”\n\nSo I would test a simple sweep.\n\nLoRA weight | What to check\n---|---\n`0.45` | More composition freedom, weaker likeness\n`0.55` | Possible balance point\n`0.65` | Good first candidate\n`0.75` | Check if close-up crop bias returns\n`0.85` | Stronger likeness, probably stronger dataset bias\n`1.00` | Maximum learned bias; useful as a baseline\n\nHugging Face Diffusers exposes LoRA scaling / adapter weighting in its LoRA and PEFT integration docs:\n\n  * Diffusers: Load adapters / adjust LoRA weight scale\n  * Diffusers: Inference with PEFT / multiple adapters\n  * PEFT LoRA docs\n\n\n\nEven if your UI is not Diffusers, the same idea usually applies conceptually: test LoRA influence as a variable.\n\n* * *\n\n## 2.5 If using two LoRAs, separate their roles\n\nThe `0.50 + 0.50` idea from earlier in the thread makes sense conceptually, but I would test it as a grid rather than a magic number.\n\nA useful two-LoRA split could be:\n\nLoRA | Purpose | What it should emphasize\n---|---|---\n**Identity LoRA** | face/person likeness | face, expression, angles, identity consistency\n**Framing/Flexibility LoRA** | body/framing/context separation | full-body, camera distance, clothing/background captions, visible feet/shoes\n\nThen test:\n\nTest | Identity LoRA | Framing LoRA | What to watch\n---|---|---|---\nA | `0.60` | `0.30` | face fidelity first\nB | `0.55` | `0.40` | face + body compromise\nC | `0.50` | `0.50` | balanced, close to the thread suggestion\nD | `0.45` | `0.55` | stronger framing\nE | `0.40` | `0.60` | does the face fall apart?\n\nIf full body appears but the face changes, the identity LoRA may be too weak or the dataset may not contain enough identity information at full-body distance.\n\nIf the face stays but the crop returns, the identity LoRA may be carrying too much close-up framing bias.\n\n* * *\n\n## 3. If retraining the LoRA: caption what you want to control later\n\nThis is probably the most important part.\n\nIf you want to control something at generation time, it should probably appear explicitly in the training captions.\n\nFor example, if the dataset contains many close-ups but the captions only say:\n\n\n    photo of <token>\n\n\nthen the LoRA may learn:\n\n\n    <token> = person identity + close-up framing + upper-body crop + this outfit + this background + no visible feet\n\n\nA better goal is:\n\n\n    <token> = person identity\n    close-up = close-up framing\n    full-body = full-body framing\n    black jacket = clothing\n    indoor room = background\n    camera at a distance = camera distance\n\n\nRunDiffusion’s dataset guide says to include composition context such as `portrait`, `full-body`, or `close-up`, along with lighting/environment/camera descriptors, in training captions:\n\n  * RunDiffusion: How to prepare a dataset for model training\n\n\n\nThere are also practical Flux LoRA captioning discussions that point in the same direction:\n\n  * What exactly to caption for Flux LoRA training?\n  * kohya-ss Flux LoRA training tips discussion\n\n\n\n* * *\n\n## 3.1 Caption examples\n\nFor close-up images:\n\n\n    close-up portrait of <token>, face visible, shoulders visible, camera close to the subject, indoor lighting\n\n\nFor upper-body images:\n\n\n    upper-body photo of <token>, torso visible, wearing a black jacket, standing indoors, soft window light\n\n\nFor waist-up images:\n\n\n    waist-up photo of <token>, upper body and waist visible, standing outdoors, camera at medium distance\n\n\nFor three-quarter images:\n\n\n    three-quarter body photo of <token>, legs partially visible, standing outdoors, wearing casual clothes, camera at medium distance\n\n\nFor full-body images:\n\n\n    full-body photo of <token>, head-to-toe visible, legs visible, feet visible, shoes visible, standing at a distance on a concrete floor\n\n\nFor full-body feet/shoes anchor images:\n\n\n    full-body photo of <token>, entire body visible, both shoes visible, feet planted on the ground, visible floor under both shoes, camera far enough away to include head and feet\n\n\nThe important part is not the exact wording. The important part is that close-up images are labelled as close-up images, and full-body images are labelled as full-body images.\n\n* * *\n\n## 3.2 Caption variable things; avoid absorbing them into the token\n\nA practical rule:\n\nThing in training image | Caption it? | Why\n---|---|---\nclose-up / portrait crop | yes | prevents crop type from becoming part of <token>\nupper-body / waist-up / full-body | yes | makes shot distance controllable\nclothing | usually yes | prevents outfit stickiness\nbackground | usually yes | prevents background stickiness\nlighting | often yes | prevents lighting/style stickiness\ncamera distance | yes | helps separate close-up from full-body\nfeet/shoes/floor | yes for full-body samples | teaches lower-body framing explicitly\npermanent identity | usually less | that is what <token> should learn\n\nNormal prose note: I am writing the trigger as <token> here. In actual captions/code blocks, use your real trigger token.\n\n* * *\n\n## 4. Build a “distance ladder” in the dataset\n\nI would avoid thinking only in terms of “face images” vs “full-body images.”\n\nThe model needs bridge examples.\n\nA close-up teaches identity well, but not body framing.\nA full-body image teaches framing well, but the face may be too small to teach identity strongly.\nBridge shots connect the two.\n\nShot type | What it teaches well | What it may fail to teach\n---|---|---\nClose-up face | face identity | body framing\nChest / upper body | identity + torso | legs/feet\nWaist-up | transition framing | feet\nThree-quarter body | legs/body connection | precise feet\nFull-body | full framing | face detail\nFull-body with visible shoes/floor | lower-body completion | face detail unless image quality is high\n\nA related failure mode is visible in this GitHub issue: a Flux LoRA trained on face images can look fine for face/half-body generations, but lose identity when asked for full-body outputs:\n\n  * kohya-ss/sd-scripts issue #1916: FLUX LoRA trained on face images changes face when generating full-body images\n\n\n\nThat is why I would include not only full-body examples, but also **bridge shots**.\n\n* * *\n\n## 4.1 Example dataset balance\n\nNot a universal recipe, but a reasonable starting point:\n\nDataset size | Close-up | Upper / waist | Three-quarter | Full-body | Clear feet/shoes\n---|---|---|---|---|---\n20 images | 4 | 5 | 4 | 5 | 2\n30 images | 5 | 8 | 6 | 8 | 3\n40 images | 6 | 10 | 8 | 12 | 4\n\nFor this specific issue, I would rather have **fewer but cleaner** full-body examples than many low-quality ones.\n\nFull-body examples should ideally have:\n\n  * face still recognizable\n  * full head-to-toe framing\n  * feet/shoes visible\n  * visible ground/floor contact\n  * not all the same outfit\n  * not all the same background\n  * not all the same pose\n  * not all the same camera distance\n\n\n\nFluxGym and similar trainer guides often recommend a small balanced set of high-quality images rather than indiscriminately adding more data:\n\n  * FluxGym GitHub repo\n  * Next Diffusion: How to train a Flux LoRA with FluxGym\n  * AI Toolkit by ostris\n\n\n\n* * *\n\n## 5. Diagnose the output before changing the dataset\n\nI would keep a small spreadsheet/log like this:\n\nSeed | Resolution | LoRA weight | Prompt variant | Face likeness | Full body? | Feet visible? | Leg anatomy | Notes\n---|---|---|---|---|---|---|---|---\n1234 | 832x1248 | 0.65 | full-body v1 | good | no | no | n/a | crop bias\n1234 | 832x1248 | 0.55 | full-body v1 | ok | yes | partial | bad | anatomy issue\n1234 | 896x1344 | 0.55 | shoes/floor v2 | ok | yes | yes | better | candidate\n\nThis helps avoid confusing different problems.\n\nSymptom | Likely interpretation | What to try\n---|---|---\nFace good, legs missing | crop/framing bias | stronger framing prompt, lower LoRA weight, vertical ratio\nFull body appears, face changes | identity-at-distance failure | full-body identity examples, bridge shots\nFeet appear, legs warp | anatomy weakness | cleaner lower-body data, pose/control, inpaint\nOutfit always same | clothing absorbed into token | caption clothing, vary clothing\nBackground always same | background absorbed into token | caption background, vary background\nPrompt ignored at high LoRA weight | LoRA overpowering base prompt | lower weight, test checkpoints\nBetter at lower weight but face weak | identity/framing tradeoff | two-LoRA split or better balanced retrain\n\n* * *\n\n## 6. Watch for overtraining and prompt-following loss\n\nMore training is not automatically better.\n\nStronger LoRA / more steps can improve likeness, but it can also pull the model back toward the training distribution and reduce prompt flexibility.\n\nSo if retraining, I would save intermediate checkpoints and compare them with the same test prompt.\n\nFor example:\n\nCheckpoint | What to compare\n---|---\n500 steps | undertrained? weak likeness?\n800 steps | does body framing start working?\n1000 steps | first serious candidate\n1500 steps | better likeness or more overfit?\n2000 steps | does prompt-following degrade?\n\nfal.ai’s Flux LoRA training writeup is useful here because it compares training steps and discusses prompt-following / style strength tradeoffs:\n\n  * fal.ai: Training a FLUX style LoRA\n\n\n\nHugging Face also has practical FLUX LoRA / QLoRA resources:\n\n  * Diffusers FLUX DreamBooth LoRA README\n  * Hugging Face: FLUX QLoRA on consumer hardware\n\n\n\nI would use those as implementation references, not as guarantees that a specific number of steps will solve legs.\n\n* * *\n\n## 7. If the issue is leg deformation, not just missing legs\n\nIf the lower body appears but the legs are warped, I would not expect captions alone to fix every case.\n\nThat is where pose/control or post-generation repair can be more reliable.\n\nPossible production-oriented fallbacks:\n\nTool/approach | When useful\n---|---\nPose / ControlNet-style conditioning | when you need exact full-body pose/framing\nOutpainting downward | when upper body/face is good but lower body is missing\nInpainting lower body | when legs/feet exist but are wrong\nFace pass after full-body generation | when full-body works but face likeness drops\nGenerate full-body first, then refine identity | when portrait-biased LoRA fights full-body framing\n\nControlNet is the classic reference for adding spatial controls such as edges, depth, segmentation, and human pose to text-to-image models:\n\n  * ControlNet paper\n\n\n\nFor FLUX specifically, there are FLUX ControlNet/Union-style models that support pose-like conditioning, though quality and workflow compatibility depend on your UI:\n\n  * Shakker-Labs / InstantX FLUX.1-dev-ControlNet-Union-Pro\n\n\n\nI would treat this as a fallback path, not the first thing to try. It is very useful if you need reliable output, but it can hide whether the LoRA itself is actually fixed.\n\n* * *\n\n## 8. A practical “try this in order” checklist\n\n### Phase A — no retraining\n\n  1. Use vertical aspect ratio.\n  2. Lower LoRA weight.\n  3. Use redundant full-body framing.\n  4. Add concrete feet/shoes/floor cues.\n  5. Test same seed across a small grid.\n\n\n\nExample test prompt:\n\n\n    full-body photograph of <token>, standing naturally, full height visible from head to toe, entire body visible in frame, legs visible, feet visible, shoes visible, both shoes fully visible, both feet planted on the ground, photographed from a distance, camera far enough away to include the complete body, vertical 9:16 portrait, subject centered in frame, visible floor under both shoes, showing the complete full length of the subject, not a close-up portrait, not an upper-body crop\n\n\n### Phase B — LoRA strength / two-LoRA grid\n\nSingle LoRA:\n\n\n    0.45, 0.55, 0.65, 0.75, 0.85, 1.00\n\n\nTwo LoRAs:\n\n\n    Identity 0.60 + Framing 0.30\n    Identity 0.55 + Framing 0.40\n    Identity 0.50 + Framing 0.50\n    Identity 0.45 + Framing 0.55\n    Identity 0.40 + Framing 0.60\n\n\n### Phase C — retrain LoRA, not FLUX itself\n\n  1. Add bridge shots.\n  2. Add full-body identity shots.\n  3. Add explicit feet/shoes/floor examples.\n  4. Caption all controllable visual factors.\n  5. Save intermediate checkpoints.\n  6. Compare with fixed prompt/seed/resolution.\n\n\n\n### Phase D — if anatomy still fails\n\n  1. Use pose/control.\n  2. Inpaint legs/feet.\n  3. Outpaint lower body.\n  4. Generate full-body first, then refine face/identity.\n\n\n\n* * *\n\n## 9. The main rule I would use\n\nIf I had to compress all of this into one rule:\n\n> If you want to control it later, caption it during training.\n>  If you need it spatially reliable, do not rely only on text.\n>  If the LoRA keeps overriding the prompt, reduce its strength or fix the dataset/captions.\n\nFor this thread specifically:\n\n  * `full-body` should be in the training captions for full-body images.\n  * `close-up portrait` should be in the captions for close-up images.\n  * `feet visible`, `shoes visible`, and `standing on floor/ground` should be in the captions for examples where that matters.\n  * Clothing/background/camera distance should be captioned if you do not want them absorbed into <token>.\n  * Full-body examples should not be so small/blurred that the model cannot learn identity at that distance.\n  * Close-ups should not dominate so much that the trigger becomes a portrait-crop trigger.\n\n\n\n* * *\n\n## 10. Useful links\n\n### Theory / why this is hard\n\n  * T2I-CompBench: compositional text-to-image benchmark\n  * T2I-CompBench GitHub/project\n  * GenEval: object/count/color/position evaluation for T2I\n  * Attend-and-Excite: catastrophic neglect / attention guidance\n  * HumanRefiner: abnormal human generation and limb quality\n  * Distortion-5K / ViT-HD: distorted body parts in generated images\n  * DreamBooth: subject-driven generation from a few images\n  * ControlNet: adding spatial controls to T2I diffusion models\n\n\n\n### LoRA / FLUX implementation\n\n  * Diffusers: LoRA / PEFT inference\n  * Diffusers: loading adapters and LoRA scale\n  * PEFT LoRA docs\n  * Diffusers FLUX DreamBooth LoRA README\n  * Hugging Face FLUX QLoRA blog\n  * FluxGym\n  * AI Toolkit by ostris\n\n\n\n### Captioning / dataset practice\n\n  * RunDiffusion: dataset preparation guide\n  * What exactly to caption for Flux LoRA training?\n  * Next Diffusion: training a Flux LoRA with FluxGym\n  * kohya-ss Flux LoRA training tips discussion\n  * Related Flux LoRA full-body/face issue\n\n\n\n### Related community examples\n\n  * How to get full body shot in Flux?\n  * Flux composition/anatomy issue discussion\n  * Close-up / face-focused LoRA and full-body generalization discussion\n\n\n\n* * *\n\n## Final thought\n\nI would probably not try to solve this by jumping directly to “train a bigger/better model.”\n\nI would debug it as:\n\n  1. Does the existing LoRA allow full-body framing at lower strength?\n  2. Does the prompt make feet/shoes/floor concrete enough?\n  3. Does full-body work only when identity is weak?\n  4. Are close-up, full-body, clothing, background, and camera distance separated in captions?\n  5. Does the dataset contain a distance ladder, or only closeups and full bodies with no bridge?\n  6. Are the remaining failures actually human-limb deformation rather than missing lower body?\n\n\n\nThat separation makes the problem much easier to work on.",
  "title": "Flux LORA won't show my legs or feet"
}