Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreics64f4wyz34nvaockreyro3jyge3i3p4ptzjmcdi6335bgvlecnq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml4zt2eilev2"
  },
  "path": "/t/multi-image-edit-3-refs-artifacts-at-true-cfg-fine-on-lightning-reference-content-dependent/175726#post_2",
  "publishedAt": "2026-05-05T19:20:49.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Qwen/Qwen-Image-Edit model card",
    "Diffusers QwenImage docs",
    "Diffusers Qwen Image Edit pipeline source",
    "Qwen Image Edit Plus pipeline copy",
    "lightx2v/Qwen-Image-Edit-2511-Lightning",
    "Qwen2-VL image processor source",
    "Transformers Qwen2-VL docs",
    "Qwen/Qwen-Image-Edit-2511 model card",
    "Qwen/Qwen-Image-Edit-2511 app.py prompt guidance",
    "ComfyUI issue #9481: 1MP fixed resizing in TextEncodeQwenImageEdit",
    "Reddit workflow: Qwen-Image-Edit unzooming / reference latent fix",
    "ComfyUI Qwen-Image-Edit-2511 guide",
    "Diffusers Qwen Image Edit source",
    "Qwen Image Edit Plus pipeline copy with true-CFG rescale",
    "Qwen Image Edit Plus width/height VAE fix in Diffusers PR #12453",
    "Qwen-Image-Edit unzooming workflow / explicit ReferenceLatent approach",
    "Qwen Image Edit zoom / latent-size discussion",
    "Qwen Image Edit latent-aware scaling discussion",
    "Qwen Image Edit zooming note / padding observation",
    "MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation",
    "UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing",
    "Classifier-Free Diffusion Guidance",
    "Common Diffusion Noise Schedules and Sample Steps are Flawed",
    "Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models / APG",
    "Energy-Preserving Classifier-Free Guidance"
  ],
  "textContent": "Hmm… complicated…\n\n* * *\n\n## Short version\n\nI would treat this as a **real, repeatable interaction between multi-reference conditioning and full true-CFG** , but I would **not** call it a confirmed public “Qwen2.5-VL 384² token-norm outlier bug.”\n\nThe best working diagnosis is:\n\n> Some 3-reference packs create an unstable conditional prediction. Full true-CFG repeatedly pushes the denoising trajectory along that unstable `cond - neg` direction. The norm rescale can keep the prediction magnitude bounded, but it cannot guarantee that the update direction is semantically safe. Lightning with `true_cfg=1` avoids that failure path, so the same references can look clean.\n\nIn this setup, the failure probably is not one single thing. It is likely the intersection of:\n\n  * **3 image references** with different roles: face close-up + front body + back body.\n  * **High-frequency synthetic reference content** , especially dense curls, skin microtexture, fabric/texture detail, or turbo-model sharpening.\n  * **Qwen-Image-Edit’s dual image-conditioning design** , where the input image is routed through both Qwen2.5-VL semantic conditioning and VAE appearance conditioning.\n  * **Full true-CFG** , which uses the conditional/negative prediction difference and then applies a norm-ratio rescale.\n  * **ComfyUI preprocessing / latent geometry** , including possible hidden resize or mismatch between reference conditioning geometry and KSampler latent geometry.\n  * **Custom sampler/scheduler behavior** from `res_3m + bong_tangent`.\n\n\n\nI would not conclude that 3+ image references are “only stable on Lightning.” Full multi-image Qwen edit can work. But this exact corner — **3 refs + high-frequency synthetic refs + full true-CFG + non-square output + custom sampler** — is fragile enough that I would not run it as a one-shot full-CFG workflow without reducing ambiguity first.\n\n* * *\n\n## Why this failure pattern is meaningful\n\nThe key pattern is:\n\n\n    1 face reference only:\n      clean in full CFG\n      clean in Lightning\n\n    3 references:\n      clean for some character/reference sets\n      broken for other character/reference sets\n\n    same parameters and seeds:\n      clean or broken depending on reference content\n\n    Lightning 4-step, true_cfg=1:\n      clean for every reference set\n\n\nThat pattern strongly argues against a simple “bad seed” or “bad prompt” explanation.\n\nIf it were purely random sampling, you would expect less consistent dependence on the reference set. If it were purely output resolution, BF16, or the model checkpoint, you would expect the breakage to be less dependent on which character/reference pack is used. If it were purely the sampler, one reference should also be more fragile.\n\nInstead, the most useful interpretation is:\n\n\n    reference content\n    → unstable multi-image conditioning\n    → full true-CFG amplifies it\n    → artifact appears over denoising steps\n\n\nThe reference-content dependence is the important clue.\n\n* * *\n\n## What is known from the model design\n\nQwen-Image-Edit does not use the input image in only one way. The model card says the input image is fed into **Qwen2.5-VL for semantic control** and into the **VAE encoder for visual appearance control**.\n\nSource: Qwen/Qwen-Image-Edit model card\n\nThat matters because the artifact can originate in either path:\n\nChannel | What it controls | How it can fail\n---|---|---\n**Qwen2.5-VL semantic path** | identity meaning, object roles, face/body interpretation, picture-to-picture binding | identity drift, wrong reference role, subject blending, face/body confusion\n**VAE / reference-latent path** | color, texture, local visual detail, clothing material, skin/hair texture | texture corruption, color bleed, hair/skin over-detail, local anatomy deformation\n\nYour symptoms span both channels:\n\n  * **identity drift** → semantic/reference-binding instability.\n  * **color/texture corruption** → appearance/reference-latent instability.\n  * **anatomy distortion** → reference-role confusion plus guidance/sampler amplification.\n\n\n\nThat is why “just improve the prompt” is usually not enough. Prompt clarity helps, but the model is also consuming multiple visual encodings and reference latents.\n\n* * *\n\n## What is known from Qwen true-CFG\n\nDiffusers’ Qwen docs distinguish normal `guidance_scale` from real Qwen classifier-free guidance. In the Qwen pipeline, true CFG is enabled with `true_cfg_scale` plus a `negative_prompt`; even an empty negative prompt can activate the branch.\n\nSource: Diffusers QwenImage docs\n\nThe Qwen edit pipeline source says true CFG is enabled when `true_cfg_scale > 1` and a negative prompt is provided. It also says higher guidance links the image more closely to the prompt, usually at the cost of lower image quality.\n\nSource: Diffusers Qwen Image Edit pipeline source\n\nA Qwen-Image-Edit-Plus pipeline copy shows the relevant true-CFG calculation:\n\nSource: Qwen Image Edit Plus pipeline copy\n\nThe important part is essentially:\n\n\n    comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)\n\n    cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)\n    noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)\n\n    noise_pred = comb_pred * (cond_norm / noise_norm)\n\n\nThe norm rescale is easy to overinterpret. It can keep the combined prediction’s magnitude near the conditional prediction’s magnitude, but it does **not** prove that the direction is safe.\n\nIn plain language:\n\n\n    right-sized vector\n    does not necessarily mean\n    right semantic direction\n\n\nSo if the 3-reference conditional prediction is already unstable, true-CFG can repeatedly push the trajectory in a bad direction while the norm rescale still appears mathematically “reasonable.”\n\n* * *\n\n## Why Lightning being clean does not disprove the full-CFG issue\n\nLightning is not simply “the same full model but fewer steps.” The Lightning card describes step distillation that reduces standard inference to 4 steps and gives a large speedup compared with standard 40-step inference.\n\nSource: lightx2v/Qwen-Image-Edit-2511-Lightning\n\nSo this comparison:\n\n\n    Full BF16:\n      true_cfg = 2.7\n      33 steps\n      artifacts on some 3-ref packs\n\n    Lightning:\n      true_cfg = 1\n      4 steps\n      clean on all 3-ref packs\n\n\nshould be interpreted as:\n\n\n    long full-guidance trajectory:\n      fragile\n\n    short distilled / no-true-CFG trajectory:\n      robust\n\n\nIt should not be interpreted as:\n\n\n    the reference pack is universally safe\n\n\nThe Lightning result is useful because it says the references contain enough usable information to make a clean image. But it does not prove that full true-CFG can use that same information stably.\n\n* * *\n\n## Is the 384² Qwen2.5-VL downscale the root cause?\n\nPossible, but not proven.\n\nA more careful statement is:\n\n> High-frequency rendered references can plausibly produce unstable visual-token or reference-latent conditioning after resizing/downsampling. That instability can then appear downstream as a larger conditional-vs-negative prediction difference during full true-CFG. But I would not claim that the root cause is specifically Qwen2.5-VL per-token norm outliers from 384² resizing unless tensor logging confirms it.\n\nWhy the suspicion is technically reasonable:\n\nQwen2-VL-style image preprocessing uses `smart_resize`, with dimensions made divisible by a factor tied to patch/merge behavior. The source shows defaults such as `patch_size=14`, `merge_size=2`, and a resize factor of `28`.\n\nSources:\n\n  * Qwen2-VL image processor source\n  * Transformers Qwen2-VL docs\n\n\n\nThat makes this diagnostic worth testing:\n\n\n    384 / 28 = 13.714...\n    392 / 28 = 14\n\n\nSo if a node exposes `target_vl_size`, testing `392` instead of `384` is useful. It does not prove the theory, but it removes one avoidable grid-alignment variable.\n\nThe high-frequency-content hypothesis has three possible locations:\n\nSuspect | Meaning | Test\n---|---|---\n**VL token path** | resized/patchified semantic image tokens become unstable for dense curls, skin texture, or sharp synthetic detail | smooth only the VL input; keep VAE/ref input original\n**VAE/reference-latent path** | appearance latents over-inject high-frequency texture | smooth only the VAE/ref input; keep VL input original\n**CFG path** | full true-CFG amplifies an unstable conditional prediction | sweep `true_cfg` from `1.0` to `2.7`\n\nThe current evidence proves **content-dependent instability**. It does not yet prove exactly where inside the stack that instability begins.\n\n* * *\n\n## Are 3+ references only stable on Lightning?\n\nNo, not generally.\n\nQwen-Image-Edit-2511 is explicitly presented as improving character consistency, and Qwen/Diffusers-style edit pipelines support image-conditioned editing. The issue is not “multi-image references are impossible.” The issue is that your exact setup is a high-risk corner.\n\nSource: Qwen/Qwen-Image-Edit-2511 model card\n\nThe fragile combination is:\n\n\n    3 references\n    face + front body + back body\n    synthetic high-frequency rendered references\n    BF16 full model\n    true_cfg = 2.7\n    33 denoising steps\n    1024x1536 output\n    RES4LYF res_3m + bong_tangent\n\n\nSo the better answer is:\n\n> 3+ references are not Lightning-only, but this exact 3-ref/full-CFG/custom-sampler setup should be treated as fragile. Use Lightning for first-pass composition, then use lower-CFG full BF16 refinement with fewer or weaker references.\n\n* * *\n\n## The biggest practical change: stop treating all 3 references equally\n\nThe three images have different jobs.\n\nReference | Correct role | Wrong role to avoid\n---|---|---\n**Face close-up** | identity, face structure, hairline, expression, age impression | full outfit geometry, back clothing\n**Body front** | front outfit, body proportions, front silhouette, color placement | face identity\n**Body back** | rear clothing, back silhouette, hair length from behind | face identity, front anatomy, skin texture source\n\nThe back reference is especially dangerous because it can contain strong hair/body/clothing cues without a face identity anchor. If it participates fully in the VAE/reference-latent path, it can inject body/texture information that competes with the face and front-body references.\n\nIf the node supports separate semantic/reference participation, test:\n\nRef | VL semantic path | VAE/ref-latent path\n---|---|---\nFace | on | on\nFront body | on | on\nBack body | on | **off initially**\n\nIn plain language:\n\n\n    Use the back reference as semantic guidance first.\n    Do not let it be a full appearance/reference-latent source unless needed.\n\n\nOnly enable the back reference as a full VAE/reference latent if the final output is a back-view image or if rear outfit construction is essential.\n\n* * *\n\n## Prompt template I would use\n\nThe official Qwen-Image-Edit-2511 app prompt guidance says multi-image prompts should clearly specify which image’s element is being modified.\n\nSource: Qwen/Qwen-Image-Edit-2511 app.py prompt guidance\n\nFor a front-facing or general full-body portrait, I would use a prompt like this:\n\n\n    Use the references with strict roles.\n\n    Picture 1 is the identity reference. Preserve the same face identity, facial structure, age impression, hairline, and overall character identity from Picture 1.\n\n    Picture 2 is the front body and outfit reference. Use it for body proportions, front silhouette, clothing shape, front-view outfit details, and color placement.\n\n    Picture 3 is the back outfit reference only. Use it only for back-side clothing construction, rear silhouette, and hair length visible from behind. Do not use Picture 3 to change the face, facial identity, skin texture, expression, or front-facing anatomy.\n\n    Generate one coherent person in a clean full-body 2:3 portrait. Do not blend identities. Do not average the face across references. Keep natural anatomy, stable skin texture, stable hair texture, and consistent clothing material. Do not copy rear-view anatomy into the front view.\n\n\nFor a back-view output, change the roles:\n\n\n    Use the references with strict roles.\n\n    Picture 1 is the identity and hair reference. Preserve the same character identity and overall hair type from Picture 1, but do not invent a visible face because the final image is a back view.\n\n    Picture 2 is the front outfit reference. Use it only for consistent clothing design, material, and color placement.\n\n    Picture 3 is the back outfit reference. Use it as the primary source for the rear silhouette, back-side clothing construction, hair length from behind, and rear material layout.\n\n    Generate a clean full-body back-view 2:3 portrait of one coherent person. Keep the outfit consistent across front and back references. Keep anatomy natural. Do not create extra limbs, duplicate hair masses, face fragments, or mixed front/back body structure.\n\n\nThe point is not literary quality. The point is to reduce reference-role ambiguity.\n\n* * *\n\n## Reference preprocessing I would apply\n\nBecause the failing sets are high-frequency rendered references, I would preprocess the references before changing more sampler/model knobs.\n\nThe goal is **not** to change identity. The goal is to reduce unstable synthetic microtexture.\n\n### Face reference\n\nOperation | Strength | Reason\n---|---|---\ncrop to face/head/upper shoulders | strong | remove irrelevant body/background tokens\nremove or simplify busy background | strong | reduce unrelated visual tokens\nmild denoise | low | remove synthetic turbo grain\nmild de-sharpen / reduce local contrast | low | reduce patch-level hair/skin spikes\npreserve face identity/color | strict | avoid changing identity\n\n### Front body reference\n\nOperation | Strength | Reason\n---|---|---\nclean full-body crop | strong | keep body/outfit information\nsimplify background | medium/strong | reduce irrelevant reference detail\nmild de-sharpen | low | reduce texture overbinding\npreserve clothing color layout | strict | this is the outfit source\n\n### Back body reference\n\nOperation | Strength | Reason\n---|---|---\nclean back-body crop | strong | keep only rear silhouette/outfit\nsimplify background | strong | reduce irrelevant tokens\nmild denoise / de-sharpen | medium | this reference is high-risk\navoid full VAE/ref path initially | strong | prevent appearance over-injection\n\nAvoid prompts that increase microtexture pressure:\n\n\n    ultra detailed skin, sharp curly hair, high texture, 4k, hyper detailed material\n\n\nPrefer stability wording:\n\n\n    stable natural skin texture, coherent hair texture, clean silhouette, consistent material, natural anatomy\n\n\n* * *\n\n## Geometry and latent-size checks\n\nTreat hidden geometry mismatch as a first-class suspect.\n\nA ComfyUI issue says `TextEncodeQwenImageEdit` targets roughly 1M pixels internally, and warns that if the latent passed to KSampler is not based on that same effective geometry, unintended zooming can occur.\n\nSource: ComfyUI issue #9481: 1MP fixed resizing in TextEncodeQwenImageEdit\n\nThat issue is about zoom/drift, but it still matters here. Under strong true-CFG, a geometry/reference-latent mismatch can show up as broader corruption, not just zoom.\n\nAvoid a graph shaped like this:\n\n\n    reference images\n    → TextEncode internal resize\n    → VAE/reference latents at another size\n    → KSampler latent at another size\n    → output 1024x1536\n\n\nPrefer one geometry source of truth:\n\n\n    preprocess/crop/pad references\n    → choose final target geometry\n    → build or encode latents consistently\n    → feed references through controlled semantic/ref-latent paths\n    → sample at the intended 1024x1536 geometry\n\n\nA community workflow for Qwen edit zooming reports fixing most zooming by disconnecting the VAE input from `TextEncodeQwenImageEditPlus`, adding `VAE Encode` per source, and chaining `ReferenceLatent` nodes.\n\nSource: Reddit workflow: Qwen-Image-Edit unzooming / reference latent fix\n\nEven though your symptom is more than zooming, I would still test explicit reference latents because it removes a major hidden-variable class.\n\n* * *\n\n## Try `target_vl_size=392` if available\n\nIf the node exposes a VL target size, test:\n\n\n    384 → 392\n\n\nReason:\n\n\n    Qwen2-VL visual preprocessing uses a 28-pixel factor.\n    384 is not divisible by 28.\n    392 is divisible by 28.\n\n\nSource: Qwen2-VL image processor source\n\nInterpretation:\n\nResult | Interpretation\n---|---\n`392` improves failing refs | VL resize/grid behavior is involved\n`392` changes nothing | issue is more likely VAE/ref path, CFG, sampler, or reference binding\n`392` worsens output | revert; the node may already be doing its own correction\n\nThis is a diagnostic, not a guaranteed fix.\n\n* * *\n\n## Do not diagnose with `res_3m + bong_tangent` first\n\nThe custom sampler may be useful for final output, but it is not the right baseline.\n\nUse this order:\n\n\n    1. Latest Diffusers QwenImageEditPlusPipeline, if possible\n    2. Official/native ComfyUI Qwen-Image-Edit-2511 workflow\n    3. Native Comfy + same 3 refs\n    4. Native Comfy + true-CFG sweep\n    5. Your workflow without RES4LYF\n    6. Your workflow with RES4LYF\n\n\nComfyUI’s official Qwen-Image-Edit-2511 guide is the right baseline for the Comfy side.\n\nSource: ComfyUI Qwen-Image-Edit-2511 guide\n\nIf the failure appears only after step 6, the root is not simply “Qwen multi-ref token packing.” It is more likely:\n\n\n    multi-ref conditioning\n    × true CFG\n    × custom sampler/scheduler behavior\n\n\n* * *\n\n## Recommended settings\n\n### Stable production path\n\nUse the path that already works.\n\n\n    Model: Qwen-Image-Edit-2511 + Lightning\n    Steps: 4\n    true_cfg: 1.0\n    Output: 1024x1536\n    References: face + front body + back body\n    Prompt: strict reference-role prompt\n    Negative prompt: blank/minimal\n    Sampler: Lightning-compatible/native first\n\n\nUse this when reliability matters.\n\n### Best quality/stability compromise: two-stage workflow\n\nThis is my strongest practical recommendation.\n\n\n    Stage 1:\n      Model: Qwen-Image-Edit-2511 Lightning\n      Refs: face + front body + back body\n      Steps: 4\n      true_cfg: 1.0\n      Output: 1024x1536\n      Goal: stable composition and reference binding\n\n    Stage 2:\n      Model: Qwen-Image-Edit-2511 BF16\n      Source: Stage 1 output\n      Refs: face only, or face + front body\n      Back ref: omit unless generating a back view\n      Steps: 25-40\n      true_cfg: 1.3-1.7\n      negative_prompt: \" \"\n      Sampler: native first\n      Goal: detail, identity polish, clothing consistency, texture repair\n\n\nThis works because Stage 1 avoids the long full-CFG failure trajectory, and Stage 2 no longer needs to solve the entire 3-reference binding problem.\n\n### One-pass full-BF16 attempt\n\nIf you want one-pass full BF16, I would start here:\n\n\n    Model: Qwen-Image-Edit-2511 BF16\n    Pipeline/workflow: native Diffusers or native Comfy first\n    Output: 1024x1536\n    Steps: 33-40\n    true_cfg_scale: 1.4-1.6\n    negative_prompt: \" \"\n    Sampler: native first\n    References:\n      Picture 1: face close-up, identity source\n      Picture 2: body front, body/outfit source\n      Picture 3: body back, semantic-only if possible\n    target_vl_size: try 392 if available\n\n\nDo **not** start at `true_cfg=2.7` for failing packs. Treat 2.7 as a stress-test value.\n\nLikely CFG ranges:\n\ntrue CFG | Expected behavior\n---|---\n`1.0` | no true-CFG pressure; baseline\n`1.2` | very safe\n`1.4-1.6` | best starting range\n`1.8` | possibly usable\n`2.1` | likely starts exposing fragile refs\n`2.4-2.7` | likely artifact zone for failing packs\n`3.0+` | not useful until everything else is controlled\n\n* * *\n\n## Test matrix I would run\n\n### Phase A: find the CFG cliff\n\nUse one working reference set and one failing reference set. Keep seed, prompt, output size, model, dtype, and workflow fixed.\n\nTest | true CFG\n---|---\nA | `1.0`\nB | `1.2`\nC | `1.4`\nD | `1.6`\nE | `1.8`\nF | `2.1`\nG | `2.4`\nH | `2.7`\n\nInterpretation:\n\nResult | Meaning\n---|---\nclean through `1.8`, breaks at `2.4-2.7` | classic CFG cliff / over-guidance\nbreaks at `1.2-1.4` | reference pack or geometry is unstable before CFG pressure\nclean native, broken with RES4LYF | sampler interaction\nbroken even at `1.0` | not true-CFG; likely reference/geometry issue\n\n### Phase B: isolate reference combinations\n\nRun the same seed/settings with:\n\nTest | References\n---|---\n1 | face only\n2 | front body only\n3 | back body only\n4 | face + front\n5 | face + back\n6 | front + back\n7 | face + front + back\n\nInterpretation:\n\nObservation | Likely cause\n---|---\nface + front clean, adding back breaks | back reference over-conditioning\nface + back breaks | back reference conflicts with identity\nfront + back breaks | body geometry / outfit-reference conflict\nall pairs clean, 3 refs break | token/reference packing or attention overload\nonly high-frequency sets break | reference-content sensitivity\n\n### Phase C: test high-frequency-content hypothesis\n\nCreate two versions of each reference:\n\n  * original\n  * mildly denoised/de-sharpened/background-simplified\n\n\n\nThen test:\n\nTest | VL input | VAE/ref input\n---|---|---\nA | original | original\nB | smoothed | original\nC | original | smoothed\nD | smoothed | smoothed\n\nInterpretation:\n\nResult | Meaning\n---|---\nB fixes it | Qwen2.5-VL semantic-token path likely involved\nC fixes it | VAE/reference-latent path likely involved\nD fixes it | both paths contribute\nnone fix it | CFG/sampler/reference-role issue is more likely\n\n* * *\n\n## If you can instrument the pipeline\n\nIf you can patch the Python pipeline or node implementation, log the true-CFG internals after computing `noise_pred`, `neg_noise_pred`, and `comb_pred`, before the scheduler step.\n\n\n    delta = noise_pred - neg_noise_pred\n    comb_pred = neg_noise_pred + true_cfg_scale * delta\n\n    cond_norm = torch.norm(noise_pred.float(), dim=-1)\n    neg_norm = torch.norm(neg_noise_pred.float(), dim=-1)\n    delta_norm = torch.norm(delta.float(), dim=-1)\n    comb_norm = torch.norm(comb_pred.float(), dim=-1)\n\n    scale_ratio = cond_norm / (comb_norm + 1e-8)\n\n    cos = torch.nn.functional.cosine_similarity(\n        noise_pred.float(),\n        neg_noise_pred.float(),\n        dim=-1,\n    )\n\n\nLog per step:\n\n\n    cond_norm p50 / p95 / p99\n    delta_norm p50 / p95 / p99\n    scale_ratio p95 / p99 / max\n    cosine p01 / p50\n\n\nCompare:\n\n\n    working refs vs failing refs\n    1 ref vs 3 refs\n    true_cfg 1.5 vs 2.7\n    original refs vs smoothed refs\n    back ref included vs omitted\n    back ref full vs semantic-only\n\n\nStrong evidence for the CFG hypothesis would be:\n\n\n    failing 3-ref packs show larger delta_norm\n    failing 3-ref packs show lower cond/negative cosine similarity\n    failing 3-ref packs show scale_ratio spikes\n    smoothing VL or VAE refs reduces those spikes\n    removing or weakening the back ref reduces those spikes\n\n\nThis is how to move from “plausible explanation” to actual evidence.\n\n* * *\n\n## Related public cases / resources\n\n### Core Qwen / Diffusers / Comfy references\n\n  * Qwen/Qwen-Image-Edit model card\n  * Qwen/Qwen-Image-Edit-2511 model card\n  * Diffusers QwenImage docs\n  * Diffusers Qwen Image Edit source\n  * Qwen Image Edit Plus pipeline copy with true-CFG rescale\n  * ComfyUI Qwen-Image-Edit-2511 guide\n  * Qwen2-VL image processor source\n  * Transformers Qwen2-VL docs\n\n\n\n### Lightning / distilled path\n\n  * lightx2v/Qwen-Image-Edit-2511-Lightning\n\n\n\n### Geometry / zoom / latent mismatch cases\n\n  * ComfyUI issue #9481: 1MP fixed resizing in TextEncodeQwenImageEdit\n  * Qwen Image Edit Plus width/height VAE fix in Diffusers PR #12453\n  * Qwen-Image-Edit unzooming workflow / explicit ReferenceLatent approach\n  * Qwen Image Edit zoom / latent-size discussion\n  * Qwen Image Edit latent-aware scaling discussion\n  * Qwen Image Edit zooming note / padding observation\n\n\n\n### Multi-reference / binding context\n\n  * MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation\n  * UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing\n\n\n\n### CFG / guidance context\n\n  * Classifier-Free Diffusion Guidance\n  * Common Diffusion Noise Schedules and Sample Steps are Flawed\n  * Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models / APG\n  * Energy-Preserving Classifier-Free Guidance\n\n\n\n* * *\n\n## What I would avoid for now\n\nAvoid this combination while diagnosing:\n\n\n    true_cfg = 2.7\n    heavy negative prompt\n    all 3 refs as full VAE/reference latents\n    back ref treated as identity\n    RES4LYF during diagnosis\n    uncontrolled hidden 1MP resize\n    unprocessed high-frequency synthetic refs\n    generic “ultra detailed” prompt terms\n\n\nThat combination is almost exactly the unstable corner.\n\n* * *\n\n## Final recommendation\n\nMy best practical workflow would be:\n\n\n    Use Lightning for composition.\n    Use BF16 full model only for low-CFG refinement.\n\n\nMore specifically:\n\n\n    1. Generate stable composition:\n       Qwen-Image-Edit-2511 Lightning\n       3 refs\n       true_cfg = 1\n       4 steps\n       1024x1536\n\n    2. Refine:\n       Qwen-Image-Edit-2511 BF16\n       source = Lightning output\n       refs = face only or face + front body\n       true_cfg = 1.4-1.6\n       negative_prompt = \" \"\n       25-40 steps\n       native sampler first\n\n\nIf a one-pass full-BF16 run is required, use:\n\n\n    true_cfg = 1.4-1.6\n    negative_prompt = \" \"\n    face = identity source\n    front body = outfit/body source\n    back body = semantic-only first\n    target_vl_size = 392 if available\n    native sampler first\n    explicit/latent-aware reference geometry if possible\n\n\nThe exact “384² token-norm outlier” mechanism is plausible but unproven. The safer conclusion is:\n\n> This is content-dependent multi-reference conditioning instability, exposed and amplified by full true-CFG. Lightning avoids the fragile path. Full CFG can still work, but only after reducing CFG pressure, reference-role ambiguity, hidden geometry mismatch, and sampler confounds.",
  "title": "Multi-image edit (3 refs): artifacts at true CFG, fine on Lightning — reference-content dependent"
}