Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidz525ekxypmjcjn37zjpffqzgmdezoinyyhayzvaw2ngbgpvzmpu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlhvms6f3yz2"
  },
  "path": "/t/wan2-2-i2v-clarifications-needed-regarding-settings-on-low-vram-system/175884#post_2",
  "publishedAt": "2026-05-10T02:33:20.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "ComfyUI official Wan2.2 workflow guide",
    "Wan2.2 official GitHub",
    "Wan2.2 I2V A14B model card",
    "ComfyUI-GGUF",
    "QuantStack Wan2.2 I2V A14B GGUF",
    "city96 UMT5 XXL encoder GGUF",
    "WanMoeKSampler",
    "Wan2.2-Lightning",
    "LightX2V Wan2.2 I2V working guide discussion",
    "ComfyUI-CacheDiT",
    "Kijai ComfyUI-WanVideoWrapper",
    "WAN2.2 Rapid All-in-One model card",
    "WAN2.2 Rapid All-in-One",
    "LightX2V working-guide discussion",
    "ComfyUI official Wan2.2 guide"
  ],
  "textContent": "4-step variation might not be suitable for your purpose:\n\n* * *\n\n# Wan2.2 I2V on 8GB VRAM: practical baseline for source-faithful animation\n\nFor your exact goal — **make the picture move, keep the same face, keep the same identity, keep the same lighting/background/clothing, and avoid AI embellishment** — I would not tune this like a normal high-VRAM Wan2.2 setup.\n\nYour current results are not strange:\n\n  * **CFG from 1 to ~3 doing almost nothing** is plausible in a 4-step Rapid/Lightning-style workflow.\n  * **CFG above ~3 turning the output into overcooked chaos** is also plausible.\n  * **Denoise around 0.6 helping sharpness/color/source fidelity** is not ridiculous.\n  * **Different source images needing different settings** usually means the workflow has too many interacting variables: GGUF quantization, Rapid/distilled weights, sampler, scheduler, shift, text encoder quality, VAE, offloading, source-image difficulty, and the Wan2.2 High/Low-noise expert split.\n\n\n\nThe core point:\n\n> Do not treat CFG as the main “obedience knob” in your setup.\n>  For 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V, CFG is a small final adjustment, not the steering wheel.\n\nThe knobs I would tune first are:\n\n  1. **source image quality / crop**\n  2. **denoise**\n  3. **motion size**\n  4. **shift**\n  5. **Low-noise step count / Low-noise quantization**\n  6. **sampler branch**\n  7. **text encoder quantization**\n  8. **CFG last**\n\n\n\nUseful references:\n\n  * ComfyUI official Wan2.2 workflow guide\n  * Wan2.2 official GitHub\n  * Wan2.2 I2V A14B model card\n  * ComfyUI-GGUF\n  * QuantStack Wan2.2 I2V A14B GGUF\n  * city96 UMT5 XXL encoder GGUF\n  * WanMoeKSampler\n  * Wan2.2-Lightning\n  * LightX2V Wan2.2 I2V working guide discussion\n  * ComfyUI-CacheDiT\n  * Kijai ComfyUI-WanVideoWrapper\n\n\n\n* * *\n\n## 1. Why your current setup is hard to tune\n\nYou are not simply running “Wan2.2.” You are running a stacked compromise:\n\n\n    Wan2.2-style I2V\n    + Rapid/AIO or distilled behavior\n    + GGUF quantization\n    + Q4-class compression\n    + 4-step sampling\n    + SageAttention\n    + BlockSwap/offload\n    + 8GB laptop VRAM\n    + denoise below 1.0\n    + SD3 shift\n    + image conditioning\n\n\nThat matters because one setting can appear useless when another part of the stack is dominating.\n\nFor example, CFG may appear to do nothing because:\n\n  * the model was distilled/merged for **CFG 1**\n  * 4 steps are too few for CFG to gradually steer the output\n  * image conditioning dominates the text\n  * the negative prompt is weak or mostly inactive at CFG 1\n  * quantization reduces sensitivity to small guidance changes\n  * the sampler/scheduler/shift combination matters more than CFG\n  * the High/Low-noise split is doing more than the text guidance\n\n\n\nSome Rapid/AIO model cards explicitly say their models are intended for **CFG 1 and 4 steps**. See the WAN2.2 Rapid All-in-One model card. Wan2.2-Lightning similarly describes a 4-step distilled path, so it should not be tuned like a normal 20–30 step diffusion workflow. See Wan2.2-Lightning.\n\nSo your observation — “CFG 1 to 3 did nothing, then above 3 broke everything” — is consistent with this kind of workflow.\n\n* * *\n\n## 2. The most important Wan2.2 idea: High-noise vs Low-noise experts\n\nWan2.2 A14B uses a Mixture-of-Experts style denoising structure. The official Wan2.2 repo describes MoE as separating the denoising process across timesteps with specialized expert models. See Wan2.2 official GitHub.\n\nIn practical I2V terms:\n\nPart | Mostly affects | If weak/wrong, you may see\n---|---|---\n**High-noise expert** | broad motion, layout, pose, composition, camera direction | scene drift, pose weirdness, motion chaos, composition changes\n**Low-noise expert** | face detail, eyes, mouth, skin, clothing texture, color, final sharpness | face melting, blur, color shift, unstable eyes/mouth, loss of likeness\n\nFor your goal, **Low-noise behavior is extremely important**.\n\nIf the face changes, the first fix is usually not “raise CFG.” More likely fixes are:\n\n  * lower denoise\n  * reduce the requested motion\n  * add more Low-noise steps\n  * use a better Low-noise quant if possible\n  * check the VAE\n  * crop/use a clearer source face\n  * avoid cinematic/camera-heavy prompts\n  * avoid LoRAs until the baseline is stable\n\n\n\nWanMoeKSampler is relevant if you are using separate High/Low Wan2.2 A14B models. Its README says it is designed for Wan2.2 A14B-style MoE workflows and avoids manually guessing the High-to-Low switch point. See WanMoeKSampler.\n\n* * *\n\n## 3. Best starting point for your actual goal\n\nYour goal is not “maximum cinematic transformation.” Your goal is:\n\n\n    same person\n    same face\n    same identity\n    same clothing\n    same lighting\n    same background\n    small natural movement\n    static camera\n    no embellishment\n\n\nSo I would start conservative.\n\n### Recommended baseline for your current Rapid/AIO-style setup\n\n\n    Sampler: sa_solver / beta, if that is your current most reliable branch\n    Steps: 4\n    CFG: 1.0\n    Denoise: 0.55–0.60\n    SD3 shift: 8 as current control, then test 5 and 6\n    Resolution: 512–640px long side while testing\n    Frames: 33–49 while testing\n    FPS: 12–16\n    Motion: subtle\n    Camera: static\n    LoRAs: none during baseline\n    Upscaling/interpolation: none during baseline\n    Face restore: none during baseline\n\n\nThis is not meant to be the final “best possible” setup. It is the control setup. You need a repeatable control before changing settings.\n\n* * *\n\n## 4. Do not micro-tweak CFG\n\nOn your hardware, micro-tweaking CFG by 0.1 is a bad use of time.\n\nInstead of:\n\n\n    1.0\n    1.1\n    1.2\n    1.3\n    1.4\n    ...\n\n\nUse coarse tests:\n\n\n    CFG 1.0\n    CFG 1.5\n    CFG 2.0\n    CFG 2.5\n    CFG 3.0 only as a limit test\n\n\nFor your setup, I would treat CFG like this:\n\nCFG | Practical meaning\n---|---\n**1.0** | safest Rapid/Lightning-style baseline\n**1.5** | mild text pressure\n**2.0** | moderate text pressure\n**2.5** | upper useful range to test\n**3.0** | stress-test boundary\n**> 3.0** | likely to overcook identity, color, texture, or motion\n\nIf CFG 1.5–2.5 gives no meaningful obedience improvement, stop chasing CFG. The bottleneck is probably elsewhere.\n\n* * *\n\n## 5. Denoise is probably more important than CFG for you\n\nFor source-faithful I2V, denoise is one of the strongest identity controls.\n\nDenoise | Expected behavior\n---|---\n**0.40–0.50** | most faithful, least motion, may look stiff\n**0.50–0.60** | best starting zone for “make the image move”\n**0.60–0.70** | more motion, more identity risk\n**0.70+** | more transformation, more AI invention\n\nSince you already found **0.6** useful, I would not abandon it. I would test:\n\n\n    Denoise 0.50\n    Denoise 0.55\n    Denoise 0.60\n    Denoise 0.65\n\n\nPick the best identity/motion balance.\n\nIf the face changes:\n\n\n    lower denoise first\n    reduce motion second\n    add Low-noise steps third\n    only then try CFG changes\n\n\nIf there is no movement:\n\n\n    raise denoise slightly\n    make the action simpler and more literal\n    avoid cinematic wording\n\n\n* * *\n\n## 6. Shift: test coarse values only\n\nDo not test tiny shift increments. Test meaningful jumps.\n\nFor your current setup:\n\n\n    Shift 5\n    Shift 6\n    Shift 8\n\n\nThe LightX2V Wan2.2 I2V working-guide discussion recommends:\n\n\n    Euler sampler\n    Simple scheduler\n    Shift 5\n    2 High steps\n    2 Low steps\n\n\nSource: LightX2V Wan2.2 I2V working guide discussion\n\nThat does **not** automatically mean shift 5 is best for your current Rapid/AIO branch, but it is a strong branch to test.\n\n* * *\n\n## 7. Sampler advice\n\n### For your current Rapid/AIO branch\n\nIf `sa_solver / beta / 4 steps / CFG 1 / denoise 0.6 / shift 8` is the only thing giving you usable results, keep it as the control.\n\nDo not throw it away just because it sounds weird.\n\nRapid/distilled/merged models can have very specific intended recipes. The model card for the Rapid AIO family says the models are intended for **CFG 1 and 4 steps** , and different versions list different sampler recommendations. See WAN2.2 Rapid All-in-One.\n\n### For a Lightning-style branch\n\nTest this separately:\n\n\n    Sampler: Euler\n    Scheduler: Simple\n    Steps: 4\n    CFG: 1.0\n    Shift: 5\n    Denoise: 0.55–0.60\n\n\nThat lines up with public LightX2V/Wan2.2-Lightning guidance. See Wan2.2-Lightning and the LightX2V working-guide discussion.\n\nCompare this branch against your current `sa_solver / beta` control. Do not mix the two while testing.\n\n* * *\n\n## 8. Low-noise steps may help face consistency more than CFG\n\nIf your workflow exposes the High/Low split, test this before pushing CFG:\n\nTest | High steps | Low steps | Purpose\n---|---|---|---\nA | 2 | 2 | fastest 4-step baseline\nB | 2 | 4 | more face/detail finishing\nC | 4 | 4 | balanced reference\nD | 4 | 6 | stronger finishing if time allows\nE | 6 | 4 | more broad structure/motion\n\nFor your goal, I would test:\n\n\n    2 High / 2 Low\n    2 High / 4 Low\n    4 High / 4 Low\n\n\nIf **2/2 is blurry** but **2/4 improves face/detail** , that tells you the Low-noise stage was underpowered.\n\n* * *\n\n## 9. Quantization: Q4_K_M is not automatically best on 8GB\n\nOn paper, higher quantization quality is better. In practice, on an 8GB laptop GPU, a heavier quant can cause more offload pressure, swapping, instability, or unusable render times.\n\nThe QuantStack Wan2.2 I2V A14B GGUF repo lists approximate model sizes such as:\n\n\n    Q3_K_S: 6.52 GB\n    Q3_K_M: 7.18 GB\n    Q4_K_S: 8.75 GB\n    Q4_K_M: 9.65 GB\n    Q5_K_S: 10.1 GB\n    Q5_K_M: 10.8 GB\n    Q6_K: 12 GB\n    Q8_0: 15.4 GB\n\n\nSource: QuantStack Wan2.2 I2V A14B GGUF\n\nFor an 8GB 4060 laptop, I would test:\n\nTest | High-noise | Low-noise | Why\n---|---|---|---\nA | Q3_K_M | Q3_K_M | safest low-VRAM baseline\nB | Q4_K_S | Q4_K_S | better quality if stable\nC | Q3_K_M | Q4_K_S | prioritize face/detail\nD | Q4_K_S | Q3_K_M | prioritize structure/motion\nE | Q4_K_M | Q4_K_M | only if the above are stable\n\nFor your priority, I would try:\n\n\n    High-noise: Q3_K_M\n    Low-noise: Q4_K_S\n\n\nbefore assuming:\n\n\n    High-noise: Q4_K_M\n    Low-noise: Q4_K_M\n\n\nWhy: Low-noise has more influence on final face detail, skin, eyes, mouth, color, and sharpness. If you can only “spend” quality somewhere, spend it on Low-noise first.\n\n* * *\n\n## 10. Text encoder quantization matters for prompt obedience\n\nIf prompt obedience feels weak, do not only blame CFG. The text encoder can matter too.\n\nThe city96 UMT5 XXL encoder GGUF card recommends **Q5_K_M or larger for best results** , while noting that smaller models may still be acceptable in resource-constrained situations. It lists Q3_K_M around 3.06GB, Q4_K_M around 3.66GB, and Q5_K_M around 4.15GB. See city96 UMT5 XXL encoder GGUF.\n\nFor your system:\n\n\n    UMT5 Q3_K_M: safest\n    UMT5 Q4_K_M: reasonable baseline\n    UMT5 Q5_K_M: better prompt understanding if RAM/offload behavior is tolerable\n\n\nIf CFG does not improve obedience, a better text encoder may help more than CFG micro-tweaks.\n\n* * *\n\n## 11. VAE check: important for color and softness\n\nIf Wan2.2 looks redder, softer, or less vivid than expected, check the VAE.\n\nThe official ComfyUI Wan2.2 guide distinguishes the model components for different workflows. The 14B I2V workflow uses separate High/Low I2V models and a Wan VAE component; the 5B TI2V workflow uses its own 5B model/VAE setup. See ComfyUI official Wan2.2 guide.\n\nA VAE mismatch can show up as:\n\n\n    red/yellow color cast\n    soft decode\n    loss of vividness\n    skin tone shift\n    general haze\n    reconstruction blur\n\n\nIf color is your issue, test VAE/workflow correctness before trying to fix it with prompt words like “neutral color” or “no red tint.”\n\n* * *\n\n## 12. Source image quality matters more than people admit\n\nFor face consistency, the source image should have:\n\n\n    clear face\n    visible eyes\n    visible mouth\n    not too small in frame\n    not heavily compressed\n    not extreme side profile\n    not harsh shadow over one eye\n    not heavy motion blur\n    not strong fisheye distortion\n    not sunglasses covering identity\n    not hands blocking the face\n\n\nA simple rule:\n\n> If the source face is small or unclear, the model has to invent face detail during motion.\n>  When it invents face detail, identity changes.\n\nFor baseline testing, use a clean portrait or half-body image. You can do fancy shots later.\n\n* * *\n\n## 13. Prompt style for source-faithful animation\n\nUse a boring prompt. Do not make it cinematic. Do not add style words. Do not describe a new scene.\n\n### Positive prompt baseline\n\n\n    A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.\n\n\n### Negative prompt baseline\n\n\n    different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, blurry, low detail, melted face, extra teeth\n\n\nImportant: at **CFG 1** , the negative prompt may do very little. Judge negative prompting mostly at CFG 1.5–2.5.\n\n* * *\n\n## 14. Prompt obedience testing\n\nDo not test obedience with complex motion first.\n\nBad obedience tests:\n\n\n    turns around\n    walks forward\n    raises both hands\n    laughs widely\n    talks\n    dances\n    camera orbits around the subject\n    wind blows hair dramatically\n\n\nGood obedience tests:\n\n\n    one subtle blink\n    gentle breathing only\n    slight smile\n    very small head tilt\n    tiny eye movement\n\n\nA model that cannot obey “one subtle blink” is not ready for “turns head, smiles, and raises hand.”\n\n### Better prompt wording\n\nInstead of:\n\n\n    The woman turns her head and smiles at the camera while wind blows through her hair.\n\n\nUse:\n\n\n    The person makes a very small natural smile while keeping the same face, same pose, same hairstyle, same clothing, same lighting, and same background. Static camera.\n\n\nThe second prompt gives the model less room to invent.\n\n* * *\n\n## 15. What to do when the model does not obey\n\nFirst classify the failure.\n\nFailure | Likely cause | First fix\n---|---|---\nprompt action ignored | too few steps, weak text encoder, action too subtle, distilled limitation | slightly raise denoise or simplify action\nface changes | denoise too high, Low-noise weak, source face unclear, motion too large | lower denoise / add Low steps\nred tint | VAE/model/sampler/shift issue | check VAE, test shift/sampler\nblurry face | Low-noise too weak, too few steps, low quant, low resolution | add Low steps / better Low quant\nbackground changes | denoise too high, prompt invites scene change | lower denoise / static camera prompt\ntoo much motion | denoise/CFG/shift too high, Rapid merge exaggeration | lower denoise or reduce action\nno motion | denoise too low, prompt too static | denoise +0.05\n\nThe order I would use:\n\n\n    1. Keep CFG at 1.0.\n    2. Make the action simpler and more literal.\n    3. Tune denoise: 0.50 / 0.55 / 0.60 / 0.65.\n    4. Test shift: 5 / 6 / 8.\n    5. Add Low-noise steps if available.\n    6. Improve Low-noise quantization if possible.\n    7. Test CFG 1.5 / 2.0 / 2.5.\n    8. Stop before CFG 3 if identity starts changing.\n\n\n* * *\n\n## 16. Recommended experiment matrix\n\nDo not run huge matrices at full resolution. Use short clips first.\n\nKeep these fixed:\n\n\n    same image\n    same seed\n    same prompt\n    same resolution\n    same frame count\n    same workflow branch\n\n\n### Matrix A — denoise\n\n\n    CFG: 1.0\n    Steps: 4\n    Shift: current value\n    Sampler: current best\n\n\nTest:\n\n\n    0.50\n    0.55\n    0.60\n    0.65\n\n\nPick the best identity/motion balance.\n\n### Matrix B — shift\n\nUse the best denoise from Matrix A.\n\n\n    Shift 5\n    Shift 6\n    Shift 8\n\n\nPick the best.\n\n### Matrix C — CFG\n\nUse best denoise + best shift.\n\n\n    CFG 1.0\n    CFG 1.5\n    CFG 2.0\n    CFG 2.5\n    CFG 3.0 only as a limit test\n\n\nPick the highest CFG that does not alter identity.\n\n### Matrix D — High/Low steps\n\nIf available:\n\n\n    2 High / 2 Low\n    2 High / 4 Low\n    4 High / 4 Low\n\n\nIf face detail improves with more Low steps, you found a better lever than CFG.\n\n### Matrix E — quantization\n\nIf using separate GGUF High/Low models:\n\n\n    Q3_K_M High / Q3_K_M Low\n    Q3_K_M High / Q4_K_S Low\n    Q4_K_S High / Q4_K_S Low\n\n\nAvoid assuming Q4_K_M is worth the offload cost on 8GB.\n\n* * *\n\n## 17. Additional nodes: what I would and would not add\n\n### Worth testing later: WanMoeKSampler\n\nUse it if you are working with separate Wan2.2 A14B High/Low models.\n\nGood for:\n\n\n    clean A14B High/Low workflows\n    reducing manual High/Low split guessing\n    debugging MoE transition behavior\n\n\nNot a fix for:\n\n\n    bad source image\n    bad VAE\n    too much denoise\n    bad prompt\n    4-step model limitations\n\n\nSource: WanMoeKSampler\n\n### Required for GGUF: ComfyUI-GGUF\n\nUse the proper GGUF loader rather than treating GGUF like a normal checkpoint. The ComfyUI-GGUF README says to replace the stock “Load Diffusion Model” with the “Unet Loader (GGUF)” node. See ComfyUI-GGUF.\n\n### Probably skip at 4 steps: CacheDiT\n\nCacheDiT is more useful when you have enough steps to amortize the cache/warmup overhead. For Wan2.2 14B, its README says to use the dedicated **Wan Cache Optimizer** for best results with the MoE High/Low structure. See ComfyUI-CacheDiT.\n\nMy practical rule:\n\n\n    4 steps: skip CacheDiT\n    6–8 steps: probably skip unless testing\n    12–20 steps: consider CacheDiT\n\n\n### Useful but separate branch: Kijai WanVideoWrapper\n\nKijai’s wrapper is useful and often gets Wan-specific optimizations quickly. The official Wan2.2 repo lists it as an alternative implementation. See Wan2.2 official GitHub and Kijai ComfyUI-WanVideoWrapper.\n\nBut treat it as a separate branch. Do not change wrapper + sampler + LoRAs + resolution all at once.\n\n* * *\n\n## 18. Things I would avoid during baseline testing\n\nAvoid:\n\n\n    720p\n    81+ frames\n    large camera movement\n    large head turns\n    talking/lip motion\n    multiple LoRAs\n    face restore nodes\n    interpolation while judging motion\n    upscaling while judging source fidelity\n    CFG above 3\n    high denoise\n    changing sampler + CFG + denoise together\n    testing tiny CFG increments\n\n\nEspecially avoid this kind of starting point:\n\n\n    8GB VRAM\n    A14B Q4_K_M\n    720p\n    81 frames\n    4-step Rapid/AIO\n    SageAttention\n    BlockSwap\n    multiple LoRAs\n    CFG above 3\n\n\nThat can produce occasional good clips, but it is a terrible learning baseline because too many variables are interacting.\n\n* * *\n\n## 19. Practical final recommendation\n\nFor your current setup, I would start here:\n\n\n    Sampler: sa_solver / beta if this is your current reliable branch\n    Steps: 4\n    CFG: 1.0\n    Denoise: 0.55–0.60\n    Shift: 8 as current control; test 5 and 6 separately\n    Resolution: 512–640px long side\n    Frames: 33–49\n    Prompt: subtle realistic motion only\n    Camera: static\n    LoRAs: none\n    Upscale/interpolation/face restore: off during testing\n\n\nThen tune in this order:\n\n\n    1. Source image/crop\n    2. Denoise\n    3. Shift\n    4. Low-noise steps\n    5. Low-noise quantization\n    6. Text encoder quantization\n    7. CFG 1.5–2.5\n    8. Sampler branch\n    9. Extra nodes\n\n\nMy strongest advice:\n\n> Stop trying to solve source fidelity with CFG.\n>  For your setup, source fidelity is mostly denoise + Low-noise behavior + source image quality + correct workflow components. CFG is only a small final pressure adjustment.\n\n* * *\n\n## 20. Short summary\n\n  * Your current results are plausible for 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V.\n  * CFG is not the main control. Do not micro-tune it.\n  * Use **CFG 1.0** as baseline; test **1.5 / 2.0 / 2.5** only after denoise and shift.\n  * Use **denoise 0.50–0.60** for source-faithful animation.\n  * If faces drift, lower denoise or improve Low-noise behavior before raising CFG.\n  * If available, test **2 High / 4 Low** and **4 High / 4 Low**.\n  * For 8GB VRAM, test **Q3_K_M** and **Q4_K_S** before assuming Q4_K_M is best.\n  * If you can only improve one expert, improve **Low-noise** first for face/detail.\n  * Use short 33–49 frame clips at 512–640px while testing.\n  * Avoid 720p, long clips, multiple LoRAs, and post-processing until the baseline is stable.\n\n",
  "title": "Wan2.2 i2v (clarifications needed regarding settings on low vram system)"
}