Wan2.2 i2v (clarifications needed regarding settings on low vram system)
4-step variation might not be suitable for your purpose:
Wan2.2 I2V on 8GB VRAM: practical baseline for source-faithful animation
For your exact goal — make the picture move, keep the same face, keep the same identity, keep the same lighting/background/clothing, and avoid AI embellishment — I would not tune this like a normal high-VRAM Wan2.2 setup.
Your current results are not strange:
- CFG from 1 to ~3 doing almost nothing is plausible in a 4-step Rapid/Lightning-style workflow.
- CFG above ~3 turning the output into overcooked chaos is also plausible.
- Denoise around 0.6 helping sharpness/color/source fidelity is not ridiculous.
- Different source images needing different settings usually means the workflow has too many interacting variables: GGUF quantization, Rapid/distilled weights, sampler, scheduler, shift, text encoder quality, VAE, offloading, source-image difficulty, and the Wan2.2 High/Low-noise expert split.
The core point:
Do not treat CFG as the main “obedience knob” in your setup. For 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V, CFG is a small final adjustment, not the steering wheel.
The knobs I would tune first are:
- source image quality / crop
- denoise
- motion size
- shift
- Low-noise step count / Low-noise quantization
- sampler branch
- text encoder quantization
- CFG last
Useful references:
- ComfyUI official Wan2.2 workflow guide
- Wan2.2 official GitHub
- Wan2.2 I2V A14B model card
- ComfyUI-GGUF
- QuantStack Wan2.2 I2V A14B GGUF
- city96 UMT5 XXL encoder GGUF
- WanMoeKSampler
- Wan2.2-Lightning
- LightX2V Wan2.2 I2V working guide discussion
- ComfyUI-CacheDiT
- Kijai ComfyUI-WanVideoWrapper
1. Why your current setup is hard to tune
You are not simply running “Wan2.2.” You are running a stacked compromise:
Wan2.2-style I2V
+ Rapid/AIO or distilled behavior
+ GGUF quantization
+ Q4-class compression
+ 4-step sampling
+ SageAttention
+ BlockSwap/offload
+ 8GB laptop VRAM
+ denoise below 1.0
+ SD3 shift
+ image conditioning
That matters because one setting can appear useless when another part of the stack is dominating.
For example, CFG may appear to do nothing because:
- the model was distilled/merged for CFG 1
- 4 steps are too few for CFG to gradually steer the output
- image conditioning dominates the text
- the negative prompt is weak or mostly inactive at CFG 1
- quantization reduces sensitivity to small guidance changes
- the sampler/scheduler/shift combination matters more than CFG
- the High/Low-noise split is doing more than the text guidance
Some Rapid/AIO model cards explicitly say their models are intended for CFG 1 and 4 steps. See the WAN2.2 Rapid All-in-One model card. Wan2.2-Lightning similarly describes a 4-step distilled path, so it should not be tuned like a normal 20–30 step diffusion workflow. See Wan2.2-Lightning.
So your observation — “CFG 1 to 3 did nothing, then above 3 broke everything” — is consistent with this kind of workflow.
2. The most important Wan2.2 idea: High-noise vs Low-noise experts
Wan2.2 A14B uses a Mixture-of-Experts style denoising structure. The official Wan2.2 repo describes MoE as separating the denoising process across timesteps with specialized expert models. See Wan2.2 official GitHub.
In practical I2V terms:
| Part | Mostly affects | If weak/wrong, you may see |
|---|---|---|
| High-noise expert | broad motion, layout, pose, composition, camera direction | scene drift, pose weirdness, motion chaos, composition changes |
| Low-noise expert | face detail, eyes, mouth, skin, clothing texture, color, final sharpness | face melting, blur, color shift, unstable eyes/mouth, loss of likeness |
For your goal, Low-noise behavior is extremely important.
If the face changes, the first fix is usually not “raise CFG.” More likely fixes are:
- lower denoise
- reduce the requested motion
- add more Low-noise steps
- use a better Low-noise quant if possible
- check the VAE
- crop/use a clearer source face
- avoid cinematic/camera-heavy prompts
- avoid LoRAs until the baseline is stable
WanMoeKSampler is relevant if you are using separate High/Low Wan2.2 A14B models. Its README says it is designed for Wan2.2 A14B-style MoE workflows and avoids manually guessing the High-to-Low switch point. See WanMoeKSampler.
3. Best starting point for your actual goal
Your goal is not “maximum cinematic transformation.” Your goal is:
same person
same face
same identity
same clothing
same lighting
same background
small natural movement
static camera
no embellishment
So I would start conservative.
Recommended baseline for your current Rapid/AIO-style setup
Sampler: sa_solver / beta, if that is your current most reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
SD3 shift: 8 as current control, then test 5 and 6
Resolution: 512–640px long side while testing
Frames: 33–49 while testing
FPS: 12–16
Motion: subtle
Camera: static
LoRAs: none during baseline
Upscaling/interpolation: none during baseline
Face restore: none during baseline
This is not meant to be the final “best possible” setup. It is the control setup. You need a repeatable control before changing settings.
4. Do not micro-tweak CFG
On your hardware, micro-tweaking CFG by 0.1 is a bad use of time.
Instead of:
1.0
1.1
1.2
1.3
1.4
...
Use coarse tests:
CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test
For your setup, I would treat CFG like this:
| CFG | Practical meaning |
|---|---|
| 1.0 | safest Rapid/Lightning-style baseline |
| 1.5 | mild text pressure |
| 2.0 | moderate text pressure |
| 2.5 | upper useful range to test |
| 3.0 | stress-test boundary |
| > 3.0 | likely to overcook identity, color, texture, or motion |
If CFG 1.5–2.5 gives no meaningful obedience improvement, stop chasing CFG. The bottleneck is probably elsewhere.
5. Denoise is probably more important than CFG for you
For source-faithful I2V, denoise is one of the strongest identity controls.
| Denoise | Expected behavior |
|---|---|
| 0.40–0.50 | most faithful, least motion, may look stiff |
| 0.50–0.60 | best starting zone for “make the image move” |
| 0.60–0.70 | more motion, more identity risk |
| 0.70+ | more transformation, more AI invention |
Since you already found 0.6 useful, I would not abandon it. I would test:
Denoise 0.50
Denoise 0.55
Denoise 0.60
Denoise 0.65
Pick the best identity/motion balance.
If the face changes:
lower denoise first
reduce motion second
add Low-noise steps third
only then try CFG changes
If there is no movement:
raise denoise slightly
make the action simpler and more literal
avoid cinematic wording
6. Shift: test coarse values only
Do not test tiny shift increments. Test meaningful jumps.
For your current setup:
Shift 5
Shift 6
Shift 8
The LightX2V Wan2.2 I2V working-guide discussion recommends:
Euler sampler
Simple scheduler
Shift 5
2 High steps
2 Low steps
Source: LightX2V Wan2.2 I2V working guide discussion
That does not automatically mean shift 5 is best for your current Rapid/AIO branch, but it is a strong branch to test.
7. Sampler advice
For your current Rapid/AIO branch
If sa_solver / beta / 4 steps / CFG 1 / denoise 0.6 / shift 8 is the only thing giving you usable results, keep it as the control.
Do not throw it away just because it sounds weird.
Rapid/distilled/merged models can have very specific intended recipes. The model card for the Rapid AIO family says the models are intended for CFG 1 and 4 steps , and different versions list different sampler recommendations. See WAN2.2 Rapid All-in-One.
For a Lightning-style branch
Test this separately:
Sampler: Euler
Scheduler: Simple
Steps: 4
CFG: 1.0
Shift: 5
Denoise: 0.55–0.60
That lines up with public LightX2V/Wan2.2-Lightning guidance. See Wan2.2-Lightning and the LightX2V working-guide discussion.
Compare this branch against your current sa_solver / beta control. Do not mix the two while testing.
8. Low-noise steps may help face consistency more than CFG
If your workflow exposes the High/Low split, test this before pushing CFG:
| Test | High steps | Low steps | Purpose |
|---|---|---|---|
| A | 2 | 2 | fastest 4-step baseline |
| B | 2 | 4 | more face/detail finishing |
| C | 4 | 4 | balanced reference |
| D | 4 | 6 | stronger finishing if time allows |
| E | 6 | 4 | more broad structure/motion |
For your goal, I would test:
2 High / 2 Low
2 High / 4 Low
4 High / 4 Low
If 2/2 is blurry but 2/4 improves face/detail , that tells you the Low-noise stage was underpowered.
9. Quantization: Q4_K_M is not automatically best on 8GB
On paper, higher quantization quality is better. In practice, on an 8GB laptop GPU, a heavier quant can cause more offload pressure, swapping, instability, or unusable render times.
The QuantStack Wan2.2 I2V A14B GGUF repo lists approximate model sizes such as:
Q3_K_S: 6.52 GB
Q3_K_M: 7.18 GB
Q4_K_S: 8.75 GB
Q4_K_M: 9.65 GB
Q5_K_S: 10.1 GB
Q5_K_M: 10.8 GB
Q6_K: 12 GB
Q8_0: 15.4 GB
Source: QuantStack Wan2.2 I2V A14B GGUF
For an 8GB 4060 laptop, I would test:
| Test | High-noise | Low-noise | Why |
|---|---|---|---|
| A | Q3_K_M | Q3_K_M | safest low-VRAM baseline |
| B | Q4_K_S | Q4_K_S | better quality if stable |
| C | Q3_K_M | Q4_K_S | prioritize face/detail |
| D | Q4_K_S | Q3_K_M | prioritize structure/motion |
| E | Q4_K_M | Q4_K_M | only if the above are stable |
For your priority, I would try:
High-noise: Q3_K_M
Low-noise: Q4_K_S
before assuming:
High-noise: Q4_K_M
Low-noise: Q4_K_M
Why: Low-noise has more influence on final face detail, skin, eyes, mouth, color, and sharpness. If you can only “spend” quality somewhere, spend it on Low-noise first.
10. Text encoder quantization matters for prompt obedience
If prompt obedience feels weak, do not only blame CFG. The text encoder can matter too.
The city96 UMT5 XXL encoder GGUF card recommends Q5_K_M or larger for best results , while noting that smaller models may still be acceptable in resource-constrained situations. It lists Q3_K_M around 3.06GB, Q4_K_M around 3.66GB, and Q5_K_M around 4.15GB. See city96 UMT5 XXL encoder GGUF.
For your system:
UMT5 Q3_K_M: safest
UMT5 Q4_K_M: reasonable baseline
UMT5 Q5_K_M: better prompt understanding if RAM/offload behavior is tolerable
If CFG does not improve obedience, a better text encoder may help more than CFG micro-tweaks.
11. VAE check: important for color and softness
If Wan2.2 looks redder, softer, or less vivid than expected, check the VAE.
The official ComfyUI Wan2.2 guide distinguishes the model components for different workflows. The 14B I2V workflow uses separate High/Low I2V models and a Wan VAE component; the 5B TI2V workflow uses its own 5B model/VAE setup. See ComfyUI official Wan2.2 guide.
A VAE mismatch can show up as:
red/yellow color cast
soft decode
loss of vividness
skin tone shift
general haze
reconstruction blur
If color is your issue, test VAE/workflow correctness before trying to fix it with prompt words like “neutral color” or “no red tint.”
12. Source image quality matters more than people admit
For face consistency, the source image should have:
clear face
visible eyes
visible mouth
not too small in frame
not heavily compressed
not extreme side profile
not harsh shadow over one eye
not heavy motion blur
not strong fisheye distortion
not sunglasses covering identity
not hands blocking the face
A simple rule:
If the source face is small or unclear, the model has to invent face detail during motion. When it invents face detail, identity changes.
For baseline testing, use a clean portrait or half-body image. You can do fancy shots later.
13. Prompt style for source-faithful animation
Use a boring prompt. Do not make it cinematic. Do not add style words. Do not describe a new scene.
Positive prompt baseline
A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.
Negative prompt baseline
different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, blurry, low detail, melted face, extra teeth
Important: at CFG 1 , the negative prompt may do very little. Judge negative prompting mostly at CFG 1.5–2.5.
14. Prompt obedience testing
Do not test obedience with complex motion first.
Bad obedience tests:
turns around
walks forward
raises both hands
laughs widely
talks
dances
camera orbits around the subject
wind blows hair dramatically
Good obedience tests:
one subtle blink
gentle breathing only
slight smile
very small head tilt
tiny eye movement
A model that cannot obey “one subtle blink” is not ready for “turns head, smiles, and raises hand.”
Better prompt wording
Instead of:
The woman turns her head and smiles at the camera while wind blows through her hair.
Use:
The person makes a very small natural smile while keeping the same face, same pose, same hairstyle, same clothing, same lighting, and same background. Static camera.
The second prompt gives the model less room to invent.
15. What to do when the model does not obey
First classify the failure.
| Failure | Likely cause | First fix |
|---|---|---|
| prompt action ignored | too few steps, weak text encoder, action too subtle, distilled limitation | slightly raise denoise or simplify action |
| face changes | denoise too high, Low-noise weak, source face unclear, motion too large | lower denoise / add Low steps |
| red tint | VAE/model/sampler/shift issue | check VAE, test shift/sampler |
| blurry face | Low-noise too weak, too few steps, low quant, low resolution | add Low steps / better Low quant |
| background changes | denoise too high, prompt invites scene change | lower denoise / static camera prompt |
| too much motion | denoise/CFG/shift too high, Rapid merge exaggeration | lower denoise or reduce action |
| no motion | denoise too low, prompt too static | denoise +0.05 |
The order I would use:
1. Keep CFG at 1.0.
2. Make the action simpler and more literal.
3. Tune denoise: 0.50 / 0.55 / 0.60 / 0.65.
4. Test shift: 5 / 6 / 8.
5. Add Low-noise steps if available.
6. Improve Low-noise quantization if possible.
7. Test CFG 1.5 / 2.0 / 2.5.
8. Stop before CFG 3 if identity starts changing.
16. Recommended experiment matrix
Do not run huge matrices at full resolution. Use short clips first.
Keep these fixed:
same image
same seed
same prompt
same resolution
same frame count
same workflow branch
Matrix A — denoise
CFG: 1.0
Steps: 4
Shift: current value
Sampler: current best
Test:
0.50
0.55
0.60
0.65
Pick the best identity/motion balance.
Matrix B — shift
Use the best denoise from Matrix A.
Shift 5
Shift 6
Shift 8
Pick the best.
Matrix C — CFG
Use best denoise + best shift.
CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test
Pick the highest CFG that does not alter identity.
Matrix D — High/Low steps
If available:
2 High / 2 Low
2 High / 4 Low
4 High / 4 Low
If face detail improves with more Low steps, you found a better lever than CFG.
Matrix E — quantization
If using separate GGUF High/Low models:
Q3_K_M High / Q3_K_M Low
Q3_K_M High / Q4_K_S Low
Q4_K_S High / Q4_K_S Low
Avoid assuming Q4_K_M is worth the offload cost on 8GB.
17. Additional nodes: what I would and would not add
Worth testing later: WanMoeKSampler
Use it if you are working with separate Wan2.2 A14B High/Low models.
Good for:
clean A14B High/Low workflows
reducing manual High/Low split guessing
debugging MoE transition behavior
Not a fix for:
bad source image
bad VAE
too much denoise
bad prompt
4-step model limitations
Source: WanMoeKSampler
Required for GGUF: ComfyUI-GGUF
Use the proper GGUF loader rather than treating GGUF like a normal checkpoint. The ComfyUI-GGUF README says to replace the stock “Load Diffusion Model” with the “Unet Loader (GGUF)” node. See ComfyUI-GGUF.
Probably skip at 4 steps: CacheDiT
CacheDiT is more useful when you have enough steps to amortize the cache/warmup overhead. For Wan2.2 14B, its README says to use the dedicated Wan Cache Optimizer for best results with the MoE High/Low structure. See ComfyUI-CacheDiT.
My practical rule:
4 steps: skip CacheDiT
6–8 steps: probably skip unless testing
12–20 steps: consider CacheDiT
Useful but separate branch: Kijai WanVideoWrapper
Kijai’s wrapper is useful and often gets Wan-specific optimizations quickly. The official Wan2.2 repo lists it as an alternative implementation. See Wan2.2 official GitHub and Kijai ComfyUI-WanVideoWrapper.
But treat it as a separate branch. Do not change wrapper + sampler + LoRAs + resolution all at once.
18. Things I would avoid during baseline testing
Avoid:
720p
81+ frames
large camera movement
large head turns
talking/lip motion
multiple LoRAs
face restore nodes
interpolation while judging motion
upscaling while judging source fidelity
CFG above 3
high denoise
changing sampler + CFG + denoise together
testing tiny CFG increments
Especially avoid this kind of starting point:
8GB VRAM
A14B Q4_K_M
720p
81 frames
4-step Rapid/AIO
SageAttention
BlockSwap
multiple LoRAs
CFG above 3
That can produce occasional good clips, but it is a terrible learning baseline because too many variables are interacting.
19. Practical final recommendation
For your current setup, I would start here:
Sampler: sa_solver / beta if this is your current reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
Shift: 8 as current control; test 5 and 6 separately
Resolution: 512–640px long side
Frames: 33–49
Prompt: subtle realistic motion only
Camera: static
LoRAs: none
Upscale/interpolation/face restore: off during testing
Then tune in this order:
1. Source image/crop
2. Denoise
3. Shift
4. Low-noise steps
5. Low-noise quantization
6. Text encoder quantization
7. CFG 1.5–2.5
8. Sampler branch
9. Extra nodes
My strongest advice:
Stop trying to solve source fidelity with CFG. For your setup, source fidelity is mostly denoise + Low-noise behavior + source image quality + correct workflow components. CFG is only a small final pressure adjustment.
20. Short summary
- Your current results are plausible for 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V.
- CFG is not the main control. Do not micro-tune it.
- Use CFG 1.0 as baseline; test 1.5 / 2.0 / 2.5 only after denoise and shift.
- Use denoise 0.50–0.60 for source-faithful animation.
- If faces drift, lower denoise or improve Low-noise behavior before raising CFG.
- If available, test 2 High / 4 Low and 4 High / 4 Low.
- For 8GB VRAM, test Q3_K_M and Q4_K_S before assuming Q4_K_M is best.
- If you can only improve one expert, improve Low-noise first for face/detail.
- Use short 33–49 frame clips at 512–640px while testing.
- Avoid 720p, long clips, multiple LoRAs, and post-processing until the baseline is stable.
Discussion in the ATmosphere