1st movie clip!
I think the challenge is just too hard… It’s on hard mode from the very start, after all.
This is how I would think about your setup now
First: switching from the direct Desktop install to ComfyUI Portable and suddenly having nodes/Manager behave properly is a real clue, not a coincidence. It strongly suggests the earlier problems were environmental rather than “you not understanding ComfyUI.” That is common with custom-node ecosystems: the install is only truly healthy when the node location and Python environment line up properly.
The good news is that you are now past the hardest beginner wall.
Your current Wan 2.2 setup already does something valuable:
- it generates clips reliably
- you understand the main nodes
- you can use positive/negative conditioning
- you can apply one or more LoRAs
- you can do first-frame workflows
- you can do first-frame → last-frame workflows
That means the main questions are no longer:
- “How do I make anything at all?”
- “Why won’t the nodes load?”
Your real questions now are more advanced and more interesting:
- Why does a classic checkpoint node not seem to fit the Wan graph?
- Why does lower FPS make drift look worse, and what should you do about it?
- Why do ordinary inpainting tutorials not solve “take this bad frame and fix the face using another face image”?
Those three are connected.
The short answer
If I had to compress the whole answer into one paragraph, it would be this:
Keep your Wan 2.2 workflow as your main shot generator. Do not force a classic SD-style checkpoint loader into the native Wan graph. Treat FPS as a quality/time tradeoff, not as a magic identity fix. Use FLF for the sit-down transition. And for face repair, stop thinking “text-only inpaint” and start thinking “separate still-frame repair workflow using either plain masked face inpaint, ReActor face swap, or mask-local face repair/detailing with a reference-guided method.”
That is the cleanest mental model.
1) About the checkpoint node
Short version
In a native Wan 2.2 workflow, you normally do not insert a classic SD/SDXL-style checkpoint node.
Why
The official Wan 2.2 ComfyUI workflow is not structured like a classic Stable Diffusion workflow where one checkpoint node loads most of the system in one go.
Instead, the official Wan-native flow is built from separate components, typically:
- diffusion model loader
- CLIP loader
- VAE loader
- the Wan video node itself
- LoRA loader(s)
- conditioning nodes
See:
- Wan2.2 Video Generation ComfyUI Official Native Workflow Example
What that means for your graph
If your current graph already looks something like:
Load Diffusion ModelLoad CLIPLoad VAE- one or more LoRA nodes
- positive / negative conditioning
- Wan image-to-video or first/last-frame node
- decode / save
then you are already using the correct native loading pattern.
So the reason you “can’t figure out how to include a checkpoint node” is probably not that you are missing something. It is more likely that there is no natural slot for a classic checkpoint node in the native Wan graph.
Where a checkpoint loader does make sense
A classic checkpoint loader can make sense in a separate still-image repair workflow.
For example, if you later build a dedicated face-repair graph using:
- a still-image inpaint model,
- a checkpoint-based image model,
- or an SDXL/Flux-style repair branch,
then that separate graph may use a checkpoint node.
But that would be its own repair workflow, not something you must squeeze into the Wan graph itself.
About your LoRA chain
Your current LoRA logic sounds fine.
Relevant docs:
- LoRA Loader
- LoraLoaderModelOnly
Important points from those docs:
- LoRAs are discovered from
ComfyUI/models/loras - multiple LoRA nodes can be chained directly
LoraLoaderModelOnlyis specifically for applying LoRAs to the model branch only , without needing a CLIP model input on that node
That is why LoRA chaining feels natural in your current setup, while a classic checkpoint node does not.
My practical recommendation
For your Wan graph:
- do not force a classic checkpoint loader into it
- keep the native Wan structure
- only use checkpoint-based loading in a separate repair graph if you later choose a checkpoint-based still-image repair method
2) About FPS, drift, and render time
You noticed:
- lower FPS = more visible drift
- higher FPS = drift feels less noticeable
- but higher FPS = much longer generation time
That observation is useful, and it makes sense.
Why higher FPS often looks better
Higher FPS does not necessarily mean the model suddenly understands identity better.
What it often means is:
- each frame is closer to the next in time
- motion is split into smaller steps
- the changes between frames feel less abrupt
- the drift becomes less obvious because the motion is smoother
So the model may still be drifting, but the drift is hidden better by finer temporal spacing.
Why this becomes expensive quickly
The cost scales with frame count.
The official ComfyUI docs for Wan/Fun Inp make this very explicit: video length is the total number of frames , and the example calculation is basically:
seconds × fps = frame count
So if you double FPS while keeping the duration the same, you roughly double the number of frames the system has to generate.
See:
- WanFunInpaintToVideo node docs
- Wan2.2 Video Generation ComfyUI Official Native Workflow Example
The important production lesson
On 8 GB VRAM, I would not make native 24 FPS your default unless you truly need it.
That is because your real bottleneck is not “video exists or not.” It is:
- quality per minute of render time
- how many iterations you can afford
- whether you can keep enough control over continuity
A better 8 GB strategy
Instead of brute-forcing everything at native 24 FPS, I would bias toward:
- shorter clips
- moderate native FPS
- frame interpolation later , when needed
The official ComfyUI frame interpolation workflow exists for exactly this reason.
See:
- ComfyUI frame interpolation workflow
That page is very relevant because it explicitly says frame interpolation:
- generates intermediate frames
- smooths motion
- improves temporal consistency
- is useful for increasing frame rate in short clips
- is useful for fixing low-FPS generations without regenerating the source frames
My practical recommendation
For your current setup I would test this order:
- keep clips short
- use a sensible native frame count
- use stronger control (first frame, first→last frame)
- only then use interpolation for smoother output
That is usually a better quality/time tradeoff than forcing 24 FPS generation everywhere.
3) Why the inpainting tutorials feel like they stop one step too early
This is the part causing the most confusion, and for good reason.
What those tutorials are really teaching
The standard inpainting tutorials teach:
- load an image
- draw a mask
- use text conditioning
- regenerate only the masked region
That is generic inpainting.
And yes, that is why:
- teapot example works
- cloud/hair example works
- but your actual problem still feels unsolved
Because your actual problem is not :
replace this masked region with any plausible thing described by text
Your actual problem is:
keep this bad frame as the base image, keep the pose/lighting/composition, and make the masked face look like the correct person from another image
That is a different task.
The missing concept
You are not supposed to put the second face image “onto the canvas” like another background layer.
Instead:
- the broken frame remains the base image
- the mask defines the region to repair
- the second face image enters the graph as a reference / swap source / identity guide
- a repair node uses that second image to influence what happens inside the mask
That is the key mental shift.
4) So what are the actual ways to use a second face image?
There are three practical families.
A. Face swap: the direct route
This is the ReActor route.
Use it when:
- the frame is already good
- the face became the wrong person
- the pose, lighting, clothes, and framing are acceptable
Relevant repo:
- ComfyUI-ReActor
Why it is relevant:
- it is explicitly a face-swap extension for ComfyUI
- it supports reusable face models
- it is designed for image inputs and is very naturally suited to “fix this bad frame”
In plain language, the workflow is:
input_image= broken framesource_imageorface_model= the correct identity- output = repaired frame
That is probably the closest direct answer to your actual question.
B. Local face repair/detailing: the practical fallback
This is the Impact Pack route.
Relevant repo:
- ComfyUI Impact Pack
Important nodes:
MaskPainter— draw the maskFaceDetailer— detect faces and improve themMaskDetailer— inpaint only the masked area with a detailer pass
Why it is relevant:
- it matches the “keep the frame, only fix the face” logic very well
- it is a great fallback if ReActor is awkward or not the right fit
- it is especially useful if the face is not just the wrong person but also a bit damaged, blurry, or structurally off
C. Reference-guided identity repair: the most conceptually accurate route
This is the IPAdapter FaceID-style idea.
Relevant repo:
- ComfyUI IPAdapter Plus
Why it is relevant:
- this is the clearest answer to “how do I use a second image to guide the face repair?”
- the second face image becomes an identity reference, not just a prompt substitute
- the docs emphasize that regional use is most effective through an inpainting workflow
This route is powerful, but it is more setup-heavy than the other two.
5) My actual recommendation for your case
If this were my setup, I would not try to solve everything inside one giant graph.
I would deliberately split the work into two workflows.
Workflow A — the main Wan video workflow
This is your existing graph.
Keep it for:
- image/text/video generation
- positive / negative prompt control
- LoRAs
- first-frame workflows
- first-frame → last-frame workflows
This is your shot generator.
Relevant docs:
- Wan2.2 Video Generation ComfyUI Official Native Workflow Example
- ComfyUI Wan FLF workflow
Workflow B — the separate still-frame repair workflow
This is the graph you use when a shot finishes and the last frame is almost right, but the face is not.
Use it for:
- loading the broken frame
- masking only the face
- repairing that face with one of:
- plain inpaint
- ReActor
- Impact Pack
- reference-guided identity repair
Then save the repaired frame and feed it back into the next Wan shot.
This is your continuity repair tool.
That split is extremely important.
Why I recommend two workflows
Because it gives each graph one clear job:
- Workflow A creates shots
- Workflow B repairs bridge frames
That is much easier to understand and much easier to debug than an all-in-one “do everything” workflow.
6) Repair vs recreate: the rule that will save you the most time
This is the rule I would use.
Repair when:
- the frame is already mostly good
- the body pose is right
- the lighting is right
- the composition is right
- the background / bench is right
- only the face or a tiny area drifted
Recreate when:
- the pose is wrong
- the camera is wrong
- the sit-down motion is wrong
- multiple frames in a row are bad
- fixing the face would still leave the shot unusable
For your project, that usually means:
- walk : repair the last frame if only the face drifted
- approach bench : same
- sit-down transition : usually recreate with FLF, not patch frame-by-frame
- seated shot : repair isolated face drift, recreate bad staging
This is the production logic I would trust.
7) The exact answer to “what am I doing wrong?”
I do not think you are doing the wrong operation.
I think you are trying to solve a reference-guided identity repair task with a text-only generic inpainting tutorial.
That is the mismatch.
You are not failing because you do not understand masking.
You are failing because your actual task needs one more ingredient:
- a second-image identity mechanism
That mechanism is usually:
- ReActor as a swap source
- or IPAdapter FaceID as a reference-conditioning source
- or Impact Pack as a local face repair route when the face is damaged and needs controlled correction
That is the real missing piece.
8) The most practical next steps I would recommend
Step 1
Keep your current Wan graph as your main shot workflow.
Step 2
Stop worrying about forcing a classic checkpoint node into that graph.
Step 3
Create a separate frame repair workflow.
Start with the official inpaint tutorial just to make sure the mask/local-rebuild logic is completely clear:
- ComfyUI Inpainting Workflow
Then decide:
- slight face drift → plain masked face inpaint
- wrong person → ReActor
- local face repair / fallback → Impact Pack
- strong second-image identity guidance → IPAdapter FaceID-style route
Step 4
Use FLF for your sit-down transition.
That is a cleaner solution than trying to patch a bad sit-down clip.
Step 5
Do not default to native 24 FPS everywhere. Test:
- short clips
- moderate native FPS
- interpolation later
See:
- ComfyUI frame interpolation workflow
9) The one-sentence summary
Your Wan workflow should generate shots; your separate repair workflow should fix bad bridge frames using a mask plus either a swap source, a face-detail pass, or a second-image identity guide.
That is the cleanest way to make the whole thing click.
10) The exact links I would keep open
Main generation:
- Wan2.2 Video Generation ComfyUI Official Native Workflow Example
- ComfyUI Wan FLF workflow
Repair / continuity:
- ComfyUI Inpainting Workflow
- ComfyUI-ReActor
- ComfyUI Impact Pack
- ComfyUI IPAdapter Plus
Later / heavier options:
- WanFunInpaintToVideo node docs
- Wan VACE To Video node docs
Utilities:
- ComfyUI frame interpolation workflow
Final recommendation
If this were my setup, I would do this:
- keep the current Wan 2.2 graph
- do not add a classic checkpoint loader to it
- build one separate still-frame repair graph
- use that graph to fix bad bridge frames
- use FLF for the sit-down transition
- use frame interpolation instead of brute-forcing native 24 FPS everywhere
- only later consider heavier clip-editing or training workflows
That is the simplest, cleanest, least frustrating path from where you are now.
Discussion in the ATmosphere