Nvidia driver update - reactor node
Hmm… if you are trying to do audio too on 8GB VRAM, maybe something like this:
I would separate the “audio problem” into three different problems:
| Layer | What it means | 8GB VRAM practicality |
|---|---|---|
| 1. Attach audio to a video | Add an existing .wav / .mp3 track to the generated video |
Very realistic |
| 2. Generate speech | Create the dialogue audio from text or a recording | Realistic, especially if done outside the video workflow |
| 3. Lip-sync / audio-driven motion | Make the mouth, face, head, or body follow the audio | Possible, but should be treated as a separate later workflow |
So I would not try to solve all of this in one giant ComfyUI graph at first.
For 8GB VRAM, the practical order is probably:
1. Generate or record the voice separately
2. Generate the silent video with your current working workflow
3. Mux the audio into the video
4. If the mismatch is too obvious, try a simple lip-sync post-process
5. Only then look at heavier audio-driven video systems
The most important point is this:
An audio input on a video node usually means “I can attach an existing audio stream,” not “I can generate speech from the prompt.”
So if the save/combine video node has an audio socket or audio icon, that does not necessarily mean Wan/ReActor is generating audio. It usually means you can pass in an existing audio object and have it combined into the final file.
Recommended path for 8GB VRAM
I would start with the boring, common, well-documented path:
TTS or recorded voice
↓
silent generated video
↓
mux audio into the video
↓
optional Wav2Lip / LatentSync pass if lip-sync is needed
This is less magical than a full audio-driven video model, but it is much more realistic on 8GB VRAM.
Why I would not start with Wan2.2-S2V locally
Wan2.2-S2V is closer to the ideal solution: image/video + audio → speech-driven video.
But I would not start there on 8GB VRAM.
Wan2.2-S2V-14B exists and is the more “native” speech-to-video direction:
- Wan2.2 GitHub
- Wan-AI/Wan2.2-S2V-14B on Hugging Face
However, the official model card / README examples are much heavier than an 8GB local setup. The S2V route is more like:
high-VRAM GPU / cloud / hosted workflow
not:
easy local 8GB ComfyUI workflow
So I would treat Wan2.2-S2V as the “ideal future path,” not the first recovery path.
Step 1: audio attachment / muxing
For simply putting audio into a generated video, the common ComfyUI path is usually something like:
LoadAudio
+
Video Combine
Useful links:
- ComfyUI LoadAudio node
- ComfyUI SaveAudio node
- ComfyUI-VideoHelperSuite
- RunComfy Video Combine node guide
VideoHelperSuite’s Video Combine node is useful because it combines image frames into a video, and if an optional audio input is provided, it can combine that audio into the output video.
So the first test should be very simple:
short generated silent video
+
short audio file
↓
Video Combine
↓
video with audio
Do not start with a full long clip. Start with 5–10 seconds.
Good first test settings
Duration: 5–10 seconds
One speaker
One face
Front-facing if possible
No scene cuts
No camera chaos
Audio length roughly equals video length
This first step answers only one question:
Can I attach audio to the video at all?
It does not solve lip-sync yet.
Step 2: generate the speech separately
For speech, I would initially use a separate TTS or recorded voice path.
Possible options:
| Option | Why use it | Caveat |
|---|---|---|
| Recorded voice | Simplest and most predictable | Requires recording |
| External TTS | Often easiest | May require API/account |
| ComfyUI TTS node | Keeps more inside ComfyUI | Adds more dependencies |
| Voice cloning TTS | Better character voice control | More setup and ethical/legal care |
ComfyUI TTS/audio nodes exist, but I would keep them separate from the main video workflow at first.
Some useful entry points:
- TTS-Audio-Suite for ComfyUI
- TTS-Audio-Suite releases
- F5-TTS paper
- ComfyUI ElevenLabs integration announcement
For the first working version, I would not care too much where the voice comes from. The important thing is to get a clean .wav or .mp3 that you can attach to the video.
Step 3: if lip-sync is needed, start with common tools
You probably will eventually want lip-sync. But I would not start with the newest full audio-driven video system.
For beginner-friendly debugging, I would try the older/common lip-sync route first:
generated video
+
speech audio
↓
Wav2Lip or LatentSync
↓
lip-synced video
This is a post-processing step. It is different from asking Wan to generate the whole video from the audio.
Beginner-friendly first lip-sync option: Wav2Lip
Wav2Lip is older, but that is actually a benefit for debugging. There are many examples, tutorials, and failure reports around it.
Useful links:
- ComfyUI_wav2lip
- Wav2Lip paper
- Wav2Lip original GitHub repo
- Example YouTube tutorial: LipSync in ComfyUI with ReActor and Wav2Lip
Why Wav2Lip first?
older
common
more tutorials
more known failure cases
simpler mental model
The mental model is straightforward:
input video + input audio → output video with adjusted mouth movement
It may not be the best quality, but it is often a good first proof of concept.
Expected Wav2Lip problems
Wav2Lip can struggle with:
small faces
side views
covered mouths
fast head movement
multiple faces
low-resolution faces
strong stylization
large camera motion
long clips
So the first test should be intentionally easy:
one person
face visible
mouth visible
short clip
audio length close to video length
Better-quality next option: LatentSync
If Wav2Lip works but the quality is not good enough, I would try LatentSync next.
Useful links:
- ComfyUI-LatentSyncWrapper
- LatentSync paper
- LatentSync GitHub issue #99: video/audio length mismatch behavior
- ThinkDiffusion LatentSync guide
LatentSync is newer and likely to give better results in some cases, but it also has more moving parts.
The main practical issues to expect are:
video length vs audio length mismatch
fps mismatch
audio sample-rate expectations
face detection failure
small/side faces
long clip instability
VRAM pressure
dependency issues
A very common beginner mistake is trying:
5-second video + 30-second audio
and expecting a full 30-second lip-synced output. Tools often behave according to the video length, the audio length, or internal chunking assumptions. So keep the first test very short and matched:
5-second video
+
5-second audio
What I would not do first
I would not start with these on 8GB VRAM:
| Tool / direction | Why not first |
|---|---|
| Wan2.2-S2V-14B | Much closer to ideal, but too heavy for 8GB local first attempt |
| InfiniteTalk | More powerful audio-driven video/dubbing direction, but more complex |
| FantasyTalking / WanVideo adapter workflows | Potentially strong, but heavier and more fragile |
| HunyuanVideo-Avatar | High-end audio-driven human animation; not a simple 8GB beginner route |
| Long multi-scene lip-sync | Too many failure points at once |
These are worth knowing about, but I would keep them as later options.
Useful links for later exploration:
- InfiniteTalk GitHub
- InfiniteTalk on Hugging Face
- Comfy workflow: Wan2.1 InfiniteTalk audio-driven character lip sync
- HunyuanVideo-Avatar paper
- MMAudio GitHub
- ComfyUI-MMAudio wrapper
InfiniteTalk is interesting because it does not only try to modify the lips. It aims to align lip sync, head movement, body posture, and facial expression from an input video and audio track. That is more ambitious than Wav2Lip-style mouth replacement. But that also means it is not where I would start on a small local setup.
Recommended practical workflow
I would use this staged approach.
Phase 1 — prove audio muxing
Goal:
Can I attach audio to my generated video?
Test:
1. Generate a 5-second silent video
2. Create or record a 5-second audio file
3. Load the audio
4. Combine video + audio
5. Export
Use:
- LoadAudio
- VideoHelperSuite Video Combine
Do not care about lip-sync yet.
Phase 2 — create better speech
Goal:
Can I make the voice track I actually want?
Options:
recorded voice
external TTS
ComfyUI TTS node
voice-cloning TTS
Output:
clean wav/mp3
same approximate duration as the video
Phase 3 — basic lip-sync attempt
Goal:
Can I make the mouth roughly match the audio?
First try:
Wav2Lip
Then, if needed:
LatentSync
Test conditions:
5–10 seconds
single person
front-facing
no cuts
mouth visible
audio length ~= video length
Phase 4 — scale up carefully
Only after the short test works:
longer clip
higher resolution
more motion
more camera movement
more stylized faces
multiple scenes
If it breaks, go back to a shorter clip.
Phase 5 — advanced audio-driven video
Only later consider:
InfiniteTalk
Wan2.2-S2V
HunyuanVideo-Avatar
FantasyTalking
MMAudio for sound effects/background audio
This is where you explore more modern full audio-driven motion systems, but it is not the first 8GB route.
Suggested decision table
| Goal | First thing to try | If not enough | Avoid at first |
|---|---|---|---|
| Just add sound | VideoHelperSuite Video Combine | ffmpeg/editor mux | S2V |
| Generate dialogue | external TTS or simple ComfyUI TTS | better TTS / voice cloning | full audio-driven video |
| Basic lip-sync | Wav2Lip | LatentSync | InfiniteTalk first |
| Better lip-sync quality | LatentSync | short-clip advanced tools | long full-scene test |
| Body/head/audio-driven performance | InfiniteTalk-like workflows | cloud/high-VRAM workflows | 8GB local full setup |
| Sound effects/background audio | MMAudio-like V2A tools | manual SFX editing | treating it as dialogue TTS |
Practical advice for 8GB VRAM
For 8GB VRAM, I would think in terms of short, separate stages:
video generation
audio generation
audio muxing
lip-sync pass
final edit
not one huge all-in-one workflow.
A good first target:
5-second talking clip
one face
one voice
audio attached
rough lip-sync
A bad first target:
81-frame or longer full workflow
multiple cuts
stylized face
moving camera
ReActor
Wan video
TTS
lip-sync
audio mux
all in one graph
The second version has too many things that can fail at once.
The simple recommendation
If I had to pick a practical beginner route, I would do:
1. Keep your current working video workflow.
2. Generate or record the voice separately.
3. Use VideoHelperSuite / Video Combine to attach the audio.
4. If lip-sync is necessary, try Wav2Lip on a 5–10 second clip.
5. If Wav2Lip is too low-quality, try LatentSync.
6. Treat Wan2.2-S2V / InfiniteTalk / HunyuanVideo-Avatar as later high-end options.
That path is not the fanciest, but it is probably the most debuggable.
And with 8GB VRAM, “debuggable” matters more than “most advanced.”
Discussion in the ATmosphere