External Publication

Nvidia driver update - reactor node

Hugging Face Forums [Unofficial] June 8, 2026

Hmm… if you are trying to do audio too on 8GB VRAM, maybe something like this:

I would separate the “audio problem” into three different problems:

Layer	What it means	8GB VRAM practicality
1. Attach audio to a video	Add an existing `.wav` / `.mp3` track to the generated video	Very realistic
2. Generate speech	Create the dialogue audio from text or a recording	Realistic, especially if done outside the video workflow
3. Lip-sync / audio-driven motion	Make the mouth, face, head, or body follow the audio	Possible, but should be treated as a separate later workflow

So I would not try to solve all of this in one giant ComfyUI graph at first.

For 8GB VRAM, the practical order is probably:

1. Generate or record the voice separately
2. Generate the silent video with your current working workflow
3. Mux the audio into the video
4. If the mismatch is too obvious, try a simple lip-sync post-process
5. Only then look at heavier audio-driven video systems

The most important point is this:

An audio input on a video node usually means “I can attach an existing audio stream,” not “I can generate speech from the prompt.”

So if the save/combine video node has an audio socket or audio icon, that does not necessarily mean Wan/ReActor is generating audio. It usually means you can pass in an existing audio object and have it combined into the final file.

Recommended path for 8GB VRAM

I would start with the boring, common, well-documented path:

TTS or recorded voice
↓
silent generated video
↓
mux audio into the video
↓
optional Wav2Lip / LatentSync pass if lip-sync is needed

This is less magical than a full audio-driven video model, but it is much more realistic on 8GB VRAM.

Why I would not start with Wan2.2-S2V locally

Wan2.2-S2V is closer to the ideal solution: image/video + audio → speech-driven video.

But I would not start there on 8GB VRAM.

Wan2.2-S2V-14B exists and is the more “native” speech-to-video direction:

Wan2.2 GitHub
Wan-AI/Wan2.2-S2V-14B on Hugging Face

However, the official model card / README examples are much heavier than an 8GB local setup. The S2V route is more like:

high-VRAM GPU / cloud / hosted workflow

not:

easy local 8GB ComfyUI workflow

So I would treat Wan2.2-S2V as the “ideal future path,” not the first recovery path.

Step 1: audio attachment / muxing

For simply putting audio into a generated video, the common ComfyUI path is usually something like:

LoadAudio
+
Video Combine

Useful links:

ComfyUI LoadAudio node
ComfyUI SaveAudio node
ComfyUI-VideoHelperSuite
RunComfy Video Combine node guide

VideoHelperSuite’s Video Combine node is useful because it combines image frames into a video, and if an optional audio input is provided, it can combine that audio into the output video.

So the first test should be very simple:

short generated silent video
+
short audio file
↓
Video Combine
↓
video with audio

Do not start with a full long clip. Start with 5–10 seconds.

Good first test settings

Duration: 5–10 seconds
One speaker
One face
Front-facing if possible
No scene cuts
No camera chaos
Audio length roughly equals video length

This first step answers only one question:

Can I attach audio to the video at all?

It does not solve lip-sync yet.

Step 2: generate the speech separately

For speech, I would initially use a separate TTS or recorded voice path.

Possible options:

Option	Why use it	Caveat
Recorded voice	Simplest and most predictable	Requires recording
External TTS	Often easiest	May require API/account
ComfyUI TTS node	Keeps more inside ComfyUI	Adds more dependencies
Voice cloning TTS	Better character voice control	More setup and ethical/legal care

ComfyUI TTS/audio nodes exist, but I would keep them separate from the main video workflow at first.

Some useful entry points:

TTS-Audio-Suite for ComfyUI
TTS-Audio-Suite releases
F5-TTS paper
ComfyUI ElevenLabs integration announcement

For the first working version, I would not care too much where the voice comes from. The important thing is to get a clean .wav or .mp3 that you can attach to the video.

Step 3: if lip-sync is needed, start with common tools

You probably will eventually want lip-sync. But I would not start with the newest full audio-driven video system.

For beginner-friendly debugging, I would try the older/common lip-sync route first:

generated video
+
speech audio
↓
Wav2Lip or LatentSync
↓
lip-synced video

This is a post-processing step. It is different from asking Wan to generate the whole video from the audio.

Beginner-friendly first lip-sync option: Wav2Lip

Wav2Lip is older, but that is actually a benefit for debugging. There are many examples, tutorials, and failure reports around it.

Useful links:

ComfyUI_wav2lip
Wav2Lip paper
Wav2Lip original GitHub repo
Example YouTube tutorial: LipSync in ComfyUI with ReActor and Wav2Lip

Why Wav2Lip first?

older
common
more tutorials
more known failure cases
simpler mental model

The mental model is straightforward:

input video + input audio → output video with adjusted mouth movement

It may not be the best quality, but it is often a good first proof of concept.

Expected Wav2Lip problems

Wav2Lip can struggle with:

small faces
side views
covered mouths
fast head movement
multiple faces
low-resolution faces
strong stylization
large camera motion
long clips

So the first test should be intentionally easy:

one person
face visible
mouth visible
short clip
audio length close to video length

Better-quality next option: LatentSync

If Wav2Lip works but the quality is not good enough, I would try LatentSync next.

Useful links:

ComfyUI-LatentSyncWrapper
LatentSync paper
LatentSync GitHub issue #99: video/audio length mismatch behavior
ThinkDiffusion LatentSync guide

LatentSync is newer and likely to give better results in some cases, but it also has more moving parts.

The main practical issues to expect are:

video length vs audio length mismatch
fps mismatch
audio sample-rate expectations
face detection failure
small/side faces
long clip instability
VRAM pressure
dependency issues

A very common beginner mistake is trying:

5-second video + 30-second audio

and expecting a full 30-second lip-synced output. Tools often behave according to the video length, the audio length, or internal chunking assumptions. So keep the first test very short and matched:

5-second video
+
5-second audio

What I would not do first

I would not start with these on 8GB VRAM:

Tool / direction	Why not first
Wan2.2-S2V-14B	Much closer to ideal, but too heavy for 8GB local first attempt
InfiniteTalk	More powerful audio-driven video/dubbing direction, but more complex
FantasyTalking / WanVideo adapter workflows	Potentially strong, but heavier and more fragile
HunyuanVideo-Avatar	High-end audio-driven human animation; not a simple 8GB beginner route
Long multi-scene lip-sync	Too many failure points at once

These are worth knowing about, but I would keep them as later options.

Useful links for later exploration:

InfiniteTalk GitHub
InfiniteTalk on Hugging Face
Comfy workflow: Wan2.1 InfiniteTalk audio-driven character lip sync
HunyuanVideo-Avatar paper
MMAudio GitHub
ComfyUI-MMAudio wrapper

InfiniteTalk is interesting because it does not only try to modify the lips. It aims to align lip sync, head movement, body posture, and facial expression from an input video and audio track. That is more ambitious than Wav2Lip-style mouth replacement. But that also means it is not where I would start on a small local setup.

Recommended practical workflow

I would use this staged approach.

Phase 1 — prove audio muxing

Goal:

Can I attach audio to my generated video?

Test:

1. Generate a 5-second silent video
2. Create or record a 5-second audio file
3. Load the audio
4. Combine video + audio
5. Export

Use:

LoadAudio
VideoHelperSuite Video Combine

Do not care about lip-sync yet.

Phase 2 — create better speech

Goal:

Can I make the voice track I actually want?

Options:

recorded voice
external TTS
ComfyUI TTS node
voice-cloning TTS

Output:

clean wav/mp3
same approximate duration as the video

Phase 3 — basic lip-sync attempt

Goal:

Can I make the mouth roughly match the audio?

First try:

Wav2Lip

Then, if needed:

LatentSync

Test conditions:

5–10 seconds
single person
front-facing
no cuts
mouth visible
audio length ~= video length

Phase 4 — scale up carefully

Only after the short test works:

longer clip
higher resolution
more motion
more camera movement
more stylized faces
multiple scenes

If it breaks, go back to a shorter clip.

Phase 5 — advanced audio-driven video

Only later consider:

InfiniteTalk
Wan2.2-S2V
HunyuanVideo-Avatar
FantasyTalking
MMAudio for sound effects/background audio

This is where you explore more modern full audio-driven motion systems, but it is not the first 8GB route.

Suggested decision table

Goal	First thing to try	If not enough	Avoid at first
Just add sound	VideoHelperSuite Video Combine	ffmpeg/editor mux	S2V
Generate dialogue	external TTS or simple ComfyUI TTS	better TTS / voice cloning	full audio-driven video
Basic lip-sync	Wav2Lip	LatentSync	InfiniteTalk first
Better lip-sync quality	LatentSync	short-clip advanced tools	long full-scene test
Body/head/audio-driven performance	InfiniteTalk-like workflows	cloud/high-VRAM workflows	8GB local full setup
Sound effects/background audio	MMAudio-like V2A tools	manual SFX editing	treating it as dialogue TTS

Practical advice for 8GB VRAM

For 8GB VRAM, I would think in terms of short, separate stages:

video generation
audio generation
audio muxing
lip-sync pass
final edit

not one huge all-in-one workflow.

A good first target:

5-second talking clip
one face
one voice
audio attached
rough lip-sync

A bad first target:

81-frame or longer full workflow
multiple cuts
stylized face
moving camera
ReActor
Wan video
TTS
lip-sync
audio mux
all in one graph

The second version has too many things that can fail at once.

The simple recommendation

If I had to pick a practical beginner route, I would do:

1. Keep your current working video workflow.
2. Generate or record the voice separately.
3. Use VideoHelperSuite / Video Combine to attach the audio.
4. If lip-sync is necessary, try Wav2Lip on a 5–10 second clip.
5. If Wav2Lip is too low-quality, try LatentSync.
6. Treat Wan2.2-S2V / InfiniteTalk / HunyuanVideo-Avatar as later high-end options.

That path is not the fanciest, but it is probably the most debuggable.

And with 8GB VRAM, “debuggable” matters more than “most advanced.”

Recommended path for 8GB VRAM

Why I would not start with Wan2.2-S2V locally

Step 1: audio attachment / muxing

Good first test settings

Step 2: generate the speech separately

Step 3: if lip-sync is needed, start with common tools

Beginner-friendly first lip-sync option: Wav2Lip

Expected Wav2Lip problems

Better-quality next option: LatentSync

What I would not do first

Recommended practical workflow

Phase 1 — prove audio muxing

Phase 2 — create better speech

Phase 3 — basic lip-sync attempt

Phase 4 — scale up carefully

Phase 5 — advanced audio-driven video

Suggested decision table

Practical advice for 8GB VRAM

The simple recommendation

Discussion in the ATmosphere