External Publication
Visit Post

Nvidia driver update - reactor node

Hugging Face Forums [Unofficial] June 8, 2026
Source

Hmm… if you are trying to do audio too on 8GB VRAM, maybe something like this:


I would separate the “audio problem” into three different problems:

Layer What it means 8GB VRAM practicality
1. Attach audio to a video Add an existing .wav / .mp3 track to the generated video Very realistic
2. Generate speech Create the dialogue audio from text or a recording Realistic, especially if done outside the video workflow
3. Lip-sync / audio-driven motion Make the mouth, face, head, or body follow the audio Possible, but should be treated as a separate later workflow

So I would not try to solve all of this in one giant ComfyUI graph at first.

For 8GB VRAM, the practical order is probably:

1. Generate or record the voice separately
2. Generate the silent video with your current working workflow
3. Mux the audio into the video
4. If the mismatch is too obvious, try a simple lip-sync post-process
5. Only then look at heavier audio-driven video systems

The most important point is this:

An audio input on a video node usually means “I can attach an existing audio stream,” not “I can generate speech from the prompt.”

So if the save/combine video node has an audio socket or audio icon, that does not necessarily mean Wan/ReActor is generating audio. It usually means you can pass in an existing audio object and have it combined into the final file.

Recommended path for 8GB VRAM

I would start with the boring, common, well-documented path:

TTS or recorded voice
↓
silent generated video
↓
mux audio into the video
↓
optional Wav2Lip / LatentSync pass if lip-sync is needed

This is less magical than a full audio-driven video model, but it is much more realistic on 8GB VRAM.

Why I would not start with Wan2.2-S2V locally

Wan2.2-S2V is closer to the ideal solution: image/video + audio → speech-driven video.

But I would not start there on 8GB VRAM.

Wan2.2-S2V-14B exists and is the more “native” speech-to-video direction:

  • Wan2.2 GitHub
  • Wan-AI/Wan2.2-S2V-14B on Hugging Face

However, the official model card / README examples are much heavier than an 8GB local setup. The S2V route is more like:

high-VRAM GPU / cloud / hosted workflow

not:

easy local 8GB ComfyUI workflow

So I would treat Wan2.2-S2V as the “ideal future path,” not the first recovery path.

Step 1: audio attachment / muxing

For simply putting audio into a generated video, the common ComfyUI path is usually something like:

LoadAudio
+
Video Combine

Useful links:

  • ComfyUI LoadAudio node
  • ComfyUI SaveAudio node
  • ComfyUI-VideoHelperSuite
  • RunComfy Video Combine node guide

VideoHelperSuite’s Video Combine node is useful because it combines image frames into a video, and if an optional audio input is provided, it can combine that audio into the output video.

So the first test should be very simple:

short generated silent video
+
short audio file
↓
Video Combine
↓
video with audio

Do not start with a full long clip. Start with 5–10 seconds.

Good first test settings

Duration: 5–10 seconds
One speaker
One face
Front-facing if possible
No scene cuts
No camera chaos
Audio length roughly equals video length

This first step answers only one question:

Can I attach audio to the video at all?

It does not solve lip-sync yet.

Step 2: generate the speech separately

For speech, I would initially use a separate TTS or recorded voice path.

Possible options:

Option Why use it Caveat
Recorded voice Simplest and most predictable Requires recording
External TTS Often easiest May require API/account
ComfyUI TTS node Keeps more inside ComfyUI Adds more dependencies
Voice cloning TTS Better character voice control More setup and ethical/legal care

ComfyUI TTS/audio nodes exist, but I would keep them separate from the main video workflow at first.

Some useful entry points:

  • TTS-Audio-Suite for ComfyUI
  • TTS-Audio-Suite releases
  • F5-TTS paper
  • ComfyUI ElevenLabs integration announcement

For the first working version, I would not care too much where the voice comes from. The important thing is to get a clean .wav or .mp3 that you can attach to the video.

Step 3: if lip-sync is needed, start with common tools

You probably will eventually want lip-sync. But I would not start with the newest full audio-driven video system.

For beginner-friendly debugging, I would try the older/common lip-sync route first:

generated video
+
speech audio
↓
Wav2Lip or LatentSync
↓
lip-synced video

This is a post-processing step. It is different from asking Wan to generate the whole video from the audio.

Beginner-friendly first lip-sync option: Wav2Lip

Wav2Lip is older, but that is actually a benefit for debugging. There are many examples, tutorials, and failure reports around it.

Useful links:

  • ComfyUI_wav2lip
  • Wav2Lip paper
  • Wav2Lip original GitHub repo
  • Example YouTube tutorial: LipSync in ComfyUI with ReActor and Wav2Lip

Why Wav2Lip first?

older
common
more tutorials
more known failure cases
simpler mental model

The mental model is straightforward:

input video + input audio → output video with adjusted mouth movement

It may not be the best quality, but it is often a good first proof of concept.

Expected Wav2Lip problems

Wav2Lip can struggle with:

small faces
side views
covered mouths
fast head movement
multiple faces
low-resolution faces
strong stylization
large camera motion
long clips

So the first test should be intentionally easy:

one person
face visible
mouth visible
short clip
audio length close to video length

Better-quality next option: LatentSync

If Wav2Lip works but the quality is not good enough, I would try LatentSync next.

Useful links:

  • ComfyUI-LatentSyncWrapper
  • LatentSync paper
  • LatentSync GitHub issue #99: video/audio length mismatch behavior
  • ThinkDiffusion LatentSync guide

LatentSync is newer and likely to give better results in some cases, but it also has more moving parts.

The main practical issues to expect are:

video length vs audio length mismatch
fps mismatch
audio sample-rate expectations
face detection failure
small/side faces
long clip instability
VRAM pressure
dependency issues

A very common beginner mistake is trying:

5-second video + 30-second audio

and expecting a full 30-second lip-synced output. Tools often behave according to the video length, the audio length, or internal chunking assumptions. So keep the first test very short and matched:

5-second video
+
5-second audio

What I would not do first

I would not start with these on 8GB VRAM:

Tool / direction Why not first
Wan2.2-S2V-14B Much closer to ideal, but too heavy for 8GB local first attempt
InfiniteTalk More powerful audio-driven video/dubbing direction, but more complex
FantasyTalking / WanVideo adapter workflows Potentially strong, but heavier and more fragile
HunyuanVideo-Avatar High-end audio-driven human animation; not a simple 8GB beginner route
Long multi-scene lip-sync Too many failure points at once

These are worth knowing about, but I would keep them as later options.

Useful links for later exploration:

  • InfiniteTalk GitHub
  • InfiniteTalk on Hugging Face
  • Comfy workflow: Wan2.1 InfiniteTalk audio-driven character lip sync
  • HunyuanVideo-Avatar paper
  • MMAudio GitHub
  • ComfyUI-MMAudio wrapper

InfiniteTalk is interesting because it does not only try to modify the lips. It aims to align lip sync, head movement, body posture, and facial expression from an input video and audio track. That is more ambitious than Wav2Lip-style mouth replacement. But that also means it is not where I would start on a small local setup.

Recommended practical workflow

I would use this staged approach.

Phase 1 — prove audio muxing

Goal:

Can I attach audio to my generated video?

Test:

1. Generate a 5-second silent video
2. Create or record a 5-second audio file
3. Load the audio
4. Combine video + audio
5. Export

Use:

  • LoadAudio
  • VideoHelperSuite Video Combine

Do not care about lip-sync yet.

Phase 2 — create better speech

Goal:

Can I make the voice track I actually want?

Options:

recorded voice
external TTS
ComfyUI TTS node
voice-cloning TTS

Output:

clean wav/mp3
same approximate duration as the video

Phase 3 — basic lip-sync attempt

Goal:

Can I make the mouth roughly match the audio?

First try:

Wav2Lip

Then, if needed:

LatentSync

Test conditions:

5–10 seconds
single person
front-facing
no cuts
mouth visible
audio length ~= video length

Phase 4 — scale up carefully

Only after the short test works:

longer clip
higher resolution
more motion
more camera movement
more stylized faces
multiple scenes

If it breaks, go back to a shorter clip.

Phase 5 — advanced audio-driven video

Only later consider:

InfiniteTalk
Wan2.2-S2V
HunyuanVideo-Avatar
FantasyTalking
MMAudio for sound effects/background audio

This is where you explore more modern full audio-driven motion systems, but it is not the first 8GB route.

Suggested decision table

Goal First thing to try If not enough Avoid at first
Just add sound VideoHelperSuite Video Combine ffmpeg/editor mux S2V
Generate dialogue external TTS or simple ComfyUI TTS better TTS / voice cloning full audio-driven video
Basic lip-sync Wav2Lip LatentSync InfiniteTalk first
Better lip-sync quality LatentSync short-clip advanced tools long full-scene test
Body/head/audio-driven performance InfiniteTalk-like workflows cloud/high-VRAM workflows 8GB local full setup
Sound effects/background audio MMAudio-like V2A tools manual SFX editing treating it as dialogue TTS

Practical advice for 8GB VRAM

For 8GB VRAM, I would think in terms of short, separate stages:

video generation
audio generation
audio muxing
lip-sync pass
final edit

not one huge all-in-one workflow.

A good first target:

5-second talking clip
one face
one voice
audio attached
rough lip-sync

A bad first target:

81-frame or longer full workflow
multiple cuts
stylized face
moving camera
ReActor
Wan video
TTS
lip-sync
audio mux
all in one graph

The second version has too many things that can fail at once.

The simple recommendation

If I had to pick a practical beginner route, I would do:

1. Keep your current working video workflow.
2. Generate or record the voice separately.
3. Use VideoHelperSuite / Video Combine to attach the audio.
4. If lip-sync is necessary, try Wav2Lip on a 5–10 second clip.
5. If Wav2Lip is too low-quality, try LatentSync.
6. Treat Wan2.2-S2V / InfiniteTalk / HunyuanVideo-Avatar as later high-end options.

That path is not the fanciest, but it is probably the most debuggable.

And with 8GB VRAM, “debuggable” matters more than “most advanced.”

Discussion in the ATmosphere

Loading comments...