Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifhk6mdylkdgiqo22xfaclsog6bkmly6fbxdeeoc6245ysvd74rhu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnqyf5i6exw2"
  },
  "path": "/t/nvidia-driver-update-reactor-node/176221#post_8",
  "publishedAt": "2026-06-08T04:05:06.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Wan2.2 GitHub",
    "Wan-AI/Wan2.2-S2V-14B on Hugging Face",
    "ComfyUI LoadAudio node",
    "ComfyUI SaveAudio node",
    "ComfyUI-VideoHelperSuite",
    "RunComfy Video Combine node guide",
    "TTS-Audio-Suite for ComfyUI",
    "TTS-Audio-Suite releases",
    "F5-TTS paper",
    "ComfyUI ElevenLabs integration announcement",
    "ComfyUI_wav2lip",
    "Wav2Lip paper",
    "Wav2Lip original GitHub repo",
    "Example YouTube tutorial: LipSync in ComfyUI with ReActor and Wav2Lip",
    "ComfyUI-LatentSyncWrapper",
    "LatentSync paper",
    "LatentSync GitHub issue #99: video/audio length mismatch behavior",
    "ThinkDiffusion LatentSync guide",
    "InfiniteTalk GitHub",
    "InfiniteTalk on Hugging Face",
    "Comfy workflow: Wan2.1 InfiniteTalk audio-driven character lip sync",
    "HunyuanVideo-Avatar paper",
    "MMAudio GitHub",
    "ComfyUI-MMAudio wrapper",
    "LoadAudio",
    "VideoHelperSuite Video Combine"
  ],
  "textContent": "Hmm… if you are trying to do audio too on 8GB VRAM, maybe something like this:\n\n* * *\n\nI would separate the “audio problem” into three different problems:\n\nLayer | What it means | 8GB VRAM practicality\n---|---|---\n**1. Attach audio to a video** | Add an existing `.wav` / `.mp3` track to the generated video | Very realistic\n**2. Generate speech** | Create the dialogue audio from text or a recording | Realistic, especially if done outside the video workflow\n**3. Lip-sync / audio-driven motion** | Make the mouth, face, head, or body follow the audio | Possible, but should be treated as a separate later workflow\n\nSo I would **not** try to solve all of this in one giant ComfyUI graph at first.\n\nFor 8GB VRAM, the practical order is probably:\n\n\n    1. Generate or record the voice separately\n    2. Generate the silent video with your current working workflow\n    3. Mux the audio into the video\n    4. If the mismatch is too obvious, try a simple lip-sync post-process\n    5. Only then look at heavier audio-driven video systems\n\n\nThe most important point is this:\n\n> An audio input on a video node usually means “I can attach an existing audio stream,” not “I can generate speech from the prompt.”\n\nSo if the save/combine video node has an audio socket or audio icon, that does not necessarily mean Wan/ReActor is generating audio. It usually means you can pass in an existing audio object and have it combined into the final file.\n\n## Recommended path for 8GB VRAM\n\nI would start with the boring, common, well-documented path:\n\n\n    TTS or recorded voice\n    ↓\n    silent generated video\n    ↓\n    mux audio into the video\n    ↓\n    optional Wav2Lip / LatentSync pass if lip-sync is needed\n\n\nThis is less magical than a full audio-driven video model, but it is much more realistic on 8GB VRAM.\n\n## Why I would not start with Wan2.2-S2V locally\n\nWan2.2-S2V is closer to the ideal solution: image/video + audio → speech-driven video.\n\nBut I would not start there on 8GB VRAM.\n\nWan2.2-S2V-14B exists and is the more “native” speech-to-video direction:\n\n  * Wan2.2 GitHub\n  * Wan-AI/Wan2.2-S2V-14B on Hugging Face\n\n\n\nHowever, the official model card / README examples are much heavier than an 8GB local setup. The S2V route is more like:\n\n\n    high-VRAM GPU / cloud / hosted workflow\n\n\nnot:\n\n\n    easy local 8GB ComfyUI workflow\n\n\nSo I would treat Wan2.2-S2V as the “ideal future path,” not the first recovery path.\n\n## Step 1: audio attachment / muxing\n\nFor simply putting audio into a generated video, the common ComfyUI path is usually something like:\n\n\n    LoadAudio\n    +\n    Video Combine\n\n\nUseful links:\n\n  * ComfyUI LoadAudio node\n  * ComfyUI SaveAudio node\n  * ComfyUI-VideoHelperSuite\n  * RunComfy Video Combine node guide\n\n\n\nVideoHelperSuite’s `Video Combine` node is useful because it combines image frames into a video, and if an optional audio input is provided, it can combine that audio into the output video.\n\nSo the first test should be very simple:\n\n\n    short generated silent video\n    +\n    short audio file\n    ↓\n    Video Combine\n    ↓\n    video with audio\n\n\nDo not start with a full long clip. Start with 5–10 seconds.\n\n### Good first test settings\n\n\n    Duration: 5–10 seconds\n    One speaker\n    One face\n    Front-facing if possible\n    No scene cuts\n    No camera chaos\n    Audio length roughly equals video length\n\n\nThis first step answers only one question:\n\n> Can I attach audio to the video at all?\n\nIt does not solve lip-sync yet.\n\n## Step 2: generate the speech separately\n\nFor speech, I would initially use a separate TTS or recorded voice path.\n\nPossible options:\n\nOption | Why use it | Caveat\n---|---|---\n**Recorded voice** | Simplest and most predictable | Requires recording\n**External TTS** | Often easiest | May require API/account\n**ComfyUI TTS node** | Keeps more inside ComfyUI | Adds more dependencies\n**Voice cloning TTS** | Better character voice control | More setup and ethical/legal care\n\nComfyUI TTS/audio nodes exist, but I would keep them separate from the main video workflow at first.\n\nSome useful entry points:\n\n  * TTS-Audio-Suite for ComfyUI\n  * TTS-Audio-Suite releases\n  * F5-TTS paper\n  * ComfyUI ElevenLabs integration announcement\n\n\n\nFor the first working version, I would not care too much where the voice comes from. The important thing is to get a clean `.wav` or `.mp3` that you can attach to the video.\n\n## Step 3: if lip-sync is needed, start with common tools\n\nYou probably will eventually want lip-sync. But I would not start with the newest full audio-driven video system.\n\nFor beginner-friendly debugging, I would try the older/common lip-sync route first:\n\n\n    generated video\n    +\n    speech audio\n    ↓\n    Wav2Lip or LatentSync\n    ↓\n    lip-synced video\n\n\nThis is a post-processing step. It is different from asking Wan to generate the whole video from the audio.\n\n## Beginner-friendly first lip-sync option: Wav2Lip\n\nWav2Lip is older, but that is actually a benefit for debugging. There are many examples, tutorials, and failure reports around it.\n\nUseful links:\n\n  * ComfyUI_wav2lip\n  * Wav2Lip paper\n  * Wav2Lip original GitHub repo\n  * Example YouTube tutorial: LipSync in ComfyUI with ReActor and Wav2Lip\n\n\n\nWhy Wav2Lip first?\n\n\n    older\n    common\n    more tutorials\n    more known failure cases\n    simpler mental model\n\n\nThe mental model is straightforward:\n\n\n    input video + input audio → output video with adjusted mouth movement\n\n\nIt may not be the best quality, but it is often a good first proof of concept.\n\n### Expected Wav2Lip problems\n\nWav2Lip can struggle with:\n\n\n    small faces\n    side views\n    covered mouths\n    fast head movement\n    multiple faces\n    low-resolution faces\n    strong stylization\n    large camera motion\n    long clips\n\n\nSo the first test should be intentionally easy:\n\n\n    one person\n    face visible\n    mouth visible\n    short clip\n    audio length close to video length\n\n\n## Better-quality next option: LatentSync\n\nIf Wav2Lip works but the quality is not good enough, I would try LatentSync next.\n\nUseful links:\n\n  * ComfyUI-LatentSyncWrapper\n  * LatentSync paper\n  * LatentSync GitHub issue #99: video/audio length mismatch behavior\n  * ThinkDiffusion LatentSync guide\n\n\n\nLatentSync is newer and likely to give better results in some cases, but it also has more moving parts.\n\nThe main practical issues to expect are:\n\n\n    video length vs audio length mismatch\n    fps mismatch\n    audio sample-rate expectations\n    face detection failure\n    small/side faces\n    long clip instability\n    VRAM pressure\n    dependency issues\n\n\nA very common beginner mistake is trying:\n\n\n    5-second video + 30-second audio\n\n\nand expecting a full 30-second lip-synced output. Tools often behave according to the video length, the audio length, or internal chunking assumptions. So keep the first test very short and matched:\n\n\n    5-second video\n    +\n    5-second audio\n\n\n## What I would not do first\n\nI would not start with these on 8GB VRAM:\n\nTool / direction | Why not first\n---|---\n**Wan2.2-S2V-14B** | Much closer to ideal, but too heavy for 8GB local first attempt\n**InfiniteTalk** | More powerful audio-driven video/dubbing direction, but more complex\n**FantasyTalking / WanVideo adapter workflows** | Potentially strong, but heavier and more fragile\n**HunyuanVideo-Avatar** | High-end audio-driven human animation; not a simple 8GB beginner route\n**Long multi-scene lip-sync** | Too many failure points at once\n\nThese are worth knowing about, but I would keep them as later options.\n\nUseful links for later exploration:\n\n  * InfiniteTalk GitHub\n  * InfiniteTalk on Hugging Face\n  * Comfy workflow: Wan2.1 InfiniteTalk audio-driven character lip sync\n  * HunyuanVideo-Avatar paper\n  * MMAudio GitHub\n  * ComfyUI-MMAudio wrapper\n\n\n\nInfiniteTalk is interesting because it does not only try to modify the lips. It aims to align lip sync, head movement, body posture, and facial expression from an input video and audio track. That is more ambitious than Wav2Lip-style mouth replacement. But that also means it is not where I would start on a small local setup.\n\n## Recommended practical workflow\n\nI would use this staged approach.\n\n### Phase 1 — prove audio muxing\n\nGoal:\n\n\n    Can I attach audio to my generated video?\n\n\nTest:\n\n\n    1. Generate a 5-second silent video\n    2. Create or record a 5-second audio file\n    3. Load the audio\n    4. Combine video + audio\n    5. Export\n\n\nUse:\n\n  * LoadAudio\n  * VideoHelperSuite Video Combine\n\n\n\nDo not care about lip-sync yet.\n\n### Phase 2 — create better speech\n\nGoal:\n\n\n    Can I make the voice track I actually want?\n\n\nOptions:\n\n\n    recorded voice\n    external TTS\n    ComfyUI TTS node\n    voice-cloning TTS\n\n\nOutput:\n\n\n    clean wav/mp3\n    same approximate duration as the video\n\n\n### Phase 3 — basic lip-sync attempt\n\nGoal:\n\n\n    Can I make the mouth roughly match the audio?\n\n\nFirst try:\n\n\n    Wav2Lip\n\n\nThen, if needed:\n\n\n    LatentSync\n\n\nTest conditions:\n\n\n    5–10 seconds\n    single person\n    front-facing\n    no cuts\n    mouth visible\n    audio length ~= video length\n\n\n### Phase 4 — scale up carefully\n\nOnly after the short test works:\n\n\n    longer clip\n    higher resolution\n    more motion\n    more camera movement\n    more stylized faces\n    multiple scenes\n\n\nIf it breaks, go back to a shorter clip.\n\n### Phase 5 — advanced audio-driven video\n\nOnly later consider:\n\n\n    InfiniteTalk\n    Wan2.2-S2V\n    HunyuanVideo-Avatar\n    FantasyTalking\n    MMAudio for sound effects/background audio\n\n\nThis is where you explore more modern full audio-driven motion systems, but it is not the first 8GB route.\n\n## Suggested decision table\n\nGoal | First thing to try | If not enough | Avoid at first\n---|---|---|---\nJust add sound | VideoHelperSuite Video Combine | ffmpeg/editor mux | S2V\nGenerate dialogue | external TTS or simple ComfyUI TTS | better TTS / voice cloning | full audio-driven video\nBasic lip-sync | Wav2Lip | LatentSync | InfiniteTalk first\nBetter lip-sync quality | LatentSync | short-clip advanced tools | long full-scene test\nBody/head/audio-driven performance | InfiniteTalk-like workflows | cloud/high-VRAM workflows | 8GB local full setup\nSound effects/background audio | MMAudio-like V2A tools | manual SFX editing | treating it as dialogue TTS\n\n## Practical advice for 8GB VRAM\n\nFor 8GB VRAM, I would think in terms of short, separate stages:\n\n\n    video generation\n    audio generation\n    audio muxing\n    lip-sync pass\n    final edit\n\n\nnot one huge all-in-one workflow.\n\nA good first target:\n\n\n    5-second talking clip\n    one face\n    one voice\n    audio attached\n    rough lip-sync\n\n\nA bad first target:\n\n\n    81-frame or longer full workflow\n    multiple cuts\n    stylized face\n    moving camera\n    ReActor\n    Wan video\n    TTS\n    lip-sync\n    audio mux\n    all in one graph\n\n\nThe second version has too many things that can fail at once.\n\n## The simple recommendation\n\nIf I had to pick a practical beginner route, I would do:\n\n\n    1. Keep your current working video workflow.\n    2. Generate or record the voice separately.\n    3. Use VideoHelperSuite / Video Combine to attach the audio.\n    4. If lip-sync is necessary, try Wav2Lip on a 5–10 second clip.\n    5. If Wav2Lip is too low-quality, try LatentSync.\n    6. Treat Wan2.2-S2V / InfiniteTalk / HunyuanVideo-Avatar as later high-end options.\n\n\nThat path is not the fanciest, but it is probably the most debuggable.\n\nAnd with 8GB VRAM, “debuggable” matters more than “most advanced.”",
  "title": "Nvidia driver update - reactor node"
}