Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifllipmdb52jtbkrs2zd4ndmzfadgt45fy6fft6ux3r7llguiem3e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmrrmkjvew32"
  },
  "path": "/t/training-lora-for-ltx2-3-voice-sound-only/176239#post_2",
  "publishedAt": "2026-05-26T19:02:42.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LTX-2.3 model card",
    "LTX-2 repository",
    "LTX-2 paper",
    "dataset preparation docs",
    "Training Modes / Audio-Video LoRA",
    "configuration reference",
    "ltx2_av_lora.yaml",
    "huggingface/xet-core #763",
    "Hugging Face cache docs",
    "LTX dataset preparation docs",
    "ostris/ai-toolkit #684",
    "ostris/ai-toolkit #780",
    "ostris/ai-toolkit #701",
    "bghira/SimpleTuner #2349",
    "ID-LoRA GitHub repo",
    "ID-LoRA-LTX2.3-ComfyUI",
    "Kijai / RuneXX Hugging Face discussion",
    "LTX-2 official GitHub repo",
    "Official LTX-2 Trainer docs",
    "Configuration Reference",
    "Dataset Preparation",
    "Original HF forum thread",
    "AI Toolkit #684: good video but poor voice",
    "AI Toolkit #701: Musubi picked up voice, AI Toolkit did not",
    "AI Toolkit #780: LTX-2.3 LoRA audio noise",
    "SimpleTuner #2349: LTX LoRA keys not loaded in ComfyUI",
    "Hugging Face Xet #763: Background writer channel closed can hide OS-level I/O errors",
    "Hugging Face hub cache docs",
    "ID-LoRA GitHub",
    "Kijai / RuneXX LTX2.3 ComfyUI reference audio workflow discussion"
  ],
  "textContent": "Maybe something like this would work:\n\n* * *\n\nI think I would first reframe this as an **Audio-Video LoRA** problem, not a pure “voice-only LoRA” problem.\n\nThat does not mean your goal is impossible. It just means I would avoid starting from `num_frames: 1` and expecting LTX-2.3 to behave like a TTS / speaker-LoRA system. LTX-2.3 is an audio-video model, and the official training docs describe Audio-Video LoRA as a LoRA that can affect both video and audio output.\n\n## Short answer\n\nI would try this order:\n\n  1. First make a normal short **Audio-Video LoRA** work.\n  2. Use real temporal video frames, not `num_frames: 1`.\n  3. Preprocess with audio enabled and verify the decoded audio latents before training long runs.\n  4. Use a non-empty trigger word.\n  5. Put the exact transcript, voice style, and sound description in the captions.\n  6. Check that the inference workflow actually loads the audio-related LoRA keys.\n  7. Only after that works, experiment with making the training more voice-focused.\n\n\n\nIf your practical goal is simply “I want this character to speak with a consistent voice,” also look at **ID-LoRA Reference Audio** as a related alternative. That is not the same as training your own AV-LoRA, but it may solve the consistent-voice use case faster.\n\n## Why I would not start with `num_frames: 1`\n\nI understand why you set it that way: you want to isolate the voice or sound and avoid learning the visual character yet.\n\nBut for LTX-2.3, I think `num_frames: 1` is suspicious as a first baseline.\n\nThe LTX-2.3 model card describes LTX-2.3 as a DiT-based **audio-video foundation model** designed to generate synchronized video and audio within a single model. The LTX-2 repository also describes LTX-2 as an audio-video model for synchronized audio and video generation.\n\nThe LTX-2 paper is also useful context: it describes LTX-2 as a dual-stream audio-video model, with video and audio streams connected by bidirectional audio-video cross-attention. In other words, the model is not just a voice model with a video model attached afterward.\n\nSo I would not remove almost all temporal video information for the first test. You may be removing the audio-video relationship that the model expects to learn.\n\nIn the dataset preparation docs, `F=1` is mainly discussed in the image-dataset path, while video buckets are described as `width × height × frames`. For video, the frame count has to follow the LTX VAE constraints. The docs list the frame rule as:\n\n\n    frames % 8 == 1\n\n\nSo for short AV-LoRA tests I would start with something like:\n\n\n    512x512x49\n    512x512x73\n    512x512x89\n    576x576x89\n\n\nnot `1` frame.\n\nI am not saying audio-focused experiments are impossible. I am saying I would first make a standard short Audio-Video LoRA work, then try to bias it toward voice/audio.\n\n## Treat it as Audio-Video LoRA first\n\nThe official Training Modes / Audio-Video LoRA docs say that LTX-2 supports joint audio-video generation and that you can train LoRA adapters that affect both video and audio output.\n\nThe same docs show the important pieces:\n\n\n    model:\n      training_mode: \"lora\"\n\n    training_strategy:\n      name: \"text_to_video\"\n      with_audio: true\n\n    data:\n      audio_latents_dir: \"audio_latents\"\n\n\nThe key idea is: enabling audio is not just “turn on voice.” The dataset must actually include preprocessed audio latents, and the LoRA target modules need to include audio and cross-modal branches.\n\nThe docs also warn that for audio-video LoRAs, `target_modules` should capture:\n\n  * video attention modules\n  * audio attention modules\n  * audio-to-video attention modules\n  * video-to-audio attention modules\n\n\n\nThat is why they recommend broader patterns like:\n\n\n    target_modules:\n      - \"to_k\"\n      - \"to_q\"\n      - \"to_v\"\n      - \"to_out.0\"\n\n\ninstead of overly narrow patterns such as `attn1.to_k`.\n\nThe configuration reference is also worth reading for this, because it explains that LTX-2 has video-only modules, audio-only modules, and audio-video cross-attention modules. For AV-LoRA, I would verify that the training config is actually touching the audio and cross-modal parts.\n\nI would compare your config against the official ltx2_av_lora.yaml.\n\n## Separate the runtime error from the training design\n\nThe `Background writer channel closed` error may be a separate issue from the LoRA recipe.\n\nThere is a Hugging Face Xet issue about OS-level I/O errors, such as disk-full conditions, surfacing as a generic error like:\n\n\n    RuntimeError: Data processing error: File reconstruction error: Internal Writer Error: Background writer channel closed\n\n\nSee huggingface/xet-core #763.\n\nSo I would debug two things separately:\n\n  1. **Runtime / cache / disk / download / I/O issue**\n  2. **Audio-Video LoRA training recipe issue**\n\n\n\nFor the runtime side, I would check:\n\n\n    df -h\n    du -sh ~/.cache/huggingface || true\n    du -sh /workspace || true\n    du -sh ./output || true\n\n\nAlso check Hugging Face cache location. The Hugging Face cache docs explain the hub cache layout and environment variables such as `HF_HOME` / `HF_HUB_CACHE`.\n\nIf you suspect Xet/caching issues, it may be worth testing with:\n\n\n    export HF_HUB_DISABLE_XET=1\n\n\nBut I would treat that as runtime debugging, not as proof that the LoRA method itself is wrong.\n\n## Preprocess checks I would do before any long run\n\nBefore training for thousands of steps, I would first verify the preprocessed dataset.\n\nThe LTX dataset preparation docs mention audio preprocessing with `--with-audio`. For AV-LoRA, make sure the dataset really has:\n\n\n    latents/\n    conditions/\n    audio_latents/\n    captions/\n\n\nI would also use the decode/debug path from the same docs. The docs describe `--decode`, which saves decoded video and, when audio preprocessing is enabled, decoded audio under something like:\n\n\n    .precomputed/decoded_audio\n\n\nThat is a very useful check.\n\nIf the decoded precomputed audio already sounds bad, then the problem is probably preprocessing, source files, cache, or audio latents — not LoRA learning.\n\nAlso, if you change model checkpoint, resolution bucket, text encoder, trigger word, or preprocessing parameters, rerun preprocessing with overwrite. The docs mention that changing preprocessing settings without `--overwrite` can leave stale cached outputs.\n\nSomething like this is the kind of check I would want before a long training run:\n\n\n    # Pseudocode / adapt paths to your trainer setup\n    python process_dataset.py \\\n      --input_dir <dataset_dir> \\\n      --output_dir <precomputed_dir> \\\n      --resolution_buckets 512x512x49 512x512x89 \\\n      --with-audio \\\n      --decode \\\n      --overwrite\n\n\nThen listen to the decoded audio before training.\n\n## Dataset suggestions\n\nFor a first successful AV-LoRA test, I would make the dataset boring and clean.\n\nI would not start with 6–10 second clips if the goal is debugging. I would cut some clips down to around 3–5 seconds, ideally one clear spoken line per clip.\n\nRecommended first-pass dataset:\n\nItem | Recommendation\n---|---\nClip length | 3–5 seconds first\nAudio | single speaker, clean, low noise, low reverb\nMusic | avoid music at first\nBackground sound | avoid or describe it explicitly\nVideo | visible face / mouth / speaker motion if it is speech\nFrames | 49 or 89 for first tests\nTrigger | non-empty unique trigger\nCaptions | transcript + voice style + sound description + visual description\n\nExample caption:\n\n\n    <trigger>, a young woman speaks in a soft, calm voice in a quiet indoor room. She looks toward the camera with a neutral expression. Speech: \"I think we should start again from the beginning.\" Sounds: clear female speech, quiet room tone, no music.\n\n\nI would avoid an empty trigger word. The dataset preparation docs describe a LoRA trigger token as being prepended to captions and then used in prompts to activate the LoRA. So I would use something unique, for example:\n\n\n    ema_voice\n\n\nor:\n\n\n    ltx_ema_voice\n\n\nThen keep that same trigger in validation prompts.\n\n## Suggested first experiment\n\nI would not start with the full 5000-step run.\n\nI would first do a small sanity test to prove the whole AV path works:\n\n\n    # Not a full config, just the direction I would test first.\n    model:\n      training_mode: \"lora\"\n\n    training_strategy:\n      name: \"text_to_video\"\n      with_audio: true\n\n    data:\n      audio_latents_dir: \"audio_latents\"\n\n    network:\n      type: \"lora\"\n      rank: 32\n      alpha: 32\n      target_modules:\n        - \"to_k\"\n        - \"to_q\"\n        - \"to_v\"\n        - \"to_out.0\"\n\n    train:\n      batch_size: 1\n      gradient_checkpointing: true\n\n    resolution_buckets:\n      - \"512x512x49\"\n      - \"512x512x89\"\n\n\nFor debugging, I would try something like:\n\n\n    small dataset subset\n    300–800 steps\n    same validation prompt\n    same seed\n    save several checkpoints\n    compare audio and video separately\n\n\nThen scale up only after you can confirm:\n\n  * the trainer runs\n  * audio latents decode correctly\n  * the LoRA changes the audio output\n  * the inference workflow loads the audio-related keys\n  * the result is not immediately overcooked\n\n\n\n## Known failure modes worth checking\n\nThere are already some reports that look related, especially around AI Toolkit and LTX audio training.\n\n### 1. Good video, poor voice/audio\n\nSee ostris/ai-toolkit #684: the report says LTX-2 LoRA training produced good image/video quality, but the voice/audio became distorted and noisy, even with clean audio and `do audio` enabled.\n\nSo if the video works but audio is bad, that is not necessarily your dataset alone.\n\n### 2. LTX-2.3 LoRA corrupting audio\n\nSee ostris/ai-toolkit #780: the report says the video output is correct after LTX-2.3 LoRA training, but the audio is corrupted with buzzing/noise/distortion, while the base model without LoRA has correct audio.\n\nThat suggests you should test base model audio, LoRA-disabled audio, and LoRA-enabled audio separately.\n\n### 3. Trainer / workflow differences\n\nSee ostris/ai-toolkit #701: this report says the same dataset behaved differently between Musubi and AI Toolkit, with Musubi picking up voice but AI Toolkit not doing so.\n\nSo if the config looks right but the audio is ignored, I would not only blame the dataset. I would also check the trainer and inference path.\n\n### 4. LoRA keys not loaded at inference time\n\nSee bghira/SimpleTuner #2349: there are logs where LTX-2 LoRA keys such as `audio_connector` / `video_connector` keys were not loaded in ComfyUI.\n\nThis is important. You can train an AV-LoRA correctly and still get misleading results if your inference workflow does not actually load the audio-related LoRA keys.\n\nAfter loading the LoRA, check logs for things like:\n\n\n    audio_connector\n    video_connector\n    audio_attn\n    video_to_audio_attn\n    audio_to_video_attn\n    lora key not loaded\n\n\n## About “voice-only LoRA”\n\nI would be careful with the term “voice-only LoRA” here.\n\nIf by “voice-only LoRA” you mean:\n\n> I want a reusable speaker identity LoRA, like a TTS speaker LoRA, independent of video.\n\nthen I am not sure that is the easiest or most supported route for LTX-2.3.\n\nIf by “voice-only LoRA” you mean:\n\n> I want the generated character to consistently speak with this kind of voice / tone / sound.\n\nthen I would first try:\n\n  1. normal short Audio-Video LoRA, or\n  2. ID-LoRA Reference Audio, depending on whether you want training or inference-time control.\n\n\n\nFor the actual AV-LoRA training path, I would not try to eliminate the video side at first. I would instead use short, clean audio-video clips and captions that make the audio content explicit.\n\n## Related alternative: ID-LoRA Reference Audio\n\nThis is not the same thing as training your own Audio-Video LoRA, but it may be very relevant to your practical goal.\n\nIf the goal is:\n\n> “I want this character to speak with a consistent voice.”\n\nthen look at ID-LoRA / Reference Audio workflows.\n\nThe ID-LoRA GitHub repo describes using a reference image / first frame, a short reference audio clip, and a text prompt for identity-preserving talking video generation. It specifically mentions voice identity transfer from short reference audio and zero-shot inference without per-speaker fine-tuning.\n\nThere is also ID-LoRA-LTX2.3-ComfyUI, which mentions `LTXVReferenceAudio` and reference-audio speaker identity transfer.\n\nThis Kijai / RuneXX Hugging Face discussion is also useful because it describes a ComfyUI workflow using a short reference audio clip, around 5 seconds, for consistent voice.\n\nThat route is different:\n\n  * AV-LoRA training: learn from your dataset into a LoRA.\n  * ID-LoRA Reference Audio: provide a short reference voice at inference time.\n\n\n\nSo I would not replace your whole AV-LoRA experiment with ID-LoRA if your goal is training. But if your real goal is just consistent character voice, ID-LoRA may solve it with less training pain.\n\n## What I would try next\n\nI would probably do this:\n\n### Step 1: Fix / isolate the runtime error\n\nCheck disk, cache, and Xet/HF download behavior.\n\n\n    df -h\n    du -sh ~/.cache/huggingface || true\n    du -sh /workspace || true\n    du -sh ./output || true\n\n\nIf needed, test:\n\n\n    export HF_HUB_DISABLE_XET=1\n\n\n### Step 2: Make a tiny AV dataset\n\nUse maybe 5–10 clips first.\n\n\n    3–5 sec each\n    clean single-speaker audio\n    visible face/mouth if speech\n    no music\n    no heavy background noise\n\n\n### Step 3: Use normal temporal buckets\n\nDo not use `num_frames: 1` for the first baseline.\n\nTry:\n\n\n    512x512x49\n    512x512x89\n\n\n### Step 4: Preprocess with audio and decode\n\nMake sure `audio_latents/` exists.\n\nThen decode and listen to the decoded audio latents.\n\n### Step 5: Use transcript-rich captions\n\nExample:\n\n\n    <trigger>, a young woman speaks in a soft calm voice in a quiet indoor room. Speech: \"I think we should start again from the beginning.\" Sounds: clear female speech, quiet room tone, no music.\n\n\n### Step 6: Train short first\n\nDo not spend 5000 steps before proving the setup.\n\nTry a shorter run first:\n\n\n    300–800 steps\n    same validation prompt\n    same seed\n    save several checkpoints\n\n\n### Step 7: Verify inference loading\n\nCheck whether audio-related LoRA keys are loaded.\n\nIf the loader ignores the audio branch, the generated audio may not tell you what the training actually learned.\n\n## Useful links\n\nCore LTX links:\n\n  * LTX-2.3 model card\n  * LTX-2 official GitHub repo\n  * LTX-2 paper\n  * Official LTX-2 Trainer docs\n  * Training Modes / Audio-Video LoRA\n  * Configuration Reference\n  * Dataset Preparation\n  * ltx2_av_lora.yaml\n\n\n\nDebug / failure-mode links:\n\n  * Original HF forum thread\n  * AI Toolkit #684: good video but poor voice\n  * AI Toolkit #701: Musubi picked up voice, AI Toolkit did not\n  * AI Toolkit #780: LTX-2.3 LoRA audio noise\n  * SimpleTuner #2349: LTX LoRA keys not loaded in ComfyUI\n  * Hugging Face Xet #763: Background writer channel closed can hide OS-level I/O errors\n  * Hugging Face hub cache docs\n\n\n\nRelated alternative:\n\n  * ID-LoRA GitHub\n  * ID-LoRA-LTX2.3-ComfyUI\n  * Kijai / RuneXX LTX2.3 ComfyUI reference audio workflow discussion\n\n\n\n## TL;DR\n\nI would not start by trying to train a “voice-only” LoRA with `num_frames: 1`.\n\nI would first make a normal short **Audio-Video LoRA** work:\n\n\n    real video frames\n    with_audio: true\n    audio_latents/\n    non-empty trigger\n    transcript-rich captions\n    decoded audio-latent verification\n    audio/video/cross-modal target modules\n    inference log checks\n\n\nThen, after that baseline works, experiment with making it more voice-focused.\n\nAnd if the practical goal is simply consistent character voice, I would also test **ID-LoRA Reference Audio** , because it may solve that use case without needing to train a separate voice-only LoRA.",
  "title": "Training lora for LTX2.3 voice / sound only"
}