External Publication
Visit Post

Training lora for LTX2.3 voice / sound only

Hugging Face Forums [Unofficial] May 26, 2026
Source

Maybe something like this would work:


I think I would first reframe this as an Audio-Video LoRA problem, not a pure “voice-only LoRA” problem.

That does not mean your goal is impossible. It just means I would avoid starting from num_frames: 1 and expecting LTX-2.3 to behave like a TTS / speaker-LoRA system. LTX-2.3 is an audio-video model, and the official training docs describe Audio-Video LoRA as a LoRA that can affect both video and audio output.

Short answer

I would try this order:

  1. First make a normal short Audio-Video LoRA work.
  2. Use real temporal video frames, not num_frames: 1.
  3. Preprocess with audio enabled and verify the decoded audio latents before training long runs.
  4. Use a non-empty trigger word.
  5. Put the exact transcript, voice style, and sound description in the captions.
  6. Check that the inference workflow actually loads the audio-related LoRA keys.
  7. Only after that works, experiment with making the training more voice-focused.

If your practical goal is simply “I want this character to speak with a consistent voice,” also look at ID-LoRA Reference Audio as a related alternative. That is not the same as training your own AV-LoRA, but it may solve the consistent-voice use case faster.

Why I would not start with num_frames: 1

I understand why you set it that way: you want to isolate the voice or sound and avoid learning the visual character yet.

But for LTX-2.3, I think num_frames: 1 is suspicious as a first baseline.

The LTX-2.3 model card describes LTX-2.3 as a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. The LTX-2 repository also describes LTX-2 as an audio-video model for synchronized audio and video generation.

The LTX-2 paper is also useful context: it describes LTX-2 as a dual-stream audio-video model, with video and audio streams connected by bidirectional audio-video cross-attention. In other words, the model is not just a voice model with a video model attached afterward.

So I would not remove almost all temporal video information for the first test. You may be removing the audio-video relationship that the model expects to learn.

In the dataset preparation docs, F=1 is mainly discussed in the image-dataset path, while video buckets are described as width × height × frames. For video, the frame count has to follow the LTX VAE constraints. The docs list the frame rule as:

frames % 8 == 1

So for short AV-LoRA tests I would start with something like:

512x512x49
512x512x73
512x512x89
576x576x89

not 1 frame.

I am not saying audio-focused experiments are impossible. I am saying I would first make a standard short Audio-Video LoRA work, then try to bias it toward voice/audio.

Treat it as Audio-Video LoRA first

The official Training Modes / Audio-Video LoRA docs say that LTX-2 supports joint audio-video generation and that you can train LoRA adapters that affect both video and audio output.

The same docs show the important pieces:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  with_audio: true

data:
  audio_latents_dir: "audio_latents"

The key idea is: enabling audio is not just “turn on voice.” The dataset must actually include preprocessed audio latents, and the LoRA target modules need to include audio and cross-modal branches.

The docs also warn that for audio-video LoRAs, target_modules should capture:

  • video attention modules
  • audio attention modules
  • audio-to-video attention modules
  • video-to-audio attention modules

That is why they recommend broader patterns like:

target_modules:
  - "to_k"
  - "to_q"
  - "to_v"
  - "to_out.0"

instead of overly narrow patterns such as attn1.to_k.

The configuration reference is also worth reading for this, because it explains that LTX-2 has video-only modules, audio-only modules, and audio-video cross-attention modules. For AV-LoRA, I would verify that the training config is actually touching the audio and cross-modal parts.

I would compare your config against the official ltx2_av_lora.yaml.

Separate the runtime error from the training design

The Background writer channel closed error may be a separate issue from the LoRA recipe.

There is a Hugging Face Xet issue about OS-level I/O errors, such as disk-full conditions, surfacing as a generic error like:

RuntimeError: Data processing error: File reconstruction error: Internal Writer Error: Background writer channel closed

See huggingface/xet-core #763.

So I would debug two things separately:

  1. Runtime / cache / disk / download / I/O issue
  2. Audio-Video LoRA training recipe issue

For the runtime side, I would check:

df -h
du -sh ~/.cache/huggingface || true
du -sh /workspace || true
du -sh ./output || true

Also check Hugging Face cache location. The Hugging Face cache docs explain the hub cache layout and environment variables such as HF_HOME / HF_HUB_CACHE.

If you suspect Xet/caching issues, it may be worth testing with:

export HF_HUB_DISABLE_XET=1

But I would treat that as runtime debugging, not as proof that the LoRA method itself is wrong.

Preprocess checks I would do before any long run

Before training for thousands of steps, I would first verify the preprocessed dataset.

The LTX dataset preparation docs mention audio preprocessing with --with-audio. For AV-LoRA, make sure the dataset really has:

latents/
conditions/
audio_latents/
captions/

I would also use the decode/debug path from the same docs. The docs describe --decode, which saves decoded video and, when audio preprocessing is enabled, decoded audio under something like:

.precomputed/decoded_audio

That is a very useful check.

If the decoded precomputed audio already sounds bad, then the problem is probably preprocessing, source files, cache, or audio latents — not LoRA learning.

Also, if you change model checkpoint, resolution bucket, text encoder, trigger word, or preprocessing parameters, rerun preprocessing with overwrite. The docs mention that changing preprocessing settings without --overwrite can leave stale cached outputs.

Something like this is the kind of check I would want before a long training run:

# Pseudocode / adapt paths to your trainer setup
python process_dataset.py \
  --input_dir <dataset_dir> \
  --output_dir <precomputed_dir> \
  --resolution_buckets 512x512x49 512x512x89 \
  --with-audio \
  --decode \
  --overwrite

Then listen to the decoded audio before training.

Dataset suggestions

For a first successful AV-LoRA test, I would make the dataset boring and clean.

I would not start with 6–10 second clips if the goal is debugging. I would cut some clips down to around 3–5 seconds, ideally one clear spoken line per clip.

Recommended first-pass dataset:

Item Recommendation
Clip length 3–5 seconds first
Audio single speaker, clean, low noise, low reverb
Music avoid music at first
Background sound avoid or describe it explicitly
Video visible face / mouth / speaker motion if it is speech
Frames 49 or 89 for first tests
Trigger non-empty unique trigger
Captions transcript + voice style + sound description + visual description

Example caption:

<trigger>, a young woman speaks in a soft, calm voice in a quiet indoor room. She looks toward the camera with a neutral expression. Speech: "I think we should start again from the beginning." Sounds: clear female speech, quiet room tone, no music.

I would avoid an empty trigger word. The dataset preparation docs describe a LoRA trigger token as being prepended to captions and then used in prompts to activate the LoRA. So I would use something unique, for example:

ema_voice

or:

ltx_ema_voice

Then keep that same trigger in validation prompts.

Suggested first experiment

I would not start with the full 5000-step run.

I would first do a small sanity test to prove the whole AV path works:

# Not a full config, just the direction I would test first.
model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  with_audio: true

data:
  audio_latents_dir: "audio_latents"

network:
  type: "lora"
  rank: 32
  alpha: 32
  target_modules:
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

train:
  batch_size: 1
  gradient_checkpointing: true

resolution_buckets:
  - "512x512x49"
  - "512x512x89"

For debugging, I would try something like:

small dataset subset
300–800 steps
same validation prompt
same seed
save several checkpoints
compare audio and video separately

Then scale up only after you can confirm:

  • the trainer runs
  • audio latents decode correctly
  • the LoRA changes the audio output
  • the inference workflow loads the audio-related keys
  • the result is not immediately overcooked

Known failure modes worth checking

There are already some reports that look related, especially around AI Toolkit and LTX audio training.

1. Good video, poor voice/audio

See ostris/ai-toolkit #684: the report says LTX-2 LoRA training produced good image/video quality, but the voice/audio became distorted and noisy, even with clean audio and do audio enabled.

So if the video works but audio is bad, that is not necessarily your dataset alone.

2. LTX-2.3 LoRA corrupting audio

See ostris/ai-toolkit #780: the report says the video output is correct after LTX-2.3 LoRA training, but the audio is corrupted with buzzing/noise/distortion, while the base model without LoRA has correct audio.

That suggests you should test base model audio, LoRA-disabled audio, and LoRA-enabled audio separately.

3. Trainer / workflow differences

See ostris/ai-toolkit #701: this report says the same dataset behaved differently between Musubi and AI Toolkit, with Musubi picking up voice but AI Toolkit not doing so.

So if the config looks right but the audio is ignored, I would not only blame the dataset. I would also check the trainer and inference path.

4. LoRA keys not loaded at inference time

See bghira/SimpleTuner #2349: there are logs where LTX-2 LoRA keys such as audio_connector / video_connector keys were not loaded in ComfyUI.

This is important. You can train an AV-LoRA correctly and still get misleading results if your inference workflow does not actually load the audio-related LoRA keys.

After loading the LoRA, check logs for things like:

audio_connector
video_connector
audio_attn
video_to_audio_attn
audio_to_video_attn
lora key not loaded

About “voice-only LoRA”

I would be careful with the term “voice-only LoRA” here.

If by “voice-only LoRA” you mean:

I want a reusable speaker identity LoRA, like a TTS speaker LoRA, independent of video.

then I am not sure that is the easiest or most supported route for LTX-2.3.

If by “voice-only LoRA” you mean:

I want the generated character to consistently speak with this kind of voice / tone / sound.

then I would first try:

  1. normal short Audio-Video LoRA, or
  2. ID-LoRA Reference Audio, depending on whether you want training or inference-time control.

For the actual AV-LoRA training path, I would not try to eliminate the video side at first. I would instead use short, clean audio-video clips and captions that make the audio content explicit.

Related alternative: ID-LoRA Reference Audio

This is not the same thing as training your own Audio-Video LoRA, but it may be very relevant to your practical goal.

If the goal is:

“I want this character to speak with a consistent voice.”

then look at ID-LoRA / Reference Audio workflows.

The ID-LoRA GitHub repo describes using a reference image / first frame, a short reference audio clip, and a text prompt for identity-preserving talking video generation. It specifically mentions voice identity transfer from short reference audio and zero-shot inference without per-speaker fine-tuning.

There is also ID-LoRA-LTX2.3-ComfyUI, which mentions LTXVReferenceAudio and reference-audio speaker identity transfer.

This Kijai / RuneXX Hugging Face discussion is also useful because it describes a ComfyUI workflow using a short reference audio clip, around 5 seconds, for consistent voice.

That route is different:

  • AV-LoRA training: learn from your dataset into a LoRA.
  • ID-LoRA Reference Audio: provide a short reference voice at inference time.

So I would not replace your whole AV-LoRA experiment with ID-LoRA if your goal is training. But if your real goal is just consistent character voice, ID-LoRA may solve it with less training pain.

What I would try next

I would probably do this:

Step 1: Fix / isolate the runtime error

Check disk, cache, and Xet/HF download behavior.

df -h
du -sh ~/.cache/huggingface || true
du -sh /workspace || true
du -sh ./output || true

If needed, test:

export HF_HUB_DISABLE_XET=1

Step 2: Make a tiny AV dataset

Use maybe 5–10 clips first.

3–5 sec each
clean single-speaker audio
visible face/mouth if speech
no music
no heavy background noise

Step 3: Use normal temporal buckets

Do not use num_frames: 1 for the first baseline.

Try:

512x512x49
512x512x89

Step 4: Preprocess with audio and decode

Make sure audio_latents/ exists.

Then decode and listen to the decoded audio latents.

Step 5: Use transcript-rich captions

Example:

<trigger>, a young woman speaks in a soft calm voice in a quiet indoor room. Speech: "I think we should start again from the beginning." Sounds: clear female speech, quiet room tone, no music.

Step 6: Train short first

Do not spend 5000 steps before proving the setup.

Try a shorter run first:

300–800 steps
same validation prompt
same seed
save several checkpoints

Step 7: Verify inference loading

Check whether audio-related LoRA keys are loaded.

If the loader ignores the audio branch, the generated audio may not tell you what the training actually learned.

Useful links

Core LTX links:

  • LTX-2.3 model card
  • LTX-2 official GitHub repo
  • LTX-2 paper
  • Official LTX-2 Trainer docs
  • Training Modes / Audio-Video LoRA
  • Configuration Reference
  • Dataset Preparation
  • ltx2_av_lora.yaml

Debug / failure-mode links:

  • Original HF forum thread
  • AI Toolkit #684: good video but poor voice
  • AI Toolkit #701: Musubi picked up voice, AI Toolkit did not
  • AI Toolkit #780: LTX-2.3 LoRA audio noise
  • SimpleTuner #2349: LTX LoRA keys not loaded in ComfyUI
  • Hugging Face Xet #763: Background writer channel closed can hide OS-level I/O errors
  • Hugging Face hub cache docs

Related alternative:

  • ID-LoRA GitHub
  • ID-LoRA-LTX2.3-ComfyUI
  • Kijai / RuneXX LTX2.3 ComfyUI reference audio workflow discussion

TL;DR

I would not start by trying to train a “voice-only” LoRA with num_frames: 1.

I would first make a normal short Audio-Video LoRA work:

real video frames
with_audio: true
audio_latents/
non-empty trigger
transcript-rich captions
decoded audio-latent verification
audio/video/cross-modal target modules
inference log checks

Then, after that baseline works, experiment with making it more voice-focused.

And if the practical goal is simply consistent character voice, I would also test ID-LoRA Reference Audio , because it may solve that use case without needing to train a separate voice-only LoRA.

Discussion in the ATmosphere

Loading comments...