External Publication
Visit Post

Ltx 2.3 problem with dialogs

Hugging Face Forums [Unofficial] June 11, 2026
Source

Ah, so that is the kind of “accent” you mean. That clarification helps. In that case, whether this is controllable depends quite a lot on LTX itself, but I think the practical answer is roughly below. Also, the fact that Suno, LTX, and other models have different “dialects” for this kind of control is actually a pretty fundamental issue:


Short answer

If the goal is correct Russian word stress , I would treat this less as a normal “negative prompt” problem and more as a pronunciation-control problem.

For LTX, I would split the answer into two levels:

Route My rough expectation
Text prompt only Worth testing, but probably unreliable for exact Russian stress
Text prompt with stress mark + syllable hint + explicit pronunciation instruction Better, but still not guaranteed
Generate/record correct Russian audio first, then use LTX audio-to-video / image-audio-to-video / custom-audio I2V Much more promising
Final mux/post-production with verified audio Most reliable for pronunciation correctness

So the practical summary is:

Do not treat capital letters as a universal pronunciation API. For LTX text-only prompting, try stress marks, syllable hints, and explicit delivery instructions. But if Russian stress must be correct, make the correct audio first and use LTX as an audio-driven video generator.

That audio-first route is not just theoretical. LTX has official audio-to-video API support, and the LTX/ComfyUI community already has workflows around custom audio , image-audio-to-video , Qwen/Fish-style TTS , and voice-driven talking video.

Why Suno and LTX can behave differently

This is the core issue.

Different generative models learn different informal “control languages.”

Suno is strongly connected to music, lyrics, song structure, vocal emphasis, line breaks, and lyric formatting. In that world, things like:

ALL CAPS
line breaks
repeated syllables
elongated words
[spoken]
[whispered]
[chorus]

can become useful signals because the model has likely seen many examples where typography and lyric formatting correlate with vocal delivery.

LTX is different. LTX-2.3 is described as a diffusion-based audio-video foundation model that generates synchronized video and audio in a single model. It is not simply a lyrics-to-song model or a normal standalone TTS engine. Its speech is entangled with:

  • the person on screen
  • mouth movement
  • facial expression
  • camera motion
  • scene timing
  • environment
  • emotion
  • ambient sound
  • video/audio synchronization

So the same trick can work differently:

Model family Likely stronger signal
Song / lyric model Typography, line breaks, lyric structure, repeated syllables
Ordinary TTS text normalization, lexicon, SSML, phonemes, voice settings
LTX-style audio-video model scene description, quoted dialogue, audio prompt, timing, character action, reference image/audio
Lip-sync model input audio waveform, face crop/identity, mouth motion constraints

That is why this is not just a “bad prompt” issue. The model may simply not have learned the same convention that Suno learned.

Russian stress is especially difficult

Russian word stress is not a simple typographic effect.

In normal Russian writing, stress marks are usually omitted. The model often sees a plain word and must infer where the stress should be. That can depend on lexical knowledge, word form, context, and training data coverage.

For example, learning materials may write stress like this:

предста́вь
вообрази́

But ordinary text usually does not include those marks. So if the model sees:

представь

it has to know the pronunciation from memory. If it does not know it reliably, capitalization may not fix the problem.

This is very different from simply saying:

Say this word louder.

Russian lexical stress is closer to:

Pronounce this specific word with the correct stressed vowel.

That is why a text-only video model may be inconsistent: sometimes it knows the word, sometimes it guesses, sometimes it follows visual/timing constraints more strongly than the intended stress hint.

Text-only LTX prompt: what is still worth trying

I would still test text-only prompting, but I would treat it as experimental.

Instead of only doing:

предстАвь

I would try redundant hints:

A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: "предста́вь". The stress is on the vowel "а", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.

This gives the model several different signals:

Signal type Example
Normal Russian spelling представь
Stress mark предста́вь
Syllable/stress hint пред-СТАВЬ
Natural-language explanation “The stress is on the vowel а
Acoustic explanation “slightly longer and louder”
Delivery instruction “slowly and clearly”
Scene simplification one speaker, quiet room, close-up

This may still fail, but it is a more LTX-like prompt than just capitalizing one letter.

Why negative prompt probably will not solve it

I would not start with:

negative prompt: wrong stress, bad Russian pronunciation, incorrect accent

That does not tell the model what the correct pronunciation is.

For this kind of problem, a positive target is more useful:

The stress is on the vowel "а": пред-СТАВЬ.
The stressed syllable is slightly longer and louder.
He says the word slowly and clearly in Russian.

A negative prompt can suppress broad unwanted things. But Russian word stress is not a broad unwanted artifact. It is a specific pronunciation target.

The more promising route: generate the Russian audio first

For exact Russian stress, I would probably move the pronunciation problem out of the LTX text prompt.

A better production-style route is:

Russian-capable TTS / voice clone / human recording
        ↓
verify stress and pronunciation
        ↓
LTX audio-to-video / image-audio-to-video / custom-audio I2V
        ↓
optional mux/post-production with verified audio

This is not just a workaround. It matches LTX’s strengths better.

LTX’s audio-to-video API is explicitly designed to generate video driven by an audio track. The documentation says you can supply dialogue, music, or ambient sound, and the model produces visuals synchronized to the audio. LTX’s Audio-to-Video capability page also describes audio as the primary conditioning signal, where voice, music, and sound drive motion, pacing, and scene structure.

That is exactly the kind of control surface you want when pronunciation matters.

Instead of asking LTX:

Please infer the correct Russian stress from text.

you give it:

Here is the already-correct Russian audio. Generate the visual performance around it.

That is a much stronger signal.

Practical precedent: people are already doing this

There are already LTX community workflows that look very close to this route.

For example:

  • In a Kijai/LTX2.3_comfy discussion, RuneXX shared an I2V & T2V with Custom Audio workflow described as “Use your own audio files with lip sync, and synced motion.”
  • In the same ecosystem, users discuss workflows where they first generate or clone voice with Qwen TTS , then use LTX I2V with custom audio.
  • Comfy has an LTX-2.3 Image Audio to Video workflow where a portrait image and audio file are used to create a lip-synced talking video.
  • There are also LTX/Comfy workflows combining Qwen TTS , Fish Audio , or other TTS/voice-cloning tools with LTX image/audio video generation.

So the route is not only theoretical:

TTS / voice clone / verified recording
        ↓
LTX custom audio / IA2V / A2V
        ↓
talking video

is already a practical pattern in the LTX 2.3 ComfyUI ecosystem.

Important caveat: custom audio is not always automatic lip-sync

I would still be careful.

Providing custom audio does not always guarantee perfect lip-sync. Some users report cases where the audio is present in the output but does not properly drive the mouth, or behaves more like voice-over narration.

A useful practical trick from the LTX ComfyUI ecosystem is:

Provide both the audio file and a transcript/description of the spoken line in the prompt.

For example, if your audio says:

Предста́вь, что это правда.

then the prompt should not only say:

A man speaks Russian.

It should say something like:

A close-up of a man speaking directly to the camera in Russian. He says: "Предста́вь, что это правда." His mouth movements are synchronized to the provided audio. The scene is quiet, with clear close-mic speech and no music.

This helps the model understand that the audio is meant to be character dialogue , not just background narration or ambience.

Recommended audio-first workflow

If I were trying to get correct Russian stress in LTX, I would test this pipeline.

Step 1 — Make the Russian audio outside LTX

Use one of:

  • a Russian-capable TTS
  • a voice-cloning TTS
  • a human recording
  • a manually edited recording
  • a TTS system with SSML/phoneme/lexicon controls, if available

The important point is: verify the pronunciation before giving it to LTX.

Step 2 — Keep the audio simple

For the first test:

  • one speaker
  • short phrase
  • no background music
  • no echo
  • no heavy reverb
  • clean volume
  • clear Russian speech

Do not start with a long dramatic scene.

Step 3 — Use LTX audio-to-video or image-audio-to-video

Use the verified audio as the main conditioning signal.

If using an image:

  • visible face
  • visible mouth
  • not too stylized
  • not too side-profile
  • stable lighting
  • one speaker only

Step 4 — Put the transcript in the prompt

Example:

A close-up portrait of one man speaking Russian directly to the camera. He says: "Предста́вь, что это правда." His mouth movements are synchronized to the provided audio. The delivery is calm and clear. The audio is close-mic Russian speech with quiet room tone.

Step 5 — Check whether LTX preserves or changes the audio

Depending on the workflow, the generated output audio may not be exactly the same as your verified source audio.

If the pronunciation is correct in the source audio but degraded in the output, then simply mux the verified audio back into the final video.

Conceptually:

generated_video.mp4 + verified_russian_audio.wav -> final_video.mp4

Step 6 — If lip-sync is weak, simplify before changing models

Try:

  • shorter clip
  • clearer face
  • more frontal portrait
  • less camera motion
  • no second speaker
  • no music
  • transcript in prompt
  • different seed
  • different workflow version

Only after that would I move to a dedicated lip-sync model.

When to use dedicated lip-sync tools

If LTX gives a good video but poor mouth movement, a dedicated lip-sync step may be better.

Tools in this category include:

  • Wav2Lip
  • MuseTalk
  • VideoReTalking

These tools solve a narrower problem:

given video face + given audio -> lip-synced face video

That is narrower than LTX’s job:

scene + character + motion + audio + camera + style -> full audiovisual generation

So if the visual scene is already good and only the mouth timing is wrong, a dedicated lip-sync tool may be the better final step.

Audio-to-audio / voice conversion

Audio-to-audio or voice conversion may also be useful, but I would separate it from pronunciation correction.

Voice conversion is useful when the issue is:

  • voice identity
  • timbre
  • speaker style
  • accent color
  • emotional tone
  • making one generated voice sound more like another voice

But for Russian lexical stress, I would not rely on voice conversion as the main fix.

If the source audio has the wrong stress, a voice converter may preserve the wrong stress. It may change the voice color while keeping the same pronunciation error.

So for this problem, I would prioritize:

correct Russian audio first

then use voice conversion only if needed:

correct Russian audio
        ↓
optional voice conversion / voice cloning
        ↓
LTX audio-to-video / lip-sync

Why this is better than forcing text prompting

The reason is simple:

Task Best control signal
Correct Russian stress verified Russian audio
Speaker voice identity reference audio / voice clone
Face and video generation LTX image/video prompt
Mouth timing audio-driven generation or lip-sync
Final audio correctness mux verified audio back in

Text prompt alone asks LTX to solve too many tasks at once:

Read Russian correctly
infer word stress
generate voice
generate face
generate mouth motion
generate scene
align audio and video
follow camera/action prompt

Audio-first separates the tasks:

TTS/recording handles pronunciation.
LTX handles audiovisual performance.
Post-production handles final audio correctness.

That is usually more controllable.

Minimal LTX text-only fallback

If you still want to test text-only LTX first, I would use a minimal diagnostic prompt:

A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: "предста́вь". The stress is on the vowel "а", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.

If that works, gradually add complexity.

If that fails, I would not spend too much time on capitalization tricks. I would switch to the audio-first route.

Suggested production route

For your case, I would rank the routes like this:

Rank Route Why
1 Russian TTS / recording → verify stress → LTX audio-to-video or IA2V Best match for pronunciation-sensitive generation
2 Russian TTS / recording → LTX video → mux verified audio back Best if LTX changes/degrades audio
3 Russian TTS / recording → LTX video → dedicated lip-sync cleanup Best if mouth movement is weak
4 Text-only prompt with stress mark + syllable hints Worth trying, but not robust
5 Negative prompt Probably least useful for this exact problem

My current guess

My guess is:

  • Suno may respond to capitalization because it has learned lyric/music formatting conventions.
  • LTX text-only prompting may not treat capitalization inside Russian words as a reliable pronunciation-control marker.
  • Russian stress is hard because it is usually not written in ordinary spelling.
  • LTX is especially complicated because speech is generated together with video, face motion, timing, and scene context.
  • The best practical route is to stop making LTX infer the pronunciation from text.
  • Create the correct Russian audio first, then use LTX’s audio-driven workflow.

So I would say:

If exact Russian stress matters, do not make typography carry the whole burden. Use text-only LTX prompting as a quick experiment, but use verified audio as the real control signal.

Discussion in the ATmosphere

Loading comments...