External Publication

Ltx 2.3 problem with dialogs

Hugging Face Forums [Unofficial] June 11, 2026

Ah, so that is the kind of “accent” you mean. That clarification helps. In that case, whether this is controllable depends quite a lot on LTX itself, but I think the practical answer is roughly below. Also, the fact that Suno, LTX, and other models have different “dialects” for this kind of control is actually a pretty fundamental issue:

Short answer

If the goal is correct Russian word stress , I would treat this less as a normal “negative prompt” problem and more as a pronunciation-control problem.

For LTX, I would split the answer into two levels:

Route	My rough expectation
Text prompt only	Worth testing, but probably unreliable for exact Russian stress
Text prompt with stress mark + syllable hint + explicit pronunciation instruction	Better, but still not guaranteed
Generate/record correct Russian audio first, then use LTX audio-to-video / image-audio-to-video / custom-audio I2V	Much more promising
Final mux/post-production with verified audio	Most reliable for pronunciation correctness

So the practical summary is:

Do not treat capital letters as a universal pronunciation API. For LTX text-only prompting, try stress marks, syllable hints, and explicit delivery instructions. But if Russian stress must be correct, make the correct audio first and use LTX as an audio-driven video generator.

That audio-first route is not just theoretical. LTX has official audio-to-video API support, and the LTX/ComfyUI community already has workflows around custom audio , image-audio-to-video , Qwen/Fish-style TTS , and voice-driven talking video.

Why Suno and LTX can behave differently

This is the core issue.

Different generative models learn different informal “control languages.”

Suno is strongly connected to music, lyrics, song structure, vocal emphasis, line breaks, and lyric formatting. In that world, things like:

ALL CAPS
line breaks
repeated syllables
elongated words
[spoken]
[whispered]
[chorus]

can become useful signals because the model has likely seen many examples where typography and lyric formatting correlate with vocal delivery.

LTX is different. LTX-2.3 is described as a diffusion-based audio-video foundation model that generates synchronized video and audio in a single model. It is not simply a lyrics-to-song model or a normal standalone TTS engine. Its speech is entangled with:

the person on screen
mouth movement
facial expression
camera motion
scene timing
environment
emotion
ambient sound
video/audio synchronization

So the same trick can work differently:

Model family	Likely stronger signal
Song / lyric model	Typography, line breaks, lyric structure, repeated syllables
Ordinary TTS	text normalization, lexicon, SSML, phonemes, voice settings
LTX-style audio-video model	scene description, quoted dialogue, audio prompt, timing, character action, reference image/audio
Lip-sync model	input audio waveform, face crop/identity, mouth motion constraints

That is why this is not just a “bad prompt” issue. The model may simply not have learned the same convention that Suno learned.

Russian stress is especially difficult

Russian word stress is not a simple typographic effect.

In normal Russian writing, stress marks are usually omitted. The model often sees a plain word and must infer where the stress should be. That can depend on lexical knowledge, word form, context, and training data coverage.

For example, learning materials may write stress like this:

предста́вь
вообрази́

But ordinary text usually does not include those marks. So if the model sees:

представь

it has to know the pronunciation from memory. If it does not know it reliably, capitalization may not fix the problem.

This is very different from simply saying:

Say this word louder.

Russian lexical stress is closer to:

Pronounce this specific word with the correct stressed vowel.

That is why a text-only video model may be inconsistent: sometimes it knows the word, sometimes it guesses, sometimes it follows visual/timing constraints more strongly than the intended stress hint.

Text-only LTX prompt: what is still worth trying

I would still test text-only prompting, but I would treat it as experimental.

Instead of only doing:

предстАвь

I would try redundant hints:

A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: "предста́вь". The stress is on the vowel "а", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.

This gives the model several different signals:

Signal type	Example
Normal Russian spelling	`представь`
Stress mark	`предста́вь`
Syllable/stress hint	`пред-СТАВЬ`
Natural-language explanation	“The stress is on the vowel `а`”
Acoustic explanation	“slightly longer and louder”
Delivery instruction	“slowly and clearly”
Scene simplification	one speaker, quiet room, close-up

This may still fail, but it is a more LTX-like prompt than just capitalizing one letter.

Why negative prompt probably will not solve it

I would not start with:

negative prompt: wrong stress, bad Russian pronunciation, incorrect accent

That does not tell the model what the correct pronunciation is.

For this kind of problem, a positive target is more useful:

The stress is on the vowel "а": пред-СТАВЬ.
The stressed syllable is slightly longer and louder.
He says the word slowly and clearly in Russian.

A negative prompt can suppress broad unwanted things. But Russian word stress is not a broad unwanted artifact. It is a specific pronunciation target.

The more promising route: generate the Russian audio first

For exact Russian stress, I would probably move the pronunciation problem out of the LTX text prompt.

A better production-style route is:

Russian-capable TTS / voice clone / human recording
        ↓
verify stress and pronunciation
        ↓
LTX audio-to-video / image-audio-to-video / custom-audio I2V
        ↓
optional mux/post-production with verified audio

This is not just a workaround. It matches LTX’s strengths better.

LTX’s audio-to-video API is explicitly designed to generate video driven by an audio track. The documentation says you can supply dialogue, music, or ambient sound, and the model produces visuals synchronized to the audio. LTX’s Audio-to-Video capability page also describes audio as the primary conditioning signal, where voice, music, and sound drive motion, pacing, and scene structure.

That is exactly the kind of control surface you want when pronunciation matters.

Instead of asking LTX:

Please infer the correct Russian stress from text.

you give it:

Here is the already-correct Russian audio. Generate the visual performance around it.

That is a much stronger signal.

Practical precedent: people are already doing this

There are already LTX community workflows that look very close to this route.

For example:

In a Kijai/LTX2.3_comfy discussion, RuneXX shared an I2V & T2V with Custom Audio workflow described as “Use your own audio files with lip sync, and synced motion.”
In the same ecosystem, users discuss workflows where they first generate or clone voice with Qwen TTS , then use LTX I2V with custom audio.
Comfy has an LTX-2.3 Image Audio to Video workflow where a portrait image and audio file are used to create a lip-synced talking video.
There are also LTX/Comfy workflows combining Qwen TTS , Fish Audio , or other TTS/voice-cloning tools with LTX image/audio video generation.

So the route is not only theoretical:

TTS / voice clone / verified recording
        ↓
LTX custom audio / IA2V / A2V
        ↓
talking video

is already a practical pattern in the LTX 2.3 ComfyUI ecosystem.

Important caveat: custom audio is not always automatic lip-sync

I would still be careful.

Providing custom audio does not always guarantee perfect lip-sync. Some users report cases where the audio is present in the output but does not properly drive the mouth, or behaves more like voice-over narration.

A useful practical trick from the LTX ComfyUI ecosystem is:

Provide both the audio file and a transcript/description of the spoken line in the prompt.

For example, if your audio says:

Предста́вь, что это правда.

then the prompt should not only say:

A man speaks Russian.

It should say something like:

A close-up of a man speaking directly to the camera in Russian. He says: "Предста́вь, что это правда." His mouth movements are synchronized to the provided audio. The scene is quiet, with clear close-mic speech and no music.

This helps the model understand that the audio is meant to be character dialogue , not just background narration or ambience.

Recommended audio-first workflow

If I were trying to get correct Russian stress in LTX, I would test this pipeline.

Step 1 — Make the Russian audio outside LTX

Use one of:

a Russian-capable TTS
a voice-cloning TTS
a human recording
a manually edited recording
a TTS system with SSML/phoneme/lexicon controls, if available

The important point is: verify the pronunciation before giving it to LTX.

Step 2 — Keep the audio simple

For the first test:

one speaker
short phrase
no background music
no echo
no heavy reverb
clean volume
clear Russian speech

Do not start with a long dramatic scene.

Step 3 — Use LTX audio-to-video or image-audio-to-video

Use the verified audio as the main conditioning signal.

If using an image:

visible face
visible mouth
not too stylized
not too side-profile
stable lighting
one speaker only

Step 4 — Put the transcript in the prompt

Example:

A close-up portrait of one man speaking Russian directly to the camera. He says: "Предста́вь, что это правда." His mouth movements are synchronized to the provided audio. The delivery is calm and clear. The audio is close-mic Russian speech with quiet room tone.

Step 5 — Check whether LTX preserves or changes the audio

Depending on the workflow, the generated output audio may not be exactly the same as your verified source audio.

If the pronunciation is correct in the source audio but degraded in the output, then simply mux the verified audio back into the final video.

Conceptually:

generated_video.mp4 + verified_russian_audio.wav -> final_video.mp4

Step 6 — If lip-sync is weak, simplify before changing models

Try:

shorter clip
clearer face
more frontal portrait
less camera motion
no second speaker
no music
transcript in prompt
different seed
different workflow version

Only after that would I move to a dedicated lip-sync model.

When to use dedicated lip-sync tools

If LTX gives a good video but poor mouth movement, a dedicated lip-sync step may be better.

Tools in this category include:

Wav2Lip
MuseTalk
VideoReTalking

These tools solve a narrower problem:

given video face + given audio -> lip-synced face video

That is narrower than LTX’s job:

scene + character + motion + audio + camera + style -> full audiovisual generation

So if the visual scene is already good and only the mouth timing is wrong, a dedicated lip-sync tool may be the better final step.

Audio-to-audio / voice conversion

Audio-to-audio or voice conversion may also be useful, but I would separate it from pronunciation correction.

Voice conversion is useful when the issue is:

voice identity
timbre
speaker style
accent color
emotional tone
making one generated voice sound more like another voice

But for Russian lexical stress, I would not rely on voice conversion as the main fix.

If the source audio has the wrong stress, a voice converter may preserve the wrong stress. It may change the voice color while keeping the same pronunciation error.

So for this problem, I would prioritize:

correct Russian audio first

then use voice conversion only if needed:

correct Russian audio
        ↓
optional voice conversion / voice cloning
        ↓
LTX audio-to-video / lip-sync

Why this is better than forcing text prompting

The reason is simple:

Task	Best control signal
Correct Russian stress	verified Russian audio
Speaker voice identity	reference audio / voice clone
Face and video generation	LTX image/video prompt
Mouth timing	audio-driven generation or lip-sync
Final audio correctness	mux verified audio back in

Text prompt alone asks LTX to solve too many tasks at once:

Read Russian correctly
infer word stress
generate voice
generate face
generate mouth motion
generate scene
align audio and video
follow camera/action prompt

Audio-first separates the tasks:

TTS/recording handles pronunciation.
LTX handles audiovisual performance.
Post-production handles final audio correctness.

That is usually more controllable.

Minimal LTX text-only fallback

If you still want to test text-only LTX first, I would use a minimal diagnostic prompt:

A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: "предста́вь". The stress is on the vowel "а", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.

If that works, gradually add complexity.

If that fails, I would not spend too much time on capitalization tricks. I would switch to the audio-first route.

Suggested production route

For your case, I would rank the routes like this:

Rank	Route	Why
1	Russian TTS / recording → verify stress → LTX audio-to-video or IA2V	Best match for pronunciation-sensitive generation
2	Russian TTS / recording → LTX video → mux verified audio back	Best if LTX changes/degrades audio
3	Russian TTS / recording → LTX video → dedicated lip-sync cleanup	Best if mouth movement is weak
4	Text-only prompt with stress mark + syllable hints	Worth trying, but not robust
5	Negative prompt	Probably least useful for this exact problem

My current guess

My guess is:

Suno may respond to capitalization because it has learned lyric/music formatting conventions.
LTX text-only prompting may not treat capitalization inside Russian words as a reliable pronunciation-control marker.
Russian stress is hard because it is usually not written in ordinary spelling.
LTX is especially complicated because speech is generated together with video, face motion, timing, and scene context.
The best practical route is to stop making LTX infer the pronunciation from text.
Create the correct Russian audio first, then use LTX’s audio-driven workflow.

So I would say:

If exact Russian stress matters, do not make typography carry the whole burden. Use text-only LTX prompting as a quick experiment, but use verified audio as the real control signal.