External Publication
Visit Post

Ltx 2.3 problem with dialogs

Hugging Face Forums [Unofficial] June 10, 2026
Source

I’m not an experienced specialist, but assuming this is about LTX-2.3’s built-in audio generation, I wonder if the prompt may be specifying the speech in a slightly mismatched way:


Short answer

If by “accent” you mean word stress / emphasis / prosody — for example, “make this word sound stronger” — then capital letters alone are probably too weak.

Instead of treating it like typography:

"I will NEVER go back."

I would try writing it as speech performance direction :

The woman says slowly and clearly, "I will never go back." She pauses briefly before the word "never", then says "never" more firmly and slightly louder than the rest of the sentence. The audio is crisp close-mic speech with quiet room tone.

If by “accent” you mean a regional accent — for example British, French, Russian, American, etc. — that is a harder problem. You can specify it in the prompt, but if exact accent or pronunciation is important, prompt-only generation may not be reliable enough. Reference audio, audio-to-video, LipDub, or replacing/cleaning the final audio may be more practical.

Why I would not start with the negative prompt

For this specific problem, I would not start by asking “what should I put in the negative prompt?”

A negative prompt is usually better for removing broad unwanted concepts. But here the issue is probably not just “avoid bad accent.” It is more like:

  • which word should be stressed
  • where the pause should be
  • how fast the line should be spoken
  • whether the voice should be calm, angry, hesitant, loud, soft, tense, etc.
  • whether “accent” means regional pronunciation or just emphasis

So I would first improve the positive prompt : describe the speech you want, not only the speech you do not want.

“Accent” may mean two different things

This is important because the fix depends on the meaning.

If “accent” means… Example Better first approach
Regional accent British accent, French accent, Russian accent Specify language, regional accent, and voice style, but expect limited reliability. Use reference audio if exact fidelity matters.
Word stress / emphasis / prosody Make “never” stronger; pause before a word; say one phrase more firmly Do not rely only on capital letters. Describe the delivery: pause, pace, volume, firmness, emotional beat, and the exact word to emphasize.
Pronunciation of a specific word A name, acronym, unusual word, technical term Rewrite the phrase more clearly, but exact pronunciation may require reference audio, external TTS, or post-production.
Wrong speaker / wrong dialogue assignment The wrong character speaks; two people speak at once Use clearer speaker references, shorter dialogue chunks, and check whether the Space/ComfyUI workflow adds or rewrites prompt text.

LTX-style prompts seem closer to directing a performance

I would not treat this as a universal rule, because prompt behavior ultimately depends on the model weights, training data, text encoder, conditioning design, guidance settings, sampler, seed, and sometimes the UI or workflow around the model.

But different models do have practical prompting habits.

For LTX-2.3, the official prompt guide strongly points toward a cinematic / performance-direction style rather than a short keyword or typography-based style. The guide says LTX-2.3 responds better to long, specific prompts, and it specifically mentions facial expressions, timing, pauses, emotional beats, voice qualities, and breaking dialogue into short phrases with acting directions between them.

Useful references:

  • LTX-2.3 Prompt Guide
  • LTX Documentation: Prompting Guide
  • LTX-2.3 Hugging Face model card

The LTX documentation also says to clearly describe audio, including ambient sound, music, speech, or singing, and to put spoken dialogue in quotation marks. It also says to specify language and accent if needed.

So for LTX, I would think less like this:

keyword, keyword, keyword, BIG EMPHASIS, no bad accent

and more like this:

A short cinematic close-up. The woman speaks English in a calm but tense voice. She says, "I will never go back." She pauses briefly before "never", then says that word more firmly and slightly louder. Her jaw tightens as she finishes the sentence. The audio is crisp close-mic speech with quiet room tone.

Why capital letters are probably not enough

Capital letters are a text convention. They may sometimes help a model infer emphasis, but they do not fully specify speech delivery.

Even in ordinary TTS systems, prosody is often controlled with explicit mechanisms such as pauses, pitch, rate, volume, pronunciation, and emphasis. For example, SSML exists partly because speech synthesis often needs structured controls for pronunciation, volume, pitch, rate, and related properties. Google’s Text-to-Speech SSML documentation similarly describes controls such as prosody, pitch, speaking rate, volume, breaks, and pronunciation-related markup.

LTX-2.3 is not just an ordinary TTS engine, though. It is an audio-video generation model. The LTX-2 paper describes LTX-2 as a joint audio-visual model with video and audio streams connected by cross-modal attention. That means the speech can be affected by the scene, character, timing, facial motion, acoustic environment, and video prompt — not only by the quoted text.

So I would avoid relying on typography alone.

Better prompt patterns to try

1. For word emphasis

Instead of:

The man says: "This is VERY important."

Try:

The man speaks slowly and seriously. He says, "This is very important." He places the strongest emphasis on the word "very", saying it slightly louder and more firmly than the rest of the sentence.

Or:

The man says, "This is..." He pauses briefly, then says "very important" with a firmer tone and stronger emphasis on "very". The audio is clear speech with quiet room tone.

2. For a pause before a key word

Instead of:

The woman says: "I will NEVER go back."

Try:

The woman says softly, "I will..." She pauses, looks directly at him, and her jaw tightens. She then says "never" more firmly and slightly louder: "never go back." The audio is crisp, close-mic speech with quiet room tone.

3. For regional accent

Instead of:

The woman speaks with an accent.

Try:

The woman speaks English with a light French accent, in a calm conversational voice. She says: "I do not think we should go back." The audio is crisp, with quiet room tone.

But I would not expect perfect control here. Regional accent is not just one simple attribute. It involves pronunciation, rhythm, vowel/consonant patterns, speaking style, and sometimes speaker identity. If the exact accent matters, prompt-only control may not be the best tool.

4. For clearer dialogue timing

Instead of putting a long line in one block:

She says, "So anyway, we went to watch the movie, and spoiler, someone dies at the end", then she laughs.

Try splitting it into shorter beats:

The woman speaks in a casual YouTuber-style close-up. She says, "So anyway, we went to watch the movie." She pauses and hides a small laugh. Then she continues, "And spoiler: someone dies at the end." The audio is clear speech with natural room tone.

This style is closer to what the LTX-2.3 prompt guide suggests: short dialogue chunks, acting directions, and clear physical cues.

A useful mental model

I would use this mental model:

Capital letters are typography. LTX-style prompting is closer to directing an actor.

So instead of asking only:

How do I mark the accented word?

ask:

How should the character perform the line?

Then describe:

  • who is speaking
  • the exact quoted dialogue
  • the language
  • regional accent, if relevant
  • the word to emphasize
  • pause placement
  • pace
  • volume
  • firmness / softness
  • emotional beat
  • facial or body cue
  • acoustic environment

For example:

A tight close-up of a tired man sitting in a quiet kitchen at night. He speaks English in a low, controlled voice. He says, "I told you..." He pauses, looks down, then continues more firmly: "I am not going back." He places the strongest emphasis on "not", saying it slightly louder and slower than the surrounding words. The audio is crisp close-mic speech with quiet room tone and no dramatic music.

What I would try first

I would test in this order:

  1. Clarify the meaning of “accent”

    • Regional accent?
    • Word stress?
    • Pronunciation?
    • Speaker voice?
  2. Use positive speech direction

    • Do not start with negative prompt.
    • Describe the desired delivery directly.
  3. Quote the dialogue

    • Keep spoken words inside quotation marks.
  4. Split long dialogue into short beats

    • Especially if the line contains pauses, laughter, hesitation, or emphasis.
  5. Describe the emphasis in natural speech terms

    • “pauses before the word”
    • “says it slightly louder”
    • “says it more firmly”
    • “slows down on that word”
    • “voice becomes sharper/softer/tenser”
  6. Describe the audio environment

    • “crisp close-mic speech”
    • “quiet room tone”
    • “subtle natural ambience”
    • “no dramatic music” only if needed, but focus mainly on what you want to hear
  7. If exact accent matters, use a stronger control surface

    • reference audio
    • audio-to-video
    • LipDub
    • external TTS
    • post-production audio replacement or cleanup

If you are using a Space, ComfyUI workflow, or wrapper

One more thing: if this is not the raw model call, the prompt may be passing through another layer.

For example, a Space, workflow, prompt enhancer, default text node, or UI wrapper may add or rewrite text before LTX sees it. So if the model seems to ignore a very short instruction, it may be worth checking:

  • whether a prompt enhancer is enabled
  • whether default prompt text is being appended
  • whether the workflow separates visual prompt and audio prompt
  • whether the workflow has audio guidance settings
  • whether the model is actually receiving the exact dialogue you typed

This matters because LTX prompting often appears to work best when the final prompt is a complete scene description, not a tiny isolated speech instruction.

When prompt-only may not be enough

If you need the voice to have a specific regional accent, exact pronunciation, or consistent speaker identity, I would not rely only on text prompting.

There are LTX-related routes that are closer to that kind of control, such as:

  • ComfyUI-LTXVideo
  • LTX LipDub / speaker identity preservation features
  • ID-LoRA for LTX-2 / LTX-2.3 reference-audio speaker identity transfer
  • LTX Desktop, which includes audio-to-video and local LTX-2.3 workflows

Those are probably more relevant when the problem is not just “make this word stronger,” but “make this voice/accent/pronunciation stay consistent.”

Summary

My guess is:

  • LTX-2.3 does have built-in audio generation, so this is not automatically an external TTS issue.
  • But “accent” is ambiguous.
  • If you mean word emphasis , capitalization alone is probably too weak.
  • Try writing the desired speech as performance direction : quoted dialogue, short phrases, pauses, pace, volume, firmness, and the exact word to emphasize.
  • If you mean regional accent or exact pronunciation , prompt-only control may be unreliable, and reference audio / audio-to-video / LipDub / post-production may be more realistic.
  • I would improve the positive prompt first before trying to solve this with the negative prompt.

Discussion in the ATmosphere

Loading comments...