Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibnep3zixhok4b6nvemv7yqffpcifxcpipvxrhglhaj3s7ft5ennq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnvmerxkvgi2"
  },
  "path": "/t/ltx-2-3-problem-with-dialogs/176649#post_2",
  "publishedAt": "2026-06-10T00:57:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LTX-2.3 Prompt Guide",
    "LTX Documentation: Prompting Guide",
    "LTX-2.3 Hugging Face model card",
    "SSML",
    "Text-to-Speech SSML documentation",
    "LTX-2 paper",
    "ComfyUI-LTXVideo",
    "LTX LipDub / speaker identity preservation features",
    "ID-LoRA for LTX-2 / LTX-2.3 reference-audio speaker identity transfer",
    "LTX Desktop"
  ],
  "textContent": "I’m not an experienced specialist, but assuming this is about LTX-2.3’s built-in audio generation, I wonder if the prompt may be specifying the speech in a slightly mismatched way:\n\n* * *\n\n## Short answer\n\nIf by **“accent”** you mean **word stress / emphasis / prosody** — for example, “make this word sound stronger” — then capital letters alone are probably too weak.\n\nInstead of treating it like typography:\n\n\n    \"I will NEVER go back.\"\n\n\nI would try writing it as **speech performance direction** :\n\n\n    The woman says slowly and clearly, \"I will never go back.\" She pauses briefly before the word \"never\", then says \"never\" more firmly and slightly louder than the rest of the sentence. The audio is crisp close-mic speech with quiet room tone.\n\n\nIf by **“accent”** you mean a **regional accent** — for example British, French, Russian, American, etc. — that is a harder problem. You can specify it in the prompt, but if exact accent or pronunciation is important, prompt-only generation may not be reliable enough. Reference audio, audio-to-video, LipDub, or replacing/cleaning the final audio may be more practical.\n\n## Why I would not start with the negative prompt\n\nFor this specific problem, I would not start by asking “what should I put in the negative prompt?”\n\nA negative prompt is usually better for removing broad unwanted concepts. But here the issue is probably not just “avoid bad accent.” It is more like:\n\n  * which word should be stressed\n  * where the pause should be\n  * how fast the line should be spoken\n  * whether the voice should be calm, angry, hesitant, loud, soft, tense, etc.\n  * whether “accent” means regional pronunciation or just emphasis\n\n\n\nSo I would first improve the **positive prompt** : describe the speech you want, not only the speech you do not want.\n\n## “Accent” may mean two different things\n\nThis is important because the fix depends on the meaning.\n\nIf “accent” means… | Example | Better first approach\n---|---|---\n**Regional accent** | British accent, French accent, Russian accent | Specify language, regional accent, and voice style, but expect limited reliability. Use reference audio if exact fidelity matters.\n**Word stress / emphasis / prosody** | Make “never” stronger; pause before a word; say one phrase more firmly | Do not rely only on capital letters. Describe the delivery: pause, pace, volume, firmness, emotional beat, and the exact word to emphasize.\n**Pronunciation of a specific word** | A name, acronym, unusual word, technical term | Rewrite the phrase more clearly, but exact pronunciation may require reference audio, external TTS, or post-production.\n**Wrong speaker / wrong dialogue assignment** | The wrong character speaks; two people speak at once | Use clearer speaker references, shorter dialogue chunks, and check whether the Space/ComfyUI workflow adds or rewrites prompt text.\n\n## LTX-style prompts seem closer to directing a performance\n\nI would not treat this as a universal rule, because prompt behavior ultimately depends on the model weights, training data, text encoder, conditioning design, guidance settings, sampler, seed, and sometimes the UI or workflow around the model.\n\nBut different models do have practical prompting habits.\n\nFor LTX-2.3, the official prompt guide strongly points toward a **cinematic / performance-direction style** rather than a short keyword or typography-based style. The guide says LTX-2.3 responds better to long, specific prompts, and it specifically mentions facial expressions, timing, pauses, emotional beats, voice qualities, and breaking dialogue into short phrases with acting directions between them.\n\nUseful references:\n\n  * LTX-2.3 Prompt Guide\n  * LTX Documentation: Prompting Guide\n  * LTX-2.3 Hugging Face model card\n\n\n\nThe LTX documentation also says to clearly describe audio, including ambient sound, music, speech, or singing, and to put spoken dialogue in quotation marks. It also says to specify language and accent if needed.\n\nSo for LTX, I would think less like this:\n\n\n    keyword, keyword, keyword, BIG EMPHASIS, no bad accent\n\n\nand more like this:\n\n\n    A short cinematic close-up. The woman speaks English in a calm but tense voice. She says, \"I will never go back.\" She pauses briefly before \"never\", then says that word more firmly and slightly louder. Her jaw tightens as she finishes the sentence. The audio is crisp close-mic speech with quiet room tone.\n\n\n## Why capital letters are probably not enough\n\nCapital letters are a text convention. They may sometimes help a model infer emphasis, but they do not fully specify speech delivery.\n\nEven in ordinary TTS systems, prosody is often controlled with explicit mechanisms such as pauses, pitch, rate, volume, pronunciation, and emphasis. For example, SSML exists partly because speech synthesis often needs structured controls for pronunciation, volume, pitch, rate, and related properties. Google’s Text-to-Speech SSML documentation similarly describes controls such as prosody, pitch, speaking rate, volume, breaks, and pronunciation-related markup.\n\nLTX-2.3 is not just an ordinary TTS engine, though. It is an audio-video generation model. The LTX-2 paper describes LTX-2 as a joint audio-visual model with video and audio streams connected by cross-modal attention. That means the speech can be affected by the scene, character, timing, facial motion, acoustic environment, and video prompt — not only by the quoted text.\n\nSo I would avoid relying on typography alone.\n\n## Better prompt patterns to try\n\n### 1. For word emphasis\n\nInstead of:\n\n\n    The man says: \"This is VERY important.\"\n\n\nTry:\n\n\n    The man speaks slowly and seriously. He says, \"This is very important.\" He places the strongest emphasis on the word \"very\", saying it slightly louder and more firmly than the rest of the sentence.\n\n\nOr:\n\n\n    The man says, \"This is...\" He pauses briefly, then says \"very important\" with a firmer tone and stronger emphasis on \"very\". The audio is clear speech with quiet room tone.\n\n\n### 2. For a pause before a key word\n\nInstead of:\n\n\n    The woman says: \"I will NEVER go back.\"\n\n\nTry:\n\n\n    The woman says softly, \"I will...\" She pauses, looks directly at him, and her jaw tightens. She then says \"never\" more firmly and slightly louder: \"never go back.\" The audio is crisp, close-mic speech with quiet room tone.\n\n\n### 3. For regional accent\n\nInstead of:\n\n\n    The woman speaks with an accent.\n\n\nTry:\n\n\n    The woman speaks English with a light French accent, in a calm conversational voice. She says: \"I do not think we should go back.\" The audio is crisp, with quiet room tone.\n\n\nBut I would not expect perfect control here. Regional accent is not just one simple attribute. It involves pronunciation, rhythm, vowel/consonant patterns, speaking style, and sometimes speaker identity. If the exact accent matters, prompt-only control may not be the best tool.\n\n### 4. For clearer dialogue timing\n\nInstead of putting a long line in one block:\n\n\n    She says, \"So anyway, we went to watch the movie, and spoiler, someone dies at the end\", then she laughs.\n\n\nTry splitting it into shorter beats:\n\n\n    The woman speaks in a casual YouTuber-style close-up. She says, \"So anyway, we went to watch the movie.\" She pauses and hides a small laugh. Then she continues, \"And spoiler: someone dies at the end.\" The audio is clear speech with natural room tone.\n\n\nThis style is closer to what the LTX-2.3 prompt guide suggests: short dialogue chunks, acting directions, and clear physical cues.\n\n## A useful mental model\n\nI would use this mental model:\n\n> Capital letters are typography. LTX-style prompting is closer to directing an actor.\n\nSo instead of asking only:\n\n\n    How do I mark the accented word?\n\n\nask:\n\n\n    How should the character perform the line?\n\n\nThen describe:\n\n  * who is speaking\n  * the exact quoted dialogue\n  * the language\n  * regional accent, if relevant\n  * the word to emphasize\n  * pause placement\n  * pace\n  * volume\n  * firmness / softness\n  * emotional beat\n  * facial or body cue\n  * acoustic environment\n\n\n\nFor example:\n\n\n    A tight close-up of a tired man sitting in a quiet kitchen at night. He speaks English in a low, controlled voice. He says, \"I told you...\" He pauses, looks down, then continues more firmly: \"I am not going back.\" He places the strongest emphasis on \"not\", saying it slightly louder and slower than the surrounding words. The audio is crisp close-mic speech with quiet room tone and no dramatic music.\n\n\n## What I would try first\n\nI would test in this order:\n\n  1. **Clarify the meaning of “accent”**\n\n     * Regional accent?\n     * Word stress?\n     * Pronunciation?\n     * Speaker voice?\n  2. **Use positive speech direction**\n\n     * Do not start with negative prompt.\n     * Describe the desired delivery directly.\n  3. **Quote the dialogue**\n\n     * Keep spoken words inside quotation marks.\n  4. **Split long dialogue into short beats**\n\n     * Especially if the line contains pauses, laughter, hesitation, or emphasis.\n  5. **Describe the emphasis in natural speech terms**\n\n     * “pauses before the word”\n     * “says it slightly louder”\n     * “says it more firmly”\n     * “slows down on that word”\n     * “voice becomes sharper/softer/tenser”\n  6. **Describe the audio environment**\n\n     * “crisp close-mic speech”\n     * “quiet room tone”\n     * “subtle natural ambience”\n     * “no dramatic music” only if needed, but focus mainly on what you want to hear\n  7. **If exact accent matters, use a stronger control surface**\n\n     * reference audio\n     * audio-to-video\n     * LipDub\n     * external TTS\n     * post-production audio replacement or cleanup\n\n\n\n## If you are using a Space, ComfyUI workflow, or wrapper\n\nOne more thing: if this is not the raw model call, the prompt may be passing through another layer.\n\nFor example, a Space, workflow, prompt enhancer, default text node, or UI wrapper may add or rewrite text before LTX sees it. So if the model seems to ignore a very short instruction, it may be worth checking:\n\n  * whether a prompt enhancer is enabled\n  * whether default prompt text is being appended\n  * whether the workflow separates visual prompt and audio prompt\n  * whether the workflow has audio guidance settings\n  * whether the model is actually receiving the exact dialogue you typed\n\n\n\nThis matters because LTX prompting often appears to work best when the final prompt is a complete scene description, not a tiny isolated speech instruction.\n\n## When prompt-only may not be enough\n\nIf you need the voice to have a specific regional accent, exact pronunciation, or consistent speaker identity, I would not rely only on text prompting.\n\nThere are LTX-related routes that are closer to that kind of control, such as:\n\n  * ComfyUI-LTXVideo\n  * LTX LipDub / speaker identity preservation features\n  * ID-LoRA for LTX-2 / LTX-2.3 reference-audio speaker identity transfer\n  * LTX Desktop, which includes audio-to-video and local LTX-2.3 workflows\n\n\n\nThose are probably more relevant when the problem is not just “make this word stronger,” but “make this voice/accent/pronunciation stay consistent.”\n\n## Summary\n\nMy guess is:\n\n  * LTX-2.3 does have built-in audio generation, so this is not automatically an external TTS issue.\n  * But “accent” is ambiguous.\n  * If you mean **word emphasis** , capitalization alone is probably too weak.\n  * Try writing the desired speech as **performance direction** : quoted dialogue, short phrases, pauses, pace, volume, firmness, and the exact word to emphasize.\n  * If you mean **regional accent** or **exact pronunciation** , prompt-only control may be unreliable, and reference audio / audio-to-video / LipDub / post-production may be more realistic.\n  * I would improve the positive prompt first before trying to solve this with the negative prompt.\n\n",
  "title": "Ltx 2.3 problem with dialogs"
}