{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifxtrwqcj33hmubpqloc2xskeyrkascrd4dijbwacwmkirpwmmoty",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo2h4i7duaq2"
  },
  "path": "/t/ltx-2-3-problem-with-dialogs/176649#post_4",
  "publishedAt": "2026-06-11T22:39:30.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "audio-to-video API support",
    "LTX-2.3",
    "audio-to-video API",
    "Audio-to-Video capability page",
    "Kijai/LTX2.3_comfy discussion",
    "LTX-2.3 Image Audio to Video workflow",
    "Wav2Lip",
    "MuseTalk",
    "VideoReTalking"
  ],
  "textContent": "Ah, so that is the kind of “accent” you mean. That clarification helps. In that case, whether this is controllable depends quite a lot on LTX itself, but I think the practical answer is roughly below. Also, the fact that Suno, LTX, and other models have different “dialects” for this kind of control is actually a pretty fundamental issue:\n\n* * *\n\n## Short answer\n\nIf the goal is **correct Russian word stress** , I would treat this less as a normal “negative prompt” problem and more as a **pronunciation-control problem**.\n\nFor LTX, I would split the answer into two levels:\n\nRoute | My rough expectation\n---|---\n**Text prompt only** | Worth testing, but probably unreliable for exact Russian stress\n**Text prompt with stress mark + syllable hint + explicit pronunciation instruction** | Better, but still not guaranteed\n**Generate/record correct Russian audio first, then use LTX audio-to-video / image-audio-to-video / custom-audio I2V** | Much more promising\n**Final mux/post-production with verified audio** | Most reliable for pronunciation correctness\n\nSo the practical summary is:\n\n> Do not treat capital letters as a universal pronunciation API.\n>  For LTX text-only prompting, try stress marks, syllable hints, and explicit delivery instructions.\n>  But if Russian stress must be correct, make the correct audio first and use LTX as an audio-driven video generator.\n\nThat audio-first route is not just theoretical. LTX has official audio-to-video API support, and the LTX/ComfyUI community already has workflows around **custom audio** , **image-audio-to-video** , **Qwen/Fish-style TTS** , and **voice-driven talking video**.\n\n## Why Suno and LTX can behave differently\n\nThis is the core issue.\n\nDifferent generative models learn different informal “control languages.”\n\nSuno is strongly connected to music, lyrics, song structure, vocal emphasis, line breaks, and lyric formatting. In that world, things like:\n\n\n    ALL CAPS\n    line breaks\n    repeated syllables\n    elongated words\n    [spoken]\n    [whispered]\n    [chorus]\n\n\ncan become useful signals because the model has likely seen many examples where typography and lyric formatting correlate with vocal delivery.\n\nLTX is different. LTX-2.3 is described as a diffusion-based **audio-video foundation model** that generates synchronized video and audio in a single model. It is not simply a lyrics-to-song model or a normal standalone TTS engine. Its speech is entangled with:\n\n  * the person on screen\n  * mouth movement\n  * facial expression\n  * camera motion\n  * scene timing\n  * environment\n  * emotion\n  * ambient sound\n  * video/audio synchronization\n\n\n\nSo the same trick can work differently:\n\nModel family | Likely stronger signal\n---|---\nSong / lyric model | Typography, line breaks, lyric structure, repeated syllables\nOrdinary TTS | text normalization, lexicon, SSML, phonemes, voice settings\nLTX-style audio-video model | scene description, quoted dialogue, audio prompt, timing, character action, reference image/audio\nLip-sync model | input audio waveform, face crop/identity, mouth motion constraints\n\nThat is why this is not just a “bad prompt” issue. The model may simply not have learned the same convention that Suno learned.\n\n## Russian stress is especially difficult\n\nRussian word stress is not a simple typographic effect.\n\nIn normal Russian writing, stress marks are usually omitted. The model often sees a plain word and must infer where the stress should be. That can depend on lexical knowledge, word form, context, and training data coverage.\n\nFor example, learning materials may write stress like this:\n\n\n    предста́вь\n    вообрази́\n\n\nBut ordinary text usually does not include those marks. So if the model sees:\n\n\n    представь\n\n\nit has to know the pronunciation from memory. If it does not know it reliably, capitalization may not fix the problem.\n\nThis is very different from simply saying:\n\n\n    Say this word louder.\n\n\nRussian lexical stress is closer to:\n\n\n    Pronounce this specific word with the correct stressed vowel.\n\n\nThat is why a text-only video model may be inconsistent: sometimes it knows the word, sometimes it guesses, sometimes it follows visual/timing constraints more strongly than the intended stress hint.\n\n## Text-only LTX prompt: what is still worth trying\n\nI would still test text-only prompting, but I would treat it as experimental.\n\nInstead of only doing:\n\n\n    предстАвь\n\n\nI would try redundant hints:\n\n\n    A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: \"предста́вь\". The stress is on the vowel \"а\", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.\n\n\nThis gives the model several different signals:\n\nSignal type | Example\n---|---\nNormal Russian spelling | `представь`\nStress mark | `предста́вь`\nSyllable/stress hint | `пред-СТАВЬ`\nNatural-language explanation | “The stress is on the vowel `а`”\nAcoustic explanation | “slightly longer and louder”\nDelivery instruction | “slowly and clearly”\nScene simplification | one speaker, quiet room, close-up\n\nThis may still fail, but it is a more LTX-like prompt than just capitalizing one letter.\n\n## Why negative prompt probably will not solve it\n\nI would not start with:\n\n\n    negative prompt: wrong stress, bad Russian pronunciation, incorrect accent\n\n\nThat does not tell the model what the correct pronunciation is.\n\nFor this kind of problem, a positive target is more useful:\n\n\n    The stress is on the vowel \"а\": пред-СТАВЬ.\n    The stressed syllable is slightly longer and louder.\n    He says the word slowly and clearly in Russian.\n\n\nA negative prompt can suppress broad unwanted things. But Russian word stress is not a broad unwanted artifact. It is a specific pronunciation target.\n\n## The more promising route: generate the Russian audio first\n\nFor exact Russian stress, I would probably move the pronunciation problem out of the LTX text prompt.\n\nA better production-style route is:\n\n\n    Russian-capable TTS / voice clone / human recording\n            ↓\n    verify stress and pronunciation\n            ↓\n    LTX audio-to-video / image-audio-to-video / custom-audio I2V\n            ↓\n    optional mux/post-production with verified audio\n\n\nThis is not just a workaround. It matches LTX’s strengths better.\n\nLTX’s audio-to-video API is explicitly designed to generate video driven by an audio track. The documentation says you can supply dialogue, music, or ambient sound, and the model produces visuals synchronized to the audio. LTX’s Audio-to-Video capability page also describes audio as the primary conditioning signal, where voice, music, and sound drive motion, pacing, and scene structure.\n\nThat is exactly the kind of control surface you want when pronunciation matters.\n\nInstead of asking LTX:\n\n\n    Please infer the correct Russian stress from text.\n\n\nyou give it:\n\n\n    Here is the already-correct Russian audio. Generate the visual performance around it.\n\n\nThat is a much stronger signal.\n\n## Practical precedent: people are already doing this\n\nThere are already LTX community workflows that look very close to this route.\n\nFor example:\n\n  * In a Kijai/LTX2.3_comfy discussion, RuneXX shared an **I2V & T2V with Custom Audio** workflow described as “Use your own audio files with lip sync, and synced motion.”\n  * In the same ecosystem, users discuss workflows where they first generate or clone voice with **Qwen TTS** , then use **LTX I2V with custom audio**.\n  * Comfy has an LTX-2.3 Image Audio to Video workflow where a portrait image and audio file are used to create a lip-synced talking video.\n  * There are also LTX/Comfy workflows combining **Qwen TTS** , **Fish Audio** , or other TTS/voice-cloning tools with LTX image/audio video generation.\n\n\n\nSo the route is not only theoretical:\n\n\n    TTS / voice clone / verified recording\n            ↓\n    LTX custom audio / IA2V / A2V\n            ↓\n    talking video\n\n\nis already a practical pattern in the LTX 2.3 ComfyUI ecosystem.\n\n## Important caveat: custom audio is not always automatic lip-sync\n\nI would still be careful.\n\nProviding custom audio does not always guarantee perfect lip-sync. Some users report cases where the audio is present in the output but does not properly drive the mouth, or behaves more like voice-over narration.\n\nA useful practical trick from the LTX ComfyUI ecosystem is:\n\n> Provide both the audio file and a transcript/description of the spoken line in the prompt.\n\nFor example, if your audio says:\n\n\n    Предста́вь, что это правда.\n\n\nthen the prompt should not only say:\n\n\n    A man speaks Russian.\n\n\nIt should say something like:\n\n\n    A close-up of a man speaking directly to the camera in Russian. He says: \"Предста́вь, что это правда.\" His mouth movements are synchronized to the provided audio. The scene is quiet, with clear close-mic speech and no music.\n\n\nThis helps the model understand that the audio is meant to be **character dialogue** , not just background narration or ambience.\n\n## Recommended audio-first workflow\n\nIf I were trying to get correct Russian stress in LTX, I would test this pipeline.\n\n### Step 1 — Make the Russian audio outside LTX\n\nUse one of:\n\n  * a Russian-capable TTS\n  * a voice-cloning TTS\n  * a human recording\n  * a manually edited recording\n  * a TTS system with SSML/phoneme/lexicon controls, if available\n\n\n\nThe important point is: verify the pronunciation before giving it to LTX.\n\n### Step 2 — Keep the audio simple\n\nFor the first test:\n\n  * one speaker\n  * short phrase\n  * no background music\n  * no echo\n  * no heavy reverb\n  * clean volume\n  * clear Russian speech\n\n\n\nDo not start with a long dramatic scene.\n\n### Step 3 — Use LTX audio-to-video or image-audio-to-video\n\nUse the verified audio as the main conditioning signal.\n\nIf using an image:\n\n  * visible face\n  * visible mouth\n  * not too stylized\n  * not too side-profile\n  * stable lighting\n  * one speaker only\n\n\n\n### Step 4 — Put the transcript in the prompt\n\nExample:\n\n\n    A close-up portrait of one man speaking Russian directly to the camera. He says: \"Предста́вь, что это правда.\" His mouth movements are synchronized to the provided audio. The delivery is calm and clear. The audio is close-mic Russian speech with quiet room tone.\n\n\n### Step 5 — Check whether LTX preserves or changes the audio\n\nDepending on the workflow, the generated output audio may not be exactly the same as your verified source audio.\n\nIf the pronunciation is correct in the source audio but degraded in the output, then simply mux the verified audio back into the final video.\n\nConceptually:\n\n\n    generated_video.mp4 + verified_russian_audio.wav -> final_video.mp4\n\n\n### Step 6 — If lip-sync is weak, simplify before changing models\n\nTry:\n\n  * shorter clip\n  * clearer face\n  * more frontal portrait\n  * less camera motion\n  * no second speaker\n  * no music\n  * transcript in prompt\n  * different seed\n  * different workflow version\n\n\n\nOnly after that would I move to a dedicated lip-sync model.\n\n## When to use dedicated lip-sync tools\n\nIf LTX gives a good video but poor mouth movement, a dedicated lip-sync step may be better.\n\nTools in this category include:\n\n  * Wav2Lip\n  * MuseTalk\n  * VideoReTalking\n\n\n\nThese tools solve a narrower problem:\n\n\n    given video face + given audio -> lip-synced face video\n\n\nThat is narrower than LTX’s job:\n\n\n    scene + character + motion + audio + camera + style -> full audiovisual generation\n\n\nSo if the visual scene is already good and only the mouth timing is wrong, a dedicated lip-sync tool may be the better final step.\n\n## Audio-to-audio / voice conversion\n\nAudio-to-audio or voice conversion may also be useful, but I would separate it from pronunciation correction.\n\nVoice conversion is useful when the issue is:\n\n  * voice identity\n  * timbre\n  * speaker style\n  * accent color\n  * emotional tone\n  * making one generated voice sound more like another voice\n\n\n\nBut for Russian lexical stress, I would not rely on voice conversion as the main fix.\n\nIf the source audio has the wrong stress, a voice converter may preserve the wrong stress. It may change the voice color while keeping the same pronunciation error.\n\nSo for this problem, I would prioritize:\n\n\n    correct Russian audio first\n\n\nthen use voice conversion only if needed:\n\n\n    correct Russian audio\n            ↓\n    optional voice conversion / voice cloning\n            ↓\n    LTX audio-to-video / lip-sync\n\n\n## Why this is better than forcing text prompting\n\nThe reason is simple:\n\nTask | Best control signal\n---|---\nCorrect Russian stress | verified Russian audio\nSpeaker voice identity | reference audio / voice clone\nFace and video generation | LTX image/video prompt\nMouth timing | audio-driven generation or lip-sync\nFinal audio correctness | mux verified audio back in\n\nText prompt alone asks LTX to solve too many tasks at once:\n\n\n    Read Russian correctly\n    infer word stress\n    generate voice\n    generate face\n    generate mouth motion\n    generate scene\n    align audio and video\n    follow camera/action prompt\n\n\nAudio-first separates the tasks:\n\n\n    TTS/recording handles pronunciation.\n    LTX handles audiovisual performance.\n    Post-production handles final audio correctness.\n\n\nThat is usually more controllable.\n\n## Minimal LTX text-only fallback\n\nIf you still want to test text-only LTX first, I would use a minimal diagnostic prompt:\n\n\n    A close-up of one man in a quiet room, speaking directly to the camera. He speaks Russian slowly and clearly. He says only one word: \"предста́вь\". The stress is on the vowel \"а\", pronounced пред-СТАВЬ. The stressed syllable is slightly longer and louder than the others. The audio is crisp close-mic Russian speech with quiet room tone and no music.\n\n\nIf that works, gradually add complexity.\n\nIf that fails, I would not spend too much time on capitalization tricks. I would switch to the audio-first route.\n\n## Suggested production route\n\nFor your case, I would rank the routes like this:\n\nRank | Route | Why\n---|---|---\n1 | Russian TTS / recording → verify stress → LTX audio-to-video or IA2V | Best match for pronunciation-sensitive generation\n2 | Russian TTS / recording → LTX video → mux verified audio back | Best if LTX changes/degrades audio\n3 | Russian TTS / recording → LTX video → dedicated lip-sync cleanup | Best if mouth movement is weak\n4 | Text-only prompt with stress mark + syllable hints | Worth trying, but not robust\n5 | Negative prompt | Probably least useful for this exact problem\n\n## My current guess\n\nMy guess is:\n\n  * Suno may respond to capitalization because it has learned lyric/music formatting conventions.\n  * LTX text-only prompting may not treat capitalization inside Russian words as a reliable pronunciation-control marker.\n  * Russian stress is hard because it is usually not written in ordinary spelling.\n  * LTX is especially complicated because speech is generated together with video, face motion, timing, and scene context.\n  * The best practical route is to stop making LTX infer the pronunciation from text.\n  * Create the correct Russian audio first, then use LTX’s audio-driven workflow.\n\n\n\nSo I would say:\n\n> If exact Russian stress matters, do not make typography carry the whole burden.\n>  Use text-only LTX prompting as a quick experiment, but use verified audio as the real control signal.",
  "title": "Ltx 2.3 problem with dialogs"
}