Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifc4uhrxebwh322sf4wtjfi32qz2wd72yqoo564eadzzept24ggqi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkwlpy7tn2q2"
  },
  "path": "/t/how-to-make-tts-voice-in-sync-with-the-video/175719#post_2",
  "publishedAt": "2026-05-03T06:33:13.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "in the future, it will be easy to do with a single model",
    "IWSLT Automatic Dubbing Track",
    "VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing",
    "WhisperX",
    "Kokoro VOICES.md",
    "Auto-Synced-Translated-Dubs",
    "VideoDubber paper",
    "WhisperX paper",
    "Montreal Forced Aligner",
    "aeneas",
    "Kokoro voice notes",
    "Length-Aware Speech Translation for Video Dubbing",
    "Length-Aware NMT and Adaptive Duration for Automatic Dubbing",
    "VideoDubber",
    "atempo",
    "faster-whisper",
    "Qwen3-ASR-1.7B",
    "Qwen3-ForcedAligner-0.6B",
    "Kokoro-82M",
    "ResembleAI Chatterbox",
    "VoxCPM2",
    "CosyVoice2",
    "XTTS-v2",
    "SoniTranslate",
    "VideoLingo",
    "Bluez-Dubbing",
    "Subdub",
    "Isochrony-Aware Neural Machine Translation for Automatic Dubbing",
    "Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing",
    "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio"
  ],
  "textContent": "In a Voice→ASR→Text→TTS pipeline, information related to “time” is lost at the ASR stage… (unless you find a way to explicitly preserve it…)\nI suppose in the future, it will be easy to do with a single model…\n\nOne method that works with most ASR models is to feed the audio into the pipeline after pre-segmenting it by time intervals and then concatenating the segments. Even if individual segments are out of sync, this approach helps prevent the entire sequence from being out of sync:\n\n* * *\n\n# How to make TTS voice sync with video for translated recap/story dubbing\n\nYou are running into a very common automatic-dubbing problem: the target-language narration is not naturally the same duration as the source-language narration. A 7-minute Chinese narration can easily become a 5-minute English narration, or the reverse. Stretching the whole audio file, speeding up the video, or asking an LLM for an exact word count will not reliably fix it.\n\nThe core fix is architectural:\n\n> Do not generate one full translated script, synthesize one full TTS file, and then stretch it to match the video.\n>  Instead, split the source into timed narration segments, generate duration-aware English variants per segment, synthesize each segment separately, measure actual TTS duration, then place each fitted clip back onto the original timeline.\n\nThis is the difference between **global duration matching** and **local scene/beat synchronization**.\n\nUseful background:\n\n  * IWSLT Automatic Dubbing Track frames dubbing around **isochrony** , meaning translated speech should be time-aligned with the original video.\n  * VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing explains why controlling word/character count is weaker than controlling spoken duration.\n  * WhisperX is useful because it combines ASR, VAD, and forced alignment concepts for better timestamps.\n  * Kokoro VOICES.md notes that Kokoro can be weak on very short utterances and can rush on long utterances, so segment size matters.\n  * Auto-Synced-Translated-Dubs is a useful reference for subtitle-timing-based dubbed audio generation.\n\n\n\n* * *\n\n## 1. What is actually going wrong?\n\nYour current pipeline is approximately:\n\n\n    video\n    → extract audio\n    → isolate main voice\n    → ASR with timestamps\n    → calibrate average TTS speed / words per second\n    → send large transcript to LLM for English\n    → generate TTS\n    → stretch/compress/merge with video\n\n\nThe failure is happening because the last part treats dubbing as a **whole-file duration problem**.\n\nBut story/recap dubbing is a **timeline problem**.\n\nA full English TTS file matching the full video length does not guarantee that:\n\n  * the character name is spoken when the character appears,\n  * the twist is explained when the twist is shown,\n  * the pause happens where the original pause happened,\n  * the scene transition lines up with the scene transition,\n  * the important reveal word lands near the visual reveal.\n\n\n\nSo global stretching can make the end time correct while the middle is still wrong.\n\n* * *\n\n## 2. Why exact word count does not solve it\n\nAsking the LLM for a precise word count is not enough.\n\nExample:\n\n\n    \"He ran.\"\n    \"Unbelievable.\"\n    \"She underestimated him.\"\n\n\nThese have very different spoken durations even though the word counts are small.\n\nTTS duration depends on:\n\n  * phonemes,\n  * syllables,\n  * punctuation,\n  * pauses,\n  * voice,\n  * TTS model prosody,\n  * names,\n  * numbers,\n  * sentence structure,\n  * whether the model inserts expressive pauses.\n\n\n\nThe VideoDubber paper specifically argues that word/character-count control is not enough for dubbing because spoken duration varies across languages and tokens.\n\nSo the LLM should not be asked only:\n\n\n    Make this exactly 15 words.\n\n\nIt should be asked:\n\n\n    Make this a natural English narration line that can be spoken in about 4.2 seconds.\n    Return normal, compact, ultra_compact, and expanded versions.\n\n\nThen your code should synthesize the audio and measure the real duration.\n\nThe measured WAV duration is the final truth.\n\n* * *\n\n## 3. Correct target: scene/beat sync, not lip sync\n\nYou said you do not need lip-sync. That makes the problem easier.\n\nThere are four different sync levels:\n\nSync level | Meaning | Needed for your case?\n---|---|---\nGlobal sync | final audio starts/ends with the video | Yes, but not enough\nSegment sync | each narration line fits its local source slot | Yes\nBeat sync | important words land near important visual events | Yes\nLip sync | phonemes match mouth shapes | No\n\nYour target is:\n\n\n    segment sync + story beat sync\n\n\nnot full lip-sync.\n\nFor story recaps, a good dub is one where:\n\n  * “he opened the door” happens near the door opening,\n  * “she was the traitor” happens near the reveal,\n  * “three years later” lands near the time-skip card,\n  * the narration resets when the scene changes,\n  * pauses still feel natural.\n\n\n\n* * *\n\n## 4. Recommended architecture\n\nUse this pipeline instead:\n\n\n    video\n    → extract audio\n    → isolate speech for ASR\n    → keep music/SFX bed separately\n    → ASR + VAD + timestamps\n    → clean into dubbing-friendly segments\n    → optionally add scene/shot boundaries\n    → LLM generates timed English variants per segment\n    → TTS each segment separately\n    → measure each TTS clip duration\n    → choose / rewrite / speed-adjust / pad\n    → place each clip at the original segment timestamp\n    → mix with music/SFX bed\n    → export final video\n\n\nThe key difference:\n\n\n    Wrong:\n    one translated script → one TTS file → global stretch\n\n    Right:\n    many timestamped segments → many fitted TTS clips → timeline overlay\n\n\n* * *\n\n## 5. Build around a segment table\n\nUse one central data structure for the whole pipeline.\n\nExample:\n\n\n    [\n      {\n        \"id\": 42,\n        \"start\": 183.20,\n        \"end\": 187.60,\n        \"target_duration\": 4.40,\n        \"source_text\": \"<source text>\",\n        \"scene_note\": \"<the woman reveals the letter>\",\n        \"importance\": \"high\",\n        \"can_start_early_ms\": 100,\n        \"can_end_late_ms\": 250,\n        \"english_candidates\": {\n          \"normal\": \"\",\n          \"compact\": \"\",\n          \"ultra_compact\": \"\",\n          \"expanded\": \"\"\n        },\n        \"chosen_text\": \"\",\n        \"tts_duration\": null,\n        \"fit_action\": \"\"\n      }\n    ]\n\n\nThis makes debugging much easier.\n\nInstead of only knowing:\n\n\n    The final English audio is 2 minutes too short.\n\n\nyou can know:\n\n\n    segment 42 is 38% too long\n    segment 57 is too short but can be padded\n    segment 91 overlaps the next scene\n    segment 103 needs a shorter rewrite\n\n\nThat is the difference between guessing and engineering.\n\n* * *\n\n## 6. How to create good segments\n\nRaw Whisper segments are not always good dubbing segments. You need to convert them into **TTS-friendly, scene-aware narration blocks**.\n\nUse three types of boundaries.\n\n### Speech boundaries\n\nThese come from ASR/VAD:\n\n  * speech starts,\n  * speech ends,\n  * pause begins,\n  * pause ends.\n\n\n\nTools to study:\n\n  * WhisperX\n  * WhisperX paper\n  * Montreal Forced Aligner\n  * aeneas\n\n\n\n### Semantic boundaries\n\nThese are story boundaries:\n\n  * one plot point,\n  * one reveal,\n  * one joke,\n  * one explanation,\n  * one transition.\n\n\n\n### Visual boundaries\n\nThese come from the video:\n\n  * scene cut,\n  * character appears,\n  * object appears,\n  * title card appears,\n  * fight starts,\n  * flashback begins.\n\n\n\nFor story/recap videos, visual boundaries matter. If the source says “then he opened the door” after the door is already open, the dub feels late even if the audio duration is technically correct.\n\n* * *\n\n## 7. Segment length rules for Kokoro\n\nKokoro is fast and practical, but its segment size matters.\n\nThe Kokoro voice notes mention:\n\n  * weakness on very short utterances, especially under about 10–20 tokens,\n  * rushing on long utterances, especially over 400 tokens,\n  * possible mitigation by bundling short utterances, chunking long utterances, or adjusting speed.\n\n\n\nFor recap/story narration, start with:\n\nSegment duration | Recommendation\n---|---\nUnder 1 sec | Usually too short; merge if possible\n1–2 sec | Use only for punchy dramatic beats\n3–8 sec | Best range for most narration\n8–12 sec | Good for slower explanation\n12–15 sec | Usable but monitor\n15+ sec | Usually split\n\nA good practical default:\n\n\n    Most segments: 3–8 seconds\n    Rare short beats: 1.5–3 seconds\n    Rare long explanations: 8–12 seconds\n\n\nAvoid both extremes:\n\n  * too many tiny clips → choppy, unstable TTS,\n  * huge paragraphs → hard to fit, rushed, bad timing.\n\n\n\n* * *\n\n## 8. The most useful trick: generate multiple timing variants\n\nDo not ask Gemini for one translation.\n\nAsk for several spoken variants per segment:\n\n\n    {\n      \"id\": 42,\n      \"normal\": \"At that moment, he realized she had been lying to him all along.\",\n      \"compact\": \"Then he realized she had been lying.\",\n      \"ultra_compact\": \"She had lied all along.\",\n      \"expanded\": \"At that moment, he finally realized she had been lying to him from the beginning.\",\n      \"must_keep\": [\"she lied\", \"he realizes it now\"],\n      \"can_drop\": [\"from the beginning\", \"emotional emphasis\"]\n    }\n\n\nThen synthesize candidates and measure real duration.\n\nExample:\n\nCandidate | TTS duration | Target slot | Decision\n---|---|---|---\nnormal | 5.8s | 3.9s | Too long\ncompact | 4.2s | 3.9s | Good with slight speedup\nultra_compact | 2.6s | 3.9s | Too short unless pause works\nexpanded | 6.5s | 3.9s | Reject\n\nThis matches the direction of length-aware dubbing research:\n\n  * Length-Aware Speech Translation for Video Dubbing\n  * Length-Aware NMT and Adaptive Duration for Automatic Dubbing\n  * VideoDubber\n\n\n\n* * *\n\n## 9. Better Gemini prompt\n\nUse Gemini as a **timed adaptation model** , not only a translator.\n\n\n    You are adapting narration for an English TTS dub of a story recap video.\n\n    The goal is not literal translation.\n    The goal is natural English narration that fits the original video timing.\n\n    Rules:\n    1. Preserve the segment id.\n    2. Do not merge segments.\n    3. Do not move plot information across segment boundaries.\n    4. Preserve character names and plot-critical facts.\n    5. Use short spoken English.\n    6. Avoid long clauses.\n    7. If the time slot is short, compress less important detail.\n    8. If the time slot is long, use natural pacing but do not add new facts.\n    9. Output valid JSON only.\n\n    For each segment, return:\n    {\n      \"id\": number,\n      \"normal\": \"natural spoken English\",\n      \"compact\": \"shorter version\",\n      \"ultra_compact\": \"shortest acceptable version\",\n      \"expanded\": \"slightly fuller version if TTS is too short\",\n      \"must_keep\": [\"critical facts\"],\n      \"can_drop\": [\"details that may be removed if compact\"]\n    }\n\n    Duration guide:\n    - Under 2 sec: 2–6 words\n    - 2–4 sec: 5–12 words\n    - 4–7 sec: 10–22 words\n    - 7–10 sec: 18–35 words\n    - 10+ sec: one or two short sentences\n\n    Input:\n    [\n      {\n        \"id\": 42,\n        \"start\": 183.20,\n        \"end\": 187.60,\n        \"duration_sec\": 4.40,\n        \"source_text\": \"<source text>\",\n        \"context_before\": \"<previous context>\",\n        \"context_after\": \"<next context>\",\n        \"visual_note\": \"<the woman reveals the letter>\",\n        \"importance\": \"high\"\n      }\n    ]\n\n\nThe word guide is only a rough hint. The actual TTS duration is the final authority.\n\n* * *\n\n## 10. Duration fitting policy\n\nFor every generated TTS clip:\n\n\n    target = source_end - source_start\n    actual = generated_tts_duration\n    ratio = actual / target\n\n\nUse this decision table:\n\nRatio | Meaning | Best action\n---|---|---\n0.90–1.05 | Good fit | Keep; pad tiny gap if needed\n0.75–0.90 | Short | Add silence or try expanded version\n0.60–0.75 | Very short | Use expanded version or add deliberate pause\n1.05–1.15 | Slightly long | Small speedup or small tempo correction\n1.15–1.30 | Long | Try compact version first\n1.30–1.60 | Too long | Rewrite/compress\n>1.60 | Bad fit | Re-segment or mark for manual review\n\nImportant rule:\n\n\n    Rewrite text first.\n    Use TTS speed second.\n    Use small audio tempo correction third.\n    Pad silence when short.\n    Manual-review extreme failures.\n\n\nDo not do:\n\n\n    literal translation\n    → huge time-stretch\n\n\n* * *\n\n## 11. When to stretch audio\n\nUse audio stretching only for small mismatches.\n\nFFmpeg’s atempo filter changes audio tempo. It can be useful, but for dubbing the real limit is not the technical maximum. The real limit is naturalness.\n\nPractical limits:\n\nAdjustment | Usually okay?\n---|---\n0.95×–1.05× | Safe\n0.90×–1.10× | Usually fine\n0.85×–1.20× | Sometimes acceptable\nBelow 0.85× | Often sounds dragged\nAbove 1.20× | Often sounds rushed\nAbove 1.30× | Rewrite instead\n\nExample:\n\n\n    ffmpeg -i <input.wav> -filter:a \"atempo=1.10\" <output.wav>\n\n\nIf your clip is 6.5 seconds and the target is 3.8 seconds, do not speed it up by 1.71×. Rewrite it.\n\n* * *\n\n## 12. How to handle long vs short TTS\n\n### If TTS is too long\n\nExample:\n\n\n    target: 4.0s\n    generated: 5.7s\n    ratio: 1.43\n\n\nBad fix:\n\n\n    speed up to 1.43×\n\n\nBetter fix:\n\n  * ask Gemini for a compact version,\n  * remove adjectives,\n  * remove repeated context,\n  * preserve only the plot-critical beat.\n\n\n\nExample:\n\n\n    Too long:\n    At that moment, he finally understood that the woman standing in front of him had secretly planned everything from the start.\n\n    Better:\n    Then he realized she planned everything.\n\n\n### If TTS is too short\n\nExample:\n\n\n    target: 5.0s\n    generated: 3.4s\n    ratio: 0.68\n\n\nBad fix:\n\n\n    slow it down heavily\n\n\nBetter fixes:\n\n  * try expanded variant,\n  * add silence after the line,\n  * add silence before a reveal,\n  * add a natural pause between clauses.\n\n\n\nExample:\n\n\n    Short:\n    She was the traitor.\n\n    Expanded:\n    Only then did he realize she was the traitor.\n\n\nFor story videos, silence is often natural. A small pause before a reveal can improve drama.\n\n* * *\n\n## 13. Timeline assembly\n\nDo not concatenate TTS clips.\n\nCreate a silent audio track with the same duration as the video, then overlay each fitted clip at its original timestamp.\n\nConceptually:\n\n\n    final_audio = silence(video_duration)\n\n    for segment in segments:\n        clip = generated_tts(segment[\"chosen_text\"])\n        clip = fit_to_segment(clip, segment)\n        final_audio.overlay(clip, position=segment[\"start\"])\n\n\nThis preserves the original timeline.\n\nIf two clips overlap, do not blindly mix them. Overlap means something needs fixing:\n\n  * translation too verbose,\n  * segment boundary wrong,\n  * TTS speed too slow,\n  * adjacent segments should be merged,\n  * current segment should be rewritten,\n  * visual beat needs manual review.\n\n\n\n* * *\n\n## 14. Chinese → English adaptation tips\n\nChinese-to-English dubbing has special issues.\n\n### English often needs explicit connectors\n\nChinese narration can be compact and context-heavy. English often adds words like:\n\n\n    but then\n    because of this\n    at that moment\n    meanwhile\n    three years later\n\n\nThese improve clarity but add duration.\n\nPrompt the LLM:\n\n\n    Use connectors only when needed.\n    Prefer short spoken English.\n    Avoid long clauses.\n    Do not add explanation that is not necessary for this scene.\n\n\n### Use short glossary terms\n\nChinese story videos often contain relationship terms, fantasy titles, cultivation terms, names, or recurring labels. These can become too long in English.\n\nChinese term | Long English | Better timed version\n---|---|---\n师兄 | his senior martial brother | senior brother\n魔尊 | the supreme demon lord | Demon Lord\n灵根 | spiritual root | spirit root\n三年后 | after three years had passed | three years later\n\nUse a glossary:\n\n\n    {\n      \"师兄\": \"senior brother\",\n      \"魔尊\": \"Demon Lord\",\n      \"灵根\": \"spirit root\",\n      \"三年后\": \"three years later\"\n    }\n\n\nThis improves consistency and reduces duration.\n\n### Compress phrasing, not plot facts\n\nBad compression removes story meaning. Good compression removes extra phrasing.\n\nExample:\n\n\n    Literal:\n    He never expected that everything that had happened until now was actually only one small part of her carefully prepared plan.\n\n    Timed:\n    It was all part of her plan.\n\n\nThe second version is better if the scene slot is short.\n\n* * *\n\n## 15. Timestamping options\n\nWhisper is fine as a baseline, but for dubbing you must evaluate timestamp quality, not only transcript quality.\n\nCheck:\n\n  * segment start accuracy,\n  * segment end accuracy,\n  * pause detection,\n  * word timing near important beats.\n\n\n\nUseful options:\n\n  * WhisperX\n  * faster-whisper\n  * Qwen3-ASR-1.7B\n  * Qwen3-ForcedAligner-0.6B\n  * Montreal Forced Aligner\n\n\n\nFor Chinese-heavy content, Qwen3-ASR + Qwen3-ForcedAligner is worth testing. The Qwen3-ASR model card describes Qwen3-ForcedAligner as supporting timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages.\n\nRecommended test:\n\n\n    Take 10 representative videos.\n    Run Whisper / faster-whisper.\n    Run WhisperX.\n    Run Qwen3-ASR + Qwen3-ForcedAligner.\n    Compare 30–50 important scene boundaries manually.\n    Choose based on timestamp quality, not only text accuracy.\n\n\n* * *\n\n## 16. TTS model thoughts\n\n### Keep Kokoro first\n\nKokoro-82M is a good baseline because it is fast, small, and practical. That matters because your sync strategy may generate multiple TTS candidates per segment.\n\nIf a 7-minute video has 180 segments and you generate 3 variants each, that can be hundreds of TTS generations. A fast TTS model is useful.\n\nUse Kokoro like this:\n\n\n    good:\n    Kokoro per segment\n    → measure duration\n    → choose / rewrite / adjust / pad\n\n    bad:\n    one huge Kokoro output\n    → global stretch\n\n\n### Test other TTS models only after timing works\n\nA better voice will not fix bad segment logic.\n\nAfter the sync loop works, compare:\n\n  * ResembleAI Chatterbox\n  * VoxCPM2\n  * CosyVoice2\n  * XTTS-v2\n\n\n\nChatterbox is interesting because its model card includes pacing-related controls such as `cfg` and `exaggeration`. VoxCPM2 is interesting for higher-quality multilingual TTS and voice design. But do not switch models before fixing the segment-level timing pipeline.\n\n* * *\n\n## 17. Projects and papers worth studying\n\n### Practical projects\n\n  * Auto-Synced-Translated-Dubs\nGood reference for subtitle-timing-based translation and dubbed audio.\n\n  * SoniTranslate\nLarger synchronized video translation/dubbing project with useful real-world issues.\n\n  * VideoLingo\nUseful for subtitle segmentation, translation, alignment, and dubbing workflow.\n\n  * Bluez-Dubbing\nUseful for modular dubbing, source separation, VAD-based alignment, and sync strategy ideas.\n\n  * Subdub\nUseful for subtitle-to-dub CLI workflow.\n\n\n\n\n### Papers / research\n\n  * VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing\n  * Isochrony-Aware Neural Machine Translation for Automatic Dubbing\n  * Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing\n  * Length-Aware NMT and Adaptive Duration for Automatic Dubbing\n  * Length-Aware Speech Translation for Video Dubbing\n  * WhisperX: Time-Accurate Speech Transcription of Long-Form Audio\n\n\n\n* * *\n\n## 18. Recommended stack for your case\n\n### Baseline stack\n\n\n    Whisper / faster-whisper\n    → Gemini timed adaptation\n    → Kokoro\n    → per-segment duration fitting\n    → timeline overlay\n\n\nThis is the first thing I would build.\n\n### Chinese-focused timestamp stack\n\n\n    Qwen3-ASR-1.7B\n    → Qwen3-ForcedAligner-0.6B\n    → Gemini timed adaptation\n    → Kokoro\n    → per-segment duration fitting\n\n\nUse this if Whisper timestamps are weak on Chinese/Cantonese/dialect-heavy content.\n\n### Better TTS experiment\n\n\n    Whisper or Qwen3-ASR\n    → Gemini timed adaptation\n    → Chatterbox or VoxCPM2\n    → same duration fitting logic\n\n\nOnly test this after the Kokoro timing loop works.\n\n* * *\n\n## 19. Concrete implementation order\n\n### Step 1: Stop full-video TTS\n\nChange:\n\n\n    one translated script → one TTS file\n\n\nto:\n\n\n    many timestamped segments → many TTS clips\n\n\n### Step 2: Add candidate variants\n\nFor each segment, generate:\n\n\n    normal\n    compact\n    ultra_compact\n    expanded\n\n\n### Step 3: Measure real audio duration\n\nAfter TTS:\n\n\n    duration = actual WAV duration\n\n\nDo not trust estimated words-per-second.\n\n### Step 4: Add fitting logic\n\nUse:\n\n\n    rewrite if too long\n    pad if too short\n    small speed correction if close\n    manual review if extreme\n\n\n### Step 5: Overlay clips by timestamp\n\nDo not concatenate. Place each clip at the original segment start.\n\n### Step 6: Generate a QA report\n\nExample:\n\n\n    {\n      \"video_duration\": 420.0,\n      \"segments_total\": 184,\n      \"segments_good\": 132,\n      \"segments_padded\": 31,\n      \"segments_speed_corrected\": 14,\n      \"segments_rewritten\": 7,\n      \"overlaps\": 2,\n      \"manual_review\": [44, 91]\n    }\n\n\nFlag segments where:\n\n\n    duration ratio > 1.25\n    duration ratio < 0.70\n    speed correction > 1.20x\n    overlap > 150 ms\n    segment contains visual reveal\n    segment has title/name/date/number\n\n\nThis lets you review only problematic segments instead of repeatedly watching the whole video.\n\n* * *\n\n## 20. What I would not do\n\n### Do not rely on exact word counts\n\nWord count is too weak. Use duration budgets plus measured TTS duration.\n\n### Do not globally stretch the final audio\n\nIt fixes the ending, not the middle.\n\n### Do not globally change video speed\n\nThis may be acceptable for lectures/tutorials, but for story/recap edits it usually damages pacing.\n\n### Do not over-split into tiny clips\n\nKokoro can be weaker on very short utterances. Merge tiny fragments when they belong to the same scene.\n\n### Do not use aggressive tempo correction\n\nRewrite the text instead.\n\n* * *\n\n## 21. The actual trick\n\nThe practical trick is:\n\n> Generate multiple English versions per timestamped segment, synthesize them, measure the real audio duration, and choose or rewrite until each segment fits.\n\nNot:\n\n\n    calculate one average words-per-second\n\n\nNot:\n\n\n    stretch the final audio\n\n\nNot:\n\n\n    ask the LLM for exactly 22 words\n\n\nThe real loop is:\n\n\n    target segment duration\n    → LLM generates variants\n    → TTS generates audio\n    → code measures duration\n    → choose / rewrite / speed / pad\n    → place on timeline\n\n\nThat is the method I would build.\n\n* * *\n\n## Short summary\n\n  * Your issue is mostly pipeline design, not simply Kokoro/Gemini/Whisper.\n  * Use per-segment timed adaptation, not full-script translation.\n  * Use a central segment table with `start`, `end`, `duration`, `source_text`, `visual_note`, and candidate English lines.\n  * Ask Gemini for `normal`, `compact`, `ultra_compact`, and `expanded` variants.\n  * Generate Kokoro TTS per segment and measure actual audio duration.\n  * If too long, rewrite shorter. If slightly long, speed up a little. If too short, pad or use expanded text.\n  * Overlay clips at original timestamps instead of concatenating them.\n  * Keep Kokoro for now; improve the architecture first.\n  * Test WhisperX or Qwen3-ASR + Qwen3-ForcedAligner if timestamps are weak.\n  * Test Chatterbox/VoxCPM2 only after segment-level sync works.\n\n",
  "title": "How to make TTS voice in sync with the video"
}