{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidw6vyjkz5ysimx7lj4g76wmfg4w7ldiddtelyjqvrojevnwfvvse",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkvqvawx2lr2"
},
"path": "/t/how-to-make-tts-voice-in-sync-with-the-video/175719#post_1",
"publishedAt": "2026-05-02T21:37:09.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I’ve been trying to make a video dubbing pipeline i nailed most of the part of processing video and audio but I’m stuck to that voice and video synchronization.\n\nlike when i make a narration of a new language and then convert it to a audio it just doesn’t match like a 7 mins Chinese voice becomes a 5 minutes English voice. and then even if i stretch the audio or make video bit faster it just doesn’t sync with the video.\n\nI’m mainly processing story type videos like recap or some story i don’t need lips syncing but i do need to sync the videos scene with my voice\n\nhere is the process I’m following\n\ntake video → extract the audio → process the audio (remove everything aside from the main voice) → generate text from that voice with proper time stamps → calibrate on a demo story to get the TTS speed like word per second average gap etc. → pass that whole data to a LLM to convert that to English → convert that LLM response to voice using that TTS → merge the video with the new audio\n\nI’m using kokoro for TTS and gemini for LLM whisper for audio to text convertion.\n\nthe main issue is happening in last 2-3 steps LLM gives me response but after converting to audio it doesn’t match the videos length sometimes its either too short or too long which makes it hard to sync with the video.\n\nanyone have experience with similar work what’s the approach that worked for you. is there any way to fix this properly. is there any trick to make a TTS voice make in sync with the video or way to make the LLM give precise amount of words .. anything worth trying ???",
"title": "How to make TTS voice in sync with the video"
}