How to sync dual-channel transcripts via OpenAI Whisper (VAD silence stripping destroys absolute timestamps)
Well honestly your split-channel approach is actually pretty solid already the timestamps/speaker reconstruction logic looks mostly correct to me.
The bigger issue feels more like Whisper struggling with
narrowband 8kHz telephony audio
overlapping speech/interruption handling
and context reconstruction across independently transcribed channels.
A few things stand out from your output tho
“Policy Test” instead of “caller side test”
numbers normalized weirdly
sentence continuation split oddly across timestamps
interruption timing drift around the “blue” overlap
So that usually looks more like ASR inference limitations than AGI/Asterisk sync problems.
One thing I’d seriously test
upsample audio to 16kHz before transcription (
soxorffmpeg)even though no new information is created, Whisper tends to behave noticeably better on resampled telephony audio.
Also maybe try
adding small silence padding at start of both legs before transcription
forcing shorter segments/VAD chunking
aligning merged segments by midpoint timestamps instead of raw start times only.
Hmm.. Your actual synchronization pipeline honestly seems cleaner than most PBX transcription setups I’ve ever seen
Discussion in the ATmosphere