How to sync dual-channel transcripts via OpenAI Whisper (VAD silence stripping destroys absolute timestamps)
I am building an automated call transcription pipeline for a PBX system. The goal is to generate a perfectly chronological, multi-speaker transcript (Caller vs. Callee) from standard 8kHz telephony audio. (My Attempted Solution) Because the OpenAI API downmixes stereo files to mono (which destroys speaker separation and causes heavy hallucination on 8kHz audio), I built a split-channel architecture:
Asterisk: I use
MixMonitorwith theb,r(),t()flags to record the call legs into two separate, mathematically synchronized files (_caller.wavand_callee.wav).PHP Worker: A background script converts the files and fires two separate
cURLrequests to the Whisper API, requestingverbose_jsonto get exact timestamps.The Merge: The PHP script parses both JSON arrays, tags the speakers, merges the arrays, and sorts them chronologically by their start times to reconstruct the conversation.
The Specific Issue I am Facing getting jumbled transcription the transcription i am getting: [00:00] Caller: Hello, this is a Policy Test, my name is John Miller, today is Wednesday, May 27th, the
[00:00] Callee: Hi, if you record your name and reason for calling, I’ll see if this person is available.
[00:15] Caller: reference number is 473169, can you hear me clearly?
[00:24] Callee: Yes, I can hear you clearly.
[00:26] Callee: This is the Kohli site test.
[00:28] Callee: My name is Sarah Johnson.
[00:30] Callee: The audio quality sounds good from my side.
[00:33] Callee: Please continue with the verification. [00:35] Caller: I will now test timestamps and speaker changes, the amount is $125, the meeting is scheduled
[00:43] Caller: for 10.30am, please confirm the details.
[00:48] Callee: Confirmed.
[00:49] Callee: $125.
[00:51] Callee: Meeting at 10.30 AM.
[00:53] Callee: I am also testing punctuation, pauses, and pronunciation.
[00:58] Caller: Now testing short interruptions, can you just say the color blue while I continue speaking?
[01:05] Callee: Blue.
[01:07] Caller: Thank you, now testing phone numbers 9876543210, final verification test, this call recording
[01:17] Callee: Received.
[01:18] Callee: Now testing email pronunciation.
[01:20] Callee: john.miller at example dot com
[01:26] Caller: should contain timestamps, speaker labels and accurate English transcriptions, ending
[01:32] Caller: test now. the actual script of the test call i made: Caller
Hello, this is the caller side test.
My name is John Miller.
Today is Wednesday, May twenty seventh.
The reference number is four seven three one six nine.
Can you hear me clearly?
Callee
Yes, I can hear you clearly.
This is the callee side test.
My name is Sarah Johnson.
The audio quality sounds good from my side.
Please continue with the verification.
Caller
I will now test timestamps and speaker changes.
The amount is one hundred twenty five dollars.
The meeting is scheduled for ten thirty AM. Callee
Confirmed.
One hundred twenty five dollars.
Meeting at ten thirty AM.
I am also testing punctuation, pauses, and pronunciation.
Caller
Now testing short interruptions.
Can you say the color blue while I continue speaking?
Callee (interrupt slightly)
Blue. Caller
Thank you.
Now testing phone numbers.
Nine eight seven six five four three two one zero.
Callee
Received.
Now testing email pronunciation.
john dot miller at example dot com.
Caller
Final verification test.
This call recording should contain timestamps,
speaker labels, and accurate English transcription.
Ending test now.
AGI and asterisk experts please help if any solution from AGI side possible to this problem
Discussion in the ATmosphere