Realtime SIP caller hangup still drops final input transcription events
We can reproduce this fairly easily.
Reproduction condition:
- Start a Realtime SIP call through Twilio to OpenAI SIP.
- Let the assistant finish its prompt.
- The caller starts speaking a relatively long utterance, around 15-30 seconds, without leaving enough silence for VAD to finalize the turn.
- The caller hangs up immediately after finishing the utterance, or while the utterance is still being finalized.
- The Twilio recording contains the caller’s final speech, but the Realtime event stream does not emit the final input transcription events for that speech.
In our observed case:
- OpenAI Realtime Call_ID: rtc_u2_DeAOG3pcpBFvJ4Mv1fkY7
- OpenAI webhook event id: evt_6a013a9835dc8190a117e2426dad6903
- OpenAI webhook id: wh_6a013a9841248190b12513da3ccd7788
- Twilio parent CallSid: CAa17b0544dd4ed2584c9b8f2a5803e5c8
- Twilio SIP child CallSid: CAc79d8fb428da42ad1b9234742fa5f053
- SIP Call-ID: 0f62ec63e2269af3709ce2b92685312e@0.0.0.0
- Internal request id in our app: 58cd683f-b34e-490f-bbbe-74fe8ca48af6
We do not have the OpenAI API x-request-id for this historical call because we were not logging the response headers from the realtime.calls.accept request at the time.
Timeline:
- The assistant finished speaking at around 2026-05-11 02:11:03 UTC.
- The SIP leg completed at around 2026-05-11 02:11:31 UTC.
- The Realtime session disconnected at around 2026-05-11 02:11:32 UTC.
- The final caller speech exists in the Twilio recording.
- We did not receive
conversation.item.input_audio_transcription.deltaorconversation.item.input_audio_transcription.completedfor that final speech.
Our current setup listens for:
conversation.item.input_audio_transcription.deltaconversation.item.input_audio_transcription.completed
We currently do not log:
input_audio_buffer.committedinput_audio_buffer.speech_stopped
So we cannot yet confirm whether input_audio_buffer.committed was emitted before the disconnect for this historical call. We are planning to add logging for those events.
The important pattern seems to be: caller speaks for a long enough time, then immediately disconnects before VAD has finalized/committed the input audio buffer.
Could you confirm whether Realtime SIP is expected to commit and emit a final transcription for any pending input audio when the SIP caller sends BYE / hangs up? Or is this currently a known limitation where applications should always use the recording as the fallback source of truth for the final utterance?
Discussion in the ATmosphere