Firefox + Realtime API (WebRTC): voice sessions drop deterministically on the user's second speech turn
Summary
Realtime API sessions via WebRTC drop deterministically on the user’s second speech turn (~30–90s elapsed depending on the duration of the AI’s first response), only on Firefox. Same code, same OpenAI account, same model (gpt-realtime-1.5) works end-to-end on Chrome and Edge through 5+ minute sessions including multiple session.update events.
The drop fires on input_audio_buffer.speech_started, manifests as iceConnectionState: disconnected → connectionState: disconnected → data channel close. No application-level error event ever fires from the server before the drop — only rate_limits.updated, then the closing connection-state events.
The Realtime Playground works in Firefox, but I confirmed via DevTools and about:webrtc that the Playground uses WebSocket transport. So the Playground sidesteps the bug rather than disproving it — this isolates the issue to the WebRTC path specifically, not to anything Firefox-vs-OpenAI more broadly.
Environment
Firefox version: [150.0.1]
OS: [Windows]
Model:
gpt-realtime-1.5Reproduction rate: 100% across 10+ sessions, multiple networks
Symptom (console trace)
[rt-event] response.done ← AI finishes turn 1
[rt-event] output_audio_buffer.stopped
[rt-event] input_audio_buffer.speech_started ← user starts turn 2
[rt-ice-state] disconnected ← drop, 5–10s later
[rt-conn-state] disconnected
[rt-dc-close] readyState: closed
about:webrtc shows several successful ICE consent-refresh exchanges followed by STUN-CLIENT(consent): Timed out. Connection terminates before any application-layer error is signalled.
Reproduction
Browser-side WebRTC connection to Realtime API on Firefox
Standard
session.update(full payload below)AI greets, user speaks turn 1 (works), AI responds (works)
User starts speaking turn 2 → drop within 5–10 seconds
Reproduces across two networks (WiFi + mobile hotspot), Firefox normal mode + private mode, and Firefox on mobile.
Reference session.update payload
json
{
"type": "session.update",
"session": {
"type": "realtime",
"instructions": "<character role-play prompt>",
"audio": {
"input": {
"turn_detection": { "type": "semantic_vad" },
"transcription": { "model": "gpt-4o-mini-transcribe", "language": "en" }
},
"output": { "voice": "ash" }
}
}
}
Hypotheses tested and ruled out
I built URL-param diagnostic toggles for each hypothesis and tested them in isolation. All ruled out:
| Hypothesis | Test | Result |
|---|---|---|
| VAD type | Swap semantic_vad → server_vad with default thresholds |
Still drops |
| System prompt size (~35K chars) | Replace with ~500-char slim prompt | Still drops |
language hint on transcription |
Drop "language": "en" entirely |
Still drops |
| Combined Playground-likely config | server_vad + slim prompt + no language hint |
Still drops |
| NAT / STUN traversal | Add explicit STUN + public TURN; UDP-direct candidate pair confirmed via pc.getStats() |
Still drops |
| Trust-update mid-session | Disable mid-session session.update entirely |
Still drops |
| Client-side noise gate (AudioContext + GainNode) | Bypass, feed raw mic stream into pc.addTrack |
Still drops |
| Browser audio processing | getUserMedia({ audio: { echoCancellation: false, noiseSuppression: false, autoGainControl: false } }) |
Still drops |
| Output GainNode (AudioContext on incoming track) | Bypass, native <audio> playback |
Still drops |
| Whisper-1 transcription model | Swap to gpt-4o-mini-transcribe |
Still drops |
What works
Chrome (latest): full sessions through 5+ minutes, multiple
session.updateevents, no dropsEdge (latest): validated end-to-end through full app pipeline (voice → debrief → scoring)
Realtime Playground in Firefox: works end-to-end (uses WebSocket transport, not WebRTC)
What I’d love OpenAI to investigate
Anything specific to Firefox’s
mtransportWebRTC stack vs. Chrome’s libwebrtc interacting with the Realtime media serverWhether the server-side STUN consent-refresh handling assumes Chromium-style timing (Firefox’s consent-refresh is stricter — 5s timeout)
Whether the GA endpoint applies an implicit default config (e.g.,
input_audio_noise_reduction) that interacts poorly with Firefox’s WebRTC stack. Prior forum threads document similar “Firefox WebRTC suddenly fails, Chrome works, no app-level error” patterns — different surface symptoms but plausibly related
Discussion in the ATmosphere