External Publication

Irefox + Realtime API (WebRTC): voice sessions drop deterministically on the user's second speech turn

OpenAI Developer Community May 3, 2026

Summary

Realtime API sessions via WebRTC drop deterministically on the user’s second speech turn (~30–90s elapsed depending on the duration of the AI’s first response), only on Firefox. Same code, same OpenAI account, same model (gpt-realtime-1.5) works end-to-end on Chrome and Edge through 5+ minute sessions including multiple session.update events.

The drop fires on input_audio_buffer.speech_started, manifests as iceConnectionState: disconnected → connectionState: disconnected → data channel close. No application-level error event ever fires from the server before the drop — only rate_limits.updated, then the closing connection-state events.

The Realtime Playground works in Firefox, but I confirmed via DevTools and about:webrtc that the Playground uses WebSocket transport. So the Playground sidesteps the bug rather than disproving it — this isolates the issue to the WebRTC path specifically, not to anything Firefox-vs-OpenAI more broadly.

Environment

Firefox version: [150.0.1]
OS: [Windows]
Model: gpt-realtime-1.5
Reproduction rate: 100% across 10+ sessions, multiple networks

Symptom (console trace)

[rt-event] response.done                       ← AI finishes turn 1
[rt-event] output_audio_buffer.stopped
[rt-event] input_audio_buffer.speech_started   ← user starts turn 2
[rt-ice-state] disconnected                    ← drop, 5–10s later
[rt-conn-state] disconnected
[rt-dc-close] readyState: closed

about:webrtc shows several successful ICE consent-refresh exchanges followed by STUN-CLIENT(consent): Timed out. Connection terminates before any application-layer error is signalled.

Reproduction

Browser-side WebRTC connection to Realtime API on Firefox
Standard session.update (full payload below)
AI greets, user speaks turn 1 (works), AI responds (works)
User starts speaking turn 2 → drop within 5–10 seconds

Reproduces across two networks (WiFi + mobile hotspot), Firefox normal mode + private mode, and Firefox on mobile.

Reference `session.update` payload

json

{
  "type": "session.update",
  "session": {
    "type": "realtime",
    "instructions": "<character role-play prompt>",
    "audio": {
      "input": {
        "turn_detection": { "type": "semantic_vad" },
        "transcription": { "model": "gpt-4o-mini-transcribe", "language": "en" }
      },
      "output": { "voice": "ash" }
    }
  }
}

Hypotheses tested and ruled out

I built URL-param diagnostic toggles for each hypothesis and tested them in isolation. All ruled out:

Hypothesis	Test	Result
VAD type	Swap `semantic_vad` → `server_vad` with default thresholds	Still drops
System prompt size (~35K chars)	Replace with ~500-char slim prompt	Still drops
`language` hint on transcription	Drop `"language": "en"` entirely	Still drops
Combined Playground-likely config	`server_vad` + slim prompt + no language hint	Still drops
NAT / STUN traversal	Add explicit STUN + public TURN; UDP-direct candidate pair confirmed via `pc.getStats()`	Still drops
Trust-update mid-session	Disable mid-session `session.update` entirely	Still drops
Client-side noise gate (AudioContext + GainNode)	Bypass, feed raw mic stream into `pc.addTrack`	Still drops
Browser audio processing	`getUserMedia({ audio: { echoCancellation: false, noiseSuppression: false, autoGainControl: false } })`	Still drops
Output GainNode (AudioContext on incoming track)	Bypass, native `<audio>` playback	Still drops
Whisper-1 transcription model	Swap to `gpt-4o-mini-transcribe`	Still drops

What works

Chrome (latest): full sessions through 5+ minutes, multiple session.update events, no drops
Edge (latest): validated end-to-end through full app pipeline (voice → debrief → scoring)
Realtime Playground in Firefox: works end-to-end (uses WebSocket transport, not WebRTC)

What I’d love OpenAI to investigate

Anything specific to Firefox’s mtransport WebRTC stack vs. Chrome’s libwebrtc interacting with the Realtime media server
Whether the server-side STUN consent-refresh handling assumes Chromium-style timing (Firefox’s consent-refresh is stricter — 5s timeout)
Whether the GA endpoint applies an implicit default config (e.g., input_audio_noise_reduction) that interacts poorly with Firefox’s WebRTC stack. Prior forum threads document similar “Firefox WebRTC suddenly fails, Chrome works, no app-level error” patterns — different surface symptoms but plausibly related