External Publication
Visit Post

Irefox + Realtime API (WebRTC): voice sessions drop deterministically on the user's second speech turn

OpenAI Developer Community May 3, 2026
Source

Summary

Realtime API sessions via WebRTC drop deterministically on the user’s second speech turn (~30–90s elapsed depending on the duration of the AI’s first response), only on Firefox. Same code, same OpenAI account, same model (gpt-realtime-1.5) works end-to-end on Chrome and Edge through 5+ minute sessions including multiple session.update events.

The drop fires on input_audio_buffer.speech_started, manifests as iceConnectionState: disconnectedconnectionState: disconnected → data channel close. No application-level error event ever fires from the server before the drop — only rate_limits.updated, then the closing connection-state events.

The Realtime Playground works in Firefox, but I confirmed via DevTools and about:webrtc that the Playground uses WebSocket transport. So the Playground sidesteps the bug rather than disproving it — this isolates the issue to the WebRTC path specifically, not to anything Firefox-vs-OpenAI more broadly.

Environment

  • Firefox version: [150.0.1]

  • OS: [Windows]

  • Model: gpt-realtime-1.5

  • Reproduction rate: 100% across 10+ sessions, multiple networks

Symptom (console trace)

[rt-event] response.done                       ← AI finishes turn 1
[rt-event] output_audio_buffer.stopped
[rt-event] input_audio_buffer.speech_started   ← user starts turn 2
[rt-ice-state] disconnected                    ← drop, 5–10s later
[rt-conn-state] disconnected
[rt-dc-close] readyState: closed

about:webrtc shows several successful ICE consent-refresh exchanges followed by STUN-CLIENT(consent): Timed out. Connection terminates before any application-layer error is signalled.

Reproduction

  1. Browser-side WebRTC connection to Realtime API on Firefox

  2. Standard session.update (full payload below)

  3. AI greets, user speaks turn 1 (works), AI responds (works)

  4. User starts speaking turn 2 → drop within 5–10 seconds

Reproduces across two networks (WiFi + mobile hotspot), Firefox normal mode + private mode, and Firefox on mobile.

Reference session.update payload

json

{
  "type": "session.update",
  "session": {
    "type": "realtime",
    "instructions": "<character role-play prompt>",
    "audio": {
      "input": {
        "turn_detection": { "type": "semantic_vad" },
        "transcription": { "model": "gpt-4o-mini-transcribe", "language": "en" }
      },
      "output": { "voice": "ash" }
    }
  }
}

Hypotheses tested and ruled out

I built URL-param diagnostic toggles for each hypothesis and tested them in isolation. All ruled out:

Hypothesis Test Result
VAD type Swap semantic_vadserver_vad with default thresholds Still drops
System prompt size (~35K chars) Replace with ~500-char slim prompt Still drops
language hint on transcription Drop "language": "en" entirely Still drops
Combined Playground-likely config server_vad + slim prompt + no language hint Still drops
NAT / STUN traversal Add explicit STUN + public TURN; UDP-direct candidate pair confirmed via pc.getStats() Still drops
Trust-update mid-session Disable mid-session session.update entirely Still drops
Client-side noise gate (AudioContext + GainNode) Bypass, feed raw mic stream into pc.addTrack Still drops
Browser audio processing getUserMedia({ audio: { echoCancellation: false, noiseSuppression: false, autoGainControl: false } }) Still drops
Output GainNode (AudioContext on incoming track) Bypass, native <audio> playback Still drops
Whisper-1 transcription model Swap to gpt-4o-mini-transcribe Still drops

What works

  • Chrome (latest): full sessions through 5+ minutes, multiple session.update events, no drops

  • Edge (latest): validated end-to-end through full app pipeline (voice → debrief → scoring)

  • Realtime Playground in Firefox: works end-to-end (uses WebSocket transport, not WebRTC)

What I’d love OpenAI to investigate

  • Anything specific to Firefox’s mtransport WebRTC stack vs. Chrome’s libwebrtc interacting with the Realtime media server

  • Whether the server-side STUN consent-refresh handling assumes Chromium-style timing (Firefox’s consent-refresh is stricter — 5s timeout)

  • Whether the GA endpoint applies an implicit default config (e.g., input_audio_noise_reduction) that interacts poorly with Firefox’s WebRTC stack. Prior forum threads document similar “Firefox WebRTC suddenly fails, Chrome works, no app-level error” patterns — different surface symptoms but plausibly related

Discussion in the ATmosphere

Loading comments...