Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig6xk6c6kpfkxkjgeg4c6h56m4axsbhve6pd7mbvgxcyrrtvruh2m",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mkxtm5znahc2"
  },
  "path": "/t/irefox-realtime-api-webrtc-voice-sessions-drop-deterministically-on-the-users-second-speech-turn/1380257#post_1",
  "publishedAt": "2026-05-03T18:55:20.000Z",
  "site": "https://community.openai.com",
  "textContent": "### Summary\n\nRealtime API sessions via WebRTC drop deterministically on the user’s second speech turn (~30–90s elapsed depending on the duration of the AI’s first response), only on Firefox. Same code, same OpenAI account, same model (`gpt-realtime-1.5`) works end-to-end on Chrome and Edge through 5+ minute sessions including multiple `session.update` events.\n\nThe drop fires on `input_audio_buffer.speech_started`, manifests as `iceConnectionState: disconnected` → `connectionState: disconnected` → data channel close. **No application-level error event ever fires from the server before the drop** — only `rate_limits.updated`, then the closing connection-state events.\n\nThe Realtime Playground works in Firefox, but I confirmed via DevTools and `about:webrtc` that the Playground uses WebSocket transport. So the Playground sidesteps the bug rather than disproving it — this isolates the issue to the WebRTC path specifically, not to anything Firefox-vs-OpenAI more broadly.\n\n### Environment\n\n  * **Firefox version:** [150.0.1]\n\n  * **OS:** [Windows]\n\n  * **Model:** `gpt-realtime-1.5`\n\n  * **Reproduction rate:** 100% across 10+ sessions, multiple networks\n\n\n\n\n### Symptom (console trace)\n\n\n    [rt-event] response.done                       ← AI finishes turn 1\n    [rt-event] output_audio_buffer.stopped\n    [rt-event] input_audio_buffer.speech_started   ← user starts turn 2\n    [rt-ice-state] disconnected                    ← drop, 5–10s later\n    [rt-conn-state] disconnected\n    [rt-dc-close] readyState: closed\n\n\n`about:webrtc` shows several successful ICE consent-refresh exchanges followed by `STUN-CLIENT(consent): Timed out`. Connection terminates before any application-layer error is signalled.\n\n### Reproduction\n\n  1. Browser-side WebRTC connection to Realtime API on Firefox\n\n  2. Standard `session.update` (full payload below)\n\n  3. AI greets, user speaks turn 1 (works), AI responds (works)\n\n  4. User starts speaking turn 2 → drop within 5–10 seconds\n\n\n\n\nReproduces across two networks (WiFi + mobile hotspot), Firefox normal mode + private mode, and Firefox on mobile.\n\n### Reference `session.update` payload\n\njson\n\n\n    {\n      \"type\": \"session.update\",\n      \"session\": {\n        \"type\": \"realtime\",\n        \"instructions\": \"<character role-play prompt>\",\n        \"audio\": {\n          \"input\": {\n            \"turn_detection\": { \"type\": \"semantic_vad\" },\n            \"transcription\": { \"model\": \"gpt-4o-mini-transcribe\", \"language\": \"en\" }\n          },\n          \"output\": { \"voice\": \"ash\" }\n        }\n      }\n    }\n\n\n### Hypotheses tested and ruled out\n\nI built URL-param diagnostic toggles for each hypothesis and tested them in isolation. All ruled out:\n\nHypothesis | Test | Result\n---|---|---\nVAD type | Swap `semantic_vad` → `server_vad` with default thresholds | Still drops\nSystem prompt size (~35K chars) | Replace with ~500-char slim prompt | Still drops\n`language` hint on transcription | Drop `\"language\": \"en\"` entirely | Still drops\nCombined Playground-likely config | `server_vad` + slim prompt + no language hint | Still drops\nNAT / STUN traversal | Add explicit STUN + public TURN; UDP-direct candidate pair confirmed via `pc.getStats()` | Still drops\nTrust-update mid-session | Disable mid-session `session.update` entirely | Still drops\nClient-side noise gate (AudioContext + GainNode) | Bypass, feed raw mic stream into `pc.addTrack` | Still drops\nBrowser audio processing | `getUserMedia({ audio: { echoCancellation: false, noiseSuppression: false, autoGainControl: false } })` | Still drops\nOutput GainNode (AudioContext on incoming track) | Bypass, native `<audio>` playback | Still drops\nWhisper-1 transcription model | Swap to `gpt-4o-mini-transcribe` | Still drops\n\n### What works\n\n  * **Chrome (latest):** full sessions through 5+ minutes, multiple `session.update` events, no drops\n\n  * **Edge (latest):** validated end-to-end through full app pipeline (voice → debrief → scoring)\n\n  * **Realtime Playground in Firefox:** works end-to-end (uses WebSocket transport, not WebRTC)\n\n\n\n\n### What I’d love OpenAI to investigate\n\n  * Anything specific to Firefox’s `mtransport` WebRTC stack vs. Chrome’s libwebrtc interacting with the Realtime media server\n\n  * Whether the server-side STUN consent-refresh handling assumes Chromium-style timing (Firefox’s consent-refresh is stricter — 5s timeout)\n\n  * Whether the GA endpoint applies an implicit default config (e.g., `input_audio_noise_reduction`) that interacts poorly with Firefox’s WebRTC stack. Prior forum threads document similar “Firefox WebRTC suddenly fails, Chrome works, no app-level error” patterns — different surface symptoms but plausibly related\n\n\n",
  "title": "Irefox + Realtime API (WebRTC): voice sessions drop deterministically on the user's second speech turn"
}