{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig6xk6c6kpfkxkjgeg4c6h56m4axsbhve6pd7mbvgxcyrrtvruh2m",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mkxtm5znahc2"
},
"path": "/t/irefox-realtime-api-webrtc-voice-sessions-drop-deterministically-on-the-users-second-speech-turn/1380257#post_1",
"publishedAt": "2026-05-03T18:55:20.000Z",
"site": "https://community.openai.com",
"textContent": "### Summary\n\nRealtime API sessions via WebRTC drop deterministically on the user’s second speech turn (~30–90s elapsed depending on the duration of the AI’s first response), only on Firefox. Same code, same OpenAI account, same model (`gpt-realtime-1.5`) works end-to-end on Chrome and Edge through 5+ minute sessions including multiple `session.update` events.\n\nThe drop fires on `input_audio_buffer.speech_started`, manifests as `iceConnectionState: disconnected` → `connectionState: disconnected` → data channel close. **No application-level error event ever fires from the server before the drop** — only `rate_limits.updated`, then the closing connection-state events.\n\nThe Realtime Playground works in Firefox, but I confirmed via DevTools and `about:webrtc` that the Playground uses WebSocket transport. So the Playground sidesteps the bug rather than disproving it — this isolates the issue to the WebRTC path specifically, not to anything Firefox-vs-OpenAI more broadly.\n\n### Environment\n\n * **Firefox version:** [150.0.1]\n\n * **OS:** [Windows]\n\n * **Model:** `gpt-realtime-1.5`\n\n * **Reproduction rate:** 100% across 10+ sessions, multiple networks\n\n\n\n\n### Symptom (console trace)\n\n\n [rt-event] response.done ← AI finishes turn 1\n [rt-event] output_audio_buffer.stopped\n [rt-event] input_audio_buffer.speech_started ← user starts turn 2\n [rt-ice-state] disconnected ← drop, 5–10s later\n [rt-conn-state] disconnected\n [rt-dc-close] readyState: closed\n\n\n`about:webrtc` shows several successful ICE consent-refresh exchanges followed by `STUN-CLIENT(consent): Timed out`. Connection terminates before any application-layer error is signalled.\n\n### Reproduction\n\n 1. Browser-side WebRTC connection to Realtime API on Firefox\n\n 2. Standard `session.update` (full payload below)\n\n 3. AI greets, user speaks turn 1 (works), AI responds (works)\n\n 4. User starts speaking turn 2 → drop within 5–10 seconds\n\n\n\n\nReproduces across two networks (WiFi + mobile hotspot), Firefox normal mode + private mode, and Firefox on mobile.\n\n### Reference `session.update` payload\n\njson\n\n\n {\n \"type\": \"session.update\",\n \"session\": {\n \"type\": \"realtime\",\n \"instructions\": \"<character role-play prompt>\",\n \"audio\": {\n \"input\": {\n \"turn_detection\": { \"type\": \"semantic_vad\" },\n \"transcription\": { \"model\": \"gpt-4o-mini-transcribe\", \"language\": \"en\" }\n },\n \"output\": { \"voice\": \"ash\" }\n }\n }\n }\n\n\n### Hypotheses tested and ruled out\n\nI built URL-param diagnostic toggles for each hypothesis and tested them in isolation. All ruled out:\n\nHypothesis | Test | Result\n---|---|---\nVAD type | Swap `semantic_vad` → `server_vad` with default thresholds | Still drops\nSystem prompt size (~35K chars) | Replace with ~500-char slim prompt | Still drops\n`language` hint on transcription | Drop `\"language\": \"en\"` entirely | Still drops\nCombined Playground-likely config | `server_vad` + slim prompt + no language hint | Still drops\nNAT / STUN traversal | Add explicit STUN + public TURN; UDP-direct candidate pair confirmed via `pc.getStats()` | Still drops\nTrust-update mid-session | Disable mid-session `session.update` entirely | Still drops\nClient-side noise gate (AudioContext + GainNode) | Bypass, feed raw mic stream into `pc.addTrack` | Still drops\nBrowser audio processing | `getUserMedia({ audio: { echoCancellation: false, noiseSuppression: false, autoGainControl: false } })` | Still drops\nOutput GainNode (AudioContext on incoming track) | Bypass, native `<audio>` playback | Still drops\nWhisper-1 transcription model | Swap to `gpt-4o-mini-transcribe` | Still drops\n\n### What works\n\n * **Chrome (latest):** full sessions through 5+ minutes, multiple `session.update` events, no drops\n\n * **Edge (latest):** validated end-to-end through full app pipeline (voice → debrief → scoring)\n\n * **Realtime Playground in Firefox:** works end-to-end (uses WebSocket transport, not WebRTC)\n\n\n\n\n### What I’d love OpenAI to investigate\n\n * Anything specific to Firefox’s `mtransport` WebRTC stack vs. Chrome’s libwebrtc interacting with the Realtime media server\n\n * Whether the server-side STUN consent-refresh handling assumes Chromium-style timing (Firefox’s consent-refresh is stricter — 5s timeout)\n\n * Whether the GA endpoint applies an implicit default config (e.g., `input_audio_noise_reduction`) that interacts poorly with Firefox’s WebRTC stack. Prior forum threads document similar “Firefox WebRTC suddenly fails, Chrome works, no app-level error” patterns — different surface symptoms but plausibly related\n\n\n",
"title": "Irefox + Realtime API (WebRTC): voice sessions drop deterministically on the user's second speech turn"
}