{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihe4io6c7ejecgax4qpt7ehx543oqngi4rwnlb6y5hmuzd4zdlayy",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mjrdgpzogib2"
  },
  "path": "/t/gpt-realtime-1-5-leaks-audio-control-tokens-audio-text-caption-quality-n-into-text-stream-when-run-with-modalities-text/1379235#post_1",
  "publishedAt": "2026-04-18T10:37:09.000Z",
  "site": "https://community.openai.com",
  "textContent": "**Affected:** `gpt-realtime-1.5` (OpenAI direct API and Azure OpenAI deployments). `gpt-realtime` is not affected.\n\n**Reproduction (minimal idea):**\n\n  1. Open a Realtime API session with `modalities: [\"text\"]` (no audio output requested).\n\n  2. Send a normal user message via `input_audio_buffer` (audio in) or `conversation.item.create` (text in).\n\n  3. Observe the assistant’s `response.text.delta` / `response.output_text.delta` events.\n\n\n\n\n**Expected:** Text stream contains only the spoken transcript.\n\n**Actual:** The text stream is interleaved with audio-side control tokens, e.g.:\n\n<|audio_text|><|caption_quality_9|>Hello, how can I help you today?\n\nThese tokens never appear with `gpt-realtime`. They appear consistently with `gpt-realtime-1.5` on the very first response of every session, regardless of system prompt.\n\n**Why this matters in production:** When the Realtime LLM is paired with an **external TTS** (e.g. ElevenLabs, Cartesia, etc.) — which is the standard “realtime LLM + 3rd-party voice” architecture — the raw text stream is fed to the TTS engine. The engine speaks the tokens **literally** , so users hear “audio text caption quality nine …” prefixed to every assistant reply. With OpenAI’s native voice (`modalities=[\"text\",\"audio\"]`), the tokens stay inside OpenAI’s TTS path and are never spoken, which is why the bug is invisible if you only test with OpenAI voice.\n\n**Sample log line from a real call** (LiveKit agents transcript):\n\n[AGENT]: <|audio_text|><|caption_quality_9|>",
  "title": "gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=[\"text\"]"
}