gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=["text"]

OpenAI Developer Community April 18, 2026

Source

Affected: gpt-realtime-1.5 (OpenAI direct API and Azure OpenAI deployments). gpt-realtime is not affected.

Reproduction (minimal idea):

Open a Realtime API session with modalities: ["text"] (no audio output requested).
Send a normal user message via input_audio_buffer (audio in) or conversation.item.create (text in).
Observe the assistant’s response.text.delta / response.output_text.delta events.

Expected: Text stream contains only the spoken transcript.

Actual: The text stream is interleaved with audio-side control tokens, e.g.:

<|audio_text|><|caption_quality_9|>Hello, how can I help you today?

These tokens never appear with gpt-realtime. They appear consistently with gpt-realtime-1.5 on the very first response of every session, regardless of system prompt.

Why this matters in production: When the Realtime LLM is paired with an external TTS (e.g. ElevenLabs, Cartesia, etc.) — which is the standard “realtime LLM + 3rd-party voice” architecture — the raw text stream is fed to the TTS engine. The engine speaks the tokens literally , so users hear “audio text caption quality nine …” prefixed to every assistant reply. With OpenAI’s native voice (modalities=["text","audio"]), the tokens stay inside OpenAI’s TTS path and are never spoken, which is why the bug is invisible if you only test with OpenAI voice.

Sample log line from a real call (LiveKit agents transcript):

Discussion in the ATmosphere