gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=["text"]
Affected: gpt-realtime-1.5 (OpenAI direct API and Azure OpenAI deployments). gpt-realtime is not affected.
Reproduction (minimal idea):
Open a Realtime API session with
modalities: ["text"](no audio output requested).Send a normal user message via
input_audio_buffer(audio in) orconversation.item.create(text in).Observe the assistant’s
response.text.delta/response.output_text.deltaevents.
Expected: Text stream contains only the spoken transcript.
Actual: The text stream is interleaved with audio-side control tokens, e.g.:
<|audio_text|><|caption_quality_9|>Hello, how can I help you today?
These tokens never appear with gpt-realtime. They appear consistently with gpt-realtime-1.5 on the very first response of every session, regardless of system prompt.
Why this matters in production: When the Realtime LLM is paired with an external TTS (e.g. ElevenLabs, Cartesia, etc.) — which is the standard “realtime LLM + 3rd-party voice” architecture — the raw text stream is fed to the TTS engine. The engine speaks the tokens literally , so users hear “audio text caption quality nine …” prefixed to every assistant reply. With OpenAI’s native voice (modalities=["text","audio"]), the tokens stay inside OpenAI’s TTS path and are never spoken, which is why the bug is invisible if you only test with OpenAI voice.
Sample log line from a real call (LiveKit agents transcript):
Discussion in the ATmosphere