External Publication

Realtime regression in non-English production voice agents: gpt-realtime-mini vs gpt-realtime-mini-2025-10-06

OpenAI Developer Community May 10, 2026

We operate a production AI voice platform built on the OpenAI Realtime API via WebSocket/SIP/Twilio. We are currently deploying our solution across a few dozen locations nationwide for an enterprise client, and we have encountered a material regression when comparing the dated Realtime snapshot we validated in production against the listed replacement model. Model validated in production: gpt-realtime-mini-2025-10-06 Listed replacement: gpt-realtime-mini Issue: In non-English voice-agent flows, with Romanian as our current production case, the replacement model shows noticeably worse language quality and worse faithfulness to supplied business data. The most serious issue is not just spelling or phrasing. We have observed the newer model hallucinating non-existing departments, services, and operational details that were not present in the database/context. The older snapshot, gpt-realtime-mini-2025-10-06, has been significantly more faithful to the provided information and less prone to confabulating unavailable services or internal departments. This is important because the older snapshot was not selected casually. It was selected after thousands of hours of testing, R&D, and practical validation in Romanian-language voice-agent scenarios. Its reliability in staying faithful to provided business information is one of the reasons we currently depend on it for production deployments. Impact: This affects an active enterprise rollout across a few dozen locations nationwide. The regression impacts: * live AI phone conversations; * appointment and call summaries; * CRM/customer records; * operational reporting; * client trust during rollout. We are concerned that this may not be limited to Romanian, but may reflect broader non-English quality/faithfulness differences between the dated snapshot and the current gpt-realtime-mini alias. Evidence: We have transcription evidence and can provide side-by-side examples comparing the same or similar flows between gpt-realtime-mini-2025-10-06 and gpt-realtime-mini. Questions: 1. Has anyone else observed worse non-English performance or worse faithfulness to supplied data on gpt-realtime-mini compared to dated Realtime snapshots? 2. Is OpenAI tracking language-specific regressions for Realtime models before snapshot deprecations? 3. Is there a path for production customers to request temporary extended access or a migration path when a listed replacement model is not behaviorally equivalent? We are committed to building on OpenAI’s Realtime infrastructure, but we need a reliable migration path before moving production enterprise traffic away from the currently working snapshot.

Discussion in the ATmosphere