Realtime API: Poor Portuguese call quality with gpt-realtime-mini / gpt-realtime
Thanks for the suggestions.
I did some additional testing on the SIP/RTP side with Asterisk/PJSIP, and it looks like increasing the audio quality is not currently practical in this setup.
The OpenAI SIP endpoint accepted G.711 only:
PCMU/8000→ accepted, call completedPCMA/8000→ accepted, call completedG722/8000→ rejected with400 Bad RequestL16/16000→ rejected with400 Bad RequestL16/24000→ rejected with400 Bad Request
For example, this was rejected:
m=audio 15822 RTP/SAVP 123 101
a=rtpmap:123 L16/24000
a=rtpmap:101 telephone-event/8000
a=ptime:20
a=sendrecv
This was accepted:
m=audio 41544 RTP/SAVP 8 101
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=ptime:20
I also tried setting the calls.accept configuration to audio/pcm with rate: 24000, but when the SIP endpoint is configured with ulaw or alaw, the actual negotiated media remains G.711:
NativeFormats: (ulaw)
ReadFormat: ulaw
WriteFormat: ulaw
So it seems that audio/pcm in calls.accept does not make the SIP/RTP leg accept or negotiate L16/24000. At least in my tests, the SIP endpoint only works with G.711 (PCMU/PCMA) at 8 kHz.
For PSTN SIP trunks this is also a practical limitation, because all carriers I work with provide only 8 kHz codecs such as PCMU/PCMA. Even in the WhatsApp Calling SIP scenario, where Opus may be available on the Meta side, the audio would still be transcoded down to G.711 before reaching OpenAI if the OpenAI SIP leg only accepts PCMU/PCMA.
So, unless there is a specific SDP format required for PCM over SIP, or some other supported wideband codec on the OpenAI SIP endpoint, increasing the audio sample rate is not currently feasible with direct SIP integration.
It would be useful to clarify whether audio/pcm in calls.accept is expected to apply to SIP/RTP codec negotiation, or only to non-SIP Realtime media flows.
The issue is not equally distributed across all speech. General conversation is often understandable, but the most problematic parts are critical short entities: names, numbers, addresses, and payment methods. This is especially problematic because those are exactly the fields that need high accuracy in telephony workflows.
In Portuguese phone calls, the model can follow the overall intent, but it frequently mishears proper names or short payment-related terms, even when the user speaks naturally. That makes the workflow risky unless we add explicit confirmation steps.
Discussion in the ATmosphere