External Publication

Realtime API: Poor Portuguese call quality with gpt-realtime-mini / gpt-realtime

OpenAI Developer Community May 20, 2026

Hi everyone,

I’d like to share an issue I’m experiencing in a real production scenario using the OpenAI Realtime API for phone calls in Portuguese.

I have tested different Realtime models, including gpt-realtime-mini and gpt-realtime, but the problem is very similar across them.

In Brazil, many users answer calls using speakerphone mode. This captures a lot of background noise from the environment and causes interference. As a result, the AI frequently starts talking by itself or responds when the user has not clearly finished speaking.

Another frequent issue is speech recognition accuracy when collecting data. For example, when the user provides an address, name, or other important information, the AI often understands completely different words. Sometimes the words are not even similar. For example, the person says “Apple”, but the AI understands “Oscar”.

Currently, I’m using Realtime directly via SIP, with transcription enabled, and I accept the call using these parameters:

‘audio’ => [
‘input’ => [
‘format’ => [‘type’ => ‘audio/pcmu’],

    'transcription' => [
        'model' => 'gpt-4o-transcribe',
        'language' => 'pt',
        'prompt' => 'Transcribe with maximum fidelity. Proper names are critical. Do not correct names. If you’re unsure, keep exactly what you heard.'
    ],

    'turn_detection' => [
        'type' => 'server_vad',
        'threshold' => 0.7,
        'prefix_padding_ms' => 300,
        'silence_duration_ms' => 1100,
        'idle_timeout_ms' => 12000,

        'create_response' => true,
        'interrupt_response' => true,
    ],
],

'output' => [
    'format' => ['type' => 'audio/pcmu'],
    'voice' => 'marin',
],

]

Has anyone had a better experience with calls in noisy environments, especially with not english users on speakerphone?

Any recommendations for improving transcription accuracy, turn detection, VAD configuration, or reducing cases where the AI starts speaking by itself would be very welcome.

I already tested gpt-realtime-1.5 and gpt-realtime-2, but they are not acceptable for my use case right now because they are too expensive. Also, in my tests, the same issues still happened with them.

Discussion in the ATmosphere