External Publication

Why do gpt-5.1 and gpt-5.4-mini behave so differently in production chatbot use cases?

OpenAI Developer Community May 14, 2026

I have been testing a model change from gpt-5.1 to gpt-5.4-mini in the Responses API , but after many tests, I feel that gpt-5.4-mini is not reliable enough for my current production chatbot use cases.

In my tests, gpt-5.4-mini loses context more easily , does not always respect the prompt rules, and sometimes applies or ignores instructions inconsistently. Compared to gpt-5.1 , the difference is very noticeable.

To be clear, I understand that gpt-5.4-mini may be a good option for many workloads , especially considering the lower price. However, in my customer service chatbot scenarios, where prompt-following, context retention, business rules, and tool execution decisions are very important, the results have been significantly worse than with gpt-5.1.

So far, gpt-5.1 has been the best model I have found for chatbot customer service scenarios using the Responses API. I was also reading another post in the community and saw other people reporting similar behavior.

On paper, the models look somewhat similar in terms of general capabilities and context size, but in real production chatbot usage I am seeing a big difference in reliability.

Has anyone else experienced the same difference between these two models?

Also, is there another model you would recommend for chatbot/customer service use cases with the Responses API, especially when instruction-following, context retention, and tool usage reliability are very important?

Below is a comparison of the two models. On paper they look quite similar, but in real usage I am seeing a big quality difference.

Discussion in the ATmosphere