Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiazsu7pptae42nbq2yo5onn4iz6bo4o773lwrpirno6pbqr7guirq",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mls4qtu3pgz2"
  },
  "path": "/t/why-do-gpt-5-1-and-gpt-5-4-mini-behave-so-differently-in-production-chatbot-use-cases/1380891#post_1",
  "publishedAt": "2026-05-14T04:12:03.000Z",
  "site": "https://community.openai.com",
  "textContent": "I have been testing a model change from **gpt-5.1** to **gpt-5.4-mini** in the **Responses API** , but after many tests, I feel that gpt-5.4-mini is not reliable enough for my current production chatbot use cases.\n\nIn my tests, **gpt-5.4-mini loses context more easily** , does not always respect the prompt rules, and sometimes applies or ignores instructions inconsistently. Compared to **gpt-5.1** , the difference is very noticeable.\n\nTo be clear, I understand that **gpt-5.4-mini may be a good option for many workloads** , especially considering the lower price. However, in my customer service chatbot scenarios, where prompt-following, context retention, business rules, and tool execution decisions are very important, the results have been significantly worse than with gpt-5.1.\n\nSo far, **gpt-5.1 has been the best model I have found for chatbot customer service scenarios** using the Responses API. I was also reading another post in the community and saw other people reporting similar behavior.\n\nOn paper, the models look somewhat similar in terms of general capabilities and context size, but in real production chatbot usage I am seeing a big difference in reliability.\n\nHas anyone else experienced the same difference between these two models?\n\nAlso, is there another model you would recommend for chatbot/customer service use cases with the Responses API, especially when instruction-following, context retention, and tool usage reliability are very important?\n\nBelow is a comparison of the two models. On paper they look quite similar, but in real usage I am seeing a big quality difference.",
  "title": "Why do gpt-5.1 and gpt-5.4-mini behave so differently in production chatbot use cases?"
}