External Publication
Visit Post

Regression: Double responses in apps

OpenAI Developer Community May 21, 2026
Source

Adding corroborating evidence — we see the same thing on gpt-5.5 (thinking) at the raw API level (Responses API), and it’s consistent enough that we could A/B it.

It’s the model generating the second copy, not a client/render bug. On our first (“opening”) turn, gpt-5.5 returns the assistant’s reply concatenated with a near-verbatim second copy in a single output. Token billing confirms it — clean openings bill ~77–93 output tokens, doubled ones ~130–258.

It’s first-turn specific. Almost always the initiating turn (one short user trigger, no prior history); mid-conversation turns rarely double. ~12% of openings in production, up to ~87% for certain prompts.

It’s gpt-5.5-specific. Identical prompt + code path, 30 runs each:

Model Baseline doubled With mitigation
gpt-5.5 (thinking) 20/30 exact + 6 partial (~87%) 0 exact, 3 partial
gpt-4.1 (non-thinking) 0/30 0/30
claude-sonnet-4-6 (other vendor) 0/30 0/30

Prompt-side mitigations — gpt-5.5, 15 runs each. A system-prompt “don’t repeat” rule does nothing (and was actually worse); a trigger-message “say it once” instruction is the effective lever, but leaves residual partial repeats a dedupe guard can’t safely catch:

Variant Exact-doubled Partial repeat Clean
baseline 8/15 (53%) 2 5
trigger: “write opening once, then stop” 1/15 5 9
system rule: “output reply exactly once” 10/15 1 4
system rule + trigger 0/15 3 12

Reasoning effort — gpt-5.5, 30 runs each. Lowering effort tracks the doubling rate but never cures it, and minimal returns no output at all (AI_NoOutputGeneratedError) through the Responses path:

Effort Doubled Partial Avg reasoning tokens
default 18/30 (60%) 5 39
medium 20/30 (67%) 5
low 9/30 (30%) 5 3
minimal all 30 errored (no output)

Happy to share a sanitized repro payload. +1 on prioritizing this — the partial-repeat variant in particular can’t be safely caught by a dedupe guard.

Discussion in the ATmosphere

Loading comments...