Regression: Double responses in apps
Adding corroborating evidence — we see the same thing on gpt-5.5 (thinking) at the raw API level (Responses API), and it’s consistent enough that we could A/B it.
It’s the model generating the second copy, not a client/render bug. On our first (“opening”) turn, gpt-5.5 returns the assistant’s reply concatenated with a near-verbatim second copy in a single output. Token billing confirms it — clean openings bill ~77–93 output tokens, doubled ones ~130–258.
It’s first-turn specific. Almost always the initiating turn (one short user trigger, no prior history); mid-conversation turns rarely double. ~12% of openings in production, up to ~87% for certain prompts.
It’s gpt-5.5-specific. Identical prompt + code path, 30 runs each:
| Model | Baseline doubled | With mitigation |
|---|---|---|
| gpt-5.5 (thinking) | 20/30 exact + 6 partial (~87%) | 0 exact, 3 partial |
| gpt-4.1 (non-thinking) | 0/30 | 0/30 |
| claude-sonnet-4-6 (other vendor) | 0/30 | 0/30 |
Prompt-side mitigations — gpt-5.5, 15 runs each. A system-prompt “don’t repeat” rule does nothing (and was actually worse); a trigger-message “say it once” instruction is the effective lever, but leaves residual partial repeats a dedupe guard can’t safely catch:
| Variant | Exact-doubled | Partial repeat | Clean |
|---|---|---|---|
| baseline | 8/15 (53%) | 2 | 5 |
| trigger: “write opening once, then stop” | 1/15 | 5 | 9 |
| system rule: “output reply exactly once” | 10/15 | 1 | 4 |
| system rule + trigger | 0/15 | 3 | 12 |
Reasoning effort — gpt-5.5, 30 runs each. Lowering effort tracks the doubling rate but never cures it, and minimal returns no output at all (AI_NoOutputGeneratedError) through the Responses path:
| Effort | Doubled | Partial | Avg reasoning tokens |
|---|---|---|---|
| default | 18/30 (60%) | 5 | 39 |
| medium | 20/30 (67%) | 5 | — |
| low | 9/30 (30%) | 5 | 3 |
| minimal | all 30 errored (no output) | — | — |
Happy to share a sanitized repro payload. +1 on prioritizing this — the partial-repeat variant in particular can’t be safely caught by a dedupe guard.
Discussion in the ATmosphere