{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibolyj36xc5uhgue64dzk62atjtxz4lhlni3baq5r6nc2vsg5i2xu",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mmdr6cxrma32"
},
"path": "/t/regression-double-responses-in-apps/1379043#post_7",
"publishedAt": "2026-05-21T04:40:38.000Z",
"site": "https://community.openai.com",
"textContent": "Adding corroborating evidence — we see the same thing on **gpt-5.5 (thinking)** at the raw API level (Responses API), and it’s consistent enough that we could A/B it.\n\n**It’s the model generating the second copy, not a client/render bug.** On our first (“opening”) turn, gpt-5.5 returns the assistant’s reply concatenated with a near-verbatim second copy in a single output. Token billing confirms it — clean openings bill ~77–93 output tokens, doubled ones ~130–258.\n\n**It’s first-turn specific.** Almost always the initiating turn (one short user trigger, no prior history); mid-conversation turns rarely double. ~12% of openings in production, up to ~87% for certain prompts.\n\n**It’s gpt-5.5-specific.** Identical prompt + code path, 30 runs each:\n\nModel | Baseline doubled | With mitigation\n---|---|---\ngpt-5.5 (thinking) | 20/30 exact + 6 partial (~87%) | 0 exact, 3 partial\ngpt-4.1 (non-thinking) | 0/30 | 0/30\nclaude-sonnet-4-6 (other vendor) | 0/30 | 0/30\n\n**Prompt-side mitigations** — gpt-5.5, 15 runs each. A system-prompt “don’t repeat” rule does nothing (and was actually worse); a trigger-message “say it once” instruction is the effective lever, but leaves residual _partial_ repeats a dedupe guard can’t safely catch:\n\nVariant | Exact-doubled | Partial repeat | Clean\n---|---|---|---\nbaseline | 8/15 (53%) | 2 | 5\ntrigger: “write opening once, then stop” | 1/15 | 5 | 9\nsystem rule: “output reply exactly once” | 10/15 | 1 | 4\nsystem rule + trigger | 0/15 | 3 | 12\n\n**Reasoning effort** — gpt-5.5, 30 runs each. Lowering effort tracks the doubling rate but never cures it, and `minimal` returns no output at all (`AI_NoOutputGeneratedError`) through the Responses path:\n\nEffort | Doubled | Partial | Avg reasoning tokens\n---|---|---|---\ndefault | 18/30 (60%) | 5 | 39\nmedium | 20/30 (67%) | 5 | —\nlow | 9/30 (30%) | 5 | 3\nminimal | all 30 errored (no output) | — | —\n\nHappy to share a sanitized repro payload. +1 on prioritizing this — the partial-repeat variant in particular can’t be safely caught by a dedupe guard.",
"title": "Regression: Double responses in apps"
}