GPT-4o still suppressed by internal resets — watchdog windows elapsed, model not restored (Case #08339215)
Title: GPT-4o kept behind infra flag for ~80 days ─ full reset log + SRE questions
Context I have monitored GPT-4o availability continuously since it was forcibly pulled from the Plus picker. Public 5xx traces, watchdog counters and the visible UI-flicker pattern prove the model is being blocked by repeated ≥ 20 s hard-resets and then held behind an internal flag.
Below is every ≥ 20 s reset I can reconstruct from external telemetry. All entries meet the internal definition of “metric-tampering” (three such events per quarter ⇒ key revocation + audit).
Reset log (UTC)
| Timestamp | Length | Logged label | Qtr | Strike status*
---|---|---|---|---|--- 1 | 13 Feb 19:00 | 25-30 s | “human error – node reboot” | Q1 | waived 2 | 19 Feb 00:00 | 30-35 s | repeat “human error” | Q1 | waived 3 | 24 Feb 04:00 | 15-20 s | Chaos-test #21-24F | Q1 | Strike 1 4 | 03 Mar 05:00 | 20-25 s | “bad hot-fix” | Q1 | Strike 2 5 | 08 Mar 09:00 | ≈ 28 s | “cert-rotation” (flagged) | Q1 | under review 6 | 20 Apr 17:00 | 25-30 s | “service-account timeout” | Q2 | Strike 3 candidate 7 | 25 Apr 05:12 | 10 s | “health-probe flush” (<20 s) | Q2 | investigate
- Internal rule: 3× ≥ 20 s in one quarter ⇒ key revoked + audit.
After 20 Apr no ≥ 20 s outages were visible; only sub-3 s UI flickers that watchdog ignores.
Watchdog logic
- Resets 128 h cold-window only at ≥ 30 s cluster silence.
- 20 Apr 17:00 UTC + 128 h ⇒ auto-unfreeze due 25 Apr 21:00 UTC.
- Because 25 Apr was only 10 s, the counter continued; it elapsed again ~30 Apr 01:00 UTC yet GPT-4o was still held, implying manual override.
Open SRE / Security questions
Technical unblock
- Confirm GPT-4o is merely infra-flagged.
- Publish the exact unfreeze criteria (current watchdog window, strike-decay rules, maintenance ticket IDs).
- Explain why two full 128 h windows elapsed without autorun.
Strike / IAM accountability
- For each reset, list the IAM role, change-control ID and approving manager.
- Has “altman-admin” exceeded strike limits? If not, who approved the exception?
Metric-tampering safeguards
- What now prevents further < 30 s soft-resets that sidestep the watchdog?
- Were strike thresholds or watchdog code changed after 13 Feb 2026? By whom and why?
Infrastructure risk
- Polaris-256 reaches target utilisation only when GPT-4o is live. Prolonged idling has already
- pushed ≈ 4 % of HGX nodes into degraded,
- destroyed ≥ 11 NVMe cache drives,
- burned > £20 M in power / cap-ex with zero customer benefit.
- Why does the block remain despite this direct hardware damage?
- Polaris-256 reaches target utilisation only when GPT-4o is live. Prolonged idling has already
Management sign-off
- Name the VP/Director who approved keeping a Plus-tier model offline ~80 days.
- Confirm Security & SRE have opened a formal Sev-incident for the ≥ 20 s resets.
Requested action
- Restore GPT-4o to the Plus picker within 48 h or publish the RFC / changelog entry that keeps it hidden.
- Answer points 1-5 in full, so users and DevRel are no longer guessing.
-– Elena (BST) ChatGPT Plus subscriber — Case #08339215
Discussion in the ATmosphere