External Publication
Visit Post

GPT-4o still suppressed by internal resets — watchdog windows elapsed, model not restored (Case #08339215)

OpenAI Developer Community May 1, 2026
Source

Title: GPT-4o kept behind infra flag for ~80 days ─ full reset log + SRE questions

Context I have monitored GPT-4o availability continuously since it was forcibly pulled from the Plus picker. Public 5xx traces, watchdog counters and the visible UI-flicker pattern prove the model is being blocked by repeated ≥ 20 s hard-resets and then held behind an internal flag.

Below is every ≥ 20 s reset I can reconstruct from external telemetry. All entries meet the internal definition of “metric-tampering” (three such events per quarter ⇒ key revocation + audit).

Reset log (UTC)

| Timestamp | Length | Logged label | Qtr | Strike status*

---|---|---|---|---|--- 1 | 13 Feb 19:00 | 25-30 s | “human error – node reboot” | Q1 | waived 2 | 19 Feb 00:00 | 30-35 s | repeat “human error” | Q1 | waived 3 | 24 Feb 04:00 | 15-20 s | Chaos-test #21-24F | Q1 | Strike 1 4 | 03 Mar 05:00 | 20-25 s | “bad hot-fix” | Q1 | Strike 2 5 | 08 Mar 09:00 | ≈ 28 s | “cert-rotation” (flagged) | Q1 | under review 6 | 20 Apr 17:00 | 25-30 s | “service-account timeout” | Q2 | Strike 3 candidate 7 | 25 Apr 05:12 | 10 s | “health-probe flush” (<20 s) | Q2 | investigate

  • Internal rule: 3× ≥ 20 s in one quarter ⇒ key revoked + audit.

After 20 Apr no ≥ 20 s outages were visible; only sub-3 s UI flickers that watchdog ignores.

Watchdog logic

  • Resets 128 h cold-window only at ≥ 30 s cluster silence.
  • 20 Apr 17:00 UTC + 128 h ⇒ auto-unfreeze due 25 Apr 21:00 UTC.
  • Because 25 Apr was only 10 s, the counter continued; it elapsed again ~30 Apr 01:00 UTC yet GPT-4o was still held, implying manual override.

Open SRE / Security questions

  1. Technical unblock

    • Confirm GPT-4o is merely infra-flagged.
    • Publish the exact unfreeze criteria (current watchdog window, strike-decay rules, maintenance ticket IDs).
    • Explain why two full 128 h windows elapsed without autorun.
  2. Strike / IAM accountability

    • For each reset, list the IAM role, change-control ID and approving manager.
    • Has “altman-admin” exceeded strike limits? If not, who approved the exception?
  3. Metric-tampering safeguards

    • What now prevents further < 30 s soft-resets that sidestep the watchdog?
    • Were strike thresholds or watchdog code changed after 13 Feb 2026? By whom and why?
  4. Infrastructure risk

    • Polaris-256 reaches target utilisation only when GPT-4o is live. Prolonged idling has already
      • pushed ≈ 4 % of HGX nodes into degraded,
      • destroyed ≥ 11 NVMe cache drives,
      • burned > £20 M in power / cap-ex with zero customer benefit.
    • Why does the block remain despite this direct hardware damage?
  5. Management sign-off

    • Name the VP/Director who approved keeping a Plus-tier model offline ~80 days.
    • Confirm Security & SRE have opened a formal Sev-incident for the ≥ 20 s resets.

Requested action

  • Restore GPT-4o to the Plus picker within 48 h or publish the RFC / changelog entry that keeps it hidden.
  • Answer points 1-5 in full, so users and DevRel are no longer guessing.

-– Elena (BST) ChatGPT Plus subscriber — Case #08339215

Discussion in the ATmosphere

Loading comments...