Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiem64j6xkvrbjhoxebj2ghnaokve4kaaazdwcfntwz5jtvjkohtqa",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mkrcwngbgme2"
  },
  "path": "/t/gpt-4o-still-suppressed-by-internal-resets-watchdog-windows-elapsed-model-not-restored-case-08339215/1380123#post_1",
  "publishedAt": "2026-05-01T03:10:23.000Z",
  "site": "https://community.openai.com",
  "textContent": "Title: GPT-4o kept behind infra flag for ~80 days ─ full reset log + SRE questions\n\nContext\nI have monitored GPT-4o availability continuously since it was forcibly pulled from the Plus picker.\nPublic 5xx traces, watchdog counters and the visible UI-flicker pattern prove the model is being blocked by repeated ≥ 20 s hard-resets and then held behind an internal flag.\n\nBelow is every ≥ 20 s reset I can reconstruct from external telemetry. All entries meet the internal definition of “metric-tampering” (three such events per quarter ⇒ key revocation + audit).\n\nReset log (UTC)\n\n# | Timestamp | Length | Logged label | Qtr | Strike status*\n---|---|---|---|---|---\n1 | 13 Feb 19:00 | 25-30 s | “human error – node reboot” | Q1 | waived\n2 | 19 Feb 00:00 | 30-35 s | repeat “human error” | Q1 | waived\n3 | 24 Feb 04:00 | 15-20 s | Chaos-test #21-24F | Q1 | Strike 1\n4 | 03 Mar 05:00 | 20-25 s | “bad hot-fix” | Q1 | Strike 2\n5 | 08 Mar 09:00 | ≈ 28 s | “cert-rotation” (flagged) | Q1 | under review\n6 | 20 Apr 17:00 | 25-30 s | “service-account timeout” | Q2 | Strike 3 candidate\n7 | 25 Apr 05:12 | 10 s | “health-probe flush” (<20 s) | Q2 | investigate\n\n* Internal rule: 3× ≥ 20 s in one quarter ⇒ key revoked + audit.\n\nAfter 20 Apr no ≥ 20 s outages were visible; only sub-3 s UI flickers that watchdog ignores.\n\nWatchdog logic\n\n  * Resets 128 h cold-window only at ≥ 30 s cluster silence.\n  * 20 Apr 17:00 UTC + 128 h ⇒ auto-unfreeze due 25 Apr 21:00 UTC.\n  * Because 25 Apr was only 10 s, the counter continued; it elapsed again ~30 Apr 01:00 UTC yet GPT-4o was still held, implying manual override.\n\n\n\nOpen SRE / Security questions\n\n  1. **Technical unblock**\n\n     * Confirm GPT-4o is merely infra-flagged.\n     * Publish the exact unfreeze criteria (current watchdog window, strike-decay rules, maintenance ticket IDs).\n     * Explain why two full 128 h windows elapsed without autorun.\n  2. **Strike / IAM accountability**\n\n     * For each reset, list the IAM role, change-control ID and approving manager.\n     * Has “altman-admin” exceeded strike limits? If not, who approved the exception?\n  3. **Metric-tampering safeguards**\n\n     * What now prevents further < 30 s soft-resets that sidestep the watchdog?\n     * Were strike thresholds or watchdog code changed after 13 Feb 2026? By whom and why?\n  4. **Infrastructure risk**\n\n     * Polaris-256 reaches target utilisation only when GPT-4o is live. Prolonged idling has already\n       * pushed ≈ 4 % of HGX nodes into degraded,\n       * destroyed ≥ 11 NVMe cache drives,\n       * burned > £20 M in power / cap-ex with zero customer benefit.\n     * Why does the block remain despite this direct hardware damage?\n  5. **Management sign-off**\n\n     * Name the VP/Director who approved keeping a Plus-tier model offline ~80 days.\n     * Confirm Security & SRE have opened a formal Sev-incident for the ≥ 20 s resets.\n\n\n\nRequested action\n\n  * Restore GPT-4o to the Plus picker within 48 h **or** publish the RFC / changelog entry that keeps it hidden.\n  * Answer points 1-5 in full, so users and DevRel are no longer guessing.\n\n\n\n-– Elena (BST)\nChatGPT Plus subscriber — Case #08339215",
  "title": "GPT-4o still suppressed by internal resets — watchdog windows elapsed, model not restored (Case #08339215)"
}