Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie7fe23otyk2jwd5h2bvq2kekz33qlfj4f5qhhbpy6vkkalddseji",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moxxawd5f3i2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigc64uxih7l4u5mnbfsahpeou3yxgtqmjvc7e2hqf6lmzcizhep2a"
    },
    "mimeType": "image/webp",
    "size": 86386
  },
  "path": "/ethanwritesai/91-pass-rate-gate-green-shipped-worst-regression-we-had-all-quarter-4dfn",
  "publishedAt": "2026-06-23T17:28:17.000Z",
  "site": "https://dev.to",
  "tags": [
    "testing",
    "ci",
    "ai",
    "llm"
  ],
  "textContent": "The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.\n\n##  The number that lied: 91%\n\nThe eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.\n\nThe users hitting that slice did not experience a 91%. They experienced a 74%.\n\n##  What an absolute threshold actually measures\n\nA static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.\n\n##  The fix: gate on the delta, per slice\n\nWe stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:\n\n  1. No single slice drops more than 3 points versus baseline.\n  2. The aggregate drops no more than 1.5 points versus baseline.\n\n\n\n\n    def gate(current, baseline):\n        failures = []\n        for slice_name, score in current.slices.items():\n            prev = baseline.slices.get(slice_name)\n            if prev is not None and prev - score > 3.0:\n                failures.append((slice_name, prev, score))\n        if baseline.aggregate - current.aggregate > 1.5:\n            failures.append((\"AGGREGATE\", baseline.aggregate, current.aggregate))\n        return failures  # empty == pass\n\n\nThe refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.\n\n##  The part that bites you: baseline management\n\nDelta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.\n\nSo the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.\n\n##  What I'd check first\n\n  * Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.\n  * Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is \"a lot,\" the aggregate is hiding it.\n  * Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.\n\n",
  "title": "91% pass rate. Gate green. Shipped. Worst regression we had all quarter."
}