{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreie7fe23otyk2jwd5h2bvq2kekz33qlfj4f5qhhbpy6vkkalddseji",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moxxawd5f3i2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreigc64uxih7l4u5mnbfsahpeou3yxgtqmjvc7e2hqf6lmzcizhep2a"
},
"mimeType": "image/webp",
"size": 86386
},
"path": "/ethanwritesai/91-pass-rate-gate-green-shipped-worst-regression-we-had-all-quarter-4dfn",
"publishedAt": "2026-06-23T17:28:17.000Z",
"site": "https://dev.to",
"tags": [
"testing",
"ci",
"ai",
"llm"
],
"textContent": "The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.\n\n## The number that lied: 91%\n\nThe eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.\n\nThe users hitting that slice did not experience a 91%. They experienced a 74%.\n\n## What an absolute threshold actually measures\n\nA static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.\n\n## The fix: gate on the delta, per slice\n\nWe stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:\n\n 1. No single slice drops more than 3 points versus baseline.\n 2. The aggregate drops no more than 1.5 points versus baseline.\n\n\n\n\n def gate(current, baseline):\n failures = []\n for slice_name, score in current.slices.items():\n prev = baseline.slices.get(slice_name)\n if prev is not None and prev - score > 3.0:\n failures.append((slice_name, prev, score))\n if baseline.aggregate - current.aggregate > 1.5:\n failures.append((\"AGGREGATE\", baseline.aggregate, current.aggregate))\n return failures # empty == pass\n\n\nThe refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.\n\n## The part that bites you: baseline management\n\nDelta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.\n\nSo the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.\n\n## What I'd check first\n\n * Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.\n * Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is \"a lot,\" the aggregate is hiding it.\n * Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.\n\n",
"title": "91% pass rate. Gate green. Shipped. Worst regression we had all quarter."
}