{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihujpxzxkckdw3zlxd7r5b52f22jwnr37vdb3nwgsybj3xbxaszpy",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpadxdjqqj32"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreibnz3qhft27o2ka4hi53x6lqpunyr4owwayw6rcga6aicmmzc22w4"
},
"mimeType": "image/webp",
"size": 78094
},
"path": "/saurav_bhattacharya/who-grades-the-grader-your-llm-judge-is-an-unvalidated-model-in-production-pfi",
"publishedAt": "2026-06-27T01:02:32.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"evaluation",
"observability",
"testing",
"agent-eval",
"AgentLens"
],
"textContent": "Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth.\n\nYou wrote deterministic checks for the easy stuff — schema valid, no PII, latency under budget. Then you hit the subjective stuff — \"is this answer actually helpful,\" \"did the agent follow the user's intent,\" \"is this summary faithful to the source\" — and you reached for an LLM judge, because what else are you going to do. Now a model grades your model. And here's the part that should keep you up at night: **you never validated the grader.** You're shipping or blocking releases based on a 0–10 score from a prompt you wrote in twenty minutes, and you have no idea if that score correlates with anything a human would agree with.\n\nI've watched teams trust a green judge dashboard for months, then discover the judge was handing out 8s to answers users hated. The judge wasn't broken in an obvious way. It was just _uncalibrated_ , and uncalibrated graders fail silently — which is the worst way to fail.\n\n## The judge is a model in production, so treat it like one\n\nSay it plainly: your LLM judge is a non-deterministic model making consequential decisions in your release pipeline. That is the exact thing you spent the last year learning to distrust. Somehow when it's wearing a lab coat and called an \"evaluator,\" people grant it authority they'd never give the agent itself.\n\nThree ways judges quietly lie:\n\n * **Position bias.** Swap the order of two candidate answers and the judge changes its winner. If A-vs-B and B-vs-A disagree more than ~10% of the time, your pairwise scores are partly coin flips.\n * **Verbosity bias.** Longer, more confident answers score higher regardless of correctness. Your judge is grading prose, not truth.\n * **Self-preference.** A judge from the same model family as the agent rates that family's outputs higher. If GPT grades GPT, you've got a conflict of interest with a number attached.\n\n\n\nNone of these show up on a dashboard that only plots the average score. They show up when you go looking — and most teams never look, because the judge produces a clean metric and clean metrics feel like ground truth.\n\n## Calibrate the judge against humans, then keep checking\n\nThe fix isn't \"stop using LLM judges.\" They're genuinely useful and you can't human-label every run. The fix is to **treat the judge as a system under test with its own ground-truth set.** You need a labeled golden set — a few hundred examples scored by humans you trust — and you measure your judge's agreement with those humans. Cohen's kappa, not raw accuracy, because raw agreement is inflated when most answers are \"fine.\"\n\nHere's the calibration check I run before any judge is allowed to gate anything:\n\n\n\n import { judge } from \"./llm-judge\";\n\n type Labeled = { input: string; output: string; humanScore: number };\n\n // Quadratic-weighted agreement: penalize big disagreements more than small ones.\n function weightedAgreement(human: number[], model: number[], max = 10): number {\n let num = 0, den = 0;\n for (let i = 0; i < human.length; i++) {\n const w = ((human[i] - model[i]) ** 2) / (max ** 2);\n num += 1 - w;\n den += 1;\n }\n return num / den; // 1.0 = perfect, lower = drifting from humans\n }\n\n // Position-bias probe: judge must agree with itself when we flip the order.\n async function positionBias(pairs: { a: string; b: string }[]): Promise<number> {\n let flips = 0;\n for (const { a, b } of pairs) {\n const fwd = await judge.compare(a, b); // \"a\" | \"b\"\n const rev = await judge.compare(b, a); // \"a\" | \"b\" (b is now first)\n const consistent = (fwd === \"a\" && rev === \"b\") || (fwd === \"b\" && rev === \"a\");\n if (!consistent) flips++;\n }\n return flips / pairs.length; // want this near 0\n }\n\n export async function certifyJudge(golden: Labeled[]) {\n const scored = await Promise.all(\n golden.map(async (g) => (await judge.score(g.input, g.output)).value),\n );\n const agreement = weightedAgreement(golden.map((g) => g.humanScore), scored);\n const bias = await positionBias(buildPairs(golden));\n\n const passed = agreement >= 0.85 && bias <= 0.1;\n if (!passed) {\n throw new Error(\n `Judge not certified: agreement=${agreement.toFixed(2)} (need >=0.85), ` +\n `positionBias=${bias.toFixed(2)} (need <=0.10). Do not gate releases with this judge.`,\n );\n }\n return { agreement, bias };\n }\n\n\nThis runs in CI on a schedule, not just once. Judges drift the same way agents do — provider updates the underlying model, your prompt template gets edited, your data distribution shifts — and a judge that agreed with humans in March can quietly diverge by June. If you only calibrated once at the start, you don't have a calibrated judge; you have a historical artifact.\n\n## Calibration tells you _that_ it's wrong. Traces tell you _why._\n\nHere's where the two halves of the workflow lock together, because a kappa of 0.6 is a smoke alarm, not a diagnosis.\n\nagent-eval is what runs the scoring and the gate — it's the layer holding your deterministic checks, your model-as-judge, the golden set, and the `certifyJudge` step above. It's the thing that tells you the judge agreement dropped below 0.85 and refuses to let the release through. That's the signal. But a failing number with no context is just an argument waiting to happen — \"the judge is wrong,\" \"no, the agent regressed,\" and nobody can settle it.\n\nThat's the job of AgentLens: it captures the full trace behind every score — the exact prompt the judge saw, the candidate output, the resolved rubric, the judge's raw completion _before_ you parsed a number out of it, and the agent's own tool-and-model steps that produced the answer in the first place. So when agent-eval flags that the judge handed a 9 to an answer humans scored 3, you open the AgentLens trace and _see_ it: the judge rewarded a confident, verbose response that never grounded its central claim. Now it's not a vibe. You can see the verbosity bias in the raw text, fix the rubric to demand citations, and re-certify.\n\nThat's the loop. **agent-eval scores and gates; AgentLens shows the trace so the score is debuggable.** Without the trace, a bad judge score is unfalsifiable — you can't tell a judge problem from an agent problem, so you end up trusting the number you should be interrogating. With it, every disagreement between judge and human becomes a concrete, inspectable artifact instead of a meeting.\n\n## The uncomfortable takeaway\n\nIf you're using a model-as-judge and you can't state your judge's agreement with human labels as a number, you are not running evals. You're running a vibe check with extra steps and a false sense of rigor. The judge is the most trusted, least audited component in your entire pipeline — and \"the LLM said it was good\" is doing a lot of unexamined work in your release decisions.\n\nCertify the judge. Re-certify on a schedule. Keep the traces so every score can be challenged. A grader you haven't validated isn't measuring quality — it's laundering an opinion into a metric, and your green dashboard is the receipt.",
"title": "Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production"
}