{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihujpxzxkckdw3zlxd7r5b52f22jwnr37vdb3nwgsybj3xbxaszpy",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpadxdjqqj32"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibnz3qhft27o2ka4hi53x6lqpunyr4owwayw6rcga6aicmmzc22w4"
    },
    "mimeType": "image/webp",
    "size": 78094
  },
  "path": "/saurav_bhattacharya/who-grades-the-grader-your-llm-judge-is-an-unvalidated-model-in-production-pfi",
  "publishedAt": "2026-06-27T01:02:32.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "evaluation",
    "observability",
    "testing",
    "agent-eval",
    "AgentLens"
  ],
  "textContent": "Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth.\n\nYou wrote deterministic checks for the easy stuff — schema valid, no PII, latency under budget. Then you hit the subjective stuff — \"is this answer actually helpful,\" \"did the agent follow the user's intent,\" \"is this summary faithful to the source\" — and you reached for an LLM judge, because what else are you going to do. Now a model grades your model. And here's the part that should keep you up at night: **you never validated the grader.** You're shipping or blocking releases based on a 0–10 score from a prompt you wrote in twenty minutes, and you have no idea if that score correlates with anything a human would agree with.\n\nI've watched teams trust a green judge dashboard for months, then discover the judge was handing out 8s to answers users hated. The judge wasn't broken in an obvious way. It was just _uncalibrated_ , and uncalibrated graders fail silently — which is the worst way to fail.\n\n##  The judge is a model in production, so treat it like one\n\nSay it plainly: your LLM judge is a non-deterministic model making consequential decisions in your release pipeline. That is the exact thing you spent the last year learning to distrust. Somehow when it's wearing a lab coat and called an \"evaluator,\" people grant it authority they'd never give the agent itself.\n\nThree ways judges quietly lie:\n\n  * **Position bias.** Swap the order of two candidate answers and the judge changes its winner. If A-vs-B and B-vs-A disagree more than ~10% of the time, your pairwise scores are partly coin flips.\n  * **Verbosity bias.** Longer, more confident answers score higher regardless of correctness. Your judge is grading prose, not truth.\n  * **Self-preference.** A judge from the same model family as the agent rates that family's outputs higher. If GPT grades GPT, you've got a conflict of interest with a number attached.\n\n\n\nNone of these show up on a dashboard that only plots the average score. They show up when you go looking — and most teams never look, because the judge produces a clean metric and clean metrics feel like ground truth.\n\n##  Calibrate the judge against humans, then keep checking\n\nThe fix isn't \"stop using LLM judges.\" They're genuinely useful and you can't human-label every run. The fix is to **treat the judge as a system under test with its own ground-truth set.** You need a labeled golden set — a few hundred examples scored by humans you trust — and you measure your judge's agreement with those humans. Cohen's kappa, not raw accuracy, because raw agreement is inflated when most answers are \"fine.\"\n\nHere's the calibration check I run before any judge is allowed to gate anything:\n\n\n\n    import { judge } from \"./llm-judge\";\n\n    type Labeled = { input: string; output: string; humanScore: number };\n\n    // Quadratic-weighted agreement: penalize big disagreements more than small ones.\n    function weightedAgreement(human: number[], model: number[], max = 10): number {\n      let num = 0, den = 0;\n      for (let i = 0; i < human.length; i++) {\n        const w = ((human[i] - model[i]) ** 2) / (max ** 2);\n        num += 1 - w;\n        den += 1;\n      }\n      return num / den; // 1.0 = perfect, lower = drifting from humans\n    }\n\n    // Position-bias probe: judge must agree with itself when we flip the order.\n    async function positionBias(pairs: { a: string; b: string }[]): Promise<number> {\n      let flips = 0;\n      for (const { a, b } of pairs) {\n        const fwd = await judge.compare(a, b);   // \"a\" | \"b\"\n        const rev = await judge.compare(b, a);   // \"a\" | \"b\" (b is now first)\n        const consistent = (fwd === \"a\" && rev === \"b\") || (fwd === \"b\" && rev === \"a\");\n        if (!consistent) flips++;\n      }\n      return flips / pairs.length; // want this near 0\n    }\n\n    export async function certifyJudge(golden: Labeled[]) {\n      const scored = await Promise.all(\n        golden.map(async (g) => (await judge.score(g.input, g.output)).value),\n      );\n      const agreement = weightedAgreement(golden.map((g) => g.humanScore), scored);\n      const bias = await positionBias(buildPairs(golden));\n\n      const passed = agreement >= 0.85 && bias <= 0.1;\n      if (!passed) {\n        throw new Error(\n          `Judge not certified: agreement=${agreement.toFixed(2)} (need >=0.85), ` +\n          `positionBias=${bias.toFixed(2)} (need <=0.10). Do not gate releases with this judge.`,\n        );\n      }\n      return { agreement, bias };\n    }\n\n\nThis runs in CI on a schedule, not just once. Judges drift the same way agents do — provider updates the underlying model, your prompt template gets edited, your data distribution shifts — and a judge that agreed with humans in March can quietly diverge by June. If you only calibrated once at the start, you don't have a calibrated judge; you have a historical artifact.\n\n##  Calibration tells you _that_ it's wrong. Traces tell you _why._\n\nHere's where the two halves of the workflow lock together, because a kappa of 0.6 is a smoke alarm, not a diagnosis.\n\nagent-eval is what runs the scoring and the gate — it's the layer holding your deterministic checks, your model-as-judge, the golden set, and the `certifyJudge` step above. It's the thing that tells you the judge agreement dropped below 0.85 and refuses to let the release through. That's the signal. But a failing number with no context is just an argument waiting to happen — \"the judge is wrong,\" \"no, the agent regressed,\" and nobody can settle it.\n\nThat's the job of AgentLens: it captures the full trace behind every score — the exact prompt the judge saw, the candidate output, the resolved rubric, the judge's raw completion _before_ you parsed a number out of it, and the agent's own tool-and-model steps that produced the answer in the first place. So when agent-eval flags that the judge handed a 9 to an answer humans scored 3, you open the AgentLens trace and _see_ it: the judge rewarded a confident, verbose response that never grounded its central claim. Now it's not a vibe. You can see the verbosity bias in the raw text, fix the rubric to demand citations, and re-certify.\n\nThat's the loop. **agent-eval scores and gates; AgentLens shows the trace so the score is debuggable.** Without the trace, a bad judge score is unfalsifiable — you can't tell a judge problem from an agent problem, so you end up trusting the number you should be interrogating. With it, every disagreement between judge and human becomes a concrete, inspectable artifact instead of a meeting.\n\n##  The uncomfortable takeaway\n\nIf you're using a model-as-judge and you can't state your judge's agreement with human labels as a number, you are not running evals. You're running a vibe check with extra steps and a false sense of rigor. The judge is the most trusted, least audited component in your entire pipeline — and \"the LLM said it was good\" is doing a lot of unexamined work in your release decisions.\n\nCertify the judge. Re-certify on a schedule. Keep the traces so every score can be challenged. A grader you haven't validated isn't measuring quality — it's laundering an opinion into a metric, and your green dashboard is the receipt.",
  "title": "Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production"
}