Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigw6zatzjzgbnbbsmsrame5a7zmyjlkvwq3c2lks7mwww53calsiy",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moiu7ofzlyo2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiatd4r25ffxsqeucmbztrmqkfslpuojleqk2owzjznqsvaf6u2np4"
    },
    "mimeType": "image/webp",
    "size": 143130
  },
  "path": "/mrviduus/ai-evals-part-4-llm-as-judge-done-right-31eg",
  "publishedAt": "2026-06-17T17:28:22.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "evals",
    "llm",
    "dotnet",
    "what evals are",
    "error analysis",
    "golden datasets",
    "Microsoft.Extensions.AI.Evaluation",
    "textstack.app",
    "github.com/mrviduus/textstack"
  ],
  "textContent": "_Part 4 of a series on building production AI on .NET. We've covered what evals are, error analysis, and golden datasets. Now: how do you turn a paragraph into a number you can trust?_\n\nYou have a golden dataset and your feature's real output for each case. Now you need a score. But you can't `assert ==` two paragraphs — there's no single right answer, and exact-match comparison is meaningless for prose. String-similarity metrics (BLEU, ROUGE) don't help either; they reward overlapping words, not correct meaning.\n\nThe pragmatic answer the field has converged on is **LLM-as-judge** : use a second, capable model to read the reference and the actual output and score it against a rubric. It's powerful, it scales, and — handled carelessly — it will hand you confident, biased numbers that feel rigorous and aren't. This post is about doing it right.\n\n##  The basic shape\n\nA judge takes the rubric and an _evidence_ block (the inputs, the reference answer, and the model's actual output), and returns a structured verdict. In TextStack the judge is one feature-agnostic component built on Microsoft.Extensions.AI.Evaluation — Microsoft's official .NET evaluation library — implemented as a custom `IEvaluator`. The core is a single judge call asking for strict JSON:\n\n\n\n    var system =\n        \"You are a strict, fair evaluator of an AI feature's output. \" +\n        \"Score each of three dimensions on an integer scale 1-5 (5 = excellent, 1 = poor):\\n\" +\n        $\"- d1 = {rubric.Dim1}\\n- d2 = {rubric.Dim2}\\n- d3 = {rubric.Dim3}\\n\" +\n        \"Return ONLY strict JSON: {\\\"d1\\\": int, \\\"d2\\\": int, \\\"d3\\\": int, \\\"rationale\\\": \\\"...\\\"}\";\n\n\nThe rubric is a **parameter, not a hardcode** — three named axes passed in per feature. That's what lets one judge score Explain, Translate, distractors, and book metadata, each on the dimensions its own error analysis surfaced (Explain → accuracy / conciseness / usefulness; Translate → accuracy / fluency / register; and so on). One judge, many rubrics.\n\n##  Three things that separate a toy judge from a production one\n\n**Parse defensively.** Judges wrap their JSON in prose or code fences no matter how firmly you forbid it. Don't trust the whole string — extract the first `{…}` span:\n\n\n\n    var start = raw.IndexOf('{');\n    var end = raw.LastIndexOf('}');\n    if (start < 0 || end <= start)\n        return new JudgeScore(0, 0, 0, \"unparseable: no JSON object\");\n\n\n**Fail to a number, not an exception.** An unparseable or failed judge call returns a zero score with the reason attached, which drags the run's mean _down_ instead of crashing it. A judge that silently throws is worse than one that scores zero — the zero is a visible signal you can investigate.\n\n**Use a dedicated, stronger judge — and route it like everything else.** The model that _judges_ should be more capable than the models that _generate_. TextStack generates features on small, cheap models but judges with a `gpt-4.1`-class model. And the judge call carries the same `eval.judge` feature tag and flows through the same gateway as production traffic, so it's traced and cost-accounted like any other call. Evaluating is itself an AI feature; treat it like one.\n\n##  The biases that quietly wreck your judge\n\nThis is the part that separates people who _use_ an LLM judge from people who can _trust_ one. A judge is a language model, and it brings model-shaped biases to grading. Ignore them and your scores are precise and wrong.\n\n**Position bias.** In pairwise comparisons (\"is A or B better?\"), judges favour whichever answer appears first (sometimes second) regardless of content. _Mitigation:_ run each comparison both ways and average, or randomise order and watch the swap rate.\n\n**Verbosity bias.** Judges reliably prefer longer, more elaborate answers even when the extra words add nothing — actively harmful for a feature like Explain whose rubric _demands_ conciseness. _Mitigation:_ name length explicitly in the rubric and watch for score creeping up with token count.\n\n**Self-preference bias.** A judge scores text from its own model family higher. I'll be concrete about where TextStack sits here: features generated on a local model (distractors, book metadata) are judged cross-family by OpenAI — good, that's independent. But Explain and Translate are generated _and_ judged within the OpenAI family (different sizes — `gpt-4.1-nano` to generate, `gpt-4.1` to judge — but the same lineage), so some self-preference is still in play. The honest read: the absolute number is treated as soft; the _deltas between runs_ are what we trust. A fully independent second judge is on the roadmap.\n\n**Sycophancy and scale compression.** Judges drift toward agreeable, middling scores, clustering around 3–4 on a 1–5 scale and flattening your signal. _Mitigation:_ anchor each dimension with a concrete description (not just a one-word label), always give the judge the reference answer as a yardstick, and consider a coarser scale if the judge can't use the full range reliably.\n\n##  Your judge needs its own eval\n\nHere's the step almost everyone skips: **validate the judge against humans.** You wouldn't ship a feature on an unvalidated model, and a judge _is_ a model — so prove it agrees with human judgement before you trust its scores.\n\nHand-label a sample of outputs yourself, then measure agreement between you and the judge. The right metric is **inter-rater agreement** — Cohen's κ (kappa), which corrects for the agreement you'd get by chance — not raw percent-agreement, which flatters you when scores cluster. A judge around κ ≥ 0.6 against human labels is usable; near zero means it's rolling dice and your whole pipeline is theatre. Re-check it whenever you change the judge model or the rubric.\n\nThere's a design subtlety worth applying here: treat the _judge prompt itself_ as something you iterate on against a labelled split. Tune the judge prompt on one slice of human-labelled cases, validate κ on a held-out slice — exactly the train/test discipline from the last post, applied one level up. The judge is software; it deserves the same rigour as the feature it grades.\n\nThis closes a loop people miss. The golden set evaluates the feature; a human-labelled slice evaluates the judge. Skip the second and you've just moved your trust problem one level up and hidden it from yourself.\n\n##  The pitfalls\n\n  * **Trusting an unvalidated judge** — measure κ against human labels or it's theatre.\n  * **Same model generating and judging** — self-preference inflates the score; prefer a different (ideally cross-family) judge.\n  * **A weak judge model** — the judge should be _more_ capable than the generator, not the same one.\n  * **Ignoring position/verbosity bias** — randomise order, penalise padding, anchor the rubric.\n  * **One-word rubric axes** — \"accuracy\" alone means different things to the model each run; describe it concretely.\n  * **Throwing on a bad verdict** — score it zero and surface it; don't let one parse failure kill the run.\n\n\n\n##  The takeaway\n\nLLM-as-judge is the only practical way to score prose at scale, but a judge is a model with a model's biases — so build it like production code (defensive parsing, a dedicated stronger model, routed and traced) and validate it like a model (human labels, Cohen's κ, a tuned-and-tested judge prompt). Do that and your scores mean something. Skip it and you've automated the production of confident nonsense.\n\nNext, and last in the series: **from a number to a gate** — wiring evals into CI and online monitoring so quality regressions turn the build red, on Microsoft.Extensions.AI.Evaluation, without bankrupting your pipeline.\n\n_TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack._",
  "title": "AI Evals, Part 4: LLM-as-Judge, Done Right"
}