External Publication
Visit Post

AI Evals, Part 4: LLM-as-Judge, Done Right

DEV Community [Unofficial] June 17, 2026
Source

Part 4 of a series on building production AI on .NET. We've covered what evals are, error analysis, and golden datasets. Now: how do you turn a paragraph into a number you can trust?

You have a golden dataset and your feature's real output for each case. Now you need a score. But you can't assert == two paragraphs — there's no single right answer, and exact-match comparison is meaningless for prose. String-similarity metrics (BLEU, ROUGE) don't help either; they reward overlapping words, not correct meaning.

The pragmatic answer the field has converged on is LLM-as-judge : use a second, capable model to read the reference and the actual output and score it against a rubric. It's powerful, it scales, and — handled carelessly — it will hand you confident, biased numbers that feel rigorous and aren't. This post is about doing it right.

The basic shape

A judge takes the rubric and an evidence block (the inputs, the reference answer, and the model's actual output), and returns a structured verdict. In TextStack the judge is one feature-agnostic component built on Microsoft.Extensions.AI.Evaluation — Microsoft's official .NET evaluation library — implemented as a custom IEvaluator. The core is a single judge call asking for strict JSON:

var system =
    "You are a strict, fair evaluator of an AI feature's output. " +
    "Score each of three dimensions on an integer scale 1-5 (5 = excellent, 1 = poor):\n" +
    $"- d1 = {rubric.Dim1}\n- d2 = {rubric.Dim2}\n- d3 = {rubric.Dim3}\n" +
    "Return ONLY strict JSON: {\"d1\": int, \"d2\": int, \"d3\": int, \"rationale\": \"...\"}";

The rubric is a parameter, not a hardcode — three named axes passed in per feature. That's what lets one judge score Explain, Translate, distractors, and book metadata, each on the dimensions its own error analysis surfaced (Explain → accuracy / conciseness / usefulness; Translate → accuracy / fluency / register; and so on). One judge, many rubrics.

Three things that separate a toy judge from a production one

Parse defensively. Judges wrap their JSON in prose or code fences no matter how firmly you forbid it. Don't trust the whole string — extract the first {…} span:

var start = raw.IndexOf('{');
var end = raw.LastIndexOf('}');
if (start < 0 || end <= start)
    return new JudgeScore(0, 0, 0, "unparseable: no JSON object");

Fail to a number, not an exception. An unparseable or failed judge call returns a zero score with the reason attached, which drags the run's mean down instead of crashing it. A judge that silently throws is worse than one that scores zero — the zero is a visible signal you can investigate.

Use a dedicated, stronger judge — and route it like everything else. The model that judges should be more capable than the models that generate. TextStack generates features on small, cheap models but judges with a gpt-4.1-class model. And the judge call carries the same eval.judge feature tag and flows through the same gateway as production traffic, so it's traced and cost-accounted like any other call. Evaluating is itself an AI feature; treat it like one.

The biases that quietly wreck your judge

This is the part that separates people who use an LLM judge from people who can trust one. A judge is a language model, and it brings model-shaped biases to grading. Ignore them and your scores are precise and wrong.

Position bias. In pairwise comparisons ("is A or B better?"), judges favour whichever answer appears first (sometimes second) regardless of content. Mitigation: run each comparison both ways and average, or randomise order and watch the swap rate.

Verbosity bias. Judges reliably prefer longer, more elaborate answers even when the extra words add nothing — actively harmful for a feature like Explain whose rubric demands conciseness. Mitigation: name length explicitly in the rubric and watch for score creeping up with token count.

Self-preference bias. A judge scores text from its own model family higher. I'll be concrete about where TextStack sits here: features generated on a local model (distractors, book metadata) are judged cross-family by OpenAI — good, that's independent. But Explain and Translate are generated and judged within the OpenAI family (different sizes — gpt-4.1-nano to generate, gpt-4.1 to judge — but the same lineage), so some self-preference is still in play. The honest read: the absolute number is treated as soft; the deltas between runs are what we trust. A fully independent second judge is on the roadmap.

Sycophancy and scale compression. Judges drift toward agreeable, middling scores, clustering around 3–4 on a 1–5 scale and flattening your signal. Mitigation: anchor each dimension with a concrete description (not just a one-word label), always give the judge the reference answer as a yardstick, and consider a coarser scale if the judge can't use the full range reliably.

Your judge needs its own eval

Here's the step almost everyone skips: validate the judge against humans. You wouldn't ship a feature on an unvalidated model, and a judge is a model — so prove it agrees with human judgement before you trust its scores.

Hand-label a sample of outputs yourself, then measure agreement between you and the judge. The right metric is inter-rater agreement — Cohen's κ (kappa), which corrects for the agreement you'd get by chance — not raw percent-agreement, which flatters you when scores cluster. A judge around κ ≥ 0.6 against human labels is usable; near zero means it's rolling dice and your whole pipeline is theatre. Re-check it whenever you change the judge model or the rubric.

There's a design subtlety worth applying here: treat the judge prompt itself as something you iterate on against a labelled split. Tune the judge prompt on one slice of human-labelled cases, validate κ on a held-out slice — exactly the train/test discipline from the last post, applied one level up. The judge is software; it deserves the same rigour as the feature it grades.

This closes a loop people miss. The golden set evaluates the feature; a human-labelled slice evaluates the judge. Skip the second and you've just moved your trust problem one level up and hidden it from yourself.

The pitfalls

  • Trusting an unvalidated judge — measure κ against human labels or it's theatre.
  • Same model generating and judging — self-preference inflates the score; prefer a different (ideally cross-family) judge.
  • A weak judge model — the judge should be more capable than the generator, not the same one.
  • Ignoring position/verbosity bias — randomise order, penalise padding, anchor the rubric.
  • One-word rubric axes — "accuracy" alone means different things to the model each run; describe it concretely.
  • Throwing on a bad verdict — score it zero and surface it; don't let one parse failure kill the run.

The takeaway

LLM-as-judge is the only practical way to score prose at scale, but a judge is a model with a model's biases — so build it like production code (defensive parsing, a dedicated stronger model, routed and traced) and validate it like a model (human labels, Cohen's κ, a tuned-and-tested judge prompt). Do that and your scores mean something. Skip it and you've automated the production of confident nonsense.

Next, and last in the series: from a number to a gate — wiring evals into CI and online monitoring so quality regressions turn the build red, on Microsoft.Extensions.AI.Evaluation, without bankrupting your pipeline.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

Discussion in the ATmosphere

Loading comments...