External Publication
Visit Post

What's your method for benchmarking?

Hugging Face Forums [Unofficial] June 25, 2026
Source

Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.


Short version:

Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.

Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case.

A practical first workflow

  1. Define the target behavior

    • What was the fine-tune supposed to improve?
  2. Create a held-out eval set

    • Use examples that were not in the training data.
  3. Run the base model

    • Save the outputs.
  4. Run the fine-tuned model

    • Same examples, same prompt template, same decoding settings.
  5. Compare with task-appropriate metrics

    • Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.
  6. Inspect individual failures

    • Do not only look at the average score.
  7. Add public benchmarks if they match the goal

    • Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.

1. Start with your actual task, not a leaderboard

Before choosing a benchmark, I would write the goal in plain language.

If the fine-tune is meant to improve… A better first eval is usually…
Support-ticket classification Accuracy / F1 on held-out tickets
Domain QA Held-out question/answer examples
JSON or structured output JSON validity, schema validity, field accuracy
Chat helpfulness Pairwise comparison, rubric, human spot-checks
Summarization Coverage, factuality, maybe ROUGE/BERTScore as supporting metrics
Code generation Unit tests, hidden tests, code-specific benchmark tasks
General reasoning Public reasoning benchmarks may be relevant

The key question is:

Does this benchmark measure the thing I actually fine-tuned for?

A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.

Useful starting reference:


2. Minimal checklist

If you only do one thing, I would do this:

Step Check
Goal What should improve?
Data Are eval examples excluded from training?
Baseline Did you run the base model too?
Fairness Are prompt/settings the same?
Metric Does the metric match the task?
Samples Did you inspect actual outputs?
Regression Did anything get worse?
Notes Can someone else reproduce the setup?

That is already a useful benchmark for many fine-tunes.

More detail: building a held-out eval set (click for more details) More detail: keeping the comparison fair (click for more details)


3. Pick metrics by task

There is no single metric that works for all fine-tuned LLMs.

A useful reference is Hugging Face Evaluate’s metric guide:

Very roughly:

Task type Possible metrics / checks
Classification accuracy, precision, recall, F1
Extraction exact match, field-level accuracy, schema validity
QA with known answers exact match, F1, semantic correctness, human spot-check
Summarization ROUGE/BERTScore can help, but inspect factuality and coverage
Translation BLEU / chrF / COMET, depending on setup
Open-ended chat rubric scoring, pairwise comparison, human or LLM judge
Instruction following constraint pass rate, format adherence
Coding unit tests, pass rate, hidden tests, code benchmark tasks
RAG/document QA answer correctness, context recall, faithfulness, citation usefulness
Deployment latency, throughput, VRAM, cost per request

I would avoid relying on only one weak signal, such as:

  • “the answer looks good to me”
  • training loss
  • validation loss
  • one public leaderboard score
  • one cherry-picked demo

Those can be useful, but they are not the whole evaluation.

More detail: eval loss vs task success (click for more details)


4. Inspect failures, not only scores

Aggregate scores are useful, but a lot of the value comes from looking at individual examples.

For each output, I would save something like:

Field Why
Input prompt Reproduces the case
Expected answer / rubric Defines what “good” means
Base model output Baseline
Fine-tuned output Comparison
Score / pass-fail Aggregate metric
Short error label Helps find patterns

Useful error labels might be:

  • wrong answer
  • incomplete answer
  • hallucinated detail
  • ignored instruction
  • wrong format
  • invalid JSON
  • too verbose
  • too short
  • unsafe refusal
  • should have refused but did not
  • correct answer but bad explanation
  • regression from base model

This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:


5. Public benchmarks are a second layer

Once your own task-specific eval is in place, public benchmarks can be useful.

But I would choose them by target capability.

If you care about… Look at… Main caution
General reasoning / knowledge MMLU-Pro, GPQA, LiveBench, HELM-like suites May not match your domain
Instruction following IFEval-style tests Measures verifiable constraints, not all helpfulness
Open-ended chat quality MT-Bench / Arena-style pairwise evals / AlpacaEval-like setups Judge and preference biases matter
Coding LiveCodeBench, BigCodeBench, SWE-bench depending on task Code completion and repo-level issue fixing are different
RAG / document QA RAGAS-style component metrics Retrieval and generation should be separated
Deployment latency, throughput, VRAM, cost Not the same as quality

Common benchmark-running tools:

I would not start with:

Which leaderboard should I optimize for?

I would start with:

Which capability did I fine-tune for, and does this benchmark actually measure it?

More detail: public benchmark map (click for more details)


6. Special cases

The right evaluation changes a lot depending on what kind of fine-tune this is.

If the task is open-ended chat (click for more details) If the task is formatting or instruction following (click for more details) If this is RAG or document QA (click for more details) If this is a coding fine-tune (click for more details) If you plan to deploy the model (click for more details)


7. Document the benchmark setup

If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened.

Record:

  • base model name and revision
  • fine-tuned model name and revision
  • whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
  • dataset name and split
  • whether eval examples were excluded from training
  • prompt template
  • chat template
  • decoding parameters
  • metric
  • benchmark tool and version
  • hardware/backend, if reporting latency or throughput
  • sample outputs or failure examples
  • known limitations

Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:

There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:


8. What information would help people recommend a concrete benchmark?

If you want more specific suggestions, I would add:

  • What base model did you fine-tune?
  • Is it LoRA/PEFT, QLoRA, or full fine-tuning?
  • What task did you fine-tune for?
  • What dataset did you train on?
  • Do you have a held-out test set?
  • Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
  • Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?

With those details, people can suggest a much more specific eval setup.

Discussion in the ATmosphere

Loading comments...