What's your method for benchmarking?
Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.
Short version:
Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.
Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case.
A practical first workflow
Define the target behavior
- What was the fine-tune supposed to improve?
Create a held-out eval set
- Use examples that were not in the training data.
Run the base model
- Save the outputs.
Run the fine-tuned model
- Same examples, same prompt template, same decoding settings.
Compare with task-appropriate metrics
- Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.
Inspect individual failures
- Do not only look at the average score.
Add public benchmarks if they match the goal
- Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.
1. Start with your actual task, not a leaderboard
Before choosing a benchmark, I would write the goal in plain language.
| If the fine-tune is meant to improve… | A better first eval is usually… |
|---|---|
| Support-ticket classification | Accuracy / F1 on held-out tickets |
| Domain QA | Held-out question/answer examples |
| JSON or structured output | JSON validity, schema validity, field accuracy |
| Chat helpfulness | Pairwise comparison, rubric, human spot-checks |
| Summarization | Coverage, factuality, maybe ROUGE/BERTScore as supporting metrics |
| Code generation | Unit tests, hidden tests, code-specific benchmark tasks |
| General reasoning | Public reasoning benchmarks may be relevant |
The key question is:
Does this benchmark measure the thing I actually fine-tuned for?
A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.
Useful starting reference:
- Hugging Face LLM Course, evaluation section: https://huggingface.co/learn/llm-course/chapter11/5
2. Minimal checklist
If you only do one thing, I would do this:
| Step | Check |
|---|---|
| Goal | What should improve? |
| Data | Are eval examples excluded from training? |
| Baseline | Did you run the base model too? |
| Fairness | Are prompt/settings the same? |
| Metric | Does the metric match the task? |
| Samples | Did you inspect actual outputs? |
| Regression | Did anything get worse? |
| Notes | Can someone else reproduce the setup? |
That is already a useful benchmark for many fine-tunes.
More detail: building a held-out eval set (click for more details) More detail: keeping the comparison fair (click for more details)
3. Pick metrics by task
There is no single metric that works for all fine-tuned LLMs.
A useful reference is Hugging Face Evaluate’s metric guide:
Very roughly:
| Task type | Possible metrics / checks |
|---|---|
| Classification | accuracy, precision, recall, F1 |
| Extraction | exact match, field-level accuracy, schema validity |
| QA with known answers | exact match, F1, semantic correctness, human spot-check |
| Summarization | ROUGE/BERTScore can help, but inspect factuality and coverage |
| Translation | BLEU / chrF / COMET, depending on setup |
| Open-ended chat | rubric scoring, pairwise comparison, human or LLM judge |
| Instruction following | constraint pass rate, format adherence |
| Coding | unit tests, pass rate, hidden tests, code benchmark tasks |
| RAG/document QA | answer correctness, context recall, faithfulness, citation usefulness |
| Deployment | latency, throughput, VRAM, cost per request |
I would avoid relying on only one weak signal, such as:
- “the answer looks good to me”
- training loss
- validation loss
- one public leaderboard score
- one cherry-picked demo
Those can be useful, but they are not the whole evaluation.
More detail: eval loss vs task success (click for more details)
4. Inspect failures, not only scores
Aggregate scores are useful, but a lot of the value comes from looking at individual examples.
For each output, I would save something like:
| Field | Why |
|---|---|
| Input prompt | Reproduces the case |
| Expected answer / rubric | Defines what “good” means |
| Base model output | Baseline |
| Fine-tuned output | Comparison |
| Score / pass-fail | Aggregate metric |
| Short error label | Helps find patterns |
Useful error labels might be:
- wrong answer
- incomplete answer
- hallucinated detail
- ignored instruction
- wrong format
- invalid JSON
- too verbose
- too short
- unsafe refusal
- should have refused but did not
- correct answer but bad explanation
- regression from base model
This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:
5. Public benchmarks are a second layer
Once your own task-specific eval is in place, public benchmarks can be useful.
But I would choose them by target capability.
| If you care about… | Look at… | Main caution |
|---|---|---|
| General reasoning / knowledge | MMLU-Pro, GPQA, LiveBench, HELM-like suites | May not match your domain |
| Instruction following | IFEval-style tests | Measures verifiable constraints, not all helpfulness |
| Open-ended chat quality | MT-Bench / Arena-style pairwise evals / AlpacaEval-like setups | Judge and preference biases matter |
| Coding | LiveCodeBench, BigCodeBench, SWE-bench depending on task | Code completion and repo-level issue fixing are different |
| RAG / document QA | RAGAS-style component metrics | Retrieval and generation should be separated |
| Deployment | latency, throughput, VRAM, cost | Not the same as quality |
Common benchmark-running tools:
- Lighteval: https://huggingface.co/docs/lighteval/en/index
- EleutherAI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
I would not start with:
Which leaderboard should I optimize for?
I would start with:
Which capability did I fine-tune for, and does this benchmark actually measure it?
More detail: public benchmark map (click for more details)
6. Special cases
The right evaluation changes a lot depending on what kind of fine-tune this is.
If the task is open-ended chat (click for more details) If the task is formatting or instruction following (click for more details) If this is RAG or document QA (click for more details) If this is a coding fine-tune (click for more details) If you plan to deploy the model (click for more details)
7. Document the benchmark setup
If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened.
Record:
- base model name and revision
- fine-tuned model name and revision
- whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
- dataset name and split
- whether eval examples were excluded from training
- prompt template
- chat template
- decoding parameters
- metric
- benchmark tool and version
- hardware/backend, if reporting latency or throughput
- sample outputs or failure examples
- known limitations
Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:
There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:
8. What information would help people recommend a concrete benchmark?
If you want more specific suggestions, I would add:
- What base model did you fine-tune?
- Is it LoRA/PEFT, QLoRA, or full fine-tuning?
- What task did you fine-tune for?
- What dataset did you train on?
- Do you have a held-out test set?
- Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
- Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?
With those details, people can suggest a much more specific eval setup.
Discussion in the ATmosphere