External Publication

What's your method for benchmarking?

Hugging Face Forums [Unofficial] June 25, 2026

Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.

Short version:

Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.

Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case.

A practical first workflow

Define the target behavior
- What was the fine-tune supposed to improve?
Create a held-out eval set
- Use examples that were not in the training data.
Run the base model
- Save the outputs.
Run the fine-tuned model
- Same examples, same prompt template, same decoding settings.
Compare with task-appropriate metrics
- Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.
Inspect individual failures
- Do not only look at the average score.
Add public benchmarks if they match the goal
- Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.

1. Start with your actual task, not a leaderboard

Before choosing a benchmark, I would write the goal in plain language.

If the fine-tune is meant to improve…	A better first eval is usually…
Support-ticket classification	Accuracy / F1 on held-out tickets
Domain QA	Held-out question/answer examples
JSON or structured output	JSON validity, schema validity, field accuracy
Chat helpfulness	Pairwise comparison, rubric, human spot-checks
Summarization	Coverage, factuality, maybe ROUGE/BERTScore as supporting metrics
Code generation	Unit tests, hidden tests, code-specific benchmark tasks
General reasoning	Public reasoning benchmarks may be relevant

The key question is:

Does this benchmark measure the thing I actually fine-tuned for?

A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.

Useful starting reference:

Hugging Face LLM Course, evaluation section: https://huggingface.co/learn/llm-course/chapter11/5

2. Minimal checklist

If you only do one thing, I would do this:

Step	Check
Goal	What should improve?
Data	Are eval examples excluded from training?
Baseline	Did you run the base model too?
Fairness	Are prompt/settings the same?
Metric	Does the metric match the task?
Samples	Did you inspect actual outputs?
Regression	Did anything get worse?
Notes	Can someone else reproduce the setup?

That is already a useful benchmark for many fine-tunes.

More detail: building a held-out eval set (click for more details) More detail: keeping the comparison fair (click for more details)

3. Pick metrics by task

There is no single metric that works for all fine-tuned LLMs.

A useful reference is Hugging Face Evaluate’s metric guide:

https://huggingface.co/docs/evaluate/en/choosing_a_metric

Very roughly:

Task type	Possible metrics / checks
Classification	accuracy, precision, recall, F1
Extraction	exact match, field-level accuracy, schema validity
QA with known answers	exact match, F1, semantic correctness, human spot-check
Summarization	ROUGE/BERTScore can help, but inspect factuality and coverage
Translation	BLEU / chrF / COMET, depending on setup
Open-ended chat	rubric scoring, pairwise comparison, human or LLM judge
Instruction following	constraint pass rate, format adherence
Coding	unit tests, pass rate, hidden tests, code benchmark tasks
RAG/document QA	answer correctness, context recall, faithfulness, citation usefulness
Deployment	latency, throughput, VRAM, cost per request

I would avoid relying on only one weak signal, such as:

“the answer looks good to me”
training loss
validation loss
one public leaderboard score
one cherry-picked demo

Those can be useful, but they are not the whole evaluation.

More detail: eval loss vs task success (click for more details)

4. Inspect failures, not only scores

Aggregate scores are useful, but a lot of the value comes from looking at individual examples.

For each output, I would save something like:

Field	Why
Input prompt	Reproduces the case
Expected answer / rubric	Defines what “good” means
Base model output	Baseline
Fine-tuned output	Comparison
Score / pass-fail	Aggregate metric
Short error label	Helps find patterns

Useful error labels might be:

wrong answer
incomplete answer
hallucinated detail
ignored instruction
wrong format
invalid JSON
too verbose
too short
unsafe refusal
should have refused but did not
correct answer but bad explanation
regression from base model

This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:

https://huggingface.co/docs/lighteval/en/index

5. Public benchmarks are a second layer

Once your own task-specific eval is in place, public benchmarks can be useful.

But I would choose them by target capability.

If you care about…	Look at…	Main caution
General reasoning / knowledge	MMLU-Pro, GPQA, LiveBench, HELM-like suites	May not match your domain
Instruction following	IFEval-style tests	Measures verifiable constraints, not all helpfulness
Open-ended chat quality	MT-Bench / Arena-style pairwise evals / AlpacaEval-like setups	Judge and preference biases matter
Coding	LiveCodeBench, BigCodeBench, SWE-bench depending on task	Code completion and repo-level issue fixing are different
RAG / document QA	RAGAS-style component metrics	Retrieval and generation should be separated
Deployment	latency, throughput, VRAM, cost	Not the same as quality

Common benchmark-running tools:

Lighteval: https://huggingface.co/docs/lighteval/en/index
EleutherAI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness

I would not start with:

Which leaderboard should I optimize for?

I would start with:

Which capability did I fine-tune for, and does this benchmark actually measure it?

More detail: public benchmark map (click for more details)

6. Special cases

The right evaluation changes a lot depending on what kind of fine-tune this is.

If the task is open-ended chat (click for more details) If the task is formatting or instruction following (click for more details) If this is RAG or document QA (click for more details) If this is a coding fine-tune (click for more details) If you plan to deploy the model (click for more details)

7. Document the benchmark setup

If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened.

Record:

base model name and revision
fine-tuned model name and revision
whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
dataset name and split
whether eval examples were excluded from training
prompt template
chat template
decoding parameters
metric
benchmark tool and version
hardware/backend, if reporting latency or throughput
sample outputs or failure examples
known limitations

Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:

https://huggingface.co/docs/hub/en/model-cards

There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:

https://huggingface.co/docs/hub/en/eval-results

8. What information would help people recommend a concrete benchmark?

If you want more specific suggestions, I would add:

What base model did you fine-tune?
Is it LoRA/PEFT, QLoRA, or full fine-tuning?
What task did you fine-tune for?
What dataset did you train on?
Do you have a held-out test set?
Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?

With those details, people can suggest a much more specific eval setup.