External Publication

Visit Post

LLM as a Judge - Evaluate ASR

Hugging Face Forums [Unofficial] May 18, 2026

Source

Umm. for now:

-–

I would treat this as a layered ASR evaluation problem , not as “replace WER/CER with an LLM score”.

Classic ASR metrics and LLM-as-a-Judge answer different questions:

WER/CER/MER/WIL/WIP ask: How different is the predicted transcript from the reference at the word/character level?
Semantic / intent / entity evaluation asks: Would a human or downstream system still understand the same thing?
Operational evaluation asks: Were the important things preserved: numbers, names, places, dates, times, destinations, commands, negations, and complete utterances?

That distinction is important in your case because you have 15 models over 17,900+ audio/transcript pairs, and your errors are not all equally meaningful.

Examples:

Reference	Model output	Raw WER view	Semantic view
`I am thirty years old`	`I am 30 years old`	error	harmless formatting difference
`I am thirty years old`	`I am 13 years old`	error	serious number-value error
`Book a train to Birmingham`	`Book a train to Burnley`	error	critical place-entity error
`Call Sarah tomorrow`	`Call Zara tomorrow`	error	person-entity error
`Here's what I need you to do next...`	`Here's`	deletion-heavy	truncation / incomplete-output failure
`Please cancel the booking`	`Please confirm the booking`	one-word substitution	critical intent reversal

So I would not frame the work as:

LLM-as-a-Judge vs WER

I would frame it as:

Raw WER/CER + normalized WER/CER + semantic severity + intent preservation + entity preservation + truncation/hallucination diagnostics

This gives you a much stronger benchmark and much better EDA.

1. Useful background and similar work

A few directly relevant examples:

Pipecat STT Benchmark / Semantic WER Uses Semantic WER for STT benchmarking, where only transcription errors that affect how an LLM agent understands/responds are counted. The implementation prompt is especially useful: semantic_wer.py.
Sarvam ASR evaluation beyond WER Useful layered framing: classic WER/CER plus LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. The entity-preservation idea is very relevant to your names, places, dates, times, and numbers.
Sarvam LLM-WER repo and LLM intent/entity repo Good practical references for separating literal transcript similarity from meaning and entity preservation.
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding and repo Strong framing: WER measures textual fidelity, but downstream impact may be different. Their domain is clinical dialogue; your domain would be semantic/intent/entity impact over accented speech.
Evaluation of Automatic Speech Recognition Using Generative Large Language Models Directly relevant to using LLMs for ASR evaluation: hypothesis selection, semantic distance, and qualitative error classification. Useful especially for pairwise validation subsets.
JiWER and JiWER transforms Good practical tooling for WER, CER, MER, WIL, WIP, plus explicit normalization pipelines.
Hugging Face Audio Course: ASR evaluation Clear explanation of WER as the de facto ASR metric, based on word-level substitutions, insertions, and deletions.
What is lost in Normalization? Important warning: normalization can reduce harmless formatting penalties, but can also hide meaningful errors.
A Survey on LLM-as-a-Judge Useful for reliability, consistency, calibration, and bias discussion.
Judging the Judges: Position Bias in LLM-as-a-Judge Important if you use pairwise judging. Always randomize or reverse A/B order on a subset.
Prometheus 2 and Prometheus-Eval Relevant if you want an open evaluator model. Still needs ASR-specific validation.

2. Recommended evaluation design

I would use four layers.

Layer 1: raw classic metrics

Compute these on the original reference and model output:

wer_raw
cer_raw
mer
wil
wip

These remain useful because they show literal transcription fidelity and allow comparison with standard ASR work.

Example with JiWER:

import jiwer

wer_raw = jiwer.wer(reference_raw, hypothesis_raw)
cer_raw = jiwer.cer(reference_raw, hypothesis_raw)
mer = jiwer.mer(reference_raw, hypothesis_raw)
wil = jiwer.wil(reference_raw, hypothesis_raw)
wip = jiwer.wip(reference_raw, hypothesis_raw)

Layer 2: normalized classic metrics

Also compute:

wer_normalized
cer_normalized

Normalization should reduce harmless differences such as:

Thirty -> 30
forty five -> 45
twenty thirteen -> 2013, when context-safe
Hello, John! -> hello john

But normalization can hide important failures, so keep both raw and normalized versions:

reference_raw
hypothesis_raw
reference_normalized
hypothesis_normalized
normalization_version

Do not overwrite the raw text.

Layer 3: deterministic diagnostics

Before calling an SLM judge, compute cheap rule-based features:

reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination
reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch
reference_entities
hypothesis_entities
possible_entity_mismatch

This reduces cost and improves consistency. Many number-format cases do not need an LLM judge.

Layer 4: SLM semantic judge

Use the SLM for the cases where meaning, intent, entity preservation, truncation, or hallucination needs judgment.

The judge should output labels , not a direct numeric score.

3. Do not ask the SLM for a free numeric score

I would avoid this:

{
  "score": 8.2
}

or:

{
  "semantic_similarity": 0.87
}

Small judge models often have unstable numeric calibration. A score of 7/10 can change with prompt wording, few-shot examples, model version, or decoding settings.

Instead, ask for structured labels:

{
  "severity": "SEMANTIC_EQUIVALENT",
  "error_types": ["NUMBER_FORMAT"],
  "meaning_preserved": true,
  "main_intent_preserved": true,
  "entities_preserved": true,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The number is written in digits but has the same value."
}

Then derive scores offline.

This is better for:

heatmaps
Spearman correlation
confusion matrices
debugging
reproducibility
prompt iteration
model comparison

4. Suggested label schema

Severity labels

Use one severity label per example:

EXACT_MATCH
ORTHOGRAPHIC_ONLY
SEMANTIC_EQUIVALENT
MINOR_SEMANTIC_SHIFT
MAJOR_SEMANTIC_SHIFT
CRITICAL_MEANING_ERROR
UNCERTAIN

Definitions:

Label	Meaning	Example
`EXACT_MATCH`	Same transcript	`hello there` vs `hello there`
`ORTHOGRAPHIC_ONLY`	Only punctuation/casing/spacing/harmless spelling differs	`Hello, John.` vs `hello john`
`SEMANTIC_EQUIVALENT`	Surface form differs, meaning is the same	`thirty` vs `30`
`MINOR_SEMANTIC_SHIFT`	Small meaning change, probably not task-breaking	missing filler or minor modifier
`MAJOR_SEMANTIC_SHIFT`	Important content changed	wrong object, missing key phrase
`CRITICAL_MEANING_ERROR`	Downstream interpretation/action likely wrong	wrong number, wrong person, wrong place, `cancel` vs `confirm`
`UNCERTAIN`	Judge cannot decide confidently	ambiguous or context-dependent case

Error-type labels

Use multi-label error types:

NUMBER_FORMAT
NUMBER_VALUE_ERROR
PERSON_NAME_ERROR
PLACE_NAME_ERROR
OTHER_ENTITY_ERROR
OMISSION_OR_TRUNCATION
HALLUCINATION_OR_INSERTION
WORD_SUBSTITUTION
DIALECT_OR_ACCENT_WORD
ORTHOGRAPHIC_OR_PUNCTUATION
NO_ERROR
UNCERTAIN

Why multi-label?

Because one transcript can have several problems.

Example:

Reference: Here's the address: 45 King Street in Birmingham.
Hypothesis: Here's the address.

Expected output:

{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": [
    "OMISSION_OR_TRUNCATION",
    "NUMBER_VALUE_ERROR",
    "PLACE_NAME_ERROR"
  ],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The hypothesis stops early and omits the address number and place."
}

5. Convert labels into scores offline

A simple first version:

Severity	Semantic penalty
`EXACT_MATCH`	`0.00`
`ORTHOGRAPHIC_ONLY`	`0.00`
`SEMANTIC_EQUIVALENT`	`0.00`
`MINOR_SEMANTIC_SHIFT`	`0.25`
`MAJOR_SEMANTIC_SHIFT`	`0.75`
`CRITICAL_MEANING_ERROR`	`1.00`
`UNCERTAIN`	separate bucket

Then:

semantic_score = 1 - semantic_penalty

This gives:

Severity	Semantic score
`EXACT_MATCH`	`1.00`
`ORTHOGRAPHIC_ONLY`	`1.00`
`SEMANTIC_EQUIVALENT`	`1.00`
`MINOR_SEMANTIC_SHIFT`	`0.75`
`MAJOR_SEMANTIC_SHIFT`	`0.25`
`CRITICAL_MEANING_ERROR`	`0.00`

Start simple. Later, if needed, add an entity-aware penalty:

final_penalty =
    severity_penalty
    + 0.15 * number_value_error
    + 0.15 * person_name_error
    + 0.15 * place_name_error
    + 0.20 * truncated

Then clip to 1.0.

But I would not start with a complex formula. First validate the labels.

6. Handling your common error types

6.1 Numbers as words vs digits

Separate:

NUMBER_FORMAT
NUMBER_VALUE_ERROR

Examples:

Reference	Hypothesis	Label	Severity
`thirty`	`30`	`NUMBER_FORMAT`	`SEMANTIC_EQUIVALENT`
`forty five`	`45`	`NUMBER_FORMAT`	`SEMANTIC_EQUIVALENT`
`twenty thirteen`	`2013`	`NUMBER_FORMAT`	usually `SEMANTIC_EQUIVALENT`
`nine thirteen`	`9:13`	`NUMBER_FORMAT`, if same intended time	usually `SEMANTIC_EQUIVALENT`
`thirty`	`13`	`NUMBER_VALUE_ERROR`	`CRITICAL_MEANING_ERROR`
`9:13`	`9:30`	`NUMBER_VALUE_ERROR`	`CRITICAL_MEANING_ERROR`

Principle:

Same value, different format = low or zero semantic penalty. Different value = high semantic penalty.

I would implement deterministic number canonicalization before the judge.

Store:

reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch

6.2 Human and place names

Names and places should be strict.

A wrong name or wrong place can be more important than several ordinary word substitutions.

Examples:

Reference	Hypothesis	Suggested label
`Birmingham`	`burning them`	`PLACE_NAME_ERROR`
`Leeds`	`leads`	context-dependent; often `PLACE_NAME_ERROR`
`Sarah`	`Zara`	`PERSON_NAME_ERROR`
`John Smith`	`John's myth`	`PERSON_NAME_ERROR`
`Edinburgh`	`Edinburg`	maybe spelling-only, depending the task
`Newcastle`	`new castle`	context-dependent

Create entity-specific fields:

person_name_error
place_name_error
other_entity_error
number_value_error
entities_preserved
entity_score

A simple entity score:

entity_score = preserved_key_entities / total_key_entities

Example:

Reference: Call Sarah in Birmingham at 9:13.
Hypothesis: Call Zara in Birmingham at 9:13.

Entity score:

Sarah: wrong
Birmingham: correct
9:13: correct

entity_score = 2 / 3 = 0.67

This matters because the rough intent can survive while the useful information fails.

6.3 Incomplete transcriptions

For incomplete outputs, do not rely only on WER.

Use:

OMISSION_OR_TRUNCATION
truncated = true

Example:

Reference: Here's what I need you to do next. Please call the office before five.
Hypothesis: Here's

Expected:

{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["OMISSION_OR_TRUNCATION"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The hypothesis stops after the first word and omits the main content."
}

Possible causes:

Cause	Explanation
Max output tokens too low	Audio LLM generation stops before full transcript
Stop sequence triggered	The model hits an unintended stop token
VAD/chunking issue	Input audio was cut or segmented incorrectly
Prompt ambiguity	The model summarizes or answers instead of transcribing
Long audio degradation	Later parts of the clip are lost
Decoding settings	Generation settings prefer short outputs
Model uncertainty	The model stops after becoming unsure

Track:

reference_word_count
hypothesis_word_count
length_ratio
audio_duration_sec
starts_with_reference_prefix

Useful heuristic:

possible_truncation =
    length_ratio < 0.5
    and hypothesis matches the beginning of the reference

7. Efficient pipeline for 15 x 17,900 outputs

You likely have about:

15 * 17,900 = 268,500 model-output rows

So avoid unnecessary judge calls.

Stage 1: compute raw metrics

For every row:

wer_raw
cer_raw
mer
wil
wip

Stage 2: compute normalized metrics

For every row:

wer_normalized
cer_normalized

Stage 3: deterministic shortcuts

Skip or reduce judge calls where possible:

Condition	Direct label/action
raw exact match	`EXACT_MATCH`
normalized exact match	`ORTHOGRAPHIC_ONLY` or `SEMANTIC_EQUIVALENT`
only number format differs	`SEMANTIC_EQUIVALENT`, `NUMBER_FORMAT`
empty hypothesis	`CRITICAL_MEANING_ERROR`, `OMISSION_OR_TRUNCATION`
very short prefix-only hypothesis	likely `OMISSION_OR_TRUNCATION`
hypothesis much longer than reference	possible `HALLUCINATION_OR_INSERTION`

Stage 4: SLM judge for non-trivial cases

Use the SLM for:

possible semantic shift
possible entity error
possible truncation
possible hallucination
low WER but possible entity/number mismatch
high WER but maybe same meaning
accent/dialect-word cases

Stage 5: stronger judge or manual review for high-risk cases

Route these to a stronger judge or manual review:

judge_uncertain = true
CRITICAL_MEANING_ERROR
low WER + critical error
high WER + semantic equivalent
entity errors
number value errors
truncations

8. Suggested SLM judge prompt

Use a strict JSON-only prompt.

You are evaluating an automatic speech recognition transcript.

Compare the reference transcript and the model transcript.

Evaluate fidelity to the reference, not fluency. Do not reward the model transcript for being more grammatical, more complete-sounding, or more fluent than the reference.

Return only valid JSON. Do not include markdown or text outside the JSON.

Reference transcript:
{reference}

Model transcript:
{hypothesis}

Return this JSON:
{
  "severity": one of [
    "EXACT_MATCH",
    "ORTHOGRAPHIC_ONLY",
    "SEMANTIC_EQUIVALENT",
    "MINOR_SEMANTIC_SHIFT",
    "MAJOR_SEMANTIC_SHIFT",
    "CRITICAL_MEANING_ERROR",
    "UNCERTAIN"
  ],
  "error_types": list of labels from [
    "NUMBER_FORMAT",
    "NUMBER_VALUE_ERROR",
    "PERSON_NAME_ERROR",
    "PLACE_NAME_ERROR",
    "OTHER_ENTITY_ERROR",
    "OMISSION_OR_TRUNCATION",
    "HALLUCINATION_OR_INSERTION",
    "WORD_SUBSTITUTION",
    "DIALECT_OR_ACCENT_WORD",
    "ORTHOGRAPHIC_OR_PUNCTUATION",
    "NO_ERROR",
    "UNCERTAIN"
  ],
  "meaning_preserved": true or false,
  "main_intent_preserved": true or false,
  "entities_preserved": true or false,
  "truncated": true or false,
  "hallucinated": true or false,
  "judge_uncertain": true or false,
  "short_reason": "one short sentence"
}

Rules:
- Punctuation, casing, spacing, and harmless formatting differences are not semantic errors.
- Digit-vs-word differences are not semantic errors if the numeric value is identical.
- Wrong numeric values, dates, times, prices, ages, addresses, quantities, or phone numbers are important errors.
- Treat person names and place names strictly.
- If the wrong person, place, station, city, region, organization, date, time, or number would change interpretation, mark an entity or number error.
- If the model transcript stops early or only contains the beginning of the reference, mark OMISSION_OR_TRUNCATION and set truncated=true.
- If the model transcript adds information not present in the reference, mark HALLUCINATION_OR_INSERTION and set hallucinated=true.
- If the main action, request, destination, object, number, entity, or intent changes, use MAJOR_SEMANTIC_SHIFT or CRITICAL_MEANING_ERROR.
- If unsure, use UNCERTAIN and set judge_uncertain=true.

For SLMs, keep the prompt stable and short. Add only a few examples.

9. Few-shot examples

Example 1: number format only

Reference:
I am thirty years old.

Model transcript:
I am 30 years old.

Expected JSON:
{
  "severity": "SEMANTIC_EQUIVALENT",
  "error_types": ["NUMBER_FORMAT"],
  "meaning_preserved": true,
  "main_intent_preserved": true,
  "entities_preserved": true,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The numeric value is the same but formatted differently."
}

Example 2: number value error

Reference:
The appointment is at nine thirteen.

Model transcript:
The appointment is at 9:30.

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["NUMBER_VALUE_ERROR"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The appointment time is wrong."
}

Example 3: place-name error

Reference:
I need a ticket to Birmingham.

Model transcript:
I need a ticket to Burnley.

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["PLACE_NAME_ERROR"],
  "meaning_preserved": false,
  "main_intent_preserved": true,
  "entities_preserved": false,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The destination place is different."
}

Example 4: truncation

Reference:
Here's what I need you to do next. Please call the office before five.

Model transcript:
Here's

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["OMISSION_OR_TRUNCATION"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The transcript stops after the first word and omits the main content."
}

10. Validation plan

Do not run the judge over the full dataset without validation.

Create a human-labeled subset.

Minimum:

300 examples

Better:

500 to 1,000 examples

Use stratified sampling. Include:

low WER
high WER
low WER + possible entity error
low WER + possible number error
high WER + likely same meaning
truncation cases
hallucination/insertion cases
all accent groups
all model families
short audio
long audio
different speakers
different genders

Measure:

accuracy
macro-F1
per-label F1
Cohen's kappa
confusion matrix
judge_uncertain_rate
parse_error_rate

Important confusions to inspect:

Confusion	Why it matters
`NUMBER_FORMAT` vs `NUMBER_VALUE_ERROR`	harmless vs critical
`ORTHOGRAPHIC_ONLY` vs `PLACE_NAME_ERROR`	spelling vs wrong place
`SEMANTIC_EQUIVALENT` vs `MINOR_SEMANTIC_SHIFT`	score calibration
`MAJOR_SEMANTIC_SHIFT` vs `CRITICAL_MEANING_ERROR`	severity calibration
`WORD_SUBSTITUTION` vs `PERSON_NAME_ERROR`	entity strictness
`OMISSION_OR_TRUNCATION` vs ordinary deletion	audio-LLM failure-mode detection

Freeze these before the final run:

judge_model
judge_model_revision
judge_prompt_version
temperature
max_tokens
output_parser_version
normalization_version

11. Pairwise judging

Pairwise judging can be useful, but I would not use it for the whole dataset.

With 15 models:

15 choose 2 = 105 pairs per audio

For 17,900 audios:

17,900 * 105 = 1,879,500 pairwise judgments

That is probably too expensive.

Use pairwise judging for:

validation subset
top 3 to 5 models
low-WER/high-severity outliers
high-WER/semantic-equivalent outliers
cases where model rankings are unclear

If you use pairwise judging, reverse A/B order on a subset because LLM judges can show position bias.

Store:

pairwise_winner
pairwise_reversed_winner
position_stable

If the winner changes after reversing A/B order, mark the comparison unstable.

12. How DeepEval fits

DeepEval can be useful as infrastructure, especially if you want G-Eval-like custom criteria or decision-tree/DAG-style evaluation.

Useful links:

DeepEval LLM-as-a-Judge guide
DeepEval G-Eval docs
DeepEval GitHub

But I would not use a generic “correctness” or “semantic similarity” metric directly.

Your metric should be ASR-specific:

number format
number value
person names
place names
truncation
hallucination
intent preservation
entity preservation

So DeepEval is useful as an execution framework, not as the final rubric.

13. EDA and plots

Main heatmaps

Heatmap	What it shows
`model x severity`	Which models produce more serious semantic errors
`model x error_type`	Model-specific weaknesses: numbers, names, places, truncation, hallucination
`accent_group x severity`	Whether certain accents cause more meaning degradation
`accent_group x error_type`	Accent-specific error patterns
`model x truncation_rate`	Which audio LLMs stop early
`audio_duration_bucket x truncation_rate`	Whether long audio causes incomplete output
`model x entity_preservation_rate`	Which models preserve names/places/numbers
`WER bucket x severity`	Where WER agrees or disagrees with semantic labels
`normalized WER bucket x severity`	Whether normalization improves semantic alignment
`model x judge_uncertain_rate`	Which model outputs are hardest to judge

Spearman correlations

Use Spearman because many variables are ordinal or non-normal.

Compute:

wer_raw vs semantic_penalty
wer_normalized vs semantic_penalty
cer_raw vs entity_score
cer_normalized vs entity_score
wer_raw vs intent_score
audio_duration_sec vs truncated
reference_word_count vs truncated
wer_raw vs entities_preserved

Do not report only one global correlation. Also compute by:

model
accent_group
gender
duration_bucket
reference_length_bucket

Most important outlier buckets

These are probably the most interesting examples:

Bucket	Why important
low WER + critical semantic error	WER missed a dangerous error
high WER + semantic equivalent	WER over-penalized harmless differences
low WER + entity error	one key name/place/number broke the meaning
low normalized WER + critical error	normalization hid something important
high truncation rate for one model	audio LLM generation failure
high hallucination rate for one model	audio LLM over-generation
high `UNCERTAIN` rate	judge prompt/model not robust enough

These disagreement cases will be more valuable than a simple leaderboard.

14. Suggested row-level schema

Use one row per:

audio_id x model_name

Suggested fields:

audio_id
speaker_id
accent_group
gender
audio_duration_sec

reference_raw
reference_normalized

model_name
model_revision
model_family
quantization_mode
prompt_version
decoding_params

hypothesis_raw
hypothesis_normalized

wer_raw
cer_raw
mer
wil
wip
wer_normalized
cer_normalized

reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination

reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch

reference_entities
hypothesis_entities
missing_entities
changed_entities
extra_entities

judge_model
judge_model_revision
judge_prompt_version
judge_temperature
judge_parse_error

severity
error_types
meaning_preserved
main_intent_preserved
entities_preserved
truncated
hallucinated
judge_uncertain
short_reason

semantic_penalty
semantic_score
intent_score
entity_score

normalization_version
run_timestamp

This looks verbose, but it makes later EDA much easier.

15. What I would report

Table 1: classic ASR metrics

Model	WER raw	CER raw	WER normalized	CER normalized	MER	WIL

Table 2: semantic metrics

Model	Semantic score	Intent preserved	Entity preserved	Critical error rate	Judge uncertain

Table 3: error-type rates

Model	Number format	Number value error	Person name error	Place name error	Truncation	Hallucination

Table 4: accent slicing

Accent group	WER norm	Semantic score	Entity preserved	Truncation rate	Critical error rate

Table 5: metric-disagreement examples

Pattern	Example type
low WER + critical error	wrong number/name/place
high WER + semantic equivalent	formatting/paraphrase
low normalized WER + critical error	normalization artifact
high truncation	audio LLM stopped early

16. Practical final recommendation

I would implement this pipeline:

1. Save every raw model output.
2. Compute raw WER/CER/MER/WIL/WIP.
3. Build a versioned normalization pipeline.
4. Compute normalized WER/CER.
5. Add deterministic number, entity, length, truncation, and hallucination features.
6. Use SLM judge only for non-trivial semantic cases.
7. Make the SLM return JSON labels, not numeric scores.
8. Convert severity labels to numeric semantic penalties offline.
9. Validate the judge on 300 to 1,000 human-labeled examples.
10. Use pairwise judging only for validation subsets and metric-disagreement cases.
11. Plot model x severity, model x error type, accent x error type, WER bucket x severity, and duration x truncation.
12. Focus the discussion on where WER and semantic quality disagree.

17. Short summary

Keep WER/CER , but do not rely on them alone.
Add normalized WER/CER to reduce harmless formatting penalties.
Add SLM judge labels for semantic severity, intent, entities, truncation, and hallucination.
Do not ask the SLM for a direct numeric score.
Use labels first, then map them to scores offline.
Treat thirty vs 30 as NUMBER_FORMAT, not a semantic error.
Treat 30 vs 13 as NUMBER_VALUE_ERROR, usually critical.
Treat names and places strictly with PERSON_NAME_ERROR and PLACE_NAME_ERROR.
Treat incomplete outputs as OMISSION_OR_TRUNCATION, not just high WER.
Validate your judge with a human-labeled subset before scaling.
The strongest analysis will be the disagreement cases: low WER + critical semantic error and high WER + semantic equivalent.

References

Pipecat STT Benchmark: GitHub - pipecat-ai/stt-benchmark: Benchmarking STT service TTFB and semantic WER for real-time AI applications · GitHub
Pipecat Semantic WER implementation: stt-benchmark/src/stt_benchmark/evaluation/semantic_wer.py at main · pipecat-ai/stt-benchmark · GitHub
Sarvam ASR evaluation beyond WER: Indic ASR evaluation: beyond WER to LLM & semantic metrics | Sarvam AI
Sarvam LLM-WER: GitHub - sarvamai/llm_wer · GitHub
Sarvam intent/entity evaluation: GitHub - sarvamai/llm_intent_entity: LLM-Eval framework for evaluating performance of ASR models · GitHub
WER is Unaware paper: [2511.16544] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
WER is Unaware repo: GitHub - Ufonia/wer-is-unaware: A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors. · GitHub
Generative LLMs for ASR evaluation: [2604.21928] Evaluation of Automatic Speech Recognition Using Generative Large Language Models
JiWER: GitHub - jitsi/jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate (WER) · GitHub
JiWER transforms: transforms - jiwer
Hugging Face Audio Course, ASR evaluation: Evaluation metrics for ASR · Hugging Face
What is lost in Normalization?: [2409.02449] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
Whisper English normalizer: whisper/whisper/normalizers/english.py at main · openai/whisper · GitHub
Whisper Basic normalizer: whisper/whisper/normalizers/basic.py at main · openai/whisper · GitHub
DeepEval LLM-as-a-Judge guide: LLM-as-a-Judge Evaluation with DeepEval | DeepEval by Confident AI - The LLM Evaluation Framework
DeepEval G-Eval docs: G-Eval | DeepEval by Confident AI - The LLM Evaluation Framework
DeepEval GitHub: GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub
G-Eval paper: [2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Prometheus 2 paper: [2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Prometheus-Eval: GitHub - prometheus-eval/prometheus-eval: Evaluate your LLM's response with Prometheus and GPT4 💯 · GitHub
LLM-as-a-Judge survey: [2411.15594] A Survey on LLM-as-a-Judge
Position bias in LLM-as-a-Judge: [2406.07791] Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Earnings-22 accented ASR benchmark: [2203.15591] Earnings-22: A Practical Benchmark for Accents in the Wild
EdAcc accented English ASR corpus: [2303.18110] The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
EdAcc dataset card: edinburghcstr/edacc · Datasets at Hugging Face

1. Useful background and similar work

2. Recommended evaluation design

Layer 1: raw classic metrics

Layer 2: normalized classic metrics

Layer 3: deterministic diagnostics

Layer 4: SLM semantic judge

3. Do not ask the SLM for a free numeric score

4. Suggested label schema

Severity labels

Error-type labels

5. Convert labels into scores offline

6. Handling your common error types

6.1 Numbers as words vs digits

6.2 Human and place names

6.3 Incomplete transcriptions

7. Efficient pipeline for 15 x 17,900 outputs

Stage 1: compute raw metrics

Stage 2: compute normalized metrics

Stage 3: deterministic shortcuts

Stage 4: SLM judge for non-trivial cases

Stage 5: stronger judge or manual review for high-risk cases

8. Suggested SLM judge prompt

9. Few-shot examples

Example 1: number format only

Example 2: number value error

Example 3: place-name error

Example 4: truncation

10. Validation plan

11. Pairwise judging

12. How DeepEval fits

13. EDA and plots

Main heatmaps

Spearman correlations

Most important outlier buckets

14. Suggested row-level schema

15. What I would report

Table 1: classic ASR metrics

Table 2: semantic metrics

Table 3: error-type rates

Table 4: accent slicing

Table 5: metric-disagreement examples

16. Practical final recommendation

17. Short summary

References

Discussion in the ATmosphere