LLM as a Judge - Evaluate ASR
Umm. for now:
-–
I would treat this as a layered ASR evaluation problem , not as “replace WER/CER with an LLM score”.
Classic ASR metrics and LLM-as-a-Judge answer different questions:
- WER/CER/MER/WIL/WIP ask: How different is the predicted transcript from the reference at the word/character level?
- Semantic / intent / entity evaluation asks: Would a human or downstream system still understand the same thing?
- Operational evaluation asks: Were the important things preserved: numbers, names, places, dates, times, destinations, commands, negations, and complete utterances?
That distinction is important in your case because you have 15 models over 17,900+ audio/transcript pairs, and your errors are not all equally meaningful.
Examples:
| Reference | Model output | Raw WER view | Semantic view |
|---|---|---|---|
I am thirty years old |
I am 30 years old |
error | harmless formatting difference |
I am thirty years old |
I am 13 years old |
error | serious number-value error |
Book a train to Birmingham |
Book a train to Burnley |
error | critical place-entity error |
Call Sarah tomorrow |
Call Zara tomorrow |
error | person-entity error |
Here's what I need you to do next... |
Here's |
deletion-heavy | truncation / incomplete-output failure |
Please cancel the booking |
Please confirm the booking |
one-word substitution | critical intent reversal |
So I would not frame the work as:
LLM-as-a-Judge vs WER
I would frame it as:
Raw WER/CER + normalized WER/CER + semantic severity + intent preservation + entity preservation + truncation/hallucination diagnostics
This gives you a much stronger benchmark and much better EDA.
1. Useful background and similar work
A few directly relevant examples:
Pipecat STT Benchmark / Semantic WER Uses Semantic WER for STT benchmarking, where only transcription errors that affect how an LLM agent understands/responds are counted. The implementation prompt is especially useful: semantic_wer.py.
Sarvam ASR evaluation beyond WER Useful layered framing: classic WER/CER plus LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. The entity-preservation idea is very relevant to your names, places, dates, times, and numbers.
Sarvam LLM-WER repo and LLM intent/entity repo Good practical references for separating literal transcript similarity from meaning and entity preservation.
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding and repo Strong framing: WER measures textual fidelity, but downstream impact may be different. Their domain is clinical dialogue; your domain would be semantic/intent/entity impact over accented speech.
Evaluation of Automatic Speech Recognition Using Generative Large Language Models Directly relevant to using LLMs for ASR evaluation: hypothesis selection, semantic distance, and qualitative error classification. Useful especially for pairwise validation subsets.
JiWER and JiWER transforms Good practical tooling for WER, CER, MER, WIL, WIP, plus explicit normalization pipelines.
Hugging Face Audio Course: ASR evaluation Clear explanation of WER as the de facto ASR metric, based on word-level substitutions, insertions, and deletions.
What is lost in Normalization? Important warning: normalization can reduce harmless formatting penalties, but can also hide meaningful errors.
A Survey on LLM-as-a-Judge Useful for reliability, consistency, calibration, and bias discussion.
Judging the Judges: Position Bias in LLM-as-a-Judge Important if you use pairwise judging. Always randomize or reverse A/B order on a subset.
Prometheus 2 and Prometheus-Eval Relevant if you want an open evaluator model. Still needs ASR-specific validation.
2. Recommended evaluation design
I would use four layers.
Layer 1: raw classic metrics
Compute these on the original reference and model output:
wer_rawcer_rawmerwilwip
These remain useful because they show literal transcription fidelity and allow comparison with standard ASR work.
Example with JiWER:
import jiwer
wer_raw = jiwer.wer(reference_raw, hypothesis_raw)
cer_raw = jiwer.cer(reference_raw, hypothesis_raw)
mer = jiwer.mer(reference_raw, hypothesis_raw)
wil = jiwer.wil(reference_raw, hypothesis_raw)
wip = jiwer.wip(reference_raw, hypothesis_raw)
Layer 2: normalized classic metrics
Also compute:
wer_normalizedcer_normalized
Normalization should reduce harmless differences such as:
Thirty -> 30
forty five -> 45
twenty thirteen -> 2013, when context-safe
Hello, John! -> hello john
But normalization can hide important failures, so keep both raw and normalized versions:
reference_raw
hypothesis_raw
reference_normalized
hypothesis_normalized
normalization_version
Do not overwrite the raw text.
Layer 3: deterministic diagnostics
Before calling an SLM judge, compute cheap rule-based features:
reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination
reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch
reference_entities
hypothesis_entities
possible_entity_mismatch
This reduces cost and improves consistency. Many number-format cases do not need an LLM judge.
Layer 4: SLM semantic judge
Use the SLM for the cases where meaning, intent, entity preservation, truncation, or hallucination needs judgment.
The judge should output labels , not a direct numeric score.
3. Do not ask the SLM for a free numeric score
I would avoid this:
{
"score": 8.2
}
or:
{
"semantic_similarity": 0.87
}
Small judge models often have unstable numeric calibration. A score of 7/10 can change with prompt wording, few-shot examples, model version, or decoding settings.
Instead, ask for structured labels:
{
"severity": "SEMANTIC_EQUIVALENT",
"error_types": ["NUMBER_FORMAT"],
"meaning_preserved": true,
"main_intent_preserved": true,
"entities_preserved": true,
"truncated": false,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The number is written in digits but has the same value."
}
Then derive scores offline.
This is better for:
- heatmaps
- Spearman correlation
- confusion matrices
- debugging
- reproducibility
- prompt iteration
- model comparison
4. Suggested label schema
Severity labels
Use one severity label per example:
EXACT_MATCH
ORTHOGRAPHIC_ONLY
SEMANTIC_EQUIVALENT
MINOR_SEMANTIC_SHIFT
MAJOR_SEMANTIC_SHIFT
CRITICAL_MEANING_ERROR
UNCERTAIN
Definitions:
| Label | Meaning | Example |
|---|---|---|
EXACT_MATCH |
Same transcript | hello there vs hello there |
ORTHOGRAPHIC_ONLY |
Only punctuation/casing/spacing/harmless spelling differs | Hello, John. vs hello john |
SEMANTIC_EQUIVALENT |
Surface form differs, meaning is the same | thirty vs 30 |
MINOR_SEMANTIC_SHIFT |
Small meaning change, probably not task-breaking | missing filler or minor modifier |
MAJOR_SEMANTIC_SHIFT |
Important content changed | wrong object, missing key phrase |
CRITICAL_MEANING_ERROR |
Downstream interpretation/action likely wrong | wrong number, wrong person, wrong place, cancel vs confirm |
UNCERTAIN |
Judge cannot decide confidently | ambiguous or context-dependent case |
Error-type labels
Use multi-label error types:
NUMBER_FORMAT
NUMBER_VALUE_ERROR
PERSON_NAME_ERROR
PLACE_NAME_ERROR
OTHER_ENTITY_ERROR
OMISSION_OR_TRUNCATION
HALLUCINATION_OR_INSERTION
WORD_SUBSTITUTION
DIALECT_OR_ACCENT_WORD
ORTHOGRAPHIC_OR_PUNCTUATION
NO_ERROR
UNCERTAIN
Why multi-label?
Because one transcript can have several problems.
Example:
Reference: Here's the address: 45 King Street in Birmingham.
Hypothesis: Here's the address.
Expected output:
{
"severity": "CRITICAL_MEANING_ERROR",
"error_types": [
"OMISSION_OR_TRUNCATION",
"NUMBER_VALUE_ERROR",
"PLACE_NAME_ERROR"
],
"meaning_preserved": false,
"main_intent_preserved": false,
"entities_preserved": false,
"truncated": true,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The hypothesis stops early and omits the address number and place."
}
5. Convert labels into scores offline
A simple first version:
| Severity | Semantic penalty |
|---|---|
EXACT_MATCH |
0.00 |
ORTHOGRAPHIC_ONLY |
0.00 |
SEMANTIC_EQUIVALENT |
0.00 |
MINOR_SEMANTIC_SHIFT |
0.25 |
MAJOR_SEMANTIC_SHIFT |
0.75 |
CRITICAL_MEANING_ERROR |
1.00 |
UNCERTAIN |
separate bucket |
Then:
semantic_score = 1 - semantic_penalty
This gives:
| Severity | Semantic score |
|---|---|
EXACT_MATCH |
1.00 |
ORTHOGRAPHIC_ONLY |
1.00 |
SEMANTIC_EQUIVALENT |
1.00 |
MINOR_SEMANTIC_SHIFT |
0.75 |
MAJOR_SEMANTIC_SHIFT |
0.25 |
CRITICAL_MEANING_ERROR |
0.00 |
Start simple. Later, if needed, add an entity-aware penalty:
final_penalty =
severity_penalty
+ 0.15 * number_value_error
+ 0.15 * person_name_error
+ 0.15 * place_name_error
+ 0.20 * truncated
Then clip to 1.0.
But I would not start with a complex formula. First validate the labels.
6. Handling your common error types
6.1 Numbers as words vs digits
Separate:
NUMBER_FORMAT
NUMBER_VALUE_ERROR
Examples:
| Reference | Hypothesis | Label | Severity |
|---|---|---|---|
thirty |
30 |
NUMBER_FORMAT |
SEMANTIC_EQUIVALENT |
forty five |
45 |
NUMBER_FORMAT |
SEMANTIC_EQUIVALENT |
twenty thirteen |
2013 |
NUMBER_FORMAT |
usually SEMANTIC_EQUIVALENT |
nine thirteen |
9:13 |
NUMBER_FORMAT, if same intended time |
usually SEMANTIC_EQUIVALENT |
thirty |
13 |
NUMBER_VALUE_ERROR |
CRITICAL_MEANING_ERROR |
9:13 |
9:30 |
NUMBER_VALUE_ERROR |
CRITICAL_MEANING_ERROR |
Principle:
Same value, different format = low or zero semantic penalty. Different value = high semantic penalty.
I would implement deterministic number canonicalization before the judge.
Store:
reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch
6.2 Human and place names
Names and places should be strict.
A wrong name or wrong place can be more important than several ordinary word substitutions.
Examples:
| Reference | Hypothesis | Suggested label |
|---|---|---|
Birmingham |
burning them |
PLACE_NAME_ERROR |
Leeds |
leads |
context-dependent; often PLACE_NAME_ERROR |
Sarah |
Zara |
PERSON_NAME_ERROR |
John Smith |
John's myth |
PERSON_NAME_ERROR |
Edinburgh |
Edinburg |
maybe spelling-only, depending the task |
Newcastle |
new castle |
context-dependent |
Create entity-specific fields:
person_name_error
place_name_error
other_entity_error
number_value_error
entities_preserved
entity_score
A simple entity score:
entity_score = preserved_key_entities / total_key_entities
Example:
Reference: Call Sarah in Birmingham at 9:13.
Hypothesis: Call Zara in Birmingham at 9:13.
Entity score:
Sarah: wrong
Birmingham: correct
9:13: correct
entity_score = 2 / 3 = 0.67
This matters because the rough intent can survive while the useful information fails.
6.3 Incomplete transcriptions
For incomplete outputs, do not rely only on WER.
Use:
OMISSION_OR_TRUNCATION
truncated = true
Example:
Reference: Here's what I need you to do next. Please call the office before five.
Hypothesis: Here's
Expected:
{
"severity": "CRITICAL_MEANING_ERROR",
"error_types": ["OMISSION_OR_TRUNCATION"],
"meaning_preserved": false,
"main_intent_preserved": false,
"entities_preserved": false,
"truncated": true,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The hypothesis stops after the first word and omits the main content."
}
Possible causes:
| Cause | Explanation |
|---|---|
| Max output tokens too low | Audio LLM generation stops before full transcript |
| Stop sequence triggered | The model hits an unintended stop token |
| VAD/chunking issue | Input audio was cut or segmented incorrectly |
| Prompt ambiguity | The model summarizes or answers instead of transcribing |
| Long audio degradation | Later parts of the clip are lost |
| Decoding settings | Generation settings prefer short outputs |
| Model uncertainty | The model stops after becoming unsure |
Track:
reference_word_count
hypothesis_word_count
length_ratio
audio_duration_sec
starts_with_reference_prefix
Useful heuristic:
possible_truncation =
length_ratio < 0.5
and hypothesis matches the beginning of the reference
7. Efficient pipeline for 15 x 17,900 outputs
You likely have about:
15 * 17,900 = 268,500 model-output rows
So avoid unnecessary judge calls.
Stage 1: compute raw metrics
For every row:
wer_raw
cer_raw
mer
wil
wip
Stage 2: compute normalized metrics
For every row:
wer_normalized
cer_normalized
Stage 3: deterministic shortcuts
Skip or reduce judge calls where possible:
| Condition | Direct label/action |
|---|---|
| raw exact match | EXACT_MATCH |
| normalized exact match | ORTHOGRAPHIC_ONLY or SEMANTIC_EQUIVALENT |
| only number format differs | SEMANTIC_EQUIVALENT, NUMBER_FORMAT |
| empty hypothesis | CRITICAL_MEANING_ERROR, OMISSION_OR_TRUNCATION |
| very short prefix-only hypothesis | likely OMISSION_OR_TRUNCATION |
| hypothesis much longer than reference | possible HALLUCINATION_OR_INSERTION |
Stage 4: SLM judge for non-trivial cases
Use the SLM for:
possible semantic shift
possible entity error
possible truncation
possible hallucination
low WER but possible entity/number mismatch
high WER but maybe same meaning
accent/dialect-word cases
Stage 5: stronger judge or manual review for high-risk cases
Route these to a stronger judge or manual review:
judge_uncertain = true
CRITICAL_MEANING_ERROR
low WER + critical error
high WER + semantic equivalent
entity errors
number value errors
truncations
8. Suggested SLM judge prompt
Use a strict JSON-only prompt.
You are evaluating an automatic speech recognition transcript.
Compare the reference transcript and the model transcript.
Evaluate fidelity to the reference, not fluency. Do not reward the model transcript for being more grammatical, more complete-sounding, or more fluent than the reference.
Return only valid JSON. Do not include markdown or text outside the JSON.
Reference transcript:
{reference}
Model transcript:
{hypothesis}
Return this JSON:
{
"severity": one of [
"EXACT_MATCH",
"ORTHOGRAPHIC_ONLY",
"SEMANTIC_EQUIVALENT",
"MINOR_SEMANTIC_SHIFT",
"MAJOR_SEMANTIC_SHIFT",
"CRITICAL_MEANING_ERROR",
"UNCERTAIN"
],
"error_types": list of labels from [
"NUMBER_FORMAT",
"NUMBER_VALUE_ERROR",
"PERSON_NAME_ERROR",
"PLACE_NAME_ERROR",
"OTHER_ENTITY_ERROR",
"OMISSION_OR_TRUNCATION",
"HALLUCINATION_OR_INSERTION",
"WORD_SUBSTITUTION",
"DIALECT_OR_ACCENT_WORD",
"ORTHOGRAPHIC_OR_PUNCTUATION",
"NO_ERROR",
"UNCERTAIN"
],
"meaning_preserved": true or false,
"main_intent_preserved": true or false,
"entities_preserved": true or false,
"truncated": true or false,
"hallucinated": true or false,
"judge_uncertain": true or false,
"short_reason": "one short sentence"
}
Rules:
- Punctuation, casing, spacing, and harmless formatting differences are not semantic errors.
- Digit-vs-word differences are not semantic errors if the numeric value is identical.
- Wrong numeric values, dates, times, prices, ages, addresses, quantities, or phone numbers are important errors.
- Treat person names and place names strictly.
- If the wrong person, place, station, city, region, organization, date, time, or number would change interpretation, mark an entity or number error.
- If the model transcript stops early or only contains the beginning of the reference, mark OMISSION_OR_TRUNCATION and set truncated=true.
- If the model transcript adds information not present in the reference, mark HALLUCINATION_OR_INSERTION and set hallucinated=true.
- If the main action, request, destination, object, number, entity, or intent changes, use MAJOR_SEMANTIC_SHIFT or CRITICAL_MEANING_ERROR.
- If unsure, use UNCERTAIN and set judge_uncertain=true.
For SLMs, keep the prompt stable and short. Add only a few examples.
9. Few-shot examples
Example 1: number format only
Reference:
I am thirty years old.
Model transcript:
I am 30 years old.
Expected JSON:
{
"severity": "SEMANTIC_EQUIVALENT",
"error_types": ["NUMBER_FORMAT"],
"meaning_preserved": true,
"main_intent_preserved": true,
"entities_preserved": true,
"truncated": false,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The numeric value is the same but formatted differently."
}
Example 2: number value error
Reference:
The appointment is at nine thirteen.
Model transcript:
The appointment is at 9:30.
Expected JSON:
{
"severity": "CRITICAL_MEANING_ERROR",
"error_types": ["NUMBER_VALUE_ERROR"],
"meaning_preserved": false,
"main_intent_preserved": false,
"entities_preserved": false,
"truncated": false,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The appointment time is wrong."
}
Example 3: place-name error
Reference:
I need a ticket to Birmingham.
Model transcript:
I need a ticket to Burnley.
Expected JSON:
{
"severity": "CRITICAL_MEANING_ERROR",
"error_types": ["PLACE_NAME_ERROR"],
"meaning_preserved": false,
"main_intent_preserved": true,
"entities_preserved": false,
"truncated": false,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The destination place is different."
}
Example 4: truncation
Reference:
Here's what I need you to do next. Please call the office before five.
Model transcript:
Here's
Expected JSON:
{
"severity": "CRITICAL_MEANING_ERROR",
"error_types": ["OMISSION_OR_TRUNCATION"],
"meaning_preserved": false,
"main_intent_preserved": false,
"entities_preserved": false,
"truncated": true,
"hallucinated": false,
"judge_uncertain": false,
"short_reason": "The transcript stops after the first word and omits the main content."
}
10. Validation plan
Do not run the judge over the full dataset without validation.
Create a human-labeled subset.
Minimum:
300 examples
Better:
500 to 1,000 examples
Use stratified sampling. Include:
low WER
high WER
low WER + possible entity error
low WER + possible number error
high WER + likely same meaning
truncation cases
hallucination/insertion cases
all accent groups
all model families
short audio
long audio
different speakers
different genders
Measure:
accuracy
macro-F1
per-label F1
Cohen's kappa
confusion matrix
judge_uncertain_rate
parse_error_rate
Important confusions to inspect:
| Confusion | Why it matters |
|---|---|
NUMBER_FORMAT vs NUMBER_VALUE_ERROR |
harmless vs critical |
ORTHOGRAPHIC_ONLY vs PLACE_NAME_ERROR |
spelling vs wrong place |
SEMANTIC_EQUIVALENT vs MINOR_SEMANTIC_SHIFT |
score calibration |
MAJOR_SEMANTIC_SHIFT vs CRITICAL_MEANING_ERROR |
severity calibration |
WORD_SUBSTITUTION vs PERSON_NAME_ERROR |
entity strictness |
OMISSION_OR_TRUNCATION vs ordinary deletion |
audio-LLM failure-mode detection |
Freeze these before the final run:
judge_model
judge_model_revision
judge_prompt_version
temperature
max_tokens
output_parser_version
normalization_version
11. Pairwise judging
Pairwise judging can be useful, but I would not use it for the whole dataset.
With 15 models:
15 choose 2 = 105 pairs per audio
For 17,900 audios:
17,900 * 105 = 1,879,500 pairwise judgments
That is probably too expensive.
Use pairwise judging for:
validation subset
top 3 to 5 models
low-WER/high-severity outliers
high-WER/semantic-equivalent outliers
cases where model rankings are unclear
If you use pairwise judging, reverse A/B order on a subset because LLM judges can show position bias.
Store:
pairwise_winner
pairwise_reversed_winner
position_stable
If the winner changes after reversing A/B order, mark the comparison unstable.
12. How DeepEval fits
DeepEval can be useful as infrastructure, especially if you want G-Eval-like custom criteria or decision-tree/DAG-style evaluation.
Useful links:
- DeepEval LLM-as-a-Judge guide
- DeepEval G-Eval docs
- DeepEval GitHub
But I would not use a generic “correctness” or “semantic similarity” metric directly.
Your metric should be ASR-specific:
number format
number value
person names
place names
truncation
hallucination
intent preservation
entity preservation
So DeepEval is useful as an execution framework, not as the final rubric.
13. EDA and plots
Main heatmaps
| Heatmap | What it shows |
|---|---|
model x severity |
Which models produce more serious semantic errors |
model x error_type |
Model-specific weaknesses: numbers, names, places, truncation, hallucination |
accent_group x severity |
Whether certain accents cause more meaning degradation |
accent_group x error_type |
Accent-specific error patterns |
model x truncation_rate |
Which audio LLMs stop early |
audio_duration_bucket x truncation_rate |
Whether long audio causes incomplete output |
model x entity_preservation_rate |
Which models preserve names/places/numbers |
WER bucket x severity |
Where WER agrees or disagrees with semantic labels |
normalized WER bucket x severity |
Whether normalization improves semantic alignment |
model x judge_uncertain_rate |
Which model outputs are hardest to judge |
Spearman correlations
Use Spearman because many variables are ordinal or non-normal.
Compute:
wer_raw vs semantic_penalty
wer_normalized vs semantic_penalty
cer_raw vs entity_score
cer_normalized vs entity_score
wer_raw vs intent_score
audio_duration_sec vs truncated
reference_word_count vs truncated
wer_raw vs entities_preserved
Do not report only one global correlation. Also compute by:
model
accent_group
gender
duration_bucket
reference_length_bucket
Most important outlier buckets
These are probably the most interesting examples:
| Bucket | Why important |
|---|---|
| low WER + critical semantic error | WER missed a dangerous error |
| high WER + semantic equivalent | WER over-penalized harmless differences |
| low WER + entity error | one key name/place/number broke the meaning |
| low normalized WER + critical error | normalization hid something important |
| high truncation rate for one model | audio LLM generation failure |
| high hallucination rate for one model | audio LLM over-generation |
high UNCERTAIN rate |
judge prompt/model not robust enough |
These disagreement cases will be more valuable than a simple leaderboard.
14. Suggested row-level schema
Use one row per:
audio_id x model_name
Suggested fields:
audio_id
speaker_id
accent_group
gender
audio_duration_sec
reference_raw
reference_normalized
model_name
model_revision
model_family
quantization_mode
prompt_version
decoding_params
hypothesis_raw
hypothesis_normalized
wer_raw
cer_raw
mer
wil
wip
wer_normalized
cer_normalized
reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination
reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch
reference_entities
hypothesis_entities
missing_entities
changed_entities
extra_entities
judge_model
judge_model_revision
judge_prompt_version
judge_temperature
judge_parse_error
severity
error_types
meaning_preserved
main_intent_preserved
entities_preserved
truncated
hallucinated
judge_uncertain
short_reason
semantic_penalty
semantic_score
intent_score
entity_score
normalization_version
run_timestamp
This looks verbose, but it makes later EDA much easier.
15. What I would report
Table 1: classic ASR metrics
| Model | WER raw | CER raw | WER normalized | CER normalized | MER | WIL |
|---|---|---|---|---|---|---|
Table 2: semantic metrics
| Model | Semantic score | Intent preserved | Entity preserved | Critical error rate | Judge uncertain |
|---|---|---|---|---|---|
Table 3: error-type rates
| Model | Number format | Number value error | Person name error | Place name error | Truncation | Hallucination |
|---|---|---|---|---|---|---|
Table 4: accent slicing
| Accent group | WER norm | Semantic score | Entity preserved | Truncation rate | Critical error rate |
|---|---|---|---|---|---|
Table 5: metric-disagreement examples
| Pattern | Example type |
|---|---|
| low WER + critical error | wrong number/name/place |
| high WER + semantic equivalent | formatting/paraphrase |
| low normalized WER + critical error | normalization artifact |
| high truncation | audio LLM stopped early |
16. Practical final recommendation
I would implement this pipeline:
1. Save every raw model output.
2. Compute raw WER/CER/MER/WIL/WIP.
3. Build a versioned normalization pipeline.
4. Compute normalized WER/CER.
5. Add deterministic number, entity, length, truncation, and hallucination features.
6. Use SLM judge only for non-trivial semantic cases.
7. Make the SLM return JSON labels, not numeric scores.
8. Convert severity labels to numeric semantic penalties offline.
9. Validate the judge on 300 to 1,000 human-labeled examples.
10. Use pairwise judging only for validation subsets and metric-disagreement cases.
11. Plot model x severity, model x error type, accent x error type, WER bucket x severity, and duration x truncation.
12. Focus the discussion on where WER and semantic quality disagree.
17. Short summary
- Keep WER/CER , but do not rely on them alone.
- Add normalized WER/CER to reduce harmless formatting penalties.
- Add SLM judge labels for semantic severity, intent, entities, truncation, and hallucination.
- Do not ask the SLM for a direct numeric score.
- Use labels first, then map them to scores offline.
- Treat
thirtyvs30asNUMBER_FORMAT, not a semantic error. - Treat
30vs13asNUMBER_VALUE_ERROR, usually critical. - Treat names and places strictly with
PERSON_NAME_ERRORandPLACE_NAME_ERROR. - Treat incomplete outputs as
OMISSION_OR_TRUNCATION, not just high WER. - Validate your judge with a human-labeled subset before scaling.
- The strongest analysis will be the disagreement cases: low WER + critical semantic error and high WER + semantic equivalent.
References
- Pipecat STT Benchmark: GitHub - pipecat-ai/stt-benchmark: Benchmarking STT service TTFB and semantic WER for real-time AI applications · GitHub
- Pipecat Semantic WER implementation: stt-benchmark/src/stt_benchmark/evaluation/semantic_wer.py at main · pipecat-ai/stt-benchmark · GitHub
- Sarvam ASR evaluation beyond WER: Indic ASR evaluation: beyond WER to LLM & semantic metrics | Sarvam AI
- Sarvam LLM-WER: GitHub - sarvamai/llm_wer · GitHub
- Sarvam intent/entity evaluation: GitHub - sarvamai/llm_intent_entity: LLM-Eval framework for evaluating performance of ASR models · GitHub
- WER is Unaware paper: [2511.16544] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
- WER is Unaware repo: GitHub - Ufonia/wer-is-unaware: A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors. · GitHub
- Generative LLMs for ASR evaluation: [2604.21928] Evaluation of Automatic Speech Recognition Using Generative Large Language Models
- JiWER: GitHub - jitsi/jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate (WER) · GitHub
- JiWER transforms: transforms - jiwer
- Hugging Face Audio Course, ASR evaluation: Evaluation metrics for ASR · Hugging Face
- What is lost in Normalization?: [2409.02449] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
- Whisper English normalizer: whisper/whisper/normalizers/english.py at main · openai/whisper · GitHub
- Whisper Basic normalizer: whisper/whisper/normalizers/basic.py at main · openai/whisper · GitHub
- DeepEval LLM-as-a-Judge guide: LLM-as-a-Judge Evaluation with DeepEval | DeepEval by Confident AI - The LLM Evaluation Framework
- DeepEval G-Eval docs: G-Eval | DeepEval by Confident AI - The LLM Evaluation Framework
- DeepEval GitHub: GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub
- G-Eval paper: [2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
- Prometheus 2 paper: [2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
- Prometheus-Eval: GitHub - prometheus-eval/prometheus-eval: Evaluate your LLM's response with Prometheus and GPT4 💯 · GitHub
- LLM-as-a-Judge survey: [2411.15594] A Survey on LLM-as-a-Judge
- Position bias in LLM-as-a-Judge: [2406.07791] Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
- Earnings-22 accented ASR benchmark: [2203.15591] Earnings-22: A Practical Benchmark for Accents in the Wild
- EdAcc accented English ASR corpus: [2303.18110] The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
- EdAcc dataset card: edinburghcstr/edacc · Datasets at Hugging Face
Discussion in the ATmosphere