External Publication
Visit Post

LLM as a Judge - Evaluate ASR

Hugging Face Forums [Unofficial] May 18, 2026
Source

Umm. for now:

-–

I would treat this as a layered ASR evaluation problem , not as “replace WER/CER with an LLM score”.

Classic ASR metrics and LLM-as-a-Judge answer different questions:

  • WER/CER/MER/WIL/WIP ask: How different is the predicted transcript from the reference at the word/character level?
  • Semantic / intent / entity evaluation asks: Would a human or downstream system still understand the same thing?
  • Operational evaluation asks: Were the important things preserved: numbers, names, places, dates, times, destinations, commands, negations, and complete utterances?

That distinction is important in your case because you have 15 models over 17,900+ audio/transcript pairs, and your errors are not all equally meaningful.

Examples:

Reference Model output Raw WER view Semantic view
I am thirty years old I am 30 years old error harmless formatting difference
I am thirty years old I am 13 years old error serious number-value error
Book a train to Birmingham Book a train to Burnley error critical place-entity error
Call Sarah tomorrow Call Zara tomorrow error person-entity error
Here's what I need you to do next... Here's deletion-heavy truncation / incomplete-output failure
Please cancel the booking Please confirm the booking one-word substitution critical intent reversal

So I would not frame the work as:

LLM-as-a-Judge vs WER

I would frame it as:

Raw WER/CER + normalized WER/CER + semantic severity + intent preservation + entity preservation + truncation/hallucination diagnostics

This gives you a much stronger benchmark and much better EDA.


1. Useful background and similar work

A few directly relevant examples:

  • Pipecat STT Benchmark / Semantic WER Uses Semantic WER for STT benchmarking, where only transcription errors that affect how an LLM agent understands/responds are counted. The implementation prompt is especially useful: semantic_wer.py.

  • Sarvam ASR evaluation beyond WER Useful layered framing: classic WER/CER plus LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. The entity-preservation idea is very relevant to your names, places, dates, times, and numbers.

  • Sarvam LLM-WER repo and LLM intent/entity repo Good practical references for separating literal transcript similarity from meaning and entity preservation.

  • WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding and repo Strong framing: WER measures textual fidelity, but downstream impact may be different. Their domain is clinical dialogue; your domain would be semantic/intent/entity impact over accented speech.

  • Evaluation of Automatic Speech Recognition Using Generative Large Language Models Directly relevant to using LLMs for ASR evaluation: hypothesis selection, semantic distance, and qualitative error classification. Useful especially for pairwise validation subsets.

  • JiWER and JiWER transforms Good practical tooling for WER, CER, MER, WIL, WIP, plus explicit normalization pipelines.

  • Hugging Face Audio Course: ASR evaluation Clear explanation of WER as the de facto ASR metric, based on word-level substitutions, insertions, and deletions.

  • What is lost in Normalization? Important warning: normalization can reduce harmless formatting penalties, but can also hide meaningful errors.

  • A Survey on LLM-as-a-Judge Useful for reliability, consistency, calibration, and bias discussion.

  • Judging the Judges: Position Bias in LLM-as-a-Judge Important if you use pairwise judging. Always randomize or reverse A/B order on a subset.

  • Prometheus 2 and Prometheus-Eval Relevant if you want an open evaluator model. Still needs ASR-specific validation.


2. Recommended evaluation design

I would use four layers.

Layer 1: raw classic metrics

Compute these on the original reference and model output:

  • wer_raw
  • cer_raw
  • mer
  • wil
  • wip

These remain useful because they show literal transcription fidelity and allow comparison with standard ASR work.

Example with JiWER:

import jiwer

wer_raw = jiwer.wer(reference_raw, hypothesis_raw)
cer_raw = jiwer.cer(reference_raw, hypothesis_raw)
mer = jiwer.mer(reference_raw, hypothesis_raw)
wil = jiwer.wil(reference_raw, hypothesis_raw)
wip = jiwer.wip(reference_raw, hypothesis_raw)

Layer 2: normalized classic metrics

Also compute:

  • wer_normalized
  • cer_normalized

Normalization should reduce harmless differences such as:

Thirty -> 30
forty five -> 45
twenty thirteen -> 2013, when context-safe
Hello, John! -> hello john

But normalization can hide important failures, so keep both raw and normalized versions:

reference_raw
hypothesis_raw
reference_normalized
hypothesis_normalized
normalization_version

Do not overwrite the raw text.

Layer 3: deterministic diagnostics

Before calling an SLM judge, compute cheap rule-based features:

reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination
reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch
reference_entities
hypothesis_entities
possible_entity_mismatch

This reduces cost and improves consistency. Many number-format cases do not need an LLM judge.

Layer 4: SLM semantic judge

Use the SLM for the cases where meaning, intent, entity preservation, truncation, or hallucination needs judgment.

The judge should output labels , not a direct numeric score.


3. Do not ask the SLM for a free numeric score

I would avoid this:

{
  "score": 8.2
}

or:

{
  "semantic_similarity": 0.87
}

Small judge models often have unstable numeric calibration. A score of 7/10 can change with prompt wording, few-shot examples, model version, or decoding settings.

Instead, ask for structured labels:

{
  "severity": "SEMANTIC_EQUIVALENT",
  "error_types": ["NUMBER_FORMAT"],
  "meaning_preserved": true,
  "main_intent_preserved": true,
  "entities_preserved": true,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The number is written in digits but has the same value."
}

Then derive scores offline.

This is better for:

  • heatmaps
  • Spearman correlation
  • confusion matrices
  • debugging
  • reproducibility
  • prompt iteration
  • model comparison

4. Suggested label schema

Severity labels

Use one severity label per example:

EXACT_MATCH
ORTHOGRAPHIC_ONLY
SEMANTIC_EQUIVALENT
MINOR_SEMANTIC_SHIFT
MAJOR_SEMANTIC_SHIFT
CRITICAL_MEANING_ERROR
UNCERTAIN

Definitions:

Label Meaning Example
EXACT_MATCH Same transcript hello there vs hello there
ORTHOGRAPHIC_ONLY Only punctuation/casing/spacing/harmless spelling differs Hello, John. vs hello john
SEMANTIC_EQUIVALENT Surface form differs, meaning is the same thirty vs 30
MINOR_SEMANTIC_SHIFT Small meaning change, probably not task-breaking missing filler or minor modifier
MAJOR_SEMANTIC_SHIFT Important content changed wrong object, missing key phrase
CRITICAL_MEANING_ERROR Downstream interpretation/action likely wrong wrong number, wrong person, wrong place, cancel vs confirm
UNCERTAIN Judge cannot decide confidently ambiguous or context-dependent case

Error-type labels

Use multi-label error types:

NUMBER_FORMAT
NUMBER_VALUE_ERROR
PERSON_NAME_ERROR
PLACE_NAME_ERROR
OTHER_ENTITY_ERROR
OMISSION_OR_TRUNCATION
HALLUCINATION_OR_INSERTION
WORD_SUBSTITUTION
DIALECT_OR_ACCENT_WORD
ORTHOGRAPHIC_OR_PUNCTUATION
NO_ERROR
UNCERTAIN

Why multi-label?

Because one transcript can have several problems.

Example:

Reference: Here's the address: 45 King Street in Birmingham.
Hypothesis: Here's the address.

Expected output:

{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": [
    "OMISSION_OR_TRUNCATION",
    "NUMBER_VALUE_ERROR",
    "PLACE_NAME_ERROR"
  ],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The hypothesis stops early and omits the address number and place."
}

5. Convert labels into scores offline

A simple first version:

Severity Semantic penalty
EXACT_MATCH 0.00
ORTHOGRAPHIC_ONLY 0.00
SEMANTIC_EQUIVALENT 0.00
MINOR_SEMANTIC_SHIFT 0.25
MAJOR_SEMANTIC_SHIFT 0.75
CRITICAL_MEANING_ERROR 1.00
UNCERTAIN separate bucket

Then:

semantic_score = 1 - semantic_penalty

This gives:

Severity Semantic score
EXACT_MATCH 1.00
ORTHOGRAPHIC_ONLY 1.00
SEMANTIC_EQUIVALENT 1.00
MINOR_SEMANTIC_SHIFT 0.75
MAJOR_SEMANTIC_SHIFT 0.25
CRITICAL_MEANING_ERROR 0.00

Start simple. Later, if needed, add an entity-aware penalty:

final_penalty =
    severity_penalty
    + 0.15 * number_value_error
    + 0.15 * person_name_error
    + 0.15 * place_name_error
    + 0.20 * truncated

Then clip to 1.0.

But I would not start with a complex formula. First validate the labels.


6. Handling your common error types

6.1 Numbers as words vs digits

Separate:

NUMBER_FORMAT
NUMBER_VALUE_ERROR

Examples:

Reference Hypothesis Label Severity
thirty 30 NUMBER_FORMAT SEMANTIC_EQUIVALENT
forty five 45 NUMBER_FORMAT SEMANTIC_EQUIVALENT
twenty thirteen 2013 NUMBER_FORMAT usually SEMANTIC_EQUIVALENT
nine thirteen 9:13 NUMBER_FORMAT, if same intended time usually SEMANTIC_EQUIVALENT
thirty 13 NUMBER_VALUE_ERROR CRITICAL_MEANING_ERROR
9:13 9:30 NUMBER_VALUE_ERROR CRITICAL_MEANING_ERROR

Principle:

Same value, different format = low or zero semantic penalty. Different value = high semantic penalty.

I would implement deterministic number canonicalization before the judge.

Store:

reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch

6.2 Human and place names

Names and places should be strict.

A wrong name or wrong place can be more important than several ordinary word substitutions.

Examples:

Reference Hypothesis Suggested label
Birmingham burning them PLACE_NAME_ERROR
Leeds leads context-dependent; often PLACE_NAME_ERROR
Sarah Zara PERSON_NAME_ERROR
John Smith John's myth PERSON_NAME_ERROR
Edinburgh Edinburg maybe spelling-only, depending the task
Newcastle new castle context-dependent

Create entity-specific fields:

person_name_error
place_name_error
other_entity_error
number_value_error
entities_preserved
entity_score

A simple entity score:

entity_score = preserved_key_entities / total_key_entities

Example:

Reference: Call Sarah in Birmingham at 9:13.
Hypothesis: Call Zara in Birmingham at 9:13.

Entity score:

Sarah: wrong
Birmingham: correct
9:13: correct

entity_score = 2 / 3 = 0.67

This matters because the rough intent can survive while the useful information fails.

6.3 Incomplete transcriptions

For incomplete outputs, do not rely only on WER.

Use:

OMISSION_OR_TRUNCATION
truncated = true

Example:

Reference: Here's what I need you to do next. Please call the office before five.
Hypothesis: Here's

Expected:

{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["OMISSION_OR_TRUNCATION"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The hypothesis stops after the first word and omits the main content."
}

Possible causes:

Cause Explanation
Max output tokens too low Audio LLM generation stops before full transcript
Stop sequence triggered The model hits an unintended stop token
VAD/chunking issue Input audio was cut or segmented incorrectly
Prompt ambiguity The model summarizes or answers instead of transcribing
Long audio degradation Later parts of the clip are lost
Decoding settings Generation settings prefer short outputs
Model uncertainty The model stops after becoming unsure

Track:

reference_word_count
hypothesis_word_count
length_ratio
audio_duration_sec
starts_with_reference_prefix

Useful heuristic:

possible_truncation =
    length_ratio < 0.5
    and hypothesis matches the beginning of the reference

7. Efficient pipeline for 15 x 17,900 outputs

You likely have about:

15 * 17,900 = 268,500 model-output rows

So avoid unnecessary judge calls.

Stage 1: compute raw metrics

For every row:

wer_raw
cer_raw
mer
wil
wip

Stage 2: compute normalized metrics

For every row:

wer_normalized
cer_normalized

Stage 3: deterministic shortcuts

Skip or reduce judge calls where possible:

Condition Direct label/action
raw exact match EXACT_MATCH
normalized exact match ORTHOGRAPHIC_ONLY or SEMANTIC_EQUIVALENT
only number format differs SEMANTIC_EQUIVALENT, NUMBER_FORMAT
empty hypothesis CRITICAL_MEANING_ERROR, OMISSION_OR_TRUNCATION
very short prefix-only hypothesis likely OMISSION_OR_TRUNCATION
hypothesis much longer than reference possible HALLUCINATION_OR_INSERTION

Stage 4: SLM judge for non-trivial cases

Use the SLM for:

possible semantic shift
possible entity error
possible truncation
possible hallucination
low WER but possible entity/number mismatch
high WER but maybe same meaning
accent/dialect-word cases

Stage 5: stronger judge or manual review for high-risk cases

Route these to a stronger judge or manual review:

judge_uncertain = true
CRITICAL_MEANING_ERROR
low WER + critical error
high WER + semantic equivalent
entity errors
number value errors
truncations

8. Suggested SLM judge prompt

Use a strict JSON-only prompt.

You are evaluating an automatic speech recognition transcript.

Compare the reference transcript and the model transcript.

Evaluate fidelity to the reference, not fluency. Do not reward the model transcript for being more grammatical, more complete-sounding, or more fluent than the reference.

Return only valid JSON. Do not include markdown or text outside the JSON.

Reference transcript:
{reference}

Model transcript:
{hypothesis}

Return this JSON:
{
  "severity": one of [
    "EXACT_MATCH",
    "ORTHOGRAPHIC_ONLY",
    "SEMANTIC_EQUIVALENT",
    "MINOR_SEMANTIC_SHIFT",
    "MAJOR_SEMANTIC_SHIFT",
    "CRITICAL_MEANING_ERROR",
    "UNCERTAIN"
  ],
  "error_types": list of labels from [
    "NUMBER_FORMAT",
    "NUMBER_VALUE_ERROR",
    "PERSON_NAME_ERROR",
    "PLACE_NAME_ERROR",
    "OTHER_ENTITY_ERROR",
    "OMISSION_OR_TRUNCATION",
    "HALLUCINATION_OR_INSERTION",
    "WORD_SUBSTITUTION",
    "DIALECT_OR_ACCENT_WORD",
    "ORTHOGRAPHIC_OR_PUNCTUATION",
    "NO_ERROR",
    "UNCERTAIN"
  ],
  "meaning_preserved": true or false,
  "main_intent_preserved": true or false,
  "entities_preserved": true or false,
  "truncated": true or false,
  "hallucinated": true or false,
  "judge_uncertain": true or false,
  "short_reason": "one short sentence"
}

Rules:
- Punctuation, casing, spacing, and harmless formatting differences are not semantic errors.
- Digit-vs-word differences are not semantic errors if the numeric value is identical.
- Wrong numeric values, dates, times, prices, ages, addresses, quantities, or phone numbers are important errors.
- Treat person names and place names strictly.
- If the wrong person, place, station, city, region, organization, date, time, or number would change interpretation, mark an entity or number error.
- If the model transcript stops early or only contains the beginning of the reference, mark OMISSION_OR_TRUNCATION and set truncated=true.
- If the model transcript adds information not present in the reference, mark HALLUCINATION_OR_INSERTION and set hallucinated=true.
- If the main action, request, destination, object, number, entity, or intent changes, use MAJOR_SEMANTIC_SHIFT or CRITICAL_MEANING_ERROR.
- If unsure, use UNCERTAIN and set judge_uncertain=true.

For SLMs, keep the prompt stable and short. Add only a few examples.


9. Few-shot examples

Example 1: number format only

Reference:
I am thirty years old.

Model transcript:
I am 30 years old.

Expected JSON:
{
  "severity": "SEMANTIC_EQUIVALENT",
  "error_types": ["NUMBER_FORMAT"],
  "meaning_preserved": true,
  "main_intent_preserved": true,
  "entities_preserved": true,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The numeric value is the same but formatted differently."
}

Example 2: number value error

Reference:
The appointment is at nine thirteen.

Model transcript:
The appointment is at 9:30.

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["NUMBER_VALUE_ERROR"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The appointment time is wrong."
}

Example 3: place-name error

Reference:
I need a ticket to Birmingham.

Model transcript:
I need a ticket to Burnley.

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["PLACE_NAME_ERROR"],
  "meaning_preserved": false,
  "main_intent_preserved": true,
  "entities_preserved": false,
  "truncated": false,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The destination place is different."
}

Example 4: truncation

Reference:
Here's what I need you to do next. Please call the office before five.

Model transcript:
Here's

Expected JSON:
{
  "severity": "CRITICAL_MEANING_ERROR",
  "error_types": ["OMISSION_OR_TRUNCATION"],
  "meaning_preserved": false,
  "main_intent_preserved": false,
  "entities_preserved": false,
  "truncated": true,
  "hallucinated": false,
  "judge_uncertain": false,
  "short_reason": "The transcript stops after the first word and omits the main content."
}

10. Validation plan

Do not run the judge over the full dataset without validation.

Create a human-labeled subset.

Minimum:

300 examples

Better:

500 to 1,000 examples

Use stratified sampling. Include:

low WER
high WER
low WER + possible entity error
low WER + possible number error
high WER + likely same meaning
truncation cases
hallucination/insertion cases
all accent groups
all model families
short audio
long audio
different speakers
different genders

Measure:

accuracy
macro-F1
per-label F1
Cohen's kappa
confusion matrix
judge_uncertain_rate
parse_error_rate

Important confusions to inspect:

Confusion Why it matters
NUMBER_FORMAT vs NUMBER_VALUE_ERROR harmless vs critical
ORTHOGRAPHIC_ONLY vs PLACE_NAME_ERROR spelling vs wrong place
SEMANTIC_EQUIVALENT vs MINOR_SEMANTIC_SHIFT score calibration
MAJOR_SEMANTIC_SHIFT vs CRITICAL_MEANING_ERROR severity calibration
WORD_SUBSTITUTION vs PERSON_NAME_ERROR entity strictness
OMISSION_OR_TRUNCATION vs ordinary deletion audio-LLM failure-mode detection

Freeze these before the final run:

judge_model
judge_model_revision
judge_prompt_version
temperature
max_tokens
output_parser_version
normalization_version

11. Pairwise judging

Pairwise judging can be useful, but I would not use it for the whole dataset.

With 15 models:

15 choose 2 = 105 pairs per audio

For 17,900 audios:

17,900 * 105 = 1,879,500 pairwise judgments

That is probably too expensive.

Use pairwise judging for:

validation subset
top 3 to 5 models
low-WER/high-severity outliers
high-WER/semantic-equivalent outliers
cases where model rankings are unclear

If you use pairwise judging, reverse A/B order on a subset because LLM judges can show position bias.

Store:

pairwise_winner
pairwise_reversed_winner
position_stable

If the winner changes after reversing A/B order, mark the comparison unstable.


12. How DeepEval fits

DeepEval can be useful as infrastructure, especially if you want G-Eval-like custom criteria or decision-tree/DAG-style evaluation.

Useful links:

  • DeepEval LLM-as-a-Judge guide
  • DeepEval G-Eval docs
  • DeepEval GitHub

But I would not use a generic “correctness” or “semantic similarity” metric directly.

Your metric should be ASR-specific:

number format
number value
person names
place names
truncation
hallucination
intent preservation
entity preservation

So DeepEval is useful as an execution framework, not as the final rubric.


13. EDA and plots

Main heatmaps

Heatmap What it shows
model x severity Which models produce more serious semantic errors
model x error_type Model-specific weaknesses: numbers, names, places, truncation, hallucination
accent_group x severity Whether certain accents cause more meaning degradation
accent_group x error_type Accent-specific error patterns
model x truncation_rate Which audio LLMs stop early
audio_duration_bucket x truncation_rate Whether long audio causes incomplete output
model x entity_preservation_rate Which models preserve names/places/numbers
WER bucket x severity Where WER agrees or disagrees with semantic labels
normalized WER bucket x severity Whether normalization improves semantic alignment
model x judge_uncertain_rate Which model outputs are hardest to judge

Spearman correlations

Use Spearman because many variables are ordinal or non-normal.

Compute:

wer_raw vs semantic_penalty
wer_normalized vs semantic_penalty
cer_raw vs entity_score
cer_normalized vs entity_score
wer_raw vs intent_score
audio_duration_sec vs truncated
reference_word_count vs truncated
wer_raw vs entities_preserved

Do not report only one global correlation. Also compute by:

model
accent_group
gender
duration_bucket
reference_length_bucket

Most important outlier buckets

These are probably the most interesting examples:

Bucket Why important
low WER + critical semantic error WER missed a dangerous error
high WER + semantic equivalent WER over-penalized harmless differences
low WER + entity error one key name/place/number broke the meaning
low normalized WER + critical error normalization hid something important
high truncation rate for one model audio LLM generation failure
high hallucination rate for one model audio LLM over-generation
high UNCERTAIN rate judge prompt/model not robust enough

These disagreement cases will be more valuable than a simple leaderboard.


14. Suggested row-level schema

Use one row per:

audio_id x model_name

Suggested fields:

audio_id
speaker_id
accent_group
gender
audio_duration_sec

reference_raw
reference_normalized

model_name
model_revision
model_family
quantization_mode
prompt_version
decoding_params

hypothesis_raw
hypothesis_normalized

wer_raw
cer_raw
mer
wil
wip
wer_normalized
cer_normalized

reference_word_count
hypothesis_word_count
length_ratio
possible_truncation
possible_hallucination

reference_numbers
hypothesis_numbers
reference_numbers_canonical
hypothesis_numbers_canonical
number_format_only
number_value_mismatch

reference_entities
hypothesis_entities
missing_entities
changed_entities
extra_entities

judge_model
judge_model_revision
judge_prompt_version
judge_temperature
judge_parse_error

severity
error_types
meaning_preserved
main_intent_preserved
entities_preserved
truncated
hallucinated
judge_uncertain
short_reason

semantic_penalty
semantic_score
intent_score
entity_score

normalization_version
run_timestamp

This looks verbose, but it makes later EDA much easier.


15. What I would report

Table 1: classic ASR metrics

Model WER raw CER raw WER normalized CER normalized MER WIL

Table 2: semantic metrics

Model Semantic score Intent preserved Entity preserved Critical error rate Judge uncertain

Table 3: error-type rates

Model Number format Number value error Person name error Place name error Truncation Hallucination

Table 4: accent slicing

Accent group WER norm Semantic score Entity preserved Truncation rate Critical error rate

Table 5: metric-disagreement examples

Pattern Example type
low WER + critical error wrong number/name/place
high WER + semantic equivalent formatting/paraphrase
low normalized WER + critical error normalization artifact
high truncation audio LLM stopped early

16. Practical final recommendation

I would implement this pipeline:

1. Save every raw model output.
2. Compute raw WER/CER/MER/WIL/WIP.
3. Build a versioned normalization pipeline.
4. Compute normalized WER/CER.
5. Add deterministic number, entity, length, truncation, and hallucination features.
6. Use SLM judge only for non-trivial semantic cases.
7. Make the SLM return JSON labels, not numeric scores.
8. Convert severity labels to numeric semantic penalties offline.
9. Validate the judge on 300 to 1,000 human-labeled examples.
10. Use pairwise judging only for validation subsets and metric-disagreement cases.
11. Plot model x severity, model x error type, accent x error type, WER bucket x severity, and duration x truncation.
12. Focus the discussion on where WER and semantic quality disagree.

17. Short summary

  • Keep WER/CER , but do not rely on them alone.
  • Add normalized WER/CER to reduce harmless formatting penalties.
  • Add SLM judge labels for semantic severity, intent, entities, truncation, and hallucination.
  • Do not ask the SLM for a direct numeric score.
  • Use labels first, then map them to scores offline.
  • Treat thirty vs 30 as NUMBER_FORMAT, not a semantic error.
  • Treat 30 vs 13 as NUMBER_VALUE_ERROR, usually critical.
  • Treat names and places strictly with PERSON_NAME_ERROR and PLACE_NAME_ERROR.
  • Treat incomplete outputs as OMISSION_OR_TRUNCATION, not just high WER.
  • Validate your judge with a human-labeled subset before scaling.
  • The strongest analysis will be the disagreement cases: low WER + critical semantic error and high WER + semantic equivalent.

References

  • Pipecat STT Benchmark: GitHub - pipecat-ai/stt-benchmark: Benchmarking STT service TTFB and semantic WER for real-time AI applications · GitHub
  • Pipecat Semantic WER implementation: stt-benchmark/src/stt_benchmark/evaluation/semantic_wer.py at main · pipecat-ai/stt-benchmark · GitHub
  • Sarvam ASR evaluation beyond WER: Indic ASR evaluation: beyond WER to LLM & semantic metrics | Sarvam AI
  • Sarvam LLM-WER: GitHub - sarvamai/llm_wer · GitHub
  • Sarvam intent/entity evaluation: GitHub - sarvamai/llm_intent_entity: LLM-Eval framework for evaluating performance of ASR models · GitHub
  • WER is Unaware paper: [2511.16544] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
  • WER is Unaware repo: GitHub - Ufonia/wer-is-unaware: A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors. · GitHub
  • Generative LLMs for ASR evaluation: [2604.21928] Evaluation of Automatic Speech Recognition Using Generative Large Language Models
  • JiWER: GitHub - jitsi/jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate (WER) · GitHub
  • JiWER transforms: transforms - jiwer
  • Hugging Face Audio Course, ASR evaluation: Evaluation metrics for ASR · Hugging Face
  • What is lost in Normalization?: [2409.02449] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
  • Whisper English normalizer: whisper/whisper/normalizers/english.py at main · openai/whisper · GitHub
  • Whisper Basic normalizer: whisper/whisper/normalizers/basic.py at main · openai/whisper · GitHub
  • DeepEval LLM-as-a-Judge guide: LLM-as-a-Judge Evaluation with DeepEval | DeepEval by Confident AI - The LLM Evaluation Framework
  • DeepEval G-Eval docs: G-Eval | DeepEval by Confident AI - The LLM Evaluation Framework
  • DeepEval GitHub: GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub
  • G-Eval paper: [2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  • Prometheus 2 paper: [2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
  • Prometheus-Eval: GitHub - prometheus-eval/prometheus-eval: Evaluate your LLM's response with Prometheus and GPT4 💯 · GitHub
  • LLM-as-a-Judge survey: [2411.15594] A Survey on LLM-as-a-Judge
  • Position bias in LLM-as-a-Judge: [2406.07791] Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
  • Earnings-22 accented ASR benchmark: [2203.15591] Earnings-22: A Practical Benchmark for Accents in the Wild
  • EdAcc accented English ASR corpus: [2303.18110] The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
  • EdAcc dataset card: edinburghcstr/edacc · Datasets at Hugging Face

Discussion in the ATmosphere

Loading comments...