Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieasvg72we2bbaclvieh32j2rqchelixlha7sjfqr6bw7h5tnzsdq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mm5bsf5yatc2"
  },
  "path": "/t/llm-as-a-judge-evaluate-asr/176076#post_2",
  "publishedAt": "2026-05-18T15:46:30.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Pipecat STT Benchmark / Semantic WER",
    "semantic_wer.py",
    "Sarvam ASR evaluation beyond WER",
    "Sarvam LLM-WER repo",
    "LLM intent/entity repo",
    "WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding",
    "repo",
    "Evaluation of Automatic Speech Recognition Using Generative Large Language Models",
    "JiWER",
    "JiWER transforms",
    "Hugging Face Audio Course: ASR evaluation",
    "What is lost in Normalization?",
    "A Survey on LLM-as-a-Judge",
    "Judging the Judges: Position Bias in LLM-as-a-Judge",
    "Prometheus 2",
    "Prometheus-Eval",
    "DeepEval LLM-as-a-Judge guide",
    "DeepEval G-Eval docs",
    "DeepEval GitHub",
    "GitHub - pipecat-ai/stt-benchmark: Benchmarking STT service TTFB and semantic WER for real-time AI applications · GitHub",
    "stt-benchmark/src/stt_benchmark/evaluation/semantic_wer.py at main · pipecat-ai/stt-benchmark · GitHub",
    "Indic ASR evaluation: beyond WER to LLM & semantic metrics | Sarvam AI",
    "GitHub - sarvamai/llm_wer · GitHub",
    "GitHub - sarvamai/llm_intent_entity: LLM-Eval framework for evaluating performance of ASR models · GitHub",
    "[2511.16544] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue",
    "GitHub - Ufonia/wer-is-unaware: A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors. · GitHub",
    "[2604.21928] Evaluation of Automatic Speech Recognition Using Generative Large Language Models",
    "GitHub - jitsi/jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate (WER) · GitHub",
    "transforms - jiwer",
    "Evaluation metrics for ASR · Hugging Face",
    "[2409.02449] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations",
    "whisper/whisper/normalizers/english.py at main · openai/whisper · GitHub",
    "whisper/whisper/normalizers/basic.py at main · openai/whisper · GitHub",
    "LLM-as-a-Judge Evaluation with DeepEval | DeepEval by Confident AI - The LLM Evaluation Framework",
    "G-Eval | DeepEval by Confident AI - The LLM Evaluation Framework",
    "GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub",
    "[2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment",
    "[2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models",
    "GitHub - prometheus-eval/prometheus-eval: Evaluate your LLM's response with Prometheus and GPT4 💯 · GitHub",
    "[2411.15594] A Survey on LLM-as-a-Judge",
    "[2406.07791] Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge",
    "[2203.15591] Earnings-22: A Practical Benchmark for Accents in the Wild",
    "[2303.18110] The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR",
    "edinburghcstr/edacc · Datasets at Hugging Face"
  ],
  "textContent": "Umm. for now:\n\n-–\n\nI would treat this as a **layered ASR evaluation problem** , not as “replace WER/CER with an LLM score”.\n\nClassic ASR metrics and LLM-as-a-Judge answer different questions:\n\n  * **WER/CER/MER/WIL/WIP** ask: _How different is the predicted transcript from the reference at the word/character level?_\n  * **Semantic / intent / entity evaluation** asks: _Would a human or downstream system still understand the same thing?_\n  * **Operational evaluation** asks: _Were the important things preserved: numbers, names, places, dates, times, destinations, commands, negations, and complete utterances?_\n\n\n\nThat distinction is important in your case because you have 15 models over 17,900+ audio/transcript pairs, and your errors are not all equally meaningful.\n\nExamples:\n\nReference | Model output | Raw WER view | Semantic view\n---|---|---|---\n`I am thirty years old` | `I am 30 years old` | error | harmless formatting difference\n`I am thirty years old` | `I am 13 years old` | error | serious number-value error\n`Book a train to Birmingham` | `Book a train to Burnley` | error | critical place-entity error\n`Call Sarah tomorrow` | `Call Zara tomorrow` | error | person-entity error\n`Here's what I need you to do next...` | `Here's` | deletion-heavy | truncation / incomplete-output failure\n`Please cancel the booking` | `Please confirm the booking` | one-word substitution | critical intent reversal\n\nSo I would not frame the work as:\n\n> LLM-as-a-Judge vs WER\n\nI would frame it as:\n\n> **Raw WER/CER + normalized WER/CER + semantic severity + intent preservation + entity preservation + truncation/hallucination diagnostics**\n\nThis gives you a much stronger benchmark and much better EDA.\n\n* * *\n\n## 1. Useful background and similar work\n\nA few directly relevant examples:\n\n  * Pipecat STT Benchmark / Semantic WER\nUses Semantic WER for STT benchmarking, where only transcription errors that affect how an LLM agent understands/responds are counted. The implementation prompt is especially useful: semantic_wer.py.\n\n  * Sarvam ASR evaluation beyond WER\nUseful layered framing: classic WER/CER plus LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. The entity-preservation idea is very relevant to your names, places, dates, times, and numbers.\n\n  * Sarvam LLM-WER repo and LLM intent/entity repo\nGood practical references for separating literal transcript similarity from meaning and entity preservation.\n\n  * WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding and repo\nStrong framing: WER measures textual fidelity, but downstream impact may be different. Their domain is clinical dialogue; your domain would be semantic/intent/entity impact over accented speech.\n\n  * Evaluation of Automatic Speech Recognition Using Generative Large Language Models\nDirectly relevant to using LLMs for ASR evaluation: hypothesis selection, semantic distance, and qualitative error classification. Useful especially for pairwise validation subsets.\n\n  * JiWER and JiWER transforms\nGood practical tooling for WER, CER, MER, WIL, WIP, plus explicit normalization pipelines.\n\n  * Hugging Face Audio Course: ASR evaluation\nClear explanation of WER as the de facto ASR metric, based on word-level substitutions, insertions, and deletions.\n\n  * What is lost in Normalization?\nImportant warning: normalization can reduce harmless formatting penalties, but can also hide meaningful errors.\n\n  * A Survey on LLM-as-a-Judge\nUseful for reliability, consistency, calibration, and bias discussion.\n\n  * Judging the Judges: Position Bias in LLM-as-a-Judge\nImportant if you use pairwise judging. Always randomize or reverse A/B order on a subset.\n\n  * Prometheus 2 and Prometheus-Eval\nRelevant if you want an open evaluator model. Still needs ASR-specific validation.\n\n\n\n\n* * *\n\n## 2. Recommended evaluation design\n\nI would use four layers.\n\n### Layer 1: raw classic metrics\n\nCompute these on the original reference and model output:\n\n  * `wer_raw`\n  * `cer_raw`\n  * `mer`\n  * `wil`\n  * `wip`\n\n\n\nThese remain useful because they show literal transcription fidelity and allow comparison with standard ASR work.\n\nExample with JiWER:\n\n\n    import jiwer\n\n    wer_raw = jiwer.wer(reference_raw, hypothesis_raw)\n    cer_raw = jiwer.cer(reference_raw, hypothesis_raw)\n    mer = jiwer.mer(reference_raw, hypothesis_raw)\n    wil = jiwer.wil(reference_raw, hypothesis_raw)\n    wip = jiwer.wip(reference_raw, hypothesis_raw)\n\n\n### Layer 2: normalized classic metrics\n\nAlso compute:\n\n  * `wer_normalized`\n  * `cer_normalized`\n\n\n\nNormalization should reduce harmless differences such as:\n\n\n    Thirty -> 30\n    forty five -> 45\n    twenty thirteen -> 2013, when context-safe\n    Hello, John! -> hello john\n\n\nBut normalization can hide important failures, so keep both raw and normalized versions:\n\n\n    reference_raw\n    hypothesis_raw\n    reference_normalized\n    hypothesis_normalized\n    normalization_version\n\n\nDo not overwrite the raw text.\n\n### Layer 3: deterministic diagnostics\n\nBefore calling an SLM judge, compute cheap rule-based features:\n\n\n    reference_word_count\n    hypothesis_word_count\n    length_ratio\n    possible_truncation\n    possible_hallucination\n    reference_numbers\n    hypothesis_numbers\n    reference_numbers_canonical\n    hypothesis_numbers_canonical\n    number_format_only\n    number_value_mismatch\n    reference_entities\n    hypothesis_entities\n    possible_entity_mismatch\n\n\nThis reduces cost and improves consistency. Many number-format cases do not need an LLM judge.\n\n### Layer 4: SLM semantic judge\n\nUse the SLM for the cases where meaning, intent, entity preservation, truncation, or hallucination needs judgment.\n\nThe judge should output **labels** , not a direct numeric score.\n\n* * *\n\n## 3. Do not ask the SLM for a free numeric score\n\nI would avoid this:\n\n\n    {\n      \"score\": 8.2\n    }\n\n\nor:\n\n\n    {\n      \"semantic_similarity\": 0.87\n    }\n\n\nSmall judge models often have unstable numeric calibration. A score of `7/10` can change with prompt wording, few-shot examples, model version, or decoding settings.\n\nInstead, ask for structured labels:\n\n\n    {\n      \"severity\": \"SEMANTIC_EQUIVALENT\",\n      \"error_types\": [\"NUMBER_FORMAT\"],\n      \"meaning_preserved\": true,\n      \"main_intent_preserved\": true,\n      \"entities_preserved\": true,\n      \"truncated\": false,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The number is written in digits but has the same value.\"\n    }\n\n\nThen derive scores offline.\n\nThis is better for:\n\n  * heatmaps\n  * Spearman correlation\n  * confusion matrices\n  * debugging\n  * reproducibility\n  * prompt iteration\n  * model comparison\n\n\n\n* * *\n\n## 4. Suggested label schema\n\n### Severity labels\n\nUse one severity label per example:\n\n\n    EXACT_MATCH\n    ORTHOGRAPHIC_ONLY\n    SEMANTIC_EQUIVALENT\n    MINOR_SEMANTIC_SHIFT\n    MAJOR_SEMANTIC_SHIFT\n    CRITICAL_MEANING_ERROR\n    UNCERTAIN\n\n\nDefinitions:\n\nLabel | Meaning | Example\n---|---|---\n`EXACT_MATCH` | Same transcript | `hello there` vs `hello there`\n`ORTHOGRAPHIC_ONLY` | Only punctuation/casing/spacing/harmless spelling differs | `Hello, John.` vs `hello john`\n`SEMANTIC_EQUIVALENT` | Surface form differs, meaning is the same | `thirty` vs `30`\n`MINOR_SEMANTIC_SHIFT` | Small meaning change, probably not task-breaking | missing filler or minor modifier\n`MAJOR_SEMANTIC_SHIFT` | Important content changed | wrong object, missing key phrase\n`CRITICAL_MEANING_ERROR` | Downstream interpretation/action likely wrong | wrong number, wrong person, wrong place, `cancel` vs `confirm`\n`UNCERTAIN` | Judge cannot decide confidently | ambiguous or context-dependent case\n\n### Error-type labels\n\nUse multi-label error types:\n\n\n    NUMBER_FORMAT\n    NUMBER_VALUE_ERROR\n    PERSON_NAME_ERROR\n    PLACE_NAME_ERROR\n    OTHER_ENTITY_ERROR\n    OMISSION_OR_TRUNCATION\n    HALLUCINATION_OR_INSERTION\n    WORD_SUBSTITUTION\n    DIALECT_OR_ACCENT_WORD\n    ORTHOGRAPHIC_OR_PUNCTUATION\n    NO_ERROR\n    UNCERTAIN\n\n\nWhy multi-label?\n\nBecause one transcript can have several problems.\n\nExample:\n\n\n    Reference: Here's the address: 45 King Street in Birmingham.\n    Hypothesis: Here's the address.\n\n\nExpected output:\n\n\n    {\n      \"severity\": \"CRITICAL_MEANING_ERROR\",\n      \"error_types\": [\n        \"OMISSION_OR_TRUNCATION\",\n        \"NUMBER_VALUE_ERROR\",\n        \"PLACE_NAME_ERROR\"\n      ],\n      \"meaning_preserved\": false,\n      \"main_intent_preserved\": false,\n      \"entities_preserved\": false,\n      \"truncated\": true,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The hypothesis stops early and omits the address number and place.\"\n    }\n\n\n* * *\n\n## 5. Convert labels into scores offline\n\nA simple first version:\n\nSeverity | Semantic penalty\n---|---\n`EXACT_MATCH` | `0.00`\n`ORTHOGRAPHIC_ONLY` | `0.00`\n`SEMANTIC_EQUIVALENT` | `0.00`\n`MINOR_SEMANTIC_SHIFT` | `0.25`\n`MAJOR_SEMANTIC_SHIFT` | `0.75`\n`CRITICAL_MEANING_ERROR` | `1.00`\n`UNCERTAIN` | separate bucket\n\nThen:\n\n\n    semantic_score = 1 - semantic_penalty\n\n\nThis gives:\n\nSeverity | Semantic score\n---|---\n`EXACT_MATCH` | `1.00`\n`ORTHOGRAPHIC_ONLY` | `1.00`\n`SEMANTIC_EQUIVALENT` | `1.00`\n`MINOR_SEMANTIC_SHIFT` | `0.75`\n`MAJOR_SEMANTIC_SHIFT` | `0.25`\n`CRITICAL_MEANING_ERROR` | `0.00`\n\nStart simple. Later, if needed, add an entity-aware penalty:\n\n\n    final_penalty =\n        severity_penalty\n        + 0.15 * number_value_error\n        + 0.15 * person_name_error\n        + 0.15 * place_name_error\n        + 0.20 * truncated\n\n\nThen clip to `1.0`.\n\nBut I would not start with a complex formula. First validate the labels.\n\n* * *\n\n## 6. Handling your common error types\n\n### 6.1 Numbers as words vs digits\n\nSeparate:\n\n\n    NUMBER_FORMAT\n    NUMBER_VALUE_ERROR\n\n\nExamples:\n\nReference | Hypothesis | Label | Severity\n---|---|---|---\n`thirty` | `30` | `NUMBER_FORMAT` | `SEMANTIC_EQUIVALENT`\n`forty five` | `45` | `NUMBER_FORMAT` | `SEMANTIC_EQUIVALENT`\n`twenty thirteen` | `2013` | `NUMBER_FORMAT` | usually `SEMANTIC_EQUIVALENT`\n`nine thirteen` | `9:13` | `NUMBER_FORMAT`, if same intended time | usually `SEMANTIC_EQUIVALENT`\n`thirty` | `13` | `NUMBER_VALUE_ERROR` | `CRITICAL_MEANING_ERROR`\n`9:13` | `9:30` | `NUMBER_VALUE_ERROR` | `CRITICAL_MEANING_ERROR`\n\nPrinciple:\n\n> Same value, different format = low or zero semantic penalty.\n>  Different value = high semantic penalty.\n\nI would implement deterministic number canonicalization before the judge.\n\nStore:\n\n\n    reference_numbers\n    hypothesis_numbers\n    reference_numbers_canonical\n    hypothesis_numbers_canonical\n    number_format_only\n    number_value_mismatch\n\n\n### 6.2 Human and place names\n\nNames and places should be strict.\n\nA wrong name or wrong place can be more important than several ordinary word substitutions.\n\nExamples:\n\nReference | Hypothesis | Suggested label\n---|---|---\n`Birmingham` | `burning them` | `PLACE_NAME_ERROR`\n`Leeds` | `leads` | context-dependent; often `PLACE_NAME_ERROR`\n`Sarah` | `Zara` | `PERSON_NAME_ERROR`\n`John Smith` | `John's myth` | `PERSON_NAME_ERROR`\n`Edinburgh` | `Edinburg` | maybe spelling-only, depending the task\n`Newcastle` | `new castle` | context-dependent\n\nCreate entity-specific fields:\n\n\n    person_name_error\n    place_name_error\n    other_entity_error\n    number_value_error\n    entities_preserved\n    entity_score\n\n\nA simple entity score:\n\n\n    entity_score = preserved_key_entities / total_key_entities\n\n\nExample:\n\n\n    Reference: Call Sarah in Birmingham at 9:13.\n    Hypothesis: Call Zara in Birmingham at 9:13.\n\n\nEntity score:\n\n\n    Sarah: wrong\n    Birmingham: correct\n    9:13: correct\n\n    entity_score = 2 / 3 = 0.67\n\n\nThis matters because the rough intent can survive while the useful information fails.\n\n### 6.3 Incomplete transcriptions\n\nFor incomplete outputs, do not rely only on WER.\n\nUse:\n\n\n    OMISSION_OR_TRUNCATION\n    truncated = true\n\n\nExample:\n\n\n    Reference: Here's what I need you to do next. Please call the office before five.\n    Hypothesis: Here's\n\n\nExpected:\n\n\n    {\n      \"severity\": \"CRITICAL_MEANING_ERROR\",\n      \"error_types\": [\"OMISSION_OR_TRUNCATION\"],\n      \"meaning_preserved\": false,\n      \"main_intent_preserved\": false,\n      \"entities_preserved\": false,\n      \"truncated\": true,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The hypothesis stops after the first word and omits the main content.\"\n    }\n\n\nPossible causes:\n\nCause | Explanation\n---|---\nMax output tokens too low | Audio LLM generation stops before full transcript\nStop sequence triggered | The model hits an unintended stop token\nVAD/chunking issue | Input audio was cut or segmented incorrectly\nPrompt ambiguity | The model summarizes or answers instead of transcribing\nLong audio degradation | Later parts of the clip are lost\nDecoding settings | Generation settings prefer short outputs\nModel uncertainty | The model stops after becoming unsure\n\nTrack:\n\n\n    reference_word_count\n    hypothesis_word_count\n    length_ratio\n    audio_duration_sec\n    starts_with_reference_prefix\n\n\nUseful heuristic:\n\n\n    possible_truncation =\n        length_ratio < 0.5\n        and hypothesis matches the beginning of the reference\n\n\n* * *\n\n## 7. Efficient pipeline for 15 x 17,900 outputs\n\nYou likely have about:\n\n\n    15 * 17,900 = 268,500 model-output rows\n\n\nSo avoid unnecessary judge calls.\n\n### Stage 1: compute raw metrics\n\nFor every row:\n\n\n    wer_raw\n    cer_raw\n    mer\n    wil\n    wip\n\n\n### Stage 2: compute normalized metrics\n\nFor every row:\n\n\n    wer_normalized\n    cer_normalized\n\n\n### Stage 3: deterministic shortcuts\n\nSkip or reduce judge calls where possible:\n\nCondition | Direct label/action\n---|---\nraw exact match | `EXACT_MATCH`\nnormalized exact match | `ORTHOGRAPHIC_ONLY` or `SEMANTIC_EQUIVALENT`\nonly number format differs | `SEMANTIC_EQUIVALENT`, `NUMBER_FORMAT`\nempty hypothesis | `CRITICAL_MEANING_ERROR`, `OMISSION_OR_TRUNCATION`\nvery short prefix-only hypothesis | likely `OMISSION_OR_TRUNCATION`\nhypothesis much longer than reference | possible `HALLUCINATION_OR_INSERTION`\n\n### Stage 4: SLM judge for non-trivial cases\n\nUse the SLM for:\n\n\n    possible semantic shift\n    possible entity error\n    possible truncation\n    possible hallucination\n    low WER but possible entity/number mismatch\n    high WER but maybe same meaning\n    accent/dialect-word cases\n\n\n### Stage 5: stronger judge or manual review for high-risk cases\n\nRoute these to a stronger judge or manual review:\n\n\n    judge_uncertain = true\n    CRITICAL_MEANING_ERROR\n    low WER + critical error\n    high WER + semantic equivalent\n    entity errors\n    number value errors\n    truncations\n\n\n* * *\n\n## 8. Suggested SLM judge prompt\n\nUse a strict JSON-only prompt.\n\n\n    You are evaluating an automatic speech recognition transcript.\n\n    Compare the reference transcript and the model transcript.\n\n    Evaluate fidelity to the reference, not fluency. Do not reward the model transcript for being more grammatical, more complete-sounding, or more fluent than the reference.\n\n    Return only valid JSON. Do not include markdown or text outside the JSON.\n\n    Reference transcript:\n    {reference}\n\n    Model transcript:\n    {hypothesis}\n\n    Return this JSON:\n    {\n      \"severity\": one of [\n        \"EXACT_MATCH\",\n        \"ORTHOGRAPHIC_ONLY\",\n        \"SEMANTIC_EQUIVALENT\",\n        \"MINOR_SEMANTIC_SHIFT\",\n        \"MAJOR_SEMANTIC_SHIFT\",\n        \"CRITICAL_MEANING_ERROR\",\n        \"UNCERTAIN\"\n      ],\n      \"error_types\": list of labels from [\n        \"NUMBER_FORMAT\",\n        \"NUMBER_VALUE_ERROR\",\n        \"PERSON_NAME_ERROR\",\n        \"PLACE_NAME_ERROR\",\n        \"OTHER_ENTITY_ERROR\",\n        \"OMISSION_OR_TRUNCATION\",\n        \"HALLUCINATION_OR_INSERTION\",\n        \"WORD_SUBSTITUTION\",\n        \"DIALECT_OR_ACCENT_WORD\",\n        \"ORTHOGRAPHIC_OR_PUNCTUATION\",\n        \"NO_ERROR\",\n        \"UNCERTAIN\"\n      ],\n      \"meaning_preserved\": true or false,\n      \"main_intent_preserved\": true or false,\n      \"entities_preserved\": true or false,\n      \"truncated\": true or false,\n      \"hallucinated\": true or false,\n      \"judge_uncertain\": true or false,\n      \"short_reason\": \"one short sentence\"\n    }\n\n    Rules:\n    - Punctuation, casing, spacing, and harmless formatting differences are not semantic errors.\n    - Digit-vs-word differences are not semantic errors if the numeric value is identical.\n    - Wrong numeric values, dates, times, prices, ages, addresses, quantities, or phone numbers are important errors.\n    - Treat person names and place names strictly.\n    - If the wrong person, place, station, city, region, organization, date, time, or number would change interpretation, mark an entity or number error.\n    - If the model transcript stops early or only contains the beginning of the reference, mark OMISSION_OR_TRUNCATION and set truncated=true.\n    - If the model transcript adds information not present in the reference, mark HALLUCINATION_OR_INSERTION and set hallucinated=true.\n    - If the main action, request, destination, object, number, entity, or intent changes, use MAJOR_SEMANTIC_SHIFT or CRITICAL_MEANING_ERROR.\n    - If unsure, use UNCERTAIN and set judge_uncertain=true.\n\n\nFor SLMs, keep the prompt stable and short. Add only a few examples.\n\n* * *\n\n## 9. Few-shot examples\n\n### Example 1: number format only\n\n\n    Reference:\n    I am thirty years old.\n\n    Model transcript:\n    I am 30 years old.\n\n    Expected JSON:\n    {\n      \"severity\": \"SEMANTIC_EQUIVALENT\",\n      \"error_types\": [\"NUMBER_FORMAT\"],\n      \"meaning_preserved\": true,\n      \"main_intent_preserved\": true,\n      \"entities_preserved\": true,\n      \"truncated\": false,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The numeric value is the same but formatted differently.\"\n    }\n\n\n### Example 2: number value error\n\n\n    Reference:\n    The appointment is at nine thirteen.\n\n    Model transcript:\n    The appointment is at 9:30.\n\n    Expected JSON:\n    {\n      \"severity\": \"CRITICAL_MEANING_ERROR\",\n      \"error_types\": [\"NUMBER_VALUE_ERROR\"],\n      \"meaning_preserved\": false,\n      \"main_intent_preserved\": false,\n      \"entities_preserved\": false,\n      \"truncated\": false,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The appointment time is wrong.\"\n    }\n\n\n### Example 3: place-name error\n\n\n    Reference:\n    I need a ticket to Birmingham.\n\n    Model transcript:\n    I need a ticket to Burnley.\n\n    Expected JSON:\n    {\n      \"severity\": \"CRITICAL_MEANING_ERROR\",\n      \"error_types\": [\"PLACE_NAME_ERROR\"],\n      \"meaning_preserved\": false,\n      \"main_intent_preserved\": true,\n      \"entities_preserved\": false,\n      \"truncated\": false,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The destination place is different.\"\n    }\n\n\n### Example 4: truncation\n\n\n    Reference:\n    Here's what I need you to do next. Please call the office before five.\n\n    Model transcript:\n    Here's\n\n    Expected JSON:\n    {\n      \"severity\": \"CRITICAL_MEANING_ERROR\",\n      \"error_types\": [\"OMISSION_OR_TRUNCATION\"],\n      \"meaning_preserved\": false,\n      \"main_intent_preserved\": false,\n      \"entities_preserved\": false,\n      \"truncated\": true,\n      \"hallucinated\": false,\n      \"judge_uncertain\": false,\n      \"short_reason\": \"The transcript stops after the first word and omits the main content.\"\n    }\n\n\n* * *\n\n## 10. Validation plan\n\nDo not run the judge over the full dataset without validation.\n\nCreate a human-labeled subset.\n\nMinimum:\n\n\n    300 examples\n\n\nBetter:\n\n\n    500 to 1,000 examples\n\n\nUse stratified sampling. Include:\n\n\n    low WER\n    high WER\n    low WER + possible entity error\n    low WER + possible number error\n    high WER + likely same meaning\n    truncation cases\n    hallucination/insertion cases\n    all accent groups\n    all model families\n    short audio\n    long audio\n    different speakers\n    different genders\n\n\nMeasure:\n\n\n    accuracy\n    macro-F1\n    per-label F1\n    Cohen's kappa\n    confusion matrix\n    judge_uncertain_rate\n    parse_error_rate\n\n\nImportant confusions to inspect:\n\nConfusion | Why it matters\n---|---\n`NUMBER_FORMAT` vs `NUMBER_VALUE_ERROR` | harmless vs critical\n`ORTHOGRAPHIC_ONLY` vs `PLACE_NAME_ERROR` | spelling vs wrong place\n`SEMANTIC_EQUIVALENT` vs `MINOR_SEMANTIC_SHIFT` | score calibration\n`MAJOR_SEMANTIC_SHIFT` vs `CRITICAL_MEANING_ERROR` | severity calibration\n`WORD_SUBSTITUTION` vs `PERSON_NAME_ERROR` | entity strictness\n`OMISSION_OR_TRUNCATION` vs ordinary deletion | audio-LLM failure-mode detection\n\nFreeze these before the final run:\n\n\n    judge_model\n    judge_model_revision\n    judge_prompt_version\n    temperature\n    max_tokens\n    output_parser_version\n    normalization_version\n\n\n* * *\n\n## 11. Pairwise judging\n\nPairwise judging can be useful, but I would not use it for the whole dataset.\n\nWith 15 models:\n\n\n    15 choose 2 = 105 pairs per audio\n\n\nFor 17,900 audios:\n\n\n    17,900 * 105 = 1,879,500 pairwise judgments\n\n\nThat is probably too expensive.\n\nUse pairwise judging for:\n\n\n    validation subset\n    top 3 to 5 models\n    low-WER/high-severity outliers\n    high-WER/semantic-equivalent outliers\n    cases where model rankings are unclear\n\n\nIf you use pairwise judging, reverse A/B order on a subset because LLM judges can show position bias.\n\nStore:\n\n\n    pairwise_winner\n    pairwise_reversed_winner\n    position_stable\n\n\nIf the winner changes after reversing A/B order, mark the comparison unstable.\n\n* * *\n\n## 12. How DeepEval fits\n\nDeepEval can be useful as infrastructure, especially if you want G-Eval-like custom criteria or decision-tree/DAG-style evaluation.\n\nUseful links:\n\n  * DeepEval LLM-as-a-Judge guide\n  * DeepEval G-Eval docs\n  * DeepEval GitHub\n\n\n\nBut I would not use a generic “correctness” or “semantic similarity” metric directly.\n\nYour metric should be ASR-specific:\n\n\n    number format\n    number value\n    person names\n    place names\n    truncation\n    hallucination\n    intent preservation\n    entity preservation\n\n\nSo DeepEval is useful as an execution framework, not as the final rubric.\n\n* * *\n\n## 13. EDA and plots\n\n### Main heatmaps\n\nHeatmap | What it shows\n---|---\n`model x severity` | Which models produce more serious semantic errors\n`model x error_type` | Model-specific weaknesses: numbers, names, places, truncation, hallucination\n`accent_group x severity` | Whether certain accents cause more meaning degradation\n`accent_group x error_type` | Accent-specific error patterns\n`model x truncation_rate` | Which audio LLMs stop early\n`audio_duration_bucket x truncation_rate` | Whether long audio causes incomplete output\n`model x entity_preservation_rate` | Which models preserve names/places/numbers\n`WER bucket x severity` | Where WER agrees or disagrees with semantic labels\n`normalized WER bucket x severity` | Whether normalization improves semantic alignment\n`model x judge_uncertain_rate` | Which model outputs are hardest to judge\n\n### Spearman correlations\n\nUse Spearman because many variables are ordinal or non-normal.\n\nCompute:\n\n\n    wer_raw vs semantic_penalty\n    wer_normalized vs semantic_penalty\n    cer_raw vs entity_score\n    cer_normalized vs entity_score\n    wer_raw vs intent_score\n    audio_duration_sec vs truncated\n    reference_word_count vs truncated\n    wer_raw vs entities_preserved\n\n\nDo not report only one global correlation. Also compute by:\n\n\n    model\n    accent_group\n    gender\n    duration_bucket\n    reference_length_bucket\n\n\n### Most important outlier buckets\n\nThese are probably the most interesting examples:\n\nBucket | Why important\n---|---\nlow WER + critical semantic error | WER missed a dangerous error\nhigh WER + semantic equivalent | WER over-penalized harmless differences\nlow WER + entity error | one key name/place/number broke the meaning\nlow normalized WER + critical error | normalization hid something important\nhigh truncation rate for one model | audio LLM generation failure\nhigh hallucination rate for one model | audio LLM over-generation\nhigh `UNCERTAIN` rate | judge prompt/model not robust enough\n\nThese disagreement cases will be more valuable than a simple leaderboard.\n\n* * *\n\n## 14. Suggested row-level schema\n\nUse one row per:\n\n\n    audio_id x model_name\n\n\nSuggested fields:\n\n\n    audio_id\n    speaker_id\n    accent_group\n    gender\n    audio_duration_sec\n\n    reference_raw\n    reference_normalized\n\n    model_name\n    model_revision\n    model_family\n    quantization_mode\n    prompt_version\n    decoding_params\n\n    hypothesis_raw\n    hypothesis_normalized\n\n    wer_raw\n    cer_raw\n    mer\n    wil\n    wip\n    wer_normalized\n    cer_normalized\n\n    reference_word_count\n    hypothesis_word_count\n    length_ratio\n    possible_truncation\n    possible_hallucination\n\n    reference_numbers\n    hypothesis_numbers\n    reference_numbers_canonical\n    hypothesis_numbers_canonical\n    number_format_only\n    number_value_mismatch\n\n    reference_entities\n    hypothesis_entities\n    missing_entities\n    changed_entities\n    extra_entities\n\n    judge_model\n    judge_model_revision\n    judge_prompt_version\n    judge_temperature\n    judge_parse_error\n\n    severity\n    error_types\n    meaning_preserved\n    main_intent_preserved\n    entities_preserved\n    truncated\n    hallucinated\n    judge_uncertain\n    short_reason\n\n    semantic_penalty\n    semantic_score\n    intent_score\n    entity_score\n\n    normalization_version\n    run_timestamp\n\n\nThis looks verbose, but it makes later EDA much easier.\n\n* * *\n\n## 15. What I would report\n\n### Table 1: classic ASR metrics\n\nModel | WER raw | CER raw | WER normalized | CER normalized | MER | WIL\n---|---|---|---|---|---|---\n|  |  |  |  |  |\n\n### Table 2: semantic metrics\n\nModel | Semantic score | Intent preserved | Entity preserved | Critical error rate | Judge uncertain\n---|---|---|---|---|---\n|  |  |  |  |\n\n### Table 3: error-type rates\n\nModel | Number format | Number value error | Person name error | Place name error | Truncation | Hallucination\n---|---|---|---|---|---|---\n|  |  |  |  |  |\n\n### Table 4: accent slicing\n\nAccent group | WER norm | Semantic score | Entity preserved | Truncation rate | Critical error rate\n---|---|---|---|---|---\n|  |  |  |  |\n\n### Table 5: metric-disagreement examples\n\nPattern | Example type\n---|---\nlow WER + critical error | wrong number/name/place\nhigh WER + semantic equivalent | formatting/paraphrase\nlow normalized WER + critical error | normalization artifact\nhigh truncation | audio LLM stopped early\n\n* * *\n\n## 16. Practical final recommendation\n\nI would implement this pipeline:\n\n\n    1. Save every raw model output.\n    2. Compute raw WER/CER/MER/WIL/WIP.\n    3. Build a versioned normalization pipeline.\n    4. Compute normalized WER/CER.\n    5. Add deterministic number, entity, length, truncation, and hallucination features.\n    6. Use SLM judge only for non-trivial semantic cases.\n    7. Make the SLM return JSON labels, not numeric scores.\n    8. Convert severity labels to numeric semantic penalties offline.\n    9. Validate the judge on 300 to 1,000 human-labeled examples.\n    10. Use pairwise judging only for validation subsets and metric-disagreement cases.\n    11. Plot model x severity, model x error type, accent x error type, WER bucket x severity, and duration x truncation.\n    12. Focus the discussion on where WER and semantic quality disagree.\n\n\n* * *\n\n## 17. Short summary\n\n  * Keep **WER/CER** , but do not rely on them alone.\n  * Add **normalized WER/CER** to reduce harmless formatting penalties.\n  * Add **SLM judge labels** for semantic severity, intent, entities, truncation, and hallucination.\n  * Do **not** ask the SLM for a direct numeric score.\n  * Use labels first, then map them to scores offline.\n  * Treat `thirty` vs `30` as `NUMBER_FORMAT`, not a semantic error.\n  * Treat `30` vs `13` as `NUMBER_VALUE_ERROR`, usually critical.\n  * Treat names and places strictly with `PERSON_NAME_ERROR` and `PLACE_NAME_ERROR`.\n  * Treat incomplete outputs as `OMISSION_OR_TRUNCATION`, not just high WER.\n  * Validate your judge with a human-labeled subset before scaling.\n  * The strongest analysis will be the disagreement cases: **low WER + critical semantic error** and **high WER + semantic equivalent**.\n\n\n\n* * *\n\n## References\n\n  * Pipecat STT Benchmark: GitHub - pipecat-ai/stt-benchmark: Benchmarking STT service TTFB and semantic WER for real-time AI applications · GitHub\n  * Pipecat Semantic WER implementation: stt-benchmark/src/stt_benchmark/evaluation/semantic_wer.py at main · pipecat-ai/stt-benchmark · GitHub\n  * Sarvam ASR evaluation beyond WER: Indic ASR evaluation: beyond WER to LLM & semantic metrics | Sarvam AI\n  * Sarvam LLM-WER: GitHub - sarvamai/llm_wer · GitHub\n  * Sarvam intent/entity evaluation: GitHub - sarvamai/llm_intent_entity: LLM-Eval framework for evaluating performance of ASR models · GitHub\n  * WER is Unaware paper: [2511.16544] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue\n  * WER is Unaware repo: GitHub - Ufonia/wer-is-unaware: A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors. · GitHub\n  * Generative LLMs for ASR evaluation: [2604.21928] Evaluation of Automatic Speech Recognition Using Generative Large Language Models\n  * JiWER: GitHub - jitsi/jiwer: Evaluate your speech-to-text system with similarity measures such as word error rate (WER) · GitHub\n  * JiWER transforms: transforms - jiwer\n  * Hugging Face Audio Course, ASR evaluation: Evaluation metrics for ASR · Hugging Face\n  * What is lost in Normalization?: [2409.02449] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations\n  * Whisper English normalizer: whisper/whisper/normalizers/english.py at main · openai/whisper · GitHub\n  * Whisper Basic normalizer: whisper/whisper/normalizers/basic.py at main · openai/whisper · GitHub\n  * DeepEval LLM-as-a-Judge guide: LLM-as-a-Judge Evaluation with DeepEval | DeepEval by Confident AI - The LLM Evaluation Framework\n  * DeepEval G-Eval docs: G-Eval | DeepEval by Confident AI - The LLM Evaluation Framework\n  * DeepEval GitHub: GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub\n  * G-Eval paper: [2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment\n  * Prometheus 2 paper: [2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models\n  * Prometheus-Eval: GitHub - prometheus-eval/prometheus-eval: Evaluate your LLM's response with Prometheus and GPT4 💯 · GitHub\n  * LLM-as-a-Judge survey: [2411.15594] A Survey on LLM-as-a-Judge\n  * Position bias in LLM-as-a-Judge: [2406.07791] Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge\n  * Earnings-22 accented ASR benchmark: [2203.15591] Earnings-22: A Practical Benchmark for Accents in the Wild\n  * EdAcc accented English ASR corpus: [2303.18110] The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR\n  * EdAcc dataset card: edinburghcstr/edacc · Datasets at Hugging Face\n\n",
  "title": "LLM as a Judge - Evaluate ASR"
}