External Publication

Fine-tuning Whisper-large-v3 for child reading assessment with numerals and proper names

Hugging Face Forums [Unofficial] April 6, 2026

Hi everyone, I’m working on a reading assessment product for children. Current setup: * a child reads a known passage for about 1 minute * our system then counts how many words were read correctly * right now we use whisper-1 as a baseline * we now want to move to an open model and fine-tune Whisper-large-v3 on our own infrastructure This is not a generic ASR task: * we always know the reference text in advance * our main metric is correct-word-count accuracy against the reference passage The main cases we want to improve through fine-tuning are: * numerals / spoken-written forms, for example “three” vs “3” * proper names and other rare words * child reading speech in general I’d like advice specifically on the fine-tuning strategy for this type of task. My questions: 1. For this use case, what training targets would you recommend for fine-tuning: verbatim spoken transcripts, normalized transcripts, or transcripts matching the reference text format? 2. How much data is usually needed to see meaningful improvement when fine-tuning Whisper-large-v3 for child reading speech? 3. What data mix would you recommend for training: * general child speech * child reading audio * oversampled examples with numerals * oversampled examples with proper names / rare words 4. Would you start with LoRA or full fine-tuning for this kind of adaptation? 5. If the main goal is to improve numerals and proper names, is it better to do one fine-tuning run on all data, or a staged approach: * first domain adaptation on child speech * then additional fine-tuning on hard cases like numerals and proper names 6. Has anyone here fine-tuned Whisper-large-v3 specifically for child speech or reading assessment? If so, what setup worked best for you? Planned stack: * Transformers * PEFT / LoRA * Accelerate * base model: openai/whisper-large-v3 I’d really appreciate practical advice on data volume, dataset composition, and fine-tuning strategy for this specific use case. Thanks!

Discussion in the ATmosphere