Fine-tuning Whisper-large-v3 for child reading assessment with numerals and proper names
Hugging Face Forums [Unofficial]
April 6, 2026
Hi everyone,
I’m working on a reading assessment product for children.
Current setup:
* a child reads a known passage for about 1 minute
* our system then counts how many words were read correctly
* right now we use whisper-1 as a baseline
* we now want to move to an open model and fine-tune Whisper-large-v3 on our own infrastructure
This is not a generic ASR task:
* we always know the reference text in advance
* our main metric is correct-word-count accuracy against the reference passage
The main cases we want to improve through fine-tuning are:
* numerals / spoken-written forms, for example “three” vs “3”
* proper names and other rare words
* child reading speech in general
I’d like advice specifically on the fine-tuning strategy for this type of task.
My questions:
1. For this use case, what training targets would you recommend for fine-tuning: verbatim spoken transcripts, normalized transcripts, or transcripts matching the reference text format?
2. How much data is usually needed to see meaningful improvement when fine-tuning Whisper-large-v3 for child reading speech?
3. What data mix would you recommend for training:
* general child speech
* child reading audio
* oversampled examples with numerals
* oversampled examples with proper names / rare words
4. Would you start with LoRA or full fine-tuning for this kind of adaptation?
5. If the main goal is to improve numerals and proper names, is it better to do one fine-tuning run on all data, or a staged approach:
* first domain adaptation on child speech
* then additional fine-tuning on hard cases like numerals and proper names
6. Has anyone here fine-tuned Whisper-large-v3 specifically for child speech or reading assessment? If so, what setup worked best for you?
Planned stack:
* Transformers
* PEFT / LoRA
* Accelerate
* base model: openai/whisper-large-v3
I’d really appreciate practical advice on data volume, dataset composition, and fine-tuning strategy for this specific use case.
Thanks!
Discussion in the ATmosphere