{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihbmjqvotfhwkak67ahbu7zvmrmf5wid3yxk4icrhotrujabsmqeq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mitu5rs2n6w2"
},
"path": "/t/fine-tuning-whisper-large-v3-for-child-reading-assessment-with-numerals-and-proper-names/175022#post_1",
"publishedAt": "2026-04-06T14:44:36.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi everyone,\n\nI’m working on a reading assessment product for children.\n\nCurrent setup:\n\n * a child reads a known passage for about 1 minute\n * our system then counts how many words were read correctly\n * right now we use whisper-1 as a baseline\n * we now want to move to an open model and fine-tune Whisper-large-v3 on our own infrastructure\n\n\n\nThis is not a generic ASR task:\n\n * we always know the reference text in advance\n * our main metric is correct-word-count accuracy against the reference passage\n\n\n\nThe main cases we want to improve through fine-tuning are:\n\n * numerals / spoken-written forms, for example “three” vs “3”\n * proper names and other rare words\n * child reading speech in general\n\n\n\nI’d like advice specifically on the fine-tuning strategy for this type of task.\n\nMy questions:\n\n 1. For this use case, what training targets would you recommend for fine-tuning: verbatim spoken transcripts, normalized transcripts, or transcripts matching the reference text format?\n 2. How much data is usually needed to see meaningful improvement when fine-tuning Whisper-large-v3 for child reading speech?\n 3. What data mix would you recommend for training:\n * general child speech\n * child reading audio\n * oversampled examples with numerals\n * oversampled examples with proper names / rare words\n 4. Would you start with LoRA or full fine-tuning for this kind of adaptation?\n 5. If the main goal is to improve numerals and proper names, is it better to do one fine-tuning run on all data, or a staged approach:\n * first domain adaptation on child speech\n * then additional fine-tuning on hard cases like numerals and proper names\n 6. Has anyone here fine-tuned Whisper-large-v3 specifically for child speech or reading assessment? If so, what setup worked best for you?\n\n\n\nPlanned stack:\n\n * Transformers\n * PEFT / LoRA\n * Accelerate\n * base model: openai/whisper-large-v3\n\n\n\nI’d really appreciate practical advice on data volume, dataset composition, and fine-tuning strategy for this specific use case.\n\nThanks!",
"title": "Fine-tuning Whisper-large-v3 for child reading assessment with numerals and proper names"
}