{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihbmjqvotfhwkak67ahbu7zvmrmf5wid3yxk4icrhotrujabsmqeq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mitu5rs2n6w2"
  },
  "path": "/t/fine-tuning-whisper-large-v3-for-child-reading-assessment-with-numerals-and-proper-names/175022#post_1",
  "publishedAt": "2026-04-06T14:44:36.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi everyone,\n\nI’m working on a reading assessment product for children.\n\nCurrent setup:\n\n  * a child reads a known passage for about 1 minute\n  * our system then counts how many words were read correctly\n  * right now we use whisper-1 as a baseline\n  * we now want to move to an open model and fine-tune Whisper-large-v3 on our own infrastructure\n\n\n\nThis is not a generic ASR task:\n\n  * we always know the reference text in advance\n  * our main metric is correct-word-count accuracy against the reference passage\n\n\n\nThe main cases we want to improve through fine-tuning are:\n\n  * numerals / spoken-written forms, for example “three” vs “3”\n  * proper names and other rare words\n  * child reading speech in general\n\n\n\nI’d like advice specifically on the fine-tuning strategy for this type of task.\n\nMy questions:\n\n  1. For this use case, what training targets would you recommend for fine-tuning: verbatim spoken transcripts, normalized transcripts, or transcripts matching the reference text format?\n  2. How much data is usually needed to see meaningful improvement when fine-tuning Whisper-large-v3 for child reading speech?\n  3. What data mix would you recommend for training:\n     * general child speech\n     * child reading audio\n     * oversampled examples with numerals\n     * oversampled examples with proper names / rare words\n  4. Would you start with LoRA or full fine-tuning for this kind of adaptation?\n  5. If the main goal is to improve numerals and proper names, is it better to do one fine-tuning run on all data, or a staged approach:\n     * first domain adaptation on child speech\n     * then additional fine-tuning on hard cases like numerals and proper names\n  6. Has anyone here fine-tuned Whisper-large-v3 specifically for child speech or reading assessment? If so, what setup worked best for you?\n\n\n\nPlanned stack:\n\n  * Transformers\n  * PEFT / LoRA\n  * Accelerate\n  * base model: openai/whisper-large-v3\n\n\n\nI’d really appreciate practical advice on data volume, dataset composition, and fine-tuning strategy for this specific use case.\n\nThanks!",
  "title": "Fine-tuning Whisper-large-v3 for child reading assessment with numerals and proper names"
}