{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihz2xeqixyuaeo75wcy54jhvhj4wgaqdrwrtdef4mxiumbn24ipx4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjyi5kbfsb22"
},
"path": "/t/looking-for-guidance-trying-to-create-a-model-with-trocrs-encoder-googles-mt5-multilingual-decoder-but-model-fails-to-overfit-on-a-single-data-sample/174634#post_8",
"publishedAt": "2026-04-21T07:37:43.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"huggingface.co",
"arxiv.org",
"aclanthology.org",
"cvit.iiit.ac.in"
],
"textContent": "Hmm… Hypothesis that the problem is occurring on the decoder side:\n\n* * *\n\n## My overall conclusion\n\nI do **not** think you should change the tokenizer or abandon the current TrOCR-encoder + mT5-decoder setup yet.\n\nI think the current evidence says something more specific:\n\n * the architecture can work, because some targets overfit correctly;\n * the decoder is still under-adapted on harder lines;\n * and the tokens like `<extra_id_0>` are not random accidents — they are a very specific sign that the T5-family decoder is falling back to its pretraining behavior when OCR grounding is weak. T5-family tokenizers include `extra_ids` special tokens, and the original T5 pretraining objective uses sentinel tokens as part of span corruption. (huggingface.co)\n\n\n\nSo my main recommendation is:\n\n> **Keep the current setup, but change how you adapt and evaluate the decoder.**\n\n* * *\n\n## Why `<extra_id_0>` appears at all\n\nThis is the first thing to understand.\n\n`<extra_id_0>` is a built-in special token in the T5 and mT5 tokenizer family. It is not some random OCR artifact. T5 was pretrained with a denoising objective that literally teaches the model to emit sentinel tokens like `<extra_id_0>` when reconstructing masked spans. That means these tokens have very strong pretrained priors. (arxiv.org)\n\nSo when your OCR model is uncertain, what can happen is:\n\n 1. the image signal is not strong enough to dominate,\n 2. the decoder falls back to familiar pretrained behavior,\n 3. sentinel tokens and repetitive continuations leak into generation.\n\n\n\nThat is why your outputs look like:\n\n * `<extra_id_0>`\n * then repeated “लिए”\n * then repeated “और”\n * and other locally high-probability continuations\n\n\n\nThis is a **decoder grounding problem** , not a token-coverage problem.\n\n* * *\n\n## Why some lines fit and others fail\n\nThis is the second key idea.\n\nIf the only issue were random initialization, you would mostly see **run-to-run** differences:\n\n * one run works,\n * another run does not.\n\n\n\nBut what you are seeing is also **line-to-line** variation:\n\n * some target texts overfit nicely,\n * others collapse badly.\n\n\n\nThat means there are **multiple causes at once**.\n\n### Cause 1: random bridge initialization\n\nHugging Face’s encoder-decoder docs explicitly note that in warm-started hybrids, the decoder-side cross-attention can be randomly initialized and must be fine-tuned downstream. So yes, some instability is expected. (huggingface.co)\n\n### Cause 2: target difficulty is uneven\n\nSome lines are easier:\n\n * shorter,\n * cleaner,\n * more common vocabulary,\n * fewer punctuation marks,\n * easier crops,\n * more printed than handwritten.\n\n\n\nSome are harder:\n\n * longer,\n * more punctuation,\n * noisier handwriting,\n * denser ligatures,\n * rarer word combinations.\n\n\n\nThe hard lines require stronger and more stable image grounding. So they expose decoder weakness faster.\n\n### Cause 3: your current trainable slice is still too narrow\n\nThis is the main practical issue.\n\nYour two freezing strategies both let the model learn **some** bridge behavior, but they do not give the decoder enough freedom to fully reshape sequence generation for hard OCR lines.\n\nThat is why the model can sometimes stick closely to the target and sometimes fail badly.\n\nSo my interpretation is:\n\n> **random initialization contributes, but the bigger story is under-adaptation of the decoder in a tiny-data regime.**\n\n* * *\n\n## Why your two freezing strategies behave this way\n\n### Strategy 1\n\nTrain only:\n\n * `EncDecAttention`\n * `lm_head`\n * `shared`\n * projection\n\n\n\nThis helps the model learn:\n\n * how to inject image features into the decoder,\n * and how to map decoder states into output tokens.\n\n\n\nThat is often enough for easy examples.\n\nBut it does **not** fully change the decoder’s internal sequence dynamics.\n\nSo if the image signal is weak, the decoder still falls back to pretrained behavior.\n\n### Strategy 2\n\nAdd:\n\n * `DenseReluDense`\n\n\n\nThis is broader and better than Strategy 1.\n\nBut it still leaves other important parts constrained, especially self-attention-driven sequence behavior.\n\nSo it can still fail on harder examples.\n\nThat is why both strategies can show “sometimes good, sometimes terrible” behavior.\n\nThey are not wrong. They are just **not broad enough** yet for the hard lines.\n\n* * *\n\n## My recommended solutions, in order\n\n## Solution 1. Keep the current architecture\n\nDo not switch tokenizer.\nDo not switch away from mT5 yet.\nDo not switch away from the TrOCR encoder yet.\n\nReason:\n\n * one-sample overfit success proves the wiring can work;\n * `<extra_id_0>` means decoder fallback, not missing Hindi token support. (huggingface.co)\n\n\n\nThis is the highest-confidence recommendation.\n\n* * *\n\n## Solution 2. Use a broader decoder-side adaptation strategy\n\nThis is the most important practical change.\n\n### Recommended next freeze schedule\n\nFreeze:\n\n * **entire encoder**\n\n\n\nTrain:\n\n * `enc_to_dec_proj`\n * **all`EncDecAttention` layers**\n * `lm_head`\n * `shared`\n * **all parameters in the last 2 decoder blocks**\n\n\n\nThis is better than both of your current strategies because it gives the decoder more freedom to change:\n\n * sequence behavior,\n * grounding behavior,\n * and output token dynamics.\n\n\n\nI would use this as the next main training strategy.\n\nWhy this makes sense:\n\n * the encoder already gives usable image features;\n * the fragile part is still the decoder-side bridge and generation;\n * Hugging Face’s docs already point to cross-attention as the new component that often needs fine-tuning in warm-started hybrids. (huggingface.co)\n\n\n\n### What I would **not** do\n\nDo **not** unfreeze the encoder yet.\n\nThat is too early for your data size and not where the failure signal is pointing.\n\n* * *\n\n## Solution 3. Suppress sentinel tokens during validation and inference\n\nThis is a very useful **guardrail**.\n\nHugging Face generation utilities support `bad_words_ids`, which lets you block specific tokens or token sequences during generation. Since `<extra_id_n>` tokens should never be valid OCR output for your task, you can suppress them during validation and inference. (huggingface.co)\n\nExample idea:\n\n\n extra_tokens = [f\"<extra_id_{i}>\" for i in range(100)]\n bad_words_ids = tokenizer(extra_tokens, add_special_tokens=False).input_ids\n\n generated_ids = model.generate(\n pixel_values=pixel_values,\n max_new_tokens=max_new_tokens,\n num_beams=1,\n do_sample=False,\n bad_words_ids=bad_words_ids,\n )\n\n\nImportant caution:\n\n * this is **not** the real fix,\n * it is a **guardrail**.\n\n\n\nIt prevents the most obviously invalid decoder fallback behavior from polluting your evaluation, while you keep working on the actual training problem.\n\n* * *\n\n## Solution 4. Split your tests into easy lines and hard lines\n\nRight now your model feels “unpredictable” because you are mentally averaging together different difficulty levels.\n\nDo this instead:\n\n### Easy probe set\n\nUse lines that are:\n\n * shorter,\n * cleaner,\n * more printed,\n * less punctuation-heavy,\n * more common vocabulary.\n\n\n\n### Hard probe set\n\nUse lines that are:\n\n * longer,\n * more punctuation-heavy,\n * noisier handwriting,\n * more complex Devanagari forms,\n * more unusual vocabulary.\n\n\n\nThen run the same overfit test on both.\n\nThis will tell you much more than one mixed impression.\n\nIf easy lines fit but hard lines do not, then the explanation is not “just random init.”\nIt is:\n\n * random init,\n * plus hard-target difficulty,\n * plus decoder under-adaptation.\n\n\n\n* * *\n\n## Solution 5. Add three diagnostics\n\nThese three diagnostics will make your debugging much clearer.\n\n### A. Sentinel-token rate\n\nTrack how often predictions contain `<extra_id_0>` or any `<extra_id_n>`.\n\nThis tells you whether the decoder is still falling back to T5 pretraining behavior.\n\n### B. Length ratio\n\nTrack:\n\n * `len(prediction) / len(reference)`\n\n\n\nIf this ratio explodes, repetition and EOS failure are dominating.\n\n### C. Target token length\n\nTrack tokenized target length for each line.\n\nHard examples often cluster here.\n\nThese three numbers will be more informative than loss alone.\n\n* * *\n\n## Solution 6. Tighten generation length\n\nA flat `max_new_tokens=64` is probably too blunt.\n\nMy recommendation is:\n\n * compute the **95th percentile** target token length in your dataset,\n * then set `max_new_tokens = p95 + 4`.\n\n\n\nWhy:\n\n * long ceilings give unstable models more room to loop,\n * shorter, data-driven ceilings reduce runaway repetition.\n\n\n\nThis is a practical recommendation based on the failure pattern you are seeing.\n\n* * *\n\n## Solution 7. Move to LoRA if the above still fails\n\nIf the broader decoder-side adaptation still gives unstable behavior, my next recommendation is:\n\n * **LoRA on cross-attention**\n * plus **LoRA on the last 2 decoder blocks**\n\n\n\nThis is not just a generic modern preference. Recent low-resource Indic OCR work uses LoRA-style parameter-efficient adaptation, and the original LoRA paper explains why adapting only a small low-rank slice is often more stable and much cheaper than broad full fine-tuning. (aclanthology.org)\n\nSo if the broader partial fine-tuning still collapses, I would move to LoRA **before** touching the encoder.\n\n* * *\n\n## Solution 8. Keep 210 samples for debugging, but move to 1000+ once decoder behavior stabilizes\n\nYour 210-sample dataset is still useful. It is a good **debugging set**.\n\nBut it is not enough to expect stable OCR behavior from a large hybrid model across both printed and handwritten Hindi.\n\nThat is why recent low-resource Indic OCR work uses PEFT and synthetic data, and why public Indic handwriting resources are much larger. This is not a sign that your approach is bad; it is a sign that your current data regime is very small. (cvit.iiit.ac.in)\n\nSo my recommendation is:\n\n * keep the 210 samples for trainer and decoder debugging,\n * move to 1000+ line crops once the decoder stops collapsing into sentinels and repetitions.\n\n\n\n* * *\n\n## My final recommendation stack\n\nIf I compress everything into the clearest sequence, it is this:\n\n### First\n\nKeep the architecture.\n\n### Second\n\nTrain:\n\n * encoder frozen,\n * all cross-attention trainable,\n * last 2 decoder blocks trainable,\n * `lm_head`,\n * `shared`,\n * projection.\n\n\n\n### Third\n\nSuppress `<extra_id_n>` tokens during validation and inference with `bad_words_ids`.\n\n### Fourth\n\nUse easy vs hard probe sets and log:\n\n * sentinel-token rate,\n * length ratio,\n * target token length.\n\n\n\n### Fifth\n\nIf instability persists, move to **LoRA on decoder-side blocks**.\n\n### Sixth\n\nScale to 1000+ line crops once decoder behavior becomes sane.\n\n* * *\n\n## My one-sentence summary\n\nYour problem is **not** “wrong tokenizer.”\nIt is **decoder fallback to T5 priors plus under-adapted generation on a tiny mixed dataset**.\n\nSo my recommended solution is:\n\n> **Keep the current setup, broaden decoder-side adaptation to the last 2 decoder blocks plus all cross-attention, suppress sentinel tokens during validation, and use LoRA next if the decoder is still unstable.**\n\nThat is the path I think best fits what you are seeing now.",
"title": "Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample"
}