Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreify645xk4tmfzxvz6jzjelajz3xb67jegjksvcchldqtyp4ase5a4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhxqgkroiw22"
  },
  "path": "/t/looking-for-guidance-trying-to-create-a-model-with-trocrs-encoder-googles-mt5-multilingual-decoder-but-model-fails-to-overfit-on-a-single-data-sample/174634#post_2",
  "publishedAt": "2026-03-26T12:42:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub",
    "Hugging Face Forums"
  ],
  "textContent": "Hmm…\n\n* * *\n\nThis is fixable enough to keep exploring, but the main problem is probably **not the tokenizer itself**. The bigger problem is that your current experiment combines a TrOCR encoder that was fine-tuned for **English single-line handwriting** , a custom **mT5-as-decoder-only** wiring path, and a difficult Hindi OCR target. That is a fragile combination. Hugging Face’s encoder-decoder docs explicitly warn that when you combine a pretrained encoder and a different decoder, the **cross-attention layers may be randomly initialized** and must be learned during fine-tuning. They also show that the supported decoder path is usually a decoder model configured for cross-attention, not a full seq2seq model hacked into decoder-only use. (Hugging Face)\n\n## The most important conclusion\n\nI do **not** think your experiment proves that “TrOCR encoder + Hindi-capable decoder cannot work.” I think it proves that your **current wiring and training regime are too unstable** to make that judgment. The fact that loss drops at all means the image path, label path, and cross-modal connection are at least partially alive. The repeated characters point more toward **autoregressive instability** than “complete failure.” Repetition is also a known failure mode in TrOCR-style generation, especially when decoder setup or generation config is off. (Hugging Face)\n\n## What is going wrong in your current Colab\n\nFrom the code you shared, these are the biggest issues.\n\n### 1. Your notebook says `mt5-small`, but the code loads `mt5-base`\n\nThat is not a cosmetic detail. `mt5-base` is materially larger and harder to stabilize than `mt5-small`. For a one-sample overfit test, you want the smallest model that can still express the task. Using a larger multilingual decoder makes the bridge-learning problem harder, not easier.\n\n### 2. You are starting from `trocr-base-handwritten`, which is already specialized\n\nThe public model card says `microsoft/trocr-base-handwritten` is a TrOCR model **fine-tuned on the IAM dataset**. The updated README also says it works best on **single-line handwritten English text** and is **not optimized for printed text** or multi-line inputs. For a language swap, `trocr-base-stage1` or `trocr-small-stage1` is usually a cleaner starting point because those are the **pre-trained only** checkpoints rather than the already English-finetuned handwritten checkpoint. (Hugging Face)\n\n### 3. The mT5 wiring path is custom, and that matters\n\nYou are not using the standard `VisionEncoderDecoderModel.from_encoder_decoder_pretrained(...)` path. Instead, you replace the mT5 encoder with a dummy module and feed `encoder_outputs` directly into `MT5ForConditionalGeneration`. That can work, but public Hugging Face issue history shows that using **T5 or ByT5 as decoder-only** for OCR is still a custom workaround path, not the most standard one. There is a dedicated issue where a user had to create a `T5DecoderOnlyForCausalLM` subclass for this exact reason. (GitHub)\n\n### 4. Your one-sample overfit test is not a clean overfit test\n\nIn your code, the one-sample test uses:\n\n  * full trainable model\n  * `AdamW(lr=1e-3)`\n  * only `150` steps\n  * beam search during evaluation\n\n\n\nThat is too aggressive and too noisy. T5-family docs say that with AdamW, values around **`1e-4` to `3e-4`** typically work well, and they note that T5 was pretrained with **Adafactor**. Also, for T5 and mT5, the correct decoder start behavior is to use **`pad_token_id`**. (Hugging Face)\n\nSo your current test is mixing three confounders:\n\n  * LR is likely too high for this hybrid.\n  * The decoder is larger than needed.\n  * Beam search is a poor judge of early training quality.\n\n\n\n### 5. Decoder masking is too implicit\n\nThe official encoder-decoder implementation uses **shift-right** logic to build decoder inputs from labels. There is also a recent Transformers issue pointing out that in `VisionEncoderDecoderModel`, users observed that `decoder_attention_mask` was not always created the way they expected when labels were shifted into decoder inputs. In a custom hybrid like yours, I would not leave this implicit. I would create `decoder_input_ids` and `decoder_attention_mask` explicitly. (GitHub)\n\n## My recommendation for your current setup\n\nKeep the overall idea for now, but simplify the experiment hard.\n\n### Recommended first rebuild\n\nUse:\n\n  * `microsoft/trocr-small-stage1` or `microsoft/trocr-base-stage1`\n  * `google/mt5-small`\n  * explicit decoder inputs and decoder attention mask\n  * frozen encoder at first\n  * greedy decoding\n  * lower LR\n  * longer one-sample training\n\n\n\nWhy this version first:\n\n  * `stage1` is a cleaner visual warm start than the English IAM handwritten checkpoint for a decoder swap. (Hugging Face)\n  * `mt5-small` is easier to stabilize than `mt5-base`.\n  * mT5 already supports Hindi tokenization and uses `pad_token_id` as the decoder start token, so the tokenizer is not the core blocker. (Hugging Face)\n\n\n\n## Concrete changes I would make\n\n### A. Change the checkpoints\n\nUse:\n\n\n    trocr = VisionEncoderDecoderModel.from_pretrained(\"microsoft/trocr-small-stage1\")\n    mt5_model = MT5ForConditionalGeneration.from_pretrained(\"google/mt5-small\")\n    tokenizer = AutoTokenizer.from_pretrained(\"google/mt5-small\")\n    image_processor = ViTImageProcessor.from_pretrained(\"microsoft/trocr-small-stage1\")\n\n\nThis removes two sources of instability at once: an over-specialized English handwritten checkpoint and an unnecessarily large decoder. The `stage1` models are the pre-trained-only TrOCR checkpoints. (Hugging Face)\n\n### B. Set both model config and generation config\n\nDo this:\n\n\n    model.mt5.config.decoder_start_token_id = tokenizer.pad_token_id\n    model.mt5.config.pad_token_id = tokenizer.pad_token_id\n    model.mt5.config.eos_token_id = tokenizer.eos_token_id\n    model.mt5.config.use_cache = False\n\n    model.mt5.generation_config.decoder_start_token_id = tokenizer.pad_token_id\n    model.mt5.generation_config.pad_token_id = tokenizer.pad_token_id\n    model.mt5.generation_config.eos_token_id = tokenizer.eos_token_id\n\n\nmT5 uses `pad_token_id` to start decoder generation. That part of your code is conceptually right, but I would set `generation_config` too. (Hugging Face)\n\n### C. Make decoder inputs explicit\n\nInside `forward`, do not rely only on `labels=...` to do everything.\n\n\n    def forward(self, pixel_values, labels=None):\n        hidden = self._encode(pixel_values)\n\n        decoder_input_ids = None\n        decoder_attention_mask = None\n\n        if labels is not None:\n            decoder_input_ids = self.mt5._shift_right(labels)\n            decoder_attention_mask = (decoder_input_ids != self.mt5.config.pad_token_id).long()\n\n        return self.mt5(\n            encoder_outputs=BaseModelOutput(last_hidden_state=hidden),\n            decoder_input_ids=decoder_input_ids,\n            decoder_attention_mask=decoder_attention_mask,\n            labels=labels,\n            use_cache=False,\n        )\n\n\nThis makes the training path less ambiguous, and it lines up with how encoder-decoder training is supposed to work conceptually: labels are shifted right into decoder inputs. (GitHub)\n\n### D. Freeze the encoder first\n\nAt the beginning, the fragile part is the **bridge** , not the vision backbone. So start with:\n\n\n    for p in model.encoder.parameters():\n        p.requires_grad = False\n\n    for name, p in model.mt5.named_parameters():\n        p.requires_grad = (\n            (\"EncDecAttention\" in name) or\n            (\"lm_head\" in name) or\n            (\"shared\" in name)\n        )\n\n    if model.enc_to_dec_proj is not None:\n        for p in model.enc_to_dec_proj.parameters():\n            p.requires_grad = True\n\n\nThis follows directly from the encoder-decoder warm-start logic: the cross-attention bridge is new and needs to be learned carefully. (Hugging Face)\n\n### E. Fix the one-sample overfit protocol\n\nFor the one-sample proof, use:\n\n  * `lr=1e-4`\n  * `weight_decay=0.0`\n  * no dropout\n  * greedy decode\n  * `500` to `1000` steps\n  * teacher-forced token accuracy\n\n\n\nExample:\n\n\n    optimizer = torch.optim.AdamW(\n        [p for p in model.parameters() if p.requires_grad],\n        lr=1e-4,\n        weight_decay=0.0,\n    )\n\n    for m in model.modules():\n        if isinstance(m, nn.Dropout):\n            m.p = 0.0\n\n    for step in range(1, 1001):\n        outputs = model(pixel_values=pv, labels=lb)\n        loss = outputs.loss\n\n        optimizer.zero_grad()\n        loss.backward()\n        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n        optimizer.step()\n\n        if step % 20 == 0:\n            model.eval()\n            with torch.no_grad():\n                tf_outputs = model(pixel_values=pv, labels=lb)\n                tf_pred = tf_outputs.logits.argmax(-1)\n                mask = lb != -100\n                token_acc = (tf_pred[mask] == lb[mask]).float().mean().item()\n\n                gen_ids = model.generate(\n                    pixel_values=pv,\n                    max_new_tokens=int(mask.sum().item()) + 4,\n                    num_beams=1,\n                    do_sample=False,\n                )\n                pred = tokenizer.decode(gen_ids[0], skip_special_tokens=True)\n\n            print(step, loss.item(), token_acc, pred)\n            model.train()\n\n\nThe T5 docs support the lower LR recommendation. Greedy decoding removes beam-search noise from the diagnosis. (Hugging Face)\n\n## What success should look like\n\nDo **not** judge by loss alone.\n\nFor a one-sample test, success is:\n\n  1. teacher-forced token accuracy approaches `1.0`\n  2. greedy decoded text becomes an exact match\n  3. it stays stable for multiple checks\n\n\n\nIf loss goes down but token accuracy stays mediocre, the bridge is not learning properly. If token accuracy gets high but free decoding still loops, the model is learning under teacher forcing but autoregressive generation is unstable.\n\n## About the tokenizer question\n\nThe practical answer is:\n\n> Do not think “which tokenizer works with TrOCR encoder?”\n>  Think “which **decoder family** works best with the TrOCR encoder?”\n\nThe tokenizer comes with the decoder family.\n\n### Best current options, in order\n\n### 1. **XLM-R decoder**\n\nThis is the cleanest TrOCR-style multilingual path inside Transformers. Hugging Face’s public decoder-replacement guidance explicitly shows replacing TrOCR’s decoder with `RobertaForCausalLM.from_pretrained(\"xlm-roberta-base\", is_decoder=True, add_cross_attention=True)`. That is the most standard multilingual replacement route. (Hugging Face Forums)\n\nWhy it is attractive:\n\n  * closer to the standard `VisionEncoderDecoderModel` recipe\n  * easier than custom T5-decoder-only plumbing\n  * multilingual tokenizer already available\n\n\n\n### 2. **IndicBART**\n\nIf your real target is Hindi and perhaps other Indian languages, this is one of the strongest alternatives. IndicBART is a multilingual seq2seq model focused on **11 Indian languages plus English**. There is also a public `trocr-indic` model built around IndicBART, and it explicitly supports Hindi, though it notes a Devanagari-script limitation in the released setup. (Hugging Face)\n\nWhy it is attractive:\n\n  * more language-focused for Indic text than mT5\n  * seq2seq architecture fits OCR-style generation naturally\n  * smaller and more targeted than `mt5-base`\n\n\n\n### 3. **ByT5**\n\nByT5 is tokenizer-free and works directly on UTF-8 bytes. The model docs say it is more robust to noise and can process any language without a separate tokenizer vocabulary. That is interesting for OCR because OCR errors often look like noisy character sequences. (Hugging Face)\n\nWhy it is attractive:\n\n  * no tokenizer coverage problem\n  * strong fit for noisy OCR text\n\n\n\nWhy I would not pick it first:\n\n  * sequence lengths are longer\n  * it still lives in the T5 family, so the decoder-only integration pain remains\n\n\n\n### 4. **Stay with mT5**\n\nThis is still viable. mT5 covers 101 languages and already supports Hindi tokenization. I would keep it only after fixing the wiring and training regime first. (Hugging Face)\n\n## My recommendation on alternatives\n\nIf your goal is **the least risky next step** , I would rank them like this:\n\n  1. **TrOCR encoder + XLM-R decoder**\n  2. **TrOCR encoder + IndicBART**\n  3. **TrOCR encoder + repaired mT5-small setup**\n  4. **ByT5 experiment only after the above**\n\n\n\nThat ranking is based on current Hugging Face implementation guidance and public issue history. The T5 decoder-only route is the least standard of the four. (Hugging Face Forums)\n\n## For your final end goal: complex documents\n\nThis part matters a lot.\n\nYour target is not just Hindi recognition. It is **handwritten + printed Hindi in complex documents**. The public TrOCR model card and discussion history strongly suggest that the handwritten checkpoint is best on **single text-line inputs** , and users doing full-page OCR typically detect or crop regions first, then run TrOCR on those crops. (Hugging Face)\n\nSo I would not design the final system as “single recognizer eats full page.” I would design it as:\n\n  1. text-region detection\n  2. line grouping or crop extraction\n  3. Hindi recognizer on each crop\n  4. merge results\n\n\n\nFor more document-native approaches, Donut is worth tracking because it is an OCR-free document model, but that is a different design choice from a recognizer-focused OCR pipeline. (Hugging Face)\n\n## My blunt recommendation\n\nFor your case, I would do this next:\n\n### Path A. Repair your current experiment\n\n  * switch to `trocr-small-stage1`\n  * switch to `mt5-small`\n  * explicit decoder inputs and mask\n  * freeze encoder\n  * `lr=1e-4`\n  * greedy decode\n  * 1000-step one-sample overfit\n\n\n\n### Path B. If that still fails\n\nStop tuning repetition penalties. Move to:\n\n  * `VisionEncoderDecoderModel`\n  * `xlm-roberta-base` as decoder using `RobertaForCausalLM`\n  * matching tokenizer for labels\n\n\n\nThat is the cleanest multilingual TrOCR path publicly documented by Hugging Face. (Hugging Face Forums)\n\n### Path C. If Hindi quality is still weak\n\nTry `IndicBART` next, because it is actually designed around Indic languages rather than broad multilingual coverage. (Hugging Face)\n\n## Final answer\n\nYour current result does **not** tell me “the idea is wrong.” It tells me:\n\n  * you have a partially working pipeline,\n  * your current overfit test is too unstable to trust,\n  * the tokenizer is probably **not** the main blocker,\n  * the biggest issue is the **bridge + training regime** ,\n  * and for your final use case, you should treat recognition and document layout as separate problems. (Hugging Face)\n\n\n\nThe fastest high-value move is to rebuild the one-sample test in the smaller, cleaner form above. If that passes, then your architecture is viable. If it still does not pass, switch decoder family before spending more time on hyperparameter tweaking.",
  "title": "Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample"
}