Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigdh65j4iyducaiapcxp2y6odvdgidvcnhoccy5icl4qskvmuenji",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhxcxx7qwlq2"
  },
  "path": "/t/looking-for-guidance-trying-to-create-a-model-with-trocrs-encoder-googles-mt5-multilingual-decoder-but-model-fails-to-overfit-on-a-single-data-sample/174634#post_1",
  "publishedAt": "2026-03-26T08:14:24.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "colab.research.google.com",
    "Google Colab"
  ],
  "textContent": "Hi everyone,\n\nI am working on building a proof of concept for OCR system that can recognize both handwritten and printed Hindi (Devanagari) text in complex documents. I’m trying to build on top of TrOCR (`microsoft/trocr-base-handwritten`) since it already has a strong vision encoder trained for handwriting recognition.\n\nThe core problem I’m running into is on the decoder/tokenizer side — TrOCR’s default decoder and tokenizer are trained for English only, and I need Hindi output.\n\n**What I’ve tried so far:**\n\nI replaced TrOCR’s decoder with `google/mt5-small`, which natively supports Hindi tokenization. The hidden sizes matched, so I expected this to work.\n\nHowever, the model failed to overfit even on a single data point. The loss comes down but hovers at near 2-3 at the end, and the characters keep repeating instead of forming a meaningful word or the sentence. I have tried changing learning rate, introducing repetition penalty but overfitting just don’t happen.\n\nMy code is over here. Check cell 1 to cell 17:\n\ncolab.research.google.com\n\n### Google Colab\n\nI need guidance as is their any other tokenizer out there that can work well with TrOCR’s encoder or can you help me improve in this current setup (TrOCR’s encoder+Decoder).",
  "title": "Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample"
}