External Publication
Visit Post

Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample

Hugging Face Forums [Unofficial] March 31, 2026
Source

John6666:

mt5-base

Thanks a lot! @John6666

The corrections and ideas you suggested worked on the overfit test, and the model consistently started producing correct prediction on the single data sample i was trying to train it on. With your help, I don’t need to look for any new tokenizer since i understood the problem was in my wiring of encoder and decoder. Now i am trying to make improvements in this setup only to achieve some level of results.

i want to proceed with next steps that i had in my mind, even tried but i am a bit stuck and confused.

I wanted to proceed with the proof of concept with a dataset size of just 210 data samples of printed and handwritten hindi line crops and create a training class to train the model on it and see if it is able to learn something meaningful.

160-65 being training set and rest being validation set. I just want to do something here that more or less proves that the pipeline is ready for gradual increases of larger datasets to thousands.

I did try this today by creating a training class, and training full parameters which i think is around 400 million.

The results were not good, The training loss was consistently going down but i was also hoping for character error rate(CER) to go down like below 1.0 after 50 epochs but that didn’t happen, it started with around 1.2 or 1.3 , went down to 0.98 at its best, and after 50 epochs it ended up to around 1.5

I looked for ways to improve this through claude, I am depending upon claude a lot due to my own knowledge gaps. But claude is somewhat making things even more confusing.

it suggested me to freeze decoder and train encoder only. This reduced trainable parameters to around 262 million out of 431 million total but that didn’t work either. CER remained above 1 after 50 epochs.

I want to ask that

  1. is my approach of trying to train a model on just 210 data samples for a proof of concept right or correct or should i use larger dataset that has at least 1000 hindi line crops?

  2. If yes, then what scores should i aim for in the CER and WER metric? What would be acceptable CER and WER in this training to conclude that my pipeline is ready for larger dataset?

  3. One last question, i am not able to understand which layers to freeze or how exactly to train the entire thing to see some improvement in results. Could you suggest correct designing of training class?

Attaching my notebook url here: (last 2 cells are the training code)

colab.research.google.com

Google Colab

Discussion in the ATmosphere

Loading comments...