Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig6mrvrgxnl72bslb6mfwfhkchypzkx722exk4d7ioikaawi33me4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgsjiyj5qy72"
  },
  "path": "/t/need-help-in-fine-tuning-of-ocr-model-at-production-grade/174196#post_1",
  "publishedAt": "2026-03-11T17:40:16.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi Guys,\n\nI recently got a project for making a Document Analyzer for complex scanned documents.\n\nThe documents contain mix of printed + handwritten English and Indic (Hindi, Telugu) scripts. Constant switching between English and Hindi, handwritten values filled into printed form fields also overall structures are quite random, unpredictable layouts.\n\nI am especially struggling with the handwritten and printed Indic languages (Hindi-Devnagari), tried many OCR models but none are able to produce satisfactory results.\n\nThere are certain models that work really well but they are hosted or managed services. I wanted something that I could host on my own since data cannot be sent to external APIs for compliance reasons\n\nI was thinking of a way where i create an AI pipeline like preprocessing->layout detection-> use of multiple OCR but i am bit less confident with this method for the sole reason that most OCRs i tried are not performing good on handwritten indic texts.\n\nI thought creating dataset of our own and fine-tuning an OCR model on it might be our best shot to solve this problem.\n\nBut the problem is that for fine-tuning, I don’t know how or where to start, I am very new to this problem. I have these questions:\n\n  * **Dataset format** : Should training samples be word-level crops, line-level crops, or full form regions?\n\n  * **Dataset size** : How many samples are realistically needed for production-grade results on mixed Hindi-English handwriting?\n\n  * **Mixed script problem** : If I fine-tune only on handwritten Hindi, will the model break on printed text or English portions? Should the dataset deliberately include all variants? If yes then what percentage of each (handwritten indic and english, printed indic and english?)\n\n  * **Model selection** : Which base model is best suited for fine-tuning on Devanagari handwriting? TrOCR, PaddleOCR, something else?\n\n\n\n\nPlease share some resources, or tutorial or guidance regarding this problem.\n\nThanks in advance!",
  "title": "Need help in fine-tuning of OCR model at production grade"
}