External Publication
Visit Post

Creating a Cyrillic(Bulgarian) Handwritten OCR Dataset - Guidance needed

Hugging Face Forums [Unofficial] March 30, 2026
Source

Hello! I am at the start of creating an OCR dataset for handwritten Bulgarian text, a Cyrillic script. After some back and forth with LLMs and searching for existing HTR dataset guidelines and best practices, I’ve settled around the following setup:

  • LabelStudio running from a docker container on my Windows laptop.

    • Labeling Interface (click for more details)
  • Text is labeled on a line level with a simple bounding box that I aim to cover mostly the current row. I split if the line is too curved

  • Illegible words are not included in a bounding box

Sample image and transcription:

The dataset for now includes about 500 images of student notebooks and I am in the process of acquiring more. It has about 15 writing styles(ranging from very pretty and legible to the thing above) and the images are of varying quality, more on the higher quality side.

I have a few questions on which I need guidance:

  • Do I have to switch to a more fine-tuned polygon method?
  • Some of the images contain english text at times. I have just included it as it is “normal transcription english normal transcription”. Are there issues with this?
  • Some of the images often contain some more advanced formatting formula such as x^2 or similar. I have currently just skipped those

As far as I can find Considering the amount of effort it will take:

  • Is such a dataset even worth creating?

    • It is quite niche - bulgarian and handwriting at that
    • LLMs with vision capabilities manage at times to accurately transcribe. But in the non perfect scenarios they attempt to make something that fits in the rest of the words but it is incorrect most of the time.

I have tried to utilise russian HTR models provided by Transkribus for example, but their results are very much unusable. I also spent an evening setting up EasyOCR in the LabelStudio docker container, tho the bounding boxes from that weren’t helpful at all. It is most definitely a me issue for that tho.

Thanks in advance for your time and answers!

Discussion in the ATmosphere

Loading comments...