Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidmcdjwkag5dvec3tjkewuzcpny4p3shyywzautaatvd4i3xocq24",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3midbrzw7sfc2"
  },
  "path": "/t/creating-a-cyrillic-bulgarian-handwritten-ocr-dataset-guidance-needed/174821#post_2",
  "publishedAt": "2026-03-30T23:39:30.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "arXiv",
    "Kraken",
    "Spinger Links",
    "Transkribus Help Center",
    "Science Direct",
    "GitHub"
  ],
  "textContent": "I think worth creating, too:\n\n* * *\n\nYour setup is a good **version 1**.\n\nThe right mental model is not “build a perfect OCR system for everything Bulgarian handwritten.” It is “build a **line-level handwritten text recognition dataset** for Bulgarian student notebooks.” That is a much more realistic target, and it matches how current HTR work is still organized: line-level recognition remains a core task, while page-level layout and end-to-end document reading are treated as a harder, separate layer. (arXiv)\n\n## Overall answer\n\nMy answer to your main questions is:\n\n  * **No, you do not need to switch to polygons now.**\n  * **Occasional English is fine. Keep it as written.**\n  * **Do not skip all notation. Keep simple inline forms like`x^2`, but exclude complex math in v1.**\n  * **Yes, the dataset is worth creating.**\n  * **The poor results from Russian models and EasyOCR are not surprising.**\n\n\n\nThat is the shortest correct answer. The reasons matter.\n\n## 1. Your current annotation design is basically correct\n\nYou are annotating **line by line** , with one box and one transcription. That is a standard and sensible choice for a first HTR dataset. Kraken’s training documentation makes the separation very explicit: **segmentation** is about finding lines and regions on the page, while **recognition** is about converting **line images into text**. For recognition, the target unit is the text contained in a line. (Kraken)\n\nThat matters because it means your boxes are not a compromise that ruins the project. They are a practical way to create the exact supervision signal a recognizer needs.\n\nYour current strategy also fits the way the recent HTR survey frames the field. It explicitly distinguishes between work **up to line level** and work **beyond line level**. Your project belongs in the first category, and that is a good place to start. (arXiv)\n\n## 2. Do you need polygons or baselines?\n\nNot yet.\n\nYou only need to move from rectangles to polygons or baselines when **layout becomes the bottleneck**. The recent line-segmentation survey is useful here because it explains the different representations clearly: text lines may be represented by **bounding boxes, polygons, or baselines** , and the right choice depends on the extraction problem, not on abstract purity. It also stresses that text-line extraction matters because it affects downstream HTR accuracy. (Spinger Links)\n\nFor your case, rectangles are enough when:\n\n  * one box mostly contains one line\n  * neighboring lines do not overlap too much\n  * ascenders and descenders are not being chopped off\n  * the line is not so curved that the crop includes too much irrelevant text\n\n\n\nSo I would keep your current approach for most pages.\n\nI would only introduce polygons or baselines for a **hard subset** where one of these keeps happening:\n\n  * the line is strongly curved\n  * neighboring lines touch\n  * slant or perspective makes a rectangle inefficient\n  * a box must include too much of the line above or below\n\n\n\nThat gives you the right tradeoff: fast annotation for most of the corpus, extra detail only where it clearly buys accuracy. (Kraken)\n\n## 3. English inside Bulgarian lines\n\nThis is not a problem by itself.\n\nTranskribus’ data-preparation guide says a model can be trained to recognize **two or more hands, languages, types of writing, or alphabets at the same time** , but those variants must appear in the ground truth in a representative way. In other words, mixed Bulgarian and English is allowed. The real issue is not the existence of English. The real issue is whether it appears often enough, and whether it is transcribed consistently. (Transkribus Help Center)\n\nSo your current policy is good:\n\n> normal transcription english normal transcription\n\nThat is better than inventing special markers around English words, because special markers would become part of the target text.\n\nWhat I would add is metadata. Mark lines as:\n\n  * `bg`\n  * `en`\n  * `mixed`\n\n\n\nThat gives you a way to evaluate later whether mixed-script lines are harder than pure Bulgarian lines. Without that tag, you will not know whether the model is failing because of handwriting difficulty or script mixing. That is an inference from the current guidance on representative ground truth and the known importance of domain mismatch in HTR. (Transkribus Help Center)\n\n## 4. What to do with formulas like `x^2`\n\nYou should not skip all of them.\n\nTranskribus recommends a consistent, accurate transcript that reflects what is on the page, and it explicitly discusses the value of a **diplomatic transcription** where punctuation, superscripts, and subscripts are transcribed as they appear. It also notes that, if conventions are consistent enough, the model can learn them. (Transkribus Help Center)\n\nThat means simple inline notation should stay in the dataset:\n\n  * `x^2`\n  * `a+b`\n  * `y=7`\n  * dates\n  * percentages\n  * short Latin variable names\n\n\n\nBut complex mathematical layout is different. CROHME exists precisely because handwritten mathematical expression recognition is treated as its own task, not just ordinary OCR with a few special characters added. (Transkribus Help Center)\n\nSo the practical rule should be:\n\n  * **Keep simple inline notation** in the main dataset.\n  * **Use one encoding consistently**. For example, always `x^2`, not sometimes `x^2` and sometimes `x²`.\n  * **Exclude or separately flag complex displayed formulas** in version 1.\n\n\n\nThat keeps the task coherent.\n\n## 5. Illegible words\n\nYour current rule needs one refinement.\n\nRight now you say illegible words are not included in a bounding box. That is workable only if the rule is consistent. The Transkribus guide is blunt: ground truth should be as accurate as possible, because mistakes in ground truth teach the model the wrong thing. It also repeatedly stresses consistency of editorial choices. (Transkribus Help Center)\n\nA cleaner rule would be:\n\n  * if the **whole line** is too unclear, exclude the line\n  * if only **one short span** is unclear, use one fixed unreadable-span convention\n  * if uncertainty is frequent in that line, exclude it from the core training set\n\n\n\nThe exact placeholder matters less than consistency.\n\n## 6. Is the dataset worth creating?\n\nYes.\n\nThis is the strongest part of the answer.\n\n### Why it is worth it\n\nCurrent HTR models still suffer from **distribution shift**. A 2025 study on HTR generalization found that out-of-distribution performance drops are driven first by **textual divergence** and then by **visual divergence**. That is directly relevant to you. “Cyrillic” is not enough. “Handwriting” is not enough. “Russian HTR” is not enough. Bulgarian student notebooks have their own text distribution, spelling, symbols, classroom notation, page layout, and writer habits. (arXiv)\n\nThere is also still visible scarcity in Bulgarian OCR resources. A recent Bulgarian paper describes creating the **first benchmark dataset** for OCR text correction in historical Bulgarian orthography, which is a strong sign that Bulgarian OCR remains under-resourced enough that new datasets still matter. That paper is about historical print correction, not modern handwriting, but that actually strengthens the case: the public ecosystem is still building core Bulgarian resources rather than already being saturated. (Spinger Links)\n\nSo yes, the niche is real. That is exactly why the dataset is valuable.\n\n### Why vision LLMs do not remove the need\n\nA 2025 benchmark of large language models for handwritten text recognition found that these models perform strongly on English, more weakly on other languages, and do **not** show a significant self-correction capability. The comparison with Transkribus-style models was mixed rather than uniformly in favor of LLMs. (Science Direct)\n\nThat matches your observation very well. A vision model can produce something plausible. But “plausible” is not the same as “faithful transcription.” For OCR and HTR, exact character fidelity matters.\n\n## 7. Why the Russian HTR models failed on your pages\n\nThat result is not surprising.\n\nThe current HTR evidence says transfer depends on more than script. Writer style, domain, lexicon, notation, and image conditions all matter. The OOD study above is the core reason. A Russian model may know Cyrillic strokes, but still fail on Bulgarian school notebooks because the **target distribution is different**. (arXiv)\n\nThe same logic explains why EasyOCR was not very helpful. EasyOCR describes itself as a **general OCR** system that reads scene text and dense document text. It is broad and convenient, but it is not a handwriting-specialized, notebook-line HTR system. That makes it a poor fit for your exact use case, especially for generating useful line boxes on messy handwritten pages. (GitHub)\n\nSo no, this does not look like a “you issue.”\n\n## 8. Is 500 images enough?\n\nIt is enough to start. It is not enough to be done.\n\nTranskribus recommends **5,000 to 15,000 words** as a starting range, around **25 to 75 pages** , and specifically advises **at least 10,000 words for each hand** for handwritten documents. It also says that models trained on much larger multi-hand corpora can start to generalize to unseen hands, though with weaker performance than on in-domain validation. (Transkribus Help Center)\n\nThat means your current collection is likely enough for:\n\n  * defining a transcription policy\n  * building a pilot model\n  * discovering the main failure modes\n\n\n\nBut it is probably not enough to support a strong claim like “general Bulgarian handwriting OCR” yet.\n\nThe biggest scaling priority from here is probably **more writers** , not just more pages from the same writers. That follows directly from the per-hand guidance above. (Transkribus Help Center)\n\n## 9. What matters most in your project\n\nThe most important decision is **not** box versus polygon.\n\nIt is **transcription policy consistency**.\n\nThe Transkribus guide says this several times in different ways:\n\n  * ground truth must be accurate\n  * editorial choices must be consistent\n  * the most common approach is a consistent transcript that accurately represents what is read\n  * if conventions are consistent enough, the model can learn them (Transkribus Help Center)\n\n\n\nFor your case, the highest-value work is writing a short policy that fixes decisions for:\n\n  * spaces\n  * punctuation\n  * capitalization\n  * English words\n  * digits\n  * inline formulas\n  * strike-throughs\n  * unreadable spans\n  * end-of-line hyphenation\n\n\n\nThat will help your model more than switching every line to a polygon.\n\n## 10. My concrete recommendation for your exact setup\n\nI would do this:\n\n### Keep\n\n  * line-level annotation\n  * one transcription per line\n  * rectangular boxes for most data\n  * English inline as written\n  * simple inline notation such as `x^2`\n\n\n\n### Add\n\n  * metadata: `writer_id`, `page_id`, `language_tag`, `quality_tag`, `notation_tag`\n  * a one-page transcription handbook\n  * a writer-independent validation and test split\n\n\n\n### Exclude for version 1\n\n  * highly uncertain lines\n  * complex displayed math\n  * pages where layout is so messy that your current boxes stop isolating lines reliably\n\n\n\n### Introduce later, only if needed\n\n  * polygon or baseline annotation for a hard subset\n  * page-level segmentation models\n  * post-correction as a second stage\n\n\n\nThis is the version of the project most likely to produce a usable dataset with a realistic annotation cost. It also matches how current HTR tools and surveys structure the problem. (Kraken)\n\n## Final verdict\n\nYour current plan is good.\n\nNot perfect. Good.\n\nThe adjustments I would make are:\n\n  1. **Do not switch the whole corpus to polygons.**\n  2. **Do not remove English.**\n  3. **Do not skip all formulas. Keep simple inline notation.**\n  4. **Write strict transcription rules now.**\n  5. **Collect more writer diversity.**\n  6. **Treat this as a line-level Bulgarian HTR dataset, not a general OCR dataset.**\n\n\n\nThat is a project worth continuing. It fills a real gap, and the failure of off-the-shelf Russian models is evidence of that gap, not evidence that the project is misguided. (Spinger Links)",
  "title": "Creating a Cyrillic(Bulgarian) Handwritten OCR Dataset - Guidance needed"
}