Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigpxsqm3v3tulipwpsumjkcnsz7vbp322y7ahankhutdh5pc6sp5e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnju4hfxvrm2"
  },
  "path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_7",
  "publishedAt": "2026-06-05T08:10:55.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hello, I read your guide and learned several new things from it. One part that I found especially helpful was the section about checking whether a tokenizer supports my target language well.\n\nAfter reading that, I tested the **Qwen 3 0.6B** tokenizer on Persian, and its performance was quite poor. I also tested **Qwen 3.5 0.8B** , which was better, but still not good enough for strong Persian support.\n\nSo I wanted to ask where can I find a base model that is truly strong for Persian?\n\nOr can i Somehow fine-tune a Tokenizer for Persian?\n\nBut i found a way myself though, if i cant improve the Tokenizer, i can help it. maybe i try to make a normalizer first and then extend it more and more and measure changes in the Tokenized output.",
  "title": "Fine-Tuning an SLM for a Low-Resource Language"
}