{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigpxsqm3v3tulipwpsumjkcnsz7vbp322y7ahankhutdh5pc6sp5e",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnju4hfxvrm2"
},
"path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_7",
"publishedAt": "2026-06-05T08:10:55.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hello, I read your guide and learned several new things from it. One part that I found especially helpful was the section about checking whether a tokenizer supports my target language well.\n\nAfter reading that, I tested the **Qwen 3 0.6B** tokenizer on Persian, and its performance was quite poor. I also tested **Qwen 3.5 0.8B** , which was better, but still not good enough for strong Persian support.\n\nSo I wanted to ask where can I find a base model that is truly strong for Persian?\n\nOr can i Somehow fine-tune a Tokenizer for Persian?\n\nBut i found a way myself though, if i cant improve the Tokenizer, i can help it. maybe i try to make a normalizer first and then extend it more and more and measure changes in the Tokenized output.",
"title": "Fine-Tuning an SLM for a Low-Resource Language"
}