Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibzxzf7agzptyfdg3cblfpzwtwm22myhsfmp6rv4e2g3kpih25nnm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnpwsqcmfx42"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_3",
  "publishedAt": "2026-06-07T18:55:44.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I am targeting Iranian Persian only. By high-quality, I mean a natural Persian dataset with fluent grammar, correct syntax, and student/teacher usage and later when the SLM was good enough, a Voice assistant for teacher/students with low-end devices.\n\nI am using Qwen 3.5 0.8B as the base model. I also tested Qwen 3 0.6B, but its tokenizer was very inefficient for Persian. Even with Qwen 3.5 0.8B, the tokenizer is still not very good for Persian, and the model struggles with simple Persian assistance tasks, grammar, and syntax.\n\nBecause of that, I do not think this problem can be fixed with an SFT dataset alone. The model needs continued pretraining (CPT). I can run CPT with LoRA rank 64, and it works well enough in practice. I tested this before with Qwen 3 0.6B trained on about 800 MB of cleaned Persian Wikipedia, and the model improved noticeably. Before that, it could barely write even two Persian words correctly.\n\nFor training data, I selected:\n\n  * **Persian Wikipedia** (~2 GB)\n\n  * **Persian OSCAR** (~3 GB, but very noisy)\n\n  * **Persian Aya dataset** (~600 MB)\n\n\n\n\nRight now, I am trying to train an n-gram model to identify whether a given text is natural and correct Persian. If the text does not meet that standard, I reject it. I should also mention that I clean the text as much as possible before sending it to the n-gram model.\n\nAt the moment, this is the best approach I can think of. I do not believe a simple scripted pipeline can reliably identify natural Persian quality on its own.\n\nBecause of my hardware limitations, I cannot use local LLMs for filtering. Smaller LLMs are not good enough at Persian, and larger ones are too expensive to run. For example, I get around **70 tokens/sec** with **Gemma E2B Q4** and around **40 tokens/sec** with **Gemma E4B Q4**.\n\nAnd I have two questions:\n\n  1. In continued pretraining (CPT), the model learns grammar, syntax, and additional knowledge, right? My understanding is that CPT changes the model more deeply, in a way that affects its outputs broadly, while SFT mainly teaches the model how to respond in the desired format and follow patterns more reliably. In other words, CPT builds the underlying language ability, while SFT guides that ability toward better answers. Is that understanding correct?\n\n  2. If I want the model to gain a larger amount of knowledge ( not at a GPT level, but enough to remember or recall information from the CPT data ) do I need to include everything I want it to remember in the CPT dataset?\n\n\n\n\nI should also mention that API costs for AI models are very high in my country.",
  "title": "How can i build a High Quality dataset?"
}