Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidbdyoiz2eu6p7fv6h6ld6av65k2vb6dx3afjxmh62oazxp322hbi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mopu2rboiba2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_19",
  "publishedAt": "2026-06-20T12:04:11.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Qwen3.5-0.8B model card",
    "Qwen3.5 model collection",
    "Unsloth Qwen3.5 guide",
    "Unsloth Qwen3.5 fine-tuning guide",
    "Khayyam Challenge / PersianMMLU",
    "Open Persian LLM Leaderboard",
    "PARSE: Persian open-domain reasoning QA",
    "ELAB: Persian alignment benchmark",
    "PersLitEval",
    "(click for more details)",
    "Qwen-Scope",
    "SASFT: SAE-guided supervised fine-tuning for unexpected code-switching",
    "Controlling Language Confusion in Multilingual LLMs",
    "OLA: Learning to respond in the user’s language",
    "FineWeb-Edu dataset",
    "FineWeb-Edu classifier",
    "FineWeb paper",
    "FineWeb blog post",
    "Step-by-Step: Improving Math Reasoning and Tutoring with Process Supervision",
    "Direct Preference Optimization",
    "PersianPhi model card",
    "Persian-Phi paper"
  ],
  "textContent": "Hmm… if you stay on the 0.8B route, **the fine-tuning and engineering cost may actually end up higher than just moving to a slightly larger model, maybe 2B–9B**. More details below:\n\n* * *\n\nI would frame this as a **model-size / evaluation / data-quality decision** , not only as a “can Qwen3.5-0.8B learn Persian?” decision.\n\nMy short answer is:\n\n> **Qwen3.5-0.8B may be good enough for a narrow Persian-first festival demo, but I would not invest heavily in fine-tuning it before comparing it against Qwen3.5-2B and Qwen3.5-4B on a small Persian stress eval.**\n\nThe reason is that 0.8B is not just “a bit smaller” than 2B or 4B. For a multilingual assistant, 0.8B may be close to the minimum useful capacity. Once you require Persian fluency, instruction following, math tutoring, tool JSON, Latin-name handling, uncertainty behavior, and low language drift, the small model can become expensive in engineering time.\n\nSo I would use this decision rule:\n\nRoute | When it makes sense | Main risk\n---|---|---\n**Qwen3.5-0.8B** | strict lightweight demo, narrow assistant, fast/cheap inference | may need more SFT/DPO/guards/CPT to compensate\n**Qwen3.5-2B** | first “extra capacity” candidate if 0.8B feels brittle | still small, but much less squeezed\n**Qwen3.5-4B** | likely best total-engineering-cost route if quality matters | higher inference cost\n**Qwen3.5-9B** | quality demo / stronger tutor behavior | may no longer feel like a tiny SLM project\n\nUseful references:\n\n  * Qwen3.5-0.8B model card\n  * Qwen3.5 model collection\n  * Unsloth Qwen3.5 guide\n  * Unsloth Qwen3.5 fine-tuning guide\n\n\n\n## 1. My recommended first step: build the eval before training\n\nBefore CPT, tokenizer extension, or a large SFT run, I would build a small Persian stress eval and run the same prompts on:\n\n  * Qwen3.5-0.8B\n  * Qwen3.5-2B\n  * Qwen3.5-4B\n  * optionally Qwen3.5-9B\n  * optionally one non-Qwen small baseline\n\n\n\nThe goal is not to create a perfect academic benchmark. The goal is to answer:\n\n  1. Is 0.8B already acceptable?\n  2. Does 2B fix most of the failures?\n  3. Does 4B reduce engineering work enough to justify the extra inference cost?\n  4. Are the failures mainly Persian fluency, language drift, math quality, JSON breakage, or shallow answers?\n  5. Should you spend effort on SFT, DPO/ORPO, CPT, tokenizer extension, or simply a larger base model?\n\n\n\nA first eval can be only 500–700 examples.\n\nEval bucket | Size | What it tests\n---|---|---\nPersian general QA | 100 | basic Persian answer quality\nPersianMMLU / Khayyam sample | 100 | school knowledge and reasoning\nLatin-name mixed prompts | 50 | `Mohammad`-style name handling\nEnglish technical term prompts | 50 | controlled code-switching\nMath tutor prompts | 50–100 | step-by-step Persian tutoring\nTool JSON prompts | 50 | valid JSON and key preservation\nGrammar correction | 50 | Persian language tutoring\nLong-answer drift | 30–50 | whether it switches language halfway\nUncertainty prompts | 30–50 | whether it says “I do not know” properly\nSafety / social norm prompts | 30–50 | basic public assistant behavior\n\nSome useful Persian benchmark starting points:\n\n  * Khayyam Challenge / PersianMMLU\n  * Open Persian LLM Leaderboard\n  * PARSE: Persian open-domain reasoning QA\n  * ELAB: Persian alignment benchmark\n  * PersLitEval\n\nSuggested scoring setup (click for more details)\n\n## 2. About Latin names like `Mohammad`\n\nI would not remove Latin names. They are not bad data by themselves.\n\nThe real target is:\n\n> **Preserve Latin names, formulas, URLs, code, and JSON when needed, but keep the surrounding answer Persian.**\n\nFor example:\n\nInput pattern | Bad behavior | Desired behavior\n---|---|---\nPersian + Latin name | model switches to English | keep the name, answer in Persian\nPersian + English term | entire answer becomes English | preserve term, explain in Persian\nPersian + formula | formula gets mistranslated | preserve formula, explain in Persian\nPersian + JSON schema | keys get translated or JSON breaks | keep valid JSON, explain in Persian outside JSON\nPersian + URL/citation | answer drifts into English | keep URL/citation, answer in Persian\n\nSo instead of deleting mixed examples, I would create controlled mixed examples.\n\nExample behavior policy:\n\n> “Answer in Persian. Preserve names, formulas, code, JSON keys, URLs, and citations exactly when needed. Do not switch the surrounding explanation into English or Chinese.”\n\nThis is also where Qwen-Scope is conceptually relevant. Qwen-Scope uses sparse autoencoders to analyze and steer Qwen-family model internals, and it discusses development uses around behaviors such as code-switching and repetition.\n\nUseful references:\n\n  * Qwen-Scope\n  * SASFT: SAE-guided supervised fine-tuning for unexpected code-switching\n  * Controlling Language Confusion in Multilingual LLMs\n  * OLA: Learning to respond in the user’s language\n\n\n\nBut I would not depend on Qwen-Scope as the main practical solution for a festival pipeline. For you, the simpler stack is probably:\n\n  1. SFT examples showing correct Persian behavior around Latin spans\n  2. rejected examples where the answer drifts into English/Chinese\n  3. small DPO/ORPO if drift persists\n  4. sentence-level language checks in eval\n  5. optional runtime retry if the output language is wrong\n\n\n\n## 3. KenLM is useful, but not enough for “high-quality” answers\n\nYour concern about shallow but fluent answers is correct.\n\nA KenLM-style Good/Bad filter can help with:\n\n  * noisy text\n  * strange character distribution\n  * broken Persian\n  * low-fluency text\n  * bad OCR-ish text\n  * obvious junk\n\n\n\nBut it will not reliably detect:\n\n  * shallow explanations\n  * generic answers\n  * low educational value\n  * missing reasoning steps\n  * factual weakness\n  * confident but incomplete answers\n\n\n\nFor this, I would copy the idea behind FineWeb-Edu: use a separate educational-quality signal, not only a fluency signal.\n\nUseful references:\n\n  * FineWeb-Edu dataset\n  * FineWeb-Edu classifier\n  * FineWeb paper\n  * FineWeb blog post\n\n\n\nA practical Persian filtering stack could look like:\n\n  1. language ID\n  2. exact / near deduplication\n  3. rule filters for broken text\n  4. KenLM / perplexity filtering\n  5. educational-quality classifier or LLM judge\n  6. small human audit\n  7. only then use the text for CPT or raw-text-to-SFT generation\n\nExample quality rubric for Persian educational data (click for more details)\n\n## 4. Math tutor: fixed format is good, but keep the target narrow\n\nFor 0.8B, I would not target “general math solver”. I would target:\n\n> **Persian step-by-step tutor for known school-level problem types.**\n\nThat is a much more realistic goal.\n\nGood tutor data should include:\n\nField | Purpose\n---|---\nproblem | original Persian problem\nlevel | grade / difficulty\ntopic | arithmetic, algebra, geometry, etc.\nstudent_attempt | optional wrong solution\ndiagnosis | what is wrong or missing\nhint | small nudge\nsolution_steps | short Persian steps\nfinal_answer | normalized answer\nchecks | how to verify\nlanguage_policy | answer in Persian; preserve formulas\n\nI would evaluate not only the final answer, but also:\n\n  * step correctness\n  * first-error detection\n  * usefulness of hint\n  * whether the model over-solves when the student only needs a hint\n  * language consistency\n  * formula preservation\n\n\n\nUseful reference:\n\n  * Step-by-Step: Improving Math Reasoning and Tutoring with Process Supervision\n\n\n\nMy practical recommendation:\n\nModel size | Math tutor scope\n---|---\n0.8B | narrow, fixed format, school-level, many templates\n2B | still controlled, but more robust explanations\n4B | better candidate for richer tutoring\n9B | stronger quality demo if inference cost is acceptable\n\n## 5. Self-reminders help, but they are not enough\n\nA system prompt like “always answer in Persian” helps, but I would not rely on it alone.\n\nBetter stack:\n\n  1. **SFT habit**\nMany examples where mixed input still produces Persian output.\n\n  2. **Preference data**\nChosen answer = Persian answer with allowed spans preserved.\nRejected answer = English/Chinese drift, broken JSON, translated schema keys, shallow answer, or hallucination.\n\n  3. **DPO/ORPO**\nUse preference tuning if SFT does not suppress drift enough.\n\n  4. **Eval**\nTrack wrong-language rate and sentence-level drift.\n\n  5. **Runtime guard**\nIf needed, detect wrong-language output and retry.\n\n\n\n\nUseful references:\n\n  * Direct Preference Optimization\n  * Controlling Language Confusion in Multilingual LLMs\n  * OLA: Learning to respond in the user’s language\n\nExample chosen/rejected pairs for language drift (click for more details)\n\n## 6. Tokenizer extension: possible, but probably not first\n\nTokenizer extension can help if Persian is tokenized badly. But it is not a free improvement.\n\nIf you add tokens, you need:\n\n  * embedding resize\n  * sensible initialization\n  * warm-up / alignment\n  * continued training\n  * regression tests\n\n\n\nA good Persian-specific reference is PersianPhi:\n\n  * PersianPhi model card\n  * Persian-Phi paper\n\n\n\nPersianPhi is useful because it shows that tokenizer adaptation can be part of a serious Persian curriculum pipeline. But that is also the warning: it is not just “add Persian tokens and run SFT”.\n\nFor your project, I would measure first:\n\nMeasurement | Why it matters\n---|---\ntokens per word | Persian compression\ncharacters per token | context efficiency\nsplit rate | whether words are fragmented\nPersian vs English token cost | multilingual imbalance\nLatin-name mixed examples | real input behavior\nformulas / JSON | tool and math safety\ntextbook text | tutor data efficiency\n\nIf Qwen3.5 tokenization is acceptable, I would skip tokenizer extension and spend the time on eval, SFT quality, and drift control.\n\n## 7. Suggested training strategy\n\nI would use this order.\n\n### Stage 0 — Compare base models before training\n\nRun the same eval on 0.8B, 2B, and 4B.\n\nDecision:\n\n  * if 0.8B passes, keep it\n  * if 0.8B fails only in language discipline, try SFT + DPO/ORPO\n  * if 0.8B fails in tutor quality, tool stability, or long-answer drift, test 2B/4B before heavy training\n  * if 4B works with much less engineering, use 4B\n\n\n\n### Stage 1 — SFT for Persian-first behavior\n\nStart with high-quality examples, not a huge weak dataset.\n\nPossible first target:\n\n  * 5k–30k strong Persian SFT examples\n  * Persian answers\n  * tutor examples\n  * grammar correction\n  * uncertainty behavior\n  * Latin-name handling\n  * formula handling\n  * JSON/tool handling\n\n\n\n### Stage 2 — Preference tuning for drift and bad behavior\n\nOnly if SFT is not enough.\n\nUse chosen/rejected pairs for:\n\n  * wrong-language drift\n  * translated JSON keys\n  * broken formulas\n  * shallow answers\n  * overconfident hallucinations\n\n\n\n### Stage 3 — CPT only if eval says you need it\n\nIf Persian fluency or domain knowledge is still weak, then consider small CPT.\n\nBut I would not start with huge CPT unless the goal is truly “build a Persian language model”, not “build a Persian assistant demo”.\n\n### Stage 4 — Tokenizer extension only if measurement justifies it\n\nTokenizer extension is a serious intervention. It belongs after measurement, not before.\n\n## Short answers to your questions\n\n### Will Latin names cause drift?\n\nThey can trigger drift, but they are not bad data. Keep them. Train and evaluate the model to preserve names while keeping the surrounding answer Persian.\n\n### Can Good/Bad KenLM catch shallow answers?\n\nNo, not reliably. KenLM helps with surface fluency and noise. Use an educational-quality classifier or rubric-based judge for depth and usefulness.\n\n### Is fixed-format math tutor data a good idea?\n\nYes. For 0.8B, fixed and narrow is better. Treat it as a guided tutor, not a general math solver.\n\n### Can the model learn confidence and self-reminders?\n\nPartly, but self-reminders are not enough. Use SFT for habit, preference tuning for wrong-language rejection, and eval/runtime checks.\n\n### Is Qwen-Scope useful?\n\nYes, conceptually and possibly technically. It supports the idea that code-switching can be analyzed and mitigated. But I would not make it the main solution for a festival pipeline.\n\n### Is tokenizer extension worth it?\n\nMaybe, but only after measuring tokenization efficiency. If Qwen3.5 tokenization is acceptable, skip it.\n\n### Should you stay on 0.8B?\n\nMaybe. If the project requires strict lightweight deployment, yes. But if 2B or 4B is allowed, test them before investing heavily in 0.8B-specific engineering.\n\n## Practical recommendation\n\nMy practical recommendation would be:\n\n  1. Build a 500–700 item Persian stress eval.\n  2. Run Qwen3.5-0.8B, 2B, and 4B before training.\n  3. If 0.8B passes, use it and keep the scope narrow.\n  4. If 0.8B fails mainly because of language drift, try SFT + small DPO/ORPO.\n  5. If 0.8B fails because of capacity, tutor quality, or tool stability, move to 2B or 4B.\n  6. Use KenLM for noise, not for educational value.\n  7. Do not extend the tokenizer unless tokenization measurements clearly justify it.\n\n\n\nMain idea:\n\n> **Do not solve model-size uncertainty with training first. Solve it with a small eval first. Then choose whether 0.8B, 2B, or 4B is the cheapest route in total engineering cost.**",
  "title": "How can i build a High Quality dataset?"
}