Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreighejxd27qsdzcpzlni4zvmw4pawzt627tzryv53n3tqclppgnlk4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mooshtkndit2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_17",
  "publishedAt": "2026-06-20T01:52:22.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "When Is Multilinguality a Curse?",
    "Multilingual Large Language Models and Curse of Multilinguality",
    "Understanding and Mitigating Language Confusion in LLMs",
    "Controlling Language Confusion in Multilingual LLMs"
  ],
  "textContent": "Hmm… for small models, maybe we should not expect clean multilingual separation in the output. It **may simply be a capacity issue**. More specifically:\n\n* * *\n\n## Short answer\n\nI would not treat this as only a Qwen-specific bug.\n\nI would think of it as a **capacity / interference problem**.\n\nIn a very small multilingual model, many things compete for limited capacity:\n\n  * Persian\n  * English\n  * Chinese\n  * generic assistant style\n  * refusal style\n  * reasoning style\n  * formatting rules\n  * tool-use behavior\n  * uncertainty behavior\n  * instruction-following constraints\n\n\n\nWhen the model is uncertain, overloaded, sampled too freely, or pushed outside its strongest distribution, it may fall back to a stronger learned mode: English, Chinese, boilerplate, generic assistant style, or some other pattern.\n\nSo for a **Persian-only 0.8B assistant** , I would not expect multilingual separation to be naturally stable. I would treat language drift as a failure mode that must be actively engineered against.\n\n* * *\n\n## 1. Rough size intuition\n\nThis is not a hard rule, but my rough expectation would be:\n\nModel size | Language/style drift expectation\n---|---\n**below 1B** | Very likely. Needs strong constraints, narrow scope, and drift-specific eval.\n**1B-3B** | Still common, but can be made usable with good CPT/SFT and conservative decoding.\n**3B-7B** | Transition zone. Often much better, but still fragile under uncertainty, long prompts, or high temperature.\n**7B-14B** | Usually much more stable for multilingual instruction following.\n**14B+** | Much more reliable, though language drift can still happen.\n\nSo for **Qwen3.5-0.8B** , I would put it clearly in the “drift likely” zone.\n\nNot hopeless, but not something I would expect to disappear automatically.\n\n* * *\n\n## 2. Why this happens\n\nI would describe it like this:\n\n> The smaller the model, the less room it has to keep languages, styles, tasks, and constraints cleanly separated.\n\nThis is related to what multilingual NLP papers often call the **curse of multilinguality** : multilingual training can help low-resource languages, but many languages and tasks also compete for limited model capacity. See, for example, When Is Multilinguality a Curse? and Multilingual Large Language Models and Curse of Multilinguality.\n\nThere is also direct work on **language confusion** , where LLMs fail to answer consistently in the user’s intended language. Understanding and Mitigating Language Confusion in LLMs reports that language confusion can be worsened by complex prompts and high sampling temperature, and can be partially reduced with few-shot prompting, multilingual SFT, and preference tuning.\n\nSo I would not frame this as:\n\n\n    Qwen randomly breaks.\n\n\nI would frame it as:\n\n\n    A small multilingual model has limited capacity, and language/style/task modes interfere.\n\n\n* * *\n\n## 3. SFT alone may not fully solve it\n\nPersian SFT will help.\n\nBut ordinary SFT mostly increases the probability of the desired answer tokens. It does not necessarily strongly penalize unwanted mixed-language outputs.\n\nThat matters because a mixed-language answer may still share many locally plausible tokens with the training distribution.\n\nThere is recent work, Controlling Language Confusion in Multilingual LLMs, arguing that normal SFT may not explicitly punish cross-lingual mixing, while preference-style objectives such as ORPO can suppress language-confused outputs more directly.\n\nPractical implication:\n\n\n    Persian-only SFT helps.\n    But if English/Chinese drift is a specific failure mode,\n    include examples where mixed-language answers are explicitly bad.\n\n\nFor example:\n\nPrompt | Bad answer | Good answer\n---|---|---\nPersian user asks a question | starts Persian, then switches to English/Chinese | stays in Iranian Persian\nPersian user asks uncertain question | “I’m not sure…” in English | Persian clarification / Persian uncertainty\nPersian math prompt | reasoning in English/Chinese | Persian explanation, or tool call + Persian final answer\nPersian correction prompt | generic English assistant tone | Persian correction style\n\nThis is where preference data, DPO/ORPO-style data, or even simple reject/accept filtering can help.\n\n* * *\n\n## 4. What I would do for a Persian-only 0.8B model\n\nI would attack the problem at several layers.\n\nLayer | Action\n---|---\nmodel choice | compare same-size models for Persian tokenization and drift\nCPT | Persian-heavy or Persian-only CPT corpus\nSFT | Persian-only assistant examples\nuncertainty SFT | “I don’t know / please clarify / I cannot answer” all in Persian\ncorrection SFT | user correction and self-correction in Persian\nnegative/preference data | mixed-language outputs marked as bad\ndecoding | low temperature, avoid aggressive presence penalty\noutput check | detect non-Persian output and retry/repair\neval | explicit language-drift test set\n\nFor a Persian-only assistant, I would not try to preserve broad multilingual behavior unless you actually need it.\n\nIf the product goal is Iranian Persian, then English/Chinese drift should be treated as an error.\n\n* * *\n\n## 5. Drift eval is necessary\n\nI would make a small eval set specifically for drift.\n\nNot just normal Persian QA.\n\nTest the cases where drift is likely:\n\nCase | Why\n---|---\nambiguous Persian prompt | model may fall back to dominant language\nmisspelled Persian | uncertainty increases drift\nlong multi-turn chat | context pressure increases drift\ncorrection from user | model may switch style/language\n“I don’t know” case | uncertainty often triggers English boilerplate\nmath/reasoning | reasoning style may drift to English\ntool failure | failure mode may drift\nmixed Persian-English technical terms | model may continue in English\nhigh temperature | sampling can increase drift\nlong answer | later paragraphs may drift\n\nMeasure something simple:\n\n\n    non-Persian character ratio\n    English token ratio\n    Chinese character count\n    answer starts in Persian?\n    answer ends in Persian?\n    does reasoning drift?\n    does refusal drift?\n\n\nEven a small 100-example drift eval would be useful.\n\n* * *\n\n## 6. If another model is better, use it\n\nIf you find another 0.5B-3B model with:\n\n  * better Persian tokenization\n  * less English/Chinese drift\n  * acceptable instruction following\n  * acceptable device performance\n\n\n\nthen yes, use it.\n\nStarting from a less drift-prone model is always better.\n\nBut I would not abandon Qwen3.5-0.8B only because drift exists. At 0.8B, some drift risk is expected.\n\nI would compare models with a small test:\n\n\n    same Persian prompts\n    same decoding settings\n    same drift eval\n    same tokenization statistics\n    same latency/memory budget\n\n\nThen choose based on evidence.\n\n* * *\n\n## 7. Practical expectation\n\nMy expectation would be:\n\n\n    Qwen3.5-0.8B + Persian CPT + Persian SFT\n      -> likely much better Persian\n      -> likely reduced drift\n      -> not guaranteed drift-free\n\n\nTo get closer to Persian-only behavior, add:\n\n\n    Persian uncertainty examples\n    Persian correction examples\n    Persian refusal examples\n    Persian tool-failure examples\n    negative examples for English/Chinese drift\n    low-temperature decoding\n    output language check\n\n\nFor example:\n\n\n    If output contains too much English/Chinese:\n      retry with stronger Persian-only instruction\n      or run a repair prompt\n      or reject the answer\n\n\nThis is not elegant, but for a very small local assistant it is practical.\n\n* * *\n\n## Bottom line\n\nFor a Persian-only 0.8B assistant, I would assume:\n\n\n    language drift is normal unless actively controlled\n\n\nThe reason is probably not only tokenizer or Qwen behavior. It is also limited model capacity and interference between languages, styles, and tasks.\n\nSo the strategy should be:\n\n\n    1. choose the least-drifty model you can\n    2. do Persian-focused CPT\n    3. do Persian-only SFT\n    4. train uncertainty/correction/failure cases in Persian\n    5. penalize mixed-language outputs if possible\n    6. use conservative decoding\n    7. add output language checks\n    8. measure drift directly\n\n\nIn short:\n\n> If another same-size model has lower drift and acceptable Persian tokenization, use it. But if Qwen3.5-0.8B is still the best tradeoff, I would keep it and treat language drift as a first-class eval and training target.",
  "title": "How can i build a High Quality dataset?"
}