How can i build a High Quality dataset?
Hmm… for small models, maybe we should not expect clean multilingual separation in the output. It may simply be a capacity issue. More specifically:
Short answer
I would not treat this as only a Qwen-specific bug.
I would think of it as a capacity / interference problem.
In a very small multilingual model, many things compete for limited capacity:
- Persian
- English
- Chinese
- generic assistant style
- refusal style
- reasoning style
- formatting rules
- tool-use behavior
- uncertainty behavior
- instruction-following constraints
When the model is uncertain, overloaded, sampled too freely, or pushed outside its strongest distribution, it may fall back to a stronger learned mode: English, Chinese, boilerplate, generic assistant style, or some other pattern.
So for a Persian-only 0.8B assistant , I would not expect multilingual separation to be naturally stable. I would treat language drift as a failure mode that must be actively engineered against.
1. Rough size intuition
This is not a hard rule, but my rough expectation would be:
| Model size | Language/style drift expectation |
|---|---|
| below 1B | Very likely. Needs strong constraints, narrow scope, and drift-specific eval. |
| 1B-3B | Still common, but can be made usable with good CPT/SFT and conservative decoding. |
| 3B-7B | Transition zone. Often much better, but still fragile under uncertainty, long prompts, or high temperature. |
| 7B-14B | Usually much more stable for multilingual instruction following. |
| 14B+ | Much more reliable, though language drift can still happen. |
So for Qwen3.5-0.8B , I would put it clearly in the “drift likely” zone.
Not hopeless, but not something I would expect to disappear automatically.
2. Why this happens
I would describe it like this:
The smaller the model, the less room it has to keep languages, styles, tasks, and constraints cleanly separated.
This is related to what multilingual NLP papers often call the curse of multilinguality : multilingual training can help low-resource languages, but many languages and tasks also compete for limited model capacity. See, for example, When Is Multilinguality a Curse? and Multilingual Large Language Models and Curse of Multilinguality.
There is also direct work on language confusion , where LLMs fail to answer consistently in the user’s intended language. Understanding and Mitigating Language Confusion in LLMs reports that language confusion can be worsened by complex prompts and high sampling temperature, and can be partially reduced with few-shot prompting, multilingual SFT, and preference tuning.
So I would not frame this as:
Qwen randomly breaks.
I would frame it as:
A small multilingual model has limited capacity, and language/style/task modes interfere.
3. SFT alone may not fully solve it
Persian SFT will help.
But ordinary SFT mostly increases the probability of the desired answer tokens. It does not necessarily strongly penalize unwanted mixed-language outputs.
That matters because a mixed-language answer may still share many locally plausible tokens with the training distribution.
There is recent work, Controlling Language Confusion in Multilingual LLMs, arguing that normal SFT may not explicitly punish cross-lingual mixing, while preference-style objectives such as ORPO can suppress language-confused outputs more directly.
Practical implication:
Persian-only SFT helps.
But if English/Chinese drift is a specific failure mode,
include examples where mixed-language answers are explicitly bad.
For example:
| Prompt | Bad answer | Good answer |
|---|---|---|
| Persian user asks a question | starts Persian, then switches to English/Chinese | stays in Iranian Persian |
| Persian user asks uncertain question | “I’m not sure…” in English | Persian clarification / Persian uncertainty |
| Persian math prompt | reasoning in English/Chinese | Persian explanation, or tool call + Persian final answer |
| Persian correction prompt | generic English assistant tone | Persian correction style |
This is where preference data, DPO/ORPO-style data, or even simple reject/accept filtering can help.
4. What I would do for a Persian-only 0.8B model
I would attack the problem at several layers.
| Layer | Action |
|---|---|
| model choice | compare same-size models for Persian tokenization and drift |
| CPT | Persian-heavy or Persian-only CPT corpus |
| SFT | Persian-only assistant examples |
| uncertainty SFT | “I don’t know / please clarify / I cannot answer” all in Persian |
| correction SFT | user correction and self-correction in Persian |
| negative/preference data | mixed-language outputs marked as bad |
| decoding | low temperature, avoid aggressive presence penalty |
| output check | detect non-Persian output and retry/repair |
| eval | explicit language-drift test set |
For a Persian-only assistant, I would not try to preserve broad multilingual behavior unless you actually need it.
If the product goal is Iranian Persian, then English/Chinese drift should be treated as an error.
5. Drift eval is necessary
I would make a small eval set specifically for drift.
Not just normal Persian QA.
Test the cases where drift is likely:
| Case | Why |
|---|---|
| ambiguous Persian prompt | model may fall back to dominant language |
| misspelled Persian | uncertainty increases drift |
| long multi-turn chat | context pressure increases drift |
| correction from user | model may switch style/language |
| “I don’t know” case | uncertainty often triggers English boilerplate |
| math/reasoning | reasoning style may drift to English |
| tool failure | failure mode may drift |
| mixed Persian-English technical terms | model may continue in English |
| high temperature | sampling can increase drift |
| long answer | later paragraphs may drift |
Measure something simple:
non-Persian character ratio
English token ratio
Chinese character count
answer starts in Persian?
answer ends in Persian?
does reasoning drift?
does refusal drift?
Even a small 100-example drift eval would be useful.
6. If another model is better, use it
If you find another 0.5B-3B model with:
- better Persian tokenization
- less English/Chinese drift
- acceptable instruction following
- acceptable device performance
then yes, use it.
Starting from a less drift-prone model is always better.
But I would not abandon Qwen3.5-0.8B only because drift exists. At 0.8B, some drift risk is expected.
I would compare models with a small test:
same Persian prompts
same decoding settings
same drift eval
same tokenization statistics
same latency/memory budget
Then choose based on evidence.
7. Practical expectation
My expectation would be:
Qwen3.5-0.8B + Persian CPT + Persian SFT
-> likely much better Persian
-> likely reduced drift
-> not guaranteed drift-free
To get closer to Persian-only behavior, add:
Persian uncertainty examples
Persian correction examples
Persian refusal examples
Persian tool-failure examples
negative examples for English/Chinese drift
low-temperature decoding
output language check
For example:
If output contains too much English/Chinese:
retry with stronger Persian-only instruction
or run a repair prompt
or reject the answer
This is not elegant, but for a very small local assistant it is practical.
Bottom line
For a Persian-only 0.8B assistant, I would assume:
language drift is normal unless actively controlled
The reason is probably not only tokenizer or Qwen behavior. It is also limited model capacity and interference between languages, styles, and tasks.
So the strategy should be:
1. choose the least-drifty model you can
2. do Persian-focused CPT
3. do Persian-only SFT
4. train uncertainty/correction/failure cases in Persian
5. penalize mixed-language outputs if possible
6. use conservative decoding
7. add output language checks
8. measure drift directly
In short:
If another same-size model has lower drift and acceptable Persian tokenization, use it. But if Qwen3.5-0.8B is still the best tradeoff, I would keep it and treat language drift as a first-class eval and training target.
Discussion in the ATmosphere