External Publication

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026

Hmm… for small models, maybe we should not expect clean multilingual separation in the output. It may simply be a capacity issue. More specifically:

Short answer

I would not treat this as only a Qwen-specific bug.

I would think of it as a capacity / interference problem.

In a very small multilingual model, many things compete for limited capacity:

Persian
English
Chinese
generic assistant style
refusal style
reasoning style
formatting rules
tool-use behavior
uncertainty behavior
instruction-following constraints

When the model is uncertain, overloaded, sampled too freely, or pushed outside its strongest distribution, it may fall back to a stronger learned mode: English, Chinese, boilerplate, generic assistant style, or some other pattern.

So for a Persian-only 0.8B assistant , I would not expect multilingual separation to be naturally stable. I would treat language drift as a failure mode that must be actively engineered against.

1. Rough size intuition

This is not a hard rule, but my rough expectation would be:

Model size	Language/style drift expectation
below 1B	Very likely. Needs strong constraints, narrow scope, and drift-specific eval.
1B-3B	Still common, but can be made usable with good CPT/SFT and conservative decoding.
3B-7B	Transition zone. Often much better, but still fragile under uncertainty, long prompts, or high temperature.
7B-14B	Usually much more stable for multilingual instruction following.
14B+	Much more reliable, though language drift can still happen.

So for Qwen3.5-0.8B , I would put it clearly in the “drift likely” zone.

Not hopeless, but not something I would expect to disappear automatically.

2. Why this happens

I would describe it like this:

The smaller the model, the less room it has to keep languages, styles, tasks, and constraints cleanly separated.

This is related to what multilingual NLP papers often call the curse of multilinguality : multilingual training can help low-resource languages, but many languages and tasks also compete for limited model capacity. See, for example, When Is Multilinguality a Curse? and Multilingual Large Language Models and Curse of Multilinguality.

There is also direct work on language confusion , where LLMs fail to answer consistently in the user’s intended language. Understanding and Mitigating Language Confusion in LLMs reports that language confusion can be worsened by complex prompts and high sampling temperature, and can be partially reduced with few-shot prompting, multilingual SFT, and preference tuning.

So I would not frame this as:

Qwen randomly breaks.

I would frame it as:

A small multilingual model has limited capacity, and language/style/task modes interfere.

3. SFT alone may not fully solve it

Persian SFT will help.

But ordinary SFT mostly increases the probability of the desired answer tokens. It does not necessarily strongly penalize unwanted mixed-language outputs.

That matters because a mixed-language answer may still share many locally plausible tokens with the training distribution.

There is recent work, Controlling Language Confusion in Multilingual LLMs, arguing that normal SFT may not explicitly punish cross-lingual mixing, while preference-style objectives such as ORPO can suppress language-confused outputs more directly.

Practical implication:

Persian-only SFT helps.
But if English/Chinese drift is a specific failure mode,
include examples where mixed-language answers are explicitly bad.

For example:

Prompt	Bad answer	Good answer
Persian user asks a question	starts Persian, then switches to English/Chinese	stays in Iranian Persian
Persian user asks uncertain question	“I’m not sure…” in English	Persian clarification / Persian uncertainty
Persian math prompt	reasoning in English/Chinese	Persian explanation, or tool call + Persian final answer
Persian correction prompt	generic English assistant tone	Persian correction style

This is where preference data, DPO/ORPO-style data, or even simple reject/accept filtering can help.

4. What I would do for a Persian-only 0.8B model

I would attack the problem at several layers.

Layer	Action
model choice	compare same-size models for Persian tokenization and drift
CPT	Persian-heavy or Persian-only CPT corpus
SFT	Persian-only assistant examples
uncertainty SFT	“I don’t know / please clarify / I cannot answer” all in Persian
correction SFT	user correction and self-correction in Persian
negative/preference data	mixed-language outputs marked as bad
decoding	low temperature, avoid aggressive presence penalty
output check	detect non-Persian output and retry/repair
eval	explicit language-drift test set

For a Persian-only assistant, I would not try to preserve broad multilingual behavior unless you actually need it.

If the product goal is Iranian Persian, then English/Chinese drift should be treated as an error.

5. Drift eval is necessary

I would make a small eval set specifically for drift.

Not just normal Persian QA.

Test the cases where drift is likely:

Case	Why
ambiguous Persian prompt	model may fall back to dominant language
misspelled Persian	uncertainty increases drift
long multi-turn chat	context pressure increases drift
correction from user	model may switch style/language
“I don’t know” case	uncertainty often triggers English boilerplate
math/reasoning	reasoning style may drift to English
tool failure	failure mode may drift
mixed Persian-English technical terms	model may continue in English
high temperature	sampling can increase drift
long answer	later paragraphs may drift

Measure something simple:

non-Persian character ratio
English token ratio
Chinese character count
answer starts in Persian?
answer ends in Persian?
does reasoning drift?
does refusal drift?

Even a small 100-example drift eval would be useful.

6. If another model is better, use it

If you find another 0.5B-3B model with:

better Persian tokenization
less English/Chinese drift
acceptable instruction following
acceptable device performance

then yes, use it.

Starting from a less drift-prone model is always better.

But I would not abandon Qwen3.5-0.8B only because drift exists. At 0.8B, some drift risk is expected.

I would compare models with a small test:

same Persian prompts
same decoding settings
same drift eval
same tokenization statistics
same latency/memory budget

Then choose based on evidence.

7. Practical expectation

My expectation would be:

Qwen3.5-0.8B + Persian CPT + Persian SFT
  -> likely much better Persian
  -> likely reduced drift
  -> not guaranteed drift-free

To get closer to Persian-only behavior, add:

Persian uncertainty examples
Persian correction examples
Persian refusal examples
Persian tool-failure examples
negative examples for English/Chinese drift
low-temperature decoding
output language check

For example:

If output contains too much English/Chinese:
  retry with stronger Persian-only instruction
  or run a repair prompt
  or reject the answer

This is not elegant, but for a very small local assistant it is practical.

Bottom line

For a Persian-only 0.8B assistant, I would assume:

language drift is normal unless actively controlled

The reason is probably not only tokenizer or Qwen behavior. It is also limited model capacity and interference between languages, styles, and tasks.

So the strategy should be:

1. choose the least-drifty model you can
2. do Persian-focused CPT
3. do Persian-only SFT
4. train uncertainty/correction/failure cases in Persian
5. penalize mixed-language outputs if possible
6. use conservative decoding
7. add output language checks
8. measure drift directly

In short:

If another same-size model has lower drift and acceptable Persian tokenization, use it. But if Qwen3.5-0.8B is still the best tradeoff, I would keep it and treat language drift as a first-class eval and training target.