External Publication
Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026
Source

Hmm… for small models, maybe we should not expect clean multilingual separation in the output. It may simply be a capacity issue. More specifically:


Short answer

I would not treat this as only a Qwen-specific bug.

I would think of it as a capacity / interference problem.

In a very small multilingual model, many things compete for limited capacity:

  • Persian
  • English
  • Chinese
  • generic assistant style
  • refusal style
  • reasoning style
  • formatting rules
  • tool-use behavior
  • uncertainty behavior
  • instruction-following constraints

When the model is uncertain, overloaded, sampled too freely, or pushed outside its strongest distribution, it may fall back to a stronger learned mode: English, Chinese, boilerplate, generic assistant style, or some other pattern.

So for a Persian-only 0.8B assistant , I would not expect multilingual separation to be naturally stable. I would treat language drift as a failure mode that must be actively engineered against.


1. Rough size intuition

This is not a hard rule, but my rough expectation would be:

Model size Language/style drift expectation
below 1B Very likely. Needs strong constraints, narrow scope, and drift-specific eval.
1B-3B Still common, but can be made usable with good CPT/SFT and conservative decoding.
3B-7B Transition zone. Often much better, but still fragile under uncertainty, long prompts, or high temperature.
7B-14B Usually much more stable for multilingual instruction following.
14B+ Much more reliable, though language drift can still happen.

So for Qwen3.5-0.8B , I would put it clearly in the “drift likely” zone.

Not hopeless, but not something I would expect to disappear automatically.


2. Why this happens

I would describe it like this:

The smaller the model, the less room it has to keep languages, styles, tasks, and constraints cleanly separated.

This is related to what multilingual NLP papers often call the curse of multilinguality : multilingual training can help low-resource languages, but many languages and tasks also compete for limited model capacity. See, for example, When Is Multilinguality a Curse? and Multilingual Large Language Models and Curse of Multilinguality.

There is also direct work on language confusion , where LLMs fail to answer consistently in the user’s intended language. Understanding and Mitigating Language Confusion in LLMs reports that language confusion can be worsened by complex prompts and high sampling temperature, and can be partially reduced with few-shot prompting, multilingual SFT, and preference tuning.

So I would not frame this as:

Qwen randomly breaks.

I would frame it as:

A small multilingual model has limited capacity, and language/style/task modes interfere.

3. SFT alone may not fully solve it

Persian SFT will help.

But ordinary SFT mostly increases the probability of the desired answer tokens. It does not necessarily strongly penalize unwanted mixed-language outputs.

That matters because a mixed-language answer may still share many locally plausible tokens with the training distribution.

There is recent work, Controlling Language Confusion in Multilingual LLMs, arguing that normal SFT may not explicitly punish cross-lingual mixing, while preference-style objectives such as ORPO can suppress language-confused outputs more directly.

Practical implication:

Persian-only SFT helps.
But if English/Chinese drift is a specific failure mode,
include examples where mixed-language answers are explicitly bad.

For example:

Prompt Bad answer Good answer
Persian user asks a question starts Persian, then switches to English/Chinese stays in Iranian Persian
Persian user asks uncertain question “I’m not sure…” in English Persian clarification / Persian uncertainty
Persian math prompt reasoning in English/Chinese Persian explanation, or tool call + Persian final answer
Persian correction prompt generic English assistant tone Persian correction style

This is where preference data, DPO/ORPO-style data, or even simple reject/accept filtering can help.


4. What I would do for a Persian-only 0.8B model

I would attack the problem at several layers.

Layer Action
model choice compare same-size models for Persian tokenization and drift
CPT Persian-heavy or Persian-only CPT corpus
SFT Persian-only assistant examples
uncertainty SFT “I don’t know / please clarify / I cannot answer” all in Persian
correction SFT user correction and self-correction in Persian
negative/preference data mixed-language outputs marked as bad
decoding low temperature, avoid aggressive presence penalty
output check detect non-Persian output and retry/repair
eval explicit language-drift test set

For a Persian-only assistant, I would not try to preserve broad multilingual behavior unless you actually need it.

If the product goal is Iranian Persian, then English/Chinese drift should be treated as an error.


5. Drift eval is necessary

I would make a small eval set specifically for drift.

Not just normal Persian QA.

Test the cases where drift is likely:

Case Why
ambiguous Persian prompt model may fall back to dominant language
misspelled Persian uncertainty increases drift
long multi-turn chat context pressure increases drift
correction from user model may switch style/language
“I don’t know” case uncertainty often triggers English boilerplate
math/reasoning reasoning style may drift to English
tool failure failure mode may drift
mixed Persian-English technical terms model may continue in English
high temperature sampling can increase drift
long answer later paragraphs may drift

Measure something simple:

non-Persian character ratio
English token ratio
Chinese character count
answer starts in Persian?
answer ends in Persian?
does reasoning drift?
does refusal drift?

Even a small 100-example drift eval would be useful.


6. If another model is better, use it

If you find another 0.5B-3B model with:

  • better Persian tokenization
  • less English/Chinese drift
  • acceptable instruction following
  • acceptable device performance

then yes, use it.

Starting from a less drift-prone model is always better.

But I would not abandon Qwen3.5-0.8B only because drift exists. At 0.8B, some drift risk is expected.

I would compare models with a small test:

same Persian prompts
same decoding settings
same drift eval
same tokenization statistics
same latency/memory budget

Then choose based on evidence.


7. Practical expectation

My expectation would be:

Qwen3.5-0.8B + Persian CPT + Persian SFT
  -> likely much better Persian
  -> likely reduced drift
  -> not guaranteed drift-free

To get closer to Persian-only behavior, add:

Persian uncertainty examples
Persian correction examples
Persian refusal examples
Persian tool-failure examples
negative examples for English/Chinese drift
low-temperature decoding
output language check

For example:

If output contains too much English/Chinese:
  retry with stronger Persian-only instruction
  or run a repair prompt
  or reject the answer

This is not elegant, but for a very small local assistant it is practical.


Bottom line

For a Persian-only 0.8B assistant, I would assume:

language drift is normal unless actively controlled

The reason is probably not only tokenizer or Qwen behavior. It is also limited model capacity and interference between languages, styles, and tasks.

So the strategy should be:

1. choose the least-drifty model you can
2. do Persian-focused CPT
3. do Persian-only SFT
4. train uncertainty/correction/failure cases in Persian
5. penalize mixed-language outputs if possible
6. use conservative decoding
7. add output language checks
8. measure drift directly

In short:

If another same-size model has lower drift and acceptable Persian tokenization, use it. But if Qwen3.5-0.8B is still the best tradeoff, I would keep it and treat language drift as a first-class eval and training target.

Discussion in the ATmosphere

Loading comments...