External Publication

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026

Hmm… if you stay on the 0.8B route, the fine-tuning and engineering cost may actually end up higher than just moving to a slightly larger model, maybe 2B–9B. More details below:

I would frame this as a model-size / evaluation / data-quality decision , not only as a “can Qwen3.5-0.8B learn Persian?” decision.

My short answer is:

Qwen3.5-0.8B may be good enough for a narrow Persian-first festival demo, but I would not invest heavily in fine-tuning it before comparing it against Qwen3.5-2B and Qwen3.5-4B on a small Persian stress eval.

The reason is that 0.8B is not just “a bit smaller” than 2B or 4B. For a multilingual assistant, 0.8B may be close to the minimum useful capacity. Once you require Persian fluency, instruction following, math tutoring, tool JSON, Latin-name handling, uncertainty behavior, and low language drift, the small model can become expensive in engineering time.

So I would use this decision rule:

Route	When it makes sense	Main risk
Qwen3.5-0.8B	strict lightweight demo, narrow assistant, fast/cheap inference	may need more SFT/DPO/guards/CPT to compensate
Qwen3.5-2B	first “extra capacity” candidate if 0.8B feels brittle	still small, but much less squeezed
Qwen3.5-4B	likely best total-engineering-cost route if quality matters	higher inference cost
Qwen3.5-9B	quality demo / stronger tutor behavior	may no longer feel like a tiny SLM project

Useful references:

Qwen3.5-0.8B model card
Qwen3.5 model collection
Unsloth Qwen3.5 guide
Unsloth Qwen3.5 fine-tuning guide

1. My recommended first step: build the eval before training

Before CPT, tokenizer extension, or a large SFT run, I would build a small Persian stress eval and run the same prompts on:

Qwen3.5-0.8B
Qwen3.5-2B
Qwen3.5-4B
optionally Qwen3.5-9B
optionally one non-Qwen small baseline

The goal is not to create a perfect academic benchmark. The goal is to answer:

Is 0.8B already acceptable?
Does 2B fix most of the failures?
Does 4B reduce engineering work enough to justify the extra inference cost?
Are the failures mainly Persian fluency, language drift, math quality, JSON breakage, or shallow answers?
Should you spend effort on SFT, DPO/ORPO, CPT, tokenizer extension, or simply a larger base model?

A first eval can be only 500–700 examples.

Eval bucket	Size	What it tests
Persian general QA	100	basic Persian answer quality
PersianMMLU / Khayyam sample	100	school knowledge and reasoning
Latin-name mixed prompts	50	`Mohammad`-style name handling
English technical term prompts	50	controlled code-switching
Math tutor prompts	50–100	step-by-step Persian tutoring
Tool JSON prompts	50	valid JSON and key preservation
Grammar correction	50	Persian language tutoring
Long-answer drift	30–50	whether it switches language halfway
Uncertainty prompts	30–50	whether it says “I do not know” properly
Safety / social norm prompts	30–50	basic public assistant behavior

Some useful Persian benchmark starting points:

Khayyam Challenge / PersianMMLU
Open Persian LLM Leaderboard
PARSE: Persian open-domain reasoning QA
ELAB: Persian alignment benchmark
PersLitEval

Suggested scoring setup (click for more details)

2. About Latin names like `Mohammad`

I would not remove Latin names. They are not bad data by themselves.

The real target is:

Preserve Latin names, formulas, URLs, code, and JSON when needed, but keep the surrounding answer Persian.

For example:

Input pattern	Bad behavior	Desired behavior
Persian + Latin name	model switches to English	keep the name, answer in Persian
Persian + English term	entire answer becomes English	preserve term, explain in Persian
Persian + formula	formula gets mistranslated	preserve formula, explain in Persian
Persian + JSON schema	keys get translated or JSON breaks	keep valid JSON, explain in Persian outside JSON
Persian + URL/citation	answer drifts into English	keep URL/citation, answer in Persian

So instead of deleting mixed examples, I would create controlled mixed examples.

Example behavior policy:

“Answer in Persian. Preserve names, formulas, code, JSON keys, URLs, and citations exactly when needed. Do not switch the surrounding explanation into English or Chinese.”

This is also where Qwen-Scope is conceptually relevant. Qwen-Scope uses sparse autoencoders to analyze and steer Qwen-family model internals, and it discusses development uses around behaviors such as code-switching and repetition.

Useful references:

Qwen-Scope
SASFT: SAE-guided supervised fine-tuning for unexpected code-switching
Controlling Language Confusion in Multilingual LLMs
OLA: Learning to respond in the user’s language

But I would not depend on Qwen-Scope as the main practical solution for a festival pipeline. For you, the simpler stack is probably:

SFT examples showing correct Persian behavior around Latin spans
rejected examples where the answer drifts into English/Chinese
small DPO/ORPO if drift persists
sentence-level language checks in eval
optional runtime retry if the output language is wrong

3. KenLM is useful, but not enough for “high-quality” answers

Your concern about shallow but fluent answers is correct.

A KenLM-style Good/Bad filter can help with:

noisy text
strange character distribution
broken Persian
low-fluency text
bad OCR-ish text
obvious junk

But it will not reliably detect:

shallow explanations
generic answers
low educational value
missing reasoning steps
factual weakness
confident but incomplete answers

For this, I would copy the idea behind FineWeb-Edu: use a separate educational-quality signal, not only a fluency signal.

Useful references:

FineWeb-Edu dataset
FineWeb-Edu classifier
FineWeb paper
FineWeb blog post

A practical Persian filtering stack could look like:

language ID
exact / near deduplication
rule filters for broken text
KenLM / perplexity filtering
educational-quality classifier or LLM judge
small human audit
only then use the text for CPT or raw-text-to-SFT generation

Example quality rubric for Persian educational data (click for more details)

4. Math tutor: fixed format is good, but keep the target narrow

For 0.8B, I would not target “general math solver”. I would target:

Persian step-by-step tutor for known school-level problem types.

That is a much more realistic goal.

Good tutor data should include:

Field	Purpose
problem	original Persian problem
level	grade / difficulty
topic	arithmetic, algebra, geometry, etc.
student_attempt	optional wrong solution
diagnosis	what is wrong or missing
hint	small nudge
solution_steps	short Persian steps
final_answer	normalized answer
checks	how to verify
language_policy	answer in Persian; preserve formulas

I would evaluate not only the final answer, but also:

step correctness
first-error detection
usefulness of hint
whether the model over-solves when the student only needs a hint
language consistency
formula preservation

Useful reference:

Step-by-Step: Improving Math Reasoning and Tutoring with Process Supervision

My practical recommendation:

Model size	Math tutor scope
0.8B	narrow, fixed format, school-level, many templates
2B	still controlled, but more robust explanations
4B	better candidate for richer tutoring
9B	stronger quality demo if inference cost is acceptable

5. Self-reminders help, but they are not enough

A system prompt like “always answer in Persian” helps, but I would not rely on it alone.

Better stack:

SFT habit Many examples where mixed input still produces Persian output.
Preference data Chosen answer = Persian answer with allowed spans preserved. Rejected answer = English/Chinese drift, broken JSON, translated schema keys, shallow answer, or hallucination.
DPO/ORPO Use preference tuning if SFT does not suppress drift enough.
Eval Track wrong-language rate and sentence-level drift.
Runtime guard If needed, detect wrong-language output and retry.

Useful references:

Direct Preference Optimization
Controlling Language Confusion in Multilingual LLMs
OLA: Learning to respond in the user’s language

Example chosen/rejected pairs for language drift (click for more details)

6. Tokenizer extension: possible, but probably not first

Tokenizer extension can help if Persian is tokenized badly. But it is not a free improvement.

If you add tokens, you need:

embedding resize
sensible initialization
warm-up / alignment
continued training
regression tests

A good Persian-specific reference is PersianPhi:

PersianPhi model card
Persian-Phi paper

PersianPhi is useful because it shows that tokenizer adaptation can be part of a serious Persian curriculum pipeline. But that is also the warning: it is not just “add Persian tokens and run SFT”.

For your project, I would measure first:

Measurement	Why it matters
tokens per word	Persian compression
characters per token	context efficiency
split rate	whether words are fragmented
Persian vs English token cost	multilingual imbalance
Latin-name mixed examples	real input behavior
formulas / JSON	tool and math safety
textbook text	tutor data efficiency

If Qwen3.5 tokenization is acceptable, I would skip tokenizer extension and spend the time on eval, SFT quality, and drift control.

7. Suggested training strategy

I would use this order.

Stage 0 — Compare base models before training

Run the same eval on 0.8B, 2B, and 4B.

Decision:

if 0.8B passes, keep it
if 0.8B fails only in language discipline, try SFT + DPO/ORPO
if 0.8B fails in tutor quality, tool stability, or long-answer drift, test 2B/4B before heavy training
if 4B works with much less engineering, use 4B

Stage 1 — SFT for Persian-first behavior

Start with high-quality examples, not a huge weak dataset.

Possible first target:

5k–30k strong Persian SFT examples
Persian answers
tutor examples
grammar correction
uncertainty behavior
Latin-name handling
formula handling
JSON/tool handling

Stage 2 — Preference tuning for drift and bad behavior

Only if SFT is not enough.

Use chosen/rejected pairs for:

wrong-language drift
translated JSON keys
broken formulas
shallow answers
overconfident hallucinations

Stage 3 — CPT only if eval says you need it

If Persian fluency or domain knowledge is still weak, then consider small CPT.

But I would not start with huge CPT unless the goal is truly “build a Persian language model”, not “build a Persian assistant demo”.

Stage 4 — Tokenizer extension only if measurement justifies it

Tokenizer extension is a serious intervention. It belongs after measurement, not before.

Short answers to your questions

Will Latin names cause drift?

They can trigger drift, but they are not bad data. Keep them. Train and evaluate the model to preserve names while keeping the surrounding answer Persian.

Can Good/Bad KenLM catch shallow answers?

No, not reliably. KenLM helps with surface fluency and noise. Use an educational-quality classifier or rubric-based judge for depth and usefulness.

Is fixed-format math tutor data a good idea?

Yes. For 0.8B, fixed and narrow is better. Treat it as a guided tutor, not a general math solver.

Can the model learn confidence and self-reminders?

Partly, but self-reminders are not enough. Use SFT for habit, preference tuning for wrong-language rejection, and eval/runtime checks.

Is Qwen-Scope useful?

Yes, conceptually and possibly technically. It supports the idea that code-switching can be analyzed and mitigated. But I would not make it the main solution for a festival pipeline.

Is tokenizer extension worth it?

Maybe, but only after measuring tokenization efficiency. If Qwen3.5 tokenization is acceptable, skip it.

Should you stay on 0.8B?

Maybe. If the project requires strict lightweight deployment, yes. But if 2B or 4B is allowed, test them before investing heavily in 0.8B-specific engineering.

Practical recommendation

My practical recommendation would be:

Build a 500–700 item Persian stress eval.
Run Qwen3.5-0.8B, 2B, and 4B before training.
If 0.8B passes, use it and keep the scope narrow.
If 0.8B fails mainly because of language drift, try SFT + small DPO/ORPO.
If 0.8B fails because of capacity, tutor quality, or tool stability, move to 2B or 4B.
Use KenLM for noise, not for educational value.
Do not extend the tokenizer unless tokenization measurements clearly justify it.

Main idea:

Do not solve model-size uncertainty with training first. Solve it with a small eval first. Then choose whether 0.8B, 2B, or 4B is the cheapest route in total engineering cost.

1. My recommended first step: build the eval before training

2. About Latin names like Mohammad

3. KenLM is useful, but not enough for “high-quality” answers

4. Math tutor: fixed format is good, but keep the target narrow

5. Self-reminders help, but they are not enough

6. Tokenizer extension: possible, but probably not first

7. Suggested training strategy

Stage 0 — Compare base models before training

Stage 1 — SFT for Persian-first behavior

Stage 2 — Preference tuning for drift and bad behavior

Stage 3 — CPT only if eval says you need it

Stage 4 — Tokenizer extension only if measurement justifies it

Short answers to your questions

Will Latin names cause drift?

Can Good/Bad KenLM catch shallow answers?

Is fixed-format math tutor data a good idea?

Can the model learn confidence and self-reminders?

Is Qwen-Scope useful?

Is tokenizer extension worth it?

Should you stay on 0.8B?

Practical recommendation

Discussion in the ATmosphere

2. About Latin names like `Mohammad`