How can i build a High Quality dataset?
Hmm… if you stay on the 0.8B route, the fine-tuning and engineering cost may actually end up higher than just moving to a slightly larger model, maybe 2B–9B. More details below:
I would frame this as a model-size / evaluation / data-quality decision , not only as a “can Qwen3.5-0.8B learn Persian?” decision.
My short answer is:
Qwen3.5-0.8B may be good enough for a narrow Persian-first festival demo, but I would not invest heavily in fine-tuning it before comparing it against Qwen3.5-2B and Qwen3.5-4B on a small Persian stress eval.
The reason is that 0.8B is not just “a bit smaller” than 2B or 4B. For a multilingual assistant, 0.8B may be close to the minimum useful capacity. Once you require Persian fluency, instruction following, math tutoring, tool JSON, Latin-name handling, uncertainty behavior, and low language drift, the small model can become expensive in engineering time.
So I would use this decision rule:
| Route | When it makes sense | Main risk |
|---|---|---|
| Qwen3.5-0.8B | strict lightweight demo, narrow assistant, fast/cheap inference | may need more SFT/DPO/guards/CPT to compensate |
| Qwen3.5-2B | first “extra capacity” candidate if 0.8B feels brittle | still small, but much less squeezed |
| Qwen3.5-4B | likely best total-engineering-cost route if quality matters | higher inference cost |
| Qwen3.5-9B | quality demo / stronger tutor behavior | may no longer feel like a tiny SLM project |
Useful references:
- Qwen3.5-0.8B model card
- Qwen3.5 model collection
- Unsloth Qwen3.5 guide
- Unsloth Qwen3.5 fine-tuning guide
1. My recommended first step: build the eval before training
Before CPT, tokenizer extension, or a large SFT run, I would build a small Persian stress eval and run the same prompts on:
- Qwen3.5-0.8B
- Qwen3.5-2B
- Qwen3.5-4B
- optionally Qwen3.5-9B
- optionally one non-Qwen small baseline
The goal is not to create a perfect academic benchmark. The goal is to answer:
- Is 0.8B already acceptable?
- Does 2B fix most of the failures?
- Does 4B reduce engineering work enough to justify the extra inference cost?
- Are the failures mainly Persian fluency, language drift, math quality, JSON breakage, or shallow answers?
- Should you spend effort on SFT, DPO/ORPO, CPT, tokenizer extension, or simply a larger base model?
A first eval can be only 500–700 examples.
| Eval bucket | Size | What it tests |
|---|---|---|
| Persian general QA | 100 | basic Persian answer quality |
| PersianMMLU / Khayyam sample | 100 | school knowledge and reasoning |
| Latin-name mixed prompts | 50 | Mohammad-style name handling |
| English technical term prompts | 50 | controlled code-switching |
| Math tutor prompts | 50–100 | step-by-step Persian tutoring |
| Tool JSON prompts | 50 | valid JSON and key preservation |
| Grammar correction | 50 | Persian language tutoring |
| Long-answer drift | 30–50 | whether it switches language halfway |
| Uncertainty prompts | 30–50 | whether it says “I do not know” properly |
| Safety / social norm prompts | 30–50 | basic public assistant behavior |
Some useful Persian benchmark starting points:
- Khayyam Challenge / PersianMMLU
- Open Persian LLM Leaderboard
- PARSE: Persian open-domain reasoning QA
- ELAB: Persian alignment benchmark
- PersLitEval
Suggested scoring setup (click for more details)
2. About Latin names like Mohammad
I would not remove Latin names. They are not bad data by themselves.
The real target is:
Preserve Latin names, formulas, URLs, code, and JSON when needed, but keep the surrounding answer Persian.
For example:
| Input pattern | Bad behavior | Desired behavior |
|---|---|---|
| Persian + Latin name | model switches to English | keep the name, answer in Persian |
| Persian + English term | entire answer becomes English | preserve term, explain in Persian |
| Persian + formula | formula gets mistranslated | preserve formula, explain in Persian |
| Persian + JSON schema | keys get translated or JSON breaks | keep valid JSON, explain in Persian outside JSON |
| Persian + URL/citation | answer drifts into English | keep URL/citation, answer in Persian |
So instead of deleting mixed examples, I would create controlled mixed examples.
Example behavior policy:
“Answer in Persian. Preserve names, formulas, code, JSON keys, URLs, and citations exactly when needed. Do not switch the surrounding explanation into English or Chinese.”
This is also where Qwen-Scope is conceptually relevant. Qwen-Scope uses sparse autoencoders to analyze and steer Qwen-family model internals, and it discusses development uses around behaviors such as code-switching and repetition.
Useful references:
- Qwen-Scope
- SASFT: SAE-guided supervised fine-tuning for unexpected code-switching
- Controlling Language Confusion in Multilingual LLMs
- OLA: Learning to respond in the user’s language
But I would not depend on Qwen-Scope as the main practical solution for a festival pipeline. For you, the simpler stack is probably:
- SFT examples showing correct Persian behavior around Latin spans
- rejected examples where the answer drifts into English/Chinese
- small DPO/ORPO if drift persists
- sentence-level language checks in eval
- optional runtime retry if the output language is wrong
3. KenLM is useful, but not enough for “high-quality” answers
Your concern about shallow but fluent answers is correct.
A KenLM-style Good/Bad filter can help with:
- noisy text
- strange character distribution
- broken Persian
- low-fluency text
- bad OCR-ish text
- obvious junk
But it will not reliably detect:
- shallow explanations
- generic answers
- low educational value
- missing reasoning steps
- factual weakness
- confident but incomplete answers
For this, I would copy the idea behind FineWeb-Edu: use a separate educational-quality signal, not only a fluency signal.
Useful references:
- FineWeb-Edu dataset
- FineWeb-Edu classifier
- FineWeb paper
- FineWeb blog post
A practical Persian filtering stack could look like:
- language ID
- exact / near deduplication
- rule filters for broken text
- KenLM / perplexity filtering
- educational-quality classifier or LLM judge
- small human audit
- only then use the text for CPT or raw-text-to-SFT generation
Example quality rubric for Persian educational data (click for more details)
4. Math tutor: fixed format is good, but keep the target narrow
For 0.8B, I would not target “general math solver”. I would target:
Persian step-by-step tutor for known school-level problem types.
That is a much more realistic goal.
Good tutor data should include:
| Field | Purpose |
|---|---|
| problem | original Persian problem |
| level | grade / difficulty |
| topic | arithmetic, algebra, geometry, etc. |
| student_attempt | optional wrong solution |
| diagnosis | what is wrong or missing |
| hint | small nudge |
| solution_steps | short Persian steps |
| final_answer | normalized answer |
| checks | how to verify |
| language_policy | answer in Persian; preserve formulas |
I would evaluate not only the final answer, but also:
- step correctness
- first-error detection
- usefulness of hint
- whether the model over-solves when the student only needs a hint
- language consistency
- formula preservation
Useful reference:
- Step-by-Step: Improving Math Reasoning and Tutoring with Process Supervision
My practical recommendation:
| Model size | Math tutor scope |
|---|---|
| 0.8B | narrow, fixed format, school-level, many templates |
| 2B | still controlled, but more robust explanations |
| 4B | better candidate for richer tutoring |
| 9B | stronger quality demo if inference cost is acceptable |
5. Self-reminders help, but they are not enough
A system prompt like “always answer in Persian” helps, but I would not rely on it alone.
Better stack:
SFT habit Many examples where mixed input still produces Persian output.
Preference data Chosen answer = Persian answer with allowed spans preserved. Rejected answer = English/Chinese drift, broken JSON, translated schema keys, shallow answer, or hallucination.
DPO/ORPO Use preference tuning if SFT does not suppress drift enough.
Eval Track wrong-language rate and sentence-level drift.
Runtime guard If needed, detect wrong-language output and retry.
Useful references:
- Direct Preference Optimization
- Controlling Language Confusion in Multilingual LLMs
- OLA: Learning to respond in the user’s language
Example chosen/rejected pairs for language drift (click for more details)
6. Tokenizer extension: possible, but probably not first
Tokenizer extension can help if Persian is tokenized badly. But it is not a free improvement.
If you add tokens, you need:
- embedding resize
- sensible initialization
- warm-up / alignment
- continued training
- regression tests
A good Persian-specific reference is PersianPhi:
- PersianPhi model card
- Persian-Phi paper
PersianPhi is useful because it shows that tokenizer adaptation can be part of a serious Persian curriculum pipeline. But that is also the warning: it is not just “add Persian tokens and run SFT”.
For your project, I would measure first:
| Measurement | Why it matters |
|---|---|
| tokens per word | Persian compression |
| characters per token | context efficiency |
| split rate | whether words are fragmented |
| Persian vs English token cost | multilingual imbalance |
| Latin-name mixed examples | real input behavior |
| formulas / JSON | tool and math safety |
| textbook text | tutor data efficiency |
If Qwen3.5 tokenization is acceptable, I would skip tokenizer extension and spend the time on eval, SFT quality, and drift control.
7. Suggested training strategy
I would use this order.
Stage 0 — Compare base models before training
Run the same eval on 0.8B, 2B, and 4B.
Decision:
- if 0.8B passes, keep it
- if 0.8B fails only in language discipline, try SFT + DPO/ORPO
- if 0.8B fails in tutor quality, tool stability, or long-answer drift, test 2B/4B before heavy training
- if 4B works with much less engineering, use 4B
Stage 1 — SFT for Persian-first behavior
Start with high-quality examples, not a huge weak dataset.
Possible first target:
- 5k–30k strong Persian SFT examples
- Persian answers
- tutor examples
- grammar correction
- uncertainty behavior
- Latin-name handling
- formula handling
- JSON/tool handling
Stage 2 — Preference tuning for drift and bad behavior
Only if SFT is not enough.
Use chosen/rejected pairs for:
- wrong-language drift
- translated JSON keys
- broken formulas
- shallow answers
- overconfident hallucinations
Stage 3 — CPT only if eval says you need it
If Persian fluency or domain knowledge is still weak, then consider small CPT.
But I would not start with huge CPT unless the goal is truly “build a Persian language model”, not “build a Persian assistant demo”.
Stage 4 — Tokenizer extension only if measurement justifies it
Tokenizer extension is a serious intervention. It belongs after measurement, not before.
Short answers to your questions
Will Latin names cause drift?
They can trigger drift, but they are not bad data. Keep them. Train and evaluate the model to preserve names while keeping the surrounding answer Persian.
Can Good/Bad KenLM catch shallow answers?
No, not reliably. KenLM helps with surface fluency and noise. Use an educational-quality classifier or rubric-based judge for depth and usefulness.
Is fixed-format math tutor data a good idea?
Yes. For 0.8B, fixed and narrow is better. Treat it as a guided tutor, not a general math solver.
Can the model learn confidence and self-reminders?
Partly, but self-reminders are not enough. Use SFT for habit, preference tuning for wrong-language rejection, and eval/runtime checks.
Is Qwen-Scope useful?
Yes, conceptually and possibly technically. It supports the idea that code-switching can be analyzed and mitigated. But I would not make it the main solution for a festival pipeline.
Is tokenizer extension worth it?
Maybe, but only after measuring tokenization efficiency. If Qwen3.5 tokenization is acceptable, skip it.
Should you stay on 0.8B?
Maybe. If the project requires strict lightweight deployment, yes. But if 2B or 4B is allowed, test them before investing heavily in 0.8B-specific engineering.
Practical recommendation
My practical recommendation would be:
- Build a 500–700 item Persian stress eval.
- Run Qwen3.5-0.8B, 2B, and 4B before training.
- If 0.8B passes, use it and keep the scope narrow.
- If 0.8B fails mainly because of language drift, try SFT + small DPO/ORPO.
- If 0.8B fails because of capacity, tutor quality, or tool stability, move to 2B or 4B.
- Use KenLM for noise, not for educational value.
- Do not extend the tokenizer unless tokenization measurements clearly justify it.
Main idea:
Do not solve model-size uncertainty with training first. Solve it with a small eval first. Then choose whether 0.8B, 2B, or 4B is the cheapest route in total engineering cost.
Discussion in the ATmosphere