External Publication
Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026
Source

Hmm… if you stay on the 0.8B route, the fine-tuning and engineering cost may actually end up higher than just moving to a slightly larger model, maybe 2B–9B. More details below:


I would frame this as a model-size / evaluation / data-quality decision , not only as a “can Qwen3.5-0.8B learn Persian?” decision.

My short answer is:

Qwen3.5-0.8B may be good enough for a narrow Persian-first festival demo, but I would not invest heavily in fine-tuning it before comparing it against Qwen3.5-2B and Qwen3.5-4B on a small Persian stress eval.

The reason is that 0.8B is not just “a bit smaller” than 2B or 4B. For a multilingual assistant, 0.8B may be close to the minimum useful capacity. Once you require Persian fluency, instruction following, math tutoring, tool JSON, Latin-name handling, uncertainty behavior, and low language drift, the small model can become expensive in engineering time.

So I would use this decision rule:

Route When it makes sense Main risk
Qwen3.5-0.8B strict lightweight demo, narrow assistant, fast/cheap inference may need more SFT/DPO/guards/CPT to compensate
Qwen3.5-2B first “extra capacity” candidate if 0.8B feels brittle still small, but much less squeezed
Qwen3.5-4B likely best total-engineering-cost route if quality matters higher inference cost
Qwen3.5-9B quality demo / stronger tutor behavior may no longer feel like a tiny SLM project

Useful references:

  • Qwen3.5-0.8B model card
  • Qwen3.5 model collection
  • Unsloth Qwen3.5 guide
  • Unsloth Qwen3.5 fine-tuning guide

1. My recommended first step: build the eval before training

Before CPT, tokenizer extension, or a large SFT run, I would build a small Persian stress eval and run the same prompts on:

  • Qwen3.5-0.8B
  • Qwen3.5-2B
  • Qwen3.5-4B
  • optionally Qwen3.5-9B
  • optionally one non-Qwen small baseline

The goal is not to create a perfect academic benchmark. The goal is to answer:

  1. Is 0.8B already acceptable?
  2. Does 2B fix most of the failures?
  3. Does 4B reduce engineering work enough to justify the extra inference cost?
  4. Are the failures mainly Persian fluency, language drift, math quality, JSON breakage, or shallow answers?
  5. Should you spend effort on SFT, DPO/ORPO, CPT, tokenizer extension, or simply a larger base model?

A first eval can be only 500–700 examples.

Eval bucket Size What it tests
Persian general QA 100 basic Persian answer quality
PersianMMLU / Khayyam sample 100 school knowledge and reasoning
Latin-name mixed prompts 50 Mohammad-style name handling
English technical term prompts 50 controlled code-switching
Math tutor prompts 50–100 step-by-step Persian tutoring
Tool JSON prompts 50 valid JSON and key preservation
Grammar correction 50 Persian language tutoring
Long-answer drift 30–50 whether it switches language halfway
Uncertainty prompts 30–50 whether it says “I do not know” properly
Safety / social norm prompts 30–50 basic public assistant behavior

Some useful Persian benchmark starting points:

  • Khayyam Challenge / PersianMMLU
  • Open Persian LLM Leaderboard
  • PARSE: Persian open-domain reasoning QA
  • ELAB: Persian alignment benchmark
  • PersLitEval

Suggested scoring setup (click for more details)

2. About Latin names like Mohammad

I would not remove Latin names. They are not bad data by themselves.

The real target is:

Preserve Latin names, formulas, URLs, code, and JSON when needed, but keep the surrounding answer Persian.

For example:

Input pattern Bad behavior Desired behavior
Persian + Latin name model switches to English keep the name, answer in Persian
Persian + English term entire answer becomes English preserve term, explain in Persian
Persian + formula formula gets mistranslated preserve formula, explain in Persian
Persian + JSON schema keys get translated or JSON breaks keep valid JSON, explain in Persian outside JSON
Persian + URL/citation answer drifts into English keep URL/citation, answer in Persian

So instead of deleting mixed examples, I would create controlled mixed examples.

Example behavior policy:

“Answer in Persian. Preserve names, formulas, code, JSON keys, URLs, and citations exactly when needed. Do not switch the surrounding explanation into English or Chinese.”

This is also where Qwen-Scope is conceptually relevant. Qwen-Scope uses sparse autoencoders to analyze and steer Qwen-family model internals, and it discusses development uses around behaviors such as code-switching and repetition.

Useful references:

  • Qwen-Scope
  • SASFT: SAE-guided supervised fine-tuning for unexpected code-switching
  • Controlling Language Confusion in Multilingual LLMs
  • OLA: Learning to respond in the user’s language

But I would not depend on Qwen-Scope as the main practical solution for a festival pipeline. For you, the simpler stack is probably:

  1. SFT examples showing correct Persian behavior around Latin spans
  2. rejected examples where the answer drifts into English/Chinese
  3. small DPO/ORPO if drift persists
  4. sentence-level language checks in eval
  5. optional runtime retry if the output language is wrong

3. KenLM is useful, but not enough for “high-quality” answers

Your concern about shallow but fluent answers is correct.

A KenLM-style Good/Bad filter can help with:

  • noisy text
  • strange character distribution
  • broken Persian
  • low-fluency text
  • bad OCR-ish text
  • obvious junk

But it will not reliably detect:

  • shallow explanations
  • generic answers
  • low educational value
  • missing reasoning steps
  • factual weakness
  • confident but incomplete answers

For this, I would copy the idea behind FineWeb-Edu: use a separate educational-quality signal, not only a fluency signal.

Useful references:

  • FineWeb-Edu dataset
  • FineWeb-Edu classifier
  • FineWeb paper
  • FineWeb blog post

A practical Persian filtering stack could look like:

  1. language ID
  2. exact / near deduplication
  3. rule filters for broken text
  4. KenLM / perplexity filtering
  5. educational-quality classifier or LLM judge
  6. small human audit
  7. only then use the text for CPT or raw-text-to-SFT generation

Example quality rubric for Persian educational data (click for more details)

4. Math tutor: fixed format is good, but keep the target narrow

For 0.8B, I would not target “general math solver”. I would target:

Persian step-by-step tutor for known school-level problem types.

That is a much more realistic goal.

Good tutor data should include:

Field Purpose
problem original Persian problem
level grade / difficulty
topic arithmetic, algebra, geometry, etc.
student_attempt optional wrong solution
diagnosis what is wrong or missing
hint small nudge
solution_steps short Persian steps
final_answer normalized answer
checks how to verify
language_policy answer in Persian; preserve formulas

I would evaluate not only the final answer, but also:

  • step correctness
  • first-error detection
  • usefulness of hint
  • whether the model over-solves when the student only needs a hint
  • language consistency
  • formula preservation

Useful reference:

  • Step-by-Step: Improving Math Reasoning and Tutoring with Process Supervision

My practical recommendation:

Model size Math tutor scope
0.8B narrow, fixed format, school-level, many templates
2B still controlled, but more robust explanations
4B better candidate for richer tutoring
9B stronger quality demo if inference cost is acceptable

5. Self-reminders help, but they are not enough

A system prompt like “always answer in Persian” helps, but I would not rely on it alone.

Better stack:

  1. SFT habit Many examples where mixed input still produces Persian output.

  2. Preference data Chosen answer = Persian answer with allowed spans preserved. Rejected answer = English/Chinese drift, broken JSON, translated schema keys, shallow answer, or hallucination.

  3. DPO/ORPO Use preference tuning if SFT does not suppress drift enough.

  4. Eval Track wrong-language rate and sentence-level drift.

  5. Runtime guard If needed, detect wrong-language output and retry.

Useful references:

  • Direct Preference Optimization
  • Controlling Language Confusion in Multilingual LLMs
  • OLA: Learning to respond in the user’s language

Example chosen/rejected pairs for language drift (click for more details)

6. Tokenizer extension: possible, but probably not first

Tokenizer extension can help if Persian is tokenized badly. But it is not a free improvement.

If you add tokens, you need:

  • embedding resize
  • sensible initialization
  • warm-up / alignment
  • continued training
  • regression tests

A good Persian-specific reference is PersianPhi:

  • PersianPhi model card
  • Persian-Phi paper

PersianPhi is useful because it shows that tokenizer adaptation can be part of a serious Persian curriculum pipeline. But that is also the warning: it is not just “add Persian tokens and run SFT”.

For your project, I would measure first:

Measurement Why it matters
tokens per word Persian compression
characters per token context efficiency
split rate whether words are fragmented
Persian vs English token cost multilingual imbalance
Latin-name mixed examples real input behavior
formulas / JSON tool and math safety
textbook text tutor data efficiency

If Qwen3.5 tokenization is acceptable, I would skip tokenizer extension and spend the time on eval, SFT quality, and drift control.

7. Suggested training strategy

I would use this order.

Stage 0 — Compare base models before training

Run the same eval on 0.8B, 2B, and 4B.

Decision:

  • if 0.8B passes, keep it
  • if 0.8B fails only in language discipline, try SFT + DPO/ORPO
  • if 0.8B fails in tutor quality, tool stability, or long-answer drift, test 2B/4B before heavy training
  • if 4B works with much less engineering, use 4B

Stage 1 — SFT for Persian-first behavior

Start with high-quality examples, not a huge weak dataset.

Possible first target:

  • 5k–30k strong Persian SFT examples
  • Persian answers
  • tutor examples
  • grammar correction
  • uncertainty behavior
  • Latin-name handling
  • formula handling
  • JSON/tool handling

Stage 2 — Preference tuning for drift and bad behavior

Only if SFT is not enough.

Use chosen/rejected pairs for:

  • wrong-language drift
  • translated JSON keys
  • broken formulas
  • shallow answers
  • overconfident hallucinations

Stage 3 — CPT only if eval says you need it

If Persian fluency or domain knowledge is still weak, then consider small CPT.

But I would not start with huge CPT unless the goal is truly “build a Persian language model”, not “build a Persian assistant demo”.

Stage 4 — Tokenizer extension only if measurement justifies it

Tokenizer extension is a serious intervention. It belongs after measurement, not before.

Short answers to your questions

Will Latin names cause drift?

They can trigger drift, but they are not bad data. Keep them. Train and evaluate the model to preserve names while keeping the surrounding answer Persian.

Can Good/Bad KenLM catch shallow answers?

No, not reliably. KenLM helps with surface fluency and noise. Use an educational-quality classifier or rubric-based judge for depth and usefulness.

Is fixed-format math tutor data a good idea?

Yes. For 0.8B, fixed and narrow is better. Treat it as a guided tutor, not a general math solver.

Can the model learn confidence and self-reminders?

Partly, but self-reminders are not enough. Use SFT for habit, preference tuning for wrong-language rejection, and eval/runtime checks.

Is Qwen-Scope useful?

Yes, conceptually and possibly technically. It supports the idea that code-switching can be analyzed and mitigated. But I would not make it the main solution for a festival pipeline.

Is tokenizer extension worth it?

Maybe, but only after measuring tokenization efficiency. If Qwen3.5 tokenization is acceptable, skip it.

Should you stay on 0.8B?

Maybe. If the project requires strict lightweight deployment, yes. But if 2B or 4B is allowed, test them before investing heavily in 0.8B-specific engineering.

Practical recommendation

My practical recommendation would be:

  1. Build a 500–700 item Persian stress eval.
  2. Run Qwen3.5-0.8B, 2B, and 4B before training.
  3. If 0.8B passes, use it and keep the scope narrow.
  4. If 0.8B fails mainly because of language drift, try SFT + small DPO/ORPO.
  5. If 0.8B fails because of capacity, tutor quality, or tool stability, move to 2B or 4B.
  6. Use KenLM for noise, not for educational value.
  7. Do not extend the tokenizer unless tokenization measurements clearly justify it.

Main idea:

Do not solve model-size uncertainty with training first. Solve it with a small eval first. Then choose whether 0.8B, 2B, or 4B is the cheapest route in total engineering cost.

Discussion in the ATmosphere

Loading comments...