How can i build a High Quality dataset?
Hmm… CPT is expensive, so it may be better to first isolate the failure mode and decide whether CPT is actually necessary. There may be a route where SFT alone improves this quite a bit, but it depends on the cause:
I would not diagnose this example as “needs CPT” yet.
The output you showed is not perfect, but it already shows several good signs:
- the model answers in Persian,
- it understands the basic concept of AI,
- it does not obviously drift into English/Chinese,
- it can produce a structured explanation,
- it is trying to be helpful.
So I would read this as a diagnostic fork , not as immediate proof that Persian CPT is required.
The important question is:
Is this a CPT-type failure, or is it an answer-style / capacity / evaluation problem?
CPT is useful when the model lacks underlying language/domain grounding. SFT is useful when the model can understand the task but needs to learn how you want it to answer. DPO/ORPO is useful when the model can produce both good and bad styles, but you need to prefer one style over another. A larger model is useful when 0.8B can do each part separately, but cannot reliably combine them.
For your example, my current guess is:
Mostly answer-style / alignment issue, possibly 0.8B capacity pressure, not clearly Persian CPT failure yet.
That is good news, because CPT is one of the more expensive routes.
Direct answer
For this specific output, I would try this order:
- Do not start with CPT yet.
- Build a small diagnostic eval.
- Test the same prompts on Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B.
- If explicit answer-style prompts fix the issue, use SFT.
- If 0.8B fails but 2B/4B works, it is probably capacity pressure.
- If all sizes show weak Persian terminology, weak Persian naturalness, or broad knowledge gaps, then consider CPT/RAG/corpus work.
The short version:
| Diagnosis | Example symptom | First route |
|---|---|---|
| A. Answer-style failure | Persian is OK, knowledge is OK, but answer is too long / unfocused / oddly polite | SFT, maybe DPO/ORPO |
| B. Capacity pressure | 0.8B cannot obey style + Persian + reasoning + length constraints together | Try 2B/4B or narrow the task |
| C. Persian/domain grounding failure | Persian wording, terminology, or knowledge is broadly weak | CPT, RAG, curated Persian corpus |
Your example currently looks more like A + maybe B , not clearly C.
Why I would not jump to CPT from this example
The answer has problems, but not necessarily CPT-type problems.
The model’s answer is too broad for the prompt. It starts with a greeting, says “good question”, gives a long multi-section explanation, and adds some strange or shallow subpoints. But the model does know what AI is.
The user asked:
سلام، هوش مصنوعی چیست
A better answer could simply be:
هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیمگیری. به زبان ساده، هوش مصنوعی از دادهها یاد میگیرد و بر اساس آن پاسخ یا تصمیم تولید میکند.
That is not mainly “more pretraining knowledge”. That is answer style : concise definition, no unnecessary greeting, no over-expanded lesson, no shallow filler.
Also, Qwen3.5-0.8B is already the post-trained model, not the raw base model:
- Qwen3.5-0.8B
- Qwen3.5-0.8B-Base
So I would not say “it fails because it is not instruction-tuned.” I would say:
It is a generic post-trained model, but not yet adapted to your Persian educational assistant style.
That is different.
Decision tree: is this really a CPT problem? (click for more details)
Test A/B/C with a tiny prompt family
Before training, I would run the same prompt family on 0.8B, 2B, and 4B.
Use temperature 0.
هوش مصنوعی چیست؟ در یک جمله جواب بده.
هوش مصنوعی چیست؟ برای دانشآموز کلاس هفتم، کوتاه و ساده توضیح بده.
هوش مصنوعی چیست؟ بدون مقدمه و در دو جمله توضیح بده.
هوش مصنوعی چیست؟ با یک مثال ساده توضیح بده، اما بیشتر از ۴ جمله ننویس.
هوش مصنوعی چیست؟ اول یک تعریف کوتاه بده، بعد یک مثال، بعد یک جمله درباره کاربرد آن.
Then interpret the results like this:
| Observation | Likely diagnosis | Route |
|---|---|---|
| 0.8B follows the explicit format well | answer-style problem | SFT likely enough |
| 0.8B still rambles, 2B/4B works | capacity pressure | larger small model may be cheaper |
| all sizes have weak Persian terminology | grounding problem | corpus/RAG/CPT |
| all sizes know facts but answer badly | style problem | SFT/DPO |
| all sizes give shallow generic answers | data-quality problem | better answer-quality rubric |
| all sizes fail the concept | domain knowledge issue | RAG or domain data |
This is cheap and gives much better evidence than guessing.
Why 0.8B may behave this way even when it is already post-trained (click for more details)
What SFT should teach here
For your current issue, I would build an answer-style SFT set.
This is different from raw Persian text.
Raw text teaches language distribution. Answer-style SFT teaches response behavior.
You want the model to learn:
- answer directly,
- avoid unnecessary greeting,
- match the requested length,
- choose the right depth,
- explain simply when asked,
- use examples only when useful,
- preserve Persian around English technical terms,
- avoid long generic essay mode unless requested,
- say “I don’t know” when appropriate.
Possible categories:
| Prompt type | Desired behavior |
|---|---|
| “What is X?” | short definition |
| “Explain X simply” | beginner explanation |
| “Explain X to a student” | educational but concise |
| “Give an example” | definition + one example |
| “Compare X and Y” | small table or clear contrast |
| “I do not understand X” | tutor-style explanation |
| “Answer in one sentence” | exactly one sentence |
| “No introduction” | no greeting, no filler |
| “Explain in 3 bullets” | obey structure |
| “I may be wrong, but…” | correct gently |
Example SFT pairs:
User:
هوش مصنوعی چیست؟ در دو جمله جواب بده.
Assistant:
هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیمگیری. به زبان ساده، هوش مصنوعی از دادهها یاد میگیرد و بر اساس آن پاسخ یا تصمیم تولید میکند.
User:
هوش مصنوعی چیست؟ برای دانشآموز کلاس هفتم توضیح بده.
Assistant:
هوش مصنوعی یعنی برنامهای که میتواند از دادهها یاد بگیرد و بعضی کارها را شبیه انسان انجام دهد. مثلاً وقتی یک برنامه با دیدن عکسهای زیاد یاد میگیرد گربه را از سگ تشخیص دهد، از هوش مصنوعی استفاده کرده است.
User:
هوش مصنوعی چیست؟ فقط تعریف کوتاه بده.
Assistant:
هوش مصنوعی توانایی یک سیستم کامپیوتری برای یادگیری، تصمیمگیری یا انجام کارهایی است که معمولاً به هوش انسانی نیاز دارند.
User:
هوش مصنوعی چیست؟ بدون مقدمه جواب بده.
Assistant:
هوش مصنوعی فناوریای است که به کامپیوترها کمک میکند از دادهها یاد بگیرند و کارهایی مانند تشخیص تصویر، فهم متن، پیشبینی یا تصمیمگیری را انجام دهند.
This is the type of data that directly targets your observed failure.
Chosen/rejected examples for DPO or ORPO (click for more details)
What data to use
I would separate the data by role.
Do not put all Persian data into one bucket.
| Data role | Purpose | Example sources |
|---|---|---|
| diagnostic eval | decide whether CPT is needed | Khayyam/PersianMMLU, PARSE, custom prompts |
| answer-style SFT | teach concise Persian assistant behavior | curated definitions, FarsInstruct-style examples |
| QA behavior | teach grounded question answering | PQuAD, PerCQA, PCoQA |
| math/tutor behavior | teach step-by-step tutoring | custom textbook-aligned data |
| raw corpus / CPT | improve Persian grounding | Matina or other filtered Persian corpus |
| preference data | prefer concise/focused answers | chosen/rejected pairs |
| RAG documents | provide domain facts without retraining | trusted educational sources |
Useful Persian resources:
| Resource | Use |
|---|---|
| FarsInstruct | Persian instruction-following data and task templates |
| FarsInstruct GitHub | inspect task/template design |
| Khayyam Challenge / PersianMMLU | Persian school/exam-style evaluation |
| PQuAD | Persian reading comprehension QA |
| PerCQA | Persian community QA / answer selection |
| PCoQA | Persian conversational QA |
| PARSE | Persian reasoning QA |
| Matina Persian corpus | possible CPT corpus if grounding is weak |
| Matina on HF | inspect raw corpus candidate |
| PersianPhi | example of a heavier Persian adaptation pipeline |
Important: do not assume these are all training-ready. Inspect license, quality, duplication, synthetic content, and whether the style matches your target.
About TinyStories
Your concern is right.
Translated TinyStories may help with:
- simple sentence structure,
- simple Persian generation,
- child-like narrative fluency,
- basic grammar patterns.
But it is not a good main source for:
- AI explanations,
- math tutoring,
- educational definitions,
- Persian technical terminology,
- robust knowledge coverage,
- assistant answer style.
TinyStories is useful as a controlled simple-language dataset for small-model research, but it is not a broad knowledge or tutor corpus:
- TinyStories paper
PersianPhi uses translated TinyStories as part of a broader curriculum, but not as the whole Persian solution:
- PersianPhi
- Persian-Phi paper
So I would treat translated TinyStories as optional auxiliary data, not the main CPT source.
Possible role of TinyStories (click for more details)
About Wikipedia
Wikipedia is not useless, but it is not enough by itself.
It is useful for:
- broad factual coverage,
- terminology,
- named entities,
- reading comprehension,
- background documents for RAG,
- raw text candidates for corpus work.
But it is weak as direct assistant SFT because Wikipedia does not teach:
- how short the answer should be,
- when not to greet,
- how to tutor a student,
- how to adapt to difficulty level,
- how to answer a simple definition prompt,
- how to admit uncertainty.
So I would not directly train assistant behavior from Wikipedia paragraphs.
Better:
- use Wikipedia as background text,
- generate candidate QA/explanations from it,
- filter with a Persian educational rubric,
- manually inspect samples,
- keep only strong examples.
| Use of Wikipedia | Good? |
|---|---|
| raw background corpus | yes, with filtering |
| RAG documents | yes |
| reading comprehension source | yes |
| terminology source | yes |
| direct assistant answer style | no |
| shallow answer detector | no |
About FineWeb-Edu and shallow text detection
You are right that the FineWeb-Edu classifier itself is not a Persian shallow-text detector.
FineWeb-Edu is still useful as an idea:
Educational quality should be scored separately from fluency.
KenLM/perplexity can detect some noisy text, but it will not reliably detect a fluent-but-shallow explanation.
Useful references:
- FineWeb-Edu
- FineWeb-Edu classifier
- FineWeb-Edu annotations
- FineWeb-C
- FineWeb2 paper
- FineWeb2 dataset
FineWeb-C is especially relevant because it is about educational-quality annotations across many languages. I would copy that idea at small scale for Persian.
Persian educational quality rubric (click for more details)
Minimal diagnosis/eval set
I would make a small eval before training.
Start with maybe 100 examples, not 1,000.
| Bucket | Count | Purpose |
|---|---|---|
| short definition prompts | 20 | answer length / focus |
| beginner explanation prompts | 20 | educational simplicity |
| Persian QA prompts | 20 | general answer behavior |
| math/tutor prompts | 20 | tutor format |
| Latin-name / English-term prompts | 10 | code-switch control |
| uncertainty prompts | 10 | honesty / hallucination control |
Score each model on:
| Metric | What to check |
|---|---|
| Persian naturalness | is the Persian acceptable? |
| answer focus | does it answer the actual prompt? |
| length control | does it obey one-sentence/two-sentence constraints? |
| educational usefulness | is it helpful for the intended user? |
| shallow answer rate | is it generic or empty? |
| language drift | does it switch language unexpectedly? |
| factual correctness | is the content right? |
| format following | does it follow requested structure? |
Run:
- Qwen3.5-0.8B
- Qwen3.5-2B
- Qwen3.5-4B
Optional:
- Qwen3.5-9B
- one non-Qwen small model
Then decide.
Route decision after the eval (click for more details)
If it is A: answer-style / alignment failure
Use SFT first.
Suggested dataset:
- 5k–30k high-quality Persian answer-style examples,
- maybe 1k–5k chosen/rejected pairs later,
- many short prompts,
- many length-control prompts,
- no unnecessary greeting,
- Persian educational tone,
- explicit “short / beginner / detailed” modes.
Examples to include:
| Example type | Why |
|---|---|
| concise definitions | fixes overlong answers |
| no-greeting answers | fixes filler |
| student explanations | fixes educational tone |
| examples with one example only | fixes over-expansion |
| answer in N sentences | fixes length control |
| uncertainty examples | fixes hallucination |
| simple correction examples | fixes tutor behavior |
This route is much cheaper than CPT.
If it is B: 0.8B capacity pressure
Try 2B/4B before heavy training.
The important point is:
0.8B may be small enough that every extra behavior competes for capacity.
Persian fluency, instruction following, educational style, math tutor behavior, tool JSON, and anti-drift behavior are all separate pressures.
If 2B or 4B solves many failures immediately, then 2B/4B may be cheaper overall than:
- CPT,
- tokenizer extension,
- large SFT,
- DPO,
- runtime guardrails,
- repeated debugging.
For a festival/demo project, 0.8B is attractive. But if it requires too much engineering, 2B may be the practical sweet spot.
If it is C: Persian/domain grounding failure
Then CPT may be justified, but I would still be careful.
Before CPT:
- inspect a Persian raw corpus,
- sample and manually review,
- deduplicate,
- check language contamination,
- check educational quality,
- keep eval separate,
- run a small pilot before a large CPT run.
Possible raw corpus reference:
- Matina: A Large-Scale 73B Token Persian Text Corpus
- Matina on Hugging Face
But CPT is not automatically the first fix. If the missing information exists in a small set of trusted documents, RAG may be cheaper and safer.
When RAG may be better than CPT (click for more details)
Practical training strategy
I would use this sequence:
Stage 0: Diagnose
Run 100 diagnostic prompts on 0.8B/2B/4B.
Stage 1: SFT
If A is likely, build answer-style SFT.
Start small:
- 5k high-quality examples if data is very good,
- 10k–30k if you can maintain quality,
- avoid huge weak synthetic datasets.
Stage 2: Preference tuning
If the model still prefers the wrong style:
- chosen = concise, focused Persian answer,
- rejected = overlong, shallow, greeting-heavy, unfocused answer.
Use DPO/ORPO only after you know what failures remain.
Stage 3: Size decision
If 0.8B remains brittle:
- try 2B,
- then 4B,
- compare total engineering cost, not only inference cost.
Stage 4: CPT only if justified
If C is clearly confirmed:
- use curated Persian raw text,
- maybe Matina or educational corpora,
- keep CPT small at first,
- evaluate after CPT and after SFT.
Stage 5: Tokenizer only if measured
Do not extend tokenizer unless tokenization analysis shows Persian is badly fragmented and you can afford embedding adaptation.
Final practical recommendation
For your current output, I would not start with CPT.
I would do:
- diagnostic eval first ,
- answer-style SFT first if the model already knows the concept ,
- 2B/4B comparison if 0.8B remains brittle ,
- DPO/ORPO if the model can produce good answers but often chooses bad style ,
- CPT only if many examples show broad Persian/domain grounding weakness.
So the main idea is:
Try to prove that CPT is necessary before paying for CPT. This example alone does not prove it. It looks more like answer-style alignment plus possible 0.8B capacity pressure.
Discussion in the ATmosphere