External Publication

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026

Hmm… CPT is expensive, so it may be better to first isolate the failure mode and decide whether CPT is actually necessary. There may be a route where SFT alone improves this quite a bit, but it depends on the cause:

I would not diagnose this example as “needs CPT” yet.

The output you showed is not perfect, but it already shows several good signs:

the model answers in Persian,
it understands the basic concept of AI,
it does not obviously drift into English/Chinese,
it can produce a structured explanation,
it is trying to be helpful.

So I would read this as a diagnostic fork , not as immediate proof that Persian CPT is required.

The important question is:

Is this a CPT-type failure, or is it an answer-style / capacity / evaluation problem?

CPT is useful when the model lacks underlying language/domain grounding. SFT is useful when the model can understand the task but needs to learn how you want it to answer. DPO/ORPO is useful when the model can produce both good and bad styles, but you need to prefer one style over another. A larger model is useful when 0.8B can do each part separately, but cannot reliably combine them.

For your example, my current guess is:

Mostly answer-style / alignment issue, possibly 0.8B capacity pressure, not clearly Persian CPT failure yet.

That is good news, because CPT is one of the more expensive routes.

Direct answer

For this specific output, I would try this order:

Do not start with CPT yet.
Build a small diagnostic eval.
Test the same prompts on Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B.
If explicit answer-style prompts fix the issue, use SFT.
If 0.8B fails but 2B/4B works, it is probably capacity pressure.
If all sizes show weak Persian terminology, weak Persian naturalness, or broad knowledge gaps, then consider CPT/RAG/corpus work.

The short version:

Diagnosis	Example symptom	First route
A. Answer-style failure	Persian is OK, knowledge is OK, but answer is too long / unfocused / oddly polite	SFT, maybe DPO/ORPO
B. Capacity pressure	0.8B cannot obey style + Persian + reasoning + length constraints together	Try 2B/4B or narrow the task
C. Persian/domain grounding failure	Persian wording, terminology, or knowledge is broadly weak	CPT, RAG, curated Persian corpus

Your example currently looks more like A + maybe B , not clearly C.

Why I would not jump to CPT from this example

The answer has problems, but not necessarily CPT-type problems.

The model’s answer is too broad for the prompt. It starts with a greeting, says “good question”, gives a long multi-section explanation, and adds some strange or shallow subpoints. But the model does know what AI is.

The user asked:

سلام، هوش مصنوعی چیست

A better answer could simply be:

هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.

That is not mainly “more pretraining knowledge”. That is answer style : concise definition, no unnecessary greeting, no over-expanded lesson, no shallow filler.

Also, Qwen3.5-0.8B is already the post-trained model, not the raw base model:

Qwen3.5-0.8B
Qwen3.5-0.8B-Base

So I would not say “it fails because it is not instruction-tuned.” I would say:

It is a generic post-trained model, but not yet adapted to your Persian educational assistant style.

That is different.

Decision tree: is this really a CPT problem? (click for more details)

Test A/B/C with a tiny prompt family

Before training, I would run the same prompt family on 0.8B, 2B, and 4B.

Use temperature 0.

هوش مصنوعی چیست؟ در یک جمله جواب بده.



هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم، کوتاه و ساده توضیح بده.



هوش مصنوعی چیست؟ بدون مقدمه و در دو جمله توضیح بده.



هوش مصنوعی چیست؟ با یک مثال ساده توضیح بده، اما بیشتر از ۴ جمله ننویس.



هوش مصنوعی چیست؟ اول یک تعریف کوتاه بده، بعد یک مثال، بعد یک جمله درباره کاربرد آن.

Then interpret the results like this:

Observation	Likely diagnosis	Route
0.8B follows the explicit format well	answer-style problem	SFT likely enough
0.8B still rambles, 2B/4B works	capacity pressure	larger small model may be cheaper
all sizes have weak Persian terminology	grounding problem	corpus/RAG/CPT
all sizes know facts but answer badly	style problem	SFT/DPO
all sizes give shallow generic answers	data-quality problem	better answer-quality rubric
all sizes fail the concept	domain knowledge issue	RAG or domain data

This is cheap and gives much better evidence than guessing.

Why 0.8B may behave this way even when it is already post-trained (click for more details)

What SFT should teach here

For your current issue, I would build an answer-style SFT set.

This is different from raw Persian text.

Raw text teaches language distribution. Answer-style SFT teaches response behavior.

You want the model to learn:

answer directly,
avoid unnecessary greeting,
match the requested length,
choose the right depth,
explain simply when asked,
use examples only when useful,
preserve Persian around English technical terms,
avoid long generic essay mode unless requested,
say “I don’t know” when appropriate.

Possible categories:

Prompt type	Desired behavior
“What is X?”	short definition
“Explain X simply”	beginner explanation
“Explain X to a student”	educational but concise
“Give an example”	definition + one example
“Compare X and Y”	small table or clear contrast
“I do not understand X”	tutor-style explanation
“Answer in one sentence”	exactly one sentence
“No introduction”	no greeting, no filler
“Explain in 3 bullets”	obey structure
“I may be wrong, but…”	correct gently

Example SFT pairs:

User:
هوش مصنوعی چیست؟ در دو جمله جواب بده.

Assistant:
هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.



User:
هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم توضیح بده.

Assistant:
هوش مصنوعی یعنی برنامه‌ای که می‌تواند از داده‌ها یاد بگیرد و بعضی کارها را شبیه انسان انجام دهد. مثلاً وقتی یک برنامه با دیدن عکس‌های زیاد یاد می‌گیرد گربه را از سگ تشخیص دهد، از هوش مصنوعی استفاده کرده است.



User:
هوش مصنوعی چیست؟ فقط تعریف کوتاه بده.

Assistant:
هوش مصنوعی توانایی یک سیستم کامپیوتری برای یادگیری، تصمیم‌گیری یا انجام کارهایی است که معمولاً به هوش انسانی نیاز دارند.



User:
هوش مصنوعی چیست؟ بدون مقدمه جواب بده.

Assistant:
هوش مصنوعی فناوری‌ای است که به کامپیوترها کمک می‌کند از داده‌ها یاد بگیرند و کارهایی مانند تشخیص تصویر، فهم متن، پیش‌بینی یا تصمیم‌گیری را انجام دهند.

This is the type of data that directly targets your observed failure.

Chosen/rejected examples for DPO or ORPO (click for more details)

What data to use

I would separate the data by role.

Do not put all Persian data into one bucket.

Data role	Purpose	Example sources
diagnostic eval	decide whether CPT is needed	Khayyam/PersianMMLU, PARSE, custom prompts
answer-style SFT	teach concise Persian assistant behavior	curated definitions, FarsInstruct-style examples
QA behavior	teach grounded question answering	PQuAD, PerCQA, PCoQA
math/tutor behavior	teach step-by-step tutoring	custom textbook-aligned data
raw corpus / CPT	improve Persian grounding	Matina or other filtered Persian corpus
preference data	prefer concise/focused answers	chosen/rejected pairs
RAG documents	provide domain facts without retraining	trusted educational sources

Useful Persian resources:

Resource	Use
FarsInstruct	Persian instruction-following data and task templates
FarsInstruct GitHub	inspect task/template design
Khayyam Challenge / PersianMMLU	Persian school/exam-style evaluation
PQuAD	Persian reading comprehension QA
PerCQA	Persian community QA / answer selection
PCoQA	Persian conversational QA
PARSE	Persian reasoning QA
Matina Persian corpus	possible CPT corpus if grounding is weak
Matina on HF	inspect raw corpus candidate
PersianPhi	example of a heavier Persian adaptation pipeline

Important: do not assume these are all training-ready. Inspect license, quality, duplication, synthetic content, and whether the style matches your target.

About TinyStories

Your concern is right.

Translated TinyStories may help with:

simple sentence structure,
simple Persian generation,
child-like narrative fluency,
basic grammar patterns.

But it is not a good main source for:

AI explanations,
math tutoring,
educational definitions,
Persian technical terminology,
robust knowledge coverage,
assistant answer style.

TinyStories is useful as a controlled simple-language dataset for small-model research, but it is not a broad knowledge or tutor corpus:

TinyStories paper

PersianPhi uses translated TinyStories as part of a broader curriculum, but not as the whole Persian solution:

PersianPhi
Persian-Phi paper

So I would treat translated TinyStories as optional auxiliary data, not the main CPT source.

Possible role of TinyStories (click for more details)

About Wikipedia

Wikipedia is not useless, but it is not enough by itself.

It is useful for:

broad factual coverage,
terminology,
named entities,
reading comprehension,
background documents for RAG,
raw text candidates for corpus work.

But it is weak as direct assistant SFT because Wikipedia does not teach:

how short the answer should be,
when not to greet,
how to tutor a student,
how to adapt to difficulty level,
how to answer a simple definition prompt,
how to admit uncertainty.

So I would not directly train assistant behavior from Wikipedia paragraphs.

Better:

use Wikipedia as background text,
generate candidate QA/explanations from it,
filter with a Persian educational rubric,
manually inspect samples,
keep only strong examples.

Use of Wikipedia	Good?
raw background corpus	yes, with filtering
RAG documents	yes
reading comprehension source	yes
terminology source	yes
direct assistant answer style	no
shallow answer detector	no

About FineWeb-Edu and shallow text detection

You are right that the FineWeb-Edu classifier itself is not a Persian shallow-text detector.

FineWeb-Edu is still useful as an idea:

Educational quality should be scored separately from fluency.

KenLM/perplexity can detect some noisy text, but it will not reliably detect a fluent-but-shallow explanation.

Useful references:

FineWeb-Edu
FineWeb-Edu classifier
FineWeb-Edu annotations
FineWeb-C
FineWeb2 paper
FineWeb2 dataset

FineWeb-C is especially relevant because it is about educational-quality annotations across many languages. I would copy that idea at small scale for Persian.

Persian educational quality rubric (click for more details)

Minimal diagnosis/eval set

I would make a small eval before training.

Start with maybe 100 examples, not 1,000.

Bucket	Count	Purpose
short definition prompts	20	answer length / focus
beginner explanation prompts	20	educational simplicity
Persian QA prompts	20	general answer behavior
math/tutor prompts	20	tutor format
Latin-name / English-term prompts	10	code-switch control
uncertainty prompts	10	honesty / hallucination control

Score each model on:

Metric	What to check
Persian naturalness	is the Persian acceptable?
answer focus	does it answer the actual prompt?
length control	does it obey one-sentence/two-sentence constraints?
educational usefulness	is it helpful for the intended user?
shallow answer rate	is it generic or empty?
language drift	does it switch language unexpectedly?
factual correctness	is the content right?
format following	does it follow requested structure?

Run:

Qwen3.5-0.8B
Qwen3.5-2B
Qwen3.5-4B

Optional:

Qwen3.5-9B
one non-Qwen small model

Then decide.

Route decision after the eval (click for more details)

If it is A: answer-style / alignment failure

Use SFT first.

Suggested dataset:

5k–30k high-quality Persian answer-style examples,
maybe 1k–5k chosen/rejected pairs later,
many short prompts,
many length-control prompts,
no unnecessary greeting,
Persian educational tone,
explicit “short / beginner / detailed” modes.

Examples to include:

Example type	Why
concise definitions	fixes overlong answers
no-greeting answers	fixes filler
student explanations	fixes educational tone
examples with one example only	fixes over-expansion
answer in N sentences	fixes length control
uncertainty examples	fixes hallucination
simple correction examples	fixes tutor behavior

This route is much cheaper than CPT.

If it is B: 0.8B capacity pressure

Try 2B/4B before heavy training.

The important point is:

0.8B may be small enough that every extra behavior competes for capacity.

Persian fluency, instruction following, educational style, math tutor behavior, tool JSON, and anti-drift behavior are all separate pressures.

If 2B or 4B solves many failures immediately, then 2B/4B may be cheaper overall than:

CPT,
tokenizer extension,
large SFT,
DPO,
runtime guardrails,
repeated debugging.

For a festival/demo project, 0.8B is attractive. But if it requires too much engineering, 2B may be the practical sweet spot.

If it is C: Persian/domain grounding failure

Then CPT may be justified, but I would still be careful.

Before CPT:

inspect a Persian raw corpus,
sample and manually review,
deduplicate,
check language contamination,
check educational quality,
keep eval separate,
run a small pilot before a large CPT run.

Possible raw corpus reference:

Matina: A Large-Scale 73B Token Persian Text Corpus
Matina on Hugging Face

But CPT is not automatically the first fix. If the missing information exists in a small set of trusted documents, RAG may be cheaper and safer.

When RAG may be better than CPT (click for more details)

Practical training strategy

I would use this sequence:

Stage 0: Diagnose

Run 100 diagnostic prompts on 0.8B/2B/4B.

Stage 1: SFT

If A is likely, build answer-style SFT.

Start small:

5k high-quality examples if data is very good,
10k–30k if you can maintain quality,
avoid huge weak synthetic datasets.

Stage 2: Preference tuning

If the model still prefers the wrong style:

chosen = concise, focused Persian answer,
rejected = overlong, shallow, greeting-heavy, unfocused answer.

Use DPO/ORPO only after you know what failures remain.

Stage 3: Size decision

If 0.8B remains brittle:

try 2B,
then 4B,
compare total engineering cost, not only inference cost.

Stage 4: CPT only if justified

If C is clearly confirmed:

use curated Persian raw text,
maybe Matina or educational corpora,
keep CPT small at first,
evaluate after CPT and after SFT.

Stage 5: Tokenizer only if measured

Do not extend tokenizer unless tokenization analysis shows Persian is badly fragmented and you can afford embedding adaptation.

Final practical recommendation

For your current output, I would not start with CPT.

I would do:

diagnostic eval first ,
answer-style SFT first if the model already knows the concept ,
2B/4B comparison if 0.8B remains brittle ,
DPO/ORPO if the model can produce good answers but often chooses bad style ,
CPT only if many examples show broad Persian/domain grounding weakness.

So the main idea is:

Try to prove that CPT is necessary before paying for CPT. This example alone does not prove it. It looks more like answer-style alignment plus possible 0.8B capacity pressure.