External Publication
Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 20, 2026
Source

Hmm… CPT is expensive, so it may be better to first isolate the failure mode and decide whether CPT is actually necessary. There may be a route where SFT alone improves this quite a bit, but it depends on the cause:


I would not diagnose this example as “needs CPT” yet.

The output you showed is not perfect, but it already shows several good signs:

  • the model answers in Persian,
  • it understands the basic concept of AI,
  • it does not obviously drift into English/Chinese,
  • it can produce a structured explanation,
  • it is trying to be helpful.

So I would read this as a diagnostic fork , not as immediate proof that Persian CPT is required.

The important question is:

Is this a CPT-type failure, or is it an answer-style / capacity / evaluation problem?

CPT is useful when the model lacks underlying language/domain grounding. SFT is useful when the model can understand the task but needs to learn how you want it to answer. DPO/ORPO is useful when the model can produce both good and bad styles, but you need to prefer one style over another. A larger model is useful when 0.8B can do each part separately, but cannot reliably combine them.

For your example, my current guess is:

Mostly answer-style / alignment issue, possibly 0.8B capacity pressure, not clearly Persian CPT failure yet.

That is good news, because CPT is one of the more expensive routes.

Direct answer

For this specific output, I would try this order:

  1. Do not start with CPT yet.
  2. Build a small diagnostic eval.
  3. Test the same prompts on Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B.
  4. If explicit answer-style prompts fix the issue, use SFT.
  5. If 0.8B fails but 2B/4B works, it is probably capacity pressure.
  6. If all sizes show weak Persian terminology, weak Persian naturalness, or broad knowledge gaps, then consider CPT/RAG/corpus work.

The short version:

Diagnosis Example symptom First route
A. Answer-style failure Persian is OK, knowledge is OK, but answer is too long / unfocused / oddly polite SFT, maybe DPO/ORPO
B. Capacity pressure 0.8B cannot obey style + Persian + reasoning + length constraints together Try 2B/4B or narrow the task
C. Persian/domain grounding failure Persian wording, terminology, or knowledge is broadly weak CPT, RAG, curated Persian corpus

Your example currently looks more like A + maybe B , not clearly C.

Why I would not jump to CPT from this example

The answer has problems, but not necessarily CPT-type problems.

The model’s answer is too broad for the prompt. It starts with a greeting, says “good question”, gives a long multi-section explanation, and adds some strange or shallow subpoints. But the model does know what AI is.

The user asked:

سلام، هوش مصنوعی چیست

A better answer could simply be:

هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.

That is not mainly “more pretraining knowledge”. That is answer style : concise definition, no unnecessary greeting, no over-expanded lesson, no shallow filler.

Also, Qwen3.5-0.8B is already the post-trained model, not the raw base model:

  • Qwen3.5-0.8B
  • Qwen3.5-0.8B-Base

So I would not say “it fails because it is not instruction-tuned.” I would say:

It is a generic post-trained model, but not yet adapted to your Persian educational assistant style.

That is different.

Decision tree: is this really a CPT problem? (click for more details)

Test A/B/C with a tiny prompt family

Before training, I would run the same prompt family on 0.8B, 2B, and 4B.

Use temperature 0.

هوش مصنوعی چیست؟ در یک جمله جواب بده.



هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم، کوتاه و ساده توضیح بده.



هوش مصنوعی چیست؟ بدون مقدمه و در دو جمله توضیح بده.



هوش مصنوعی چیست؟ با یک مثال ساده توضیح بده، اما بیشتر از ۴ جمله ننویس.



هوش مصنوعی چیست؟ اول یک تعریف کوتاه بده، بعد یک مثال، بعد یک جمله درباره کاربرد آن.

Then interpret the results like this:

Observation Likely diagnosis Route
0.8B follows the explicit format well answer-style problem SFT likely enough
0.8B still rambles, 2B/4B works capacity pressure larger small model may be cheaper
all sizes have weak Persian terminology grounding problem corpus/RAG/CPT
all sizes know facts but answer badly style problem SFT/DPO
all sizes give shallow generic answers data-quality problem better answer-quality rubric
all sizes fail the concept domain knowledge issue RAG or domain data

This is cheap and gives much better evidence than guessing.

Why 0.8B may behave this way even when it is already post-trained (click for more details)

What SFT should teach here

For your current issue, I would build an answer-style SFT set.

This is different from raw Persian text.

Raw text teaches language distribution. Answer-style SFT teaches response behavior.

You want the model to learn:

  • answer directly,
  • avoid unnecessary greeting,
  • match the requested length,
  • choose the right depth,
  • explain simply when asked,
  • use examples only when useful,
  • preserve Persian around English technical terms,
  • avoid long generic essay mode unless requested,
  • say “I don’t know” when appropriate.

Possible categories:

Prompt type Desired behavior
“What is X?” short definition
“Explain X simply” beginner explanation
“Explain X to a student” educational but concise
“Give an example” definition + one example
“Compare X and Y” small table or clear contrast
“I do not understand X” tutor-style explanation
“Answer in one sentence” exactly one sentence
“No introduction” no greeting, no filler
“Explain in 3 bullets” obey structure
“I may be wrong, but…” correct gently

Example SFT pairs:

User:
هوش مصنوعی چیست؟ در دو جمله جواب بده.

Assistant:
هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.



User:
هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم توضیح بده.

Assistant:
هوش مصنوعی یعنی برنامه‌ای که می‌تواند از داده‌ها یاد بگیرد و بعضی کارها را شبیه انسان انجام دهد. مثلاً وقتی یک برنامه با دیدن عکس‌های زیاد یاد می‌گیرد گربه را از سگ تشخیص دهد، از هوش مصنوعی استفاده کرده است.



User:
هوش مصنوعی چیست؟ فقط تعریف کوتاه بده.

Assistant:
هوش مصنوعی توانایی یک سیستم کامپیوتری برای یادگیری، تصمیم‌گیری یا انجام کارهایی است که معمولاً به هوش انسانی نیاز دارند.



User:
هوش مصنوعی چیست؟ بدون مقدمه جواب بده.

Assistant:
هوش مصنوعی فناوری‌ای است که به کامپیوترها کمک می‌کند از داده‌ها یاد بگیرند و کارهایی مانند تشخیص تصویر، فهم متن، پیش‌بینی یا تصمیم‌گیری را انجام دهند.

This is the type of data that directly targets your observed failure.

Chosen/rejected examples for DPO or ORPO (click for more details)

What data to use

I would separate the data by role.

Do not put all Persian data into one bucket.

Data role Purpose Example sources
diagnostic eval decide whether CPT is needed Khayyam/PersianMMLU, PARSE, custom prompts
answer-style SFT teach concise Persian assistant behavior curated definitions, FarsInstruct-style examples
QA behavior teach grounded question answering PQuAD, PerCQA, PCoQA
math/tutor behavior teach step-by-step tutoring custom textbook-aligned data
raw corpus / CPT improve Persian grounding Matina or other filtered Persian corpus
preference data prefer concise/focused answers chosen/rejected pairs
RAG documents provide domain facts without retraining trusted educational sources

Useful Persian resources:

Resource Use
FarsInstruct Persian instruction-following data and task templates
FarsInstruct GitHub inspect task/template design
Khayyam Challenge / PersianMMLU Persian school/exam-style evaluation
PQuAD Persian reading comprehension QA
PerCQA Persian community QA / answer selection
PCoQA Persian conversational QA
PARSE Persian reasoning QA
Matina Persian corpus possible CPT corpus if grounding is weak
Matina on HF inspect raw corpus candidate
PersianPhi example of a heavier Persian adaptation pipeline

Important: do not assume these are all training-ready. Inspect license, quality, duplication, synthetic content, and whether the style matches your target.

About TinyStories

Your concern is right.

Translated TinyStories may help with:

  • simple sentence structure,
  • simple Persian generation,
  • child-like narrative fluency,
  • basic grammar patterns.

But it is not a good main source for:

  • AI explanations,
  • math tutoring,
  • educational definitions,
  • Persian technical terminology,
  • robust knowledge coverage,
  • assistant answer style.

TinyStories is useful as a controlled simple-language dataset for small-model research, but it is not a broad knowledge or tutor corpus:

  • TinyStories paper

PersianPhi uses translated TinyStories as part of a broader curriculum, but not as the whole Persian solution:

  • PersianPhi
  • Persian-Phi paper

So I would treat translated TinyStories as optional auxiliary data, not the main CPT source.

Possible role of TinyStories (click for more details)

About Wikipedia

Wikipedia is not useless, but it is not enough by itself.

It is useful for:

  • broad factual coverage,
  • terminology,
  • named entities,
  • reading comprehension,
  • background documents for RAG,
  • raw text candidates for corpus work.

But it is weak as direct assistant SFT because Wikipedia does not teach:

  • how short the answer should be,
  • when not to greet,
  • how to tutor a student,
  • how to adapt to difficulty level,
  • how to answer a simple definition prompt,
  • how to admit uncertainty.

So I would not directly train assistant behavior from Wikipedia paragraphs.

Better:

  1. use Wikipedia as background text,
  2. generate candidate QA/explanations from it,
  3. filter with a Persian educational rubric,
  4. manually inspect samples,
  5. keep only strong examples.
Use of Wikipedia Good?
raw background corpus yes, with filtering
RAG documents yes
reading comprehension source yes
terminology source yes
direct assistant answer style no
shallow answer detector no

About FineWeb-Edu and shallow text detection

You are right that the FineWeb-Edu classifier itself is not a Persian shallow-text detector.

FineWeb-Edu is still useful as an idea:

Educational quality should be scored separately from fluency.

KenLM/perplexity can detect some noisy text, but it will not reliably detect a fluent-but-shallow explanation.

Useful references:

  • FineWeb-Edu
  • FineWeb-Edu classifier
  • FineWeb-Edu annotations
  • FineWeb-C
  • FineWeb2 paper
  • FineWeb2 dataset

FineWeb-C is especially relevant because it is about educational-quality annotations across many languages. I would copy that idea at small scale for Persian.

Persian educational quality rubric (click for more details)

Minimal diagnosis/eval set

I would make a small eval before training.

Start with maybe 100 examples, not 1,000.

Bucket Count Purpose
short definition prompts 20 answer length / focus
beginner explanation prompts 20 educational simplicity
Persian QA prompts 20 general answer behavior
math/tutor prompts 20 tutor format
Latin-name / English-term prompts 10 code-switch control
uncertainty prompts 10 honesty / hallucination control

Score each model on:

Metric What to check
Persian naturalness is the Persian acceptable?
answer focus does it answer the actual prompt?
length control does it obey one-sentence/two-sentence constraints?
educational usefulness is it helpful for the intended user?
shallow answer rate is it generic or empty?
language drift does it switch language unexpectedly?
factual correctness is the content right?
format following does it follow requested structure?

Run:

  • Qwen3.5-0.8B
  • Qwen3.5-2B
  • Qwen3.5-4B

Optional:

  • Qwen3.5-9B
  • one non-Qwen small model

Then decide.

Route decision after the eval (click for more details)

If it is A: answer-style / alignment failure

Use SFT first.

Suggested dataset:

  • 5k–30k high-quality Persian answer-style examples,
  • maybe 1k–5k chosen/rejected pairs later,
  • many short prompts,
  • many length-control prompts,
  • no unnecessary greeting,
  • Persian educational tone,
  • explicit “short / beginner / detailed” modes.

Examples to include:

Example type Why
concise definitions fixes overlong answers
no-greeting answers fixes filler
student explanations fixes educational tone
examples with one example only fixes over-expansion
answer in N sentences fixes length control
uncertainty examples fixes hallucination
simple correction examples fixes tutor behavior

This route is much cheaper than CPT.

If it is B: 0.8B capacity pressure

Try 2B/4B before heavy training.

The important point is:

0.8B may be small enough that every extra behavior competes for capacity.

Persian fluency, instruction following, educational style, math tutor behavior, tool JSON, and anti-drift behavior are all separate pressures.

If 2B or 4B solves many failures immediately, then 2B/4B may be cheaper overall than:

  • CPT,
  • tokenizer extension,
  • large SFT,
  • DPO,
  • runtime guardrails,
  • repeated debugging.

For a festival/demo project, 0.8B is attractive. But if it requires too much engineering, 2B may be the practical sweet spot.

If it is C: Persian/domain grounding failure

Then CPT may be justified, but I would still be careful.

Before CPT:

  1. inspect a Persian raw corpus,
  2. sample and manually review,
  3. deduplicate,
  4. check language contamination,
  5. check educational quality,
  6. keep eval separate,
  7. run a small pilot before a large CPT run.

Possible raw corpus reference:

  • Matina: A Large-Scale 73B Token Persian Text Corpus
  • Matina on Hugging Face

But CPT is not automatically the first fix. If the missing information exists in a small set of trusted documents, RAG may be cheaper and safer.

When RAG may be better than CPT (click for more details)

Practical training strategy

I would use this sequence:

Stage 0: Diagnose

Run 100 diagnostic prompts on 0.8B/2B/4B.

Stage 1: SFT

If A is likely, build answer-style SFT.

Start small:

  • 5k high-quality examples if data is very good,
  • 10k–30k if you can maintain quality,
  • avoid huge weak synthetic datasets.

Stage 2: Preference tuning

If the model still prefers the wrong style:

  • chosen = concise, focused Persian answer,
  • rejected = overlong, shallow, greeting-heavy, unfocused answer.

Use DPO/ORPO only after you know what failures remain.

Stage 3: Size decision

If 0.8B remains brittle:

  • try 2B,
  • then 4B,
  • compare total engineering cost, not only inference cost.

Stage 4: CPT only if justified

If C is clearly confirmed:

  • use curated Persian raw text,
  • maybe Matina or educational corpora,
  • keep CPT small at first,
  • evaluate after CPT and after SFT.

Stage 5: Tokenizer only if measured

Do not extend tokenizer unless tokenization analysis shows Persian is badly fragmented and you can afford embedding adaptation.

Final practical recommendation

For your current output, I would not start with CPT.

I would do:

  1. diagnostic eval first ,
  2. answer-style SFT first if the model already knows the concept ,
  3. 2B/4B comparison if 0.8B remains brittle ,
  4. DPO/ORPO if the model can produce good answers but often chooses bad style ,
  5. CPT only if many examples show broad Persian/domain grounding weakness.

So the main idea is:

Try to prove that CPT is necessary before paying for CPT. This example alone does not prove it. It looks more like answer-style alignment plus possible 0.8B capacity pressure.

Discussion in the ATmosphere

Loading comments...