External Publication

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 8, 2026

Since this is already quite concrete, I looked into it directly:

Short answer

Yes, your proposed high-level order is reasonable:

CPT -> SFT -> small final hand-written / polished SFT

But I would slightly modify it:

CPT
  -> small hand-written seed + private eval
  -> synthetic / generated SFT from cleaned raw text
  -> filtering + verification
  -> SFT
  -> small final hand-written / polished SFT

The important change is that the hand-written data should not only be at the end. It should also be used at the beginning as:

examples for the large teacher model
style guide
private evaluation set
final correction / polish set

The main idea is:

Cleaned raw text is not an SFT dataset. It is source material. You turn it into SFT by designing task families, generating instruction-answer pairs, verifying them, and mixing them deliberately.

So I would not try to generate one universal Persian SFT dataset. I would generate several smaller datasets, each targeting one capability.

For example:

Capability	Dataset family
Persian grammar help	explanation, correction, quiz, role identification
school-grade math	verified problem-solution pairs
student support	simplify, explain, summarize, give examples
tool calling	tool schemas + Persian user requests + tool-call traces
knowledge QA	passage-grounded QA from source texts
multi-turn assistant	document-to-dialogue or scenario-to-dialogue
natural Persian style	small native/polished final SFT set

1. Raw text is source material, not SFT

A cleaned Persian paragraph like this:

<clean Persian passage>

is good for CPT.

But SFT needs something like:

{"messages":[{"role":"user","content":"<instruction>"},{"role":"assistant","content":"<answer>"}]}

or:

{"prompt":"<instruction>","completion":"<answer>"}

That means you need a conversion step:

cleaned raw text
  -> passage selection
  -> task type selection
  -> instruction generation
  -> answer generation
  -> verification
  -> formatting
  -> SFT dataset

This is the same general idea behind work such as Bonito, which converts unannotated text into task-specific instruction tuning datasets, and Raw Text is All You Need, which studies generating knowledge-intensive multi-turn dialogues from raw documents.

You do not need to copy those exact methods. The useful idea is:

First decide what task the raw text should become.

2. Use capability-specific datasets, not one giant mixed dataset

For your case, I would split SFT data by target capability.

Recommended split

Capability	Should it come from raw text?	Best construction method
Persian grammar help	partly	grammar passages + hand-written seed + generated correction/explanation tasks
student support	yes	educational passages → explanations, summaries, examples, exercises
school math reasoning	not mainly	generated/translated problems + verified solutions
tool calling	no, mostly separate	tool schema + user intent + tool call + tool result + final answer
general helpful Persian	partly	seed examples + self-instruct-style generation + filtering
factual Persian QA	yes	passage-grounded QA from cleaned text
multi-turn dialogue	yes	document/scenario → multi-turn conversation
final natural style	mostly hand-written/polished	small native final SFT set

The mistake to avoid is:

I have clean Persian text, so I will ask an LLM to make random Q&A pairs from all of it.

That usually creates a noisy, repetitive, shallow SFT dataset.

A better approach is:

I need 7 capability buckets.
For each bucket, I will design task templates.
Then I will generate and verify examples for that bucket.
Then I will mix the buckets intentionally.

3. Better training order

Your proposed order:

CPT -> SFT -> Hand-Written SFT

is basically good.

I would expand it like this:

Stage	Purpose
CPT	Teach Persian distribution: grammar, syntax, style, general domain familiarity
private eval	Keep a small set of Persian tasks that never enters training
hand-written seed	Show the teacher LLM what “good Persian assistant data” looks like
synthetic SFT generation	Scale from raw text and task templates
filtering/verification	Remove wrong, unnatural, repetitive, or badly formatted examples
first SFT	Teach assistant behavior
final hand-written/polished SFT	Fix style, grammar tutor behavior, safety, and final answer tone

The seed set and final polish set can be small.

They just need to be high quality.

4. A practical raw-text-to-SFT workflow

Here is a concrete workflow.

1. Split cleaned corpus into passages.
2. Classify passages by usefulness.
3. Assign each useful passage to one or more task families.
4. Generate instruction-answer pairs with a teacher LLM.
5. Verify automatically where possible.
6. Review samples manually.
7. Deduplicate.
8. Convert to TRL/HF chat format.
9. Train a small SFT.
10. Evaluate.
11. Add more data only where the model fails.

Passage classification

Not every cleaned text is useful for SFT.

Classify each passage:

Passage type	Use
grammar explanation	grammar tutor data
educational explanation	student support data
factual article	grounded QA / summarization
math problem	math reasoning if answer is verifiable
tool documentation	tool calling data
noisy opinion/comment	maybe conversational style, but risky
list/table/boilerplate	usually reject or handle separately
very short text	often reject
very long text	split or summarize first

You can use simple tags:

{"id":"p_0001","text":"<passage>","tags":["grammar","education"],"quality":"high"}
{"id":"p_0002","text":"<passage>","tags":["factual","qa"],"quality":"medium"}
{"id":"p_0003","text":"<passage>","tags":["boilerplate"],"quality":"reject"}

5. Use a teacher LLM with task templates

For each task family, write a generation prompt.

Do not simply say:

Make SFT data from this text.

That is too vague.

Instead, make the task explicit.

Passage-grounded QA

You are creating Persian SFT data for a small Iranian Persian assistant.

Source passage:
<passage>

Create 3 Persian user questions that can be answered only from the passage.
For each question, write a concise Persian answer.
Do not add facts that are not in the passage.
Use natural Iranian Persian.
Return JSONL with fields: id, task_type, messages, source_passage_id.

Student explanation

You are creating Persian SFT data for a student-support assistant.

Source passage:
<passage>

Create:
1. one student question,
2. one simple explanation,
3. one example,
4. one short follow-up question the student might ask,
5. one assistant follow-up answer.

The assistant should sound like a patient Persian teacher.
Avoid English unless the source requires it.
Return JSON.

Grammar tutor

You are creating Persian grammar tutor SFT data.

Grammar topic:
<topic>

Source passage:
<passage>

Create examples for:
- explaining the concept
- identifying the concept in a sentence
- correcting a student's wrong answer
- giving 3 new examples
- making a mini quiz

Use Iranian Persian.
Keep explanations short and clear.
Return JSONL.

Simplification

Convert the passage into a student-friendly explanation.

Source:
<passage>

Create:
- user instruction
- assistant answer
- difficulty level: elementary / middle_school / high_school
- key concepts
- possible student confusion

Use natural Persian.

This is much better than unrestricted generation.

6. Grammar help in Farsi: build it as a separate dataset family

For things like:

فاعل
مفعول
متمم
مسند
فعل
قید
صفت
مضاف و مضاف‌الیه
نقش‌های دستوری
جمله ساده / مرکب

I would not rely only on raw text.

You need targeted examples.

Suggested grammar task families:

Task type	Example
concept explanation	“فاعل چیست؟ با مثال توضیح بده.”
role identification	“در جمله زیر فاعل و مفعول را مشخص کن.”
error correction	“دانش‌آموز گفته X مفعول است. آیا درست است؟”
contrastive explanation	“فرق متمم و مفعول چیست؟”
example generation	“برای متمم سه مثال ساده بساز.”
mini quiz	“سه سوال کوتاه درباره فاعل و مفعول بساز.”
step-by-step analysis	“این جمله را از نظر نقش‌های دستوری تحلیل کن.”
student misconception	“چرا این پاسخ اشتباه است؟ ساده توضیح بده.”

A useful schema:

{
  "id": "grammar_0001",
  "task_type": "grammar_role_identification",
  "topic": "فاعل",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian grammar tutor."},
    {"role": "user", "content": "در جمله «علی کتاب را خواند»، فاعل را مشخص کن و کوتاه توضیح بده."},
    {"role": "assistant", "content": "در این جمله «علی» فاعل است، چون انجام‌دهندهٔ عمل خواندن است. «کتاب» مفعول است، چون عمل خواندن روی آن انجام شده است."}
  ],
  "source": "manual_seed",
  "language": "fa",
  "variety": "iranian_persian"
}

For grammar tutoring, I would make at least a small hand-written seed set.

Something like:

100-300 excellent grammar tutor examples

Then use those as few-shot examples for the teacher LLM.

7. Student support data

For student support, raw educational text is very useful.

Convert passages into tasks like:

Task	Input	Output
explain simply	passage	simple explanation
summarize	passage	short Persian summary
give examples	concept	2-3 examples
make exercise	passage	student exercise
answer misconception	wrong student answer	correction
scaffold	hard question	step-by-step hint
quiz	lesson passage	short quiz
compare concepts	two concepts	explanation of difference

Example generation prompt:

You are building SFT data for an Iranian Persian student assistant.

Source passage:
<passage>

Create 5 examples:
1. simple explanation
2. summary
3. example-based explanation
4. student misconception correction
5. mini quiz

Rules:
- Use natural Iranian Persian.
- Do not add facts outside the passage unless clearly common knowledge.
- Keep answers short enough for a small model.
- Do not create very long chain-of-thought.
- Return JSONL.

The “short enough for a small model” rule matters. A 0.8B model may not benefit from very long, complex answers early.

8. Math reasoning: do not rely only on raw Persian text

Math is different.

For math SFT, correctness matters more than naturalness.

I would not generate math examples only from raw Persian text. Instead:

generate or translate school-grade math problems,
solve them with a reliable solver or teacher model,
verify the answer,
write a Persian explanation,
sample-review.

Relevant examples:

GSM8K is a classic school-grade math word-problem dataset.
OpenMathInstruct generated large math instruction data with code-interpreter-style solutions.
MAmmoTH / MathInstruct is another important math instruction tuning example.

For your use case, I would make Persian school-grade math as a separate dataset family.

Example schema:

{
  "id": "math_0001",
  "task_type": "school_math_reasoning",
  "grade_level": "middle_school",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian math tutor."},
    {"role": "user", "content": "اگر علی ۳ دفتر بخرد و قیمت هر دفتر ۴۰ هزار تومان باشد، او در مجموع چقدر پرداخت می‌کند؟"},
    {"role": "assistant", "content": "قیمت هر دفتر ۴۰ هزار تومان است و علی ۳ دفتر می‌خرد. پس باید ۳ را در ۴۰ هزار ضرب کنیم: ۳ × ۴۰٬۰۰۰ = ۱۲۰٬۰۰۰. بنابراین علی ۱۲۰ هزار تومان پرداخت می‌کند."}
  ],
  "answer": "120000",
  "verified": true
}

For math, add automatic checks when possible.

For arithmetic, you can verify with Python.

def check_answer(problem_id, predicted_answer, gold_answer):
    return str(predicted_answer).replace(",", "").strip() == str(gold_answer).strip()

If you generate 10,000 math examples but 20% are wrong, that can damage the model.

Better:

2,000 verified math examples > 20,000 unverified math examples

9. Tool calling should be a separate dataset

Tool calling does not naturally come from Persian raw text.

You need a tool-use dataset with:

tool schema
Persian user request
assistant tool call
tool output
final Persian answer

The TRL docs include tool-calling SFT support and describe using tool schemas and tool calls in the dataset: TRL SFTTrainer.

Tool-use work such as ToolLLM / ToolBench is useful conceptually because it treats tool use as its own data construction problem.

A simple tool example:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "calculator",
        "description": "Perform basic arithmetic.",
        "parameters": {
          "type": "object",
          "properties": {
            "expression": {"type": "string"}
          },
          "required": ["expression"]
        }
      }
    }
  ],
  "messages": [
    {"role": "user", "content": "حاصل ۲۳ ضربدر ۴۷ چقدر است؟"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "calculator",
            "arguments": "{\"expression\":\"23*47\"}"
          }
        }
      ]
    },
    {"role": "tool", "content": "1081"},
    {"role": "assistant", "content": "حاصل ۲۳ ضربدر ۴۷ برابر با ۱۰۸۱ است."}
  ]
}

You also need negative examples.

Example type	Why
tool needed	model learns when to call
no tool needed	model does not call tools unnecessarily
ambiguous request	model asks clarification
unavailable tool	model refuses or explains limitation
invalid arguments	model learns valid JSON/schema
tool result summarization	model explains result in Persian

For Persian tool calling, I would start small:

200-500 high-quality tool examples

Then expand only after testing.

10. A large LLM workflow for generating SFT data

If you can use a larger LLM as a teacher, use it as a generator, but also as a critic/verifier.

A simple pipeline:

passage
  -> generator LLM creates candidate SFT examples
  -> verifier LLM checks faithfulness, Persian naturalness, format
  -> rule-based filters check JSON/schema/length
  -> sample manual review
  -> accepted examples enter SFT dataset

Generator prompt

You are generating Persian SFT data for a small Iranian Persian assistant.

Target capability: <capability>
Source passage:
<passage>

Generate <n> examples.
Each example must include:
- task_type
- messages
- source_passage_id
- difficulty
- verification_notes

Rules:
- Use natural Iranian Persian.
- Keep answers concise.
- Do not invent facts outside the passage.
- Do not include unsafe content.
- Do not use English unless needed.
- Return valid JSONL only.

Verifier prompt

You are verifying Persian SFT examples.

Source passage:
<passage>

Candidate example:
<example>

Check:
1. Is the Persian natural?
2. Is the answer faithful to the passage?
3. Is the instruction clear?
4. Is the answer useful for the target capability?
5. Is the format valid?
6. Is there any hallucinated fact?
7. Should this be accepted, rejected, or edited?

Return:
{"decision":"accept|edit|reject","reason":"...","fixed_example":...}

Do not accept everything from the teacher LLM. The teacher LLM is a data generator, not a guarantee of quality.

11. You can implement this with simple scripts first

You do not need a complex framework at the beginning.

A simple Python pipeline is enough:

input passages
  -> prompt templates
  -> teacher LLM calls
  -> JSON parser
  -> filters
  -> output JSONL

If the pipeline becomes large, tools like distilabel can help organize synthetic data generation, AI feedback, judging, filtering, and dataset export.

But I would start simple.

A minimal folder structure:

data/
  raw_clean/
  passages/
  generated_candidates/
  rejected/
  accepted/
  eval_private/
prompts/
  grammar_generation.txt
  student_support_generation.txt
  math_generation.txt
  tool_calling_generation.txt
  verifier.txt
scripts/
  split_passages.py
  generate_candidates.py
  verify_candidates.py
  filter_jsonl.py
  dedup.py
  build_train_mix.py

12. Suggested SFT dataset sizes

There is no universal number.

The needed size depends on:

model size
base model quality
CPT quality
task complexity
data diversity
data correctness
evaluation target

But for planning, I would use this rough scale:

Dataset part	Suggested starting size
private eval	100-500 examples
hand-written seed	200-1,000 examples
first SFT experiment	1,000-5,000 examples
grammar tutor data	1,000-5,000 examples
student support data	2,000-10,000 examples
math reasoning data	2,000-20,000 verified examples
tool calling data	200-2,000 high-quality examples
broad useful SFT	5,000-30,000 examples
large synthetic SFT	50,000-100,000+ examples, only after filtering is mature

Some relevant scale references:

LIMA: around 1,000 carefully curated examples showed that small, high-quality data can matter.
AlpaGasus: selected around 9k high-quality examples from Alpaca-style data.
SmolTalk: 1M synthetic SFT samples used for SmolLM2-Instruct.
SmolLM2: small-model training can still rely heavily on careful data design and staged training.

But I would not start with 1M examples.

For your project, a more realistic first target might be:

CPT first
private eval: 200-300
hand-written seed: 300-800
first generated SFT: 3k-10k
final polished SFT: 500-2k

Then expand based on evaluation.

13. Example final mixture for your case

A possible first real SFT mixture:

Component	Examples	Notes
Persian grammar tutor	2,000	include فاعل، مفعول، متمم, correction, quiz
student support	3,000	explanations, examples, simplification
passage-grounded QA	3,000	from cleaned Persian educational/factual text
summarization/simplification	1,500	useful for student assistant
school math	2,000	verified answers only
tool calling	500	high quality, schema-valid
safety/refusal	500	simple Persian safety cases
final hand-written polish	500	naturalness and style

Total:

~13k examples

That is already enough for a serious first SFT experiment.

If it works, expand the weak categories.

14. How to mix generated and hand-written data

I would not put hand-written data only at the end.

Use it in three places:

Hand-written data role	Purpose
seed examples	show the teacher LLM what good data looks like
private eval	measure model improvement
final polish	fix the model’s final behavior/style

Workflow:

write 300 excellent examples
  -> use 100 as private eval
  -> use 100 as few-shot generation examples
  -> keep 100 for final polish / style correction

Do not train on the private eval examples.

15. Quality filters for generated SFT

For each generated example, check:

Check	Method
valid JSON	parser
correct language	language/script ratio
not too long	token length filter
not too short	length filter
no duplicate	exact/near dedup
source faithfulness	verifier LLM or rules
natural Persian	sample review / classifier
no benchmark leakage	decontamination
correct math answer	calculator / solver
valid tool schema	JSON schema validation
useful task	category-specific review
safe answer	safety filter / manual review

The most important filter depends on the task.

Task	Most important filter
grammar tutor	correctness and natural explanation
math	answer verification
tool calling	schema validity
passage QA	source faithfulness
student support	helpfulness and natural Persian
safety	refusal quality

16. Example schema for mixed SFT

I would keep metadata. It will help you debug later.

{
  "id": "sft_grammar_000001",
  "task_family": "persian_grammar_tutor",
  "source_type": "manual_seed",
  "source_passage_id": null,
  "difficulty": "middle_school",
  "language": "fa",
  "variety": "iranian_persian",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian tutor."},
    {"role": "user", "content": "در جمله «مریم به مدرسه رفت»، متمم را مشخص کن."},
    {"role": "assistant", "content": "در این جمله «به مدرسه» متمم است، چون با حرف اضافهٔ «به» آمده و معنی فعل «رفت» را کامل می‌کند."}
  ],
  "verified": true,
  "notes": "manual seed"
}

For generated passage-grounded data:

{
  "id": "sft_qa_000001",
  "task_family": "passage_grounded_qa",
  "source_type": "generated_from_clean_text",
  "source_passage_id": "passage_12345",
  "language": "fa",
  "variety": "iranian_persian",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian assistant. Answer only from the provided passage."},
    {"role": "user", "content": "<question based on passage>"},
    {"role": "assistant", "content": "<answer grounded in passage>"}
  ],
  "verified": true,
  "verifier": "teacher_llm_plus_manual_sample"
}

For tool calling, keep it separate because the schema is different.

17. What not to do

I would avoid:

Bad approach	Why
Generate random Q&A from all raw text	Usually shallow and repetitive
Mix tool calling, grammar, math, and chat without labels	Hard to debug
Trust teacher LLM outputs without verification	Wrong answers enter training
Generate huge data before first eval	You will not know what helped
Use only passage-grounded QA	The model may not learn teacher behavior
Use only final hand-written data	Too small and slow
Use only synthetic data	Persian naturalness may be weak
Ignore math verification	Bad math data is harmful
Treat tool calling like normal chat	Tool use needs schema-valid traces
Train on private eval examples	Contamination

18. Practical recommendation

For your exact case, I would do this:

1. Finish CPT on clean Persian text.
2. Build 200-300 private eval examples.
3. Write 300-800 high-quality seed SFT examples manually.
4. Split cleaned text into passages.
5. Classify passages into task families.
6. Generate SFT candidates with a larger teacher LLM.
7. Verify generated examples by task type.
8. Build a first 5k-15k SFT mixture.
9. Fine-tune.
10. Evaluate.
11. Add data only for the categories that fail.
12. Finish with 500-2k hand-written/polished examples.

If I had to choose the first SFT categories, I would start with:

Persian grammar tutor
student support / explanation
passage-grounded QA
math with verified answers
small tool-calling set

This is more realistic than trying to build every possible assistant capability immediately.

Bottom line

Your plan is good, but I would refine it like this:

CPT
  -> private eval + hand-written seed
  -> capability-specific synthetic SFT generation
  -> verification/filtering
  -> SFT
  -> small final hand-written polish

The key point is:

Do not generate one universal SFT dataset from raw text. Generate several small datasets for separate capabilities, verify each one differently, and then mix them intentionally.

For your target model, I would rather have:

10k carefully categorized and verified Persian SFT examples

than:

100k random Persian Q&A pairs generated from raw text