How can i build a High Quality dataset?
Since this is already quite concrete, I looked into it directly:
Short answer
Yes, your proposed high-level order is reasonable:
CPT -> SFT -> small final hand-written / polished SFT
But I would slightly modify it:
CPT
-> small hand-written seed + private eval
-> synthetic / generated SFT from cleaned raw text
-> filtering + verification
-> SFT
-> small final hand-written / polished SFT
The important change is that the hand-written data should not only be at the end. It should also be used at the beginning as:
- examples for the large teacher model
- style guide
- private evaluation set
- final correction / polish set
The main idea is:
Cleaned raw text is not an SFT dataset. It is source material. You turn it into SFT by designing task families, generating instruction-answer pairs, verifying them, and mixing them deliberately.
So I would not try to generate one universal Persian SFT dataset. I would generate several smaller datasets, each targeting one capability.
For example:
| Capability | Dataset family |
|---|---|
| Persian grammar help | explanation, correction, quiz, role identification |
| school-grade math | verified problem-solution pairs |
| student support | simplify, explain, summarize, give examples |
| tool calling | tool schemas + Persian user requests + tool-call traces |
| knowledge QA | passage-grounded QA from source texts |
| multi-turn assistant | document-to-dialogue or scenario-to-dialogue |
| natural Persian style | small native/polished final SFT set |
1. Raw text is source material, not SFT
A cleaned Persian paragraph like this:
<clean Persian passage>
is good for CPT.
But SFT needs something like:
{"messages":[{"role":"user","content":"<instruction>"},{"role":"assistant","content":"<answer>"}]}
or:
{"prompt":"<instruction>","completion":"<answer>"}
That means you need a conversion step:
cleaned raw text
-> passage selection
-> task type selection
-> instruction generation
-> answer generation
-> verification
-> formatting
-> SFT dataset
This is the same general idea behind work such as Bonito, which converts unannotated text into task-specific instruction tuning datasets, and Raw Text is All You Need, which studies generating knowledge-intensive multi-turn dialogues from raw documents.
You do not need to copy those exact methods. The useful idea is:
First decide what task the raw text should become.
2. Use capability-specific datasets, not one giant mixed dataset
For your case, I would split SFT data by target capability.
Recommended split
| Capability | Should it come from raw text? | Best construction method |
|---|---|---|
| Persian grammar help | partly | grammar passages + hand-written seed + generated correction/explanation tasks |
| student support | yes | educational passages → explanations, summaries, examples, exercises |
| school math reasoning | not mainly | generated/translated problems + verified solutions |
| tool calling | no, mostly separate | tool schema + user intent + tool call + tool result + final answer |
| general helpful Persian | partly | seed examples + self-instruct-style generation + filtering |
| factual Persian QA | yes | passage-grounded QA from cleaned text |
| multi-turn dialogue | yes | document/scenario → multi-turn conversation |
| final natural style | mostly hand-written/polished | small native final SFT set |
The mistake to avoid is:
I have clean Persian text, so I will ask an LLM to make random Q&A pairs from all of it.
That usually creates a noisy, repetitive, shallow SFT dataset.
A better approach is:
I need 7 capability buckets.
For each bucket, I will design task templates.
Then I will generate and verify examples for that bucket.
Then I will mix the buckets intentionally.
3. Better training order
Your proposed order:
CPT -> SFT -> Hand-Written SFT
is basically good.
I would expand it like this:
| Stage | Purpose |
|---|---|
| CPT | Teach Persian distribution: grammar, syntax, style, general domain familiarity |
| private eval | Keep a small set of Persian tasks that never enters training |
| hand-written seed | Show the teacher LLM what “good Persian assistant data” looks like |
| synthetic SFT generation | Scale from raw text and task templates |
| filtering/verification | Remove wrong, unnatural, repetitive, or badly formatted examples |
| first SFT | Teach assistant behavior |
| final hand-written/polished SFT | Fix style, grammar tutor behavior, safety, and final answer tone |
The seed set and final polish set can be small.
They just need to be high quality.
4. A practical raw-text-to-SFT workflow
Here is a concrete workflow.
1. Split cleaned corpus into passages.
2. Classify passages by usefulness.
3. Assign each useful passage to one or more task families.
4. Generate instruction-answer pairs with a teacher LLM.
5. Verify automatically where possible.
6. Review samples manually.
7. Deduplicate.
8. Convert to TRL/HF chat format.
9. Train a small SFT.
10. Evaluate.
11. Add more data only where the model fails.
Passage classification
Not every cleaned text is useful for SFT.
Classify each passage:
| Passage type | Use |
|---|---|
| grammar explanation | grammar tutor data |
| educational explanation | student support data |
| factual article | grounded QA / summarization |
| math problem | math reasoning if answer is verifiable |
| tool documentation | tool calling data |
| noisy opinion/comment | maybe conversational style, but risky |
| list/table/boilerplate | usually reject or handle separately |
| very short text | often reject |
| very long text | split or summarize first |
You can use simple tags:
{"id":"p_0001","text":"<passage>","tags":["grammar","education"],"quality":"high"}
{"id":"p_0002","text":"<passage>","tags":["factual","qa"],"quality":"medium"}
{"id":"p_0003","text":"<passage>","tags":["boilerplate"],"quality":"reject"}
5. Use a teacher LLM with task templates
For each task family, write a generation prompt.
Do not simply say:
Make SFT data from this text.
That is too vague.
Instead, make the task explicit.
Passage-grounded QA
You are creating Persian SFT data for a small Iranian Persian assistant.
Source passage:
<passage>
Create 3 Persian user questions that can be answered only from the passage.
For each question, write a concise Persian answer.
Do not add facts that are not in the passage.
Use natural Iranian Persian.
Return JSONL with fields: id, task_type, messages, source_passage_id.
Student explanation
You are creating Persian SFT data for a student-support assistant.
Source passage:
<passage>
Create:
1. one student question,
2. one simple explanation,
3. one example,
4. one short follow-up question the student might ask,
5. one assistant follow-up answer.
The assistant should sound like a patient Persian teacher.
Avoid English unless the source requires it.
Return JSON.
Grammar tutor
You are creating Persian grammar tutor SFT data.
Grammar topic:
<topic>
Source passage:
<passage>
Create examples for:
- explaining the concept
- identifying the concept in a sentence
- correcting a student's wrong answer
- giving 3 new examples
- making a mini quiz
Use Iranian Persian.
Keep explanations short and clear.
Return JSONL.
Simplification
Convert the passage into a student-friendly explanation.
Source:
<passage>
Create:
- user instruction
- assistant answer
- difficulty level: elementary / middle_school / high_school
- key concepts
- possible student confusion
Use natural Persian.
This is much better than unrestricted generation.
6. Grammar help in Farsi: build it as a separate dataset family
For things like:
- فاعل
- مفعول
- متمم
- مسند
- فعل
- قید
- صفت
- مضاف و مضافالیه
- نقشهای دستوری
- جمله ساده / مرکب
I would not rely only on raw text.
You need targeted examples.
Suggested grammar task families:
| Task type | Example |
|---|---|
| concept explanation | “فاعل چیست؟ با مثال توضیح بده.” |
| role identification | “در جمله زیر فاعل و مفعول را مشخص کن.” |
| error correction | “دانشآموز گفته X مفعول است. آیا درست است؟” |
| contrastive explanation | “فرق متمم و مفعول چیست؟” |
| example generation | “برای متمم سه مثال ساده بساز.” |
| mini quiz | “سه سوال کوتاه درباره فاعل و مفعول بساز.” |
| step-by-step analysis | “این جمله را از نظر نقشهای دستوری تحلیل کن.” |
| student misconception | “چرا این پاسخ اشتباه است؟ ساده توضیح بده.” |
A useful schema:
{
"id": "grammar_0001",
"task_type": "grammar_role_identification",
"topic": "فاعل",
"messages": [
{"role": "system", "content": "You are a helpful Persian grammar tutor."},
{"role": "user", "content": "در جمله «علی کتاب را خواند»، فاعل را مشخص کن و کوتاه توضیح بده."},
{"role": "assistant", "content": "در این جمله «علی» فاعل است، چون انجامدهندهٔ عمل خواندن است. «کتاب» مفعول است، چون عمل خواندن روی آن انجام شده است."}
],
"source": "manual_seed",
"language": "fa",
"variety": "iranian_persian"
}
For grammar tutoring, I would make at least a small hand-written seed set.
Something like:
100-300 excellent grammar tutor examples
Then use those as few-shot examples for the teacher LLM.
7. Student support data
For student support, raw educational text is very useful.
Convert passages into tasks like:
| Task | Input | Output |
|---|---|---|
| explain simply | passage | simple explanation |
| summarize | passage | short Persian summary |
| give examples | concept | 2-3 examples |
| make exercise | passage | student exercise |
| answer misconception | wrong student answer | correction |
| scaffold | hard question | step-by-step hint |
| quiz | lesson passage | short quiz |
| compare concepts | two concepts | explanation of difference |
Example generation prompt:
You are building SFT data for an Iranian Persian student assistant.
Source passage:
<passage>
Create 5 examples:
1. simple explanation
2. summary
3. example-based explanation
4. student misconception correction
5. mini quiz
Rules:
- Use natural Iranian Persian.
- Do not add facts outside the passage unless clearly common knowledge.
- Keep answers short enough for a small model.
- Do not create very long chain-of-thought.
- Return JSONL.
The “short enough for a small model” rule matters. A 0.8B model may not benefit from very long, complex answers early.
8. Math reasoning: do not rely only on raw Persian text
Math is different.
For math SFT, correctness matters more than naturalness.
I would not generate math examples only from raw Persian text. Instead:
- generate or translate school-grade math problems,
- solve them with a reliable solver or teacher model,
- verify the answer,
- write a Persian explanation,
- sample-review.
Relevant examples:
- GSM8K is a classic school-grade math word-problem dataset.
- OpenMathInstruct generated large math instruction data with code-interpreter-style solutions.
- MAmmoTH / MathInstruct is another important math instruction tuning example.
For your use case, I would make Persian school-grade math as a separate dataset family.
Example schema:
{
"id": "math_0001",
"task_type": "school_math_reasoning",
"grade_level": "middle_school",
"messages": [
{"role": "system", "content": "You are a helpful Persian math tutor."},
{"role": "user", "content": "اگر علی ۳ دفتر بخرد و قیمت هر دفتر ۴۰ هزار تومان باشد، او در مجموع چقدر پرداخت میکند؟"},
{"role": "assistant", "content": "قیمت هر دفتر ۴۰ هزار تومان است و علی ۳ دفتر میخرد. پس باید ۳ را در ۴۰ هزار ضرب کنیم: ۳ × ۴۰٬۰۰۰ = ۱۲۰٬۰۰۰. بنابراین علی ۱۲۰ هزار تومان پرداخت میکند."}
],
"answer": "120000",
"verified": true
}
For math, add automatic checks when possible.
For arithmetic, you can verify with Python.
def check_answer(problem_id, predicted_answer, gold_answer):
return str(predicted_answer).replace(",", "").strip() == str(gold_answer).strip()
If you generate 10,000 math examples but 20% are wrong, that can damage the model.
Better:
2,000 verified math examples > 20,000 unverified math examples
9. Tool calling should be a separate dataset
Tool calling does not naturally come from Persian raw text.
You need a tool-use dataset with:
- tool schema
- Persian user request
- assistant tool call
- tool output
- final Persian answer
The TRL docs include tool-calling SFT support and describe using tool schemas and tool calls in the dataset: TRL SFTTrainer.
Tool-use work such as ToolLLM / ToolBench is useful conceptually because it treats tool use as its own data construction problem.
A simple tool example:
{
"tools": [
{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform basic arithmetic.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
},
"required": ["expression"]
}
}
}
],
"messages": [
{"role": "user", "content": "حاصل ۲۳ ضربدر ۴۷ چقدر است؟"},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"expression\":\"23*47\"}"
}
}
]
},
{"role": "tool", "content": "1081"},
{"role": "assistant", "content": "حاصل ۲۳ ضربدر ۴۷ برابر با ۱۰۸۱ است."}
]
}
You also need negative examples.
| Example type | Why |
|---|---|
| tool needed | model learns when to call |
| no tool needed | model does not call tools unnecessarily |
| ambiguous request | model asks clarification |
| unavailable tool | model refuses or explains limitation |
| invalid arguments | model learns valid JSON/schema |
| tool result summarization | model explains result in Persian |
For Persian tool calling, I would start small:
200-500 high-quality tool examples
Then expand only after testing.
10. A large LLM workflow for generating SFT data
If you can use a larger LLM as a teacher, use it as a generator, but also as a critic/verifier.
A simple pipeline:
passage
-> generator LLM creates candidate SFT examples
-> verifier LLM checks faithfulness, Persian naturalness, format
-> rule-based filters check JSON/schema/length
-> sample manual review
-> accepted examples enter SFT dataset
Generator prompt
You are generating Persian SFT data for a small Iranian Persian assistant.
Target capability: <capability>
Source passage:
<passage>
Generate <n> examples.
Each example must include:
- task_type
- messages
- source_passage_id
- difficulty
- verification_notes
Rules:
- Use natural Iranian Persian.
- Keep answers concise.
- Do not invent facts outside the passage.
- Do not include unsafe content.
- Do not use English unless needed.
- Return valid JSONL only.
Verifier prompt
You are verifying Persian SFT examples.
Source passage:
<passage>
Candidate example:
<example>
Check:
1. Is the Persian natural?
2. Is the answer faithful to the passage?
3. Is the instruction clear?
4. Is the answer useful for the target capability?
5. Is the format valid?
6. Is there any hallucinated fact?
7. Should this be accepted, rejected, or edited?
Return:
{"decision":"accept|edit|reject","reason":"...","fixed_example":...}
Do not accept everything from the teacher LLM. The teacher LLM is a data generator, not a guarantee of quality.
11. You can implement this with simple scripts first
You do not need a complex framework at the beginning.
A simple Python pipeline is enough:
input passages
-> prompt templates
-> teacher LLM calls
-> JSON parser
-> filters
-> output JSONL
If the pipeline becomes large, tools like distilabel can help organize synthetic data generation, AI feedback, judging, filtering, and dataset export.
But I would start simple.
A minimal folder structure:
data/
raw_clean/
passages/
generated_candidates/
rejected/
accepted/
eval_private/
prompts/
grammar_generation.txt
student_support_generation.txt
math_generation.txt
tool_calling_generation.txt
verifier.txt
scripts/
split_passages.py
generate_candidates.py
verify_candidates.py
filter_jsonl.py
dedup.py
build_train_mix.py
12. Suggested SFT dataset sizes
There is no universal number.
The needed size depends on:
- model size
- base model quality
- CPT quality
- task complexity
- data diversity
- data correctness
- evaluation target
But for planning, I would use this rough scale:
| Dataset part | Suggested starting size |
|---|---|
| private eval | 100-500 examples |
| hand-written seed | 200-1,000 examples |
| first SFT experiment | 1,000-5,000 examples |
| grammar tutor data | 1,000-5,000 examples |
| student support data | 2,000-10,000 examples |
| math reasoning data | 2,000-20,000 verified examples |
| tool calling data | 200-2,000 high-quality examples |
| broad useful SFT | 5,000-30,000 examples |
| large synthetic SFT | 50,000-100,000+ examples, only after filtering is mature |
Some relevant scale references:
- LIMA: around 1,000 carefully curated examples showed that small, high-quality data can matter.
- AlpaGasus: selected around 9k high-quality examples from Alpaca-style data.
- SmolTalk: 1M synthetic SFT samples used for SmolLM2-Instruct.
- SmolLM2: small-model training can still rely heavily on careful data design and staged training.
But I would not start with 1M examples.
For your project, a more realistic first target might be:
CPT first
private eval: 200-300
hand-written seed: 300-800
first generated SFT: 3k-10k
final polished SFT: 500-2k
Then expand based on evaluation.
13. Example final mixture for your case
A possible first real SFT mixture:
| Component | Examples | Notes |
|---|---|---|
| Persian grammar tutor | 2,000 | include فاعل، مفعول، متمم, correction, quiz |
| student support | 3,000 | explanations, examples, simplification |
| passage-grounded QA | 3,000 | from cleaned Persian educational/factual text |
| summarization/simplification | 1,500 | useful for student assistant |
| school math | 2,000 | verified answers only |
| tool calling | 500 | high quality, schema-valid |
| safety/refusal | 500 | simple Persian safety cases |
| final hand-written polish | 500 | naturalness and style |
Total:
~13k examples
That is already enough for a serious first SFT experiment.
If it works, expand the weak categories.
14. How to mix generated and hand-written data
I would not put hand-written data only at the end.
Use it in three places:
| Hand-written data role | Purpose |
|---|---|
| seed examples | show the teacher LLM what good data looks like |
| private eval | measure model improvement |
| final polish | fix the model’s final behavior/style |
Workflow:
write 300 excellent examples
-> use 100 as private eval
-> use 100 as few-shot generation examples
-> keep 100 for final polish / style correction
Do not train on the private eval examples.
15. Quality filters for generated SFT
For each generated example, check:
| Check | Method |
|---|---|
| valid JSON | parser |
| correct language | language/script ratio |
| not too long | token length filter |
| not too short | length filter |
| no duplicate | exact/near dedup |
| source faithfulness | verifier LLM or rules |
| natural Persian | sample review / classifier |
| no benchmark leakage | decontamination |
| correct math answer | calculator / solver |
| valid tool schema | JSON schema validation |
| useful task | category-specific review |
| safe answer | safety filter / manual review |
The most important filter depends on the task.
| Task | Most important filter |
|---|---|
| grammar tutor | correctness and natural explanation |
| math | answer verification |
| tool calling | schema validity |
| passage QA | source faithfulness |
| student support | helpfulness and natural Persian |
| safety | refusal quality |
16. Example schema for mixed SFT
I would keep metadata. It will help you debug later.
{
"id": "sft_grammar_000001",
"task_family": "persian_grammar_tutor",
"source_type": "manual_seed",
"source_passage_id": null,
"difficulty": "middle_school",
"language": "fa",
"variety": "iranian_persian",
"messages": [
{"role": "system", "content": "You are a helpful Persian tutor."},
{"role": "user", "content": "در جمله «مریم به مدرسه رفت»، متمم را مشخص کن."},
{"role": "assistant", "content": "در این جمله «به مدرسه» متمم است، چون با حرف اضافهٔ «به» آمده و معنی فعل «رفت» را کامل میکند."}
],
"verified": true,
"notes": "manual seed"
}
For generated passage-grounded data:
{
"id": "sft_qa_000001",
"task_family": "passage_grounded_qa",
"source_type": "generated_from_clean_text",
"source_passage_id": "passage_12345",
"language": "fa",
"variety": "iranian_persian",
"messages": [
{"role": "system", "content": "You are a helpful Persian assistant. Answer only from the provided passage."},
{"role": "user", "content": "<question based on passage>"},
{"role": "assistant", "content": "<answer grounded in passage>"}
],
"verified": true,
"verifier": "teacher_llm_plus_manual_sample"
}
For tool calling, keep it separate because the schema is different.
17. What not to do
I would avoid:
| Bad approach | Why |
|---|---|
| Generate random Q&A from all raw text | Usually shallow and repetitive |
| Mix tool calling, grammar, math, and chat without labels | Hard to debug |
| Trust teacher LLM outputs without verification | Wrong answers enter training |
| Generate huge data before first eval | You will not know what helped |
| Use only passage-grounded QA | The model may not learn teacher behavior |
| Use only final hand-written data | Too small and slow |
| Use only synthetic data | Persian naturalness may be weak |
| Ignore math verification | Bad math data is harmful |
| Treat tool calling like normal chat | Tool use needs schema-valid traces |
| Train on private eval examples | Contamination |
18. Practical recommendation
For your exact case, I would do this:
1. Finish CPT on clean Persian text.
2. Build 200-300 private eval examples.
3. Write 300-800 high-quality seed SFT examples manually.
4. Split cleaned text into passages.
5. Classify passages into task families.
6. Generate SFT candidates with a larger teacher LLM.
7. Verify generated examples by task type.
8. Build a first 5k-15k SFT mixture.
9. Fine-tune.
10. Evaluate.
11. Add data only for the categories that fail.
12. Finish with 500-2k hand-written/polished examples.
If I had to choose the first SFT categories, I would start with:
Persian grammar tutor
student support / explanation
passage-grounded QA
math with verified answers
small tool-calling set
This is more realistic than trying to build every possible assistant capability immediately.
Bottom line
Your plan is good, but I would refine it like this:
CPT
-> private eval + hand-written seed
-> capability-specific synthetic SFT generation
-> verification/filtering
-> SFT
-> small final hand-written polish
The key point is:
Do not generate one universal SFT dataset from raw text. Generate several small datasets for separate capabilities, verify each one differently, and then mix them intentionally.
For your target model, I would rather have:
10k carefully categorized and verified Persian SFT examples
than:
100k random Persian Q&A pairs generated from raw text
Discussion in the ATmosphere