External Publication
Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 8, 2026
Source

Since this is already quite concrete, I looked into it directly:


Short answer

Yes, your proposed high-level order is reasonable:

CPT -> SFT -> small final hand-written / polished SFT

But I would slightly modify it:

CPT
  -> small hand-written seed + private eval
  -> synthetic / generated SFT from cleaned raw text
  -> filtering + verification
  -> SFT
  -> small final hand-written / polished SFT

The important change is that the hand-written data should not only be at the end. It should also be used at the beginning as:

  • examples for the large teacher model
  • style guide
  • private evaluation set
  • final correction / polish set

The main idea is:

Cleaned raw text is not an SFT dataset. It is source material. You turn it into SFT by designing task families, generating instruction-answer pairs, verifying them, and mixing them deliberately.

So I would not try to generate one universal Persian SFT dataset. I would generate several smaller datasets, each targeting one capability.

For example:

Capability Dataset family
Persian grammar help explanation, correction, quiz, role identification
school-grade math verified problem-solution pairs
student support simplify, explain, summarize, give examples
tool calling tool schemas + Persian user requests + tool-call traces
knowledge QA passage-grounded QA from source texts
multi-turn assistant document-to-dialogue or scenario-to-dialogue
natural Persian style small native/polished final SFT set

1. Raw text is source material, not SFT

A cleaned Persian paragraph like this:

<clean Persian passage>

is good for CPT.

But SFT needs something like:

{"messages":[{"role":"user","content":"<instruction>"},{"role":"assistant","content":"<answer>"}]}

or:

{"prompt":"<instruction>","completion":"<answer>"}

That means you need a conversion step:

cleaned raw text
  -> passage selection
  -> task type selection
  -> instruction generation
  -> answer generation
  -> verification
  -> formatting
  -> SFT dataset

This is the same general idea behind work such as Bonito, which converts unannotated text into task-specific instruction tuning datasets, and Raw Text is All You Need, which studies generating knowledge-intensive multi-turn dialogues from raw documents.

You do not need to copy those exact methods. The useful idea is:

First decide what task the raw text should become.


2. Use capability-specific datasets, not one giant mixed dataset

For your case, I would split SFT data by target capability.

Recommended split

Capability Should it come from raw text? Best construction method
Persian grammar help partly grammar passages + hand-written seed + generated correction/explanation tasks
student support yes educational passages → explanations, summaries, examples, exercises
school math reasoning not mainly generated/translated problems + verified solutions
tool calling no, mostly separate tool schema + user intent + tool call + tool result + final answer
general helpful Persian partly seed examples + self-instruct-style generation + filtering
factual Persian QA yes passage-grounded QA from cleaned text
multi-turn dialogue yes document/scenario → multi-turn conversation
final natural style mostly hand-written/polished small native final SFT set

The mistake to avoid is:

I have clean Persian text, so I will ask an LLM to make random Q&A pairs from all of it.

That usually creates a noisy, repetitive, shallow SFT dataset.

A better approach is:

I need 7 capability buckets.
For each bucket, I will design task templates.
Then I will generate and verify examples for that bucket.
Then I will mix the buckets intentionally.

3. Better training order

Your proposed order:

CPT -> SFT -> Hand-Written SFT

is basically good.

I would expand it like this:

Stage Purpose
CPT Teach Persian distribution: grammar, syntax, style, general domain familiarity
private eval Keep a small set of Persian tasks that never enters training
hand-written seed Show the teacher LLM what “good Persian assistant data” looks like
synthetic SFT generation Scale from raw text and task templates
filtering/verification Remove wrong, unnatural, repetitive, or badly formatted examples
first SFT Teach assistant behavior
final hand-written/polished SFT Fix style, grammar tutor behavior, safety, and final answer tone

The seed set and final polish set can be small.

They just need to be high quality.


4. A practical raw-text-to-SFT workflow

Here is a concrete workflow.

1. Split cleaned corpus into passages.
2. Classify passages by usefulness.
3. Assign each useful passage to one or more task families.
4. Generate instruction-answer pairs with a teacher LLM.
5. Verify automatically where possible.
6. Review samples manually.
7. Deduplicate.
8. Convert to TRL/HF chat format.
9. Train a small SFT.
10. Evaluate.
11. Add more data only where the model fails.

Passage classification

Not every cleaned text is useful for SFT.

Classify each passage:

Passage type Use
grammar explanation grammar tutor data
educational explanation student support data
factual article grounded QA / summarization
math problem math reasoning if answer is verifiable
tool documentation tool calling data
noisy opinion/comment maybe conversational style, but risky
list/table/boilerplate usually reject or handle separately
very short text often reject
very long text split or summarize first

You can use simple tags:

{"id":"p_0001","text":"<passage>","tags":["grammar","education"],"quality":"high"}
{"id":"p_0002","text":"<passage>","tags":["factual","qa"],"quality":"medium"}
{"id":"p_0003","text":"<passage>","tags":["boilerplate"],"quality":"reject"}

5. Use a teacher LLM with task templates

For each task family, write a generation prompt.

Do not simply say:

Make SFT data from this text.

That is too vague.

Instead, make the task explicit.

Passage-grounded QA

You are creating Persian SFT data for a small Iranian Persian assistant.

Source passage:
<passage>

Create 3 Persian user questions that can be answered only from the passage.
For each question, write a concise Persian answer.
Do not add facts that are not in the passage.
Use natural Iranian Persian.
Return JSONL with fields: id, task_type, messages, source_passage_id.

Student explanation

You are creating Persian SFT data for a student-support assistant.

Source passage:
<passage>

Create:
1. one student question,
2. one simple explanation,
3. one example,
4. one short follow-up question the student might ask,
5. one assistant follow-up answer.

The assistant should sound like a patient Persian teacher.
Avoid English unless the source requires it.
Return JSON.

Grammar tutor

You are creating Persian grammar tutor SFT data.

Grammar topic:
<topic>

Source passage:
<passage>

Create examples for:
- explaining the concept
- identifying the concept in a sentence
- correcting a student's wrong answer
- giving 3 new examples
- making a mini quiz

Use Iranian Persian.
Keep explanations short and clear.
Return JSONL.

Simplification

Convert the passage into a student-friendly explanation.

Source:
<passage>

Create:
- user instruction
- assistant answer
- difficulty level: elementary / middle_school / high_school
- key concepts
- possible student confusion

Use natural Persian.

This is much better than unrestricted generation.


6. Grammar help in Farsi: build it as a separate dataset family

For things like:

  • فاعل
  • مفعول
  • متمم
  • مسند
  • فعل
  • قید
  • صفت
  • مضاف و مضاف‌الیه
  • نقش‌های دستوری
  • جمله ساده / مرکب

I would not rely only on raw text.

You need targeted examples.

Suggested grammar task families:

Task type Example
concept explanation “فاعل چیست؟ با مثال توضیح بده.”
role identification “در جمله زیر فاعل و مفعول را مشخص کن.”
error correction “دانش‌آموز گفته X مفعول است. آیا درست است؟”
contrastive explanation “فرق متمم و مفعول چیست؟”
example generation “برای متمم سه مثال ساده بساز.”
mini quiz “سه سوال کوتاه درباره فاعل و مفعول بساز.”
step-by-step analysis “این جمله را از نظر نقش‌های دستوری تحلیل کن.”
student misconception “چرا این پاسخ اشتباه است؟ ساده توضیح بده.”

A useful schema:

{
  "id": "grammar_0001",
  "task_type": "grammar_role_identification",
  "topic": "فاعل",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian grammar tutor."},
    {"role": "user", "content": "در جمله «علی کتاب را خواند»، فاعل را مشخص کن و کوتاه توضیح بده."},
    {"role": "assistant", "content": "در این جمله «علی» فاعل است، چون انجام‌دهندهٔ عمل خواندن است. «کتاب» مفعول است، چون عمل خواندن روی آن انجام شده است."}
  ],
  "source": "manual_seed",
  "language": "fa",
  "variety": "iranian_persian"
}

For grammar tutoring, I would make at least a small hand-written seed set.

Something like:

100-300 excellent grammar tutor examples

Then use those as few-shot examples for the teacher LLM.


7. Student support data

For student support, raw educational text is very useful.

Convert passages into tasks like:

Task Input Output
explain simply passage simple explanation
summarize passage short Persian summary
give examples concept 2-3 examples
make exercise passage student exercise
answer misconception wrong student answer correction
scaffold hard question step-by-step hint
quiz lesson passage short quiz
compare concepts two concepts explanation of difference

Example generation prompt:

You are building SFT data for an Iranian Persian student assistant.

Source passage:
<passage>

Create 5 examples:
1. simple explanation
2. summary
3. example-based explanation
4. student misconception correction
5. mini quiz

Rules:
- Use natural Iranian Persian.
- Do not add facts outside the passage unless clearly common knowledge.
- Keep answers short enough for a small model.
- Do not create very long chain-of-thought.
- Return JSONL.

The “short enough for a small model” rule matters. A 0.8B model may not benefit from very long, complex answers early.


8. Math reasoning: do not rely only on raw Persian text

Math is different.

For math SFT, correctness matters more than naturalness.

I would not generate math examples only from raw Persian text. Instead:

  1. generate or translate school-grade math problems,
  2. solve them with a reliable solver or teacher model,
  3. verify the answer,
  4. write a Persian explanation,
  5. sample-review.

Relevant examples:

  • GSM8K is a classic school-grade math word-problem dataset.
  • OpenMathInstruct generated large math instruction data with code-interpreter-style solutions.
  • MAmmoTH / MathInstruct is another important math instruction tuning example.

For your use case, I would make Persian school-grade math as a separate dataset family.

Example schema:

{
  "id": "math_0001",
  "task_type": "school_math_reasoning",
  "grade_level": "middle_school",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian math tutor."},
    {"role": "user", "content": "اگر علی ۳ دفتر بخرد و قیمت هر دفتر ۴۰ هزار تومان باشد، او در مجموع چقدر پرداخت می‌کند؟"},
    {"role": "assistant", "content": "قیمت هر دفتر ۴۰ هزار تومان است و علی ۳ دفتر می‌خرد. پس باید ۳ را در ۴۰ هزار ضرب کنیم: ۳ × ۴۰٬۰۰۰ = ۱۲۰٬۰۰۰. بنابراین علی ۱۲۰ هزار تومان پرداخت می‌کند."}
  ],
  "answer": "120000",
  "verified": true
}

For math, add automatic checks when possible.

For arithmetic, you can verify with Python.

def check_answer(problem_id, predicted_answer, gold_answer):
    return str(predicted_answer).replace(",", "").strip() == str(gold_answer).strip()

If you generate 10,000 math examples but 20% are wrong, that can damage the model.

Better:

2,000 verified math examples > 20,000 unverified math examples

9. Tool calling should be a separate dataset

Tool calling does not naturally come from Persian raw text.

You need a tool-use dataset with:

  • tool schema
  • Persian user request
  • assistant tool call
  • tool output
  • final Persian answer

The TRL docs include tool-calling SFT support and describe using tool schemas and tool calls in the dataset: TRL SFTTrainer.

Tool-use work such as ToolLLM / ToolBench is useful conceptually because it treats tool use as its own data construction problem.

A simple tool example:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "calculator",
        "description": "Perform basic arithmetic.",
        "parameters": {
          "type": "object",
          "properties": {
            "expression": {"type": "string"}
          },
          "required": ["expression"]
        }
      }
    }
  ],
  "messages": [
    {"role": "user", "content": "حاصل ۲۳ ضربدر ۴۷ چقدر است؟"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "calculator",
            "arguments": "{\"expression\":\"23*47\"}"
          }
        }
      ]
    },
    {"role": "tool", "content": "1081"},
    {"role": "assistant", "content": "حاصل ۲۳ ضربدر ۴۷ برابر با ۱۰۸۱ است."}
  ]
}

You also need negative examples.

Example type Why
tool needed model learns when to call
no tool needed model does not call tools unnecessarily
ambiguous request model asks clarification
unavailable tool model refuses or explains limitation
invalid arguments model learns valid JSON/schema
tool result summarization model explains result in Persian

For Persian tool calling, I would start small:

200-500 high-quality tool examples

Then expand only after testing.


10. A large LLM workflow for generating SFT data

If you can use a larger LLM as a teacher, use it as a generator, but also as a critic/verifier.

A simple pipeline:

passage
  -> generator LLM creates candidate SFT examples
  -> verifier LLM checks faithfulness, Persian naturalness, format
  -> rule-based filters check JSON/schema/length
  -> sample manual review
  -> accepted examples enter SFT dataset

Generator prompt

You are generating Persian SFT data for a small Iranian Persian assistant.

Target capability: <capability>
Source passage:
<passage>

Generate <n> examples.
Each example must include:
- task_type
- messages
- source_passage_id
- difficulty
- verification_notes

Rules:
- Use natural Iranian Persian.
- Keep answers concise.
- Do not invent facts outside the passage.
- Do not include unsafe content.
- Do not use English unless needed.
- Return valid JSONL only.

Verifier prompt

You are verifying Persian SFT examples.

Source passage:
<passage>

Candidate example:
<example>

Check:
1. Is the Persian natural?
2. Is the answer faithful to the passage?
3. Is the instruction clear?
4. Is the answer useful for the target capability?
5. Is the format valid?
6. Is there any hallucinated fact?
7. Should this be accepted, rejected, or edited?

Return:
{"decision":"accept|edit|reject","reason":"...","fixed_example":...}

Do not accept everything from the teacher LLM. The teacher LLM is a data generator, not a guarantee of quality.


11. You can implement this with simple scripts first

You do not need a complex framework at the beginning.

A simple Python pipeline is enough:

input passages
  -> prompt templates
  -> teacher LLM calls
  -> JSON parser
  -> filters
  -> output JSONL

If the pipeline becomes large, tools like distilabel can help organize synthetic data generation, AI feedback, judging, filtering, and dataset export.

But I would start simple.

A minimal folder structure:

data/
  raw_clean/
  passages/
  generated_candidates/
  rejected/
  accepted/
  eval_private/
prompts/
  grammar_generation.txt
  student_support_generation.txt
  math_generation.txt
  tool_calling_generation.txt
  verifier.txt
scripts/
  split_passages.py
  generate_candidates.py
  verify_candidates.py
  filter_jsonl.py
  dedup.py
  build_train_mix.py

12. Suggested SFT dataset sizes

There is no universal number.

The needed size depends on:

  • model size
  • base model quality
  • CPT quality
  • task complexity
  • data diversity
  • data correctness
  • evaluation target

But for planning, I would use this rough scale:

Dataset part Suggested starting size
private eval 100-500 examples
hand-written seed 200-1,000 examples
first SFT experiment 1,000-5,000 examples
grammar tutor data 1,000-5,000 examples
student support data 2,000-10,000 examples
math reasoning data 2,000-20,000 verified examples
tool calling data 200-2,000 high-quality examples
broad useful SFT 5,000-30,000 examples
large synthetic SFT 50,000-100,000+ examples, only after filtering is mature

Some relevant scale references:

  • LIMA: around 1,000 carefully curated examples showed that small, high-quality data can matter.
  • AlpaGasus: selected around 9k high-quality examples from Alpaca-style data.
  • SmolTalk: 1M synthetic SFT samples used for SmolLM2-Instruct.
  • SmolLM2: small-model training can still rely heavily on careful data design and staged training.

But I would not start with 1M examples.

For your project, a more realistic first target might be:

CPT first
private eval: 200-300
hand-written seed: 300-800
first generated SFT: 3k-10k
final polished SFT: 500-2k

Then expand based on evaluation.


13. Example final mixture for your case

A possible first real SFT mixture:

Component Examples Notes
Persian grammar tutor 2,000 include فاعل، مفعول، متمم, correction, quiz
student support 3,000 explanations, examples, simplification
passage-grounded QA 3,000 from cleaned Persian educational/factual text
summarization/simplification 1,500 useful for student assistant
school math 2,000 verified answers only
tool calling 500 high quality, schema-valid
safety/refusal 500 simple Persian safety cases
final hand-written polish 500 naturalness and style

Total:

~13k examples

That is already enough for a serious first SFT experiment.

If it works, expand the weak categories.


14. How to mix generated and hand-written data

I would not put hand-written data only at the end.

Use it in three places:

Hand-written data role Purpose
seed examples show the teacher LLM what good data looks like
private eval measure model improvement
final polish fix the model’s final behavior/style

Workflow:

write 300 excellent examples
  -> use 100 as private eval
  -> use 100 as few-shot generation examples
  -> keep 100 for final polish / style correction

Do not train on the private eval examples.


15. Quality filters for generated SFT

For each generated example, check:

Check Method
valid JSON parser
correct language language/script ratio
not too long token length filter
not too short length filter
no duplicate exact/near dedup
source faithfulness verifier LLM or rules
natural Persian sample review / classifier
no benchmark leakage decontamination
correct math answer calculator / solver
valid tool schema JSON schema validation
useful task category-specific review
safe answer safety filter / manual review

The most important filter depends on the task.

Task Most important filter
grammar tutor correctness and natural explanation
math answer verification
tool calling schema validity
passage QA source faithfulness
student support helpfulness and natural Persian
safety refusal quality

16. Example schema for mixed SFT

I would keep metadata. It will help you debug later.

{
  "id": "sft_grammar_000001",
  "task_family": "persian_grammar_tutor",
  "source_type": "manual_seed",
  "source_passage_id": null,
  "difficulty": "middle_school",
  "language": "fa",
  "variety": "iranian_persian",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian tutor."},
    {"role": "user", "content": "در جمله «مریم به مدرسه رفت»، متمم را مشخص کن."},
    {"role": "assistant", "content": "در این جمله «به مدرسه» متمم است، چون با حرف اضافهٔ «به» آمده و معنی فعل «رفت» را کامل می‌کند."}
  ],
  "verified": true,
  "notes": "manual seed"
}

For generated passage-grounded data:

{
  "id": "sft_qa_000001",
  "task_family": "passage_grounded_qa",
  "source_type": "generated_from_clean_text",
  "source_passage_id": "passage_12345",
  "language": "fa",
  "variety": "iranian_persian",
  "messages": [
    {"role": "system", "content": "You are a helpful Persian assistant. Answer only from the provided passage."},
    {"role": "user", "content": "<question based on passage>"},
    {"role": "assistant", "content": "<answer grounded in passage>"}
  ],
  "verified": true,
  "verifier": "teacher_llm_plus_manual_sample"
}

For tool calling, keep it separate because the schema is different.


17. What not to do

I would avoid:

Bad approach Why
Generate random Q&A from all raw text Usually shallow and repetitive
Mix tool calling, grammar, math, and chat without labels Hard to debug
Trust teacher LLM outputs without verification Wrong answers enter training
Generate huge data before first eval You will not know what helped
Use only passage-grounded QA The model may not learn teacher behavior
Use only final hand-written data Too small and slow
Use only synthetic data Persian naturalness may be weak
Ignore math verification Bad math data is harmful
Treat tool calling like normal chat Tool use needs schema-valid traces
Train on private eval examples Contamination

18. Practical recommendation

For your exact case, I would do this:

1. Finish CPT on clean Persian text.
2. Build 200-300 private eval examples.
3. Write 300-800 high-quality seed SFT examples manually.
4. Split cleaned text into passages.
5. Classify passages into task families.
6. Generate SFT candidates with a larger teacher LLM.
7. Verify generated examples by task type.
8. Build a first 5k-15k SFT mixture.
9. Fine-tune.
10. Evaluate.
11. Add data only for the categories that fail.
12. Finish with 500-2k hand-written/polished examples.

If I had to choose the first SFT categories, I would start with:

Persian grammar tutor
student support / explanation
passage-grounded QA
math with verified answers
small tool-calling set

This is more realistic than trying to build every possible assistant capability immediately.


Bottom line

Your plan is good, but I would refine it like this:

CPT
  -> private eval + hand-written seed
  -> capability-specific synthetic SFT generation
  -> verification/filtering
  -> SFT
  -> small final hand-written polish

The key point is:

Do not generate one universal SFT dataset from raw text. Generate several small datasets for separate capabilities, verify each one differently, and then mix them intentionally.

For your target model, I would rather have:

10k carefully categorized and verified Persian SFT examples

than:

100k random Persian Q&A pairs generated from raw text

Discussion in the ATmosphere

Loading comments...