Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihp4ltvqtnvtwiuloi5cvful664zbftzidbsyhkbzsjc7xlw2qcw4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnr73p3bxit2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_6",
  "publishedAt": "2026-06-08T07:06:22.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Bonito",
    "Raw Text is All You Need",
    "GSM8K",
    "OpenMathInstruct",
    "MAmmoTH / MathInstruct",
    "TRL SFTTrainer",
    "ToolLLM / ToolBench",
    "distilabel",
    "LIMA",
    "AlpaGasus",
    "SmolTalk",
    "SmolLM2"
  ],
  "textContent": "Since this is already quite concrete, I looked into it directly:\n\n* * *\n\n## Short answer\n\nYes, your proposed high-level order is reasonable:\n\n\n    CPT -> SFT -> small final hand-written / polished SFT\n\n\nBut I would slightly modify it:\n\n\n    CPT\n      -> small hand-written seed + private eval\n      -> synthetic / generated SFT from cleaned raw text\n      -> filtering + verification\n      -> SFT\n      -> small final hand-written / polished SFT\n\n\nThe important change is that the hand-written data should not only be at the end. It should also be used at the beginning as:\n\n  * examples for the large teacher model\n  * style guide\n  * private evaluation set\n  * final correction / polish set\n\n\n\nThe main idea is:\n\n> Cleaned raw text is not an SFT dataset. It is source material. You turn it into SFT by designing task families, generating instruction-answer pairs, verifying them, and mixing them deliberately.\n\nSo I would not try to generate one universal Persian SFT dataset. I would generate several smaller datasets, each targeting one capability.\n\nFor example:\n\nCapability | Dataset family\n---|---\nPersian grammar help | explanation, correction, quiz, role identification\nschool-grade math | verified problem-solution pairs\nstudent support | simplify, explain, summarize, give examples\ntool calling | tool schemas + Persian user requests + tool-call traces\nknowledge QA | passage-grounded QA from source texts\nmulti-turn assistant | document-to-dialogue or scenario-to-dialogue\nnatural Persian style | small native/polished final SFT set\n\n* * *\n\n## 1. Raw text is source material, not SFT\n\nA cleaned Persian paragraph like this:\n\n\n    <clean Persian passage>\n\n\nis good for CPT.\n\nBut SFT needs something like:\n\n\n    {\"messages\":[{\"role\":\"user\",\"content\":\"<instruction>\"},{\"role\":\"assistant\",\"content\":\"<answer>\"}]}\n\n\nor:\n\n\n    {\"prompt\":\"<instruction>\",\"completion\":\"<answer>\"}\n\n\nThat means you need a conversion step:\n\n\n    cleaned raw text\n      -> passage selection\n      -> task type selection\n      -> instruction generation\n      -> answer generation\n      -> verification\n      -> formatting\n      -> SFT dataset\n\n\nThis is the same general idea behind work such as Bonito, which converts unannotated text into task-specific instruction tuning datasets, and Raw Text is All You Need, which studies generating knowledge-intensive multi-turn dialogues from raw documents.\n\nYou do not need to copy those exact methods. The useful idea is:\n\n> First decide what task the raw text should become.\n\n* * *\n\n## 2. Use capability-specific datasets, not one giant mixed dataset\n\nFor your case, I would split SFT data by target capability.\n\n### Recommended split\n\nCapability | Should it come from raw text? | Best construction method\n---|---|---\nPersian grammar help | partly | grammar passages + hand-written seed + generated correction/explanation tasks\nstudent support | yes | educational passages → explanations, summaries, examples, exercises\nschool math reasoning | not mainly | generated/translated problems + verified solutions\ntool calling | no, mostly separate | tool schema + user intent + tool call + tool result + final answer\ngeneral helpful Persian | partly | seed examples + self-instruct-style generation + filtering\nfactual Persian QA | yes | passage-grounded QA from cleaned text\nmulti-turn dialogue | yes | document/scenario → multi-turn conversation\nfinal natural style | mostly hand-written/polished | small native final SFT set\n\nThe mistake to avoid is:\n\n\n    I have clean Persian text, so I will ask an LLM to make random Q&A pairs from all of it.\n\n\nThat usually creates a noisy, repetitive, shallow SFT dataset.\n\nA better approach is:\n\n\n    I need 7 capability buckets.\n    For each bucket, I will design task templates.\n    Then I will generate and verify examples for that bucket.\n    Then I will mix the buckets intentionally.\n\n\n* * *\n\n## 3. Better training order\n\nYour proposed order:\n\n\n    CPT -> SFT -> Hand-Written SFT\n\n\nis basically good.\n\nI would expand it like this:\n\nStage | Purpose\n---|---\nCPT | Teach Persian distribution: grammar, syntax, style, general domain familiarity\nprivate eval | Keep a small set of Persian tasks that never enters training\nhand-written seed | Show the teacher LLM what “good Persian assistant data” looks like\nsynthetic SFT generation | Scale from raw text and task templates\nfiltering/verification | Remove wrong, unnatural, repetitive, or badly formatted examples\nfirst SFT | Teach assistant behavior\nfinal hand-written/polished SFT | Fix style, grammar tutor behavior, safety, and final answer tone\n\nThe seed set and final polish set can be small.\n\nThey just need to be high quality.\n\n* * *\n\n## 4. A practical raw-text-to-SFT workflow\n\nHere is a concrete workflow.\n\n\n    1. Split cleaned corpus into passages.\n    2. Classify passages by usefulness.\n    3. Assign each useful passage to one or more task families.\n    4. Generate instruction-answer pairs with a teacher LLM.\n    5. Verify automatically where possible.\n    6. Review samples manually.\n    7. Deduplicate.\n    8. Convert to TRL/HF chat format.\n    9. Train a small SFT.\n    10. Evaluate.\n    11. Add more data only where the model fails.\n\n\n### Passage classification\n\nNot every cleaned text is useful for SFT.\n\nClassify each passage:\n\nPassage type | Use\n---|---\ngrammar explanation | grammar tutor data\neducational explanation | student support data\nfactual article | grounded QA / summarization\nmath problem | math reasoning if answer is verifiable\ntool documentation | tool calling data\nnoisy opinion/comment | maybe conversational style, but risky\nlist/table/boilerplate | usually reject or handle separately\nvery short text | often reject\nvery long text | split or summarize first\n\nYou can use simple tags:\n\n\n    {\"id\":\"p_0001\",\"text\":\"<passage>\",\"tags\":[\"grammar\",\"education\"],\"quality\":\"high\"}\n    {\"id\":\"p_0002\",\"text\":\"<passage>\",\"tags\":[\"factual\",\"qa\"],\"quality\":\"medium\"}\n    {\"id\":\"p_0003\",\"text\":\"<passage>\",\"tags\":[\"boilerplate\"],\"quality\":\"reject\"}\n\n\n* * *\n\n## 5. Use a teacher LLM with task templates\n\nFor each task family, write a generation prompt.\n\nDo not simply say:\n\n\n    Make SFT data from this text.\n\n\nThat is too vague.\n\nInstead, make the task explicit.\n\n### Passage-grounded QA\n\n\n    You are creating Persian SFT data for a small Iranian Persian assistant.\n\n    Source passage:\n    <passage>\n\n    Create 3 Persian user questions that can be answered only from the passage.\n    For each question, write a concise Persian answer.\n    Do not add facts that are not in the passage.\n    Use natural Iranian Persian.\n    Return JSONL with fields: id, task_type, messages, source_passage_id.\n\n\n### Student explanation\n\n\n    You are creating Persian SFT data for a student-support assistant.\n\n    Source passage:\n    <passage>\n\n    Create:\n    1. one student question,\n    2. one simple explanation,\n    3. one example,\n    4. one short follow-up question the student might ask,\n    5. one assistant follow-up answer.\n\n    The assistant should sound like a patient Persian teacher.\n    Avoid English unless the source requires it.\n    Return JSON.\n\n\n### Grammar tutor\n\n\n    You are creating Persian grammar tutor SFT data.\n\n    Grammar topic:\n    <topic>\n\n    Source passage:\n    <passage>\n\n    Create examples for:\n    - explaining the concept\n    - identifying the concept in a sentence\n    - correcting a student's wrong answer\n    - giving 3 new examples\n    - making a mini quiz\n\n    Use Iranian Persian.\n    Keep explanations short and clear.\n    Return JSONL.\n\n\n### Simplification\n\n\n    Convert the passage into a student-friendly explanation.\n\n    Source:\n    <passage>\n\n    Create:\n    - user instruction\n    - assistant answer\n    - difficulty level: elementary / middle_school / high_school\n    - key concepts\n    - possible student confusion\n\n    Use natural Persian.\n\n\nThis is much better than unrestricted generation.\n\n* * *\n\n## 6. Grammar help in Farsi: build it as a separate dataset family\n\nFor things like:\n\n  * فاعل\n  * مفعول\n  * متمم\n  * مسند\n  * فعل\n  * قید\n  * صفت\n  * مضاف و مضاف‌الیه\n  * نقش‌های دستوری\n  * جمله ساده / مرکب\n\n\n\nI would not rely only on raw text.\n\nYou need targeted examples.\n\nSuggested grammar task families:\n\nTask type | Example\n---|---\nconcept explanation | “فاعل چیست؟ با مثال توضیح بده.”\nrole identification | “در جمله زیر فاعل و مفعول را مشخص کن.”\nerror correction | “دانش‌آموز گفته X مفعول است. آیا درست است؟”\ncontrastive explanation | “فرق متمم و مفعول چیست؟”\nexample generation | “برای متمم سه مثال ساده بساز.”\nmini quiz | “سه سوال کوتاه درباره فاعل و مفعول بساز.”\nstep-by-step analysis | “این جمله را از نظر نقش‌های دستوری تحلیل کن.”\nstudent misconception | “چرا این پاسخ اشتباه است؟ ساده توضیح بده.”\n\nA useful schema:\n\n\n    {\n      \"id\": \"grammar_0001\",\n      \"task_type\": \"grammar_role_identification\",\n      \"topic\": \"فاعل\",\n      \"messages\": [\n        {\"role\": \"system\", \"content\": \"You are a helpful Persian grammar tutor.\"},\n        {\"role\": \"user\", \"content\": \"در جمله «علی کتاب را خواند»، فاعل را مشخص کن و کوتاه توضیح بده.\"},\n        {\"role\": \"assistant\", \"content\": \"در این جمله «علی» فاعل است، چون انجام‌دهندهٔ عمل خواندن است. «کتاب» مفعول است، چون عمل خواندن روی آن انجام شده است.\"}\n      ],\n      \"source\": \"manual_seed\",\n      \"language\": \"fa\",\n      \"variety\": \"iranian_persian\"\n    }\n\n\nFor grammar tutoring, I would make at least a small hand-written seed set.\n\nSomething like:\n\n\n    100-300 excellent grammar tutor examples\n\n\nThen use those as few-shot examples for the teacher LLM.\n\n* * *\n\n## 7. Student support data\n\nFor student support, raw educational text is very useful.\n\nConvert passages into tasks like:\n\nTask | Input | Output\n---|---|---\nexplain simply | passage | simple explanation\nsummarize | passage | short Persian summary\ngive examples | concept | 2-3 examples\nmake exercise | passage | student exercise\nanswer misconception | wrong student answer | correction\nscaffold | hard question | step-by-step hint\nquiz | lesson passage | short quiz\ncompare concepts | two concepts | explanation of difference\n\nExample generation prompt:\n\n\n    You are building SFT data for an Iranian Persian student assistant.\n\n    Source passage:\n    <passage>\n\n    Create 5 examples:\n    1. simple explanation\n    2. summary\n    3. example-based explanation\n    4. student misconception correction\n    5. mini quiz\n\n    Rules:\n    - Use natural Iranian Persian.\n    - Do not add facts outside the passage unless clearly common knowledge.\n    - Keep answers short enough for a small model.\n    - Do not create very long chain-of-thought.\n    - Return JSONL.\n\n\nThe “short enough for a small model” rule matters. A 0.8B model may not benefit from very long, complex answers early.\n\n* * *\n\n## 8. Math reasoning: do not rely only on raw Persian text\n\nMath is different.\n\nFor math SFT, correctness matters more than naturalness.\n\nI would not generate math examples only from raw Persian text. Instead:\n\n  1. generate or translate school-grade math problems,\n  2. solve them with a reliable solver or teacher model,\n  3. verify the answer,\n  4. write a Persian explanation,\n  5. sample-review.\n\n\n\nRelevant examples:\n\n  * GSM8K is a classic school-grade math word-problem dataset.\n  * OpenMathInstruct generated large math instruction data with code-interpreter-style solutions.\n  * MAmmoTH / MathInstruct is another important math instruction tuning example.\n\n\n\nFor your use case, I would make Persian school-grade math as a separate dataset family.\n\nExample schema:\n\n\n    {\n      \"id\": \"math_0001\",\n      \"task_type\": \"school_math_reasoning\",\n      \"grade_level\": \"middle_school\",\n      \"messages\": [\n        {\"role\": \"system\", \"content\": \"You are a helpful Persian math tutor.\"},\n        {\"role\": \"user\", \"content\": \"اگر علی ۳ دفتر بخرد و قیمت هر دفتر ۴۰ هزار تومان باشد، او در مجموع چقدر پرداخت می‌کند؟\"},\n        {\"role\": \"assistant\", \"content\": \"قیمت هر دفتر ۴۰ هزار تومان است و علی ۳ دفتر می‌خرد. پس باید ۳ را در ۴۰ هزار ضرب کنیم: ۳ × ۴۰٬۰۰۰ = ۱۲۰٬۰۰۰. بنابراین علی ۱۲۰ هزار تومان پرداخت می‌کند.\"}\n      ],\n      \"answer\": \"120000\",\n      \"verified\": true\n    }\n\n\nFor math, add automatic checks when possible.\n\nFor arithmetic, you can verify with Python.\n\n\n    def check_answer(problem_id, predicted_answer, gold_answer):\n        return str(predicted_answer).replace(\",\", \"\").strip() == str(gold_answer).strip()\n\n\nIf you generate 10,000 math examples but 20% are wrong, that can damage the model.\n\nBetter:\n\n\n    2,000 verified math examples > 20,000 unverified math examples\n\n\n* * *\n\n## 9. Tool calling should be a separate dataset\n\nTool calling does not naturally come from Persian raw text.\n\nYou need a tool-use dataset with:\n\n  * tool schema\n  * Persian user request\n  * assistant tool call\n  * tool output\n  * final Persian answer\n\n\n\nThe TRL docs include tool-calling SFT support and describe using tool schemas and tool calls in the dataset: TRL SFTTrainer.\n\nTool-use work such as ToolLLM / ToolBench is useful conceptually because it treats tool use as its own data construction problem.\n\nA simple tool example:\n\n\n    {\n      \"tools\": [\n        {\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"calculator\",\n            \"description\": \"Perform basic arithmetic.\",\n            \"parameters\": {\n              \"type\": \"object\",\n              \"properties\": {\n                \"expression\": {\"type\": \"string\"}\n              },\n              \"required\": [\"expression\"]\n            }\n          }\n        }\n      ],\n      \"messages\": [\n        {\"role\": \"user\", \"content\": \"حاصل ۲۳ ضربدر ۴۷ چقدر است؟\"},\n        {\n          \"role\": \"assistant\",\n          \"tool_calls\": [\n            {\n              \"type\": \"function\",\n              \"function\": {\n                \"name\": \"calculator\",\n                \"arguments\": \"{\\\"expression\\\":\\\"23*47\\\"}\"\n              }\n            }\n          ]\n        },\n        {\"role\": \"tool\", \"content\": \"1081\"},\n        {\"role\": \"assistant\", \"content\": \"حاصل ۲۳ ضربدر ۴۷ برابر با ۱۰۸۱ است.\"}\n      ]\n    }\n\n\nYou also need negative examples.\n\nExample type | Why\n---|---\ntool needed | model learns when to call\nno tool needed | model does not call tools unnecessarily\nambiguous request | model asks clarification\nunavailable tool | model refuses or explains limitation\ninvalid arguments | model learns valid JSON/schema\ntool result summarization | model explains result in Persian\n\nFor Persian tool calling, I would start small:\n\n\n    200-500 high-quality tool examples\n\n\nThen expand only after testing.\n\n* * *\n\n## 10. A large LLM workflow for generating SFT data\n\nIf you can use a larger LLM as a teacher, use it as a generator, but also as a critic/verifier.\n\nA simple pipeline:\n\n\n    passage\n      -> generator LLM creates candidate SFT examples\n      -> verifier LLM checks faithfulness, Persian naturalness, format\n      -> rule-based filters check JSON/schema/length\n      -> sample manual review\n      -> accepted examples enter SFT dataset\n\n\n### Generator prompt\n\n\n    You are generating Persian SFT data for a small Iranian Persian assistant.\n\n    Target capability: <capability>\n    Source passage:\n    <passage>\n\n    Generate <n> examples.\n    Each example must include:\n    - task_type\n    - messages\n    - source_passage_id\n    - difficulty\n    - verification_notes\n\n    Rules:\n    - Use natural Iranian Persian.\n    - Keep answers concise.\n    - Do not invent facts outside the passage.\n    - Do not include unsafe content.\n    - Do not use English unless needed.\n    - Return valid JSONL only.\n\n\n### Verifier prompt\n\n\n    You are verifying Persian SFT examples.\n\n    Source passage:\n    <passage>\n\n    Candidate example:\n    <example>\n\n    Check:\n    1. Is the Persian natural?\n    2. Is the answer faithful to the passage?\n    3. Is the instruction clear?\n    4. Is the answer useful for the target capability?\n    5. Is the format valid?\n    6. Is there any hallucinated fact?\n    7. Should this be accepted, rejected, or edited?\n\n    Return:\n    {\"decision\":\"accept|edit|reject\",\"reason\":\"...\",\"fixed_example\":...}\n\n\nDo not accept everything from the teacher LLM. The teacher LLM is a data generator, not a guarantee of quality.\n\n* * *\n\n## 11. You can implement this with simple scripts first\n\nYou do not need a complex framework at the beginning.\n\nA simple Python pipeline is enough:\n\n\n    input passages\n      -> prompt templates\n      -> teacher LLM calls\n      -> JSON parser\n      -> filters\n      -> output JSONL\n\n\nIf the pipeline becomes large, tools like distilabel can help organize synthetic data generation, AI feedback, judging, filtering, and dataset export.\n\nBut I would start simple.\n\nA minimal folder structure:\n\n\n    data/\n      raw_clean/\n      passages/\n      generated_candidates/\n      rejected/\n      accepted/\n      eval_private/\n    prompts/\n      grammar_generation.txt\n      student_support_generation.txt\n      math_generation.txt\n      tool_calling_generation.txt\n      verifier.txt\n    scripts/\n      split_passages.py\n      generate_candidates.py\n      verify_candidates.py\n      filter_jsonl.py\n      dedup.py\n      build_train_mix.py\n\n\n* * *\n\n## 12. Suggested SFT dataset sizes\n\nThere is no universal number.\n\nThe needed size depends on:\n\n  * model size\n  * base model quality\n  * CPT quality\n  * task complexity\n  * data diversity\n  * data correctness\n  * evaluation target\n\n\n\nBut for planning, I would use this rough scale:\n\nDataset part | Suggested starting size\n---|---\nprivate eval | 100-500 examples\nhand-written seed | 200-1,000 examples\nfirst SFT experiment | 1,000-5,000 examples\ngrammar tutor data | 1,000-5,000 examples\nstudent support data | 2,000-10,000 examples\nmath reasoning data | 2,000-20,000 verified examples\ntool calling data | 200-2,000 high-quality examples\nbroad useful SFT | 5,000-30,000 examples\nlarge synthetic SFT | 50,000-100,000+ examples, only after filtering is mature\n\nSome relevant scale references:\n\n  * LIMA: around 1,000 carefully curated examples showed that small, high-quality data can matter.\n  * AlpaGasus: selected around 9k high-quality examples from Alpaca-style data.\n  * SmolTalk: 1M synthetic SFT samples used for SmolLM2-Instruct.\n  * SmolLM2: small-model training can still rely heavily on careful data design and staged training.\n\n\n\nBut I would not start with 1M examples.\n\nFor your project, a more realistic first target might be:\n\n\n    CPT first\n    private eval: 200-300\n    hand-written seed: 300-800\n    first generated SFT: 3k-10k\n    final polished SFT: 500-2k\n\n\nThen expand based on evaluation.\n\n* * *\n\n## 13. Example final mixture for your case\n\nA possible first real SFT mixture:\n\nComponent | Examples | Notes\n---|---|---\nPersian grammar tutor | 2,000 | include فاعل، مفعول، متمم, correction, quiz\nstudent support | 3,000 | explanations, examples, simplification\npassage-grounded QA | 3,000 | from cleaned Persian educational/factual text\nsummarization/simplification | 1,500 | useful for student assistant\nschool math | 2,000 | verified answers only\ntool calling | 500 | high quality, schema-valid\nsafety/refusal | 500 | simple Persian safety cases\nfinal hand-written polish | 500 | naturalness and style\n\nTotal:\n\n\n    ~13k examples\n\n\nThat is already enough for a serious first SFT experiment.\n\nIf it works, expand the weak categories.\n\n* * *\n\n## 14. How to mix generated and hand-written data\n\nI would not put hand-written data only at the end.\n\nUse it in three places:\n\nHand-written data role | Purpose\n---|---\nseed examples | show the teacher LLM what good data looks like\nprivate eval | measure model improvement\nfinal polish | fix the model’s final behavior/style\n\nWorkflow:\n\n\n    write 300 excellent examples\n      -> use 100 as private eval\n      -> use 100 as few-shot generation examples\n      -> keep 100 for final polish / style correction\n\n\nDo not train on the private eval examples.\n\n* * *\n\n## 15. Quality filters for generated SFT\n\nFor each generated example, check:\n\nCheck | Method\n---|---\nvalid JSON | parser\ncorrect language | language/script ratio\nnot too long | token length filter\nnot too short | length filter\nno duplicate | exact/near dedup\nsource faithfulness | verifier LLM or rules\nnatural Persian | sample review / classifier\nno benchmark leakage | decontamination\ncorrect math answer | calculator / solver\nvalid tool schema | JSON schema validation\nuseful task | category-specific review\nsafe answer | safety filter / manual review\n\nThe most important filter depends on the task.\n\nTask | Most important filter\n---|---\ngrammar tutor | correctness and natural explanation\nmath | answer verification\ntool calling | schema validity\npassage QA | source faithfulness\nstudent support | helpfulness and natural Persian\nsafety | refusal quality\n\n* * *\n\n## 16. Example schema for mixed SFT\n\nI would keep metadata. It will help you debug later.\n\n\n    {\n      \"id\": \"sft_grammar_000001\",\n      \"task_family\": \"persian_grammar_tutor\",\n      \"source_type\": \"manual_seed\",\n      \"source_passage_id\": null,\n      \"difficulty\": \"middle_school\",\n      \"language\": \"fa\",\n      \"variety\": \"iranian_persian\",\n      \"messages\": [\n        {\"role\": \"system\", \"content\": \"You are a helpful Persian tutor.\"},\n        {\"role\": \"user\", \"content\": \"در جمله «مریم به مدرسه رفت»، متمم را مشخص کن.\"},\n        {\"role\": \"assistant\", \"content\": \"در این جمله «به مدرسه» متمم است، چون با حرف اضافهٔ «به» آمده و معنی فعل «رفت» را کامل می‌کند.\"}\n      ],\n      \"verified\": true,\n      \"notes\": \"manual seed\"\n    }\n\n\nFor generated passage-grounded data:\n\n\n    {\n      \"id\": \"sft_qa_000001\",\n      \"task_family\": \"passage_grounded_qa\",\n      \"source_type\": \"generated_from_clean_text\",\n      \"source_passage_id\": \"passage_12345\",\n      \"language\": \"fa\",\n      \"variety\": \"iranian_persian\",\n      \"messages\": [\n        {\"role\": \"system\", \"content\": \"You are a helpful Persian assistant. Answer only from the provided passage.\"},\n        {\"role\": \"user\", \"content\": \"<question based on passage>\"},\n        {\"role\": \"assistant\", \"content\": \"<answer grounded in passage>\"}\n      ],\n      \"verified\": true,\n      \"verifier\": \"teacher_llm_plus_manual_sample\"\n    }\n\n\nFor tool calling, keep it separate because the schema is different.\n\n* * *\n\n## 17. What not to do\n\nI would avoid:\n\nBad approach | Why\n---|---\nGenerate random Q&A from all raw text | Usually shallow and repetitive\nMix tool calling, grammar, math, and chat without labels | Hard to debug\nTrust teacher LLM outputs without verification | Wrong answers enter training\nGenerate huge data before first eval | You will not know what helped\nUse only passage-grounded QA | The model may not learn teacher behavior\nUse only final hand-written data | Too small and slow\nUse only synthetic data | Persian naturalness may be weak\nIgnore math verification | Bad math data is harmful\nTreat tool calling like normal chat | Tool use needs schema-valid traces\nTrain on private eval examples | Contamination\n\n* * *\n\n## 18. Practical recommendation\n\nFor your exact case, I would do this:\n\n\n    1. Finish CPT on clean Persian text.\n    2. Build 200-300 private eval examples.\n    3. Write 300-800 high-quality seed SFT examples manually.\n    4. Split cleaned text into passages.\n    5. Classify passages into task families.\n    6. Generate SFT candidates with a larger teacher LLM.\n    7. Verify generated examples by task type.\n    8. Build a first 5k-15k SFT mixture.\n    9. Fine-tune.\n    10. Evaluate.\n    11. Add data only for the categories that fail.\n    12. Finish with 500-2k hand-written/polished examples.\n\n\nIf I had to choose the first SFT categories, I would start with:\n\n\n    Persian grammar tutor\n    student support / explanation\n    passage-grounded QA\n    math with verified answers\n    small tool-calling set\n\n\nThis is more realistic than trying to build every possible assistant capability immediately.\n\n* * *\n\n## Bottom line\n\nYour plan is good, but I would refine it like this:\n\n\n    CPT\n      -> private eval + hand-written seed\n      -> capability-specific synthetic SFT generation\n      -> verification/filtering\n      -> SFT\n      -> small final hand-written polish\n\n\nThe key point is:\n\n> Do not generate one universal SFT dataset from raw text. Generate several small datasets for separate capabilities, verify each one differently, and then mix them intentionally.\n\nFor your target model, I would rather have:\n\n\n    10k carefully categorized and verified Persian SFT examples\n\n\nthan:\n\n\n    100k random Persian Q&A pairs generated from raw text\n",
  "title": "How can i build a High Quality dataset?"
}