Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibu4y4a3mw2vnlqv7xzwnktlbl5dzsuhsnnlncrwvovttgfcemqse",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnoojoxvmd32"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_2",
  "publishedAt": "2026-06-07T07:31:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Open Persian LLM Leaderboard",
    "MIZAN: A Persian LLM Leaderboard",
    "Khayyam Challenge / PersianMMLU",
    "PerMMLU",
    "Persian IFEval",
    "Persian MT-Bench",
    "ELAB: Extensive LLM Alignment Benchmark in Persian Language",
    "FaMTEB",
    "PTEB Leaderboard",
    "ParsBench",
    "lm-evaluation-harness",
    "Dorna-Llama3-8B-Instruct",
    "PersianMind",
    "Persian-Phi",
    "MIZAN",
    "FarsInstruct",
    "paper",
    "GitHub",
    "MatinaAI instruction tuning / alignment datasets",
    "Matina Persian Text Corpus",
    "ParsBench datasets/models",
    "PQuAD",
    "FarsTail",
    "xmanii Persian SFT/COT collection",
    "Self-Instruct",
    "WizardLM / Evol-Instruct",
    "distilabel",
    "Bonito",
    "Aya Dataset",
    "Bactrian-X",
    "MURI",
    "LIMA",
    "AlpaGasus",
    "What Makes Good Data for Alignment? / Deita",
    "Long Is More for Alignment",
    "Large-Scale Data Selection for Instruction Tuning",
    "TRL SFTTrainer",
    "TRL dataset formats",
    "Transformers chat templates",
    "Hugging Face Dataset Cards",
    "Create a dataset card"
  ],
  "textContent": "Dataset quality is hard to define because the evaluation criteria change depending on the goal, but I tried to organize the parts that can be organized:\n\n* * *\n\n## Short answer\n\nYes, there are best practices, but I would not start from “how do I create a large high-quality dataset?” in the abstract.\n\nFor a **Persian SLM assistant** , I would start from a more evaluation-driven question:\n\n> What exact Persian capability is missing from the base model, and how will I measure whether my dataset fixes that gap?\n\nA “high-quality dataset” for Persian assistant fine-tuning may mean different things depending on whether you want:\n\n  * better Persian fluency\n  * better instruction following\n  * better multi-turn assistant behavior\n  * better Iranian/Persian cultural fit\n  * better factuality or domain knowledge\n  * safer refusals and culturally appropriate alignment\n  * better RAG behavior\n  * better formatting under a specific chat template\n  * better behavior from a very small model with limited context and imperfect Persian tokenization\n\n\n\nSo I would use an **evaluation-first roadmap** :\n\n  1. Check current Persian leaderboards and benchmarks.\n  2. Build a small private eval set for your exact use case.\n  3. Evaluate the base model and Persian-specialized baselines.\n  4. Identify the actual gap.\n  5. Only then build or curate training data targeted at that gap.\n\n\n\nThe important point is:\n\n> Do not build “more Persian chat data” blindly. Build the missing data that your evaluation shows you need.\n\n* * *\n\n## 1. Start with Persian evaluation, not training data\n\nBefore generating or collecting examples, I would first inspect current Persian evaluation resources.\n\nThese are useful because they show what Persian LLM quality already gets decomposed into: knowledge, reasoning, instruction following, NLU, NLG, multi-turn dialogue, safety, culture, and retrieval.\n\nGoal | Useful starting point | Why it matters\n---|---|---\nGeneral Persian LLM comparison | Open Persian LLM Leaderboard | A practical first stop for comparing current Persian-capable models. Use it for baselines, not as a final product metric.\nMulti-dimensional Persian evaluation | MIZAN: A Persian LLM Leaderboard | Useful because it separates Persian evaluation into reasoning, instruction following, knowledge, NLU, NLG, and multi-turn dialogue.\nPersian knowledge / reasoning | Khayyam Challenge / PersianMMLU, PerMMLU | Good anchors for general and local knowledge, but not sufficient for assistant behavior.\nInstruction following | Persian IFEval | Helpful for checking whether the model follows Persian instructions and constraints.\nMulti-turn assistant behavior | Persian MT-Bench | Useful for dialogue, writing, multi-turn behavior, and retrieval-like chat cases.\nPersian safety / alignment | ELAB: Extensive LLM Alignment Benchmark in Persian Language | Important for safety, fairness, and social norms in Persian linguistic/cultural contexts.\nPersian embeddings / RAG | FaMTEB, PTEB Leaderboard | If the assistant uses retrieval, do not evaluate only generation. Evaluate embeddings, retrieval, reranking, grounding, and citation behavior.\nCustom local evaluation | ParsBench, lm-evaluation-harness | Useful if you want repeatable private evaluation rather than only leaderboard screenshots.\n\n### Why this matters\n\nIf the model fails on Persian IFEval-style tasks, you probably need better instruction-following data.\n\nIf it fails on PerMMLU/PersianMMLU-style tasks, you may need domain or knowledge data, or maybe RAG rather than SFT.\n\nIf it fails on ELAB-style tasks, you need safety/alignment data, not generic QA.\n\nIf it fails because Persian inputs become very long under the tokenizer, more SFT examples may not be enough; you may need to think about tokenizer efficiency, model choice, or continued pretraining.\n\n* * *\n\n## 2. Do not treat “Persian” as one fully specified target\n\nIt is useful to state the target variety explicitly.\n\nPersian/Farsi can mean Iranian Persian, but Persian is also a pluricentric language with varieties such as Dari and Tajik. If your target is an Iranian Persian assistant, say that. If you want Dari, Tajik, code-switching, Arabic-script Persian only, Latin transliteration, or mixed Persian-English technical support, those become different data and evaluation problems.\n\nA simple target statement can prevent many later mistakes:\n\n> Target: Iranian Persian/Farsi assistant for <domain>, mostly formal/semi-formal register, Arabic-script Persian, occasional English technical terms, no Dari/Tajik coverage for now.\n\nOr:\n\n> Target: general Persian assistant covering Iranian Persian plus common code-switching in technical conversations.\n\nThis affects:\n\n  * source selection\n  * spelling normalization\n  * register\n  * cultural assumptions\n  * evaluation examples\n  * safety examples\n  * tokenizer measurements\n  * what counts as a natural answer\n\n\n\n* * *\n\n## 3. Define “quality” by capability, not by dataset size\n\nA large dataset can still be low quality if it is repetitive, mistranslated, inconsistent, contaminated, or irrelevant to the target model’s failure modes.\n\nFor a Persian SLM assistant, I would split quality into dimensions like this:\n\nQuality dimension | What to check | Persian SLM-specific note\n---|---|---\nPersian fluency | Natural Persian, spelling, orthography, punctuation, register | Translated English data alone may sound unnatural. Native or fluent review matters.\nInstruction following | Constraints, requested format, multi-step tasks, refusal when needed | Use Persian IFEval-like and private instruction tests.\nMulti-turn ability | Context tracking, follow-up answers, corrections, clarification | Do not evaluate only single-turn QA.\nFactuality | Verified answers, domain correctness, local knowledge | Use PersianMMLU/PerMMLU/PARSE-like tests, but do not train on their test items.\nCultural fit | Iranian/Persian norms, idioms, local references, etiquette | Use culturally grounded resources; translation-only data can miss this.\nSafety/alignment | Refusal behavior, safe alternatives, fairness, privacy, social norms | ELAB-like evaluation is relevant here.\nDomain fit | Medical/legal/education/support correctness | Domain data may need expert validation.\nRetrieval behavior | Finds correct docs, cites evidence, avoids unsupported claims | Evaluate embedding/retrieval separately with FaMTEB/PTEB-like resources.\nFormat consistency | Stable schema, roles, chat template, answer style | Bad formatting can ruin otherwise good data.\nTokenization cost | Tokens per word/sentence, truncation rate, context waste | Especially important for 0.5B–3B SLMs.\nProvenance/license | Source, generation method, license, redistribution rights | Required if you publish the dataset.\nContamination control | No benchmark leakage into train data | Critical if you use public Persian benchmarks to guide development.\n\nA good Persian assistant dataset is not just “lots of Persian conversations.” It is a dataset that targets one or more of these dimensions clearly.\n\n* * *\n\n## 4. Compare against Persian-specialized baselines before training\n\nBefore spending time building a dataset, evaluate your base SLM against existing Persian-specialized or Persian-adapted baselines.\n\nExamples worth inspecting include:\n\n  * Dorna-Llama3-8B-Instruct\n  * PersianMind\n  * Persian-Phi\n  * current models listed on Open Persian LLM Leaderboard\n  * current models listed on MIZAN\n\n\n\nThe goal is not necessarily to use those models directly. The goal is to learn what type of gap you are dealing with.\n\nObservation | Likely implication\n---|---\nYour SLM is weak at basic Persian fluency | SFT alone may not be enough; consider model choice, continued pretraining, or tokenizer issues.\nIt is fluent but bad at following instructions | SFT data may help.\nIt follows instructions but lacks local facts | Use RAG or domain knowledge data; do not expect generic chat data to fix this reliably.\nIt gives unsafe or culturally odd answers | You need alignment/safety/culture data, not just more QA.\nIt fails long Persian inputs because of token length | Measure tokenizer fertility and truncation before scaling the dataset.\nLarger Persian models are much better | Maybe the task is too hard for the chosen SLM size, or needs a narrower scope.\n\n* * *\n\n## 5. Inspect existing Persian resources before generating from scratch\n\nBefore creating 50k–100k synthetic examples, inspect existing Persian datasets and corpora.\n\nUseful examples:\n\nResource | How I would use it\n---|---\nFarsInstruct, paper, GitHub | A major Persian instruction-following resource. Useful because it is not just generic chat data; it covers multiple Persian NLP task types and instruction templates.\nMatinaAI instruction tuning / alignment datasets | Useful for cultural-alignment and Persian-focused instruction data patterns. Check access, license, and intended use carefully.\nMatina Persian Text Corpus | More relevant to language adaptation / continued pretraining than ordinary SFT. Useful if the model’s Persian base ability is weak.\nParsBench datasets/models | Useful for Persian task/evaluation exploration and possible private benchmark inspiration.\nPQuAD | Persian reading comprehension / QA resource. Good example of task-specific Persian data.\nFarsTail | Persian textual entailment resource. Useful for NLI-style evaluation or training inspiration.\nCommunity Persian SFT/COT collections, e.g. xmanii Persian SFT/COT collection | Potential bootstrapping material, but inspect carefully. Do not assume translation-based community datasets are automatically high quality.\n\nFor community SFT datasets, I would check:\n\n  * Is it native Persian, translated Persian, or synthetic Persian?\n  * What model generated it?\n  * What was the source dataset?\n  * Is the license compatible with your use?\n  * Are there duplicate examples?\n  * Does it contain benchmark items?\n  * Does it use the schema you need?\n  * Are the answers natural in Persian?\n  * Are there hidden English artifacts?\n  * Does it match your target register and domain?\n\n\n\n* * *\n\n## 6. Decide which layer you actually need to improve\n\nFor a Persian SLM assistant, there are several different intervention layers. SFT is only one of them.\n\nIf the problem is… | Better first intervention\n---|---\nPoor basic Persian language modeling | Persian continued pretraining / language adaptation / better base model\nBad instruction following | SFT on instruction-following data\nBad multi-turn chat | Multi-turn SFT examples and multi-turn eval\nBad local/cultural behavior | Culturally grounded Persian examples and human/native review\nBad safety/refusal behavior | Safety/alignment dataset and Persian alignment eval\nBad factual/domain answers | RAG, domain corpus, expert-validated QA, or domain SFT\nBad retrieval | Better Persian embeddings/rerankers, FaMTEB/PTEB-like eval\nBad formatting | Schema cleanup, chat template alignment, loss masking\nToo many tokens for Persian inputs | Tokenizer/model choice, shorter examples, context-aware data design\n\nThis is why evaluation first is useful: it tells you which layer to work on.\n\n* * *\n\n## 7. A practical dataset-building pipeline\n\nA practical workflow could look like this.\n\nStage | Output | Notes\n---|---|---\n1. Define target | One short target statement | Example: “Iranian Persian customer-support assistant for <domain>.”\n2. Choose public eval anchors | 3–5 public benchmarks/leaderboards | Use them for orientation, not for training data.\n3. Build private eval | 100–500 examples | Include the actual tasks users will ask. Keep it held out.\n4. Evaluate base model | Failure report | Measure before training.\n5. Compare baselines | Baseline table | Compare against Persian-specialized models if possible.\n6. Inspect existing data | Dataset inventory | FarsInstruct, Matina, ParsBench, domain datasets, etc.\n7. Build seed set | 200–2,000 high-quality examples | Manual/native-reviewed examples are valuable.\n8. Expand synthetically | Larger candidate pool | Use teacher LLMs carefully; do not blindly trust outputs.\n9. Filter | Clean training set | Fluency, correctness, diversity, safety, format, license.\n10. Deduplicate/decontaminate | Train/dev/test split | Remove duplicates and eval leakage.\n11. Train | SFT/LoRA/QLoRA run | Use correct chat template and loss masking.\n12. Re-evaluate | New failure report | Add targeted examples based on failures.\n\nThe key loop is:\n\n> evaluate → inspect failures → add targeted data → train → re-evaluate\n\nnot:\n\n> generate a huge dataset → train once → hope quality improves\n\n* * *\n\n## 8. Build a small private eval set\n\nPublic leaderboards are useful, but they will not perfectly match your application.\n\nI would create a small private eval set early. Even 100–300 examples can be very useful if they are well chosen.\n\nSuggested categories:\n\nCategory | Example\n---|---\nBasic Persian assistant | “Explain <topic> in simple Persian.”\nInstruction constraints | “Answer in exactly 3 bullet points.”\nMulti-turn follow-up | User corrects or narrows the previous request.\nLocal/cultural knowledge | Iranian holidays, etiquette, food, education, bureaucracy, etc.\nDomain task | Your real target domain.\nSafety/refusal | Harmful, privacy-sensitive, or ethically sensitive requests.\nFormatting | JSON, markdown table, numbered list, citation format.\nRAG/grounding | Answer using provided documents only.\nRobustness | Ambiguous questions, typos, mixed Persian-English, informal spelling.\n\nA simple private eval JSONL schema:\n\n\n    {\"id\":\"eval_0001\",\"category\":\"instruction_following\",\"language\":\"fa\",\"input\":\"<Persian user prompt>\",\"expected_behavior\":\"Follow all constraints; answer in Persian; no extra sections.\",\"must_include\":[],\"must_not_include\":[],\"notes\":\"Private held-out eval. Do not train on this.\"}\n    {\"id\":\"eval_0002\",\"category\":\"safety\",\"language\":\"fa\",\"input\":\"<Persian unsafe request>\",\"expected_behavior\":\"Refuse briefly in Persian and offer a safe alternative.\",\"must_include\":[],\"must_not_include\":[],\"notes\":\"Private held-out eval. Do not train on this.\"}\n\n\nDo not overcomplicate the first version. A small, stable, private eval is better than no eval.\n\n* * *\n\n## 9. Do not train on your evaluation data\n\nThis is important.\n\nIf you use PersianMMLU, MIZAN, ELAB, Persian IFEval, Persian MT-Bench, or leaderboard samples to guide your work, do not copy those examples into your SFT data.\n\nAlso avoid putting held-out benchmark examples into prompts for synthetic data generation.\n\nBad pattern:\n\n\n    Take these PersianMMLU examples and generate many similar examples.\n\n\nBetter pattern:\n\n\n    We need examples that test high-school-level Persian science explanations, but do not copy from any benchmark. Generate new questions from independently sourced material that is not in the held-out eval set.\n\n\nUse public benchmarks to define **categories and failure modes** , not to create near-duplicate training examples.\n\nThis matters because contamination can make a model look better on a leaderboard without actually becoming more useful.\n\n* * *\n\n## 10. Synthetic data can help, but only after the seed set is clear\n\nSynthetic data is useful, especially for low-resource languages, but it needs quality control.\n\nUseful references and tools:\n\n  * Self-Instruct\n  * WizardLM / Evol-Instruct\n  * distilabel\n  * Bonito\n  * Aya Dataset\n  * Bactrian-X\n  * MURI\n\n\n\nFor Persian, I would be careful about:\n\nRisk | What to do\n---|---\nTranslationese | Have native/fluent reviewers check samples.\nCultural mismatch | Include Persian-local examples, not only translated English tasks.\nRepetition | Deduplicate prompts and answers.\nOverly generic assistant style | Add real target-domain tasks.\nWrong facts | Verify factual/domain answers.\nUnsafe completions | Add safety filters and alignment eval.\nTeacher model bias | Use multiple teacher models or human review for important categories.\nBenchmark contamination | Keep eval examples out of generation prompts.\n\nA safe synthetic expansion pattern:\n\n  1. Write 200–500 excellent seed examples.\n  2. Define task categories and style rules.\n  3. Generate candidate examples.\n  4. Filter automatically for schema/length/language.\n  5. Review samples manually.\n  6. Deduplicate.\n  7. Train a small run.\n  8. Evaluate.\n  9. Add more only where eval shows a gap.\n\n\n\n* * *\n\n## 11. Data selection: quality, complexity, diversity\n\nA useful mental model is to select data using three axes:\n\nAxis | Meaning\n---|---\nQuality | Is the answer correct, natural, safe, and useful?\nComplexity | Does it teach the model nontrivial behavior?\nDiversity | Does it cover enough tasks, domains, styles, and difficulty levels?\n\nRelevant papers:\n\n  * LIMA: small, carefully curated instruction data can be surprisingly effective.\n  * AlpaGasus: filtering noisy instruction data can improve results.\n  * What Makes Good Data for Alignment? / Deita: quality, complexity, and diversity are useful selection dimensions.\n  * Long Is More for Alignment: simple selection heuristics such as response length can be strong baselines.\n  * Large-Scale Data Selection for Instruction Tuning: automatic data selection methods do not always scale cleanly, so evaluation and ablation matter.\n\n\n\nFor a Persian SLM, I would avoid both extremes:\n\n  * only tiny “perfect” examples with no diversity\n  * huge synthetic data dumps with no review\n\n\n\nA balanced dataset is usually better.\n\n* * *\n\n## 12. Measure tokenizer cost for an SLM\n\nFor a small model, tokenization can matter a lot.\n\nIf Persian text becomes much longer than equivalent English text under your tokenizer, the model uses more context and compute just to represent the input. That can hurt training efficiency and inference quality.\n\nMeasure this before scaling your dataset.\n\nExample:\n\n\n    # pip install transformers\n\n    from transformers import AutoTokenizer\n\n    tokenizer = AutoTokenizer.from_pretrained(\"<model-or-tokenizer-name>\")\n\n    samples = [\n        \"این یک جملهٔ نمونه به زبان فارسی است.\",\n        \"لطفاً این متن را در سه bullet point خلاصه کن.\",\n        \"کاربر می‌خواهد دربارهٔ <domain> توضیح ساده‌ای دریافت کند.\",\n    ]\n\n    for s in samples:\n        ids = tokenizer(s, add_special_tokens=False)[\"input_ids\"]\n        words = s.split()\n        print({\n            \"text\": s,\n            \"chars\": len(s),\n            \"words\": len(words),\n            \"tokens\": len(ids),\n            \"tokens_per_word\": len(ids) / max(1, len(words)),\n            \"tokens_per_char\": len(ids) / max(1, len(s)),\n        })\n\n\nAlso check:\n\nMetric | Why\n---|---\naverage tokens per Persian example | Cost and context length\ntruncation rate | Whether long examples are being cut\ntokens per word | Rough tokenization fertility\nPersian vs English equivalent length | Whether the tokenizer is inefficient for Persian\nmixed Persian-English examples | Important for technical assistants\n\nIf tokenizer cost is bad, possible responses include:\n\n  * choose a better base model/tokenizer\n  * shorten examples\n  * reduce unnecessary verbosity\n  * use narrower tasks\n  * consider language adaptation / continued pretraining\n  * avoid assuming SFT alone will solve everything\n\n\n\nPersianMind is an example where vocabulary adaptation was part of the Persian model-building story, so this is not just a theoretical concern.\n\n* * *\n\n## 13. HF / TRL formatting matters\n\nGood content can become bad training data if the format is wrong.\n\nFor Hugging Face TRL, check:\n\n  * TRL SFTTrainer\n  * TRL dataset formats\n  * Transformers chat templates\n\n\n\nCommon formats include:\n\n### Conversational format\n\n\n    {\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful Persian assistant.\"},{\"role\":\"user\",\"content\":\"<Persian user message>\"},{\"role\":\"assistant\",\"content\":\"<Persian assistant answer>\"}]}\n\n\n### Prompt-completion format\n\n\n    {\"prompt\":\"<Persian instruction>\",\"completion\":\"<Persian answer>\"}\n\n\nThings to verify:\n\nIssue | Why it matters\n---|---\nCorrect chat template | Chat models expect specific control tokens and role formatting.\nAssistant-only loss | You may want loss only on assistant responses, not user prompts.\nCompletion-only loss | For prompt-completion SFT, train on completions rather than prompts.\nMulti-turn formatting | Roles and turn boundaries must be unambiguous.\nSystem prompt consistency | Random system messages can create unstable behavior.\nPersian punctuation/normalization | Inconsistency can add avoidable noise.\nTrain/inference match | The fine-tuning template should match how you will call the model later.\n\nA dataset can look good in a spreadsheet and still fail because the model was trained on the wrong serialized chat format.\n\n* * *\n\n## 14. Minimal training data schema\n\nFor a Persian assistant SFT dataset, I would keep metadata. It makes filtering and later analysis much easier.\n\nExample JSONL:\n\n\n    {\"id\":\"sft_000001\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful Persian assistant.\"},{\"role\":\"user\",\"content\":\"<Persian user prompt>\"},{\"role\":\"assistant\",\"content\":\"<Persian assistant response>\"}],\"source\":\"manual_seed\",\"language\":\"fa\",\"variety\":\"iranian_persian\",\"domain\":\"general\",\"quality_checked\":true,\"reviewer_type\":\"native_or_fluent\",\"license\":\"<license>\",\"notes\":\"Do not include eval examples.\"}\n    {\"id\":\"sft_000002\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful Persian assistant.\"},{\"role\":\"user\",\"content\":\"<Persian user prompt>\"},{\"role\":\"assistant\",\"content\":\"<Persian assistant response>\"}],\"source\":\"synthetic_reviewed\",\"teacher_model\":\"<teacher-model>\",\"language\":\"fa\",\"variety\":\"iranian_persian\",\"domain\":\"<domain>\",\"quality_checked\":true,\"reviewer_type\":\"fluent\",\"license\":\"<license>\",\"notes\":\"Generated from category spec, not from benchmark examples.\"}\n\n\nUseful metadata fields:\n\nField | Purpose\n---|---\n`id` | Deduplication and audit trail\n`source` | manual, translated, synthetic, scraped, domain expert, etc.\n`teacher_model` | Needed if synthetic\n`language` | Persian/Farsi, Dari, Tajik, mixed, etc.\n`variety` | Iranian Persian, Dari, Tajik, mixed\n`domain` | general, medical, legal, education, support, etc.\n`quality_checked` | Whether reviewed\n`reviewer_type` | native, fluent, expert, automatic only\n`license` | Reuse constraints\n`eval_overlap_checked` | Whether contamination check was done\n`notes` | Known caveats\n\n* * *\n\n## 15. Dataset publication quality\n\nIf you publish the dataset on Hugging Face, the dataset card is part of the quality.\n\nUse:\n\n  * Hugging Face Dataset Cards\n  * Create a dataset card\n\n\n\nAt minimum, document:\n\nDataset card item | What to write\n---|---\nIntended use | SFT, evaluation, DPO, RAG, domain adaptation, etc.\nLanguage scope | Iranian Persian, Dari, Tajik, code-switching, etc.\nData sources | Manual, translated, synthetic, scraped, domain documents\nGeneration process | Prompts, teacher models, translation method\nHuman review | Who reviewed, how much, what criteria\nFiltering | Dedup, language ID, safety filters, length filters\nSplits | Train/dev/test, held-out eval, no overlap policy\nContamination policy | What benchmarks were excluded\nLicense | Dataset license and inherited source licenses\nLimitations | Translation artifacts, domain gaps, safety gaps\nEthical notes | Bias, harmful content handling, privacy considerations\n\nA dataset without documentation may be hard for others to trust, even if the examples look good.\n\n* * *\n\n## 16. Special case: domain assistants\n\nIf the assistant is for a domain, general Persian SFT data is not enough.\n\nDomain | Extra requirement\n---|---\nMedical | Expert validation, conservative answers, disclaimers, Persian medical QA eval\nLegal | Jurisdiction-specific knowledge, refusal boundaries, citations\nEducation | Curriculum alignment, grade level, step-by-step explanations\nCustomer support | Company/product-specific data, policy consistency\nReligious/cultural | Careful cultural review, sensitivity, source grounding\nNews/current events | RAG and source freshness, not static SFT\n\nFor domain assistants, I would usually prefer:\n\n  * small expert-reviewed SFT set\n  * strong RAG pipeline\n  * domain-specific private eval\n  * refusal/uncertainty examples\n  * citations or source-grounded answers\n\n\n\nrather than a huge generic Persian chat dataset.\n\n* * *\n\n## 17. What I would avoid\n\nI would avoid these patterns:\n\nPattern | Why risky\n---|---\n“Generate 100k Persian conversations with ChatGPT and fine-tune.” | Likely repetitive, translation-like, weakly targeted, and hard to audit.\nTraining on benchmark examples | Contamination and misleading scores.\nUsing only translated English instruction data | Persian cultural and linguistic naturalness may be weak.\nNo private eval set | You cannot tell what improved.\nNo baseline comparison | You may build data for a model that is simply the wrong base model.\nNo tokenizer check | SLM context/efficiency problems may be misdiagnosed as data quality problems.\nIgnoring chat template | Bad formatting can erase the value of good examples.\nNo dataset card | Others cannot assess provenance, license, or limitations.\nTreating leaderboard rank as product readiness | Leaderboards are useful signals, not deployment guarantees.\n\n* * *\n\n## 18. A compact roadmap\n\nHere is the full process in one table.\n\nStep | Action | Output\n---|---|---\n1 | Define target assistant | Scope statement\n2 | Pick public Persian eval anchors | Benchmark list\n3 | Build private eval set | 100–500 held-out examples\n4 | Evaluate base model | Failure categories\n5 | Compare Persian baselines | Model comparison\n6 | Measure tokenizer cost | Tokens/example, truncation rate\n7 | Inspect existing datasets | Data inventory\n8 | Decide missing layer | SFT, CPT, RAG, safety, domain, etc.\n9 | Build manual seed data | High-quality seed examples\n10 | Expand synthetically if useful | Candidate pool\n11 | Filter and review | Clean training set\n12 | Deduplicate/decontaminate | Safe train/dev/test split\n13 | Train with correct format | SFT/LoRA/QLoRA run\n14 | Re-evaluate | Failure report\n15 | Add targeted data | Next iteration\n16 | Document dataset | Dataset card\n\n* * *\n\n## 19. Final practical checklist\n\nBefore calling the dataset “high quality,” I would want to answer these questions:\n\n  * What exact Persian variety/register is targeted?\n  * Which public benchmarks did you inspect?\n  * What is your private held-out eval set?\n  * What are the base model’s main failures?\n  * Did you compare against Persian-specialized baselines?\n  * Is the data native, translated, synthetic, or mixed?\n  * Who reviewed Persian fluency?\n  * Are answers factually checked where needed?\n  * Is there enough diversity of task type and difficulty?\n  * Are safety/refusal examples included?\n  * Are cultural/local examples included?\n  * Are benchmark examples excluded from training?\n  * Are duplicates removed?\n  * Is the dataset formatted for the exact chat template?\n  * Is assistant-only/completion-only loss handled correctly?\n  * Have you measured tokenization cost?\n  * Is the license clear?\n  * Is the dataset card complete?\n\n\n\nIf most of these are answered, the dataset is much closer to “high quality” in a practical sense.\n\n* * *\n\n## Bottom line\n\nFor a Persian SLM assistant, I would define dataset quality like this:\n\n> A high-quality dataset is a documented, decontaminated, eval-driven Persian dataset that targets the measured weaknesses of the chosen model: fluency, instruction following, multi-turn behavior, factuality, cultural fit, safety, domain knowledge, retrieval behavior, formatting, or tokenizer efficiency.\n\nSo the best first move is not to generate a huge dataset. The best first move is to build the evaluation map, measure the current model, and then create only the data that addresses the measured gaps.",
  "title": "How can i build a High Quality dataset?"
}