Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicrzgifese6xocyq7irkopwatokp4ikm4drivoqshuu3xg4753wyq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mozpkxlhk4o2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571?page=2#post_22",
  "publishedAt": "2026-06-24T08:55:14.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Building a high-quality Persian SFT dataset for a Small Language Model (SLM) is a great initiative, as SLMs are highly sensitive to dataset noise. Here is a structured approach to achieve this:\n\n### 1. Leverage and Refine Existing Datasets (Don’t start from scratch)\n\nBefore generating new data, you can build upon high-quality Persian community datasets that have already gone through some level of curation:\n\n  * **FarsInstruct** : A comprehensive, high-quality Persian instruction dataset covering diverse NLP tasks (presented at COLING 2025).\n  * **Matina** : Curated specifically for Persian cultural alignment and multi-turn instruction following.\n  * **PersianAICommunity (Hugging Face)** : Check their curated collection, which includes datasets like `merged_persian_alpaca` and conversational datasets.\n\n\n\n### 2. The “Translation + LLM-as-a-Judge” Pipeline\n\nInstead of letting ChatGPT generate Persian instructions from scratch (which often leads to unnatural phrasing), translate high-quality English datasets (e.g., _UltraFeedback_ , _ShareGPT_ , or _Magpie_) and filter them:\n\n  1. **Translate** : Use frontier models (like GPT-4o or Claude 3.5 Sonnet) to translate the English datasets into natural Persian, explicitly instructing them to avoid literal translations.\n  2. **Filter (LLM-as-a-Judge)** : Use a strong model (e.g., GPT-4o) to rate the quality, naturalness, and grammar of the translated Persian on a scale of 1-5. Keep only samples rated 4 or 5. This drastically reduces noise.\n\n\n\n### 3. High-Quality Self-Instruct Generation\n\nIf you must generate new Persian instructions:\n\n  * **Use Frontier Models** : Generate using Claude 3.5 Sonnet or GPT-4o rather than base ChatGPT (GPT-3.5), as they have a much better grasp of Persian grammar and cultural nuances.\n  * **Define Tone & Style**: Instruct the generator model to maintain a consistent style (e.g., formal/written Persian vs. colloquial/spoken Persian, depending on your SLM’s use case).\n\n\n\n### 4. Essential Post-Processing for Persian Text\n\nPersian text is highly prone to encoding and formatting noise. Clean your final dataset using a post-processing script:\n\n  * **Unicode Normalization** : Ensure standard Persian characters (like correcting Arabic _Kaf_ `ك` and _Yaf_ `ي` to Persian `ک` and `ی`).\n  * **Spacing** : Use libraries like `hazm` or `parsivar` to fix half-space (Nim-space `‌`) formatting.\n  * **Rule-based Filtering** : Filter out repetitive generations, responses containing too much English code/text (unless desired), or overly short responses.\n\n",
  "title": "How can i build a High Quality dataset?"
}