{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicrzgifese6xocyq7irkopwatokp4ikm4drivoqshuu3xg4753wyq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mozpkxlhk4o2"
},
"path": "/t/how-can-i-build-a-high-quality-dataset/176571?page=2#post_22",
"publishedAt": "2026-06-24T08:55:14.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Building a high-quality Persian SFT dataset for a Small Language Model (SLM) is a great initiative, as SLMs are highly sensitive to dataset noise. Here is a structured approach to achieve this:\n\n### 1. Leverage and Refine Existing Datasets (Don’t start from scratch)\n\nBefore generating new data, you can build upon high-quality Persian community datasets that have already gone through some level of curation:\n\n * **FarsInstruct** : A comprehensive, high-quality Persian instruction dataset covering diverse NLP tasks (presented at COLING 2025).\n * **Matina** : Curated specifically for Persian cultural alignment and multi-turn instruction following.\n * **PersianAICommunity (Hugging Face)** : Check their curated collection, which includes datasets like `merged_persian_alpaca` and conversational datasets.\n\n\n\n### 2. The “Translation + LLM-as-a-Judge” Pipeline\n\nInstead of letting ChatGPT generate Persian instructions from scratch (which often leads to unnatural phrasing), translate high-quality English datasets (e.g., _UltraFeedback_ , _ShareGPT_ , or _Magpie_) and filter them:\n\n 1. **Translate** : Use frontier models (like GPT-4o or Claude 3.5 Sonnet) to translate the English datasets into natural Persian, explicitly instructing them to avoid literal translations.\n 2. **Filter (LLM-as-a-Judge)** : Use a strong model (e.g., GPT-4o) to rate the quality, naturalness, and grammar of the translated Persian on a scale of 1-5. Keep only samples rated 4 or 5. This drastically reduces noise.\n\n\n\n### 3. High-Quality Self-Instruct Generation\n\nIf you must generate new Persian instructions:\n\n * **Use Frontier Models** : Generate using Claude 3.5 Sonnet or GPT-4o rather than base ChatGPT (GPT-3.5), as they have a much better grasp of Persian grammar and cultural nuances.\n * **Define Tone & Style**: Instruct the generator model to maintain a consistent style (e.g., formal/written Persian vs. colloquial/spoken Persian, depending on your SLM’s use case).\n\n\n\n### 4. Essential Post-Processing for Persian Text\n\nPersian text is highly prone to encoding and formatting noise. Clean your final dataset using a post-processing script:\n\n * **Unicode Normalization** : Ensure standard Persian characters (like correcting Arabic _Kaf_ `ك` and _Yaf_ `ي` to Persian `ک` and `ی`).\n * **Spacing** : Use libraries like `hazm` or `parsivar` to fix half-space (Nim-space ``) formatting.\n * **Rule-based Filtering** : Filter out repetitive generations, responses containing too much English code/text (unless desired), or overly short responses.\n\n",
"title": "How can i build a High Quality dataset?"
}