How can i build a High Quality dataset?
Hugging Face Forums [Unofficial]
June 24, 2026
Building a high-quality Persian SFT dataset for a Small Language Model (SLM) is a great initiative, as SLMs are highly sensitive to dataset noise. Here is a structured approach to achieve this:
1. Leverage and Refine Existing Datasets (Don’t start from scratch)
Before generating new data, you can build upon high-quality Persian community datasets that have already gone through some level of curation:
- FarsInstruct : A comprehensive, high-quality Persian instruction dataset covering diverse NLP tasks (presented at COLING 2025).
- Matina : Curated specifically for Persian cultural alignment and multi-turn instruction following.
- PersianAICommunity (Hugging Face) : Check their curated collection, which includes datasets like
merged_persian_alpacaand conversational datasets.
2. The “Translation + LLM-as-a-Judge” Pipeline
Instead of letting ChatGPT generate Persian instructions from scratch (which often leads to unnatural phrasing), translate high-quality English datasets (e.g., UltraFeedback , ShareGPT , or Magpie) and filter them:
- Translate : Use frontier models (like GPT-4o or Claude 3.5 Sonnet) to translate the English datasets into natural Persian, explicitly instructing them to avoid literal translations.
- Filter (LLM-as-a-Judge) : Use a strong model (e.g., GPT-4o) to rate the quality, naturalness, and grammar of the translated Persian on a scale of 1-5. Keep only samples rated 4 or 5. This drastically reduces noise.
3. High-Quality Self-Instruct Generation
If you must generate new Persian instructions:
- Use Frontier Models : Generate using Claude 3.5 Sonnet or GPT-4o rather than base ChatGPT (GPT-3.5), as they have a much better grasp of Persian grammar and cultural nuances.
- Define Tone & Style: Instruct the generator model to maintain a consistent style (e.g., formal/written Persian vs. colloquial/spoken Persian, depending on your SLM’s use case).
4. Essential Post-Processing for Persian Text
Persian text is highly prone to encoding and formatting noise. Clean your final dataset using a post-processing script:
- Unicode Normalization : Ensure standard Persian characters (like correcting Arabic Kaf
كand Yafيto Persianکandی). - Spacing : Use libraries like
hazmorparsivarto fix half-space (Nim-space) formatting. - Rule-based Filtering : Filter out repetitive generations, responses containing too much English code/text (unless desired), or overly short responses.
Discussion in the ATmosphere