External Publication

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 24, 2026

Building a high-quality Persian SFT dataset for a Small Language Model (SLM) is a great initiative, as SLMs are highly sensitive to dataset noise. Here is a structured approach to achieve this:

1. Leverage and Refine Existing Datasets (Don’t start from scratch)

Before generating new data, you can build upon high-quality Persian community datasets that have already gone through some level of curation:

FarsInstruct : A comprehensive, high-quality Persian instruction dataset covering diverse NLP tasks (presented at COLING 2025).
Matina : Curated specifically for Persian cultural alignment and multi-turn instruction following.
PersianAICommunity (Hugging Face) : Check their curated collection, which includes datasets like merged_persian_alpaca and conversational datasets.

2. The “Translation + LLM-as-a-Judge” Pipeline

Instead of letting ChatGPT generate Persian instructions from scratch (which often leads to unnatural phrasing), translate high-quality English datasets (e.g., UltraFeedback , ShareGPT , or Magpie) and filter them:

Translate : Use frontier models (like GPT-4o or Claude 3.5 Sonnet) to translate the English datasets into natural Persian, explicitly instructing them to avoid literal translations.
Filter (LLM-as-a-Judge) : Use a strong model (e.g., GPT-4o) to rate the quality, naturalness, and grammar of the translated Persian on a scale of 1-5. Keep only samples rated 4 or 5. This drastically reduces noise.

3. High-Quality Self-Instruct Generation

If you must generate new Persian instructions:

Use Frontier Models : Generate using Claude 3.5 Sonnet or GPT-4o rather than base ChatGPT (GPT-3.5), as they have a much better grasp of Persian grammar and cultural nuances.
Define Tone & Style: Instruct the generator model to maintain a consistent style (e.g., formal/written Persian vs. colloquial/spoken Persian, depending on your SLM’s use case).

4. Essential Post-Processing for Persian Text

Persian text is highly prone to encoding and formatting noise. Clean your final dataset using a post-processing script:

Unicode Normalization : Ensure standard Persian characters (like correcting Arabic Kaf ك and Yaf ي to Persian ک and ی).
Spacing : Use libraries like hazm or parsivar to fix half-space (Nim-space ‌) formatting.
Rule-based Filtering : Filter out repetitive generations, responses containing too much English code/text (unless desired), or overly short responses.

1. Leverage and Refine Existing Datasets (Don’t start from scratch)

2. The “Translation + LLM-as-a-Judge” Pipeline

3. High-Quality Self-Instruct Generation

4. Essential Post-Processing for Persian Text

Discussion in the ATmosphere