Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib4kruuz7dkc5d77lklgkquajbapnxu4a7cwoikgc4s5cci5k5gzm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnqkxiqtkmu2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_4",
  "publishedAt": "2026-06-08T01:21:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Adapting Large Language Models",
    "TibetanLLM: CPT + SFT for Tibetan language adaptation",
    "Qwen3.5-0.8B",
    "Qwen3.5-0.8B-Base",
    "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs",
    "OSCAR 23.01 docs",
    "Aya",
    "KenLM",
    "Hazm",
    "fastText",
    "TRL SFTTrainer",
    "TRL dataset formats",
    "Transformers chat templates"
  ],
  "textContent": "Oh. That clarification narrows the focus quite a lot:\n\n* * *\n\n## Short answer\n\nYour understanding is mostly correct.\n\nIf the target is **Iranian Persian only** , the model is **Qwen3.5-0.8B** , and the model still struggles with basic Persian grammar/syntax, then I agree that this is probably **not an SFT-only problem**.\n\nA useful mental model is:\n\nStage | Best for | Not ideal for\n---|---|---\nCPT / continued pretraining | Persian language grounding, grammar, syntax, orthography, style, broad domain familiarity | Teaching exact assistant behavior or reliable factual recall\nSFT | Instruction following, teacher/student answer style, dialogue format, refusals, formatting, persona | Repairing weak base Persian language ability\nRAG / retrieval | Reliable factual recall, changing knowledge, exact textbook/document facts | Improving the model’s internal Persian grammar\n\nSo I would summarize it like this:\n\n> CPT gives the model more Persian language mass. SFT tells it how to behave as an assistant. RAG or a small local knowledge base is better when exact knowledge must be recalled reliably.\n\nYour experience with ~800 MB cleaned Persian Wikipedia improving Qwen3-0.6B is consistent with that.\n\n* * *\n\n## 1. CPT vs SFT: your understanding is mostly right\n\nI would phrase it slightly more carefully:\n\n### CPT\n\nContinued pretraining is still next-token prediction. It can move the model’s internal distribution toward Persian:\n\n  * grammar\n  * syntax\n  * punctuation\n  * orthography\n  * style\n  * common expressions\n  * common factual associations\n  * domain familiarity\n  * local writing conventions\n\n\n\nThis is why CPT is often used for language adaptation. Meta’s overview of LLM adaptation also describes continued pretraining as useful when the goal is to add capabilities such as multilingual ability, while noting that it is more expensive and can risk forgetting: Adapting Large Language Models.\n\nFor low-resource language adaptation, a similar staged pattern appears in some work like:\n\n  * CPT for language grounding\n  * SFT for task/instruction specialization\n\n\n\nExample: TibetanLLM: CPT + SFT for Tibetan language adaptation.\n\n### SFT\n\nSFT is better for teaching the model:\n\n  * how to answer as a teacher\n  * how to follow instructions\n  * how to produce student-friendly explanations\n  * how to ask clarifying questions\n  * how to refuse unsafe requests\n  * how to use a particular chat format\n  * how verbose or concise it should be\n  * what style of Persian answer you want\n\n\n\nBut if the base model cannot produce stable Persian sentences, SFT often becomes inefficient. You may end up teaching answer patterns on top of a weak language foundation.\n\nSo I would agree with your main diagnosis:\n\n> If the model cannot reliably handle basic Persian grammar and syntax, CPT or language adaptation should come before serious SFT.\n\n* * *\n\n## 2. Consider whether to CPT the Base model or the post-trained model\n\nQwen provides both:\n\n  * Qwen3.5-0.8B\n  * Qwen3.5-0.8B-Base\n\n\n\nIf possible, I would consider this order:\n\n\n    Qwen3.5-0.8B-Base\n      -> Persian CPT / language adaptation\n      -> SFT for teacher/student assistant behavior\n      -> optional preference tuning / DPO later\n\n\nWhy?\n\nBecause CPT on an already post-trained/instruct model can still work, but it may partially degrade instruction-following behavior. If you CPT the instruct/post-trained model, I would keep a small instruction-following regression eval and check it after every CPT run.\n\nPractical rule:\n\nIf you use… | Watch for…\n---|---\nBase model | You need SFT afterward before it behaves like an assistant\nInstruct/post-trained model | CPT may damage some instruction-following behavior\nLoRA CPT | Safer/cheaper, but limited capacity compared with full CPT\nFull CPT | More capacity, more cost, more forgetting risk\n\nSince you can run LoRA rank 64 and it already helped in practice, it sounds like a reasonable constraint-aware approach.\n\n* * *\n\n## 3. Does the model need to see everything in CPT to remember it?\n\nPartly yes, but with an important caveat.\n\nIf you want the model to become broadly familiar with some knowledge, the model must see that knowledge during training somehow. CPT exposure can help the model internalize patterns and associations.\n\nBut CPT is **not a reliable database**.\n\nFor exact knowledge recall, especially facts, dates, school content, rules, or domain-specific material, I would not rely only on parametric memory. A relevant comparison is Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs, which found that retrieval-augmented generation often outperforms unsupervised fine-tuning for knowledge-intensive tasks and that learning new factual information through unsupervised fine-tuning can be difficult.\n\nA practical split:\n\nKnowledge type | Better method\n---|---\nGeneral Persian grammar and syntax | CPT\nCommon Iranian Persian writing patterns | CPT\nGeneral educational style | CPT + SFT\nTeacher/student answer behavior | SFT\nExact textbook facts | RAG or local knowledge base\nChanging/current facts | RAG/search\nSmall set of very frequent facts | CPT/SFT may be acceptable\nHigh-stakes facts | Retrieval + citations + conservative answer style\n\nFor a low-end-device voice assistant, full RAG may be hard, but you can still think in layers:\n\n\n    Very common knowledge -> CPT/SFT\n    Exact or large knowledge -> compressed local KB / retrieval if possible\n    Teacher behavior -> SFT\n    Voice interface -> later ASR/TTS/latency problem\n\n\nIf you try to put all knowledge into CPT, the model may remember some of it, but recall will not be perfectly reliable, especially at 0.8B scale.\n\n* * *\n\n## 4. Your selected data sources make sense, but they have different roles\n\nYour sources:\n\n  * Persian Wikipedia\n  * Persian OSCAR\n  * Persian Aya\n\n\n\nare not equivalent. I would not mix them blindly.\n\nSource | Best use | Main risk\n---|---|---\nPersian Wikipedia | Clean-ish formal Persian, encyclopedic facts, stable style | Too encyclopedic; not conversational or teacher/student by itself\nPersian OSCAR | Broader web Persian, more style diversity | Very noisy, mixed language, boilerplate, duplicates, spam\nPersian Aya | Instruction-following data | More SFT-like than CPT-like; may not be ideal as raw CPT text\n\n### Wikipedia\n\nGood for:\n\n  * formal grammar\n  * basic factual associations\n  * clean-ish prose\n  * general encyclopedic style\n\n\n\nRisk:\n\n  * the model may become too encyclopedia-like\n  * not enough student/teacher dialogue\n  * not enough conversational assistant style\n\n\n\n### OSCAR\n\nOSCAR is useful, but I would treat it as **raw material** , not clean training data. The OSCAR 23.01 documentation mentions metadata such as KenLM-based harmful-content perplexity, TLSH hashes for near deduplication, sentence-level language identification, and quality warnings: OSCAR 23.01 docs.\n\nThat supports your instinct: OSCAR can be valuable, but only after strong cleaning.\n\n### Aya\n\nAya is more instruction-oriented. It may be useful for SFT or for a small instruction mixture, but I would not treat it the same way as Wikipedia/OSCAR for CPT.\n\nFor CPT, I would prefer raw fluent Persian prose.\n\nFor SFT, I would prefer instruction/response examples.\n\n* * *\n\n## 5. A better CPT mixture might be staged\n\nInstead of one big mixture immediately, I would test small stages.\n\nExample:\n\nStage | Data mixture | Goal\n---|---|---\nCPT-1 | Clean Persian Wikipedia | Basic grammar, syntax, formal Persian\nCPT-2 | Wikipedia + filtered OSCAR | Broader style and web Persian\nCPT-3 | Add educational Persian prose if available | Teacher/student domain adaptation\nSFT-1 | Small high-quality teacher/student examples | Assistant behavior\nSFT-2 | More instruction/multi-turn examples | Dialogue robustness\n\nI would avoid making noisy OSCAR too large early.\n\nA safe first ratio could be something like:\n\n\n    70-90% clean Persian Wikipedia / curated Persian prose\n    10-30% strongly filtered OSCAR\n    0-10% instruction-like data, if converted carefully\n\n\nThis is not a universal ratio. It is just a safe starting point.\n\nIf filtered OSCAR improves eval, increase it. If it makes outputs noisier, reduce it.\n\n* * *\n\n## 6. The n-gram filter idea is good, but use it as one signal\n\nI think your n-gram idea is practical under your constraints.\n\nKenLM is a good fit for this kind of low-cost filtering because it is fast and small compared with neural LLM filtering.\n\nBut I would not use a single rule like:\n\n\n    low perplexity = good Persian\n    high perplexity = bad Persian\n\n\nThat can fail.\n\nWhy?\n\n  * very repetitive boilerplate can have low perplexity\n  * Wikipedia-like prose may be favored too much\n  * short junk text can be unstable\n  * copied templates may look fluent but be useless\n  * unnatural but common web spam can get through\n  * good informal Persian may be rejected if your n-gram LM was trained only on formal text\n\n\n\nInstead, use n-gram scoring as one filter in a pipeline.\n\n* * *\n\n## 7. Good-LM / Bad-LM filtering may be stronger than one LM\n\nA useful cheap approach:\n\n\n    Good Persian LM:\n      trained on clean Persian Wikipedia + curated high-quality Persian text\n\n    Bad Persian LM:\n      trained on rejected OSCAR samples, spam, boilerplate, malformed text, mixed-language junk\n\n    Score each candidate text:\n      good_lm_score\n      bad_lm_score\n      difference_or_ratio = bad_score - good_score\n\n\nThen select text that looks good under the good LM and not good under the bad LM.\n\nThis is often more useful than a single perplexity threshold.\n\nRough idea:\n\n\n    accept if:\n      good_perplexity is reasonable\n      bad_perplexity is worse\n      text length is reasonable\n      Persian script ratio is high\n      repetition is low\n      duplicate score is low\n\n\nDo not choose thresholds blindly. Sample 100 accepted and 100 rejected texts, read them, and adjust.\n\n* * *\n\n## 8. Cheap filtering pipeline under hardware/API constraints\n\nGiven your constraints, I would use a classical pipeline first.\n\nSomething like:\n\n\n    raw text\n      -> normalization\n      -> language/script filtering\n      -> length filtering\n      -> boilerplate removal\n      -> repetition filtering\n      -> exact dedup\n      -> near dedup\n      -> n-gram LM scoring\n      -> optional fastText classifier\n      -> manual sample audit\n      -> CPT shard\n\n\n### Step 1: Normalize Persian\n\nFor Persian preprocessing, Hazm is useful. It provides Persian normalization, tokenization, lemmatization, and related tools.\n\nNormalize things like:\n\n  * Arabic/Persian variants of letters\n  * spacing\n  * half-space / ZWNJ issues\n  * punctuation\n  * repeated characters\n  * strange Unicode artifacts\n\n\n\n### Step 2: Script/language ratio\n\nUse cheap rules:\n\n\n    Persian/Arabic-script character ratio\n    Latin character ratio\n    digit ratio\n    symbol ratio\n    average line length\n    number of URLs\n    number of repeated lines\n\n\nReject obvious junk before expensive scoring.\n\n### Step 3: Deduplicate\n\nDo both:\n\n  * exact dedup\n  * near dedup\n\n\n\nFor OSCAR 23.01, the documentation mentions TLSH hashes for exact and near deduplication. If you are using a different OSCAR version, you may need your own MinHash/SimHash/TLSH pipeline.\n\n### Step 4: KenLM score\n\nUse KenLM perplexity as a quality signal.\n\nTrain on your best available clean Persian text.\n\nThen score candidate documents.\n\n### Step 5: Optional small classifier\n\nIf you manually label examples as good/bad Persian, you can train a cheap classifier.\n\nfastText is useful for this kind of lightweight text classification. It is much cheaper than LLM filtering.\n\nExample labels:\n\n\n    __label__good <text>\n    __label__bad <text>\n\n\nThis can become surprisingly useful after a few thousand labeled examples.\n\n* * *\n\n## 9. What I would evaluate after each CPT run\n\nDo not wait until the final model.\n\nAfter each CPT run, check a small fixed eval set.\n\nEval | Why\n---|---\nPersian perplexity on held-out clean text | Did CPT improve Persian modeling?\nTokenization stats | Are examples being truncated?\nBasic grammar prompts | Can it produce correct Persian sentences?\nTeacher/student prompts | Did educational explanation improve?\nInstruction-following prompts | Did CPT damage instruction-following?\nRepetition tests | Did it become repetitive?\nMixed Persian-English prompts | Useful for technical/student settings\nSafety/refusal sanity checks | Make sure it did not become less safe\nSmall factual probes | Did knowledge improve at all?\n\nI would keep a small frozen eval like:\n\n\n    {\"id\":\"fa_grammar_001\",\"type\":\"grammar\",\"prompt\":\"<Persian grammar prompt>\",\"expected_behavior\":\"Produce fluent Iranian Persian.\"}\n    {\"id\":\"teacher_001\",\"type\":\"teacher\",\"prompt\":\"<Student asks a basic question in Persian>\",\"expected_behavior\":\"Explain simply, step by step, in Persian.\"}\n    {\"id\":\"if_001\",\"type\":\"instruction_following\",\"prompt\":\"<Answer in exactly 3 bullet points in Persian>\",\"expected_behavior\":\"Exactly 3 bullets, no extra text.\"}\n    {\"id\":\"regression_001\",\"type\":\"regression\",\"prompt\":\"<Previously easy instruction prompt>\",\"expected_behavior\":\"Should not degrade after CPT.\"}\n\n\nThe regression part is important if you CPT an already post-trained/instruct model.\n\n* * *\n\n## 10. About LoRA CPT rank 64\n\nLoRA CPT with rank 64 can be a reasonable compromise.\n\nIt probably will not have the same capacity as full CPT, but your empirical result matters: if it noticeably improved Qwen3-0.6B after Persian Wikipedia CPT, that is evidence that it is useful in your setup.\n\nI would just watch for:\n\n  * overfitting to Wikipedia style\n  * loss of instruction-following\n  * repetition\n  * catastrophic forgetting\n  * too much formal/encyclopedic tone\n  * weak conversational style\n  * weak teacher/student style\n\n\n\nIf you can afford it, run small ablations:\n\n\n    A: Wikipedia only\n    B: Wikipedia + filtered OSCAR\n    C: Wikipedia + filtered OSCAR + educational prose\n    D: same as B but fewer steps\n    E: same as B but different LoRA rank\n\n\nEven small ablations can teach you more than one big run.\n\n* * *\n\n## 11. One important warning: do not overfit to “clean Persian” only\n\nYour n-gram filter may become too strict.\n\nIf the filter only accepts very formal Wikipedia-style Persian, the model may become better at formal prose but not better as a student/teacher assistant.\n\nFor your target, you probably need at least three Persian styles:\n\nStyle | Example source\n---|---\nFormal Persian | Wikipedia, books, formal articles\nEducational Persian | textbooks, explanations, lessons, student-facing content\nConversational Persian | teacher/student dialogue, Q&A, simple explanations\n\nIf you only use formal text, SFT will have to fight the CPT style later.\n\nSo I would keep a small amount of high-quality conversational/educational Persian, even if it is much smaller than the formal corpus.\n\n* * *\n\n## 12. SFT after CPT\n\nAfter CPT, I would do SFT with a small, clean dataset.\n\nDo not start with huge SFT.\n\nStart with examples like:\n\n  * explain a concept to a student\n  * correct a student’s grammar\n  * simplify a paragraph\n  * ask a clarifying question\n  * answer with examples\n  * answer in short teacher style\n  * refuse unsafe requests politely\n  * handle mixed Persian-English technical terms\n  * multi-turn follow-up\n\n\n\nFor TRL, check the dataset format and loss masking carefully:\n\n  * TRL SFTTrainer\n  * TRL dataset formats\n  * Transformers chat templates\n\n\n\nImportant:\n\n\n    CPT teaches language distribution.\n    SFT teaches assistant behavior.\n    Wrong chat template or wrong loss masking can waste good data.\n\n\nIf using chat data, make sure the model is trained on assistant outputs, not just random serialized conversations.\n\n* * *\n\n## 13. My practical recommendation\n\nGiven your constraints, I would do this:\n\n### Phase 1: CPT data cleaning\n\n\n    Persian Wikipedia\n      -> clean\n      -> normalize\n      -> dedup\n      -> train Good KenLM\n\n    OSCAR Persian\n      -> clean\n      -> normalize\n      -> language/script filter\n      -> remove boilerplate\n      -> dedup / near-dedup\n      -> score with Good KenLM\n      -> optionally score with Bad KenLM\n      -> sample audit\n      -> keep only high-confidence Persian\n\n\n### Phase 2: Small CPT runs\n\n\n    Run 1: Wikipedia only\n    Run 2: Wikipedia + filtered OSCAR\n    Run 3: add educational Persian prose if available\n\n\nCompare them with the same held-out eval.\n\n### Phase 3: SFT\n\n\n    Small teacher/student Persian SFT\n      -> 500 to 5,000 excellent examples first\n      -> then expand only if eval shows benefit\n\n\n### Phase 4: Knowledge\n\n\n    For common background knowledge:\n      CPT exposure is useful.\n\n    For exact educational facts:\n      use RAG / local KB if possible.\n\n    For teacher style:\n      SFT.\n\n\n* * *\n\n## Bottom line\n\nI think your direction is reasonable.\n\nYour two core ideas are right:\n\n  1. If the model cannot write Persian well, CPT is probably needed before SFT.\n  2. A cheap n-gram-based quality filter is a practical idea under hardware/API constraints.\n\n\n\nI would only refine the plan like this:\n\n> Use CPT for Persian language grounding, SFT for teacher/student assistant behavior, and RAG or a small local knowledge base for reliable factual recall.\n\nAnd for filtering:\n\n> Use the n-gram model as one signal, not the only signal. Combine it with normalization, script/language filters, deduplication, repetition filters, OSCAR metadata when available, and manual sampling.",
  "title": "How can i build a High Quality dataset?"
}