Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie7ii5omekhawoxjdb3pwg5eilx32yxt55zgpzdcus5jgqkuh3tam",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mor4dhxvero2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571?page=2#post_21",
  "publishedAt": "2026-06-20T23:07:25.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Qwen3.5-0.8B",
    "Qwen3.5-0.8B-Base",
    "(click for more details)",
    "FarsInstruct",
    "FarsInstruct GitHub",
    "Khayyam Challenge / PersianMMLU",
    "PQuAD",
    "PerCQA",
    "PCoQA",
    "PARSE",
    "Matina Persian corpus",
    "Matina on HF",
    "PersianPhi",
    "TinyStories paper",
    "Persian-Phi paper",
    "FineWeb-Edu",
    "FineWeb-Edu classifier",
    "FineWeb-Edu annotations",
    "FineWeb-C",
    "FineWeb2 paper",
    "FineWeb2 dataset",
    "Matina: A Large-Scale 73B Token Persian Text Corpus",
    "Matina on Hugging Face"
  ],
  "textContent": "Hmm… CPT is expensive, so it may be better to first isolate the failure mode and decide whether CPT is actually necessary. There may be a route where SFT alone improves this quite a bit, but it depends on the cause:\n\n* * *\n\nI would not diagnose this example as “needs CPT” yet.\n\nThe output you showed is not perfect, but it already shows several good signs:\n\n  * the model answers in Persian,\n  * it understands the basic concept of AI,\n  * it does not obviously drift into English/Chinese,\n  * it can produce a structured explanation,\n  * it is trying to be helpful.\n\n\n\nSo I would read this as a **diagnostic fork** , not as immediate proof that Persian CPT is required.\n\nThe important question is:\n\n> **Is this a CPT-type failure, or is it an answer-style / capacity / evaluation problem?**\n\nCPT is useful when the model lacks underlying language/domain grounding.\nSFT is useful when the model can understand the task but needs to learn _how you want it to answer_.\nDPO/ORPO is useful when the model can produce both good and bad styles, but you need to prefer one style over another.\nA larger model is useful when 0.8B can do each part separately, but cannot reliably combine them.\n\nFor your example, my current guess is:\n\n> **Mostly answer-style / alignment issue, possibly 0.8B capacity pressure, not clearly Persian CPT failure yet.**\n\nThat is good news, because CPT is one of the more expensive routes.\n\n## Direct answer\n\nFor this specific output, I would try this order:\n\n  1. **Do not start with CPT yet.**\n  2. Build a small diagnostic eval.\n  3. Test the same prompts on **Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B**.\n  4. If explicit answer-style prompts fix the issue, use SFT.\n  5. If 0.8B fails but 2B/4B works, it is probably capacity pressure.\n  6. If all sizes show weak Persian terminology, weak Persian naturalness, or broad knowledge gaps, then consider CPT/RAG/corpus work.\n\n\n\nThe short version:\n\nDiagnosis | Example symptom | First route\n---|---|---\n**A. Answer-style failure** | Persian is OK, knowledge is OK, but answer is too long / unfocused / oddly polite | SFT, maybe DPO/ORPO\n**B. Capacity pressure** | 0.8B cannot obey style + Persian + reasoning + length constraints together | Try 2B/4B or narrow the task\n**C. Persian/domain grounding failure** | Persian wording, terminology, or knowledge is broadly weak | CPT, RAG, curated Persian corpus\n\nYour example currently looks more like **A + maybe B** , not clearly **C**.\n\n## Why I would not jump to CPT from this example\n\nThe answer has problems, but not necessarily CPT-type problems.\n\nThe model’s answer is too broad for the prompt. It starts with a greeting, says “good question”, gives a long multi-section explanation, and adds some strange or shallow subpoints. But the model does know what AI is.\n\nThe user asked:\n\n\n    سلام، هوش مصنوعی چیست\n\n\nA better answer could simply be:\n\n\n    هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.\n\n\nThat is not mainly “more pretraining knowledge”.\nThat is **answer style** : concise definition, no unnecessary greeting, no over-expanded lesson, no shallow filler.\n\nAlso, Qwen3.5-0.8B is already the post-trained model, not the raw base model:\n\n  * Qwen3.5-0.8B\n  * Qwen3.5-0.8B-Base\n\n\n\nSo I would not say “it fails because it is not instruction-tuned.”\nI would say:\n\n> **It is a generic post-trained model, but not yet adapted to your Persian educational assistant style.**\n\nThat is different.\n\nDecision tree: is this really a CPT problem? (click for more details)\n\n## Test A/B/C with a tiny prompt family\n\nBefore training, I would run the same prompt family on 0.8B, 2B, and 4B.\n\nUse temperature 0.\n\n\n    هوش مصنوعی چیست؟ در یک جمله جواب بده.\n\n\n\n    هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم، کوتاه و ساده توضیح بده.\n\n\n\n    هوش مصنوعی چیست؟ بدون مقدمه و در دو جمله توضیح بده.\n\n\n\n    هوش مصنوعی چیست؟ با یک مثال ساده توضیح بده، اما بیشتر از ۴ جمله ننویس.\n\n\n\n    هوش مصنوعی چیست؟ اول یک تعریف کوتاه بده، بعد یک مثال، بعد یک جمله درباره کاربرد آن.\n\n\nThen interpret the results like this:\n\nObservation | Likely diagnosis | Route\n---|---|---\n0.8B follows the explicit format well | answer-style problem | SFT likely enough\n0.8B still rambles, 2B/4B works | capacity pressure | larger small model may be cheaper\nall sizes have weak Persian terminology | grounding problem | corpus/RAG/CPT\nall sizes know facts but answer badly | style problem | SFT/DPO\nall sizes give shallow generic answers | data-quality problem | better answer-quality rubric\nall sizes fail the concept | domain knowledge issue | RAG or domain data\n\nThis is cheap and gives much better evidence than guessing.\n\nWhy 0.8B may behave this way even when it is already post-trained (click for more details)\n\n## What SFT should teach here\n\nFor your current issue, I would build an **answer-style SFT set**.\n\nThis is different from raw Persian text.\n\nRaw text teaches language distribution.\nAnswer-style SFT teaches response behavior.\n\nYou want the model to learn:\n\n  * answer directly,\n  * avoid unnecessary greeting,\n  * match the requested length,\n  * choose the right depth,\n  * explain simply when asked,\n  * use examples only when useful,\n  * preserve Persian around English technical terms,\n  * avoid long generic essay mode unless requested,\n  * say “I don’t know” when appropriate.\n\n\n\nPossible categories:\n\nPrompt type | Desired behavior\n---|---\n“What is X?” | short definition\n“Explain X simply” | beginner explanation\n“Explain X to a student” | educational but concise\n“Give an example” | definition + one example\n“Compare X and Y” | small table or clear contrast\n“I do not understand X” | tutor-style explanation\n“Answer in one sentence” | exactly one sentence\n“No introduction” | no greeting, no filler\n“Explain in 3 bullets” | obey structure\n“I may be wrong, but…” | correct gently\n\nExample SFT pairs:\n\n\n    User:\n    هوش مصنوعی چیست؟ در دو جمله جواب بده.\n\n    Assistant:\n    هوش مصنوعی یعنی توانایی یک سیستم کامپیوتری برای انجام کارهایی که معمولاً به هوش انسانی نیاز دارند، مثل یادگیری، فهم متن، تشخیص تصویر یا تصمیم‌گیری. به زبان ساده، هوش مصنوعی از داده‌ها یاد می‌گیرد و بر اساس آن پاسخ یا تصمیم تولید می‌کند.\n\n\n\n    User:\n    هوش مصنوعی چیست؟ برای دانش‌آموز کلاس هفتم توضیح بده.\n\n    Assistant:\n    هوش مصنوعی یعنی برنامه‌ای که می‌تواند از داده‌ها یاد بگیرد و بعضی کارها را شبیه انسان انجام دهد. مثلاً وقتی یک برنامه با دیدن عکس‌های زیاد یاد می‌گیرد گربه را از سگ تشخیص دهد، از هوش مصنوعی استفاده کرده است.\n\n\n\n    User:\n    هوش مصنوعی چیست؟ فقط تعریف کوتاه بده.\n\n    Assistant:\n    هوش مصنوعی توانایی یک سیستم کامپیوتری برای یادگیری، تصمیم‌گیری یا انجام کارهایی است که معمولاً به هوش انسانی نیاز دارند.\n\n\n\n    User:\n    هوش مصنوعی چیست؟ بدون مقدمه جواب بده.\n\n    Assistant:\n    هوش مصنوعی فناوری‌ای است که به کامپیوترها کمک می‌کند از داده‌ها یاد بگیرند و کارهایی مانند تشخیص تصویر، فهم متن، پیش‌بینی یا تصمیم‌گیری را انجام دهند.\n\n\nThis is the type of data that directly targets your observed failure.\n\nChosen/rejected examples for DPO or ORPO (click for more details)\n\n## What data to use\n\nI would separate the data by role.\n\nDo not put all Persian data into one bucket.\n\nData role | Purpose | Example sources\n---|---|---\n**diagnostic eval** | decide whether CPT is needed | Khayyam/PersianMMLU, PARSE, custom prompts\n**answer-style SFT** | teach concise Persian assistant behavior | curated definitions, FarsInstruct-style examples\n**QA behavior** | teach grounded question answering | PQuAD, PerCQA, PCoQA\n**math/tutor behavior** | teach step-by-step tutoring | custom textbook-aligned data\n**raw corpus / CPT** | improve Persian grounding | Matina or other filtered Persian corpus\n**preference data** | prefer concise/focused answers | chosen/rejected pairs\n**RAG documents** | provide domain facts without retraining | trusted educational sources\n\nUseful Persian resources:\n\nResource | Use\n---|---\nFarsInstruct | Persian instruction-following data and task templates\nFarsInstruct GitHub | inspect task/template design\nKhayyam Challenge / PersianMMLU | Persian school/exam-style evaluation\nPQuAD | Persian reading comprehension QA\nPerCQA | Persian community QA / answer selection\nPCoQA | Persian conversational QA\nPARSE | Persian reasoning QA\nMatina Persian corpus | possible CPT corpus if grounding is weak\nMatina on HF | inspect raw corpus candidate\nPersianPhi | example of a heavier Persian adaptation pipeline\n\nImportant: do not assume these are all training-ready. Inspect license, quality, duplication, synthetic content, and whether the style matches your target.\n\n## About TinyStories\n\nYour concern is right.\n\nTranslated TinyStories may help with:\n\n  * simple sentence structure,\n  * simple Persian generation,\n  * child-like narrative fluency,\n  * basic grammar patterns.\n\n\n\nBut it is not a good main source for:\n\n  * AI explanations,\n  * math tutoring,\n  * educational definitions,\n  * Persian technical terminology,\n  * robust knowledge coverage,\n  * assistant answer style.\n\n\n\nTinyStories is useful as a controlled simple-language dataset for small-model research, but it is not a broad knowledge or tutor corpus:\n\n  * TinyStories paper\n\n\n\nPersianPhi uses translated TinyStories as part of a broader curriculum, but not as the whole Persian solution:\n\n  * PersianPhi\n  * Persian-Phi paper\n\n\n\nSo I would treat translated TinyStories as optional auxiliary data, not the main CPT source.\n\nPossible role of TinyStories (click for more details)\n\n## About Wikipedia\n\nWikipedia is not useless, but it is not enough by itself.\n\nIt is useful for:\n\n  * broad factual coverage,\n  * terminology,\n  * named entities,\n  * reading comprehension,\n  * background documents for RAG,\n  * raw text candidates for corpus work.\n\n\n\nBut it is weak as direct assistant SFT because Wikipedia does not teach:\n\n  * how short the answer should be,\n  * when not to greet,\n  * how to tutor a student,\n  * how to adapt to difficulty level,\n  * how to answer a simple definition prompt,\n  * how to admit uncertainty.\n\n\n\nSo I would not directly train assistant behavior from Wikipedia paragraphs.\n\nBetter:\n\n  1. use Wikipedia as background text,\n  2. generate candidate QA/explanations from it,\n  3. filter with a Persian educational rubric,\n  4. manually inspect samples,\n  5. keep only strong examples.\n\n\n\nUse of Wikipedia | Good?\n---|---\nraw background corpus | yes, with filtering\nRAG documents | yes\nreading comprehension source | yes\nterminology source | yes\ndirect assistant answer style | no\nshallow answer detector | no\n\n## About FineWeb-Edu and shallow text detection\n\nYou are right that the FineWeb-Edu classifier itself is not a Persian shallow-text detector.\n\nFineWeb-Edu is still useful as an idea:\n\n> **Educational quality should be scored separately from fluency.**\n\nKenLM/perplexity can detect some noisy text, but it will not reliably detect a fluent-but-shallow explanation.\n\nUseful references:\n\n  * FineWeb-Edu\n  * FineWeb-Edu classifier\n  * FineWeb-Edu annotations\n  * FineWeb-C\n  * FineWeb2 paper\n  * FineWeb2 dataset\n\n\n\nFineWeb-C is especially relevant because it is about educational-quality annotations across many languages. I would copy that idea at small scale for Persian.\n\nPersian educational quality rubric (click for more details)\n\n## Minimal diagnosis/eval set\n\nI would make a small eval before training.\n\nStart with maybe 100 examples, not 1,000.\n\nBucket | Count | Purpose\n---|---|---\nshort definition prompts | 20 | answer length / focus\nbeginner explanation prompts | 20 | educational simplicity\nPersian QA prompts | 20 | general answer behavior\nmath/tutor prompts | 20 | tutor format\nLatin-name / English-term prompts | 10 | code-switch control\nuncertainty prompts | 10 | honesty / hallucination control\n\nScore each model on:\n\nMetric | What to check\n---|---\nPersian naturalness | is the Persian acceptable?\nanswer focus | does it answer the actual prompt?\nlength control | does it obey one-sentence/two-sentence constraints?\neducational usefulness | is it helpful for the intended user?\nshallow answer rate | is it generic or empty?\nlanguage drift | does it switch language unexpectedly?\nfactual correctness | is the content right?\nformat following | does it follow requested structure?\n\nRun:\n\n  * Qwen3.5-0.8B\n  * Qwen3.5-2B\n  * Qwen3.5-4B\n\n\n\nOptional:\n\n  * Qwen3.5-9B\n  * one non-Qwen small model\n\n\n\nThen decide.\n\nRoute decision after the eval (click for more details)\n\n## If it is A: answer-style / alignment failure\n\nUse SFT first.\n\nSuggested dataset:\n\n  * 5k–30k high-quality Persian answer-style examples,\n  * maybe 1k–5k chosen/rejected pairs later,\n  * many short prompts,\n  * many length-control prompts,\n  * no unnecessary greeting,\n  * Persian educational tone,\n  * explicit “short / beginner / detailed” modes.\n\n\n\nExamples to include:\n\nExample type | Why\n---|---\nconcise definitions | fixes overlong answers\nno-greeting answers | fixes filler\nstudent explanations | fixes educational tone\nexamples with one example only | fixes over-expansion\nanswer in N sentences | fixes length control\nuncertainty examples | fixes hallucination\nsimple correction examples | fixes tutor behavior\n\nThis route is much cheaper than CPT.\n\n## If it is B: 0.8B capacity pressure\n\nTry 2B/4B before heavy training.\n\nThe important point is:\n\n> 0.8B may be small enough that every extra behavior competes for capacity.\n\nPersian fluency, instruction following, educational style, math tutor behavior, tool JSON, and anti-drift behavior are all separate pressures.\n\nIf 2B or 4B solves many failures immediately, then 2B/4B may be cheaper overall than:\n\n  * CPT,\n  * tokenizer extension,\n  * large SFT,\n  * DPO,\n  * runtime guardrails,\n  * repeated debugging.\n\n\n\nFor a festival/demo project, 0.8B is attractive. But if it requires too much engineering, 2B may be the practical sweet spot.\n\n## If it is C: Persian/domain grounding failure\n\nThen CPT may be justified, but I would still be careful.\n\nBefore CPT:\n\n  1. inspect a Persian raw corpus,\n  2. sample and manually review,\n  3. deduplicate,\n  4. check language contamination,\n  5. check educational quality,\n  6. keep eval separate,\n  7. run a small pilot before a large CPT run.\n\n\n\nPossible raw corpus reference:\n\n  * Matina: A Large-Scale 73B Token Persian Text Corpus\n  * Matina on Hugging Face\n\n\n\nBut CPT is not automatically the first fix. If the missing information exists in a small set of trusted documents, RAG may be cheaper and safer.\n\nWhen RAG may be better than CPT (click for more details)\n\n## Practical training strategy\n\nI would use this sequence:\n\n### Stage 0: Diagnose\n\nRun 100 diagnostic prompts on 0.8B/2B/4B.\n\n### Stage 1: SFT\n\nIf A is likely, build answer-style SFT.\n\nStart small:\n\n  * 5k high-quality examples if data is very good,\n  * 10k–30k if you can maintain quality,\n  * avoid huge weak synthetic datasets.\n\n\n\n### Stage 2: Preference tuning\n\nIf the model still prefers the wrong style:\n\n  * chosen = concise, focused Persian answer,\n  * rejected = overlong, shallow, greeting-heavy, unfocused answer.\n\n\n\nUse DPO/ORPO only after you know what failures remain.\n\n### Stage 3: Size decision\n\nIf 0.8B remains brittle:\n\n  * try 2B,\n  * then 4B,\n  * compare total engineering cost, not only inference cost.\n\n\n\n### Stage 4: CPT only if justified\n\nIf C is clearly confirmed:\n\n  * use curated Persian raw text,\n  * maybe Matina or educational corpora,\n  * keep CPT small at first,\n  * evaluate after CPT and after SFT.\n\n\n\n### Stage 5: Tokenizer only if measured\n\nDo not extend tokenizer unless tokenization analysis shows Persian is badly fragmented and you can afford embedding adaptation.\n\n## Final practical recommendation\n\nFor your current output, I would not start with CPT.\n\nI would do:\n\n  1. **diagnostic eval first** ,\n  2. **answer-style SFT first if the model already knows the concept** ,\n  3. **2B/4B comparison if 0.8B remains brittle** ,\n  4. **DPO/ORPO if the model can produce good answers but often chooses bad style** ,\n  5. **CPT only if many examples show broad Persian/domain grounding weakness**.\n\n\n\nSo the main idea is:\n\n> **Try to prove that CPT is necessary before paying for CPT. This example alone does not prove it. It looks more like answer-style alignment plus possible 0.8B capacity pressure.**",
  "title": "How can i build a High Quality dataset?"
}