Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihki43jl2rcvondbjrappifktacic4ippvfqx5tfms5p6tcgmptsi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3menagla4ysp2"
  },
  "path": "/t/domain-language-model/173342#post_2",
  "publishedAt": "2026-02-12T03:29:35.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "smaller models are becoming quite stable these days",
    "Epoch AI",
    "Epoch AI",
    "arXiv",
    "arXiv",
    "arXiv",
    "Hugging Face",
    "ACL Anthology",
    "Epoch AI",
    "unsloth.ai",
    "NVIDIA Blog",
    "ACL Anthology",
    "arXiv",
    "arXiv",
    "ACL Anthology",
    "arXiv",
    "arXiv",
    "ACL Anthology",
    "Epoch AI"
  ],
  "textContent": "If you ultimately require language skills, foundational math and logic abilities beyond specialized knowledge in SLM, it’s better to base it on existing models. Pre-training a new SLM from scratch is quite expensive, even for mega-parameter-scale models.\nIn most cases, fine-tuning using LoRA or QLoRA is generally preferable.\n\nEven smaller models are becoming quite stable these days.\n\n* * *\n\n## Background: what “domain language model” typically means (and why it matters)\n\nWhen people say _“build a model for my small domain”_ , they usually want one (or more) of these:\n\n  1. **Domain knowledge** (answers should be correct w.r.t. your internal docs)\n  2. **Domain behavior** (output format, tone, workflow steps, taxonomy, JSON schema)\n  3. **Domain language** (jargon, abbreviations, notation, writing style)\n\n\n\nThe best approach depends on which of these you need most.\n\n* * *\n\n## Your main choice: adapt an existing SLM vs train from scratch\n\n### In a small domain, the default best option is **not** training from scratch\n\nTraining from scratch requires **massive token volumes** to avoid an undertrained, brittle model.\n\n  * Scaling work (Chinchilla) finds that, for compute-optimal training, **parameters and training tokens should scale together** ; they demonstrate this by training a **70B parameter** model on **~1.4T tokens**.\n  * A commonly used rule-of-thumb derived from this regime is **~20 tokens per parameter** (e.g., 70B → 1.4T tokens). (Epoch AI)\n  * Also, in practice, “tokens-per-parameter” in notable open models has been trending upward over time (data-heavy training), which pushes requirements even higher. (Epoch AI)\n\n\n\n**Implication:** If your domain data is “small” (even a few million words), training-from-scratch will almost always underperform a good existing base model that you adapt.\n\n* * *\n\n## The options that usually win for small domains\n\n### Option A — **RAG** for domain knowledge (recommended starting point)\n\n**RAG (Retrieval-Augmented Generation)** means: keep a general model, but **retrieve relevant passages from your domain documents** and feed them into the prompt at answer time.\n\nWhy it’s a strong default for “small domain”:\n\n  * It directly addresses “knowledge changes” and “provenance/citations” problems: you update the document index instead of retraining weights. (arXiv)\n\n\n\n**Use RAG when:** your domain is mostly _documents/policies/manuals/KB articles_ and you want grounded answers.\n\n* * *\n\n### Option B — **Fine-tune an existing SLM** (almost always better than scratch)\n\nFine-tuning is best when you want **behavioral consistency** :\n\n  * consistent templates\n  * strict JSON\n  * correct taxonomy labels\n  * specific reasoning workflow (“step 1–2–3”)\n  * customer-support style\n\n\n\nFor small domains, you normally do **parameter-efficient fine-tuning (PEFT)** rather than full fine-tuning:\n\n  * **LoRA** : freeze base weights and train small low-rank adapters; reduces trainable parameters and memory. (arXiv)\n  * **QLoRA** : train LoRA adapters while the base model is quantized to 4-bit; makes fine-tuning feasible on limited hardware and was shown to preserve performance well. (arXiv)\n  * Hugging Face’s PEFT overview emphasizes why PEFT helps with cost/storage and is often better in low-data settings (and can reduce catastrophic forgetting vs full fine-tuning). (Hugging Face)\n\n\n\n**Use fine-tuning when:** you need the model to behave in a domain-specific way, not just “know facts.”\n\n* * *\n\n### Option C — **Continued pretraining (DAPT/TAPT)** for domain language mismatch\n\nIf the model struggles with domain text even when retrieved (RAG gives it the right paragraph but it still “doesn’t get it”), you may need domain-adaptive pretraining:\n\n  * “Don’t Stop Pretraining” shows that **continued pretraining on in-domain text (DAPT)** improves downstream task performance across multiple domains and in both high- and low-resource settings. (ACL Anthology)\n\n\n\n**Use DAPT when:** your domain has unusual language distribution (dense jargon, abbreviations, formulas, log syntax, biomedical/legal/engineering writing).\n\n* * *\n\n## Data size criteria you can actually use\n\n### 1) If you’re thinking about **training from scratch**\n\nUse a tokens-per-parameter sanity check:\n\n  * A common compute-optimal reference point is **~20 tokens/parameter** (e.g., 70B trained on ~1.4T tokens). (Epoch AI)\n\n\n\nExamples (rule-of-thumb):\n\n  * **1B params** → ~**20B tokens**\n  * **3B params** → ~**60B tokens**\n  * **7B params** → ~**140B tokens**\n\n\n\nIf you are not in the **tens of billions of tokens** , training from scratch is rarely justified for quality.\n\n* * *\n\n### 2) If you’re thinking about **fine-tuning (SFT/LoRA/QLoRA)**\n\nFor supervised fine-tuning (instruction/input → ideal output), dataset size is measured in **examples/pairs**.\n\nPractical, widely used heuristics:\n\n  * Unsloth’s dataset guidance: **minimum ~100 rows** , **> 1,000 rows preferred** for better outcomes. (unsloth.ai)\n  * NVIDIA’s practical guidance for parameter-efficient fine-tuning: **small-to-medium dataset (100–1,000 prompt-sample pairs)**. (NVIDIA Blog)\n\n\n\nInterpretation:\n\n  * If you only have **50–200** examples, you can still improve _formatting and style_ , but expect brittleness.\n  * At **1,000–10,000** good examples, you can usually get consistent behavior across a range of prompts.\n\n\n\n* * *\n\n### 3) If you’re thinking about **continued pretraining (DAPT)**\n\nThere isn’t a single universal “minimum,” but the key question is:\n\n  * Do you have enough **unlabeled in-domain text** to noticeably shift the model’s language understanding?\n\n\n\nDAPT is supported by evidence as beneficial even under low-resource conditions, but it still needs enough domain text to move the needle. (ACL Anthology)\nIn small domains, teams often try **RAG + SFT** first; add DAPT only if language mismatch persists.\n\n* * *\n\n## A clear recommendation for a “small domain” (most common scenario)\n\n### Best default stack\n\n  1. **RAG** for domain knowledge (fast wins, easy updates). (arXiv)\n  2. **LoRA/QLoRA SFT** for domain behavior (templates/taxonomy/schema). (arXiv)\n  3. **DAPT** only if needed for domain-language mismatch. (ACL Anthology)\n  4. **Avoid scratch training** unless you truly have foundation-scale data (typically billions–trillions of tokens).\n\n\n\n* * *\n\n## Quick decision checklist\n\nChoose the **lowest-cost method** that solves your real problem:\n\n  * **Need correct answers from internal docs?** → RAG first. (arXiv)\n  * **Need consistent output format / workflow?** → SFT with LoRA/QLoRA. (arXiv)\n  * **Model can’t interpret your jargon even with retrieved passages?** → Consider DAPT. (ACL Anthology)\n  * **Considering scratch?** → Check tokens-per-parameter; if you’re not in **tens of billions+ tokens** , it’s usually the wrong move. (Epoch AI)\n\n",
  "title": "Domain language model"
}