Domain language model
If you ultimately require language skills, foundational math and logic abilities beyond specialized knowledge in SLM, it’s better to base it on existing models. Pre-training a new SLM from scratch is quite expensive, even for mega-parameter-scale models. In most cases, fine-tuning using LoRA or QLoRA is generally preferable.
Even smaller models are becoming quite stable these days.
Background: what “domain language model” typically means (and why it matters)
When people say “build a model for my small domain” , they usually want one (or more) of these:
- Domain knowledge (answers should be correct w.r.t. your internal docs)
- Domain behavior (output format, tone, workflow steps, taxonomy, JSON schema)
- Domain language (jargon, abbreviations, notation, writing style)
The best approach depends on which of these you need most.
Your main choice: adapt an existing SLM vs train from scratch
In a small domain, the default best option is not training from scratch
Training from scratch requires massive token volumes to avoid an undertrained, brittle model.
- Scaling work (Chinchilla) finds that, for compute-optimal training, parameters and training tokens should scale together ; they demonstrate this by training a 70B parameter model on ~1.4T tokens.
- A commonly used rule-of-thumb derived from this regime is ~20 tokens per parameter (e.g., 70B → 1.4T tokens). (Epoch AI)
- Also, in practice, “tokens-per-parameter” in notable open models has been trending upward over time (data-heavy training), which pushes requirements even higher. (Epoch AI)
Implication: If your domain data is “small” (even a few million words), training-from-scratch will almost always underperform a good existing base model that you adapt.
The options that usually win for small domains
Option A — RAG for domain knowledge (recommended starting point)
RAG (Retrieval-Augmented Generation) means: keep a general model, but retrieve relevant passages from your domain documents and feed them into the prompt at answer time.
Why it’s a strong default for “small domain”:
- It directly addresses “knowledge changes” and “provenance/citations” problems: you update the document index instead of retraining weights. (arXiv)
Use RAG when: your domain is mostly documents/policies/manuals/KB articles and you want grounded answers.
Option B — Fine-tune an existing SLM (almost always better than scratch)
Fine-tuning is best when you want behavioral consistency :
- consistent templates
- strict JSON
- correct taxonomy labels
- specific reasoning workflow (“step 1–2–3”)
- customer-support style
For small domains, you normally do parameter-efficient fine-tuning (PEFT) rather than full fine-tuning:
- LoRA : freeze base weights and train small low-rank adapters; reduces trainable parameters and memory. (arXiv)
- QLoRA : train LoRA adapters while the base model is quantized to 4-bit; makes fine-tuning feasible on limited hardware and was shown to preserve performance well. (arXiv)
- Hugging Face’s PEFT overview emphasizes why PEFT helps with cost/storage and is often better in low-data settings (and can reduce catastrophic forgetting vs full fine-tuning). (Hugging Face)
Use fine-tuning when: you need the model to behave in a domain-specific way, not just “know facts.”
Option C — Continued pretraining (DAPT/TAPT) for domain language mismatch
If the model struggles with domain text even when retrieved (RAG gives it the right paragraph but it still “doesn’t get it”), you may need domain-adaptive pretraining:
- “Don’t Stop Pretraining” shows that continued pretraining on in-domain text (DAPT) improves downstream task performance across multiple domains and in both high- and low-resource settings. (ACL Anthology)
Use DAPT when: your domain has unusual language distribution (dense jargon, abbreviations, formulas, log syntax, biomedical/legal/engineering writing).
Data size criteria you can actually use
1) If you’re thinking about training from scratch
Use a tokens-per-parameter sanity check:
- A common compute-optimal reference point is ~20 tokens/parameter (e.g., 70B trained on ~1.4T tokens). (Epoch AI)
Examples (rule-of-thumb):
- 1B params → ~20B tokens
- 3B params → ~60B tokens
- 7B params → ~140B tokens
If you are not in the tens of billions of tokens , training from scratch is rarely justified for quality.
2) If you’re thinking about fine-tuning (SFT/LoRA/QLoRA)
For supervised fine-tuning (instruction/input → ideal output), dataset size is measured in examples/pairs.
Practical, widely used heuristics:
- Unsloth’s dataset guidance: minimum ~100 rows , > 1,000 rows preferred for better outcomes. (unsloth.ai)
- NVIDIA’s practical guidance for parameter-efficient fine-tuning: small-to-medium dataset (100–1,000 prompt-sample pairs). (NVIDIA Blog)
Interpretation:
- If you only have 50–200 examples, you can still improve formatting and style , but expect brittleness.
- At 1,000–10,000 good examples, you can usually get consistent behavior across a range of prompts.
3) If you’re thinking about continued pretraining (DAPT)
There isn’t a single universal “minimum,” but the key question is:
- Do you have enough unlabeled in-domain text to noticeably shift the model’s language understanding?
DAPT is supported by evidence as beneficial even under low-resource conditions, but it still needs enough domain text to move the needle. (ACL Anthology) In small domains, teams often try RAG + SFT first; add DAPT only if language mismatch persists.
A clear recommendation for a “small domain” (most common scenario)
Best default stack
- RAG for domain knowledge (fast wins, easy updates). (arXiv)
- LoRA/QLoRA SFT for domain behavior (templates/taxonomy/schema). (arXiv)
- DAPT only if needed for domain-language mismatch. (ACL Anthology)
- Avoid scratch training unless you truly have foundation-scale data (typically billions–trillions of tokens).
Quick decision checklist
Choose the lowest-cost method that solves your real problem:
- Need correct answers from internal docs? → RAG first. (arXiv)
- Need consistent output format / workflow? → SFT with LoRA/QLoRA. (arXiv)
- Model can’t interpret your jargon even with retrieved passages? → Consider DAPT. (ACL Anthology)
- Considering scratch? → Check tokens-per-parameter; if you’re not in tens of billions+ tokens , it’s usually the wrong move. (Epoch AI)
Discussion in the ATmosphere