Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiehkrqfvuc4plzsihaqjhmi7zp7pxuhfozmtqvx56j7ykl6yyjyqa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnezpg2gnvi2"
  },
  "path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_2",
  "publishedAt": "2026-06-03T10:29:27.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "@Bidram"
  ],
  "textContent": "Hi @Bidram — nice project. I’ll take your two questions in order, plus the “longer, more complex outputs” goal underneath them, since they connect.\n\n**1. LoRA to simulate continued pretraining — yes, with honest caveats.**\n\nYou can run continued _pretraining_ (plain next-token prediction on raw text) through a LoRA adapter, not just instruction tuning. Be realistic about one thing: a low-rank adapter has limited room to add genuinely _new_ knowledge or a new language — it’s much better at strengthening what the base already half-knows. You said the model already produces simple text in the language, so you’re in the good case: amplifying an existing ability, not teaching from zero.\n\nTo get the most from it:\n\n  * Use a higher rank than the SFT default (try 64–128) and apply LoRA to all linear layers (q, k, v, o **and** the MLP gate/up/down), not just q/v.\n  * Check tokens-per-word for your language. If the tokenizer shreds it into single bytes, that caps quality more than anything else, and the real fix is extending the tokenizer and training the new embeddings — heavier, but it’s often the actual bottleneck for low-resource languages.\n  * Given your hardware, do it as QLoRA (4-bit base + LoRA), short sequences packed together, small batch with gradient accumulation.\n\n\n\nA sequence that works well: first, LoRA continued-pretraining on raw target-language text to build fluency and length; then a smaller QA/SFT pass for task format. Fluency comes from the first stage — the QA set mostly teaches answer _shape_.\n\n**2. A QA set from Wikipedia dumps — pick the route that fits your constraints.**\n\n  * Source: skip the raw XML and pull the cleaned per-language parquet from the `wikimedia/wikipedia` dataset on the Hub. You download just your language once, which helps if bandwidth is restricted.\n  * With no outside API and limited compute (sounds like your situation), a fully local, templated approach goes surprisingly far: split each article into passages and build extractive QA from structure — lead sentence (“X is a …”) → “What is X?”; section headings → a question about that section; dated or numbered facts → fill-in-the-blank. Cheap, no model needed, and every answer stays grounded in the text.\n  * If you can run even a small instruct model locally, have it draft question/answer pairs from each passage (give it the passage, ask for a few questions answerable _only_ from that text), then drop any pair whose answer isn’t actually in the passage. Generate in the target language if the model can — English-then-translate loses quality.\n  * Either route: dedupe, discard unanswerable or garbled pairs, and store as {question, optional context, answer}.\n\n\n\nOn the “struggles with longer, complex outputs” symptom specifically: that’s usually a fluency and coverage gap, so I’d put more effort into the raw-text continued-pretraining stage than into the QA set — the QA mostly fixes format, not depth.\n\nGood luck at the festival — happy to go deeper on any of these if you share the language and the base model you’re starting from.",
  "title": "Fine-Tuning an SLM for a Low-Resource Language"
}