{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihzhju3zrr4pshep5p6ikqptdg2p3isxaa5jmh4chgyxulzhzue2m",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnfh54nyknl2"
},
"path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_3",
"publishedAt": "2026-06-03T13:59:08.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"a Persian Wikipedia QA dataset",
"a cleaned Persian translation of Alpaca"
],
"textContent": "Thank you very much for your help! My target language is Persian (Iranian), and for the base model I used **Qwen 3 0.6B Instruct**. I chose this model because I am targeting low-end devices, such as phones and school computers.\n\nMy main limitation is hardware. I have an **RTX 3050 Mobile 4Gb** , **32Gb RAM** , and an **Intel i5-11400H**.\n\nIf you have time, I have some more questions:\n\nI tried to continue pretraining through a **LoRA adapter** using the **Unsloth** library in Python with these LoRA adapter settings:\n\n\n model = FastLanguageModel.get_peft_model(\n\n model,\n\n r=8,\n\n target_modules=[\n\n \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n\n \"gate_proj\", \"up_proj\", \"down_proj\",\n\n ],\n\n lora_alpha=16,\n\n lora_dropout=0,\n\n bias=\"none\",\n\n use_gradient_checkpointing=\"unsloth\",\n\n random_state=3407,\n\n use_rslora=False,\n\n loftq_config=None,\n\n )\n\n\nI used a _.jsonl_ file containing semi-cleaned raw Wikipedia text in the target language, but the results were not very good and were weaker than the original model. Unfortunately, I do not have the training logs, but as I remember, the model started with a loss of around **2.3** , and by the end of training it reached about **1.5**.\n\nI used an **800Mb** cleaned Persian Wikipedia _.jsonl_ file to test whether the model would improve or not, but the results were negative, at least in my setup.\n\nFor fine-tuning, I used two datasets:\n\nPersian-Wiki-QA (a Persian Wikipedia QA dataset)\n\nAlpaca-Persian-Cleaned (a cleaned Persian translation of Alpaca, translated using Google Translate)\n\nThis is the training script I used to train the model:\n\n\n import os\n\n from datasets import load_dataset\n\n from unsloth import FastLanguageModel\n\n from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling\n\n\n\n\n\n MODEL_PATH = r\"F:\\Deep Learning\\Model\\Qwen 3 0.6B SafeTensor\"\n\n DATA_PATH = r\"E:\\Ava1\\Dataset-pars\\Wikipedia-Persian\\Wikipedia-Persian-textonly.jsonl\"\n\n OUTPUT_DIR = r\"F:\\Deep Learning\\Outputs\\qwen3-0.6b-persian-pretrain-lora\"\n\n\n\n MAX_SEQ_LENGTH = 256\n\n LOAD_IN_4BIT = True\n\n\n\n def get_latest_checkpoint(output_dir):\n\n if not os.path.exists(output_dir):\n\n return None\n\n checkpoints = []\n\n for name in os.listdir(output_dir):\n\n path = os.path.join(output_dir, name)\n\n if os.path.isdir(path) and name.startswith(\"checkpoint-\"):\n\n try:\n\n step = int(name.split(\"-\")[-1])\n\n checkpoints.append((step, path))\n\n except:\n\n pass\n\n if not checkpoints:\n\n return None\n\n checkpoints.sort(key=lambda x: x[0])\n\n return checkpoints[-1][1]\n\n\n\n def main():\n\n os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n\n\n model, tokenizer = FastLanguageModel.from_pretrained(\n\n model_name=MODEL_PATH,\n\n max_seq_length=MAX_SEQ_LENGTH,\n\n dtype=None,\n\n load_in_4bit=LOAD_IN_4BIT,\n\n )\n\n\n\n if tokenizer.pad_token is None:\n\n tokenizer.pad_token = tokenizer.eos_token\n\n\n\n model = FastLanguageModel.get_peft_model(\n\n model,\n\n r=8,\n\n target_modules=[\n\n \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n\n \"gate_proj\", \"up_proj\", \"down_proj\",\n\n ],\n\n lora_alpha=16,\n\n lora_dropout=0,\n\n bias=\"none\",\n\n use_gradient_checkpointing=\"unsloth\",\n\n random_state=3407,\n\n use_rslora=False,\n\n loftq_config=None,\n\n )\n\n\n\n dataset = load_dataset(\"json\", data_files=DATA_PATH, split=\"train\")\n\n\n\n def clean_example(example):\n\n text = example.get(\"Text\", \"\")\n\n if text is None:\n\n text = \"\"\n\n return {\"text\": str(text).strip()}\n\n\n\n dataset = dataset.map(\n\n clean_example,\n\n remove_columns=dataset.column_names,\n\n load_from_cache_file=False,\n\n )\n\n\n\n dataset = dataset.filter(\n\n lambda x: len(x[\"text\"]) > 50,\n\n load_from_cache_file=False,\n\n )\n\n\n\n eos_token = tokenizer.eos_token or \"\"\n\n\n\n def add_eos(example):\n\n return {\"text\": example[\"text\"] + eos_token}\n\n\n\n dataset = dataset.map(add_eos, load_from_cache_file=False)\n\n\n\n dataset = dataset.select(range(min(2000, len(dataset))))\n\n\n\n def tokenize_function(examples):\n\n outputs = tokenizer(\n\n examples[\"text\"],\n\n truncation=True,\n\n max_length=MAX_SEQ_LENGTH,\n\n padding=False,\n\n )\n\n outputs[\"labels\"] = outputs[\"input_ids\"].copy()\n\n return outputs\n\n\n\n tokenized_dataset = dataset.map(\n\n tokenize_function,\n\n batched=True,\n\n batch_size=1000,\n\n remove_columns=[\"text\"],\n\n load_from_cache_file=False,\n\n )\n\n\n\n latest_checkpoint = get_latest_checkpoint(OUTPUT_DIR)\n\n if latest_checkpoint:\n\n print(f\"Resuming from checkpoint: {latest_checkpoint}\")\n\n else:\n\n print(\"No checkpoint found. Starting fresh.\")\n\n\n\n data_collator = DataCollatorForLanguageModeling(\n\n tokenizer=tokenizer,\n\n mlm=False,\n\n )\n\n\n\n trainer = Trainer(\n\n model=model,\n\n train_dataset=tokenized_dataset,\n\n data_collator=data_collator,\n\n args=TrainingArguments(\n\n output_dir=OUTPUT_DIR,\n\n per_device_train_batch_size=1,\n\n gradient_accumulation_steps=32,\n\n warmup_steps=20,\n\n max_steps=100,\n\n learning_rate=1e-4,\n\n fp16=True,\n\n bf16=False,\n\n logging_steps=5,\n\n optim=\"adamw_8bit\",\n\n weight_decay=0.01,\n\n lr_scheduler_type=\"cosine\",\n\n save_steps=50,\n\n save_total_limit=2,\n\n report_to=\"none\",\n\n seed=3407,\n\n remove_unused_columns=False,\n\n dataloader_num_workers=0,\n\n ),\n\n )\n\n\n\n trainer.train(resume_from_checkpoint=latest_checkpoint)\n\n\n\n model.save_pretrained(OUTPUT_DIR)\n\n tokenizer.save_pretrained(OUTPUT_DIR)\n\n\n\n print(f\"Done. Saved to: {OUTPUT_DIR}\")\n\n\n\n if __name__ == \"__main__\":\n\n main()\n\n\nIs my Persian Data small? I have another ~15GB Of Semi-Clean .txt Persian Wikipedia text that could help.\n\nI thought about Translating Good English Datasets using a Local 1.8B Translator ( works well, but now Wow ) And then fine-tuning a Small SLM to make the dataset more natural, But you said “English-then-translate loses quality”",
"title": "Fine-Tuning an SLM for a Low-Resource Language"
}