Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihzhju3zrr4pshep5p6ikqptdg2p3isxaa5jmh4chgyxulzhzue2m",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnfh54nyknl2"
  },
  "path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_3",
  "publishedAt": "2026-06-03T13:59:08.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "a Persian Wikipedia QA dataset",
    "a cleaned Persian translation of Alpaca"
  ],
  "textContent": "Thank you very much for your help! My target language is Persian (Iranian), and for the base model I used **Qwen 3 0.6B Instruct**. I chose this model because I am targeting low-end devices, such as phones and school computers.\n\nMy main limitation is hardware. I have an **RTX 3050 Mobile 4Gb** , **32Gb RAM** , and an **Intel i5-11400H**.\n\nIf you have time, I have some more questions:\n\nI tried to continue pretraining through a **LoRA adapter** using the **Unsloth** library in Python with these LoRA adapter settings:\n\n\n    model = FastLanguageModel.get_peft_model(\n\n        model,\n\n        r=8,\n\n        target_modules=[\n\n            \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n\n            \"gate_proj\", \"up_proj\", \"down_proj\",\n\n        ],\n\n        lora_alpha=16,\n\n        lora_dropout=0,\n\n        bias=\"none\",\n\n        use_gradient_checkpointing=\"unsloth\",\n\n        random_state=3407,\n\n        use_rslora=False,\n\n        loftq_config=None,\n\n    )\n\n\nI used a _.jsonl_ file containing semi-cleaned raw Wikipedia text in the target language, but the results were not very good and were weaker than the original model. Unfortunately, I do not have the training logs, but as I remember, the model started with a loss of around **2.3** , and by the end of training it reached about **1.5**.\n\nI used an **800Mb** cleaned Persian Wikipedia _.jsonl_ file to test whether the model would improve or not, but the results were negative, at least in my setup.\n\nFor fine-tuning, I used two datasets:\n\nPersian-Wiki-QA (a Persian Wikipedia QA dataset)\n\nAlpaca-Persian-Cleaned (a cleaned Persian translation of Alpaca, translated using Google Translate)\n\nThis is the training script I used to train the model:\n\n\n    import os\n\n    from datasets import load_dataset\n\n    from unsloth import FastLanguageModel\n\n    from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling\n\n\n\n\n\n    MODEL_PATH = r\"F:\\Deep Learning\\Model\\Qwen 3 0.6B SafeTensor\"\n\n    DATA_PATH  = r\"E:\\Ava1\\Dataset-pars\\Wikipedia-Persian\\Wikipedia-Persian-textonly.jsonl\"\n\n    OUTPUT_DIR = r\"F:\\Deep Learning\\Outputs\\qwen3-0.6b-persian-pretrain-lora\"\n\n\n\n    MAX_SEQ_LENGTH = 256\n\n    LOAD_IN_4BIT = True\n\n\n\n    def get_latest_checkpoint(output_dir):\n\n        if not os.path.exists(output_dir):\n\n            return None\n\n        checkpoints = []\n\n        for name in os.listdir(output_dir):\n\n            path = os.path.join(output_dir, name)\n\n            if os.path.isdir(path) and name.startswith(\"checkpoint-\"):\n\n                try:\n\n                    step = int(name.split(\"-\")[-1])\n\n                    checkpoints.append((step, path))\n\n                except:\n\n                    pass\n\n        if not checkpoints:\n\n            return None\n\n        checkpoints.sort(key=lambda x: x[0])\n\n        return checkpoints[-1][1]\n\n\n\n    def main():\n\n        os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n\n\n        model, tokenizer = FastLanguageModel.from_pretrained(\n\n            model_name=MODEL_PATH,\n\n            max_seq_length=MAX_SEQ_LENGTH,\n\n            dtype=None,\n\n            load_in_4bit=LOAD_IN_4BIT,\n\n        )\n\n\n\n        if tokenizer.pad_token is None:\n\n            tokenizer.pad_token = tokenizer.eos_token\n\n\n\n        model = FastLanguageModel.get_peft_model(\n\n            model,\n\n            r=8,\n\n            target_modules=[\n\n                \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n\n                \"gate_proj\", \"up_proj\", \"down_proj\",\n\n            ],\n\n            lora_alpha=16,\n\n            lora_dropout=0,\n\n            bias=\"none\",\n\n            use_gradient_checkpointing=\"unsloth\",\n\n            random_state=3407,\n\n            use_rslora=False,\n\n            loftq_config=None,\n\n        )\n\n\n\n        dataset = load_dataset(\"json\", data_files=DATA_PATH, split=\"train\")\n\n\n\n        def clean_example(example):\n\n            text = example.get(\"Text\", \"\")\n\n            if text is None:\n\n                text = \"\"\n\n            return {\"text\": str(text).strip()}\n\n\n\n        dataset = dataset.map(\n\n            clean_example,\n\n            remove_columns=dataset.column_names,\n\n            load_from_cache_file=False,\n\n        )\n\n\n\n        dataset = dataset.filter(\n\n            lambda x: len(x[\"text\"]) > 50,\n\n            load_from_cache_file=False,\n\n        )\n\n\n\n        eos_token = tokenizer.eos_token or \"\"\n\n\n\n        def add_eos(example):\n\n            return {\"text\": example[\"text\"] + eos_token}\n\n\n\n        dataset = dataset.map(add_eos, load_from_cache_file=False)\n\n\n\n        dataset = dataset.select(range(min(2000, len(dataset))))\n\n\n\n        def tokenize_function(examples):\n\n            outputs = tokenizer(\n\n                examples[\"text\"],\n\n                truncation=True,\n\n                max_length=MAX_SEQ_LENGTH,\n\n                padding=False,\n\n            )\n\n            outputs[\"labels\"] = outputs[\"input_ids\"].copy()\n\n            return outputs\n\n\n\n        tokenized_dataset = dataset.map(\n\n            tokenize_function,\n\n            batched=True,\n\n            batch_size=1000,\n\n            remove_columns=[\"text\"],\n\n            load_from_cache_file=False,\n\n        )\n\n\n\n        latest_checkpoint = get_latest_checkpoint(OUTPUT_DIR)\n\n        if latest_checkpoint:\n\n            print(f\"Resuming from checkpoint: {latest_checkpoint}\")\n\n        else:\n\n            print(\"No checkpoint found. Starting fresh.\")\n\n\n\n        data_collator = DataCollatorForLanguageModeling(\n\n            tokenizer=tokenizer,\n\n            mlm=False,\n\n        )\n\n\n\n        trainer = Trainer(\n\n            model=model,\n\n            train_dataset=tokenized_dataset,\n\n            data_collator=data_collator,\n\n            args=TrainingArguments(\n\n                output_dir=OUTPUT_DIR,\n\n                per_device_train_batch_size=1,\n\n                gradient_accumulation_steps=32,\n\n                warmup_steps=20,\n\n                max_steps=100,\n\n                learning_rate=1e-4,\n\n                fp16=True,\n\n                bf16=False,\n\n                logging_steps=5,\n\n                optim=\"adamw_8bit\",\n\n                weight_decay=0.01,\n\n                lr_scheduler_type=\"cosine\",\n\n                save_steps=50,\n\n                save_total_limit=2,\n\n                report_to=\"none\",\n\n                seed=3407,\n\n                remove_unused_columns=False,\n\n                dataloader_num_workers=0,\n\n            ),\n\n        )\n\n\n\n        trainer.train(resume_from_checkpoint=latest_checkpoint)\n\n\n\n        model.save_pretrained(OUTPUT_DIR)\n\n        tokenizer.save_pretrained(OUTPUT_DIR)\n\n\n\n        print(f\"Done. Saved to: {OUTPUT_DIR}\")\n\n\n\n    if __name__ == \"__main__\":\n\n        main()\n\n\nIs my Persian Data small? I have another ~15GB Of Semi-Clean .txt Persian Wikipedia text that could help.\n\nI thought about Translating Good English Datasets using a Local 1.8B Translator ( works well, but now Wow ) And then fine-tuning a Small SLM to make the dataset more natural, But you said “English-then-translate loses quality”",
  "title": "Fine-Tuning an SLM for a Low-Resource Language"
}