{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreieiqsyqethcmsbbvtsldtxvtjc6bhqdybs224emhqhmowfl43p3ni",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnfulg5odjd2"
},
"path": "/t/fine-tuning-an-slm-for-a-low-resource-language/176467#post_4",
"publishedAt": "2026-06-03T17:54:08.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I Wrote a new Training script with LoRA adapter rank 64 and a alpha of 64, Its slower now, but iām letting it train and see what will come off next. These are the Logs :\n\n\n [transformers] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1\n \\\\ /| Num examples = 1,109,504 | Num Epochs = 1 | Total steps = 34,672\n O^O/ \\_/ \\ Batch size per device = 4 | Gradient accumulation steps = 8\n \\ / Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32\n \"-____-\" Trainable parameters = 40,370,176 of 636,420,096 (6.34% trained)\n 0%| | 0/34672 [00:00<?, ?it/s][transformers] `use_return_dict` is deprecated! Use `return_dict` instead!\n {'loss': '2.918', 'grad_norm': '1.558', 'learning_rate': '9e-06', 'epoch': '0.0002884'}\n {'loss': '2.773', 'grad_norm': '1.016', 'learning_rate': '1.9e-05', 'epoch': '0.0005768'}\n {'loss': '2.621', 'grad_norm': '0.7811', 'learning_rate': '2.9e-05', 'epoch': '0.0008653'}\n {'loss': '2.546', 'grad_norm': '0.7259', 'learning_rate': '3.9e-05', 'epoch': '0.001154'}\n {'loss': '2.322', 'grad_norm': '0.6498', 'learning_rate': '4.9e-05', 'epoch': '0.001442'}\n {'loss': '2.378', 'grad_norm': '0.6687', 'learning_rate': '5.9e-05', 'epoch': '0.001731'}\n {'loss': '2.293', 'grad_norm': '0.7748', 'learning_rate': '6.9e-05', 'epoch': '0.002019'}\n {'loss': '2.309', 'grad_norm': '0.7728', 'learning_rate': '7.9e-05', 'epoch': '0.002307'}\n {'loss': '2.214', 'grad_norm': '0.7984', 'learning_rate': '8.9e-05', 'epoch': '0.002596'}\n {'loss': '2.203', 'grad_norm': '0.7599', 'learning_rate': '9.9e-05', 'epoch': '0.002884'}\n {'loss': '2.227', 'grad_norm': '0.8078', 'learning_rate': '0.0001', 'epoch': '0.003173'}\n {'loss': '2.179', 'grad_norm': '0.7982', 'learning_rate': '0.0001', 'epoch': '0.003461'}\n {'loss': '2.155', 'grad_norm': '0.7713', 'learning_rate': '0.0001', 'epoch': '0.003749'}\n {'loss': '2.102', 'grad_norm': '0.7508', 'learning_rate': '0.0001', 'epoch': '0.004038'}\n {'loss': '2.094', 'grad_norm': '0.7127', 'learning_rate': '0.0001', 'epoch': '0.004326'}\n {'loss': '2.146', 'grad_norm': '0.7476', 'learning_rate': '0.0001', 'epoch': '0.004615'}\n {'loss': '2.163', 'grad_norm': '0.7814', 'learning_rate': '0.0001', 'epoch': '0.004903'}\n {'loss': '2.079', 'grad_norm': '0.7486', 'learning_rate': '0.0001', 'epoch': '0.005192'}\n {'loss': '2.142', 'grad_norm': '0.7015', 'learning_rate': '0.0001', 'epoch': '0.00548'}\n {'loss': '2.083', 'grad_norm': '0.7789', 'learning_rate': '0.0001', 'epoch': '0.005768'}\n {'loss': '2.085', 'grad_norm': '0.7249', 'learning_rate': '0.0001', 'epoch': '0.006057'}\n {'loss': '2.03', 'grad_norm': '0.7872', 'learning_rate': '0.0001', 'epoch': '0.006345'}\n {'loss': '2.085', 'grad_norm': '0.7494', 'learning_rate': '0.0001', 'epoch': '0.006634'}\n {'loss': '2.087', 'grad_norm': '0.7177', 'learning_rate': '0.0001', 'epoch': '0.006922'}\n {'loss': '2.073', 'grad_norm': '0.6864', 'learning_rate': '0.0001', 'epoch': '0.00721'}\n {'loss': '2.007', 'grad_norm': '0.7131', 'learning_rate': '9.999e-05', 'epoch': '0.007499'}\n {'loss': '2.074', 'grad_norm': '0.7246', 'learning_rate': '9.999e-05', 'epoch': '0.007787'}\n {'loss': '2.011', 'grad_norm': '0.6955', 'learning_rate': '9.999e-05', 'epoch': '0.008076'}\n {'loss': '1.99', 'grad_norm': '0.666', 'learning_rate': '9.999e-05', 'epoch': '0.008364'}\n {'loss': '2.01', 'grad_norm': '0.655', 'learning_rate': '9.999e-05', 'epoch': '0.008653'}\n {'loss': '2.032', 'grad_norm': '0.6227', 'learning_rate': '9.999e-05', 'epoch': '0.008941'}\n {'loss': '2.003', 'grad_norm': '0.7089', 'learning_rate': '9.999e-05', 'epoch': '0.009229'}\n {'loss': '2.074', 'grad_norm': '0.6487', 'learning_rate': '9.999e-05', 'epoch': '0.009518'}\n {'loss': '2.073', 'grad_norm': '0.6404', 'learning_rate': '9.999e-05', 'epoch': '0.009806'}\n {'loss': '2.052', 'grad_norm': '0.6166', 'learning_rate': '9.999e-05', 'epoch': '0.01009'}\n {'loss': '2.011', 'grad_norm': '0.6447', 'learning_rate': '9.999e-05', 'epoch': '0.01038'}\n {'loss': '1.962', 'grad_norm': '0.6773', 'learning_rate': '9.999e-05', 'epoch': '0.01067'}\n {'loss': '1.97', 'grad_norm': '0.6069', 'learning_rate': '9.998e-05', 'epoch': '0.01096'}\n {'loss': '1.982', 'grad_norm': '0.6222', 'learning_rate': '9.998e-05', 'epoch': '0.01125'}\n {'loss': '2.021', 'grad_norm': '0.6198', 'learning_rate': '9.998e-05', 'epoch': '0.01154'}\n {'loss': '1.985', 'grad_norm': '0.629', 'learning_rate': '9.998e-05', 'epoch': '0.01183'}\n {'loss': '1.959', 'grad_norm': '0.6498', 'learning_rate': '9.998e-05', 'epoch': '0.01211'}\n {'loss': '2.03', 'grad_norm': '0.6532', 'learning_rate': '9.998e-05', 'epoch': '0.0124'}\n {'loss': '1.933', 'grad_norm': '0.6459', 'learning_rate': '9.998e-05', 'epoch': '0.01269'}\n {'loss': '1.985', 'grad_norm': '0.6385', 'learning_rate': '9.997e-05', 'epoch': '0.01298'}\n {'loss': '1.995', 'grad_norm': '0.6946', 'learning_rate': '9.997e-05', 'epoch': '0.01327'}\n {'loss': '1.957', 'grad_norm': '0.6225', 'learning_rate': '9.997e-05', 'epoch': '0.01356'}\n {'loss': '1.969', 'grad_norm': '0.5927', 'learning_rate': '9.997e-05', 'epoch': '0.01384'}\n {'loss': '1.993', 'grad_norm': '0.6287', 'learning_rate': '9.997e-05', 'epoch': '0.01413'}\n {'loss': '1.964', 'grad_norm': '0.6393', 'learning_rate': '9.997e-05', 'epoch': '0.01442'}\n {'loss': '1.954', 'grad_norm': '0.6973', 'learning_rate': '9.997e-05', 'epoch': '0.01471'}\n {'loss': '1.981', 'grad_norm': '0.5865', 'learning_rate': '9.996e-05', 'epoch': '0.015'}\n {'loss': '1.903', 'grad_norm': '0.6059', 'learning_rate': '9.996e-05', 'epoch': '0.01529'}\n {'loss': '1.972', 'grad_norm': '0.6095', 'learning_rate': '9.996e-05', 'epoch': '0.01557'}\n {'loss': '1.939', 'grad_norm': '0.6175', 'learning_rate': '9.996e-05', 'epoch': '0.01586'}\n {'loss': '1.919', 'grad_norm': '0.5895', 'learning_rate': '9.996e-05', 'epoch': '0.01615'}\n {'loss': '1.915', 'grad_norm': '0.5853', 'learning_rate': '9.995e-05', 'epoch': '0.01644'}\n {'loss': '1.883', 'grad_norm': '0.6089', 'learning_rate': '9.995e-05', 'epoch': '0.01673'}\n {'loss': '1.935', 'grad_norm': '0.6074', 'learning_rate': '9.995e-05', 'epoch': '0.01702'}\n {'loss': '1.931', 'grad_norm': '0.6369', 'learning_rate': '9.995e-05', 'epoch': '0.01731'}\n {'loss': '1.896', 'grad_norm': '0.6057', 'learning_rate': '9.995e-05', 'epoch': '0.01759'}\n {'loss': '1.944', 'grad_norm': '0.5966', 'learning_rate': '9.994e-05', 'epoch': '0.01788'}\n {'loss': '1.888', 'grad_norm': '0.6135', 'learning_rate': '9.994e-05', 'epoch': '0.01817'}\n {'loss': '2.014', 'grad_norm': '0.6026', 'learning_rate': '9.994e-05', 'epoch': '0.01846'}\n {'loss': '1.884', 'grad_norm': '0.6325', 'learning_rate': '9.994e-05', 'epoch': '0.01875'}\n {'loss': '1.881', 'grad_norm': '0.6144', 'learning_rate': '9.994e-05', 'epoch': '0.01904'}\n {'loss': '1.927', 'grad_norm': '0.616', 'learning_rate': '9.993e-05', 'epoch': '0.01932'}\n {'loss': '1.924', 'grad_norm': '0.6267', 'learning_rate': '9.993e-05', 'epoch': '0.01961'}\n 2%|āā | 683/34672 [53:56<42:15:31, 4.48s/it]\n\n\nAnd the script :\n\n\n import os\n from datasets import load_dataset\n from unsloth import FastLanguageModel\n from transformers import TrainingArguments, Trainer\n\n MODEL_PATH = r\"F:\\Deep Learning\\Model\\Qwen 3 0.6B SafeTensor\"\n DATA_PATH = r\"E:\\Ava1\\Dataset-pars\\Wikipedia-Persian\\Wikipedia-Persian-textonly.jsonl\"\n OUTPUT_DIR = r\"F:\\Deep Learning\\Outputs\\qwen3-0.6b-persian-pretrain-lora3\"\n\n MAX_SEQ_LENGTH = 256\n LOAD_IN_4BIT = True\n\n NUM_PROC = 1\n\n\n def prepare_dataset(tokenizer):\n print(\"š Loading dataset...\")\n\n dataset = load_dataset(\n \"json\",\n data_files=DATA_PATH,\n split=\"train\",\n )\n\n text_column = \"Text\" if \"Text\" in dataset.column_names else dataset.column_names[0]\n\n print(\"Fast tokenization (no truncation)...\")\n\n def tokenize_function(examples):\n return tokenizer(\n examples[text_column],\n add_special_tokens=True,\n truncation=False,\n )\n\n tokenized = dataset.map(\n tokenize_function,\n batched=True,\n remove_columns=dataset.column_names,\n desc=\"Tokenizing\",\n )\n\n print(\"Packing into continuous GPT blocks...\")\n\n def group_texts(examples):\n concatenated = {k: sum(examples[k], []) for k in examples.keys()}\n total_length = len(concatenated[\"input_ids\"])\n\n total_length = (total_length // MAX_SEQ_LENGTH) * MAX_SEQ_LENGTH\n\n result = {\n k: [t[i:i + MAX_SEQ_LENGTH] for i in range(0, total_length, MAX_SEQ_LENGTH)]\n for k, t in concatenated.items()\n }\n\n result[\"labels\"] = result[\"input_ids\"].copy()\n return result\n\n lm_dataset = tokenized.map(\n group_texts,\n batched=True,\n batch_size=1000,\n desc=\"Packing blocks\",\n )\n\n lm_dataset = lm_dataset.with_format(\"torch\")\n\n return lm_dataset\n\n\n def main():\n os.makedirs(OUTPUT_DIR, exist_ok=True)\n\n print(\"Loading model with Unsloth...\")\n\n model, tokenizer = FastLanguageModel.from_pretrained(\n model_name=MODEL_PATH,\n max_seq_length=MAX_SEQ_LENGTH,\n dtype=None,\n load_in_4bit=LOAD_IN_4BIT,\n )\n\n if tokenizer.pad_token is None:\n tokenizer.pad_token = tokenizer.eos_token\n\n print(\"Applying LoRA adapters...\")\n\n model = FastLanguageModel.get_peft_model(\n model,\n r=64,\n target_modules=[\n \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n \"gate_proj\", \"up_proj\", \"down_proj\",\n ],\n lora_alpha=64,\n lora_dropout=0,\n bias=\"none\",\n use_gradient_checkpointing=\"unsloth\",\n random_state=3407,\n )\n\n train_dataset = prepare_dataset(tokenizer)\n\n print(f\"Packed samples: {len(train_dataset):,}\")\n print(\"Starting continued pretraining...\")\n\n trainer = Trainer(\n model=model,\n train_dataset=train_dataset,\n args=TrainingArguments(\n output_dir=OUTPUT_DIR,\n per_device_train_batch_size=4,\n gradient_accumulation_steps=8,\n warmup_steps=100,\n num_train_epochs=1,\n learning_rate=1e-4,\n bf16=True,\n logging_steps=10,\n optim=\"adamw_8bit\",\n weight_decay=0.01,\n lr_scheduler_type=\"cosine\",\n save_steps=500,\n save_total_limit=2,\n report_to=\"none\",\n seed=3407,\n remove_unused_columns=False,\n ),\n )\n\n trainer.train()\n\n model.save_pretrained(OUTPUT_DIR)\n tokenizer.save_pretrained(OUTPUT_DIR)\n\n print(f\"Done! Model saved to: {OUTPUT_DIR}\")\n\n\n if __name__ == \"__main__\":\n main()\n\n",
"title": "Fine-Tuning an SLM for a Low-Resource Language"
}