Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreighah52c7kuikqdmilboyicbykvvba4s4rdmt4hwzshhz34rz5k74",
    "uri": "at://did:plc:5sgu76a53rz3n6unbykmovqy/app.bsky.feed.post/3mm33hmnkh6l2"
  },
  "description": "LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low \"intrinsic rank\", most of the useful signal lives in a much smaller subspace.\n\n\nThe Math\n\nFor a weight matrix W (d×k), LoRA learns two small matrices: A (d×r) and B (r×k), where r",
  "path": "/engineering-glossary/lora-low-rank-adaptation/",
  "publishedAt": "2026-05-17T19:20:58.000Z",
  "site": "https://sahilkapoor.com",
  "tags": [
    "Prompt Engineering",
    "System Prompt",
    "Rlhf",
    "Vllm",
    "Ollama"
  ],
  "textContent": "LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low \"intrinsic rank\", most of the useful signal lives in a much smaller subspace.\n\n## The Math\n\nFor a weight matrix W (d×k), LoRA learns two small matrices: A (d×r) and B (r×k), where r ≪ min(d,k). The adapted weight is W + BA. At inference, BA is merged into W, no extra latency. Training parameters = r×(d+k) instead of d×k. With r=8 on a 7B parameter model, you train roughly 0.1% of parameters.\n\n## QLoRA\n\nQLoRA (Quantized LoRA) extends LoRA by quantizing the base model to 4-bit precision (NF4) before fine-tuning, then training LoRA adapters in 16-bit. This lets you fine-tune a 70B parameter model on a single 48GB A100, hardware that would normally only fit a 7B model for full fine-tuning. QLoRA is the standard approach for fine-tuning large models on consumer or academic GPU budgets.\n\n## When to Use LoRA\n\n  * **Domain adaptation** , teach a general model the vocabulary and style of a specific domain (legal, medical, code)\n  * **Instruction following** , train a base model to follow chat-style instructions\n  * **Format control** , reliable output formatting (JSON schema, specific response structures)\n  * **Behavior adjustment** , reduce refusals, change tone, instill specific personas\n\n\n\n## LoRA vs Prompt Engineering\n\nBefore investing in LoRA fine-tuning, exhaust Prompt Engineering options. A well-crafted System Prompt with few-shot examples often achieves 80% of what fine-tuning does at zero compute cost. LoRA makes sense when: the task requires knowledge not in the base model, you need consistent output format across millions of calls, or you need to run a smaller/cheaper model for cost reasons after fine-tuning it to match a larger model's quality.\n\n## Rlhf and LoRA\n\nRlhf (Reinforcement Learning from Human Feedback) is often implemented using LoRA for the SFT and RLHF training stages, it's more practical than full fine-tuning at scale.\n\n## Inference with LoRA Adapters\n\nLoRA adapters are small files (MBs vs GBs for full weights) that can be hot-swapped on Vllm or Ollama endpoints. This enables \"adapter serving\", one base model, multiple task-specific adapters loaded dynamically.\n\n## Related Terms\n\n  * Rlhf, fine-tuning paradigm that often uses LoRA internally\n  * Vllm, inference engine with LoRA adapter support\n  * Ollama, can load custom LoRA-adapted models in GGUF format\n  * Prompt Engineering, first thing to try before investing in fine-tuning\n\n",
  "title": "LoRA (Low-Rank Adaptation)",
  "updatedAt": "2026-05-18T20:03:57.336Z"
}