Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreido3tpcg264nmvwbd2j2xfk4yra3mihudujifsanwdtvcxa4gcmju",
    "uri": "at://did:plc:7fviay5jmlfl6u2aukgj5k42/app.bsky.feed.post/3mol7fxma2e42"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreid2skkpggypxr7sla56cvmsigvaspiwd6mplq7xaa3shhir4qanfa"
    },
    "mimeType": "image/png",
    "size": 174351
  },
  "description": "A plain-English guide to picking a local model for coding, writing, and AI agents - no PhD required\n\nLast updated: June 2026\n\nTL;DR: Jump to the bottom of the page if you want to skip the reading and go straight to my interactive AI model finder tool.\n\nYou don't need a data center or a monthly subscription to run a genuinely capable AI on your own machine anymore. A normal gaming PC can now run models that, two years ago, would have required cloud servers costing A LOT to run. Everything stays o",
  "path": "/run-your-own-ai-in-2026-which-model-fits-your-gpu-and-your-needs/",
  "publishedAt": "2026-06-18T15:57:58.000Z",
  "site": "https://dawid.ai",
  "tags": [
    "OpenClaw",
    "Hermes",
    "Ollama",
    "LM Studio",
    "ollama.com/library",
    "huggingface.co"
  ],
  "textContent": "### A plain-English guide to picking a local model for coding, writing, and AI agents - no PhD required\n\n _Last updated: June 2026_\n\n _TL;DR: Jump to the bottom of the page if you want to skip the reading and go straight to my interactive AI model finder tool._\n\n* * *\n\nYou don't need a data center or a monthly subscription to run a genuinely capable AI on your own machine anymore. A normal gaming PC can now run models that, two years ago, would have required cloud servers costing A LOT to run. Everything stays on your computer: your code, your writing, your weird 2 a.m. questions...\n\nThe catch is that \"which model should I run?\" has become a genuinely confusing question. There are dozens of models, each in a dozen sizes, each in a dozen compression levels, with names like `Qwen3-Coder-30B-A3B-Instruct-Q4_K_M`. This guide cuts through all of it. By the end you'll know exactly what to download for _your_ graphics card and _your_ specific task - whether that's writing code, drafting a blog post, or running an AI agent like OpenClaw or Hermes.\n\nWe'll start with a plain-English crash course (skip it if you already know your VRAM from your KV cache), then get to the actual recommendations.\n\n* * *\n\n## Part 1: What all the jargon actually means\n\n### The single most important number: vRAM\n\nYour graphics card has its own dedicated memory, called v**RAM** (video RAM). This - not your processor, not your regular RAM - is the thing that decides which AI models you can run and how fast.\n\nThink of vRAM as the size of your desk. The AI model is a stack of books you need open in front of you to work. If the stack fits on the desk, everything is fast and smooth. If it doesn't fit, you have to keep some books on a shelf across the room and walk over every time you need them. That \"walking across the room\" is the single biggest reason local AI feels slow, and we'll come back to it.\n\nCommon vRAM amounts:\n\n  * **8 GB** - older or budget cards (Nvidia GTX 1070, RTX 3060 8GB; AMD RX 6600, RX 7600; many laptops)\n  * **12 GB** - mid-range (Nvidia RTX 3060 12GB, RTX 4070; AMD RX 6700 XT, RX 7700 XT)\n  * **16 GB** - upper mid-range (Nvidia RTX 4060 Ti 16GB, RTX 4080; AMD RX 6800/6900 XT, RX 7800 XT, RX 9070 XT)\n  * **24 GB** - enthusiast (Nvidia RTX 3090, RTX 4090, RTX 5090-class; AMD RX 7900 XTX, Radeon Pro W7800)\n  * **Apple Silicon Macs** are a special case - they share one big pool of memory between the chip and the graphics, so a 32GB or 64GB Mac can punch well above a similarly-priced PC. The trade-off is slower raw speed.\n\n\n\nA note for AMD owners: the models are identical, but the _software_ path differs slightly. AMD cards run through ROCm (on Linux, and increasingly on Windows) or Vulkan, and both Ollama and LM Studio support them - LM Studio ships a ROCm build. One small gotcha: the newer \"I-quants\" (file names starting with `IQ`) don't play nicely with the Vulkan backend, so on AMD prefer the standard \"K-quants\" (`Q4_K_M`, `Q5_K_M`, etc.) unless you're confirmed to be running ROCm.\n\nTo check yours on Windows: open Task Manager → Performance → GPU, and look at \"Dedicated GPU memory\".\n\n### \"Parameters\" and what \"30B\" means\n\nWhen you see a model called **8B** , **27B** , or **30B** , the B stands for _billion parameters_. Parameters are the model's learned knowledge - think of them as the number of tiny dials inside the model that were tuned during training. More parameters generally mean a smarter, more capable model.\n\nBut more parameters also mean a bigger stack of books on your desk. A 30-billion-parameter model is roughly four times the size of an 8-billion one. The whole game is finding the smartest model that still fits your vRAM.\n\n### Quantization: the magic that makes this all possible\n\nHere's the trick that lets a normal PC run these things at all: **quantization**.\n\nIn its original form, every one of those billions of parameters is stored at high precision - a long, exact number. Quantization rounds those numbers down to shorter, less precise versions. It's almost exactly like saving a photo as a JPEG: you throw away some detail to make the file dramatically smaller, and most of the time you can't even tell the difference.\n\nYou'll see quantization written as **Q4** , **Q5** , **Q6** , **Q8** , with a `_K_M` or `_K_S` tacked on (those just mean \"medium\" or \"small\" variants of the same method). Higher number = less compression = better quality but bigger file. Here's the honest breakdown:\n\nQuant| What it's like| Quality| Use it when\n---|---|---|---\n**Q4_K_M**|  A good JPEG| Minor quality loss, totally usable| The default. Best fit-vs-quality balance, and what most one-click downloads give you\n**Q5_K_M**|  A high-quality JPEG| Noticeably cleaner, especially for code| You have a little VRAM to spare\n**Q6_K**|  Near-original| Practically indistinguishable from full quality| You have comfortable headroom\n**Q8_0**|  Basically the original| Effectively lossless| You have lots of VRAM and want zero compromise\n\n**The golden rule:** it is _always_ better to run a smaller model at good quality than a bigger model crushed down to nothing. A 12B model at Q6 will beat a 30B model squeezed into Q2 - it'll be smarter _and_ faster _and_ leave you more room. Don't chase the biggest number; chase the best fit.\n\nFor coding specifically, lean toward Q5 or higher if you can - code is unforgiving, and small rounding errors show up as bugs more than they do in casual chat.\n\n### Loading the model vs. leaving room for context - the part everyone forgets\n\nThis is the mistake that trips up almost every beginner, so read this twice.\n\nThe model's file size (the numbers in that table above) is just the cost of _loading_ it - getting the books onto the desk. But you also need free space for the actual conversation: everything you type, everything the model has said, every file it's reading. This working memory is called the **context window** , and the space it occupies in vRAM is called the **KV cache**.\n\nThe longer the conversation or the bigger the document you feed it, the more vRAM the context eats - _on top of_ the model. And it adds up fast. A model that fits perfectly in your vRAM with a short prompt can suddenly overflow when you paste in a long file.\n\nSo when you're budgeting vRAM, the math is:\n\n> **Model file size + room for context = total vRAM needed**\n\nA practical example on a 24GB card: a model that loads at ~18GB leaves you about 4-5GB for context, which is plenty for normal back-and-forth (roughly 32,000 words of memory) but _not_ enough for stuffing an entire codebase into a single prompt. If you need huge context, you need a smaller model or more vRAM.\n\n### What happens when it doesn't fit: \"offloading to RAM\" (and why it's painful)\n\nIf a model is too big for your vRAM, the software won't necessarily refuse - it'll quietly put the overflow into your regular system RAM instead. This is called **offloading** , and it's the \"walking across the room for books on the shelf\" problem from earlier.\n\nYour regular RAM is _vastly_ slower for this job than vRAM. A model that runs at a brisk 40 words per second fully on the GPU can crater to 1-2 words per second the moment it spills into RAM - slow enough that you'll be staring at the screen waiting for each word. In the worst cases (loading something way too big), it drops to the point where a single greeting takes minutes.\n\nThe takeaway: **always pick a model that fully fits in your vRAM with context room to spare.** Offloading is a last resort, not a plan. The only exception is Apple Silicon, where the shared memory pool makes the penalty much gentler.\n\n### Dense vs. MoE (or: why a \"30B\" model can run as fast as an 8B one)\n\nYou'll see some models labeled with a second number, like **30B-A3B** or **35B-A3B**. The \"A3B\" means \"**A** ctive **3B**.\" These are **Mixture-of-Experts (MoE)** models, and they're a clever cheat.\n\nA normal (\"dense\") model uses _all_ of its parameters for every single word it generates. An MoE model is more like a company with many specialists - it only wakes up the relevant 3 billion parameters for each word, leaving the rest dormant. The result: it takes up the _disk and vRAM space_ of a big model, but runs at the _speed_ of a small one.\n\nFor most people on consumer hardware, MoE models like Qwen3-Coder-30B-A3B are the sweet spot in 2026 - big-model smarts at small-model speeds. The only quirk is they still need the full vRAM to _load_ (all the experts have to be in the room, even the sleeping ones).\n\n### GGUF, Unsloth, bartowski - who makes these files?\n\nThe compressed model files you download come in a format called **GGUF** , which is just the universal file type that local AI tools understand - like MP3 for music. Any tool on this list (Ollama, LM Studio, and others) can play a GGUF.\n\nBut _someone_ has to do the compressing, and a few community heroes have become the trusted names:\n\n  * **Unsloth** - makers of \"Dynamic\" quants (you'll see `UD-Q4_K_XL` and similar). Their secret sauce is being smarter about _which_ parts of the model to compress hard and which to protect, giving you better quality at the same file size. Generally the recommended starting point.\n  * **bartowski** - a prolific, reliable quantizer who publishes the full ladder (Q3 through Q8) for almost every model, with clear size listings.\n  * **lmstudio-community** - the official curated quants inside LM Studio.\n\n\n\nWhen in doubt, grab the Unsloth or bartowski version of whatever model you want.\n\n### \"Uncensored,\" abliteration, and Heretic - what these actually do\n\nMost models ship with safety guardrails that make them refuse certain requests. The community has tools to remove those refusals, and you'll see the results labeled **\"uncensored,\"** **\"abliterated,\"** or **\"Heretic.\"**\n\nA quick terminology note, since it's commonly misspelled: the technique is **abliteration** , not \"obliteration.\" It's a blend of \"ablate\" - to surgically remove - and \"obliterate.\" It works by identifying the specific internal direction the model uses to say \"no\" and neutralizing it, _without_ retraining the whole model. \"Heretic\" is just a popular automated tool that does this, mostly used on Google's Gemma models.\n\nTwo things worth understanding:\n\n  1. **It's near-lossless when done well.** A good abliteration removes the refusals while keeping the model's intelligence almost completely intact - the best ones score essentially zero refusals with no measurable drop in capability.\n  2. **It removes the _reflex to refuse_ , not the model's underlying judgment.** The model still has its training-shaped instincts. These builds shine for legitimate edge cases the base model is annoyingly squeamish about: security research, fiction involving violence, medical questions, legal grey areas, and adult creative writing. You're responsible for what you do with them, and for following your local laws.\n\n\n\n### Ollama vs. LM Studio - which app do I use?\n\nThese are the two most popular ways to actually run the models. They use the same engines under the hood, so model _quality_ is identical - it's purely about how you like to work:\n\n  * **LM Studio** - a polished desktop app with a real graphical interface. You browse and download models with buttons, chat in a clean window, and it even has an \"UNCENSORED\" filter tab for finding abliterated models. **Best for beginners and anyone who'd rather not touch a command line.**\n  * **Ollama** - a command-line tool. You type `ollama run qwen3-coder:30b` and it just works. More scriptable, lighter weight, and the standard backend that most AI _agents_ plug into. **Best for tinkerers and anyone planning to run agents.**\n\n\n\nMany people (myself included) install both: LM Studio for hands-on chatting and Ollama for powering agents in the background.\n\n### What's an \"agent\"?\n\nA regular chat model answers questions. An **agent** is a model hooked up to _tools_ - it can read and write files, run commands, browse the web, and chain many steps together to actually _do_ a task instead of just describing it. \"Refactor this whole project\" or \"research X and write me a summary file\" is agent territory.\n\nTools like **OpenClaw** , **Hermes** , **Cline** , **OpenCode** , and **Aider** are the \"scaffolding\" that wraps a model and gives it those hands. They mostly connect to Ollama running in the background. The important thing for model choice: agents demand strong **tool-calling** ability (the model has to reliably output structured commands) and they burn through context fast (every tool result gets added to the conversation). So for agents, prioritize models explicitly built for it - and leave extra context headroom.\n\n* * *\n\n## Part 2: Which model should you actually run?\n\nNow the payoff. Find your graphics card's vRAM below, then pick based on what you want to do. All sizes assume the Q4_K_M quant unless noted - the safe default. Model names are given for both Ollama (the command) and LM Studio (search this in the Discover tab).\n\n### If you have 8 GB of VRAM\n\nThis is the entry tier. You're limited to smaller models, and you'll want to keep contexts modest, but it absolutely works.\n\nGoal| Model| Ollama command| LM Studio search\n---|---|---|---\n**Coding**|  Qwen3 8B| `ollama run qwen3:8b`| `Qwen3-8B-GGUF`\n**Writing / general**|  Gemma 4 12B (Q4, tight)| `ollama run gemma4:12b`| `gemma-4-12b-it`\n**Agents (tool use)**|  Gemma 4 E4B| `ollama run gemma4:e4b`| `gemma-4-e4b`\n**Uncensored**|  Llama 3.1 8B abliterated| `ollama run mannix/llama3.1-8b-abliterated:q5_K_M`| `Meta-Llama-3.1-8B-Instruct-abliterated`\n\n**Reality check:** Gemma 4 12B at Q4 (~7.6GB) is right at the edge - fine for short chats, but you'll have little context room. If it struggles, drop to an 8B model.\n\n### If you have 12 GB of VRAM\n\nA real step up. You can now run capable mid-size models comfortably.\n\nGoal| Model| Ollama command| LM Studio search\n---|---|---|---\n**Coding**|  DeepSeek-Coder-V2 Lite 16B| search Discover| `DeepSeek-Coder-V2-Lite-Instruct-GGUF`\n**Writing / general**|  Gemma 4 12B| `ollama run gemma4:12b`| `gemma-4-12b-it`\n**Agents**|  Qwen3 14B abliterated (`:agent` tag)| `ollama run richardyoung/qwen3-14b-abliterated:agent`| `qwen3-14b-abliterated`\n**Uncensored**|  Gemma 4 12B Heretic| `ollama run igorls/gemma-4-12B-it-heretic-GGUF`| `gemma-4-12B-it-heretic-GGUF`\n\nGemma 4 12B is the workhorse here - strong at writing, multilingual, and you can even run it at Q5 or Q6 for extra polish with vRAM to spare.\n\n### If you have 16 GB of VRAM\n\nYou're now into \"this is genuinely good\" territory and can touch the 24B class.\n\nGoal| Model| Ollama command| LM Studio search\n---|---|---|---\n**Coding / agents**|  Devstral (orig 24B)| `ollama run devstral:24b`| `Devstral-Small-GGUF`\n**Writing / general**|  Gemma 4 12B (at Q6 for quality)| `ollama run gemma4:12b`| `gemma-4-12b-it` (Q6_K)\n**Uncensored**|  Gemma 4 12B Heretic (Q5/Q6)| `ollama run igorls/gemma-4-12B-it-heretic-GGUF`| `gemma-4-12B-it-heretic-GGUF`\n\nThe 24B coding models fit but leave thin context headroom at 16GB - workable, but you'll feel the squeeze on long sessions.\n\n### If you have 24 GB of VRAM - the sweet spot ⭐\n\nThis is the target most enthusiasts aim for, and for good reason: it's where the best consumer models live with real context room. This is where you should be if you're serious about local AI.\n\nGoal| Model| Ollama command| LM Studio search\n---|---|---|---\n**Coding & agents (best all-round)**| **Qwen3-Coder 30B-A3B**| `ollama run qwen3-coder:30b`| `Qwen3-Coder-30B-A3B-Instruct-GGUF`\n**Coding (top quality)**|  Qwen3.6 27B| `ollama run qwen3.6:27b`| `Qwen3.6-27B-GGUF` (Unsloth UD-Q4_K_XL)\n**Speed-first agents**|  Qwen3.6 35B-A3B| `ollama run qwen3.6:35b`| `Qwen3.6-35B-A3B-GGUF`\n**Dedicated coding-agent scaffolds**|  Devstral Small 2| `ollama run devstral-small-2:24b`| `Devstral-Small-2-GGUF`\n**Writing & reasoning**| Gemma 4 31B| `ollama run gemma4:31b`| `gemma-4-31B-it-GGUF`\n**Uncensored (best)**|  Qwen3.6 35B-A3B Uncensored| `ollama run fredrezones55/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive`| same name\n**Uncensored writing**|  Gemma 4 31B Heretic| search Discover| `gemma-4-31B-it-heretic-Gguf`\n\n**The one-line recommendation for most people on 24GB:** start with **Qwen3-Coder 30B-A3B at Q4**. It's a Mixture-of-Experts model, so it loads in ~18GB but runs fast, it's excellent at both coding and agentic tool use, it leaves enough room for a comfortable 32K context, and its license (Apache 2.0) means you can use it for anything. If you mostly write prose, swap to **Gemma 4 31B** instead - it has the best feel for natural language.\n\nPS. If you are planning to buy a GPU for LLM work specifically, I highly suggest AMD RX 7900 XTX. It's a lot cheaper than Nvidia, less power hungry, and relatively \"small\" - making it much easier to build. If you are planning to use it for image/video generation, then stick to Nvidia - sadly, all libraries out there are built using the CUDA library, which was created by Nvidia :]\n\n* * *\n\n## Part 3: The full quant ladder (reference)\n\nDownload sizes in GB, weights only - remember to add room for context on top. \"Fits 24GB?\" assumes you want usable context, not just bare loading.\n\nModel| Q4_K_M| Q5_K_M| Q6_K| Q8_0| Best home\n---|---|---|---|---|---\nQwen3-Coder 30B-A3B| 18.6| 21.7| 25.1| 32.5| 24GB at Q4\nQwen3.6 27B (dense)| 17| ~19| ~22| ~28.5| 24GB at Q4\nQwen3.6 35B-A3B (MoE)| ~21| ~24.5| ~28| ~37| 24GB at Q4 only\nDevstral Small 2 (24B)| 15| ~17| ~19.5| 26| 16–24GB\nGemma 4 31B| 18.3| 21.7| 25.2| 32.6| 24GB at Q4\nGemma 4 26B-A4B (MoE)| ~16| ~18.5| ~21.5| ~27.5| 24GB\nGemma 4 12B| 7.6| ~8.8| ~10| 13| 12–16GB, any quant\nQwen3 8B| ~5.0| ~5.7| ~6.6| ~8.5| 8GB, any quant\nLlama 3.1 8B| ~4.9| ~5.7| ~6.6| ~8.5| 8GB, any quant\n\n_Sizes are from the published Unsloth/bartowski GGUF repos where available, and computed from standard compression ratios otherwise (accurate to about ±0.5GB). Builders vary slightly - treat these as planning numbers._\n\n* * *\n\n## Part 4: A simple decision flow\n\n  1. **Check your vRAM.** That dictates your tier - don't fight it. You will only waste time with zero benefit.\n  2. **Pick your main goal:** code, write, or run agents.\n  3. **Default to the Q4_K_M version** of the recommended model for your tier and goal.\n  4. **Leave context room.** Aim for a model file that's at least ~3-4GB smaller than your vRAM.\n  5. **If quality matters more than context** (and you have headroom), step up one quant level.\n  6. **If you keep hitting \"out of memory\" or it crawls,** you're offloading to RAM - drop to a smaller model or lower quant. Don't tolerate the slowdown.\n  7. **For agents,** prefer models built for tool use (Qwen3-Coder, Devstral, the `:agent`-tagged abliterated builds) and budget extra context.\n\n\n\n* * *\n\n## A note on how fast this moves\n\nThe single most important thing to know is that this entire field reshuffles every few weeks. New models, new versions, and better quants drop constantly, and version numbers multiply (Qwen 3.5 vs 3.6, Gemma 4's many variants, and so on can all coexist at once). Everything here is accurate as of mid-2026, but before you commit, spend two minutes verifying the current model tags directly on ollama.com/library or huggingface.co the day you set up. The _principles_ in Part 1 won't change - but the specific winners will.\n\nNow go and run something :]",
  "title": "Run Your Own AI in 2026: Which Model Fits Your GPU (and Your Needs)",
  "updatedAt": "2026-06-18T15:57:59.430Z"
}