Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifukl6izms4j3uxaio2d2ufak2pdaya2daqq4o6gyuk5x4ssxh7bi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjs6kbqqmmd2"
  },
  "path": "/t/building-local-my-2026-headless-ai-server-journey/175243#post_4",
  "publishedAt": "2026-04-18T18:27:45.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "\"You’re totally right to keep an eye on them—local models have come a long way! Even with a 3060 Ti, you can actually run some impressive stuff.\n\nBecause of how GGUF works now, you can ‘offload’ specific layers to your 8GB VRAM and let the rest spill over into your 32GB of system RAM. If you want to try it, I’d recommend starting with **Gemma-4-E4B-it-Q8_0.gguf**.\n\nAt that quantization, the model is about **7.5GB**. If you set your context to `PARAMETER num_ctx 32768`, it should fit comfortably across your VRAM and RAM. You’ll probably see speeds around **8–15 tokens/sec** —not blazing fast, but the reasoning quality is excellent for a local setup.\n\nIf you want to go even bigger, you could technically run the **Gemma-4 26B (A4B)**. By putting as many layers as possible on the GPU and the rest on your memory, you’d likely hit **6–8 tokens/sec**. Even if you went full system memory, you’d still get about **3–4 tokens/sec**. It’s definitely worth a shot if you want cloud-level smarts without the privacy concerns!\"\n\n* * *\n\n###",
  "title": "Building Local: My 2026 Headless AI Server Journey"
}