External Publication
Visit Post

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won.

DEV Community [Unofficial] June 20, 2026
Source

I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway.

Here's the uncomfortable part, and the optimizations that still made it worth doing.

The setup

  • 4× RTX 3090 (Ampere — no native BF16), 96 GB VRAM total, 44 cores
  • Models: Qwen3.6-35B-A3B (Q8_0, MoE) and Qwen3-Coder-Next (Q6_K, hybrid)
  • llama.cpp in router mode + OpenWebUI
  • Ceiling I hit: ~105 tokens/second

The 6% problem

The wall wasn't compute. GPU utilization sat at 6%. The bottleneck was CPU orchestration — llama.cpp dispatches across multiple GPUs sequentially, so the cards spent 94% of the time idle waiting on each other. Throwing more VRAM at it does nothing for this.

What actually moved the needle

Change Effect
--ubatch-size 512 +40% throughput
KV cache quantization (Q4_0) 4× VRAM savings
Speculative decoding (n-gram) 2.5× speedup on repetitive tasks
YaRN rope scaling context extended to 1M tokens

Two things surprised me:

  • MoE models tolerate aggressive quantization far better than dense ones — inactive experts don't eat bandwidth, so the quant hit lands softer.
  • The 3B active -parameter model was great at local decisions but fell apart on coherence past ~300–400 lines of code — fine for a function, not for cross-file consistency.

The conclusion I didn't want

At ~11 kWh/day, plus hardware depreciation, against current API pricing, the math doesn't favor local for interactive work. The single biggest improvement to my daily AI workflow was paying for an API. Local still wins for privacy, high-volume batch jobs, or uncensored experimentation — but not as a general cloud replacement. It's an economics problem, not a capability one.

I wrote up the full cost breakdown and the exact llama.cpp router configs on aipster.com. If you're weighing a local rig, I also benchmarked GLM 5.2's open weights — it changed my view on what's worth running at home.

What's your GPU utilization actually sitting at? Curious if anyone solved the sequential-dispatch problem.

Discussion in the ATmosphere

Loading comments...