External Publication
Visit Post

I ran GLM-5.1 (754B, 176GB GGUF) on a 16GB RAM APU machine — here's what happened

Hugging Face Forums [Unofficial] May 19, 2026
Source

Hey everyone,

I wanted to see what happens when you push budget consumer hardware to its absolute physical limits. So, I tried running GLM-5.1-IQ1_M (754-Billion parameters, 176GB GGUF split files) on my “potato” PC:

  • CPU: Ryzen 5 5600G (6 Cores / 12 Threads)
  • RAM: 16GB DDR4 (4GB allocated to integrated Vega 7 iGPU VRAM, leaving 12GB system RAM)
  • Storage: 512GB PCIe Gen3 NVMe SSD (mounted partition formatted in NTFS under Linux)

The Memory Miracle (How it didn’t OOM)

Normally, loading a 176GB model on a 16GB RAM machine results in an instant Kernel OOM crash. However, by compiling llama.cpp with AVX2 and utilizing memory mapping (--mmap), the OS lazily mapped the 176GB file into Virtual Memory.

  • Peak System RAM Usage: Only 8.34 GB!
  • The inactive 714 billion parameters never occupied physical RAM, and the OS page cache successfully evicted read-only GGUF pages under memory pressure.

Scientific Proof of the SSD Bottleneck

My generation speed was 0.05 t/s (22.69 seconds per token) and prompt processing was 0.21 t/s. Here is the mathematical proof that storage throughput is the primary constraint:

  • GLM-5.1 routes to 8 active experts per token (~40B active parameters).
  • In IQ1_M (1.6 bits/weight), each token requires streaming **10 GB of weights** from the disk.
  • Due to Linux NTFS/FUSE driver overhead on my mounted partition, sequential read was capped at ~650 MB/s.

\text{Theoretical Speed Limit} = \frac{10\text{ GB}}{0.65\text{ GB/s}} \approx 15.3\text{ seconds/token}

Adding GGUF layer overhead and CPU computation cycles, our actual 22.69s/token execution is closely aligned with the physical disk speed.

The Next Step: “Expert Cache / Pinning” Design

To solve this I/O bottleneck, I’m proposing an Expert Cache/Pinning Layer architecture for llama.cpp/GGML. Since Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (x^{-1.15}):

  • Pinning just 12 out of 64 routed experts (only ~18% of the model parameters) in GPU VRAM or locked RAM (mlock) yields a %73 Cache Hit Rate.
  • This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds up to 0.24 t/s (~4.2 seconds/token) —a 5.3x speedup on the same potato hardware!

I put together a GitHub repo containing the benchmarking logs, real-time CSV metrics, and the detailed C++ architectural design proposal.

GitHub Repository: GitHub - snrj35-dev/754B-on-a-Potato · GitHub

Let me know what you think about MoE disk streaming, and if anyone has tried caching expert tensors inside llama.cpp yet!

Discussion in the ATmosphere

Loading comments...