External Publication

I ran GLM-5.1 (754B, 176GB GGUF) on a 16GB RAM APU machine — here's what happened

Hugging Face Forums [Unofficial] May 19, 2026

Hey everyone,

I wanted to see what happens when you push budget consumer hardware to its absolute physical limits. So, I tried running GLM-5.1-IQ1_M (754-Billion parameters, 176GB GGUF split files) on my “potato” PC:

CPU: Ryzen 5 5600G (6 Cores / 12 Threads)
RAM: 16GB DDR4 (4GB allocated to integrated Vega 7 iGPU VRAM, leaving 12GB system RAM)
Storage: 512GB PCIe Gen3 NVMe SSD (mounted partition formatted in NTFS under Linux)

The Memory Miracle (How it didn’t OOM)

Normally, loading a 176GB model on a 16GB RAM machine results in an instant Kernel OOM crash. However, by compiling llama.cpp with AVX2 and utilizing memory mapping (--mmap), the OS lazily mapped the 176GB file into Virtual Memory.

Peak System RAM Usage: Only 8.34 GB!
The inactive 714 billion parameters never occupied physical RAM, and the OS page cache successfully evicted read-only GGUF pages under memory pressure.

Scientific Proof of the SSD Bottleneck

My generation speed was 0.05 t/s (22.69 seconds per token) and prompt processing was 0.21 t/s. Here is the mathematical proof that storage throughput is the primary constraint:

GLM-5.1 routes to 8 active experts per token (~40B active parameters).
In IQ1_M (1.6 bits/weight), each token requires streaming **10 GB of weights** from the disk.
Due to Linux NTFS/FUSE driver overhead on my mounted partition, sequential read was capped at ~650 MB/s.

\text{Theoretical Speed Limit} = \frac{10\text{ GB}}{0.65\text{ GB/s}} \approx 15.3\text{ seconds/token}

Adding GGUF layer overhead and CPU computation cycles, our actual 22.69s/token execution is closely aligned with the physical disk speed.

The Next Step: “Expert Cache / Pinning” Design

To solve this I/O bottleneck, I’m proposing an Expert Cache/Pinning Layer architecture for llama.cpp/GGML. Since Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (x^{-1.15}):

Pinning just 12 out of 64 routed experts (only ~18% of the model parameters) in GPU VRAM or locked RAM (mlock) yields a %73 Cache Hit Rate.
This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds up to 0.24 t/s (~4.2 seconds/token) —a 5.3x speedup on the same potato hardware!

I put together a GitHub repo containing the benchmarking logs, real-time CSV metrics, and the detailed C++ architectural design proposal.

GitHub Repository: GitHub - snrj35-dev/754B-on-a-Potato · GitHub

Let me know what you think about MoE disk streaming, and if anyone has tried caching expert tensors inside llama.cpp yet!

The Memory Miracle (How it didn’t OOM)

Scientific Proof of the SSD Bottleneck

The Next Step: “Expert Cache / Pinning” Design

Discussion in the ATmosphere