I ran GLM-5.1 (754B, 176GB GGUF) on a 16GB RAM APU machine — here's what happened
Hey everyone,
I wanted to see what happens when you push budget consumer hardware to its absolute physical limits. So, I tried running GLM-5.1-IQ1_M (754-Billion parameters, 176GB GGUF split files) on my “potato” PC:
- CPU: Ryzen 5 5600G (6 Cores / 12 Threads)
- RAM: 16GB DDR4 (4GB allocated to integrated Vega 7 iGPU VRAM, leaving 12GB system RAM)
- Storage: 512GB PCIe Gen3 NVMe SSD (mounted partition formatted in NTFS under Linux)
The Memory Miracle (How it didn’t OOM)
Normally, loading a 176GB model on a 16GB RAM machine results in an instant Kernel OOM crash. However, by compiling llama.cpp with AVX2 and utilizing memory mapping (--mmap), the OS lazily mapped the 176GB file into Virtual Memory.
- Peak System RAM Usage: Only 8.34 GB!
- The inactive 714 billion parameters never occupied physical RAM, and the OS page cache successfully evicted read-only GGUF pages under memory pressure.
Scientific Proof of the SSD Bottleneck
My generation speed was 0.05 t/s (22.69 seconds per token) and prompt processing was 0.21 t/s. Here is the mathematical proof that storage throughput is the primary constraint:
- GLM-5.1 routes to 8 active experts per token (~40B active parameters).
- In
IQ1_M(1.6 bits/weight), each token requires streaming **10 GB of weights** from the disk. - Due to Linux NTFS/FUSE driver overhead on my mounted partition, sequential read was capped at ~650 MB/s.
\text{Theoretical Speed Limit} = \frac{10\text{ GB}}{0.65\text{ GB/s}} \approx 15.3\text{ seconds/token}
Adding GGUF layer overhead and CPU computation cycles, our actual 22.69s/token execution is closely aligned with the physical disk speed.
The Next Step: “Expert Cache / Pinning” Design
To solve this I/O bottleneck, I’m proposing an Expert Cache/Pinning Layer architecture for llama.cpp/GGML.
Since Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (x^{-1.15}):
- Pinning just 12 out of 64 routed experts (only ~18% of the model parameters) in GPU VRAM or locked RAM (
mlock) yields a %73 Cache Hit Rate. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds up to 0.24 t/s (~4.2 seconds/token) —a 5.3x speedup on the same potato hardware!
I put together a GitHub repo containing the benchmarking logs, real-time CSV metrics, and the detailed C++ architectural design proposal.
GitHub Repository: GitHub - snrj35-dev/754B-on-a-Potato · GitHub
Let me know what you think about MoE disk streaming, and if anyone has tried caching expert tensors inside llama.cpp yet!
Discussion in the ATmosphere