{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihzihgsajwiskp3q5nvqfazuf5z6xkkbn73belbi7ynob3znfyh6a",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mm7ikjzv3di2"
},
"path": "/t/i-ran-glm-5-1-754b-176gb-gguf-on-a-16gb-ram-apu-machine-heres-what-happened/176096#post_1",
"publishedAt": "2026-05-19T12:10:51.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub - snrj35-dev/754B-on-a-Potato · GitHub"
],
"textContent": "Hey everyone,\n\nI wanted to see what happens when you push budget consumer hardware to its absolute physical limits. So, I tried running **GLM-5.1-IQ1_M (754-Billion parameters, 176GB GGUF split files)** on my “potato” PC:\n\n * **CPU:** Ryzen 5 5600G (6 Cores / 12 Threads)\n * **RAM:** 16GB DDR4 (4GB allocated to integrated Vega 7 iGPU VRAM, leaving 12GB system RAM)\n * **Storage:** 512GB PCIe Gen3 NVMe SSD (mounted partition formatted in NTFS under Linux)\n\n\n\n### The Memory Miracle (How it didn’t OOM)\n\nNormally, loading a 176GB model on a 16GB RAM machine results in an instant Kernel OOM crash. However, by compiling `llama.cpp` with AVX2 and utilizing memory mapping (`--mmap`), the OS lazily mapped the 176GB file into Virtual Memory.\n\n * **Peak System RAM Usage:** Only **8.34 GB**!\n * The inactive 714 billion parameters never occupied physical RAM, and the OS page cache successfully evicted read-only GGUF pages under memory pressure.\n\n\n\n### Scientific Proof of the SSD Bottleneck\n\nMy generation speed was **0.05 t/s (22.69 seconds per token)** and prompt processing was **0.21 t/s**. Here is the mathematical proof that storage throughput is the primary constraint:\n\n * GLM-5.1 routes to 8 active experts per token (~40B active parameters).\n * In `IQ1_M` (~1.6 bits/weight), each token requires streaming **~10 GB of weights** from the disk.\n * Due to Linux NTFS/FUSE driver overhead on my mounted partition, sequential read was capped at **~650 MB/s**.\n\n\n\n\\text{Theoretical Speed Limit} = \\frac{10\\text{ GB}}{0.65\\text{ GB/s}} \\approx 15.3\\text{ seconds/token}\n\nAdding GGUF layer overhead and CPU computation cycles, our actual **22.69s/token** execution is closely aligned with the physical disk speed.\n\n### The Next Step: “Expert Cache / Pinning” Design\n\nTo solve this I/O bottleneck, I’m proposing an **Expert Cache/Pinning Layer** architecture for `llama.cpp`/`GGML`.\nSince Mixture-of-Experts activations follow a highly skewed **Zipfian Power-Law distribution** (x^{-1.15}):\n\n * Pinning just **12 out of 64 routed experts** (only ~18% of the model parameters) in GPU VRAM or locked RAM (`mlock`) yields a **%73 Cache Hit Rate**.\n * This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds up to **0.24 t/s (~4.2 seconds/token)** —a **5.3x speedup** on the same potato hardware!\n\n\n\nI put together a GitHub repo containing the benchmarking logs, real-time CSV metrics, and the detailed C++ architectural design proposal.\n\n**GitHub Repository:** GitHub - snrj35-dev/754B-on-a-Potato · GitHub\n\nLet me know what you think about MoE disk streaming, and if anyone has tried caching expert tensors inside llama.cpp yet!",
"title": "I ran GLM-5.1 (754B, 176GB GGUF) on a 16GB RAM APU machine — here's what happened"
}