External Publication

Found the fix for memory not being freed when switching models on Linux (it's not Python or PyTorch)

Hugging Face Forums [Unofficial] March 29, 2026

Great question, the key thing is PyTorch doesn’t make one big malloc per model. It makes hundreds of small ones, one per tensor. Most are a few hundred KB to a few MB — nowhere near 16MiB.

The problem is the dynamic threshold creeping upward. Starts at 128KB, and every time a bigger mmap’d chunk gets freed, glibc raises the bar. After enough model switches, allocations that used to get mmap’d are now landing in arenas and fragmenting.

Setting MALLOC_MMAP_THRESHOLD_=65536 locks it at 64KB and kills the dynamic adjustment entirely — the docs even say that: setting the parameter disables auto-tuning. So everything over 64KB goes through mmap and gets cleanly returned to the OS.

On mmap loading — safetensors supports it, but PyTorch still mallocs for dtype conversion, GPU staging buffers, optimizer state, etc. Those intermediate allocations are what fragment the arenas, not the weight file read.

Also small note — sizeof(long) is 8 on x86-64 Linux (LP64), so the ceiling is 32MiB not 16. Doesn’t change anything since individual tensor allocs are way under either number.

Discussion in the ATmosphere