Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB

Hugging Face Forums [Unofficial] June 3, 2026

Source

Hi! Thanks for the suggestions. I worked through them in order on a Jetson Orin Nano 8GB running llama.cpp.

System

Upgraded from JetPack/L4T 36.4.7 to 36.5.0
Verified 25W mode (nvpmodel)
Verified jetson_clocks
Monitored with tegrastats

Qwen2.5-Coder-7B-Instruct Q4_K_M

Loaded and ran normally
~11.2-11.3 tokens/sec generation
Performance appears consistent with reported Orin Nano results

Tests performed

-ngl 60, 80, 99: no meaningful difference
Context 1024 vs 2048: minimal difference
Flash Attention: no meaningful difference
Batch/microbatch tuning: no meaningful difference
KV cache q8_0: significantly slower
KV cache q4_0: degraded output quality

Memory observations

Qwen 7B used ~5.1 GB RAM under load
Modest swap usage
System remained stable

Granite/Mellum test results Using ibm-granite/granite-4.0-h-small-GGUF:Q4_K_M:

-ngl 99: CUDA OOM, attempted ~18.6 GiB allocation
-ngl 20: CUDA OOM, attempted ~9.0 GiB allocation
-ngl 10: model loaded successfully and generated output

However, at -ngl 10 performance was extremely slow. Generation started eventually, but first-token latency was very long, memory usage was near system limits, swap was active, and even interrupting the process took noticeable time.

Discussion in the ATmosphere