Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB
Hi! Thanks for the suggestions. I worked through them in order on a Jetson Orin Nano 8GB running llama.cpp.
System
Upgraded from JetPack/L4T 36.4.7 to 36.5.0
Verified 25W mode (
nvpmodel)Verified
jetson_clocksMonitored with
tegrastats
Qwen2.5-Coder-7B-Instruct Q4_K_M
Loaded and ran normally
~11.2-11.3 tokens/sec generation
Performance appears consistent with reported Orin Nano results
Tests performed
-ngl 60,80,99: no meaningful differenceContext
1024vs2048: minimal differenceFlash Attention: no meaningful difference
Batch/microbatch tuning: no meaningful difference
KV cache
q8_0: significantly slowerKV cache
q4_0: degraded output quality
Memory observations
Qwen 7B used ~5.1 GB RAM under load
Modest swap usage
System remained stable
Granite/Mellum test results
Using ibm-granite/granite-4.0-h-small-GGUF:Q4_K_M:
-ngl 99: CUDA OOM, attempted ~18.6 GiB allocation-ngl 20: CUDA OOM, attempted ~9.0 GiB allocation-ngl 10: model loaded successfully and generated output
However, at -ngl 10 performance was extremely slow. Generation started eventually, but first-token latency was very long, memory usage was near system limits, swap was active, and even interrupting the process took noticeable time.
Discussion in the ATmosphere