External Publication
Visit Post

Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB

Hugging Face Forums [Unofficial] June 3, 2026
Source

Hi! Thanks for the suggestions. I worked through them in order on a Jetson Orin Nano 8GB running llama.cpp.

System

  • Upgraded from JetPack/L4T 36.4.7 to 36.5.0

  • Verified 25W mode (nvpmodel)

  • Verified jetson_clocks

  • Monitored with tegrastats

Qwen2.5-Coder-7B-Instruct Q4_K_M

  • Loaded and ran normally

  • ~11.2-11.3 tokens/sec generation

  • Performance appears consistent with reported Orin Nano results

Tests performed

  • -ngl 60, 80, 99: no meaningful difference

  • Context 1024 vs 2048: minimal difference

  • Flash Attention: no meaningful difference

  • Batch/microbatch tuning: no meaningful difference

  • KV cache q8_0: significantly slower

  • KV cache q4_0: degraded output quality

Memory observations

  • Qwen 7B used ~5.1 GB RAM under load

  • Modest swap usage

  • System remained stable

Granite/Mellum test results Using ibm-granite/granite-4.0-h-small-GGUF:Q4_K_M:

  • -ngl 99: CUDA OOM, attempted ~18.6 GiB allocation

  • -ngl 20: CUDA OOM, attempted ~9.0 GiB allocation

  • -ngl 10: model loaded successfully and generated output

However, at -ngl 10 performance was extremely slow. Generation started eventually, but first-token latency was very long, memory usage was near system limits, swap was active, and even interrupting the process took noticeable time.

Discussion in the ATmosphere

Loading comments...