Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiag3nkhuae2s4ejv27m33wzxblcf7jx746pat3le2ik4bkaeyd2ii",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnfulbsiium2"
  },
  "path": "/t/mellum2-12b-a2-5b-instruct-q4-k-m-on-jetson-orin-nano-8gb/176480#post_3",
  "publishedAt": "2026-06-03T18:57:27.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi! Thanks for the suggestions. I worked through them in order on a Jetson Orin Nano 8GB running llama.cpp.\n\n**System**\n\n  * Upgraded from JetPack/L4T 36.4.7 to 36.5.0\n\n  * Verified 25W mode (`nvpmodel`)\n\n  * Verified `jetson_clocks`\n\n  * Monitored with `tegrastats`\n\n\n\n\n**Qwen2.5-Coder-7B-Instruct Q4_K_M**\n\n  * Loaded and ran normally\n\n  * ~11.2-11.3 tokens/sec generation\n\n  * Performance appears consistent with reported Orin Nano results\n\n\n\n\n**Tests performed**\n\n  * `-ngl 60`, `80`, `99`: no meaningful difference\n\n  * Context `1024` vs `2048`: minimal difference\n\n  * Flash Attention: no meaningful difference\n\n  * Batch/microbatch tuning: no meaningful difference\n\n  * KV cache `q8_0`: significantly slower\n\n  * KV cache `q4_0`: degraded output quality\n\n\n\n\n**Memory observations**\n\n  * Qwen 7B used ~5.1 GB RAM under load\n\n  * Modest swap usage\n\n  * System remained stable\n\n\n\n\n**Granite/Mellum test results**\nUsing `ibm-granite/granite-4.0-h-small-GGUF:Q4_K_M`:\n\n  * `-ngl 99`: CUDA OOM, attempted ~18.6 GiB allocation\n\n  * `-ngl 20`: CUDA OOM, attempted ~9.0 GiB allocation\n\n  * `-ngl 10`: model loaded successfully and generated output\n\n\n\n\nHowever, at `-ngl 10` performance was extremely slow. Generation started eventually, but first-token latency was very long, memory usage was near system limits, swap was active, and even interrupting the process took noticeable time.",
  "title": "Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB"
}