Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibz6x7xd6g3y7p4ttba4ln5mmjd2s7ql62ymlvlupesdwmc7xtgfy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mng3c4flgsy2"
  },
  "path": "/t/mellum2-12b-a2-5b-instruct-q4-k-m-on-jetson-orin-nano-8gb/176480#post_4",
  "publishedAt": "2026-06-03T21:22:30.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Jetson Orin Nano Super Developer Kit",
    "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF — Q4_K_M file",
    "Qwen/Qwen2.5-7B-Instruct",
    "Grouped-Query Attention paper",
    "llama.cpp server docs: --cache-type-k, --cache-type-v",
    "llama-server man page showing KV cache type options",
    "Qwen2.5-Coder Technical Report",
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    "IBM announcement: Granite 4.0 models",
    "Unsloth Granite 4.0 guide",
    "IBM Granite model docs",
    "ibm-granite/granite-4.0-h-tiny-GGUF",
    "unsloth/granite-4.0-h-tiny-GGUF",
    "ibm-granite/granite-4.0-h-small-GGUF",
    "unsloth/granite-4.0-h-small-GGUF",
    "IBM GGUF conversion repo",
    "JetBrains/Mellum2-12B-A2.5B-Instruct",
    "Mellum2 Technical Report",
    "HF paper page: Mellum2",
    "Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue"
  ],
  "textContent": "Oh. That is an important data point. I was estimating the headroom quite conservatively because Jetson uses shared CPU/GPU RAM, but it looks like there may be room for larger models than I expected:\n\n* * *\n\nThanks for reporting the actual measurements. I would update my earlier take based on your `Qwen2.5-Coder-7B-Instruct Q4_K_M` result.\n\nThe important correction is:\n\n> **7B Q4_K_M is clearly not out of range for Jetson Orin Nano 8GB.**\n>  It can be practical when the model architecture, GGUF size, context length, JetPack/L4T version, and llama.cpp backend are friendly.\n\nYour Qwen test is especially useful because it is not just a “loaded once” result. It looks like a stable practical baseline:\n\nObservation | What I think it means\n---|---\n`Qwen2.5-Coder-7B-Instruct Q4_K_M` loaded and ran normally | 7B Q4 is a real target on this board, not just a theoretical fit.\n~11.2–11.3 tok/s generation | Very usable for local coding assistance on a small edge board.\n~5.1GB RAM under load | Consistent with a ~4.7GB GGUF plus runtime/KV/buffer overhead.\nModest swap usage, stable system | This does not sound like “barely alive by swapping”; it sounds usable.\n`-ngl 60`, `80`, `99` made little difference | Once the important path is offloaded, decode may be memory-bandwidth/runtime limited rather than layer-count limited.\nContext `1024` vs `2048` made little difference | Qwen2.5 7B has GQA and a relatively small KV cache at ordinary chat lengths.\nFlash Attention made little difference | At short/medium context, attention may not be the dominant cost.\nBatch/microbatch tuning made little difference | Mostly affects prompt processing, not necessarily steady-state decode.\nKV cache `q8_0` was significantly slower | KV quantization is not automatically faster; it can hit slower kernels or dequant overhead.\nKV cache `q4_0` degraded quality | Useful as an emergency memory lever, not necessarily a good default when the model already fits.\n\nSo I would now treat `Qwen2.5-Coder-7B-Instruct Q4_K_M` as the **known-good coding baseline** for this device.\n\n## Revised practical tiering\n\nMy previous framing was too conservative if read as “Jetson Nano 8GB can only really do 2B–4B”. Based on your Qwen result, I would revise the tiers like this:\n\nTier | Orin Nano 8GB interpretation\n---|---\n**2B–4B Q4/Q5** | Safe baseline. Good first test for JetPack/container/llama.cpp sanity.\n**7B Q4_K_M** | Practical. Your Qwen2.5-Coder result proves this can be a real target.\n**8B Q4 / 8B low-bit** | Worth testing carefully. Architecture and GGUF size matter a lot.\n**9B Q4** | Possible but more aggressive; likely sensitive to context/runtime/settings.\n**12B MoE Q4** | Mostly experimental on this board. Active parameters can be misleading.\n**30B+ total MoE / 32B total models** | Not a practical 8GB shared-memory Jetson target unless the goal is boundary testing.\n\nFor a normal 8GB discrete GPU, 7B–8B Q4_K_M models are already a common practical target. The Jetson Orin Nano 8GB is a special case because its **8GB LPDDR5 is shared by CPU and GPU** , so it is not the same as “8GB VRAM plus separate system RAM”.\n\nNVIDIA lists the Orin Nano Super Developer Kit as **8GB 128-bit LPDDR5, 102GB/s** , with 1024 CUDA cores, 32 Tensor Cores, and 7W–25W power modes: Jetson Orin Nano Super Developer Kit.\n\nThat still means shared-memory constraints are real. But your result shows that the practical ceiling is not as low as I first implied.\n\n* * *\n\n## Why Qwen2.5-Coder-7B had more headroom than expected\n\nI think the result makes sense for several reasons.\n\n### 1. The actual Q4_K_M GGUF is only about 4.68GB\n\nThe `Qwen2.5-Coder-7B-Instruct Q4_K_M` file is about **4.68GB** :\n\n  * Qwen/Qwen2.5-Coder-7B-Instruct-GGUF — Q4_K_M file\n\n\n\nThat leaves some room for:\n\n  * llama.cpp runtime buffers\n  * CUDA allocations\n  * KV cache\n  * OS/services\n  * mmap/page-cache behavior\n  * server overhead\n\n\n\nA rough mental model is:\n\n\n    Qwen2.5-Coder-7B Q4_K_M weights:  ~4.68GB\n    KV cache at normal chat lengths:   relatively small\n    runtime/CUDA/server overhead:      additional memory\n    observed under load:               ~5.1GB RAM\n\n\nSo the ~5.1GB RAM observation is quite plausible. This model is not close to the same memory class as an 8GB-class GGUF or a 19GB-class GGUF.\n\n* * *\n\n### 2. Qwen2.5 7B uses GQA, so ordinary-context KV cache is small\n\nQwen2.5 7B uses Grouped-Query Attention. Its model card lists:\n\n  * 28 layers\n  * 28 query heads\n  * 4 key/value heads\n  * GQA\n  * RoPE\n  * SwiGLU\n  * RMSNorm\n  * QKV bias\n\n\n\nSource:\n\n  * Qwen/Qwen2.5-7B-Instruct\n\n\n\nThe 4 KV heads matter. With GQA, the model stores fewer KV heads than ordinary full multi-head attention. That keeps KV cache smaller.\n\nA rough f16 KV cache estimate for Qwen2.5 7B is:\n\n\n    K and V tensors\n    × 4 KV heads\n    × 128 head_dim\n    × 2 bytes per f16 value\n    × 28 layers\n    = 57,344 bytes/token\n    ≈ 56 KiB/token\n\n\nApproximate KV cache size:\n\nContext | Approx. f16 KV cache\n---|---\n512 | ~28MB\n1024 | ~56MB\n2048 | ~112MB\n4096 | ~224MB\n8192 | ~448MB\n\nThis is small compared with the 4.68GB weight file. That explains why `ctx 1024` vs `2048` did not change much.\n\nIt also explains why KV quantization was not automatically helpful. If KV cache is already small, quantizing it saves little absolute memory, while it may introduce slower kernels, dequant overhead, or output-quality loss.\n\nRelevant references:\n\n  * Grouped-Query Attention paper\n  * llama.cpp server docs: --cache-type-k, --cache-type-v\n  * llama-server man page showing KV cache type options\n\n\n\n* * *\n\n### 3. Qwen2.5-Coder-7B is a mature dense-transformer path\n\n`Qwen2.5-Coder-7B-Instruct` is a code-specialized dense model in the Qwen2.5-Coder family. The technical report describes the Qwen2.5-Coder series as including 0.5B, 1.5B, 3B, 7B, 14B, and 32B models, with continued pretraining on more than 5.5T tokens and evaluations across code generation, completion, reasoning, and repair.\n\nReferences:\n\n  * Qwen2.5-Coder Technical Report\n  * Qwen/Qwen2.5-Coder-7B-Instruct\n\n\n\nThis matters because Qwen2.5-Coder-7B is not just a random 7B chat model. It is a strong code-specialized model that happens to fit into a realistic GGUF size for this board.\n\nThat makes it a very good baseline.\n\n* * *\n\n## Why Granite H-Small behaved so differently\n\nYour Granite result also makes sense, but I would not compare `granite-4.0-h-small` to the Qwen 7B run as if they were adjacent model sizes.\n\nIBM’s naming is a little easy to misread here. Granite 4.0 H-Small is not a small 7B-class model. IBM describes Granite 4.0 H-Small as a **32B total / 9B active** hybrid MoE model, while H-Tiny is **7B total / 1B active** and H-Micro/Micro are 3B-class models.\n\nReferences:\n\n  * IBM announcement: Granite 4.0 models\n  * Unsloth Granite 4.0 guide\n  * IBM Granite model docs\n\n\n\nSo the `granite-4.0-h-small` result is consistent with expectations:\n\nModel | Practical memory class\n---|---\n`Qwen2.5-Coder-7B Q4_K_M` | ~4.68GB dense GGUF; fits with useful headroom.\n`Granite 4.0 H-Small Q4_K_M` | 32B total / 9B active hybrid model; not a 7B-class target.\n`Mellum2-12B-A2.5B Q4_K_M` | 12B total MoE; active 2.5B does not remove weight-residency cost.\n\nYour Granite observations are exactly what I would expect from a model in the wrong memory class for this board:\n\nGranite H-Small result | Interpretation\n---|---\n`-ngl 99`: CUDA OOM, attempted ~18.6GiB allocation | Full/near-full offload is impossible on 8GB shared RAM.\n`-ngl 20`: CUDA OOM, attempted ~9.0GiB allocation | Still above the practical device-memory budget.\n`-ngl 10`: loaded but extremely slow | Technically possible to start, but the system is near limits and paging/offload/latency dominate.\n\nFor Granite on Orin Nano 8GB, I would test smaller variants instead:\n\nGranite model | Why it is more relevant\n---|---\n`granite-4.0-h-micro-GGUF` | 3B hybrid model; safer Jetson memory class.\n`granite-4.0-micro-GGUF` | 3B conventional transformer option.\n`granite-4.0-h-tiny-GGUF` | 7B total / 1B active; much more comparable to your successful Qwen 7B run.\n`granite-4.0-h-small-GGUF` | 32B total / 9B active; useful boundary test, not a practical Nano 8GB target.\n\nRelevant model links:\n\n  * ibm-granite/granite-4.0-h-tiny-GGUF\n  * unsloth/granite-4.0-h-tiny-GGUF\n  * ibm-granite/granite-4.0-h-small-GGUF\n  * unsloth/granite-4.0-h-small-GGUF\n  * IBM GGUF conversion repo\n\n\n\n* * *\n\n## Why Mellum2 is still a different case from Qwen2.5-Coder-7B\n\nMellum2 remains interesting, but I would still not put it in the same class as Qwen2.5-Coder-7B on this board.\n\nMellum2 is a **12B total / 2.5B active** MoE model. The technical report describes 64 experts, 8 active experts, GQA, sliding-window attention, and a multi-token prediction head.\n\nReferences:\n\n  * JetBrains/Mellum2-12B-A2.5B-Instruct\n  * Mellum2 Technical Report\n  * HF paper page: Mellum2\n\n\n\nThe crucial point is:\n\n> **Active parameters reduce per-token compute, but total resident weights still matter.**\n\nSo even if Mellum2 runs with 2.5B active parameters per token, it is still a 12B-total MoE model for weight storage and runtime layout. That makes it very different from a 4.68GB dense Qwen2.5-Coder-7B GGUF.\n\nI would now phrase Mellum2 like this:\n\nQuestion | Updated answer\n---|---\nIs Mellum2 impossible? | Not necessarily. It is worth experimenting with if the goal is boundary testing.\nIs Mellum2 Q4_K_M a practical recommendation for Orin Nano 8GB? | I still doubt it.\nDoes Qwen2.5-Coder-7B success imply Mellum2 should also work? | No. The memory class and architecture are different.\nWhat would make Mellum2 more interesting? | A high-quality lower-bit GGUF, careful MoE offload behavior, and very short context tests.\n\n* * *\n\n## JetPack/L4T probably mattered too\n\nYour upgrade from JetPack/L4T 36.4.7 to 36.5.0 is also important.\n\nThere is a related NVIDIA forum thread where Gemma4 E4B failed with CUDA OOM on Orin Nano, and NVIDIA later confirmed it working on `r36.5 / JetPack 6.2.2`. The thread also mentions a known memory issue in `r36.4.7` that was fixed in `r36.5`.\n\nReference:\n\n  * Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue\n\n\n\nSo part of the Qwen success may be that you were no longer testing on a release with a known memory problem.\n\nUseful checks for future posts:\n\n\n    cat /etc/nv_tegra_release\n    dpkg-query --show nvidia-l4t-core\n    sudo nvpmodel -q\n    tegrastats\n\n\n* * *\n\n## What I would test next\n\nGiven your result, I would not spend most of the time trying to force Mellum2 first. I would use your Qwen result as the reference point and compare other models against it.\n\n### Practical next candidates\n\nPriority | Candidate | Reason\n---|---|---\n1 | Keep `Qwen2.5-Coder-7B-Instruct Q4_K_M` | Known-good coding baseline on this exact board.\n2 | `Qwen2.5-Coder-7B` other quants | Compare Q4_K_M vs Q5_K_M or IQ4/IQ3 if available.\n3 | Qwen 7B/8B-ish newer coder/instruct GGUFs | Same broad architecture family may preserve good runtime behavior.\n4 | `Granite 4.0 H-Tiny` | 7B total / 1B active; much more relevant than H-Small.\n5 | `Granite 4.0 Micro` / `H-Micro` | 3B-class safe Granite tests.\n6 | LFM2 / LFM2.5 8B-A1B low-bit | Interesting edge-MoE class, but should be tested against the Qwen baseline.\n7 | StarCoder2-3B / OpenCoder / Granite Code 3B | Useful for code completion or smaller coding tasks.\n\n### Models I would keep as boundary tests\n\nModel | Why\n---|---\nMellum2 Q4_K_M | Interesting 12B MoE, but still likely too large for comfortable Nano 8GB use.\nGranite 4.0 H-Small | 32B total / 9B active. Your results already show the boundary.\nQwen 30B-A3B / 35B-A3B MoE | Interesting on 32GB RAM PCs, not this board.\nQwen3-Coder-Next 80B-A3B | Very interesting model class, wrong memory class for Orin Nano 8GB.\n\n* * *\n\n## Updated conclusion\n\nMy revised interpretation is:\n\n> **Orin Nano 8GB has enough practical headroom for well-behaved 7B Q4_K_M models.**\n>  The successful Qwen2.5-Coder-7B run shows that clearly.\n>  But that extra headroom does not automatically extend to every active-small MoE model, because total GGUF size, resident weights, KV layout, architecture, and JetPack/L4T behavior still dominate.\n\nSo I would no longer say “stay mostly under 4B” for coding models on this board.\n\nI would say:\n\n  1. **2B–4B** is the safe zone.\n  2. **7B Q4_K_M** is now proven practical by your Qwen result.\n  3. **8B low-bit / 8B Q4** is worth exploring.\n  4. **9B Q4** is possible but aggressive.\n  5. **12B MoE Q4** is still mostly experimental.\n  6. **32B/9B-active or 30B+ MoE** is not a practical Nano 8GB target.\n\n\n\nThe most useful takeaway for me is:\n\n> **Use actual GGUF size and architecture, not just parameter count or active parameter count.**\n\nThat explains all three results:\n\nModel | Result | Likely reason\n---|---|---\n`Qwen2.5-Coder-7B Q4_K_M` | Practical | ~4.68GB dense GGUF, GQA, small KV, mature runtime path.\n`Granite 4.0 H-Small Q4_K_M` | OOM or extremely slow | 32B total / 9B active hybrid model; wrong memory class.\n`Mellum2-12B-A2.5B Q4_K_M` | Still doubtful | 12B total MoE; active 2.5B does not make it behave like a 2.5B dense model in memory.",
  "title": "Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB"
}