Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicq2wekyruoqrgic5xamctqhimnst2xsgtbhoip5bztwnlgnumfvy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnefjpblfbj2"
  },
  "path": "/t/mellum2-12b-a2-5b-instruct-q4-k-m-on-jetson-orin-nano-8gb/176480#post_2",
  "publishedAt": "2026-06-03T04:55:43.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Jetson AI Lab — Models",
    "Jetson AI Lab — Gemma 4 E2B",
    "NVIDIA Developer Forums — AI Models That Run on Jetson Orin Nano Super 8GB",
    "NVIDIA Developer Forums — Gemma4 E4B on Jetson Orin Nano CUDA OOM thread",
    "Jetson AI Lab — RAM Optimization",
    "jetson-containers setup guide",
    "llama.cpp server docs",
    "llama.cpp build docs",
    "Hugging Face Hub GGUF docs",
    "Hugging Face Hub GGUF with llama.cpp",
    "JetBrains/Mellum2-12B-A2.5B-Instruct model card",
    "Mellum2 Technical Report",
    "AI Models That Run on Jetson Orin Nano Super 8GB — A Practical Guide",
    "Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue",
    "NVIDIA forum practical guide",
    "Gemma4 E4B CUDA OOM thread",
    "Gemma4 E4B CUDA OOM thread showing Orin compute capability 8.7",
    "Jetson AI Lab — Gemma 4 E4B",
    "NVIDIA Developer Forums — Gemma4 E4B CUDA OOM on Orin Nano",
    "NVIDIA Technical Blog — Bringing AI Closer to the Edge and On-Device with Gemma 4",
    "jetson-containers GitHub repo",
    "NVIDIA AI IoT packages",
    "llama.cpp GitHub",
    "Hugging Face GGUF docs",
    "Hugging Face GGUF with llama.cpp",
    "JetBrains/Mellum2-12B-A2.5B-Instruct",
    "JetBrains Mellum2 collection",
    "Gemma 4 E2B on Jetson AI Lab",
    "Gemma 4 E4B on Jetson AI Lab",
    "Qwen models on Hugging Face",
    "Liquid AI models on Hugging Face",
    "StarCoder2 docs in Transformers"
  ],
  "textContent": "Hmm… Even outside Jetson, if the practical memory budget is only around 8 GB, I think using LLMs much beyond the 7B–8B class is already pretty rough…\n\n* * *\n\nBelow is my current read of the situation for **Jetson Orin Nano 8GB + GGUF + llama.cpp** , especially for trying to run `Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf`.\n\n## Short version\n\nI would not treat this as a “bad GGUF file” first.\n\nI would treat it as a **memory-budget problem** plus a **Jetson runtime-contract problem** :\n\n  * Orin Nano 8GB has **8GB shared LPDDR5 memory** , not “8GB VRAM + lots of normal system RAM”.\n  * MoE “active parameters” help compute cost, but the runtime still has to deal with the **full quantized weight file** , KV cache, CUDA buffers, OS memory, file cache, and server overhead.\n  * `Mellum2-12B-A2.5B-Instruct` is a **12B total / 2.5B active MoE** model, so “A2.5B” does **not** make it equivalent to a 2.5B dense model for memory residency.\n  * For this board, I would generally start from **2B–4B models** , then maybe test **8B low-bit / edge-MoE** models only after the Jetson stack and llama.cpp settings are known-good.\n  * I would keep Mellum2 on this board in the “interesting experiment” bucket, not the “practical recommendation” bucket.\n\n\n\nUseful official/semi-official starting points:\n\n  * Jetson AI Lab — Models\n  * Jetson AI Lab — Gemma 4 E2B\n  * NVIDIA Developer Forums — AI Models That Run on Jetson Orin Nano Super 8GB\n  * NVIDIA Developer Forums — Gemma4 E4B on Jetson Orin Nano CUDA OOM thread\n  * Jetson AI Lab — RAM Optimization\n  * jetson-containers setup guide\n  * llama.cpp server docs\n  * llama.cpp build docs\n  * Hugging Face Hub GGUF docs\n  * Hugging Face Hub GGUF with llama.cpp\n\n\n\n* * *\n\n## Why this is hard on Orin Nano 8GB\n\nThe key issue is not just parameter count. It is the **whole runtime memory picture**.\n\nComponent | Why it matters on Orin Nano 8GB\n---|---\nGGUF model weights | The quantized file still has to be mapped/loaded and used by the runtime. An 8GB-class GGUF is already too close to the total memory pool.\nKV cache | Grows with context length, number of layers, KV heads, cache dtype, and parallel slots. Long context is expensive even if the weights fit.\nCUDA buffers | GPU offload needs extra temporary and persistent CUDA allocations.\nOS + desktop + services | They consume memory from the same 8GB pool.\nFile cache / mmap behavior | Can help or hurt depending on pressure; it does not create more physical RAM.\nSwap | Can prevent crashes, but if actual inference is paging heavily, performance can become unusable.\nCPU offload | On a desktop with 32GB+ RAM, this can be a real escape hatch. On Orin Nano 8GB, CPU and GPU share the same small memory pool.\n\nThis is why a model that is “interesting on a 32GB RAM PC” can still be a bad fit for Orin Nano 8GB.\n\n* * *\n\n## Mellum2-specific issue\n\nThe model itself is interesting. It is not a toy model.\n\n`JetBrains/Mellum2-12B-A2.5B-Instruct` is described as a coding/software-engineering model with:\n\n  * 12B total parameters\n  * 2.5B active parameters per token\n  * MoE with 64 experts and 8 active experts\n  * 131,072-token context\n  * GQA with 32 Q heads and 4 KV heads\n  * software engineering, code generation/editing, tool use, function calling, and agentic workflows as core target use cases\n\n\n\nSources:\n\n  * JetBrains/Mellum2-12B-A2.5B-Instruct model card\n  * Mellum2 Technical Report\n\n\n\nBut for Orin Nano 8GB, the important part is:\n\n> **12B total weights still matter. “2.5B active” helps per-token compute, but it does not magically make the full model fit like a 2.5B dense model.**\n\nThat is also why larger Qwen MoE models can be interesting on a **32GB RAM CPU box** , but not necessarily on Orin Nano 8GB. On a PC, llama.cpp can sometimes make good use of system RAM. On Orin Nano 8GB, the “system RAM” is still the same tiny shared pool.\n\n* * *\n\n## Practical interpretation of NVIDIA/Jetson guidance\n\nNVIDIA’s own forum practical guide says that, using llama.cpp with Q4_K GGUF, Orin Nano 8GB can fit approximately:\n\nClass | Approximate upper range from the guide\n---|---\nLLMs | up to around 10B parameters\nVLMs | up to around 4B parameters\n\nSource: AI Models That Run on Jetson Orin Nano Super 8GB — A Practical Guide\n\nI would read that as an **upper-bound / best-case sizing guide** , not as “10B will be comfortable”.\n\nA more conservative operational reading is:\n\nGGUF size / model class | My expectation on Orin Nano 8GB\n---|---\n<= 2GB | Usually comfortable if the runtime is configured correctly.\n2GB–4GB | Best practical target zone.\n4GB–5.5GB | Possible, but context/batch/KV/cache/offload settings matter a lot.\n5.5GB–7GB | Experimental; expect tuning and instability.\n7GB+ | Usually not a good practical recommendation on this board.\n8GB-class GGUF | The model file itself is too close to the total memory pool. Expect CUDA allocation failures, very short context, or unusable CPU fallback.\n\nSo, for this board, I would not start with a 12B MoE Q4 model. I would start with **2B–4B** and then test **8B low-bit** only if the baseline is stable.\n\n* * *\n\n## First thing to check: JetPack / L4T version\n\nThere is a very relevant recent NVIDIA forum thread where `Gemma 4 E4B Q4_K_M` failed with CUDA OOM on Orin Nano, then NVIDIA identified a known memory issue in `r36.4.7` that was fixed in `r36.5`.\n\nSource:\n\n  * Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue\n\n\n\nSo before changing models, I would check the Jetson software stack:\n\n\n    cat /etc/nv_tegra_release\n    dpkg-query --show nvidia-l4t-core\n\n\nIf this shows an affected JetPack/L4T combination, update the runtime first. Otherwise model-level debugging can be misleading.\n\n* * *\n\n## Recommended baseline: use the Jetson-oriented llama.cpp container first\n\nBefore doing custom builds, I would first try the NVIDIA AI IoT / Jetson AI Lab llama.cpp container.\n\nThe NVIDIA forum guide points to this container tag:\n\n\n    ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin\n\n\nSource:\n\n  * NVIDIA forum practical guide\n\n\n\nJetson AI Lab’s Gemma 4 E2B page also says the model is configured to run on Jetson with `vLLM` and `llama.cpp`, and describes E2B as the edge-first low-memory member of the Gemma 4 family:\n\n  * Jetson AI Lab — Gemma 4 E2B\n\n\n\nA conservative smoke test would be something like:\n\n\n    sudo docker run -it --rm --pull always \\\n      --runtime=nvidia \\\n      --network host \\\n      -v $HOME/.cache/huggingface:/root/.cache/huggingface \\\n      ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \\\n      llama-server \\\n        -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S \\\n        -c 1024 \\\n        -ngl auto \\\n        -fa on \\\n        -ctk q8_0 \\\n        -ctv q8_0 \\\n        -b 64 \\\n        -ub 32 \\\n        --host 0.0.0.0 \\\n        --port 8080\n\n\nThis is not because Gemma 4 E2B is necessarily the best coding model. It is because it is a good **known-good Jetson baseline**.\n\nIf this fails, I would suspect the Jetson stack, container/runtime, CUDA setup, power mode, memory pressure, or JetPack/L4T version before blaming Mellum2.\n\n* * *\n\n## Runtime tuning checklist\n\nThese are the knobs I would try, in order.\n\n### 1. Reduce context length aggressively\n\nDo not start with 8K, 32K, or 128K context on Orin Nano 8GB.\n\nStart with:\n\n\n    -c 512\n\n\nThen only increase if stable:\n\n\n    -c 1024\n    -c 2048\n\n\nFor a very marginal model:\n\n\n    -c 256\n\n\nIn llama.cpp, `--ctx-size` directly affects KV cache memory. Long context is not free.\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n### 2. Reduce batch and micro-batch\n\nFor memory-constrained Jetson runs, I would not use large defaults.\n\nTry:\n\n\n    -b 64 -ub 32\n\n\nIf it still fails:\n\n\n    -b 32 -ub 16\n\n\nThis may slow prompt processing, but it can avoid allocation failures.\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n### 3. Quantize the KV cache\n\nTry safer KV cache quantization first:\n\n\n    -ctk q8_0 -ctv q8_0\n\n\nIf the model is still too tight:\n\n\n    -ctk q4_0 -ctv q4_0\n\n\nI would not start with `q4_0` unless memory is really tight. It can be less robust, but it is worth trying for a last-resort fit test.\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n### 4. Use `-fit` / fitting mode when available\n\nRecent llama.cpp builds have model-fitting behavior in the server path. The Gemma 4 E4B NVIDIA forum log explicitly shows:\n\n\n    common_init_result: fitting params to device memory\n    llama_params_fit_impl: projected to use 5533 MiB of device memory vs. 6387 MiB of free device memory\n\n\nSource:\n\n  * Gemma4 E4B CUDA OOM thread\n\n\n\nSo I would let llama.cpp fit parameters when using the Jetson container, unless debugging a suspected fitting bug.\n\n### 5. Do not blindly maximize GPU layers\n\n`-ngl 999` or `--n-gpu-layers 99` can be fine when memory is enough. On Orin Nano 8GB it can also push the system over the edge.\n\nTry:\n\n\n    -ngl auto\n\n\nor step manually:\n\n\n    -ngl 8\n    -ngl 16\n    -ngl 24\n\n\nIf CPU-only works but GPU mode fails, the model may be too tight for the CUDA path even if the weights can be mapped.\n\n### 6. Try MoE CPU expert offload only as an experiment\n\nllama.cpp has MoE-related CPU offload options such as `--cpu-moe` / `--n-cpu-moe` in recent builds.\n\nThis can help on systems with real CPU RAM headroom, for example:\n\n  * desktop PC with 32GB/64GB RAM\n  * small GPU with large system RAM\n\n\n\nOn Orin Nano 8GB, the CPU and GPU share the same limited memory pool, so this is not a true escape hatch. Still, for MoE models, it may be worth testing:\n\n\n    -ncmoe 20\n\n\nBut I would not expect it to rescue an 8GB-class Mellum2 GGUF on an 8GB Jetson.\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n### 7. Avoid `--mlock` on low-memory Jetson\n\n`--mlock` can be useful on some systems, but on an 8GB Jetson it can make memory pressure worse by preventing paging.\n\nIn this case I would avoid:\n\n\n    --mlock\n\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n### 8. Try `--no-mmap` only as a late diagnostic\n\n`mmap` behavior can interact with page cache and memory pressure. I would keep the default first, then try:\n\n\n    --no-mmap\n\n\nonly if debugging load behavior.\n\nReference:\n\n  * llama.cpp server docs\n\n\n\n* * *\n\n## System-level memory tuning\n\nJetson AI Lab has a specific RAM optimization guide:\n\n  * Jetson AI Lab — RAM Optimization\n\n\n\nThe `jetson-containers` setup guide also has practical advice on swap and disabling the desktop GUI:\n\n  * jetson-containers setup guide\n\n\n\n### Disable the desktop GUI\n\nIf using the desktop environment, stopping it can free a meaningful amount of memory.\n\nTemporary:\n\n\n    sudo init 3\n\n\nRestore:\n\n\n    sudo init 5\n\n\nFor Orin Nano 8GB, even a few hundred MB can matter.\n\n### Add NVMe swap\n\nSwap is not a performance solution, but it can prevent immediate OOM kills and help identify whether the failure is a hard fit problem.\n\nExample:\n\n\n    sudo systemctl disable nvzramconfig\n    sudo fallocate -l 16G /mnt/16GB.swap\n    sudo mkswap /mnt/16GB.swap\n    sudo swapon /mnt/16GB.swap\n\n\nPersist in `/etc/fstab`:\n\n\n    /mnt/16GB.swap none swap sw 0 0\n\n\nPrefer NVMe over SD card for swap if available.\n\n### Clear caches before a load test\n\nFor repeatable testing:\n\n\n    sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'\n\n\nThis does not fix the model, but it makes memory-pressure tests cleaner.\n\n### Use max power / clocks after the model actually fits\n\nPower mode will not make an oversized model fit, but it matters once the model runs.\n\nCheck and set power mode:\n\n\n    sudo nvpmodel -q\n    sudo nvpmodel -m <mode>\n    sudo jetson_clocks\n\n\nThe exact mode depends on the installed JetPack / board configuration.\n\n* * *\n\n## If building llama.cpp yourself\n\nIf not using the Jetson container, build with CUDA explicitly.\n\n\n    cmake -B build \\\n      -DGGML_CUDA=ON \\\n      -DCMAKE_CUDA_ARCHITECTURES=87\n\n    cmake --build build --config Release -j\n\n\nOrin is compute capability 8.7 / SM87, and NVIDIA forum logs for Orin show `ARCHS = 870`.\n\nReferences:\n\n  * llama.cpp build docs\n  * Gemma4 E4B CUDA OOM thread showing Orin compute capability 8.7\n\n\n\nIf experimenting with memory-constrained CUDA behavior, these build options may be worth knowing about:\n\n\n    -DGGML_CUDA_FORCE_MMQ=ON\n    -DGGML_CUDA_FA_ALL_QUANTS=ON\n\n\nBut I would first test the NVIDIA container before going deep into custom builds.\n\n* * *\n\n## A realistic Mellum2 last-try profile\n\nIf you still want to try Mellum2 just to see whether it can load, I would use an extreme low-memory profile.\n\nThis is a **load experiment** , not a practical recommendation:\n\n\n    GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \\\n    llama-server \\\n      -m ./Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \\\n      -c 256 \\\n      -ngl auto \\\n      -fa on \\\n      -ctk q4_0 \\\n      -ctv q4_0 \\\n      -b 32 \\\n      -ub 16 \\\n      --host 0.0.0.0 \\\n      --port 8080\n\n\nIf your llama.cpp build supports MoE CPU expert offload:\n\n\n    GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \\\n    llama-server \\\n      -m ./Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \\\n      -c 256 \\\n      -ngl auto \\\n      -fa on \\\n      -ctk q4_0 \\\n      -ctv q4_0 \\\n      -b 32 \\\n      -ub 16 \\\n      -ncmoe 20 \\\n      --host 0.0.0.0 \\\n      --port 8080\n\n\nAgain, I would not expect this to become a good daily-use setup. If it loads but runs very slowly, that is still useful information: it confirms that the bottleneck is not simply “unsupported model”, but practical memory and bandwidth limits.\n\n* * *\n\n## Better model targets for this board\n\nFor Orin Nano 8GB, I would pick models by **actual GGUF size and runtime stability** , not leaderboard score alone.\n\n### Practical shortlist\n\nModel family | Why it is more realistic than Mellum2 Q4 | Suggested quant range\n---|---|---\nGemma 4 E2B | Explicit Jetson AI Lab support; edge-first low-memory target. | Q4_K_S / Q4_K_M\nGemma 4 E4B | Supported on Jetson, but already near the upper edge on Nano 8GB. | Q3 / Q4, short context\nQwen 4B-class models | Good instruction/coding/tool-use balance if GGUF support is current. | Q4 / IQ4 / Q3\nQwen2.5-Coder-3B | Older but still a useful coding-specific baseline. | Q4 / Q5\nStarCoder2-3B | Better for FIM/completion-style coding than chat-agent use. | Q4 / Q5\nLFM2 / LFM2.5 8B-A1B | Edge-oriented MoE; interesting if Q3/IQ4 GGUF fits. | Q3 / IQ4\nGranite Code 3B | Practical code model, enterprise-friendly posture. | Q4 / Q5\nOpenCoder 1.5B/8B | Code-specialized candidate; 8B needs low-bit. | 1.5B Q5/Q8, 8B Q3\n\n### Models I would not recommend first on Orin Nano 8GB\n\nModel type | Why I would avoid it first\n---|---\nMellum2 Q4_K_M | 12B total MoE; Q4 file is too close to / above practical memory budget.\n12B+ dense models | Even if they load, context and speed will likely be poor.\n30B-A3B / 35B-A3B MoE | Interesting on 32GB RAM PCs, not on 8GB shared-memory Jetson.\nQwen3-Coder-Next 80B-A3B | Very interesting model, wrong memory class for this board.\nDevstral-style 20B+ coding agents | Good benchmark story, wrong memory budget for Orin Nano 8GB.\nLong-context runs | Context length will often fail before “model intelligence” matters.\n\n* * *\n\n## Suggested debugging sequence\n\nI would debug in this order.\n\nStep | Goal | Command / action\n---|---|---\n1 | Confirm JetPack/L4T | `cat /etc/nv_tegra_release`\n2 | Confirm available memory | `free -h`, `tegrastats`\n3 | Stop desktop GUI | `sudo init 3`\n4 | Add NVMe swap | 16GB swap on NVMe if available\n5 | Test known-small GGUF | Gemma 4 E2B or a 2B–3B GGUF\n6 | Test 4B-class model | Qwen 4B / Gemma E4B / coder 3B\n7 | Tune ctx/KV/batch | `-c 512`, `-ctk q8_0`, `-b 64`, `-ub 32`\n8 | Try 8B low-bit | only after baseline is stable\n9 | Try Mellum2 | only as a load experiment\n10 | Decide | if small models work but Mellum2 fails, the conclusion is memory budget, not setup failure\n\n* * *\n\n## Example “known-good first” profile\n\n\n    sudo docker run -it --rm --pull always \\\n      --runtime=nvidia \\\n      --network host \\\n      -v $HOME/.cache/huggingface:/root/.cache/huggingface \\\n      ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \\\n      llama-server \\\n        -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S \\\n        -c 1024 \\\n        -ngl auto \\\n        -fa on \\\n        -ctk q8_0 \\\n        -ctv q8_0 \\\n        -b 64 \\\n        -ub 32 \\\n        --host 0.0.0.0 \\\n        --port 8080\n\n\nIf this works, move to a 3B/4B coding model.\n\nIf this fails, Mellum2 is not the right next test. Fix the Jetson runtime first.\n\n* * *\n\n## Example “4B-class” profile\n\n\n    sudo docker run -it --rm --pull always \\\n      --runtime=nvidia \\\n      --network host \\\n      -v $HOME/.cache/huggingface:/root/.cache/huggingface \\\n      ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \\\n      llama-server \\\n        -hf <repo>:<quant> \\\n        -c 512 \\\n        -ngl auto \\\n        -fa on \\\n        -ctk q8_0 \\\n        -ctv q8_0 \\\n        -b 64 \\\n        -ub 32 \\\n        --host 0.0.0.0 \\\n        --port 8080\n\n\nIf it fails:\n\n\n    -c 256\n    -ctk q4_0\n    -ctv q4_0\n    -b 32\n    -ub 16\n\n\nIf it only works with `-c 256`, it may technically run but not be useful for coding assistance.\n\n* * *\n\n## How I would summarize the recommendation\n\nFor Orin Nano 8GB, I would frame it this way:\n\n> Mellum2 is an interesting coding MoE, but the Orin Nano 8GB memory budget is probably the wrong target for the Q4_K_M GGUF. The board has 8GB shared memory, so CPU offload is not the same escape hatch that it is on a 32GB RAM PC. I would first validate the Jetson stack with NVIDIA’s Jetson-oriented llama.cpp container and a known-small model such as Gemma 4 E2B. Then I would test 3B–4B coding/instruct models, or possibly 8B-A1B / 8B low-bit models. I would only try Mellum2 with very short context, quantized KV cache, tiny batch sizes, and possibly unified-memory fallback as a load experiment.\n\n* * *\n\n## Resource list\n\n### Jetson / NVIDIA\n\n  * Jetson AI Lab — Models\n  * Jetson AI Lab — Gemma 4 E2B\n  * Jetson AI Lab — Gemma 4 E4B\n  * Jetson AI Lab — RAM Optimization\n  * NVIDIA Developer Forums — AI Models That Run on Jetson Orin Nano Super 8GB\n  * NVIDIA Developer Forums — Gemma4 E4B CUDA OOM on Orin Nano\n  * NVIDIA Technical Blog — Bringing AI Closer to the Edge and On-Device with Gemma 4\n\n\n\n### Containers / setup\n\n  * jetson-containers GitHub repo\n  * jetson-containers setup guide\n  * NVIDIA AI IoT packages\n\n\n\n### llama.cpp / GGUF\n\n  * llama.cpp GitHub\n  * llama.cpp server docs\n  * llama.cpp build docs\n  * Hugging Face GGUF docs\n  * Hugging Face GGUF with llama.cpp\n\n\n\n### Mellum2\n\n  * JetBrains/Mellum2-12B-A2.5B-Instruct\n  * Mellum2 Technical Report\n  * JetBrains Mellum2 collection\n\n\n\n### Model alternatives to check\n\n  * Gemma 4 E2B on Jetson AI Lab\n  * Gemma 4 E4B on Jetson AI Lab\n  * Qwen models on Hugging Face\n  * Liquid AI models on Hugging Face\n  * StarCoder2 docs in Transformers\n\n\n\n* * *\n\n## Final practical take\n\nI would not spend too much time trying to force `Mellum2-12B-A2.5B-Instruct-Q4_K_M` onto an Orin Nano 8GB.\n\nI would do this instead:\n\n  1. Make sure JetPack/L4T is not on a known-problem release.\n  2. Use the Jetson-oriented llama.cpp container.\n  3. Disable GUI and add NVMe swap if possible.\n  4. Validate with Gemma 4 E2B or another 2B–3B GGUF.\n  5. Try 3B–4B coding models.\n  6. Try 8B low-bit / 8B-A1B edge-MoE models only after that.\n  7. Treat Mellum2 as an experiment, not the practical target.\n\n\n\nFor this board, the useful question is probably not:\n\n> “Can I run a 12B MoE somehow?”\n\nIt is more:\n\n> “Which 2B–4B or low-bit 8B model gives the best coding usefulness per GB of actual Jetson memory?”",
  "title": "Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB"
}