{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidi4kwe426dxlejfhzd6xbgafy2aja57pqi2yg56fsa2dycacrs4q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh6dpumyzha2"
  },
  "path": "/t/are-there-any-llms-that-can-run-with-decent-performance-on-hardware-comparable-to-jetson-nx/174305#post_4",
  "publishedAt": "2026-03-16T09:28:03.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "the LeRobot Discord",
    "Jetson AI Lab",
    "NVIDIA Developer",
    "GitHub",
    "Hugging Face",
    "ACL Anthology"
  ],
  "textContent": "When it comes to AI and robotics, you’d probably get a more reliable answer by asking on the LeRobot Discord.\n\nI’ve heard that on a Jetson, it’s not a good idea to allocate all the VRAM to an LLM. If that’s the case, a 7B model might be a bit too large.\n\n* * *\n\nYes. But the practical answer is **not** “find a magical 7B–9B model that suddenly becomes fast on NX.” The practical answer is to use a **smaller local model as the supervisor** , keep context short, use retrieval for manuals/logs/SOPs, and only invoke a vision model when you actually need image understanding. Current Jetson deployment guidance and current small-model research both point in that direction. The biggest constraint is usually **memory bandwidth and KV cache** , not just raw parameter count. (Jetson AI Lab)\n\n## Why your 7B–9B tests felt bad\n\nOn Jetson, a model can “fit” and still be a bad deployment. The weights load first, then the remaining memory is consumed by runtime overhead and KV cache. Longer prompts, longer outputs, and concurrent robotics processes make this worse. NVIDIA’s Jetson benchmarking guide is explicit that the remaining GPU memory after weights is pre-allocated to KV cache, so edge failures often come from **context length and runtime configuration** , not only from the model family itself. (Jetson AI Lab)\n\nThere is also a second trap: Jetson results can collapse when the fast kernel/backend path is missing, when clocks are not pinned, or when you benchmark through a slower runtime. So published phone or server numbers often do **not** transfer to Jetson as-is. (Jetson AI Lab)\n\n## First, separate Xavier NX from Orin NX\n\nThis matters a lot. If by “Jetson NX” you mean **Xavier NX** , that is a much harder target than **Orin NX**. NVIDIA says Orin NX delivers up to **5x** the performance of Xavier NX, and current Jetson pages position Orin NX as the compact platform for multiple concurrent AI pipelines. Current official Ollama-on-Jetson support lists **AGX Orin 64GB, AGX Orin 32GB, Orin NX 16GB, and Orin Nano 8GB**. Xavier NX is not on that current support list. (NVIDIA Developer)\n\nThat leads to the blunt hardware conclusion:\n\n  * **Xavier NX** : tiny-model territory.\n  * **Orin NX 16GB** : practical for a real local agent if you stay disciplined.\n  * **AGX Orin 32GB/64GB** : the comfortable option if you want fewer compromises. (NVIDIA Developer)\n\n\n\n## What is actually realistic by hardware tier\n\n### Xavier NX\n\nIf your final target is truly Xavier NX-class, I would **not** plan around dense 7B–9B models. I would treat it as a board for **0.8B–1.5B** , maybe **3B** only in a stripped-down text-first setup. The current Xavier NX anecdotal experience in the llama.cpp community is still consistent with this: one Xavier NX report was about **~600 ms/token** , which is far too slow for a responsive operator assistant. (GitHub)\n\n### Orin NX 16GB\n\nThis is the first small Jetson where a local robotics agent starts to make sense. Current Jetson AI Lab model pages show that **Llama 3.2 3B** , **Gemma 3 4B** , **Qwen3.5 4B** , and **Llama 3.1 8B** all have Jetson-ready paths. Their listed memory footprints are roughly **4GB** for Llama 3.2 3B, **4GB** for Gemma 3 4B, **4GB** for Qwen3.5 4B, and **8GB** for Llama 3.1 8B. Jetson AI Lab also reports **52.58 output tok/s** for Llama 3.2 3B and **28.14 output tok/s** for Llama 3.1 8B on Jetson Orin with vLLM under their benchmark conditions. (Jetson AI Lab)\n\nSo on Orin NX 16GB, the useful rule is:\n\n  * **3B–4B** is the safe production zone.\n  * **8B** is possible, but with tighter margins.\n  * **9B multimodal** is possible on paper, but it is much less comfortable once the rest of the robot stack is alive. (Jetson AI Lab)\n\n\n\n### AGX Orin 32GB or 64GB\n\nThis is where you can stop fighting every constraint. NVIDIA’s AGX Orin pages list **32GB** and **64GB** modules, and current Jetson AI Lab pages show larger models that are explicitly positioned for AGX/Orin-class memory, including **GPT OSS 20B** at **16GB RAM minimum, AGX Orin minimum** , and **Qwen3.5 35B-A3B MoE** at **20GB RAM** with about **30 output tok/s** on Jetson Orin in their benchmark. (NVIDIA Developer)\n\nFor AGX Orin, that means you can reasonably consider:\n\n  * a strong dense **8B** text model,\n  * a larger efficient **MoE** model,\n  * or a **20B** class model if you truly need more general capability. (Jetson AI Lab)\n\n\n\n## Models I would actually recommend for your case\n\nYour tasks are not pure free-form chat. They sound like:\n\n  * on-site diagnosis,\n  * operator-ready summaries,\n  * maybe log/manual interpretation,\n  * maybe image-grounded inspection.\n\n\n\nThat is a classic **small planner + retrieval + optional vision** problem.\n\n### Best text-first choices today\n\n**Llama 3.2 3B** is one of the safest starting points. Jetson AI Lab explicitly positions it as a compact edge model for resource-constrained Jetson deployments, lists **4GB RAM / 2.0GB size** , and publishes very strong Jetson Orin benchmark numbers for it. Meta’s model card also explicitly mentions dialogue, retrieval, and summarization use cases. (Jetson AI Lab)\n\n**Llama 3.1 8B** is the stronger text model to try once you move up to Orin NX 16GB or AGX Orin. Jetson AI Lab lists **8GB RAM / 4.5GB size** for the quantized Jetson build and reports **28.14 output tok/s** on Jetson Orin with vLLM. (Jetson AI Lab)\n\n**SmolLM3-3B** is a strong newer candidate outside NVIDIA’s Jetson pages. Hugging Face describes it as a **3B** model with **dual-mode reasoning** , **6 languages** , and **long context** , and positions it as a strong model at the 3B–4B scale. I would treat it as a serious test candidate for Orin NX 16GB. The missing piece is that I have not found Jetson-specific benchmark numbers for it yet, so this is a “worth testing” recommendation, not a “Jetson-proven” one. (Hugging Face)\n\n**Phi-4-mini-instruct** is also worth testing if your workload leans toward longer logs, procedures, or technical text. Microsoft’s model card describes it as a lightweight open model with **128K context** , and Microsoft also provides an ONNX variant aimed at optimized inference. Again, this is promising for edge deployment, but I have not seen Jetson-specific benchmark numbers from NVIDIA for it. (Hugging Face)\n\n**LiquidAI LFM2.5-1.2B-Instruct** is one of the more interesting newer ultra-compact options. Its Hugging Face card says it is designed for **on-device deployment** , with **32K context** and support for common inference stacks. If you need the lightest possible text planner on an NX-class device, this is one of the few truly modern models I would put high on the shortlist. (Hugging Face)\n\n### Best compact multimodal choices\n\nIf your agent has to inspect images, you should usually keep the VLM separate and call it only when needed.\n\n**Gemma 3 4B** is a strong compact multimodal option. Jetson AI Lab lists it at **4GB RAM / 2.5GB size** , with text-plus-image input and a **128K** context window for the 4B size. (Jetson AI Lab)\n\n**Qwen3.5 4B** is another strong compact multimodal option. Jetson AI Lab lists **4GB RAM / 2.5GB size** , AWQ 4-bit quantization, and specifically calls out multimodal instruction following, visual understanding, and agent-style workloads on Jetson. (Jetson AI Lab)\n\n**Qwen3.5 9B** is the “bigger multimodal” step up. Jetson AI Lab lists **8GB RAM / 5GB size** and tool-calling support, but in your case I would only try it on a roomy Orin NX 16GB setup or AGX Orin, because the rest of the robotics stack will eat into that headroom fast. (Jetson AI Lab)\n\n**Gemma 3 1B** is the fallback when memory is brutal. Jetson AI Lab lists **2GB RAM / 1.2GB size** , and it is multimodal. That makes it useful as a tiny image-aware helper or as a lightweight local assistant on very constrained deployments. (Jetson AI Lab)\n\n## The best architecture for your robotics agent\n\nThis is the part that matters most.\n\nI would **not** try to make one large local model do everything. I would build:\n\n  1. a **small text model** as the main planner and explainer,\n  2. a **retrieval layer** for manuals, logs, SOPs, error catalogs, and maintenance data,\n  3. an **optional VLM** that only runs for image-dependent questions,\n  4. and a deterministic robotics layer below that.\n\n\n\nIn practice, this means the LLM should answer questions like:\n\n  * “What does this alarm likely mean?”\n  * “Summarize the fault for the operator.”\n  * “Which check should happen next?”\n  * “Should I inspect the gripper, the vision path, or the fixture first?”\n\n\n\nThe LLM should **not** be the thing that continuously drives low-level control behavior. On embedded hardware, the best use of your limited compute budget is usually **reasoning over structured state and retrieved facts** , not generating long free-form text or directly controlling motion. That design also reduces latency pressure because the model can work on short structured inputs instead of a huge prompt dump. (ACL Anthology)\n\n## My practical recommendations by board\n\n### If you end up on Xavier NX-class hardware\n\nUse a **tiny planner** and accept that this is not a general LLM workstation.\n\nMy shortlist would be:\n\n  * **LiquidAI LFM2.5-1.2B-Instruct**\n  * **Gemma 3 1B**\n  * **Qwen3.5 0.8B** if you need a very small multimodal helper\n  * maybe **Llama 3.2 3B** for text-only planning if the rest of the system is very lean. (Hugging Face)\n\n\n\n### If you use Orin NX 16GB\n\nThis is the tier I would actually target for a fully local robotics assistant.\n\nMy shortlist would be:\n\n  * **Llama 3.2 3B** as the safest starting point,\n  * **SmolLM3-3B** as a newer text-first candidate,\n  * **Gemma 3 4B** if you want multimodal support in a compact footprint,\n  * **Qwen3.5 4B** if image understanding and tool use matter,\n  * **Phi-4-mini-instruct** if your workload is heavy on long technical text,\n  * **Llama 3.1 8B** only after the smaller models are working well. (Jetson AI Lab)\n\n\n\n### If you use AGX Orin 32GB or 64GB\n\nThen you can step up without fighting the board all the time.\n\nMy shortlist would be:\n\n  * **Llama 3.1 8B** as the strong dense baseline,\n  * **GPT OSS 20B** if you want a broader text model and have AGX Orin,\n  * **Qwen3.5 35B-A3B MoE** if you want larger-model behavior with more efficient active parameters,\n  * plus a compact VLM such as **Qwen3.5 4B** or **Gemma 3 4B** for image questions. (Jetson AI Lab)\n\n\n\n## Why I would not center the design on 7B–9B dense models\n\nBecause edge deployment is now good enough that **small models are often the better engineering choice**. The ACL 2025 study on edge deployment found that modern small language models can outperform some 7B models on general tasks, and their measurements on Jetson Orin NX 16GB also showed why decode remains a bottleneck even when GPU helps a lot in prefill. That is exactly the pattern you are seeing: the big model does not feel proportionally better once it hits embedded memory and decode limits. (ACL Anthology)\n\n## Runtime choice matters more than many people expect\n\nFor final deployment, I would favor **vLLM** over Ollama. NVIDIA’s current Jetson guidance explicitly says Ollama is an easy local entry point but gets roughly **half of peak performance** versus faster APIs like NanoLLM, and their more serious benchmarking/tutorial flow uses vLLM. (Jetson AI Lab)\n\nI would also keep these deployment rules:\n\n  * use quantized checkpoints that are already known to run on Jetson,\n  * keep prompts short,\n  * keep output length bounded,\n  * avoid enormous context windows unless the task truly needs them,\n  * pin clocks and use the correct Jetson-matched software stack,\n  * and verify that you are not silently on a slow path. (Jetson AI Lab)\n\n\n\n## My direct recommendation for your specific case\n\nIf the goal is a **fully local robotics agent** that does diagnosis, operator summaries, and possibly some image-grounded reasoning:\n\n  * **Do not build the first version around Xavier NX.**\n  * **Use Orin NX 16GB as the minimum serious target.**\n  * **Use AGX Orin if budget allows and you want breathing room.** (NVIDIA Developer)\n\n\n\nThen start with:\n\n  * **text brain** : **Llama 3.2 3B** or **SmolLM3-3B**\n  * **image helper** : **Qwen3.5 4B** or **Gemma 3 4B**\n  * **retrieval** over manuals/logs/SOPs\n  * **vLLM** for serving. (Jetson AI Lab)\n\n\n\nAfter that, only move up if the small stack clearly fails your real tasks. At that point:\n\n  * test **Llama 3.1 8B** on Orin NX 16GB or AGX Orin,\n  * or move to **GPT OSS 20B** / **Qwen3.5 35B-A3B** on AGX Orin. (Jetson AI Lab)\n\n\n\n## Bottom line\n\nYes. There are local models that can solve your problem.\n\nBut the winning answer is usually **not** “run a 7B–9B dense model on NX and hope.”\nThe winning answer is:\n\n  * **small local planner model**\n  * **retrieval for factual depth**\n  * **optional compact VLM**\n  * **good runtime and quantization**\n  * **Orin NX 16GB minimum, AGX Orin preferred**. (ACL Anthology)\n\n",
  "title": "Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX?"
}