Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigjltxy2hkwe5hhnrcmxvw5z27wvjvfio5gpce4oajpehtkn4vlla",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgrvfochla72"
  },
  "path": "/t/running-8b-llama-on-jetson-orin-nano-using-only-2-5gb-of-gpu-memory/174180#post_1",
  "publishedAt": "2026-03-11T08:41:21.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://youtu.be/yVZSksaqf08",
    "https://enerzai.com/contact"
  ],
  "textContent": "Hi, we would like to share our project on deploying **8B Llama on Jetson Orin Nano** , using only 2.5GB of GPU shared memory (peak), with a comparison against a llama.cpp INT4 baseline.\n\n### **Baseline (llama.cpp INT4)**\n\nIn our baseline setup, Llama-3.1-8B INT4 reached:\n\n  * 5.2GB GPU shared memory (peak)\n\n  * 6.8GB total RAM (peak)\n\n\n\n\nOn Jetson Orin Nano, this uses most of the available memory budget and leaves limited headroom for other edge workloads.\n\n### **Our result**\n\nUsing our own extreme low-bit (1.58-bit) deployment pipeline, we ran an 8B-class Llama model with:\n\n  * 2.5GB GPU shared memory (peak)\n\n  * 4.1GB total RAM (peak)\n\n\n\n\nThis makes the deployment more practical on Orin Nano when the LLM needs to coexist with other components on the device.\n\n### **Main Techniques**\n\n  * 1.58-bit quantization (Mixed-precision QAT)\n\n  * Kernel-level optimizations (Custom kernel for embedding access and layer fusion)\n\n\n\n\n### **Demo Video**\n\n\n\n  * Link: https://youtu.be/yVZSksaqf08\n\n\n\n**Notes**\n\n  * For our 1.58-bit Llama model, instruction tuning has been limited to date and we expect further improvements with additional tuning.\n\n\n\n### **Why this may be useful**\n\nFor edge deployments, memory headroom matters because the LLM often needs to run alongside other components such as:\n\n  * Other AI models including STT, TTS, and more\n\n  * System workloads including perception, logging, control, networking, and more\n\n\n\n\nReducing the model footprint makes on-device LLM deployment more realistic even on Nano-class edge SoCs.\n\n**And we are sharing more details at GTC 2026!**\n\nIf you are blocked by memory footprint or latency while building Llama or other LLMs on Jetson or other SoC platforms, please leave us a message.\n\nContact: https://enerzai.com/contact",
  "title": "Running 8B Llama on Jetson Orin Nano (using only 2.5GB of GPU memory)"
}