Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif6b5gicbcj7sfltj4tvnt4v2lcccfavkgk52xnvs5zhqvlxao4vm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mldr67i4ut62"
  },
  "path": "/t/they-said-unquantized-local-ai-was-impossible-on-budget-phones-we-got-a-2-3gb-fp32-model-running-locally-on-a-120-galaxy-a25-cpu-no-gpu-no-npu-uses-less-ram-than-chrome/175739#post_5",
  "publishedAt": "2026-05-08T11:13:38.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "That is a solid experiment! It actually brings back memories—about a year ago, I was running a similar setup with a full LLM deployed natively on the handset. I spent about 1 months optimizing it, and like you mentioned, it’s impressive to see it functioning in total isolation (Airplane mode).\n\nHowever, after pushing that architecture to its limits—hitting 80-90% CPU sustained loads and dealing with the inevitable thermal throttling—I realized that for high-stakes OSINT and autonomous agent loops, I needed a different paradigm.\n\nThat’s why I eventually moved toward the **Local-First Orchestration** I’m using now. By offloading the heavy lifting to a dedicated Nitro-node and keeping the mobile interface at a lean 6MB, I get the best of both worlds: the raw power of an 8B model and sub-second latency, without turning the phone into a pocket heater.\n\nAlways good to see others experimenting with true edge independence, though. Keep pushing the boundaries!",
  "title": "They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome"
}