External Publication
Visit Post

What is the best architecture for integrating local LLM inference and RAG on mobile devices?

Hugging Face Forums [Unofficial] March 15, 2026
Source

Hi everyone,

I’m currently exploring a mobile AI architecture and would love to hear technical opinions from others working in this area.

The goal is to support the following on a mobile app:

  • on-device LLM inference

  • local or hybrid RAG retrieval

  • low-latency interaction

  • integration with Flutter or another cross-platform frontend

The technical directions I’m considering include:

  • Flutter / cross-platform frontend

  • llama.cpp or another on-device LLM runtime

  • vector retrieval or a lightweight local knowledge base

  • Platform Channel, FFI, or another native bridging approach

My main questions are:

  1. What is currently the most reliable architecture for local LLM + RAG on mobile?

  2. If the frontend is Flutter, would you recommend Platform Channels or FFI?

  3. What are good approaches for local knowledge retrieval on mobile devices?

  4. How do you usually balance performance, memory usage, and model size in production or prototype setups?

I’d be very interested in hearing any real-world experience or recommendations related to mobile AI / edge AI systems.

Discussion in the ATmosphere

Loading comments...