What is the best architecture for integrating local LLM inference and RAG on mobile devices?
Hi everyone,
I’m currently exploring a mobile AI architecture and would love to hear technical opinions from others working in this area.
The goal is to support the following on a mobile app:
on-device LLM inference
local or hybrid RAG retrieval
low-latency interaction
integration with Flutter or another cross-platform frontend
The technical directions I’m considering include:
Flutter / cross-platform frontend
llama.cpp or another on-device LLM runtime
vector retrieval or a lightweight local knowledge base
Platform Channel, FFI, or another native bridging approach
My main questions are:
What is currently the most reliable architecture for local LLM + RAG on mobile?
If the frontend is Flutter, would you recommend Platform Channels or FFI?
What are good approaches for local knowledge retrieval on mobile devices?
How do you usually balance performance, memory usage, and model size in production or prototype setups?
I’d be very interested in hearing any real-world experience or recommendations related to mobile AI / edge AI systems.
Discussion in the ATmosphere