{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih7k7jsshhkb5hnsss3ys7cmctgxshyzb2bwe5yvtjfojw22nvmby",
    "uri": "at://did:plc:j4nmy4ymoeorm3j6hzbijapg/app.bsky.feed.post/3miw3teybi4a2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreidkucsrjomepcwkfj3vpsibotmzwptsucqth65aeajf6b52arnnmi"
    },
    "mimeType": "image/jpeg",
    "size": 514222
  },
  "description": "Google's Gemma 4 runs offline on your iPhone. A follow-up to my local LLM experiment, now with a sharper app, a better model, and a clearer sense of what this category is becoming.",
  "path": "/running-gemma-4-on-your-iphone/",
  "publishedAt": "2026-04-07T15:29:20.000Z",
  "site": "https://hoeijmakers.net",
  "tags": [
    "a local language model on my iPhone",
    "Gemini Nano",
    "Neural Engine for LLM inference",
    "The Neural Engine Does Not Run Your LLMMy earlier post on AI hardware implied the Neural Engine handles on-device AI. It doesn’t, not for LLMs. Here is what it actually does, and why the distinction matters.Rob HoeijmakersRob Hoeijmakers",
    "Edge Gallery",
    "Google AI Edge | Google AI for DevelopersIntroducing AI EdgeGoogle AI for Developers",
    "AI Locally",
    "Locally AI - Run AI models locally on your iPhone, iPad, and Mac.Run Llama, Gemma, Qwen, DeepSeek, and more on your iPhone, iPad, and Mac. Optimized for Apple Silicon. Offline. Private.Locally AIAdrien Grondin",
    "Mac app",
    "The AI Continuity Problem",
    "Running a Local LLM on Your iPhone",
    "The Neural Engine Does Not Run Your LLM"
  ],
  "textContent": "Last summer I ran a local language model on my iPhone using Haplo AI. Gemma was the only model that actually produced a useful result. Now Google has shipped Gemma 4, and the story has changed enough to revisit.\n\n## What Gemma is\n\nGemma is Google's family of open-weight models, designed to be downloaded and run locally rather than accessed through a cloud API. The distinction matters. Where Gemini lives on Google's servers and requires a connection, Gemma publishes the weights. You can download them, run them on your own hardware, quantize them for smaller devices, fine-tune them for specific tasks.\n\nThis is also why Gemini Nano, which ships inside Chrome and on Pixel devices, is not the same thing. Nano is part of the closed Gemini family: Google controls it, distributes it as a black box, and you call it through an API without ever seeing or owning the model. Gemma is a separate release, designed from the start for local deployment. Same research lineage, completely different distribution model.\n\nWhy multiple sizes? A 2B model fits on a phone and responds in seconds. A 27B model needs a laptop with a capable GPU. Gemma 4 comes in four variants; the smallest runs on a current iPhone. The choice is always between what the hardware can carry and what the task actually needs.\n\n💡\n\nGemma 4 comes in E2B, E4B, 26B MoE and 31B Dense variants. The E2B (about 2.5GB download) runs on an iPhone 15 Pro or newer. Multimodal support, meaning it can reason about images as well as text, is included in the smaller variants.\n\n## Under the hood\n\nTwo apps let you run Gemma on iPhone, and they are built differently in ways that show.\n\n**AI Locally** uses MLX, Apple's own machine learning framework, built around the unified memory architecture of Apple Silicon. The model runs on the GPU via Metal, tight and native.\n\nGoogle **AI Edge Gallery** uses LiteRT, which Google built for Android NPUs, then translated into Metal on iOS. It works, but it is an extra step that Apple's own framework does not need.\n\nEdge Gallery and Locally AI for iOS\n\nNeither uses the Neural Engine for LLM inference, despite what the marketing around \"on-device AI\" often implies. Language models need the GPU's flexibility. Face ID and camera processing run on the Neural Engine because they are fixed, predictable operations. Running the E4B model, the larger of the two downloadable Gemma 4 variants, my iPhone 16 Pro got noticeably warm. That is not a complaint. It is the GPU working: the phone is doing real compute, not calling a server.\n\nThe Neural Engine Does Not Run Your LLMMy earlier post on AI hardware implied the Neural Engine handles on-device AI. It doesn’t, not for LLMs. Here is what it actually does, and why the distinction matters.Rob HoeijmakersRob Hoeijmakers\n\n## Demo versus tool\n\nEdge Gallery is polished and honest about what it is. Conversations are ephemeral, nothing is saved between sessions, and the feature set reads like a capabilities showcase: image questions, audio transcription, a tool-calling demo. Google built it to show what Gemma 4 can do on a phone. For that purpose it works well.\n\nGoogle AI Edge | Google AI for DevelopersIntroducing AI EdgeGoogle AI for Developers\n\nAI Locally is building toward something different. It has Shortcuts integration, a voice mode (English only for now), and a model browser that flags which models will run well on your specific device. I have been using it more than Edge Gallery, and the preference is clear after a few sessions.\n\nLocally AI - Run AI models locally on your iPhone, iPad, and Mac.Run Llama, Gemma, Qwen, DeepSeek, and more on your iPhone, iPad, and Mac. Optimized for Apple Silicon. Offline. Private.Locally AIAdrien Grondin\n\nThe limitations are also clear. The context window is small: long documents hit the ceiling quickly, PDF reading crashed outright, and transcription is not a realistic use case at this scale. These are not edge cases; they are the main things you would reach for a cloud model to do.\n\n💡\n\nAI Locally also has a Mac app, built on the same MLX foundation. It runs local models on Apple Silicon, including the Apple Foundation Model — Apple's own on-device LLM, accessible without a cloud connection. Worth installing if you are already on the iPhone app.\n\n## Progress, with caveats\n\nWhat has actually changed since my Haplo AI experiment is the baseline. The hardware has caught up: on a recent iPhone, a 3B to 4B model runs fast enough to feel responsive. Gemma 4 handles Dutch well, which matters more than it might seem for a model running entirely on device. And the apps have matured from rough experiments into something you could actually use.\n\nWhat has not changed is the gap with cloud models. Local AI on a phone is useful for bounded tasks where the input is sensitive: a financial document you have not decided to share, notes that should not leave the device, a question you would rather not route through a server. For that use case, the combination of privacy and reasonable quality is now genuinely good enough.\n\nFor everything else, the cloud model is still the right tool. The context window is larger, the reasoning is deeper, and nothing crashes when you open a PDF.\n\n### Further reading\n\n  * The AI Continuity Problem\n  * Running a Local LLM on Your iPhone\n  * The Neural Engine Does Not Run Your LLM\n\n",
  "title": "Running Gemma 4 on Your iPhone",
  "updatedAt": "2026-05-10T08:53:35.333Z"
}