External Publication
Visit Post

Local LLM on MacBook M5 Pro - Totally New to This!

Hugging Face Forums [Unofficial] July 1, 2026
Source

Hmm… It looks like you may already be past the “which model should I choose?” stage and into the next stage:


You have already done one of the hardest beginner steps: you have a local stack running at all.

Docker Model Runner + Open WebUI + local models + Knowledge collections is not a strange direction. Docker has official docs for using Open WebUI with Docker Model Runner, and Open WebUI is designed to sit in front of local or OpenAI-compatible backends.

So I would not make the next step “download ten more models.”

I would make the next step:

Build a small, repeatable local-AI test bench, then use that to decide which model/runtime/settings are actually working for your use cases.

That can be very simple at first: a note file or spreadsheet with fixed prompts and expected behavior.

My short take

I would split your setup into layers instead of judging everything as “the model.”

Layer Examples Why it matters
Model Gemma 4, Qwen3, Qwen3-Coder-style models General ability and instruction following
Artifact GGUF, MLX, Safetensors, quantization Embedded templates, conversion quality, memory/speed tradeoffs
Runtime / backend Docker Model Runner, llama.cpp, MLX-LM, Ollama, LM Studio, vLLM API behavior, speed, parser behavior, tool-call support
UI / platform Open WebUI RAG behavior, Knowledge handling, account/tool permissions
RAG / Knowledge embeddings, chunks, collections, Full Context Whether your docs are actually retrieved and used
Tools / agents function calling, Python tools, filesystem/network tools Reliability and safety boundary
Remote access Tailscale, LAN, local API ports What is reachable from where

The practical order I would use:

  1. Freeze one baseline.
  2. Make a tiny test set.
  3. Change one variable at a time.
  4. Debug RAG separately from model quality.
  5. Debug tool calling separately from chat quality.
  6. Only then add more models or agentic tools.

A small baseline and test bench (click for more details)


Model choice: pick by role, not by global ranking

I would not try to pick one global “best” local model.

A more useful split is:

Role What I would test
Daily assistant Speed, low friction, clear answers
RAG/manual reader Uses provided context faithfully and admits missing info
Long-context model Handles beginning/middle/end of long documents
Coding/tool model Structured tool calls, code repair, multi-step reliability
Lightweight fallback Fast enough for simple tasks
Experiment model Useful for learning even if not stable

For coding/tool-heavy use, Qwen3-Coder-style models are relevant candidates. The Qwen3-Coder-30B-A3B-Instruct GGUF card is useful because it shows several local runtime paths.

For Gemma 4, I would separate “daily chat” from “agent/tool use.” It may be useful as a daily model while still needing stricter backend/version checks for function calling.


GGUF vs MLX vs Transformers: runtime path first, quality judgment second

On Apple Silicon, MLX is worth knowing about. Hugging Face has a short overview of MLX on the Hub, and MLX can be attractive on Macs.

But I would not treat GGUF vs MLX vs Transformers as a simple intelligence ranking.

A safer way to think about it:

Path Why it is useful
GGUF / llama.cpp ecosystem Broad local compatibility, many quants, many tools
MLX Apple Silicon-native path, often attractive on Mac
Transformers / Safetensors Closest to many official examples and debugging paths
Docker Model Runner Docker-integrated serving path
Ollama / LM Studio Very convenient local workflows

The same base model can behave differently depending on:

  • quantization,
  • embedded chat template,
  • tokenizer config,
  • context length,
  • sampling settings,
  • runtime parser,
  • streaming behavior,
  • tool-call parser,
  • UI/client/proxy behavior.

So I would compare this:

model + artifact + quant + runtime + UI + settings

not only this:

model name

RAG / Knowledge: debug retrieval before blaming the model

For Knowledge collections, I would start with one tiny test document before judging large manuals.

A bad RAG answer can mean several different things:

Failure type Meaning
Retrieval failure The right chunk was not found
Context injection failure The chunk was found but not actually used
Generation failure The model saw the right text but ignored or misread it
Configuration mismatch UI/tool mode changed how Knowledge is exposed
Missing-answer failure The model guesses instead of saying “not in the docs”

Open WebUI’s Knowledge docs and Tools docs are worth reading because Native Function Calling can change how Knowledge is exposed. In Native mode, attached knowledge may need to be actively called through tools rather than being automatically injected in the older/simple RAG style.

Depending on your Open WebUI version/settings, also check the current docs around ENABLE_KB_EXEC=True. The Open WebUI Essentials page describes kb_exec, which gives models a filesystem-style interface over Knowledge Bases in newer Native-mode setups. I would not assume you always need it, but I would know that it exists.

Minimal RAG smoke test (click for more details)


Function calling: test the exact path, not just the model card

Tool calling is not just a model feature.

The model does not directly execute functions. It produces a tool-call request, then the surrounding app/server/client parses it, executes the tool, and sends the result back. LM Studio’s Tool Use docs explain that flow clearly. The llama.cpp function-calling docs also show why chat templates and parser support matter.

I would treat function calling as a contract between:

model
+ chat template
+ tokenizer/model repo files
+ converted artifact
+ backend parser
+ streaming parser
+ OpenAI-compatible adapter
+ UI/client
+ agent loop

If one layer is stale or mismatched, symptoms can look like:

raw JSON appears in the chat
raw native tool-call tokens appear in the chat
no tool_calls field
malformed arguments
repeated tool calls
tool loops
correct first call but broken second call
streaming-only failures
backend-direct works but proxy/client fails

That does not automatically mean “the model is bad.” It can mean the local stack is not handling that model’s tool-call protocol correctly.

Gemma 4-specific caution

Gemma 4 is a good example where I would be extra careful.

Gemma 4 has official function-calling documentation, but practical local reliability depends heavily on backend freshness and model-template freshness.

Recent Gemma 4 tool-calling fixes and reports suggest this is a multi-layer protocol-boundary problem, not just a model-weight question. This HF Forum post is useful background: Gemma 4 bug fixes and Research Request.

Other useful examples:

  • google/gemma-4-31B-it chat-template fix discussion
  • vLLM Gemma 4 streaming tool parser issue
  • MLX-LM Gemma 4 native tool-call parser issue
  • llama-cpp-python Gemma 4 raw tool-call issue

I would not read those as “do not use Gemma 4.” I would read them as:

For Gemma 4 tool use, update first, then smoke-test the exact path.

Tool-calling smoke test, especially for Gemma 4 (click for more details)

For real agentic use, I would test tool-calling models separately from daily-chat models. A model can be pleasant for chat and still not be the model I would trust first with filesystem, browser, email, or shell tools.


Remote access and agentic use: the boundary is what the system can reach

Tailscale is a reasonable direction for private remote access.

But I would separate:

  1. Remote access to Open WebUI
  2. Direct access to the model API
  3. Tool/agent access to files, network, shell, credentials, or devices

Containerizing the model is only one part of the safety story. For agentic AI, the bigger question is:

What can the agent reach?

Open WebUI’s Tools docs and Hardening guide are useful background because Open WebUI tools/functions can run server-side code. Docker’s Model Runner API docs are also worth reading so you know which endpoints are reachable from where.

Remote / agentic checklist (click for more details)


A low-friction roadmap

Phase 1 — keep your current stack and make it measurable

Do not rebuild everything yet. Record versions, model artifacts, quantization, context length, embeddings, and Open WebUI settings. Then run the same 10–15 tests.

Phase 2 — make RAG boring

Start with one tiny Knowledge file. Then one real manual. Then multiple collections. Avoid debugging PDF extraction, Markdown conversion, chunking, embeddings, retrieval, and generation all at once.

Phase 3 — test tools separately

Use one harmless function. Confirm the full loop:

user prompt
-> model requests tool
-> app executes tool
-> model receives result
-> model gives final answer

For Gemma 4, be stricter about backend/artifact/template freshness.

Phase 4 — compare models by role

Do not require one model to win every category.

Example:

Model family Test separately
Gemma 4 daily chat, summarization, RAG, tool calling
Qwen3 30B-A3B deeper reading, long context, RAG
Qwen3-Coder-style coding, structured tool use, agent workflow
Smaller models fast fallback tasks

Phase 5 — remote and agentic later

Only after the baseline is boring:

Tailscale access
read-only tools
toy function tools
limited real tools
human approval for risky actions

Reference links (click for more details)

Overall: your current direction does not look unreasonable. I would just shift the next step from “more models” to “small repeatable tests.” Once you have that, model choice becomes much less mysterious.

Discussion in the ATmosphere

Loading comments...