Local LLM on MacBook M5 Pro - Totally New to This!
Hmm… It looks like you may already be past the “which model should I choose?” stage and into the next stage:
You have already done one of the hardest beginner steps: you have a local stack running at all.
Docker Model Runner + Open WebUI + local models + Knowledge collections is not a strange direction. Docker has official docs for using Open WebUI with Docker Model Runner, and Open WebUI is designed to sit in front of local or OpenAI-compatible backends.
So I would not make the next step “download ten more models.”
I would make the next step:
Build a small, repeatable local-AI test bench, then use that to decide which model/runtime/settings are actually working for your use cases.
That can be very simple at first: a note file or spreadsheet with fixed prompts and expected behavior.
My short take
I would split your setup into layers instead of judging everything as “the model.”
| Layer | Examples | Why it matters |
|---|---|---|
| Model | Gemma 4, Qwen3, Qwen3-Coder-style models | General ability and instruction following |
| Artifact | GGUF, MLX, Safetensors, quantization | Embedded templates, conversion quality, memory/speed tradeoffs |
| Runtime / backend | Docker Model Runner, llama.cpp, MLX-LM, Ollama, LM Studio, vLLM | API behavior, speed, parser behavior, tool-call support |
| UI / platform | Open WebUI | RAG behavior, Knowledge handling, account/tool permissions |
| RAG / Knowledge | embeddings, chunks, collections, Full Context | Whether your docs are actually retrieved and used |
| Tools / agents | function calling, Python tools, filesystem/network tools | Reliability and safety boundary |
| Remote access | Tailscale, LAN, local API ports | What is reachable from where |
The practical order I would use:
- Freeze one baseline.
- Make a tiny test set.
- Change one variable at a time.
- Debug RAG separately from model quality.
- Debug tool calling separately from chat quality.
- Only then add more models or agentic tools.
A small baseline and test bench (click for more details)
Model choice: pick by role, not by global ranking
I would not try to pick one global “best” local model.
A more useful split is:
| Role | What I would test |
|---|---|
| Daily assistant | Speed, low friction, clear answers |
| RAG/manual reader | Uses provided context faithfully and admits missing info |
| Long-context model | Handles beginning/middle/end of long documents |
| Coding/tool model | Structured tool calls, code repair, multi-step reliability |
| Lightweight fallback | Fast enough for simple tasks |
| Experiment model | Useful for learning even if not stable |
For coding/tool-heavy use, Qwen3-Coder-style models are relevant candidates. The Qwen3-Coder-30B-A3B-Instruct GGUF card is useful because it shows several local runtime paths.
For Gemma 4, I would separate “daily chat” from “agent/tool use.” It may be useful as a daily model while still needing stricter backend/version checks for function calling.
GGUF vs MLX vs Transformers: runtime path first, quality judgment second
On Apple Silicon, MLX is worth knowing about. Hugging Face has a short overview of MLX on the Hub, and MLX can be attractive on Macs.
But I would not treat GGUF vs MLX vs Transformers as a simple intelligence ranking.
A safer way to think about it:
| Path | Why it is useful |
|---|---|
| GGUF / llama.cpp ecosystem | Broad local compatibility, many quants, many tools |
| MLX | Apple Silicon-native path, often attractive on Mac |
| Transformers / Safetensors | Closest to many official examples and debugging paths |
| Docker Model Runner | Docker-integrated serving path |
| Ollama / LM Studio | Very convenient local workflows |
The same base model can behave differently depending on:
- quantization,
- embedded chat template,
- tokenizer config,
- context length,
- sampling settings,
- runtime parser,
- streaming behavior,
- tool-call parser,
- UI/client/proxy behavior.
So I would compare this:
model + artifact + quant + runtime + UI + settings
not only this:
model name
RAG / Knowledge: debug retrieval before blaming the model
For Knowledge collections, I would start with one tiny test document before judging large manuals.
A bad RAG answer can mean several different things:
| Failure type | Meaning |
|---|---|
| Retrieval failure | The right chunk was not found |
| Context injection failure | The chunk was found but not actually used |
| Generation failure | The model saw the right text but ignored or misread it |
| Configuration mismatch | UI/tool mode changed how Knowledge is exposed |
| Missing-answer failure | The model guesses instead of saying “not in the docs” |
Open WebUI’s Knowledge docs and Tools docs are worth reading because Native Function Calling can change how Knowledge is exposed. In Native mode, attached knowledge may need to be actively called through tools rather than being automatically injected in the older/simple RAG style.
Depending on your Open WebUI version/settings, also check the current docs around ENABLE_KB_EXEC=True. The Open WebUI Essentials page describes kb_exec, which gives models a filesystem-style interface over Knowledge Bases in newer Native-mode setups. I would not assume you always need it, but I would know that it exists.
Minimal RAG smoke test (click for more details)
Function calling: test the exact path, not just the model card
Tool calling is not just a model feature.
The model does not directly execute functions. It produces a tool-call request, then the surrounding app/server/client parses it, executes the tool, and sends the result back. LM Studio’s Tool Use docs explain that flow clearly. The llama.cpp function-calling docs also show why chat templates and parser support matter.
I would treat function calling as a contract between:
model
+ chat template
+ tokenizer/model repo files
+ converted artifact
+ backend parser
+ streaming parser
+ OpenAI-compatible adapter
+ UI/client
+ agent loop
If one layer is stale or mismatched, symptoms can look like:
raw JSON appears in the chat
raw native tool-call tokens appear in the chat
no tool_calls field
malformed arguments
repeated tool calls
tool loops
correct first call but broken second call
streaming-only failures
backend-direct works but proxy/client fails
That does not automatically mean “the model is bad.” It can mean the local stack is not handling that model’s tool-call protocol correctly.
Gemma 4-specific caution
Gemma 4 is a good example where I would be extra careful.
Gemma 4 has official function-calling documentation, but practical local reliability depends heavily on backend freshness and model-template freshness.
Recent Gemma 4 tool-calling fixes and reports suggest this is a multi-layer protocol-boundary problem, not just a model-weight question. This HF Forum post is useful background: Gemma 4 bug fixes and Research Request.
Other useful examples:
- google/gemma-4-31B-it chat-template fix discussion
- vLLM Gemma 4 streaming tool parser issue
- MLX-LM Gemma 4 native tool-call parser issue
- llama-cpp-python Gemma 4 raw tool-call issue
I would not read those as “do not use Gemma 4.” I would read them as:
For Gemma 4 tool use, update first, then smoke-test the exact path.
Tool-calling smoke test, especially for Gemma 4 (click for more details)
For real agentic use, I would test tool-calling models separately from daily-chat models. A model can be pleasant for chat and still not be the model I would trust first with filesystem, browser, email, or shell tools.
Remote access and agentic use: the boundary is what the system can reach
Tailscale is a reasonable direction for private remote access.
But I would separate:
- Remote access to Open WebUI
- Direct access to the model API
- Tool/agent access to files, network, shell, credentials, or devices
Containerizing the model is only one part of the safety story. For agentic AI, the bigger question is:
What can the agent reach?
Open WebUI’s Tools docs and Hardening guide are useful background because Open WebUI tools/functions can run server-side code. Docker’s Model Runner API docs are also worth reading so you know which endpoints are reachable from where.
Remote / agentic checklist (click for more details)
A low-friction roadmap
Phase 1 — keep your current stack and make it measurable
Do not rebuild everything yet. Record versions, model artifacts, quantization, context length, embeddings, and Open WebUI settings. Then run the same 10–15 tests.
Phase 2 — make RAG boring
Start with one tiny Knowledge file. Then one real manual. Then multiple collections. Avoid debugging PDF extraction, Markdown conversion, chunking, embeddings, retrieval, and generation all at once.
Phase 3 — test tools separately
Use one harmless function. Confirm the full loop:
user prompt
-> model requests tool
-> app executes tool
-> model receives result
-> model gives final answer
For Gemma 4, be stricter about backend/artifact/template freshness.
Phase 4 — compare models by role
Do not require one model to win every category.
Example:
| Model family | Test separately |
|---|---|
| Gemma 4 | daily chat, summarization, RAG, tool calling |
| Qwen3 30B-A3B | deeper reading, long context, RAG |
| Qwen3-Coder-style | coding, structured tool use, agent workflow |
| Smaller models | fast fallback tasks |
Phase 5 — remote and agentic later
Only after the baseline is boring:
Tailscale access
read-only tools
toy function tools
limited real tools
human approval for risky actions
Reference links (click for more details)
Overall: your current direction does not look unreasonable. I would just shift the next step from “more models” to “small repeatable tests.” Once you have that, model choice becomes much less mysterious.
Discussion in the ATmosphere