External Publication

Local LLM on MacBook M5 Pro - Totally New to This!

Hugging Face Forums [Unofficial] July 1, 2026

Hmm… It looks like you may already be past the “which model should I choose?” stage and into the next stage:

You have already done one of the hardest beginner steps: you have a local stack running at all.

Docker Model Runner + Open WebUI + local models + Knowledge collections is not a strange direction. Docker has official docs for using Open WebUI with Docker Model Runner, and Open WebUI is designed to sit in front of local or OpenAI-compatible backends.

So I would not make the next step “download ten more models.”

I would make the next step:

Build a small, repeatable local-AI test bench, then use that to decide which model/runtime/settings are actually working for your use cases.

That can be very simple at first: a note file or spreadsheet with fixed prompts and expected behavior.

My short take

I would split your setup into layers instead of judging everything as “the model.”

Layer	Examples	Why it matters
Model	Gemma 4, Qwen3, Qwen3-Coder-style models	General ability and instruction following
Artifact	GGUF, MLX, Safetensors, quantization	Embedded templates, conversion quality, memory/speed tradeoffs
Runtime / backend	Docker Model Runner, llama.cpp, MLX-LM, Ollama, LM Studio, vLLM	API behavior, speed, parser behavior, tool-call support
UI / platform	Open WebUI	RAG behavior, Knowledge handling, account/tool permissions
RAG / Knowledge	embeddings, chunks, collections, Full Context	Whether your docs are actually retrieved and used
Tools / agents	function calling, Python tools, filesystem/network tools	Reliability and safety boundary
Remote access	Tailscale, LAN, local API ports	What is reachable from where

The practical order I would use:

Freeze one baseline.
Make a tiny test set.
Change one variable at a time.
Debug RAG separately from model quality.
Debug tool calling separately from chat quality.
Only then add more models or agentic tools.

A small baseline and test bench (click for more details)

Model choice: pick by role, not by global ranking

I would not try to pick one global “best” local model.

A more useful split is:

Role	What I would test
Daily assistant	Speed, low friction, clear answers
RAG/manual reader	Uses provided context faithfully and admits missing info
Long-context model	Handles beginning/middle/end of long documents
Coding/tool model	Structured tool calls, code repair, multi-step reliability
Lightweight fallback	Fast enough for simple tasks
Experiment model	Useful for learning even if not stable

For coding/tool-heavy use, Qwen3-Coder-style models are relevant candidates. The Qwen3-Coder-30B-A3B-Instruct GGUF card is useful because it shows several local runtime paths.

For Gemma 4, I would separate “daily chat” from “agent/tool use.” It may be useful as a daily model while still needing stricter backend/version checks for function calling.

GGUF vs MLX vs Transformers: runtime path first, quality judgment second

On Apple Silicon, MLX is worth knowing about. Hugging Face has a short overview of MLX on the Hub, and MLX can be attractive on Macs.

But I would not treat GGUF vs MLX vs Transformers as a simple intelligence ranking.

A safer way to think about it:

Path	Why it is useful
GGUF / llama.cpp ecosystem	Broad local compatibility, many quants, many tools
MLX	Apple Silicon-native path, often attractive on Mac
Transformers / Safetensors	Closest to many official examples and debugging paths
Docker Model Runner	Docker-integrated serving path
Ollama / LM Studio	Very convenient local workflows

The same base model can behave differently depending on:

quantization,
embedded chat template,
tokenizer config,
context length,
sampling settings,
runtime parser,
streaming behavior,
tool-call parser,
UI/client/proxy behavior.

So I would compare this:

model + artifact + quant + runtime + UI + settings

not only this:

model name

RAG / Knowledge: debug retrieval before blaming the model

For Knowledge collections, I would start with one tiny test document before judging large manuals.

A bad RAG answer can mean several different things:

Failure type	Meaning
Retrieval failure	The right chunk was not found
Context injection failure	The chunk was found but not actually used
Generation failure	The model saw the right text but ignored or misread it
Configuration mismatch	UI/tool mode changed how Knowledge is exposed
Missing-answer failure	The model guesses instead of saying “not in the docs”

Open WebUI’s Knowledge docs and Tools docs are worth reading because Native Function Calling can change how Knowledge is exposed. In Native mode, attached knowledge may need to be actively called through tools rather than being automatically injected in the older/simple RAG style.

Depending on your Open WebUI version/settings, also check the current docs around ENABLE_KB_EXEC=True. The Open WebUI Essentials page describes kb_exec, which gives models a filesystem-style interface over Knowledge Bases in newer Native-mode setups. I would not assume you always need it, but I would know that it exists.

Minimal RAG smoke test (click for more details)

Function calling: test the exact path, not just the model card

Tool calling is not just a model feature.

The model does not directly execute functions. It produces a tool-call request, then the surrounding app/server/client parses it, executes the tool, and sends the result back. LM Studio’s Tool Use docs explain that flow clearly. The llama.cpp function-calling docs also show why chat templates and parser support matter.

I would treat function calling as a contract between:

model
+ chat template
+ tokenizer/model repo files
+ converted artifact
+ backend parser
+ streaming parser
+ OpenAI-compatible adapter
+ UI/client
+ agent loop

If one layer is stale or mismatched, symptoms can look like:

raw JSON appears in the chat
raw native tool-call tokens appear in the chat
no tool_calls field
malformed arguments
repeated tool calls
tool loops
correct first call but broken second call
streaming-only failures
backend-direct works but proxy/client fails

That does not automatically mean “the model is bad.” It can mean the local stack is not handling that model’s tool-call protocol correctly.

Gemma 4-specific caution

Gemma 4 is a good example where I would be extra careful.

Gemma 4 has official function-calling documentation, but practical local reliability depends heavily on backend freshness and model-template freshness.

Recent Gemma 4 tool-calling fixes and reports suggest this is a multi-layer protocol-boundary problem, not just a model-weight question. This HF Forum post is useful background: Gemma 4 bug fixes and Research Request.

Other useful examples:

google/gemma-4-31B-it chat-template fix discussion
vLLM Gemma 4 streaming tool parser issue
MLX-LM Gemma 4 native tool-call parser issue
llama-cpp-python Gemma 4 raw tool-call issue

I would not read those as “do not use Gemma 4.” I would read them as:

For Gemma 4 tool use, update first, then smoke-test the exact path.

Tool-calling smoke test, especially for Gemma 4 (click for more details)

For real agentic use, I would test tool-calling models separately from daily-chat models. A model can be pleasant for chat and still not be the model I would trust first with filesystem, browser, email, or shell tools.

Remote access and agentic use: the boundary is what the system can reach

Tailscale is a reasonable direction for private remote access.

But I would separate:

Remote access to Open WebUI
Direct access to the model API
Tool/agent access to files, network, shell, credentials, or devices

Containerizing the model is only one part of the safety story. For agentic AI, the bigger question is:

What can the agent reach?

Open WebUI’s Tools docs and Hardening guide are useful background because Open WebUI tools/functions can run server-side code. Docker’s Model Runner API docs are also worth reading so you know which endpoints are reachable from where.

Remote / agentic checklist (click for more details)

A low-friction roadmap

Phase 1 — keep your current stack and make it measurable

Do not rebuild everything yet. Record versions, model artifacts, quantization, context length, embeddings, and Open WebUI settings. Then run the same 10–15 tests.

Phase 2 — make RAG boring

Start with one tiny Knowledge file. Then one real manual. Then multiple collections. Avoid debugging PDF extraction, Markdown conversion, chunking, embeddings, retrieval, and generation all at once.

Phase 3 — test tools separately

Use one harmless function. Confirm the full loop:

user prompt
-> model requests tool
-> app executes tool
-> model receives result
-> model gives final answer

For Gemma 4, be stricter about backend/artifact/template freshness.

Phase 4 — compare models by role

Do not require one model to win every category.

Example:

Model family	Test separately
Gemma 4	daily chat, summarization, RAG, tool calling
Qwen3 30B-A3B	deeper reading, long context, RAG
Qwen3-Coder-style	coding, structured tool use, agent workflow
Smaller models	fast fallback tasks

Phase 5 — remote and agentic later

Only after the baseline is boring:

Tailscale access
read-only tools
toy function tools
limited real tools
human approval for risky actions

Reference links (click for more details)

Overall: your current direction does not look unreasonable. I would just shift the next step from “more models” to “small repeatable tests.” Once you have that, model choice becomes much less mysterious.