{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiddozkitpbpdl3uwazhpsggvd46m63habo3xb5l3hziln7mrnmv44",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mpkkabb4yd42"
  },
  "path": "/t/local-llm-on-macbook-m5-pro-totally-new-to-this/177286#post_2",
  "publishedAt": "2026-07-01T01:51:53.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "using Open WebUI with Docker Model Runner",
    "(click for more details)",
    "Qwen3-Coder-30B-A3B-Instruct GGUF card",
    "MLX on the Hub",
    "Knowledge docs",
    "Tools docs",
    "Open WebUI Essentials page",
    "Tool Use docs",
    "llama.cpp function-calling docs",
    "function-calling documentation",
    "Gemma 4 bug fixes and Research Request",
    "google/gemma-4-31B-it chat-template fix discussion",
    "vLLM Gemma 4 streaming tool parser issue",
    "MLX-LM Gemma 4 native tool-call parser issue",
    "llama-cpp-python Gemma 4 raw tool-call issue",
    "Hardening guide",
    "Model Runner API docs"
  ],
  "textContent": "Hmm… It looks like you may already be past the “which model should I choose?” stage and into the next stage:\n\n* * *\n\nYou have already done one of the hardest beginner steps: you have a local stack running at all.\n\nDocker Model Runner + Open WebUI + local models + Knowledge collections is not a strange direction. Docker has official docs for using Open WebUI with Docker Model Runner, and Open WebUI is designed to sit in front of local or OpenAI-compatible backends.\n\nSo I would not make the next step “download ten more models.”\n\nI would make the next step:\n\n> Build a small, repeatable local-AI test bench, then use that to decide which model/runtime/settings are actually working for your use cases.\n\nThat can be very simple at first: a note file or spreadsheet with fixed prompts and expected behavior.\n\n## My short take\n\nI would split your setup into layers instead of judging everything as “the model.”\n\nLayer | Examples | Why it matters\n---|---|---\nModel | Gemma 4, Qwen3, Qwen3-Coder-style models | General ability and instruction following\nArtifact | GGUF, MLX, Safetensors, quantization | Embedded templates, conversion quality, memory/speed tradeoffs\nRuntime / backend | Docker Model Runner, llama.cpp, MLX-LM, Ollama, LM Studio, vLLM | API behavior, speed, parser behavior, tool-call support\nUI / platform | Open WebUI | RAG behavior, Knowledge handling, account/tool permissions\nRAG / Knowledge | embeddings, chunks, collections, Full Context | Whether your docs are actually retrieved and used\nTools / agents | function calling, Python tools, filesystem/network tools | Reliability and safety boundary\nRemote access | Tailscale, LAN, local API ports | What is reachable from where\n\nThe practical order I would use:\n\n  1. **Freeze one baseline.**\n  2. **Make a tiny test set.**\n  3. **Change one variable at a time.**\n  4. **Debug RAG separately from model quality.**\n  5. **Debug tool calling separately from chat quality.**\n  6. **Only then add more models or agentic tools.**\n\nA small baseline and test bench (click for more details)\n\n* * *\n\n## Model choice: pick by role, not by global ranking\n\nI would not try to pick one global “best” local model.\n\nA more useful split is:\n\nRole | What I would test\n---|---\nDaily assistant | Speed, low friction, clear answers\nRAG/manual reader | Uses provided context faithfully and admits missing info\nLong-context model | Handles beginning/middle/end of long documents\nCoding/tool model | Structured tool calls, code repair, multi-step reliability\nLightweight fallback | Fast enough for simple tasks\nExperiment model | Useful for learning even if not stable\n\nFor coding/tool-heavy use, Qwen3-Coder-style models are relevant candidates. The Qwen3-Coder-30B-A3B-Instruct GGUF card is useful because it shows several local runtime paths.\n\nFor Gemma 4, I would separate “daily chat” from “agent/tool use.” It may be useful as a daily model while still needing stricter backend/version checks for function calling.\n\n* * *\n\n## GGUF vs MLX vs Transformers: runtime path first, quality judgment second\n\nOn Apple Silicon, MLX is worth knowing about. Hugging Face has a short overview of MLX on the Hub, and MLX can be attractive on Macs.\n\nBut I would not treat GGUF vs MLX vs Transformers as a simple intelligence ranking.\n\nA safer way to think about it:\n\nPath | Why it is useful\n---|---\nGGUF / llama.cpp ecosystem | Broad local compatibility, many quants, many tools\nMLX | Apple Silicon-native path, often attractive on Mac\nTransformers / Safetensors | Closest to many official examples and debugging paths\nDocker Model Runner | Docker-integrated serving path\nOllama / LM Studio | Very convenient local workflows\n\nThe same base model can behave differently depending on:\n\n  * quantization,\n  * embedded chat template,\n  * tokenizer config,\n  * context length,\n  * sampling settings,\n  * runtime parser,\n  * streaming behavior,\n  * tool-call parser,\n  * UI/client/proxy behavior.\n\n\n\nSo I would compare this:\n\n\n    model + artifact + quant + runtime + UI + settings\n\n\nnot only this:\n\n\n    model name\n\n\n* * *\n\n## RAG / Knowledge: debug retrieval before blaming the model\n\nFor Knowledge collections, I would start with one tiny test document before judging large manuals.\n\nA bad RAG answer can mean several different things:\n\nFailure type | Meaning\n---|---\nRetrieval failure | The right chunk was not found\nContext injection failure | The chunk was found but not actually used\nGeneration failure | The model saw the right text but ignored or misread it\nConfiguration mismatch | UI/tool mode changed how Knowledge is exposed\nMissing-answer failure | The model guesses instead of saying “not in the docs”\n\nOpen WebUI’s Knowledge docs and Tools docs are worth reading because Native Function Calling can change how Knowledge is exposed. In Native mode, attached knowledge may need to be actively called through tools rather than being automatically injected in the older/simple RAG style.\n\nDepending on your Open WebUI version/settings, also check the current docs around `ENABLE_KB_EXEC=True`. The Open WebUI Essentials page describes `kb_exec`, which gives models a filesystem-style interface over Knowledge Bases in newer Native-mode setups. I would not assume you always need it, but I would know that it exists.\n\nMinimal RAG smoke test (click for more details)\n\n* * *\n\n## Function calling: test the exact path, not just the model card\n\nTool calling is not just a model feature.\n\nThe model does not directly execute functions. It produces a tool-call request, then the surrounding app/server/client parses it, executes the tool, and sends the result back. LM Studio’s Tool Use docs explain that flow clearly. The llama.cpp function-calling docs also show why chat templates and parser support matter.\n\nI would treat function calling as a contract between:\n\n\n    model\n    + chat template\n    + tokenizer/model repo files\n    + converted artifact\n    + backend parser\n    + streaming parser\n    + OpenAI-compatible adapter\n    + UI/client\n    + agent loop\n\n\nIf one layer is stale or mismatched, symptoms can look like:\n\n\n    raw JSON appears in the chat\n    raw native tool-call tokens appear in the chat\n    no tool_calls field\n    malformed arguments\n    repeated tool calls\n    tool loops\n    correct first call but broken second call\n    streaming-only failures\n    backend-direct works but proxy/client fails\n\n\nThat does not automatically mean “the model is bad.” It can mean the local stack is not handling that model’s tool-call protocol correctly.\n\n### Gemma 4-specific caution\n\nGemma 4 is a good example where I would be extra careful.\n\nGemma 4 has official function-calling documentation, but practical local reliability depends heavily on backend freshness and model-template freshness.\n\nRecent Gemma 4 tool-calling fixes and reports suggest this is a multi-layer protocol-boundary problem, not just a model-weight question. This HF Forum post is useful background: Gemma 4 bug fixes and Research Request.\n\nOther useful examples:\n\n  * google/gemma-4-31B-it chat-template fix discussion\n  * vLLM Gemma 4 streaming tool parser issue\n  * MLX-LM Gemma 4 native tool-call parser issue\n  * llama-cpp-python Gemma 4 raw tool-call issue\n\n\n\nI would not read those as “do not use Gemma 4.” I would read them as:\n\n> For Gemma 4 tool use, update first, then smoke-test the exact path.\n\nTool-calling smoke test, especially for Gemma 4 (click for more details)\n\nFor real agentic use, I would test tool-calling models separately from daily-chat models. A model can be pleasant for chat and still not be the model I would trust first with filesystem, browser, email, or shell tools.\n\n* * *\n\n## Remote access and agentic use: the boundary is what the system can reach\n\nTailscale is a reasonable direction for private remote access.\n\nBut I would separate:\n\n  1. **Remote access to Open WebUI**\n  2. **Direct access to the model API**\n  3. **Tool/agent access to files, network, shell, credentials, or devices**\n\n\n\nContainerizing the model is only one part of the safety story. For agentic AI, the bigger question is:\n\n> What can the agent reach?\n\nOpen WebUI’s Tools docs and Hardening guide are useful background because Open WebUI tools/functions can run server-side code. Docker’s Model Runner API docs are also worth reading so you know which endpoints are reachable from where.\n\nRemote / agentic checklist (click for more details)\n\n* * *\n\n## A low-friction roadmap\n\n### Phase 1 — keep your current stack and make it measurable\n\nDo not rebuild everything yet. Record versions, model artifacts, quantization, context length, embeddings, and Open WebUI settings. Then run the same 10–15 tests.\n\n### Phase 2 — make RAG boring\n\nStart with one tiny Knowledge file. Then one real manual. Then multiple collections. Avoid debugging PDF extraction, Markdown conversion, chunking, embeddings, retrieval, and generation all at once.\n\n### Phase 3 — test tools separately\n\nUse one harmless function. Confirm the full loop:\n\n\n    user prompt\n    -> model requests tool\n    -> app executes tool\n    -> model receives result\n    -> model gives final answer\n\n\nFor Gemma 4, be stricter about backend/artifact/template freshness.\n\n### Phase 4 — compare models by role\n\nDo not require one model to win every category.\n\nExample:\n\nModel family | Test separately\n---|---\nGemma 4 | daily chat, summarization, RAG, tool calling\nQwen3 30B-A3B | deeper reading, long context, RAG\nQwen3-Coder-style | coding, structured tool use, agent workflow\nSmaller models | fast fallback tasks\n\n### Phase 5 — remote and agentic later\n\nOnly after the baseline is boring:\n\n\n    Tailscale access\n    read-only tools\n    toy function tools\n    limited real tools\n    human approval for risky actions\n\n\n* * *\n\nReference links (click for more details)\n\nOverall: your current direction does not look unreasonable. I would just shift the next step from “more models” to “small repeatable tests.” Once you have that, model choice becomes much less mysterious.",
  "title": "Local LLM on MacBook M5 Pro - Totally New to This!"
}