Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihbobvxutcs4n7rvzraro2fvwc5pn26zw3vogmno7nudpeubjk4pe",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjbm5alxt7z2"
  },
  "path": "/t/what-is-the-right-way-to-configure-gguf-models-templates-parameters-model-creation/175182#post_2",
  "publishedAt": "2026-04-12T05:00:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face",
    "Ollama Documentation"
  ],
  "textContent": "The correct answers to those questions may not exist, may vary depending on the type of output you want, or may depend on the software (backend) used to handle GGUF…\n\n* * *\n\nThe right way is to treat GGUF setup as **three separate layers** :\n\n  1. **Model file** : the `.gguf` weights and metadata\n  2. **Runtime/backend** : Ollama, llama.cpp, llama-cpp-python, LM Studio, and similar\n  3. **Prompt/rendering layer** : template, system prompt, stop strings, context size, and sampling settings\n\n\n\nMost “bad GGUF results” come from getting layer 2 or 3 wrong, not from GGUF itself. GGUF is a binary format for inference with GGML-based executors. It is not, by itself, a full packaged runtime configuration. (GitHub)\n\n## The short answer\n\nThe safest general rule is:\n\n  * **Do not invent the template**\n  * **Do not rely on backend defaults**\n  * **Do not compare setups unless quant, template, stop strings, context, and sampling all match**\n  * **Inspect what the backend is actually using before you override anything** (Hugging Face)\n\n\n\nIf you follow that rule, GGUF models can perform at the same level as preconfigured models on a given backend. When they do not, the cause is usually one of these:\n\n  * wrong chat template\n  * wrong stop strings\n  * too-small context\n  * different backend defaults\n  * different quantization\n  * missing model-specific features such as tool or document support in the template. (Hugging Face)\n\n\n\n* * *\n\n## 1. The correct workflow\n\nThere is not one universal workflow. The correct workflow depends on the backend.\n\n### A. Raw llama.cpp workflow\n\nFor `llama.cpp`, the clean path is:\n\n  1. download a trusted GGUF\n  2. run it directly in `llama-cli` or `llama-server`\n  3. let the runtime use the model’s embedded `tokenizer.chat_template` by default\n  4. set context and sampling explicitly\n  5. benchmark before tuning style. (GitHub)\n\n\n\nWith `llama-server`, there is usually **no separate model-creation step**. You point the server at the GGUF and run it. The server’s documented default is `--ctx-size 0`, which means “load context size from model metadata.” The server also exposes `/apply-template` to show exactly how messages are being rendered into a prompt string. (GitHub)\n\n### B. llama-cpp-python workflow\n\nFor `llama-cpp-python`, the workflow is similar:\n\n  1. load the GGUF\n  2. use chat completion rather than manually concatenating prompts\n  3. let the library choose the chat format automatically unless you know you must override it\n  4. enable `verbose=True` so you can see which chat format was selected\n  5. set sampling explicitly. (GitHub)\n\n\n\nThe documented precedence is:\n\n  * `chat_handler`\n  * `chat_format`\n  * `tokenizer.chat_template` from GGUF metadata\n  * fallback to `llama-2` format. (GitHub)\n\n\n\nThat precedence order is important because it tells you exactly where mistakes can creep in.\n\n### C. Ollama workflow\n\nFor Ollama, the workflow is different because Ollama has a **packaging layer** called the `Modelfile`.\n\nIf you already have a GGUF, the official workflow is:\n\n  1. create a `Modelfile`\n  2. set `FROM /path/to/file.gguf`\n  3. run `ollama create my-model`\n  4. run `ollama run my-model`. (Ollama Documentation)\n\n\n\nOllama’s docs describe the `Modelfile` as the blueprint for a model and document `FROM`, `PARAMETER`, `TEMPLATE`, `SYSTEM`, `ADAPTER`, `MESSAGE`, and `REQUIRES`. Ollama also lets you inspect any packaged model with `ollama show --modelfile`, and the API’s `show model details` endpoint returns the model’s parameters, template, capabilities, and metadata. (Ollama Documentation)\n\n* * *\n\n## 2. The most important principle: the template is usually more important than the knobs\n\nThis is the core of the whole topic.\n\nHugging Face’s chat-template guidance states that using a format different from what the model was trained on will usually cause **severe, silent performance degradation**. That is the strongest general statement available on this subject, and it matches what people see in practice. (Hugging Face)\n\nSo when you ask:\n\n> “How do I know whether ChatML or LLaMA format is correct?”\n\nThe answer is:\n\n**You do not guess. You inspect.** (Hugging Face)\n\nUse this order:\n\n  1. **embedded template in GGUF metadata**\n  2. **model card or official model docs**\n  3. **backend inspection tools** such as `ollama show --modelfile` or `/api/show`\n  4. **manual override only if needed**. (GitHub)\n\n\n\nIn `llama.cpp`, `llama_chat_apply_template()` uses the template stored in `tokenizer.chat_template` by default. In `llama-cpp-python`, the same metadata is part of the selection chain. In Ollama, `TEMPLATE` is an explicit Modelfile instruction, and `ollama show --modelfile` reveals the packaged version. (GitHub)\n\n* * *\n\n## 3. What the “right” template workflow looks like\n\n### Best default rule\n\n**Start from the model’s own template. Do not replace it unless you have evidence that you should.** (GitHub)\n\n### How to verify support for special features\n\nIf you need RAG documents or tool calling, you must verify that the template actually supports them. Hugging Face’s docs say many templates simply ignore the `documents` input, and recommend checking the model card or printing the chat template to see whether the relevant key is present. (Hugging Face)\n\n### Backend-specific gotcha\n\nIn `llama.cpp` server, only models with a supported chat template work optimally with `/v1/chat/completions`, and the server says that by default the ChatML template will be used there. It also supports `/apply-template`, which is the easiest way to check whether the rendered prompt looks correct before generation. (GitHub)\n\nThat means for `llama.cpp` the safe path is:\n\n  * use `/v1/chat/completions` for chat\n  * inspect `/apply-template` when debugging\n  * only force `--chat-template` or `--chat-template-file` when metadata/default behavior is wrong. (GitHub)\n\n\n\n* * *\n\n## 4. The right way to build a Modelfile\n\nIf you are using Ollama with a GGUF, the **best Modelfile is usually minimal** , not clever.\n\nA good starting pattern is:\n\n\n    FROM /path/to/model.gguf\n\n    PARAMETER num_ctx 8192\n    PARAMETER temperature 0.8\n    PARAMETER top_k 40\n    PARAMETER top_p 0.9\n    PARAMETER repeat_penalty 1.1\n\n    SYSTEM \"\"\"You are a helpful assistant.\"\"\"\n\n\nThis follows the official structure of `FROM`, `PARAMETER`, and `SYSTEM`, and avoids rewriting `TEMPLATE` unless you know the model needs a custom one. Ollama documents that templates are model-specific and use Go template syntax. It also shows how to inspect an existing packaged model and copy its template and stop sequences into a new Modelfile only if you need to. (Ollama Documentation)\n\n### When to include `TEMPLATE`\n\nOnly do it if one of these is true:\n\n  * the imported GGUF is missing the right prompt wrapper\n  * you are reproducing a known-good packaged model\n  * you are working with a model card that explicitly requires a specific format\n  * you need advanced custom tool or system behavior. (Ollama Documentation)\n\n\n\nA good practical rule is:\n\n**First inspect a stock model with`ollama show --modelfile`. Then copy only the parts you actually need.** (Ollama Documentation)\n\n* * *\n\n## 5. Recommended parameter values\n\nThere is no universal “best” preset. But there are good **starting** values.\n\nCurrent documented defaults differ by backend:\n\nSetting | Ollama Modelfile default | llama.cpp server/cli default | llama-cpp-python default | Practical start\n---|---|---|---|---\ntemperature | 0.8 | 0.8 | 0.8 | 0.8\ntop_k | 40 | 40 | 40 | 40\ntop_p | 0.9 | 0.95 | 0.95 | 0.9 to 0.95\nmin_p | 0.0 | 0.05 | 0.05 | 0.0 to 0.05\nrepeat_penalty | 1.1 | 1.0 | 1.0 | 1.0 to 1.1\nrepeat_last_n | 64 | 64 | backend-dependent | 64\nnum_ctx / n_ctx | 2048 | server: 0 = model metadata | library-config dependent | set explicitly\n\nThese values come directly from the current Ollama Modelfile docs, llama.cpp CLI/server docs, and llama-cpp-python API reference. The main lesson is not “copy one number.” The lesson is **backend defaults differ** , so set them explicitly when you care about reproducibility. (Ollama Documentation)\n\n### My recommended starting profiles\n\n#### General chat\n\n  * temperature: `0.8`\n  * top_k: `40`\n  * top_p: `0.9` to `0.95`\n  * min_p: `0.0` to `0.05`\n  * repeat_penalty: `1.0` to `1.1`\n  * repeat_last_n: `64`\nThis is close to the documented defaults and works as a neutral baseline. (Ollama Documentation)\n\n\n\n#### Deterministic evaluation\n\n  * fixed `seed`\n  * stable template\n  * explicit context\n  * avoid creative sampling drift\nFor correctness checks, the important thing is not “a magic low temperature.” It is using the same seed and the same decoding behavior across runs. llama.cpp maintainers and guides emphasize explicit sampling settings when comparing outputs. (GitHub)\n\n\n\n#### Creative writing\n\n  * higher temperature\n  * possibly higher top_p\n  * keep repeat controls on\nThis changes style and diversity, but it is not the right mode for verifying setup correctness. (Ollama Documentation)\n\n\n\n* * *\n\n## 6. How much do these parameters affect quality?\n\nNot all knobs matter equally.\n\n### Highest impact on correctness\n\n  * template\n  * stop strings\n  * system prompt\n  * context size\n  * model-specific template kwargs or tool/document support. (Hugging Face)\n\n\n\nThese are the settings that can make a correct model look broken.\n\n### Medium impact\n\n  * temperature\n  * top_k\n  * top_p\n  * min_p\n  * repeat_penalty\n  * repeat_last_n. (Ollama Documentation)\n\n\n\nThese affect diversity, conservatism, repetition, and determinism. They matter, but they usually do not explain catastrophic failures the way a wrong template does.\n\n### Separate but major impact\n\n  * quantization level\nllama.cpp’s quantization docs say quantization reduces size and can speed inference, but may introduce accuracy loss, typically measured with perplexity and related metrics. Ollama’s import docs say the same tradeoff exists when quantizing models in Ollama. (GitHub)\n\n\n\n* * *\n\n## 7. What is the ideal context size?\n\nThere is no one ideal number. The right value is:\n\n**the smallest context that fully covers your real prompts and retrieved material without truncation.** (Ollama Documentation)\n\n### Backend-specific rules\n\n#### Ollama\n\nOllama’s context-length docs say default context length depends on available VRAM, and also say that for best performance you should use the maximum context length for the model and avoid CPU offload, verifying the actual split with `ollama ps`. (Ollama Documentation)\n\nThat means in Ollama you should not just set `num_ctx` blindly. You should also check whether the model is still fully on GPU.\n\n#### llama.cpp server\n\nThe server default is `--ctx-size 0`, which means “load from model.” That is a good starting point because it respects the model metadata. (GitHub)\n\n#### llama.cpp completion tool\n\nThe completion README documents a default context of 4096 for that tool, which is one more reason not to assume all backends behave the same. (GitHub)\n\n### Practical rule\n\n  * For casual chat: moderate context is fine\n  * For RAG, coding, agents, and long conversations: set more context explicitly\n  * Then check memory and offload behavior, not just the number you typed. (Ollama Documentation)\n\n\n\n* * *\n\n## 8. Standard, proven configurations\n\nThere is no universal gold standard. There are only **good baseline patterns**.\n\n### Pattern 1: minimal-change pattern\n\nUse the model’s own template and stay close to backend defaults. This is the safest first run. (GitHub)\n\n### Pattern 2: explicit baseline pattern\n\nPin the settings you care about:\n\n  * template\n  * stop strings\n  * system prompt\n  * context\n  * temperature\n  * top_k\n  * top_p\n  * repeat_penalty\n  * seed. (Ollama Documentation)\n\n\n\n### Pattern 3: quantization workflow pattern\n\nIf you quantize yourself, use a high-precision GGUF as the master and quantize from that. llama.cpp warns that requantizing an already-quantized model can severely reduce quality compared with quantizing from 16-bit or 32-bit. (GitHub)\n\n* * *\n\n## 9. Common mistakes\n\n### 1. Guessing the template\n\nThis is the biggest mistake. Hugging Face’s guidance explicitly says the wrong chat format causes silent degradation. (Hugging Face)\n\n### 2. Trusting a backend fallback without checking\n\n`llama-cpp-python` will fall back to `llama-2` formatting if nothing else is available. That is convenient, but it is not proof that the model was trained on that format. (GitHub)\n\n### 3. Relying on hidden defaults\n\nOllama, llama.cpp, and llama-cpp-python do not share identical defaults. If you do not set values explicitly, you are not actually comparing like with like. (Ollama Documentation)\n\n### 4. Using the wrong endpoint for the job\n\nIn `llama.cpp` server, `/v1/chat/completions` expects chat-style messages and supported chat templates. If you want to inspect the exact rendered prompt, `/apply-template` is the correct debugging tool. (GitHub)\n\n### 5. Assuming tools or RAG documents work just because the model is “chat tuned”\n\nHugging Face’s docs say many templates ignore `documents`. The same logic applies to tool formatting: support depends on template and runtime, not just on model branding. (Hugging Face)\n\n### 6. Requantizing a quantized model\n\nllama.cpp warns this can severely reduce quality. (GitHub)\n\n### 7. Editing an Ollama model before inspecting it\n\nOllama’s own docs give you `ollama show --modelfile` and `/api/show`. Use them first. There are also issue reports showing imported GGUFs may not always carry the same template and parameter behavior as stock packaged models. (Ollama Documentation)\n\n* * *\n\n## 10. How to match preconfigured models on each backend\n\nThis is the practical recipe.\n\n### To match a known-good Ollama model\n\n  1. inspect it with `ollama show --modelfile` or `/api/show`\n  2. copy the template, stop strings, and parameter values\n  3. use the same quantization level\n  4. use the same context\n  5. only then compare outputs. (Ollama Documentation)\n\n\n\n### To match a known-good llama.cpp server setup\n\n  1. inspect `/props` for `chat_template` and default generation settings\n  2. inspect `/apply-template` for the exact rendered prompt\n  3. keep `--ctx-size`, sampling, and any `chat_template_kwargs` fixed\n  4. compare on the same endpoint path. (GitHub)\n\n\n\n### To match a known-good llama-cpp-python setup\n\n  1. enable `verbose=True`\n  2. confirm the selected `chat_format`\n  3. pin the same `n_ctx`, seed, and sampling values\n  4. avoid changing `chat_format` unless metadata/model card says you must. (GitHub)\n\n\n\n### The equality checklist\n\nFor two setups to be a fair comparison, all of these should match:\n\n  * same base model\n  * same quant\n  * same template\n  * same stop strings\n  * same system prompt\n  * same context\n  * same sampling\n  * same seed\n  * same tool/document formatting behavior. (GitHub)\n\n\n\n* * *\n\n## 11. How to benchmark whether the configuration is correct\n\nYou need **three different tests** , not one.\n\n### A. Prompt-render correctness test\n\nBefore measuring quality, inspect the rendered prompt.\n\n  * Ollama: `ollama show --modelfile` or `/api/show`\n  * llama.cpp server: `/apply-template`\n  * llama-cpp-python: `verbose=True` and inspect selected chat format. (Ollama Documentation)\n\n\n\nIf the prompt wrapper is wrong, every downstream benchmark is misleading.\n\n### B. Quality test\n\nFor the same model family and tokenizer, use `llama-perplexity`. The tool docs say it measures how well the model predicts the next token and that lower is better, but also warn that perplexity is not directly comparable across different tokenizers and that finetunes can score worse on perplexity while still producing better human-rated outputs. (GitHub)\n\nSo use perplexity for:\n\n  * same model family\n  * same tokenizer\n  * same backend or close backend\n  * comparing quants or config changes. (GitHub)\n\n\n\n### C. Speed test\n\nUse `llama-bench` for llama.cpp. Its README is explicitly a performance testing tool and includes examples for generation speed, prompt processing, thread counts, and GPU offload comparisons. For Ollama, the API returns timing metrics such as `total_duration`, `load_duration`, `prompt_eval_count`, `prompt_eval_duration`, `eval_count`, and `eval_duration`. (GitHub)\n\n### D. Human A/B test\n\nIf you care about perceived quality, do at least a small blind comparison on real prompts. llama.cpp community work on blind quant testing used a Bradley–Terry ranking approach to compare quantized variants by human votes. That is not an official benchmark standard, but it is a good reminder that human preference testing often catches differences that raw throughput numbers do not. (GitHub)\n\n* * *\n\n## 12. A concrete backend-by-backend baseline\n\n### Ollama baseline\n\nUse a minimal Modelfile first.\n\n\n    FROM /path/to/model.gguf\n    PARAMETER num_ctx 8192\n    PARAMETER temperature 0.8\n    PARAMETER top_k 40\n    PARAMETER top_p 0.9\n    PARAMETER repeat_penalty 1.1\n\n\nThen inspect with:\n\n\n    ollama show --modelfile mymodel\n\n\nThat is the most stable starting point because it follows Ollama’s documented structure and keeps the prompt layer simple. (Ollama Documentation)\n\n### llama.cpp server baseline\n\nStart with model metadata and one clean slot.\n\n\n    llama-server -m model.gguf -c 0 --alias mymodel\n\n\nThen inspect prompt rendering with `/apply-template` before changing templates. If you need tool calling or advanced model-specific Jinja behavior, use the server’s documented chat-template options and tool-calling path. (GitHub)\n\n### llama-cpp-python baseline\n\nStart from metadata-driven chat formatting.\n\n\n    from llama_cpp import Llama\n\n    llm = Llama(\n        model_path=\"model.gguf\",\n        n_ctx=8192,\n        verbose=True,\n    )\n\n    resp = llm.create_chat_completion(\n        messages=[{\"role\": \"user\", \"content\": \"Hello\"}],\n        temperature=0.8,\n        top_k=40,\n        top_p=0.95,\n        repeat_penalty=1.0,\n    )\n\n\nThis follows the documented chat-format precedence and keeps the first run close to library defaults. (GitHub)\n\n* * *\n\n## Final recommendation\n\nIf you want one rule to remember, use this:\n\n**The correct GGUF configuration workflow is: inspect template → run with explicit context and sampling → benchmark prompt rendering first → benchmark quality and speed second → only then customize.** (Hugging Face)\n\nAnd if you want one practical rule for each backend:\n\n  * **Ollama** : start with a minimal Modelfile and inspect with `show` before overriding `TEMPLATE`\n  * **llama.cpp** : trust metadata first, use `/apply-template`, and pin `ctx` and sampling explicitly\n  * **llama-cpp-python** : let metadata select the chat format first, and only override if you have evidence. (Ollama Documentation)\n\n",
  "title": "What Is the Right Way to Configure GGUF Models? (Templates, Parameters, Model Creation)"
}