Ollama model registry provides wrong chat template
or a general HF issue?
This is probably the correct answer.
The most likely underlying cause is that the Ollama templates and the built-in GGUF templates (primarily for Llama.cpp) aren’t necessarily identical.
A similar issue occurred in Qwen 3.5, though the cause was different. This sort of thing happens occasionally when a new model family introduces a lot of changes. If I report it to the HF or Ollama GitHub, it’ll probably get fixed eventually…
Likely cause: HF’s Ollama registry is serving a lossy template, not the quantizer breaking the GGUF
I think you found a real integration-layer bug, or at least a dangerous fallback in the Hugging Face → Ollama compatibility path.
The short answer is:
This does not look primarily like a @bartowski / quantizer configuration error, assuming your GGUF inspection is correct. If the GGUF still contains the full tokenizer.chat_template, then the quantized file likely preserved the important metadata. The suspicious transformation happens later, when Hugging Face exposes the model through the Ollama-compatible registry endpoint.
The failing boundary appears to be:
GGUF metadata:
tokenizer.chat_template = full / complex / Gemma 4-specific
↓ Hugging Face Ollama compatibility layer
hf.co/v2/<repo>/manifests/<tag>:
application/vnd.ollama.image.template = short generic Go template
↓ Ollama pull/run via hf.co
ollama show --modelfile hf.co/<repo>:<tag>:
TEMPLATE = same short generic Go template
That is why the official Ollama model can behave differently: the official Ollama Gemma 4 path uses Ollama’s own Gemma 4 renderer, while the hf.co/v2 path appears to serve a static Ollama TEMPLATE layer.
Relevant references:
- HF Hub: Use Ollama with any GGUF model
- Ollama Modelfile Reference
- HF @huggingface/ollama-utils README
- Ollama Gemma 4 renderer source
- llama.cpp chat-template wiki
- Google Gemma 4 function-calling docs
- vLLM Gemma 4 usage guide
Why this matters
A chat template is not just formatting. It is the serialization contract between structured chat messages and the raw token sequence the model actually sees.
A chat model does not literally receive this abstract structure:
[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello"}
]
It receives a rendered prompt string/token stream, for example with special role markers, turn delimiters, BOS/EOS behavior, tool declarations, image placeholders, thinking markers, and stop tokens. If that rendering is wrong, the model can load successfully but behave strangely.
So when Ollama locally sees only this simplified template:
{{ if .System }}<|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
or the HF registry blob serves:
{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
that is not merely a shorter display version. It may be a materially different prompt format.
For simple one-turn text chat, this can appear to work. For Gemma 4, it is very likely incomplete.
Why I would not primarily blame the quantizer
A quantizer-side problem would be likely if one of these were true:
| Observation | Likely interpretation |
|---|---|
tokenizer.chat_template is missing from the GGUF |
bad conversion / incomplete GGUF metadata |
tokenizer.chat_template inside the GGUF is already simplified |
converter or quantizer likely damaged metadata |
| the HF GGUF viewer also shows the simplified template | GGUF metadata likely wrong |
the repo contains an explicit bad template file |
repo packaging issue |
| the problem appears only in one quantizer’s repo | repo-specific issue more likely |
But your evidence is different:
| Layer | Your observation | What it suggests |
|---|---|---|
| GGUF metadata | full tokenizer.chat_template still exists |
quantizer likely preserved the template |
| HF model details / GGUF view | shows the full complex template | HF can read the correct metadata |
hf.co/v2/.../manifests/IQ2_XXS |
contains a short application/vnd.ollama.image.template layer |
registry compatibility layer is suspicious |
| template blob | 159-byte generic Go template | conversion/selection/fallback likely lost semantics |
| local Ollama model | sees the simplified template | Ollama is consuming the served template |
| official Ollama Gemma 4 | does not show this same problem | official path uses different rendering/configuration |
That points away from “bad quantization” and toward “bad Ollama image template generated by the registry bridge.”
A quantizer can still add a workaround, but that is different from being the root cause.
The key distinction: GGUF metadata vs Ollama image template
The GGUF can contain one template while the Ollama registry image exposes another.
Your GGUF contains:
tokenizer.chat_template
The Ollama-compatible registry manifest contains:
application/vnd.ollama.image.template
Those are not the same artifact.
HF’s Ollama docs say that, by default, the template for ollama run hf.co/<namespace>/<repo> is selected from commonly used templates based on the GGUF’s built-in tokenizer.chat_template. The same docs say that if a repo provides a custom template file, it must be a Go template , not a Jinja template: HF Ollama docs.
So the intended pipeline is roughly:
GGUF tokenizer.chat_template
→ HF template selection / conversion
→ Ollama Go TEMPLATE
→ application/vnd.ollama.image.template
→ Ollama local Modelfile
Your evidence suggests that the pipeline is losing information in the middle.
Why Jinja → Go template conversion is fragile
HF / Transformers chat templates are generally Jinja-style templates. Ollama TEMPLATE uses Go template syntax.
Ollama’s own docs say TEMPLATE is the full prompt template passed to the model and that templates use Go template syntax: Ollama Modelfile Reference.
That means a bridge has to do one of these:
- convert the Jinja template into a Go template;
- map the Jinja template to a known built-in Go template;
- use a custom model-family-specific handler;
- fall back to a simpler template;
- or fail.
For simple templates, this may be fine. For Gemma 4, Qwen thinking models, multimodal models, and native tool-calling models, this is fragile.
The public HF package @huggingface/ollama-utils is especially relevant because it says it handles conversion of GGUF/Jinja chat templates into the Go format used by Ollama. It also explicitly lists “the converted template is wrong” as a valid reason to add a custom handler/test.
That is almost exactly your case.
Why Gemma 4 is a bad fit for a tiny generic template
Gemma 4 is not just a plain text chat model with user and assistant turns.
The official Ollama Gemma 4 renderer handles Gemma 4-specific prompt rendering in code: Ollama gemma4.go. The renderer deals with things such as:
- BOS emission;
- system/developer messages;
<|turn>/<turn|>markers;<|"|>string delimiters;- thinking mode;
- tools;
- tool declarations;
- tool calls;
- tool responses;
- image tags;
- adjacent assistant/model turns;
- stripping thinking blocks;
- generation-prompt behavior.
Google’s Gemma 4 function-calling docs also show that tools are passed through apply_chat_template() via the tools argument: Google Gemma 4 function calling.
vLLM’s Gemma 4 guide similarly treats Gemma 4 as needing specialized support for reasoning, tool calling, and dynamic multimodal behavior: vLLM Gemma 4 usage guide.
So the official / full behavior surface is closer to:
messages
+ system/developer roles
+ thinking flags
+ tools
+ tool calls
+ tool responses
+ image/audio placeholders
+ special delimiters
+ parser expectations
→ Gemma 4-specific rendered prompt
The short HF-served Ollama template is closer to:
optional system string
+ one user prompt
+ model response marker
Those are not equivalent.
Why the official Ollama model can work while hf.co/... behaves oddly
The official Ollama model and a community GGUF pulled through hf.co are different runtime paths.
Official Ollama Gemma 4 can use:
Ollama model-library metadata
+ Gemma4Renderer
+ known params
+ known stop tokens
+ parser behavior
+ multimodal/projector handling
The HF registry path gives Ollama an Ollama-compatible image manifest with layers like:
application/vnd.ollama.image.model
application/vnd.ollama.image.template
application/vnd.ollama.image.projector
application/vnd.ollama.image.params
If that application/vnd.ollama.image.template layer is a simplified fallback, Ollama may simply use the bad template it was given.
So this difference is expected:
ollama run gemma4:<official-tag>
→ official Ollama packaging / renderer path
ollama run hf.co/bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ2_XXS
→ HF-generated Ollama registry image
→ static application/vnd.ollama.image.template
That does not mean the GGUF is bad. It means the wrapper/rendering path is different.
Most likely root cause
My ranking:
1. Most likely: HF Ollama template conversion/selection fallback
HF reads the GGUF template, tries to convert or classify it, and emits a generic Gemma-ish Go template instead of a faithful Gemma 4 template.
Possible mechanisms:
- Gemma 4 is not handled by a custom mapping.
- The Jinja template is too complex or non-linear.
- The converter only supports a subset of the template.
- The matcher recognizes the
<|turn>markers and chooses a generic template. - Tool/thinking/multimodal branches are dropped.
- A fallback template is emitted instead of failing loudly.
This is the most likely explanation.
2. Also possible: a static Go TEMPLATE cannot fully express official Gemma 4 rendering
Ollama’s official Gemma 4 support is renderer code, not just a template string.
Some behavior may be awkward or impossible to express faithfully in a static Go template, especially if the renderer needs to restructure messages, merge tool results, strip thinking, or parse tool calls.
So there are two different levels of fix:
| Level | Possible fix |
|---|---|
| Narrow HF fix | Add a better Gemma 4 mapping/custom handler in the HF Ollama compatibility layer |
| Better Ollama fix | Let imported Gemma 4 GGUFs use the same Gemma 4 renderer path as official models |
| Broader ecosystem fix | Support Jinja chat templates directly in Ollama |
There is already an Ollama feature request for Jinja chat-template support: ollama/ollama#10222.
3. Possible but less likely: repo-level template override
HF supports a repo-level template file for Ollama, but it must be a Go template: HF Ollama docs.
If the repo contains such a file and it is bad, that could be a repo packaging issue. But from your evidence, the important template is being served as an HF registry layer while the GGUF metadata remains correct.
4. Least likely from your evidence: quantizer damaged the GGUF
This becomes likely only if the GGUF metadata itself is missing, truncated, or simplified.
You said the opposite: the correct tokenizer.chat_template is still there.
How to prove the failing boundary cleanly
Package three artifacts:
- the GGUF metadata;
- the HF
application/vnd.ollama.image.templateblob; - the local
ollama show --modelfileoutput.
1. Fetch the HF Ollama manifest
REPO="bartowski/google_gemma-4-26B-A4B-it-GGUF"
TAG="IQ2_XXS"
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/manifests/${TAG}" \
| jq . > hf-v2-manifest.json
Highlight this layer:
{
"mediaType": "application/vnd.ollama.image.template",
"size": 159
}
2. Fetch the template blob
TEMPLATE_DIGEST="$(
jq -r '.layers[] | select(.mediaType=="application/vnd.ollama.image.template") | .digest' \
hf-v2-manifest.json
)"
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/blobs/${TEMPLATE_DIGEST}" \
> hf-v2-template.txt
cat hf-v2-template.txt
Expected problematic result:
{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
3. Show what Ollama imports
ollama pull "hf.co/${REPO}:${TAG}"
ollama show --modelfile "hf.co/${REPO}:${TAG}" \
> ollama-show-modelfile.txt
Ollama documents ollama show --modelfile as the way to inspect the model’s Modelfile: Ollama Modelfile Reference.
If ollama-show-modelfile.txt contains the same simplified template, that proves Ollama is using the registry-served template.
4. Inspect GGUF metadata
For example, with llama.cpp tooling:
python ./llama.cpp/gguf-py/scripts/gguf-dump.py \
--no-tensors \
./google_gemma-4-26B-A4B-it-IQ2_XXS.gguf \
> gguf-metadata.txt
grep -n "tokenizer.chat_template" gguf-metadata.txt
llama.cpp’s template wiki says llama_chat_apply_template() uses the template stored in model metadata key tokenizer.chat_template by default and includes a Jinja parser called minja: llama.cpp template wiki.
5. Summarize the proof
| Source | Result |
|---|---|
| GGUF metadata | full Gemma 4 tokenizer.chat_template |
HF application/vnd.ollama.image.template blob |
short generic Go template |
ollama show --modelfile |
same short generic Go template |
That table makes the issue very clear.
Behavior tests to run
Do not test only hello. A generic template can pass trivial chat while failing important branches.
Test cases that matter:
| Test | What it checks |
|---|---|
| one-turn prompt | baseline behavior |
| system prompt | system role rendering |
| multi-turn chat | history loop |
| assistant-history turn | assistant/model role rendering |
| thinking on/off | Gemma 4 thinking control |
| tool declaration | tool schema serialization |
| tool call | tool-call formatting |
| tool response | tool-response formatting |
| image input | multimodal placeholder handling |
| long answer / stop leak | stop tokens and turn terminators |
The simplified template may only pass the first one or two.
Workarounds
Workaround 1: Use the official Ollama Gemma 4 model
For normal local usage, this is the safest workaround:
ollama pull gemma4:26b
ollama run gemma4:26b
Reason: the official Ollama path can use the dedicated Gemma 4 renderer: Ollama Gemma 4 renderer.
Downside: you may not get the exact community quant you wanted.
Workaround 2: Use llama.cpp / vLLM / another direct GGUF path
For testing the exact GGUF quant, use a runtime path that can apply the embedded template more directly.
Examples of relevant references:
- HF GGUF with llama.cpp
- llama.cpp template wiki
- vLLM Gemma 4 guide
This helps answer:
Is the quantized GGUF itself bad, or is the Ollama wrapper bad?
If the same GGUF behaves better through llama.cpp/vLLM with a correct template, that supports the wrapper/template diagnosis.
Workaround 3: Import the GGUF manually into Ollama
You can bypass the hf.co/v2 registry path:
FROM /absolute/path/to/google_gemma-4-26B-A4B-it-IQ2_XXS.gguf
Then:
ollama create gemma4-local -f Modelfile
ollama show --modelfile gemma4-local
But this is not automatically a full fix. Manual import bypasses the HF registry template blob, but you still need a correct Ollama template or renderer behavior.
Workaround 4: Add a repo-level template file, if a faithful Go template exists
HF allows a repo-level template file for Ollama, but it must be a Go template, not Jinja: HF Ollama docs.
This may help for some models. For Gemma 4, be careful: a partial Go template can fix basic chat while still breaking tools, thinking, images, and parser behavior.
Workaround 5: Render the prompt yourself
For serious application testing, the most controlled workaround is:
- use the HF tokenizer/processor;
- apply the correct chat template yourself;
- send the rendered prompt through a completion-style path;
- manage stop tokens and parsing yourself.
This avoids trusting a runtime’s chat serializer.
Where to report
Primary target: Hugging Face
Best first target:
https://github.com/huggingface/huggingface.js/issues
Relevant package:
packages/ollama-utils
Why this target:
- HF documents that its Ollama path selects a template based on GGUF
tokenizer.chat_template: HF Ollama docs. @huggingface/ollama-utilssays it converts GGUF/Jinja chat templates to the Go format used by Ollama: README.- The same README explicitly lists “the converted template is wrong” as a valid reason for adding a custom handler/test.
Suggested issue title:
hf.co/v2 Ollama registry serves simplified template for Gemma 4 GGUF despite full tokenizer.chat_template in GGUF metadata
Alternative title:
application/vnd.ollama.image.template loses Gemma 4 chat-template semantics
Precise framing:
The GGUF metadata appears correct. The problem appears between GGUF tokenizer.chat_template and the generated Ollama image template layer.
Secondary target: Ollama
Report to Ollama if you can show that imported Gemma 4 GGUFs should use the built-in Gemma 4 renderer but do not, or that the static TEMPLATE mechanism cannot represent official Gemma 4 rendering.
Relevant links:
- Ollama Gemma 4 renderer
- Ollama Jinja template support request
- Ollama Modelfile docs
Optional target: quantizer / repo maintainer
Only report to the quantizer if:
- the GGUF metadata is actually wrong;
- the repo has an explicit bad
templatefile; - or you want a repo-level workaround.
Suggested wording:
The GGUF metadata appears to contain the full tokenizer.chat_template, but the HF Ollama registry path is serving a simplified application/vnd.ollama.image.template layer. If a faithful Gemma 4 Go template is available, adding a repo-level template file might work around the issue for Ollama users.
That avoids wrongly blaming the quantizer.
Suggested issue body
## Summary
For `bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ2_XXS`, the GGUF metadata appears to contain the full Gemma 4 `tokenizer.chat_template`, but the HF Ollama registry endpoint serves a much shorter `application/vnd.ollama.image.template` layer.
When the model is pulled through Ollama using `hf.co/...`, the local Ollama Modelfile uses this simplified template.
This appears to lose important Gemma 4 chat-template semantics.
## Affected model
- Repo: `bartowski/google_gemma-4-26B-A4B-it-GGUF`
- Tag: `IQ2_XXS`
- Possibly affects other Gemma 4 GGUF repos/tags.
## Steps to reproduce
```sh
REPO="bartowski/google_gemma-4-26B-A4B-it-GGUF"
TAG="IQ2_XXS"
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/manifests/${TAG}" \
| jq .
Find the layer:
application/vnd.ollama.image.template
Fetch it:
TEMPLATE_DIGEST="$(
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/manifests/${TAG}" \
| jq -r '.layers[] | select(.mediaType=="application/vnd.ollama.image.template") | .digest'
)"
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/blobs/${TEMPLATE_DIGEST}"
Actual behavior
The template blob is a short generic Go template:
{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
ollama show --modelfile hf.co/${REPO}:${TAG} shows the same simplified template locally.
Expected behavior
The generated Ollama template should either:
- preserve the semantics of the GGUF
tokenizer.chat_template; - use a Gemma 4-specific custom conversion/handler;
- or fail/omit the generated template instead of serving a misleading simplified fallback.
Why this matters
Gemma 4 prompt rendering is not just simple system/user/model turns.
Ollama’s official Gemma 4 renderer handles BOS, system/developer messages, tools, thinking, tool calls, tool responses, image tags, and Gemma 4-specific delimiters:
github.com/ollama/ollama
model/renderers/gemma4.go
main
package renderers
import (
"fmt"
"sort"
"strings"
"github.com/ollama/ollama/api"
)
// Gemma4Renderer renders prompts using Gemma 4's chat format with
// <|turn>/<turn|> markers, <|"|> string delimiters, and <|tool>/
// <|tool_call>/<|tool_response> tags for function calling.
type Gemma4Renderer struct {
useImgTags bool
emptyBlockOnNothink bool
}
const (
g4Q = `<|"|>` // Gemma 4 string delimiter
This file has been truncated. show original
A simplified static template can make the model appear to run while silently degrading chat, tool, thinking, multimodal, or parser behavior.
Evidence to attach
hf-v2-manifest.jsonhf-v2-template.txtollama-show-modelfile.txtGGUF metadata snippet showing the full
tokenizer.chat_template
What maintainers could do
HF-side fixes:
- Add a Gemma 4-specific mapping/handler in the Ollama compatibility layer.
- Add tests comparing converted Go-template output against expected Gemma 4 rendering.
- Avoid silently serving a simplified fallback when conversion is incomplete.
- Document cases where a model’s Jinja template cannot be faithfully represented as an Ollama Go
TEMPLATE.
Ollama-side fixes:
- Support Jinja chat templates directly.
- Let imported Gemma 4 GGUFs use equivalent Gemma 4 renderer behavior when metadata identifies the architecture.
- Expose clearer diagnostics when a model falls back to a generic template.
Quantizer/repo-side mitigations:
- Preserve
tokenizer.chat_templatein GGUF metadata. - Document known Ollama limitations.
- Optionally add a repo-level Go
templatefile if a faithful one exists.
Final answer
So, answering your question directly:
Is this a quantizer configuration error?
Probably not, assuming the GGUF metadata really contains the full template.
Is it a general HF issue?
Most likely yes: more specifically, an issue in the HF Ollama registry / template conversion / template selection layer.
Is Ollama also involved?
Yes, structurally. Ollama uses Go templates for
TEMPLATE, while HF/GGUF templates are generally Jinja-style. Official Gemma 4 in Ollama has a custom renderer, which imported HF GGUFs may not get.Why does the official Ollama model behave better?
Because it can use Ollama’s dedicated Gemma 4 renderer. The HF
hf.co/v2path appears to provide a generated static template layer instead.What should you do now?
- Use official Ollama Gemma 4 for normal local usage.
- Use llama.cpp/vLLM/direct GGUF paths for testing the exact quant.
- File a precise issue against HF’s Ollama compatibility layer.
- Include the manifest, template blob,
ollama show --modelfile, and GGUF metadata. - Do not primarily blame the quantizer unless the GGUF metadata itself is wrong.
The best one-sentence report is:
The GGUF metadata contains the full Gemma 4
tokenizer.chat_template, but the HFhf.co/v2Ollama registry emits a simplifiedapplication/vnd.ollama.image.templatelayer, causingollama run hf.co/...to use a template that does not preserve Gemma 4 chat-template semantics.
Discussion in the ATmosphere