Gemma 4 bug fixes and Research Request
Hmm… after reading your report and the recent model-repo/template updates, this looks pretty badly broken across a surprisingly wide part of the ecosystem…
Short diagnosis
After reading the original report, the Hugging Face model-repo discussions/updates, and related runtime/client issues, I would frame this slightly differently from “Gemma 4 weights are broken”.
My current read:
Gemma 4 probably has a real ecosystem-wide agentic failure mode, but the first-order failure does not look like a single model-weights bug. It looks more like a multi-layer protocol-boundary problem around Gemma 4’s native, non-JSON tool-call format and the OpenAI-compatible agent stacks trying to wrap it.
So I mostly agree with the “ecosystem-wide” part of the report. I would just be cautious about attributing the whole thing to the weights themselves.
Likely layers:
- Gemma 4 native tool-call protocol
- HF chat templates / tokenizer configs / model-repo packaging
- GGUF or other converted artifacts with stale embedded templates
- backend runtime parsers: vLLM, llama.cpp, Ollama, MLX, SGLang
- streaming delta parsers
- OpenAI-compatible proxy layers: LiteLLM / ADK-style adapters
- coding-agent clients: OpenCode / AI SDK integrations
- agent-loop recovery: retry, duplicate suppression, malformed-turn handling
The core mismatch is that Gemma 4 tool calls are not ordinary OpenAI-style JSON tool calls. Gemma 4 uses a native format like:
<|tool_call>call:func_name{key:<|"|>value<|"|>,num:42}<tool_call|>
That has different string delimiters, unquoted keys, and different multi-call behavior from JSON. Any layer that assumes “OpenAI-compatible endpoint = ordinary JSON tool-call transcript” can corrupt the conversation while converting, streaming, storing, or re-rendering tool calls.
Useful starting links:
- Original Hugging Face forum report
- Google — Function calling with Gemma 4
- Google — Gemma 4 model card
- vLLM — Gemma4ToolParser docs
- vLLM — Gemma 4 usage guide
What kind of failure is this?
I would call this a multi-layer protocol-boundary failure , or ecosystem drift around Gemma 4 native tool calling.
| Layer | Failure mode |
|---|---|
| Official model repo / chat template | OpenAI-shaped messages are rendered incorrectly into Gemma-native dialogue. |
| GGUF / artifact distribution | Old quantized artifacts embed stale chat templates. |
| Runtime parser | Gemma native tool syntax is parsed as JSON-ish text, or not parsed at all. |
| Streaming parser | Partial deltas corrupt arguments, numbers, booleans, or boundaries. |
| Proxy / OpenAI adapter | role:"tool" / tool_calls.arguments are translated incorrectly. |
| Client / coding agent | Backend returns tool calls, but client-side stream/event parser misses them. |
| Agent loop | Malformed calls are fed back into history, causing self-reinforcing loops. |
| LoRA / fine-tune | May reduce bad generations, but does not fix broken protocol conversion. |
This explains why reports differ. vLLM streaming users may see corrupted arguments. Ollama-through-LiteLLM users may see infinite tool loops. Old-GGUF users may be running stale templates. OpenCode users may have a backend that returns tool calls, while the client fails to consume them. These all look like “Gemma 4 tools are broken”, but they are not necessarily the same bug.
The concrete bug pattern I would check first
The strongest pattern I would look for is OpenAI-style JSON arguments being re-rendered into Gemma-native syntax incorrectly.
OpenAI-compatible APIs often represent arguments as a JSON string:
{
"tool_calls": [
{
"type": "function",
"function": {
"name": "write_file",
"arguments": "{\"path\":\"foo.txt\",\"content\":\"hello\"}"
}
}
]
}
Gemma native rendering instead wants a structured object that can serialize to Gemma’s DSL:
<|tool_call>call:write_file{path:<|"|>foo.txt<|"|>,content:<|"|>hello<|"|>}<tool_call|>
If the renderer inserts the OpenAI JSON string into Gemma braces, it may produce hybrid syntax:
call:write_file{{"path":"foo.txt","content":"hello"}}
That is neither valid OpenAI JSON tool calling nor proper Gemma native tool calling.
Then the loop becomes:
correct Gemma native tool call
-> converted to OpenAI-compatible tool_calls
-> arguments stored as JSON string
-> later re-rendered into Gemma prompt
-> JSON string inserted into Gemma-native braces
-> malformed hybrid syntax appears in history
-> model imitates malformed history
-> parser fails
-> retry re-injects poisoned turn
-> loop
That is why I would treat transcript re-rendering and tool-result mapping as first-class suspects, not just sampling or LoRA.
The official HF model-repo template fixes point in the same direction. Recent Gemma 4 template discussions/fixes mention string-typed arguments, tool-response rendering, turn-tag balance, ordering, and thinking preservation:
- google/gemma-4-31B-it discussion #118
- google/gemma-4-E4B-it discussion #36
- google/gemma-4-12B-it discussion #12
- google/gemma-4-12B-it discussion #35
So the official model repos are part of the fix surface, not only third-party runtimes.
Why LoRA is probably not the durable first-line fix
Your LoRA attempt makes sense as an initial experiment: from the outside, the symptom can look like model-side format drift.
But after looking at the model-repo template changes and related runtime/client issues, I would treat LoRA as a behavioral mitigation rather than the most likely durable root fix.
| Approach | Can help with | Cannot fix |
|---|---|---|
| LoRA / fine-tune | Model behavior, format preference, tool-use tendency | Broken parser, streaming delta, transcript re-render, wrong role mapping |
| Template fix | Correct Gemma-native rendering | Client-side stream parser bugs |
| Runtime parser fix | Native extraction and conversion | Proxy rewriting tool results incorrectly |
| Proxy/adapter fix | OpenAI messages ↔ Gemma-native semantics | Runtime parser bugs |
| Agent-loop healing | Retry safety, duplicate suppression, malformed-call containment | Incorrect canonical protocol implementation |
So I would not say “LoRA is useless”. I would say: LoRA is not the durable root fix if the failure is protocol-boundary corruption.
Fix vs workaround
Durable fixes
| Layer | Who should fix it | Durable fix |
|---|---|---|
| Official model repo / HF | Google / HF maintainers | Canonical chat template, tokenizer config, response schema, tool-response rendering, examples. |
| Artifact / GGUF | Quant providers, Unsloth, LM Studio community, Bartowski-style distributors | Re-export/re-quant with fixed template metadata; provide known-good template overrides. |
| Backend runtime | vLLM, llama.cpp, Ollama, MLX, SGLang | Gemma-native parser/serializer, schema handling, streaming delta handling, reasoning/channel handling. |
| Proxy / adapter | LiteLLM, ADK, OpenAI-compatible bridges | Correct role:"tool" ↔ Gemma tool_responses; deserialize OpenAI arguments before Gemma rendering. |
| Client / coding agent | OpenCode, AI SDK integrations, LM Studio, Claude Code/OpenClaw adapters | Recognize streamed tool-call events, preserve IDs, support model-specific parser hooks. |
| Agent loop | App/framework authors | Retry cap, duplicate suppression, malformed-turn suppression, final-answer nudges, tool-call healing. |
| Third-party bridge repo | Community / researchers | Compatibility matrix, patched adapter, prompt-dump checker, regression suite, known-good combinations. |
Practical workaround decision tree
stream:false helps streaming parser bugs, but it will not fix stale GGUF templates or proxy role-mapping bugs.
0. Update first:
backend runtime
model repo files
tokenizer_config / chat_template
GGUF / quantized artifact
proxy / agent framework
1. If using GGUF:
re-download a post-template-fix artifact, or use a runtime/UI that overrides stale embedded templates.
Do not assume updating only the binary updates the embedded model template.
2. Test backend directly:
bypass LiteLLM / ADK / OpenCode / Studio / OpenAI-compatible proxy.
3. If backend-direct works:
suspect proxy / adapter / client layer.
check role:"tool" vs tool_responses.
check whether OpenAI function.arguments JSON strings are deserialized before Gemma rendering.
check whether streamed tool-call events are recognized by the client.
4. If backend-direct fails:
try stream:false for tool-call requests.
5. If stream:false fixes it:
likely streaming delta / parser bug.
keep tool calls non-streaming until that path is fixed.
6. If stream:false does not fix it:
disable MTP / speculative decoding if enabled.
7. If single-turn works but second-turn fails:
suspect tool_response mapping or transcript re-rendering.
inspect the final prompt.
8. If first-turn fails:
suspect native parser, chat template, schema complexity, or unsupported runtime.
simplify schema and test a minimal tool.
9. If loops repeat:
cap retries.
suppress duplicate tool calls.
never feed malformed assistant tool-call turns back into history.
add a clean nudge or abort instead of replaying poisoned output.
Prompt-dump checks
Dump the exact prompt reaching the model if possible.
Suspicious:
call:NAME{{"key":"value"}}
{{"key":
role: "tool"
<channel|>
<|tool_call> ... raw JSON ... <tool_call|>
Suspicious template input:
{
"function": {
"arguments": "{\"x\":1}"
}
}
Preferred template input:
{
"function": {
"arguments": {
"x": 1
}
}
}
Then the Gemma serializer can emit native syntax:
call:some_tool{x:1}
or:
call:some_tool{x:<|"|>value<|"|>}
depending on type.
Minimal regression tests
Before calling something “Gemma 4 tool support”, I would run:
1. Single tool call.
2. Tool call -> tool result -> final answer.
3. Tool call -> tool result -> second tool call.
4. Multiple tool calls in one assistant turn.
5. stream:false vs stream:true.
6. MTP/speculative decoding on/off.
7. Long string argument containing comma, colon, braces, quotes.
8. Boolean / null / integer / decimal arguments.
9. Code/html argument.
10. Malformed tool call retry.
11. Proxy vs backend-direct.
12. Old GGUF embedded template vs patched template override.
13. Assistant content + tool_calls in the same turn.
14. Consecutive tool results.
15. Repeated identical tool call suppression.
Good torture string:
Fix: deploy, retry: twice, reason: "missing { brace } in HTML"
A regex-ish parser will often break this. A real string-aware parser should not.
Known related issues and fixes by layer (click for more details)
Suggested known-good debugging protocol (click for more details)
Bottom line
I would not frame the durable fix as:
“Make Gemma 4 learn tool calling via LoRA.”
I would frame it as:
“Every layer that claims Gemma 4 support needs to handle Gemma 4’s native tool protocol losslessly, and agent loops need safeguards for malformed-output cases.”
Durable fix surface:
official template
+ artifact freshness
+ native runtime parser
+ streaming parser
+ OpenAI-compatible adapter
+ client event parser
+ transcript re-renderer
+ agent-loop healing
Immediate user workaround:
update everything
-> verify artifact/template freshness
-> test backend direct
-> disable streaming for tool calls if needed
-> disable MTP/speculative if needed
-> inspect prompt dumps
-> fix tool_response mapping
-> prevent poisoned retries
So yes: this looks pretty badly broken across the ecosystem, but not necessarily because the Gemma 4 weights are fundamentally unable to do tools. It looks more like the ecosystem is still settling around a non-JSON native tool protocol that does not fit cleanly into existing OpenAI-compatible agent assumptions.
Discussion in the ATmosphere