External Publication
Visit Post

Gemma 4 bug fixes and Research Request

Hugging Face Forums [Unofficial] June 20, 2026
Source

Hmm… after reading your report and the recent model-repo/template updates, this looks pretty badly broken across a surprisingly wide part of the ecosystem…


Short diagnosis

After reading the original report, the Hugging Face model-repo discussions/updates, and related runtime/client issues, I would frame this slightly differently from “Gemma 4 weights are broken”.

My current read:

Gemma 4 probably has a real ecosystem-wide agentic failure mode, but the first-order failure does not look like a single model-weights bug. It looks more like a multi-layer protocol-boundary problem around Gemma 4’s native, non-JSON tool-call format and the OpenAI-compatible agent stacks trying to wrap it.

So I mostly agree with the “ecosystem-wide” part of the report. I would just be cautious about attributing the whole thing to the weights themselves.

Likely layers:

  1. Gemma 4 native tool-call protocol
  2. HF chat templates / tokenizer configs / model-repo packaging
  3. GGUF or other converted artifacts with stale embedded templates
  4. backend runtime parsers: vLLM, llama.cpp, Ollama, MLX, SGLang
  5. streaming delta parsers
  6. OpenAI-compatible proxy layers: LiteLLM / ADK-style adapters
  7. coding-agent clients: OpenCode / AI SDK integrations
  8. agent-loop recovery: retry, duplicate suppression, malformed-turn handling

The core mismatch is that Gemma 4 tool calls are not ordinary OpenAI-style JSON tool calls. Gemma 4 uses a native format like:

<|tool_call>call:func_name{key:<|"|>value<|"|>,num:42}<tool_call|>

That has different string delimiters, unquoted keys, and different multi-call behavior from JSON. Any layer that assumes “OpenAI-compatible endpoint = ordinary JSON tool-call transcript” can corrupt the conversation while converting, streaming, storing, or re-rendering tool calls.

Useful starting links:

  • Original Hugging Face forum report
  • Google — Function calling with Gemma 4
  • Google — Gemma 4 model card
  • vLLM — Gemma4ToolParser docs
  • vLLM — Gemma 4 usage guide

What kind of failure is this?

I would call this a multi-layer protocol-boundary failure , or ecosystem drift around Gemma 4 native tool calling.

Layer Failure mode
Official model repo / chat template OpenAI-shaped messages are rendered incorrectly into Gemma-native dialogue.
GGUF / artifact distribution Old quantized artifacts embed stale chat templates.
Runtime parser Gemma native tool syntax is parsed as JSON-ish text, or not parsed at all.
Streaming parser Partial deltas corrupt arguments, numbers, booleans, or boundaries.
Proxy / OpenAI adapter role:"tool" / tool_calls.arguments are translated incorrectly.
Client / coding agent Backend returns tool calls, but client-side stream/event parser misses them.
Agent loop Malformed calls are fed back into history, causing self-reinforcing loops.
LoRA / fine-tune May reduce bad generations, but does not fix broken protocol conversion.

This explains why reports differ. vLLM streaming users may see corrupted arguments. Ollama-through-LiteLLM users may see infinite tool loops. Old-GGUF users may be running stale templates. OpenCode users may have a backend that returns tool calls, while the client fails to consume them. These all look like “Gemma 4 tools are broken”, but they are not necessarily the same bug.


The concrete bug pattern I would check first

The strongest pattern I would look for is OpenAI-style JSON arguments being re-rendered into Gemma-native syntax incorrectly.

OpenAI-compatible APIs often represent arguments as a JSON string:

{
  "tool_calls": [
    {
      "type": "function",
      "function": {
        "name": "write_file",
        "arguments": "{\"path\":\"foo.txt\",\"content\":\"hello\"}"
      }
    }
  ]
}

Gemma native rendering instead wants a structured object that can serialize to Gemma’s DSL:

<|tool_call>call:write_file{path:<|"|>foo.txt<|"|>,content:<|"|>hello<|"|>}<tool_call|>

If the renderer inserts the OpenAI JSON string into Gemma braces, it may produce hybrid syntax:

call:write_file{{"path":"foo.txt","content":"hello"}}

That is neither valid OpenAI JSON tool calling nor proper Gemma native tool calling.

Then the loop becomes:

correct Gemma native tool call
  -> converted to OpenAI-compatible tool_calls
  -> arguments stored as JSON string
  -> later re-rendered into Gemma prompt
  -> JSON string inserted into Gemma-native braces
  -> malformed hybrid syntax appears in history
  -> model imitates malformed history
  -> parser fails
  -> retry re-injects poisoned turn
  -> loop

That is why I would treat transcript re-rendering and tool-result mapping as first-class suspects, not just sampling or LoRA.

The official HF model-repo template fixes point in the same direction. Recent Gemma 4 template discussions/fixes mention string-typed arguments, tool-response rendering, turn-tag balance, ordering, and thinking preservation:

  • google/gemma-4-31B-it discussion #118
  • google/gemma-4-E4B-it discussion #36
  • google/gemma-4-12B-it discussion #12
  • google/gemma-4-12B-it discussion #35

So the official model repos are part of the fix surface, not only third-party runtimes.


Why LoRA is probably not the durable first-line fix

Your LoRA attempt makes sense as an initial experiment: from the outside, the symptom can look like model-side format drift.

But after looking at the model-repo template changes and related runtime/client issues, I would treat LoRA as a behavioral mitigation rather than the most likely durable root fix.

Approach Can help with Cannot fix
LoRA / fine-tune Model behavior, format preference, tool-use tendency Broken parser, streaming delta, transcript re-render, wrong role mapping
Template fix Correct Gemma-native rendering Client-side stream parser bugs
Runtime parser fix Native extraction and conversion Proxy rewriting tool results incorrectly
Proxy/adapter fix OpenAI messages ↔ Gemma-native semantics Runtime parser bugs
Agent-loop healing Retry safety, duplicate suppression, malformed-call containment Incorrect canonical protocol implementation

So I would not say “LoRA is useless”. I would say: LoRA is not the durable root fix if the failure is protocol-boundary corruption.


Fix vs workaround

Durable fixes

Layer Who should fix it Durable fix
Official model repo / HF Google / HF maintainers Canonical chat template, tokenizer config, response schema, tool-response rendering, examples.
Artifact / GGUF Quant providers, Unsloth, LM Studio community, Bartowski-style distributors Re-export/re-quant with fixed template metadata; provide known-good template overrides.
Backend runtime vLLM, llama.cpp, Ollama, MLX, SGLang Gemma-native parser/serializer, schema handling, streaming delta handling, reasoning/channel handling.
Proxy / adapter LiteLLM, ADK, OpenAI-compatible bridges Correct role:"tool" ↔ Gemma tool_responses; deserialize OpenAI arguments before Gemma rendering.
Client / coding agent OpenCode, AI SDK integrations, LM Studio, Claude Code/OpenClaw adapters Recognize streamed tool-call events, preserve IDs, support model-specific parser hooks.
Agent loop App/framework authors Retry cap, duplicate suppression, malformed-turn suppression, final-answer nudges, tool-call healing.
Third-party bridge repo Community / researchers Compatibility matrix, patched adapter, prompt-dump checker, regression suite, known-good combinations.

Practical workaround decision tree

stream:false helps streaming parser bugs, but it will not fix stale GGUF templates or proxy role-mapping bugs.

0. Update first:
   backend runtime
   model repo files
   tokenizer_config / chat_template
   GGUF / quantized artifact
   proxy / agent framework

1. If using GGUF:
   re-download a post-template-fix artifact, or use a runtime/UI that overrides stale embedded templates.
   Do not assume updating only the binary updates the embedded model template.

2. Test backend directly:
   bypass LiteLLM / ADK / OpenCode / Studio / OpenAI-compatible proxy.

3. If backend-direct works:
   suspect proxy / adapter / client layer.
   check role:"tool" vs tool_responses.
   check whether OpenAI function.arguments JSON strings are deserialized before Gemma rendering.
   check whether streamed tool-call events are recognized by the client.

4. If backend-direct fails:
   try stream:false for tool-call requests.

5. If stream:false fixes it:
   likely streaming delta / parser bug.
   keep tool calls non-streaming until that path is fixed.

6. If stream:false does not fix it:
   disable MTP / speculative decoding if enabled.

7. If single-turn works but second-turn fails:
   suspect tool_response mapping or transcript re-rendering.
   inspect the final prompt.

8. If first-turn fails:
   suspect native parser, chat template, schema complexity, or unsupported runtime.
   simplify schema and test a minimal tool.

9. If loops repeat:
   cap retries.
   suppress duplicate tool calls.
   never feed malformed assistant tool-call turns back into history.
   add a clean nudge or abort instead of replaying poisoned output.

Prompt-dump checks

Dump the exact prompt reaching the model if possible.

Suspicious:

call:NAME{{"key":"value"}}
{{"key":
role: "tool"
<channel|>
<|tool_call> ... raw JSON ... <tool_call|>

Suspicious template input:

{
  "function": {
    "arguments": "{\"x\":1}"
  }
}

Preferred template input:

{
  "function": {
    "arguments": {
      "x": 1
    }
  }
}

Then the Gemma serializer can emit native syntax:

call:some_tool{x:1}

or:

call:some_tool{x:<|"|>value<|"|>}

depending on type.


Minimal regression tests

Before calling something “Gemma 4 tool support”, I would run:

1. Single tool call.
2. Tool call -> tool result -> final answer.
3. Tool call -> tool result -> second tool call.
4. Multiple tool calls in one assistant turn.
5. stream:false vs stream:true.
6. MTP/speculative decoding on/off.
7. Long string argument containing comma, colon, braces, quotes.
8. Boolean / null / integer / decimal arguments.
9. Code/html argument.
10. Malformed tool call retry.
11. Proxy vs backend-direct.
12. Old GGUF embedded template vs patched template override.
13. Assistant content + tool_calls in the same turn.
14. Consecutive tool results.
15. Repeated identical tool call suppression.

Good torture string:

Fix: deploy, retry: twice, reason: "missing { brace } in HTML"

A regex-ish parser will often break this. A real string-aware parser should not.


Known related issues and fixes by layer (click for more details)


Suggested known-good debugging protocol (click for more details)


Bottom line

I would not frame the durable fix as:

“Make Gemma 4 learn tool calling via LoRA.”

I would frame it as:

“Every layer that claims Gemma 4 support needs to handle Gemma 4’s native tool protocol losslessly, and agent loops need safeguards for malformed-output cases.”

Durable fix surface:

official template
  + artifact freshness
  + native runtime parser
  + streaming parser
  + OpenAI-compatible adapter
  + client event parser
  + transcript re-renderer
  + agent-loop healing

Immediate user workaround:

update everything
  -> verify artifact/template freshness
  -> test backend direct
  -> disable streaming for tool calls if needed
  -> disable MTP/speculative if needed
  -> inspect prompt dumps
  -> fix tool_response mapping
  -> prevent poisoned retries

So yes: this looks pretty badly broken across the ecosystem, but not necessarily because the Gemma 4 weights are fundamentally unable to do tools. It looks more like the ecosystem is still settling around a non-JSON native tool protocol that does not fit cleanly into existing OpenAI-compatible agent assumptions.

Discussion in the ATmosphere

Loading comments...