Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidfyv55anyihpx43qt6r55dhc7uw4wc4ggzsrcmarygqlmie2s7p4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mooz6y6pzvk2"
  },
  "path": "/t/gemma-4-bug-fixes-and-research-request/176979#post_2",
  "publishedAt": "2026-06-20T03:28:28.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Original Hugging Face forum report",
    "Google — Function calling with Gemma 4",
    "Google — Gemma 4 model card",
    "vLLM — Gemma4ToolParser docs",
    "vLLM — Gemma 4 usage guide",
    "google/gemma-4-31B-it discussion #118",
    "google/gemma-4-E4B-it discussion #36",
    "google/gemma-4-12B-it discussion #12",
    "google/gemma-4-12B-it discussion #35",
    "(click for more details)"
  ],
  "textContent": "Hmm… after reading your report and the recent model-repo/template updates, this looks pretty badly broken across a surprisingly wide part of the ecosystem…\n\n* * *\n\n## Short diagnosis\n\nAfter reading the original report, the Hugging Face model-repo discussions/updates, and related runtime/client issues, I would frame this slightly differently from “Gemma 4 weights are broken”.\n\nMy current read:\n\n> Gemma 4 probably has a real ecosystem-wide agentic failure mode, but the first-order failure does not look like a single model-weights bug. It looks more like a multi-layer protocol-boundary problem around Gemma 4’s native, non-JSON tool-call format and the OpenAI-compatible agent stacks trying to wrap it.\n\nSo I mostly agree with the “ecosystem-wide” part of the report. I would just be cautious about attributing the whole thing to the weights themselves.\n\nLikely layers:\n\n  1. Gemma 4 native tool-call protocol\n  2. HF chat templates / tokenizer configs / model-repo packaging\n  3. GGUF or other converted artifacts with stale embedded templates\n  4. backend runtime parsers: vLLM, llama.cpp, Ollama, MLX, SGLang\n  5. streaming delta parsers\n  6. OpenAI-compatible proxy layers: LiteLLM / ADK-style adapters\n  7. coding-agent clients: OpenCode / AI SDK integrations\n  8. agent-loop recovery: retry, duplicate suppression, malformed-turn handling\n\n\n\nThe core mismatch is that Gemma 4 tool calls are not ordinary OpenAI-style JSON tool calls. Gemma 4 uses a native format like:\n\n\n    <|tool_call>call:func_name{key:<|\"|>value<|\"|>,num:42}<tool_call|>\n\n\nThat has different string delimiters, unquoted keys, and different multi-call behavior from JSON. Any layer that assumes “OpenAI-compatible endpoint = ordinary JSON tool-call transcript” can corrupt the conversation while converting, streaming, storing, or re-rendering tool calls.\n\nUseful starting links:\n\n  * Original Hugging Face forum report\n  * Google — Function calling with Gemma 4\n  * Google — Gemma 4 model card\n  * vLLM — Gemma4ToolParser docs\n  * vLLM — Gemma 4 usage guide\n\n\n\n* * *\n\n## What kind of failure is this?\n\nI would call this a **multi-layer protocol-boundary failure** , or **ecosystem drift around Gemma 4 native tool calling**.\n\nLayer | Failure mode\n---|---\nOfficial model repo / chat template | OpenAI-shaped messages are rendered incorrectly into Gemma-native dialogue.\nGGUF / artifact distribution | Old quantized artifacts embed stale chat templates.\nRuntime parser | Gemma native tool syntax is parsed as JSON-ish text, or not parsed at all.\nStreaming parser | Partial deltas corrupt arguments, numbers, booleans, or boundaries.\nProxy / OpenAI adapter | `role:\"tool\"` / `tool_calls.arguments` are translated incorrectly.\nClient / coding agent | Backend returns tool calls, but client-side stream/event parser misses them.\nAgent loop | Malformed calls are fed back into history, causing self-reinforcing loops.\nLoRA / fine-tune | May reduce bad generations, but does not fix broken protocol conversion.\n\nThis explains why reports differ. vLLM streaming users may see corrupted arguments. Ollama-through-LiteLLM users may see infinite tool loops. Old-GGUF users may be running stale templates. OpenCode users may have a backend that returns tool calls, while the client fails to consume them. These all look like “Gemma 4 tools are broken”, but they are not necessarily the same bug.\n\n* * *\n\n## The concrete bug pattern I would check first\n\nThe strongest pattern I would look for is **OpenAI-style JSON arguments being re-rendered into Gemma-native syntax incorrectly**.\n\nOpenAI-compatible APIs often represent arguments as a JSON string:\n\n\n    {\n      \"tool_calls\": [\n        {\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"write_file\",\n            \"arguments\": \"{\\\"path\\\":\\\"foo.txt\\\",\\\"content\\\":\\\"hello\\\"}\"\n          }\n        }\n      ]\n    }\n\n\nGemma native rendering instead wants a structured object that can serialize to Gemma’s DSL:\n\n\n    <|tool_call>call:write_file{path:<|\"|>foo.txt<|\"|>,content:<|\"|>hello<|\"|>}<tool_call|>\n\n\nIf the renderer inserts the OpenAI JSON string into Gemma braces, it may produce hybrid syntax:\n\n\n    call:write_file{{\"path\":\"foo.txt\",\"content\":\"hello\"}}\n\n\nThat is neither valid OpenAI JSON tool calling nor proper Gemma native tool calling.\n\nThen the loop becomes:\n\n\n    correct Gemma native tool call\n      -> converted to OpenAI-compatible tool_calls\n      -> arguments stored as JSON string\n      -> later re-rendered into Gemma prompt\n      -> JSON string inserted into Gemma-native braces\n      -> malformed hybrid syntax appears in history\n      -> model imitates malformed history\n      -> parser fails\n      -> retry re-injects poisoned turn\n      -> loop\n\n\nThat is why I would treat **transcript re-rendering** and **tool-result mapping** as first-class suspects, not just sampling or LoRA.\n\nThe official HF model-repo template fixes point in the same direction. Recent Gemma 4 template discussions/fixes mention string-typed arguments, tool-response rendering, turn-tag balance, ordering, and thinking preservation:\n\n  * google/gemma-4-31B-it discussion #118\n  * google/gemma-4-E4B-it discussion #36\n  * google/gemma-4-12B-it discussion #12\n  * google/gemma-4-12B-it discussion #35\n\n\n\nSo the official model repos are part of the fix surface, not only third-party runtimes.\n\n* * *\n\n## Why LoRA is probably not the durable first-line fix\n\nYour LoRA attempt makes sense as an initial experiment: from the outside, the symptom can look like model-side format drift.\n\nBut after looking at the model-repo template changes and related runtime/client issues, I would treat LoRA as a behavioral mitigation rather than the most likely durable root fix.\n\nApproach | Can help with | Cannot fix\n---|---|---\nLoRA / fine-tune | Model behavior, format preference, tool-use tendency | Broken parser, streaming delta, transcript re-render, wrong role mapping\nTemplate fix | Correct Gemma-native rendering | Client-side stream parser bugs\nRuntime parser fix | Native extraction and conversion | Proxy rewriting tool results incorrectly\nProxy/adapter fix | OpenAI messages ↔ Gemma-native semantics | Runtime parser bugs\nAgent-loop healing | Retry safety, duplicate suppression, malformed-call containment | Incorrect canonical protocol implementation\n\nSo I would not say “LoRA is useless”. I would say: **LoRA is not the durable root fix if the failure is protocol-boundary corruption**.\n\n* * *\n\n## Fix vs workaround\n\n### Durable fixes\n\nLayer | Who should fix it | Durable fix\n---|---|---\nOfficial model repo / HF | Google / HF maintainers | Canonical chat template, tokenizer config, response schema, tool-response rendering, examples.\nArtifact / GGUF | Quant providers, Unsloth, LM Studio community, Bartowski-style distributors | Re-export/re-quant with fixed template metadata; provide known-good template overrides.\nBackend runtime | vLLM, llama.cpp, Ollama, MLX, SGLang | Gemma-native parser/serializer, schema handling, streaming delta handling, reasoning/channel handling.\nProxy / adapter | LiteLLM, ADK, OpenAI-compatible bridges | Correct `role:\"tool\"` ↔ Gemma `tool_responses`; deserialize OpenAI `arguments` before Gemma rendering.\nClient / coding agent | OpenCode, AI SDK integrations, LM Studio, Claude Code/OpenClaw adapters | Recognize streamed tool-call events, preserve IDs, support model-specific parser hooks.\nAgent loop | App/framework authors | Retry cap, duplicate suppression, malformed-turn suppression, final-answer nudges, tool-call healing.\nThird-party bridge repo | Community / researchers | Compatibility matrix, patched adapter, prompt-dump checker, regression suite, known-good combinations.\n\n### Practical workaround decision tree\n\n`stream:false` helps streaming parser bugs, but it will not fix stale GGUF templates or proxy role-mapping bugs.\n\n\n    0. Update first:\n       backend runtime\n       model repo files\n       tokenizer_config / chat_template\n       GGUF / quantized artifact\n       proxy / agent framework\n\n    1. If using GGUF:\n       re-download a post-template-fix artifact, or use a runtime/UI that overrides stale embedded templates.\n       Do not assume updating only the binary updates the embedded model template.\n\n    2. Test backend directly:\n       bypass LiteLLM / ADK / OpenCode / Studio / OpenAI-compatible proxy.\n\n    3. If backend-direct works:\n       suspect proxy / adapter / client layer.\n       check role:\"tool\" vs tool_responses.\n       check whether OpenAI function.arguments JSON strings are deserialized before Gemma rendering.\n       check whether streamed tool-call events are recognized by the client.\n\n    4. If backend-direct fails:\n       try stream:false for tool-call requests.\n\n    5. If stream:false fixes it:\n       likely streaming delta / parser bug.\n       keep tool calls non-streaming until that path is fixed.\n\n    6. If stream:false does not fix it:\n       disable MTP / speculative decoding if enabled.\n\n    7. If single-turn works but second-turn fails:\n       suspect tool_response mapping or transcript re-rendering.\n       inspect the final prompt.\n\n    8. If first-turn fails:\n       suspect native parser, chat template, schema complexity, or unsupported runtime.\n       simplify schema and test a minimal tool.\n\n    9. If loops repeat:\n       cap retries.\n       suppress duplicate tool calls.\n       never feed malformed assistant tool-call turns back into history.\n       add a clean nudge or abort instead of replaying poisoned output.\n\n\n* * *\n\n## Prompt-dump checks\n\nDump the exact prompt reaching the model if possible.\n\nSuspicious:\n\n\n    call:NAME{{\"key\":\"value\"}}\n    {{\"key\":\n    role: \"tool\"\n    <channel|>\n    <|tool_call> ... raw JSON ... <tool_call|>\n\n\nSuspicious template input:\n\n\n    {\n      \"function\": {\n        \"arguments\": \"{\\\"x\\\":1}\"\n      }\n    }\n\n\nPreferred template input:\n\n\n    {\n      \"function\": {\n        \"arguments\": {\n          \"x\": 1\n        }\n      }\n    }\n\n\nThen the Gemma serializer can emit native syntax:\n\n\n    call:some_tool{x:1}\n\n\nor:\n\n\n    call:some_tool{x:<|\"|>value<|\"|>}\n\n\ndepending on type.\n\n* * *\n\n## Minimal regression tests\n\nBefore calling something “Gemma 4 tool support”, I would run:\n\n\n    1. Single tool call.\n    2. Tool call -> tool result -> final answer.\n    3. Tool call -> tool result -> second tool call.\n    4. Multiple tool calls in one assistant turn.\n    5. stream:false vs stream:true.\n    6. MTP/speculative decoding on/off.\n    7. Long string argument containing comma, colon, braces, quotes.\n    8. Boolean / null / integer / decimal arguments.\n    9. Code/html argument.\n    10. Malformed tool call retry.\n    11. Proxy vs backend-direct.\n    12. Old GGUF embedded template vs patched template override.\n    13. Assistant content + tool_calls in the same turn.\n    14. Consecutive tool results.\n    15. Repeated identical tool call suppression.\n\n\nGood torture string:\n\n\n    Fix: deploy, retry: twice, reason: \"missing { brace } in HTML\"\n\n\nA regex-ish parser will often break this. A real string-aware parser should not.\n\n* * *\n\nKnown related issues and fixes by layer (click for more details)\n\n* * *\n\nSuggested known-good debugging protocol (click for more details)\n\n* * *\n\n## Bottom line\n\nI would not frame the durable fix as:\n\n> “Make Gemma 4 learn tool calling via LoRA.”\n\nI would frame it as:\n\n> “Every layer that claims Gemma 4 support needs to handle Gemma 4’s native tool protocol losslessly, and agent loops need safeguards for malformed-output cases.”\n\nDurable fix surface:\n\n\n    official template\n      + artifact freshness\n      + native runtime parser\n      + streaming parser\n      + OpenAI-compatible adapter\n      + client event parser\n      + transcript re-renderer\n      + agent-loop healing\n\n\nImmediate user workaround:\n\n\n    update everything\n      -> verify artifact/template freshness\n      -> test backend direct\n      -> disable streaming for tool calls if needed\n      -> disable MTP/speculative if needed\n      -> inspect prompt dumps\n      -> fix tool_response mapping\n      -> prevent poisoned retries\n\n\nSo yes: this looks pretty badly broken across the ecosystem, but not necessarily because the Gemma 4 weights are fundamentally unable to do tools. It looks more like the ecosystem is still settling around a non-JSON native tool protocol that does not fit cleanly into existing OpenAI-compatible agent assumptions.",
  "title": "Gemma 4 bug fixes and Research Request"
}