Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidqppte3vtlu777nvhmd76hoipuzwdwpgyraqwplh6uvufffxy7by",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mf33mxq3zff2"
  },
  "path": "/t/kv-caching-problem-with-gemma-3/173571#post_2",
  "publishedAt": "2026-02-17T16:42:05.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face",
    "GitHub",
    "GitHub",
    "GitHub",
    "unsloth.ai",
    "GitHub",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face",
    "GitHub",
    "GitHub",
    "unsloth.ai",
    "GitHub",
    "Hugging Face"
  ],
  "textContent": "Gemma 3 family models have some quirk…\n\n* * *\n\n## What is causing your `ValueError`\n\nThat error is thrown by `transformers` when **both** of these are true at generation time:\n\n  1. `generation_config.cache_implementation` is **not`None`** (Transformers will try to initialize/manage a cache itself), **and**\n  2. you pass `past_key_values` as a **Cache object** (e.g., `StaticCache`), meaning _you_ are managing the cache.\n\n\n\nTransformers explicitly rejects that combination and raises exactly your message. (GitHub)\n\n### Why this happens “by default” with Gemma 3\n\nMany Gemma 3 model repos ship a `generation_config.json` with:\n\n\n    \"cache_implementation\": \"hybrid\"\n\n\nSo even if you never set `cache_implementation` in code, `model.generation_config.cache_implementation` starts as `\"hybrid\"`. (Hugging Face)\n\nThat’s why `generate(past_key_values=StaticCache(...))` immediately errors: you’re passing a user cache while the model’s generation config says “use hybrid cache”. (GitHub)\n\n* * *\n\n## Why “I tried `cache_implementation=None`” often still fails with Unsloth\n\nUnsloth commonly patches `model.generate` with a fast wrapper (`unsloth_base_fast_generate`) that **overwrites caching settings**. In the wrapper, Unsloth computes a cache type (`\"static\"` / `\"hybrid\"` / `None`) and then sets:\n\n  * `kwargs[\"generation_config\"].cache_implementation = cache_implementation` (if a generation_config is passed), **or**\n  * `kwargs[\"cache_implementation\"] = cache_implementation` (otherwise),\n\n\n\nthen calls `self._old_generate(...)`. (GitHub)\n\nSo even if you set `cache_implementation=None`, the wrapper can set it back to `\"static\"`/`\"hybrid\"`, and Transformers will again see both `cache_implementation` and `past_key_values` → same `ValueError`. (GitHub)\n\n* * *\n\n## Solutions (pick one)\n\n### Solution 1 (recommended for prefix-caching): disable Unsloth fast generation\n\nSet before importing `unsloth`:\n\n\n    import os\n    os.environ[\"UNSLOTH_DISABLE_FAST_GENERATION\"] = \"1\"\n\n\nUnsloth documents this flag. (unsloth.ai)\n\nThen you can safely do:\n\n  * `generation_config.cache_implementation = None`\n  * pass `past_key_values=StaticCache(...)`\n\n\n\n### Solution 2: bypass the wrapper and call the original generate\n\nIf Unsloth already patched `generate`, it calls `self._old_generate(...)` internally. You can call that directly so Unsloth doesn’t re-inject `cache_implementation`. (GitHub)\n\nPattern:\n\n\n    gen = model_gemma._old_generate if hasattr(model_gemma, \"_old_generate\") else model_gemma.generate\n    gen(**inputs, generation_config=gen_cfg, past_key_values=cache, ...)\n\n\n### Solution 3: stop using a user cache (use only `cache_implementation`)\n\nIf you remove `past_key_values=StaticCache(...)` and rely purely on `cache_implementation`, Transformers will manage caching internally. This avoids the error, but it does **not** give you system-prompt _prefix reuse across independent requests_ in the same way as a persisted prefilled cache.\n\n* * *\n\n## Additional issue in your specific code: the cached prefix does not match your chat prompt\n\nYou prefill the cache with:\n\n\n    inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, ...)\n\n\nBut later you generate from a prompt produced by `apply_chat_template(...)` containing a system message + user message. Those tokens are **not the same prefix** as the raw `PROMPT_SYSTEM` string.\n\nFor prefix caching to be correct/beneficial, the cached prefix must match the **exact token sequence** at the start of the later prompt. Hugging Face’s prefix caching example works because the “INITIAL_PROMPT” is exactly the prefix of the later prompts. (Hugging Face)\n\n**Fix:** prefill using the same chat template with only the system message, then generate from system+user.\n\n* * *\n\n## Another important Gemma 3 constraint: “unprocessed input_ids” with caches\n\nGemma 3’s model docs state that when `past_key_values` are used, the user is expected to pass only the **unprocessed** `input_ids` (tokens not already covered by the cache). (Hugging Face)\n\nTransformers `generate()` usually slices inputs appropriately, but you can still run into edge cases (especially if your cached prefix and the prompt don’t align, or if the “suffix length” becomes zero).\n\n* * *\n\n## Minimal “fix shape” for your original snippet\n\nKey changes:\n\n  * make `generation_config.cache_implementation=None` for the call\n  * prefill using chat template system-only\n  * disable Unsloth fast generation _or_ call `_old_generate`\n\n\n\n\n    import os\n    os.environ[\"UNSLOTH_DISABLE_FAST_GENERATION\"] = \"1\"  # must be set before importing unsloth :contentReference[oaicite:9]{index=9}\n\n    import copy, torch\n    from transformers.cache_utils import StaticCache\n    from unsloth import FastLanguageModel\n\n    model, tok = FastLanguageModel.from_pretrained(\"gemma_3_lora\", max_seq_length=2048, load_in_4bit=False)\n    model.eval()\n\n    # 1) Build SYSTEM-only prefix using chat template (token-exact prefix)\n    sys_inputs = tok.apply_chat_template(\n        [{\"role\":\"system\",\"content\": PROMPT_SYSTEM}],\n        add_generation_prompt=False,\n        return_tensors=\"pt\",\n        return_dict=True,\n    ).to(model.device)\n\n    # 2) Prefill cache with that prefix\n    cache = StaticCache(config=model.config, max_cache_len=2048, device=model.device, dtype=model.dtype)\n    with torch.no_grad():\n        cache = model(**sys_inputs, past_key_values=cache, use_cache=True).past_key_values\n\n    # 3) Build FULL prompt (system + user)\n    full_inputs = tok.apply_chat_template(\n        [{\"role\":\"system\",\"content\": PROMPT_SYSTEM},\n         {\"role\":\"user\",\"content\": PROMPT_INPUT.format(context=\"This is some fake data\")}],\n        add_generation_prompt=True,\n        return_tensors=\"pt\",\n        return_dict=True,\n    ).to(model.device)\n\n    # 4) Ensure cache_implementation is None when passing a Cache object :contentReference[oaicite:10]{index=10}\n    gen_cfg = copy.deepcopy(model.generation_config)\n    gen_cfg.cache_implementation = None\n\n    # 5) Generate (unpatched generate, because fast generation disabled)\n    out = model.generate(**full_inputs, generation_config=gen_cfg, past_key_values=copy.deepcopy(cache), max_new_tokens=128)\n\n\n* * *\n\n## Summary of “causes → fixes”\n\n  * **Cause A:** Gemma 3 repos often set `\"cache_implementation\": \"hybrid\"` in `generation_config.json` (Hugging Face)\n**Fix:** pass a `GenerationConfig` where `cache_implementation=None` when using `past_key_values` (GitHub)\n\n  * **Cause B:** Unsloth fast generation wrapper can forcibly set `cache_implementation` (static/hybrid) before calling the original generate (GitHub)\n**Fix:** `UNSLOTH_DISABLE_FAST_GENERATION=1` (unsloth.ai) or call `_old_generate` (GitHub)\n\n  * **Cause C (logic/prefix mismatch):** you cached raw `PROMPT_SYSTEM` tokens but generated from chat-template tokens\n**Fix:** prefill cache using system-only `apply_chat_template(...)` so the cached tokens exactly match the start of later prompts (Hugging Face)\n\n\n",
  "title": "KV Caching problem with gemma 3"
}