{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicpsa3t3f2a25l627ihev365unnunvlzblxpc6m2zkffnsye53f4m",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mf2uwnra4qp2"
  },
  "path": "/t/kv-caching-problem-with-gemma-3/173571#post_1",
  "publishedAt": "2026-02-17T13:38:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "section"
  ],
  "textContent": "Hi I have fine-tuned a gemma 3 270M model with unsloth.\n\nI am now trying to implement a caching mechanism for my system prompt. Went through HF doc and particularly this section.\n\nHowever, I am getting the following the error when running:\n\n\n    from transformers.cache_utils import StaticCache\n    import torch\n    from pathlib import Path\n    import json\n    from unsloth import FastLanguageModel\n    import time\n    import copy\n\n    model_gemma, tokenizer_gemma = FastLanguageModel.from_pretrained(\n        model_name = \"gemma_3_lora\", # YOUR MODEL YOU USED FOR TRAINING\n        max_seq_length = 2048,\n        load_in_4bit = False,\n    )\n\n    PROMPT_SYSTEM = \"\"\"\n    ### Instruction:\n    Extract metadata from the document text below.\n    You must output a VALID JSON object. Do not output lists, markdown, or conversational text.\n\n    Required Keys:\n    - \"document_number\": The drawing ID\n    - \"document_title\": The main title\n    - \"document_revision\": Revision code (e.g., C01)\n    - \"document_date\": YYYY-MM-DD\n    \"\"\"\n\n\n    PROMPT_INPUT = \"\"\"\n    ### Input:\n    {context}\n\n    ### Response:\n    \"\"\"\n\n    cache_sys = StaticCache(config=model_gemma.config, max_cache_len=1024, device=model_gemma.device, dtype=model_gemma.dtype)\n    inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, return_tensors = \"pt\").to(\"cuda\")\n\n    with torch.no_grad():\n        cache_sys = model_gemma(\n            **inputs_sys,\n            past_key_values=cache_sys\n        ).past_key_values\n\n    text_input_template = tokenizer_gemma.apply_chat_template(\n                [\n                {\"role\" : \"system\", \"content\": PROMPT_SYSTEM},\n                {\"role\" : \"user\", \"content\": PROMPT_INPUT.format(context=\"This is some fake data\")}\n                ],\n                tokenize = False,\n                add_generation_prompt = True).removeprefix('<bos>')\n\n    tokened_input_text = tokenizer_gemma(text_input_template, return_tensors = \"pt\").to(\"cuda\")\n\n    past_key_values = copy.deepcopy(cache_sys)\n\n    outputs = model_gemma.generate(\n        **tokened_input_text,\n        temperature = 1,\n        top_p = 0.95,\n        top_k = 64,\n        past_key_values=past_key_values\n    )\n    input_length = tokened_input_text[\"input_ids\"].shape[1]\n    generated_tokens = outputs[:, input_length:]\n    response_variable = tokenizer_gemma.decode(generated_tokens[0], skip_special_tokens=True)\n\n\nI get the following error from the generate function\n\n\n    ValueError: Passing both \\`cache_implementation\\` (used to initialize certain caches) and \\`past_key_values\\` (a Cache object) is unsupported. Please use only one of the two.\n\n\nI have tried to pass cache_implementation as None but still the same error\n\nI am using `transformers==4.57.6`.",
  "title": "KV Caching problem with gemma 3"
}