{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicpsa3t3f2a25l627ihev365unnunvlzblxpc6m2zkffnsye53f4m",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mf2uwnra4qp2"
},
"path": "/t/kv-caching-problem-with-gemma-3/173571#post_1",
"publishedAt": "2026-02-17T13:38:16.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"section"
],
"textContent": "Hi I have fine-tuned a gemma 3 270M model with unsloth.\n\nI am now trying to implement a caching mechanism for my system prompt. Went through HF doc and particularly this section.\n\nHowever, I am getting the following the error when running:\n\n\n from transformers.cache_utils import StaticCache\n import torch\n from pathlib import Path\n import json\n from unsloth import FastLanguageModel\n import time\n import copy\n\n model_gemma, tokenizer_gemma = FastLanguageModel.from_pretrained(\n model_name = \"gemma_3_lora\", # YOUR MODEL YOU USED FOR TRAINING\n max_seq_length = 2048,\n load_in_4bit = False,\n )\n\n PROMPT_SYSTEM = \"\"\"\n ### Instruction:\n Extract metadata from the document text below.\n You must output a VALID JSON object. Do not output lists, markdown, or conversational text.\n\n Required Keys:\n - \"document_number\": The drawing ID\n - \"document_title\": The main title\n - \"document_revision\": Revision code (e.g., C01)\n - \"document_date\": YYYY-MM-DD\n \"\"\"\n\n\n PROMPT_INPUT = \"\"\"\n ### Input:\n {context}\n\n ### Response:\n \"\"\"\n\n cache_sys = StaticCache(config=model_gemma.config, max_cache_len=1024, device=model_gemma.device, dtype=model_gemma.dtype)\n inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, return_tensors = \"pt\").to(\"cuda\")\n\n with torch.no_grad():\n cache_sys = model_gemma(\n **inputs_sys,\n past_key_values=cache_sys\n ).past_key_values\n\n text_input_template = tokenizer_gemma.apply_chat_template(\n [\n {\"role\" : \"system\", \"content\": PROMPT_SYSTEM},\n {\"role\" : \"user\", \"content\": PROMPT_INPUT.format(context=\"This is some fake data\")}\n ],\n tokenize = False,\n add_generation_prompt = True).removeprefix('<bos>')\n\n tokened_input_text = tokenizer_gemma(text_input_template, return_tensors = \"pt\").to(\"cuda\")\n\n past_key_values = copy.deepcopy(cache_sys)\n\n outputs = model_gemma.generate(\n **tokened_input_text,\n temperature = 1,\n top_p = 0.95,\n top_k = 64,\n past_key_values=past_key_values\n )\n input_length = tokened_input_text[\"input_ids\"].shape[1]\n generated_tokens = outputs[:, input_length:]\n response_variable = tokenizer_gemma.decode(generated_tokens[0], skip_special_tokens=True)\n\n\nI get the following error from the generate function\n\n\n ValueError: Passing both \\`cache_implementation\\` (used to initialize certain caches) and \\`past_key_values\\` (a Cache object) is unsupported. Please use only one of the two.\n\n\nI have tried to pass cache_implementation as None but still the same error\n\nI am using `transformers==4.57.6`.",
"title": "KV Caching problem with gemma 3"
}