KV Caching problem with gemma 3
Hugging Face Forums [Unofficial]
February 17, 2026
Hi I have fine-tuned a gemma 3 270M model with unsloth.
I am now trying to implement a caching mechanism for my system prompt. Went through HF doc and particularly this section.
However, I am getting the following the error when running:
from transformers.cache_utils import StaticCache
import torch
from pathlib import Path
import json
from unsloth import FastLanguageModel
import time
import copy
model_gemma, tokenizer_gemma = FastLanguageModel.from_pretrained(
model_name = "gemma_3_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = False,
)
PROMPT_SYSTEM = """
### Instruction:
Extract metadata from the document text below.
You must output a VALID JSON object. Do not output lists, markdown, or conversational text.
Required Keys:
- "document_number": The drawing ID
- "document_title": The main title
- "document_revision": Revision code (e.g., C01)
- "document_date": YYYY-MM-DD
"""
PROMPT_INPUT = """
### Input:
{context}
### Response:
"""
cache_sys = StaticCache(config=model_gemma.config, max_cache_len=1024, device=model_gemma.device, dtype=model_gemma.dtype)
inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, return_tensors = "pt").to("cuda")
with torch.no_grad():
cache_sys = model_gemma(
**inputs_sys,
past_key_values=cache_sys
).past_key_values
text_input_template = tokenizer_gemma.apply_chat_template(
[
{"role" : "system", "content": PROMPT_SYSTEM},
{"role" : "user", "content": PROMPT_INPUT.format(context="This is some fake data")}
],
tokenize = False,
add_generation_prompt = True).removeprefix('<bos>')
tokened_input_text = tokenizer_gemma(text_input_template, return_tensors = "pt").to("cuda")
past_key_values = copy.deepcopy(cache_sys)
outputs = model_gemma.generate(
**tokened_input_text,
temperature = 1,
top_p = 0.95,
top_k = 64,
past_key_values=past_key_values
)
input_length = tokened_input_text["input_ids"].shape[1]
generated_tokens = outputs[:, input_length:]
response_variable = tokenizer_gemma.decode(generated_tokens[0], skip_special_tokens=True)
I get the following error from the generate function
ValueError: Passing both \`cache_implementation\` (used to initialize certain caches) and \`past_key_values\` (a Cache object) is unsupported. Please use only one of the two.
I have tried to pass cache_implementation as None but still the same error
I am using transformers==4.57.6.
Discussion in the ATmosphere