External Publication

KV Caching problem with gemma 3

Hugging Face Forums [Unofficial] February 17, 2026

Hi I have fine-tuned a gemma 3 270M model with unsloth.

I am now trying to implement a caching mechanism for my system prompt. Went through HF doc and particularly this section.

However, I am getting the following the error when running:

from transformers.cache_utils import StaticCache
import torch
from pathlib import Path
import json
from unsloth import FastLanguageModel
import time
import copy

model_gemma, tokenizer_gemma = FastLanguageModel.from_pretrained(
    model_name = "gemma_3_lora", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    load_in_4bit = False,
)

PROMPT_SYSTEM = """
### Instruction:
Extract metadata from the document text below.
You must output a VALID JSON object. Do not output lists, markdown, or conversational text.

Required Keys:
- "document_number": The drawing ID
- "document_title": The main title
- "document_revision": Revision code (e.g., C01)
- "document_date": YYYY-MM-DD
"""


PROMPT_INPUT = """
### Input:
{context}

### Response:
"""

cache_sys = StaticCache(config=model_gemma.config, max_cache_len=1024, device=model_gemma.device, dtype=model_gemma.dtype)
inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, return_tensors = "pt").to("cuda")

with torch.no_grad():
    cache_sys = model_gemma(
        **inputs_sys,
        past_key_values=cache_sys
    ).past_key_values

text_input_template = tokenizer_gemma.apply_chat_template(
            [
            {"role" : "system", "content": PROMPT_SYSTEM},
            {"role" : "user", "content": PROMPT_INPUT.format(context="This is some fake data")}
            ],
            tokenize = False,
            add_generation_prompt = True).removeprefix('<bos>')

tokened_input_text = tokenizer_gemma(text_input_template, return_tensors = "pt").to("cuda")

past_key_values = copy.deepcopy(cache_sys)

outputs = model_gemma.generate(
    **tokened_input_text,
    temperature = 1,
    top_p = 0.95,
    top_k = 64,
    past_key_values=past_key_values
)
input_length = tokened_input_text["input_ids"].shape[1]
generated_tokens = outputs[:, input_length:]
response_variable = tokenizer_gemma.decode(generated_tokens[0], skip_special_tokens=True)

I get the following error from the generate function

ValueError: Passing both \`cache_implementation\` (used to initialize certain caches) and \`past_key_values\` (a Cache object) is unsupported. Please use only one of the two.

I have tried to pass cache_implementation as None but still the same error

I am using transformers==4.57.6.

Discussion in the ATmosphere