External Publication

CPU offloading error scenario

Hugging Face Forums [Unofficial] April 26, 2026

For those interested, here is my working configuration for Gemma 4 E4B it and E2B, along with a list of the trials and fixes I went through to get Gemma 4 running.

Gemma 4 is much smarter than the previous Gemma 3 12B I used. Even the smaller E2B model performs better. However, Gemma 4’s KV cache is much larger than that of previous Gemma models, and offloading is not working, so it is still a struggle to get a viable setup.

My setup uses a 12 GB VRAM GPU. Below is a list of my trials and fixes to make VRAM usage work.

VRAM trials and fixes

Because Gemma 4 uses a much larger KV cache, I tried to optimize VRAM usage as much as possible. In my best working state, I have less than 3 GB of VRAM free for the KV cache. This means I can only summarize a ~15 KB text file before running into KV cache OOM (see below — vision is not working for my best setup otherwise my VRAM become less than 2GB free).

1.1

I was blocked from doing any offloading until the _is_hf_initialized bug was fixed. This was recently fixed in Accelerate (or rather, will be fixed in v1.14). I applied the fix by installing the dev version:

pip install git+https://github.com/huggingface/accelerate.git

1.2

I could not use device_map=“auto” until it was fixed in Transformers v5.6.0. Before that, I couldn’t run any max_memory tests. Only device_map={“”: 0} worked.

Even now that device_map=“auto” is fixed, Gemma 4 does not spill over into shared GPU memory like previous Gemma models did.

1.3

If I set max_memory for the GPU to anything lower than what is required to fit the full Gemma 4 E4B model, I get this error during model loading:

Tensor.item() cannot be called on meta tensors

1.4

If I try to offload the Vision and Audio components, I get the same error during PEFT loading:

Tensor.item() cannot be called on meta tensors

(This is the issue discussed in this forum thread.)

1.5

Since I couldn’t offload anything, I purchased a motherboard that supports two GPUs. I added an old 4 GB GPU and moved my RAG model to that card, freeing up about 1.3 GB of VRAM on my main 12 GB GPU.

Vision issues

The vision component was not working correctly — the model only saw grey and distorted shapes and could not identify images or colors.

Unlike Gemma 3, I had to explicitly exclude model.vision_tower and model.multi_modal_projector from quantization before vision started working. I also excluded the audio components, although I haven’t yet confirmed whether that is necessary.

However, excluding the vision and audio components from quantization costs about 1 GB of VRAM. This leaves me with a trade-off:

Quantize vision and audio → <3 GB free VRAM Do not quantize vision and audio → <2 GB free VRAM

# Needed to add the "llm_int8_skip_modules" section to exclude the vision and audio from quantization.
QUANT_CONFIGS = {
    "gemma4-e4b-it": BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        llm_int8_enable_fp32_cpu_offload=True,
        llm_int8_skip_modules=[
            "model.vision_tower",
            "model.multi_modal_projector",
            "model.audio_tower",
            "lm_head"
        ]
    ),
}

# I can't see as off yet that "low_cpu_mem_usage" do anything.
MODEL_KWARGS = {
    "gemma4-e4b-it": {
        "dtype": torch.bfloat16,
        "attn_implementation": "sdpa",
        "trust_remote_code": False,
        "low_cpu_mem_usage": False
    },
}

# Main model loading

quant_config = QUANT_CONFIGS.get(model_id_to_load)

target_gpu = 0
gpu_props = torch.cuda.get_device_properties(target_gpu)
gpu_total_gb = gpu_props.total_memory / (1024 ** 3)

# I need 0.91 to fit complete Gemma 4 E4B it model

gpu_budget_gb = int(gpu_total_gb * 0.91)
cpu_budget_gb = 24
max_memory = {target_gpu: f"{gpu_budget_gb}GiB", "cpu": f"{cpu_budget_gb}GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_REGISTRY[model_id_to_load],
    quantization_config=quant_config,
#    device_map={"":0},
    device_map="auto",
#    device_map=device_map,
    max_memory=max_memory,
    offload_folder="e:\\Folder\\offload_temp",
    **MODEL_KWARGS[model_id_to_load]
)

# PEFT model loading

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    is_trainable=False
)

# I tried to quantize kv_cache but it did not help because the kv_cache is loaded full before it is quantizized, so it OOM in the beginning with big context's.

past_key_values = QuantoQuantizedCache(
    config=model.config,
)

# I tried Dynamic cache but that did not help also because same as above the whole content is loaded in kv_cache before Dynamic cache start to do its job. So it OOM on big context's.

past_key_values = DynamicCache(
    config=model.config,
    offloading=True
)

outputs = model.generate(
    **inputs,
    past_key_values=past_key_values,
    tokenizer=tokenizer,
    do_sample=True,
    stop_strings=current_stop_strings,
    **params,
)


# PEFT training had an error it works differently than previous Gemma models I had this "Gemma4ClippableLinear" error. To fix the error I needed to add "linear" to the target_modules.

if is_gemma4_model:
    lora_targets_full = [
        "q_proj.linear", "k_proj.linear", "v_proj.linear", "o_proj.linear",
        "gate_proj.linear", "up_proj.linear", "down_proj.linear"
    ]
    lora_targets_minimal = ["q_proj.linear", "v_proj.linear", "gate_proj.linear", "up_proj.linear"]
else:
    lora_targets_full = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    lora_targets_minimal = ["q_proj", "v_proj", "gate_proj", "up_proj"]

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=lora_targets_full,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Hope this help anyone struggling with Gemma 4

Discussion in the ATmosphere