CPU offloading error scenario
For those interested, here is my working configuration for Gemma 4 E4B it and E2B, along with a list of the trials and fixes I went through to get Gemma 4 running.
Gemma 4 is much smarter than the previous Gemma 3 12B I used. Even the smaller E2B model performs better. However, Gemma 4’s KV cache is much larger than that of previous Gemma models, and offloading is not working, so it is still a struggle to get a viable setup.
My setup uses a 12 GB VRAM GPU. Below is a list of my trials and fixes to make VRAM usage work.
- VRAM trials and fixes
Because Gemma 4 uses a much larger KV cache, I tried to optimize VRAM usage as much as possible. In my best working state, I have less than 3 GB of VRAM free for the KV cache. This means I can only summarize a ~15 KB text file before running into KV cache OOM (see below — vision is not working for my best setup otherwise my VRAM become less than 2GB free).
1.1
I was blocked from doing any offloading until the _is_hf_initialized bug was fixed. This was recently fixed in Accelerate (or rather, will be fixed in v1.14). I applied the fix by installing the dev version:
pip install git+https://github.com/huggingface/accelerate.git
1.2
I could not use device_map=“auto” until it was fixed in Transformers v5.6.0. Before that, I couldn’t run any max_memory tests. Only device_map={“”: 0} worked.
Even now that device_map=“auto” is fixed, Gemma 4 does not spill over into shared GPU memory like previous Gemma models did.
1.3
If I set max_memory for the GPU to anything lower than what is required to fit the full Gemma 4 E4B model, I get this error during model loading:
Tensor.item() cannot be called on meta tensors
1.4
If I try to offload the Vision and Audio components, I get the same error during PEFT loading:
Tensor.item() cannot be called on meta tensors
(This is the issue discussed in this forum thread.)
1.5
Since I couldn’t offload anything, I purchased a motherboard that supports two GPUs. I added an old 4 GB GPU and moved my RAG model to that card, freeing up about 1.3 GB of VRAM on my main 12 GB GPU.
- Vision issues
The vision component was not working correctly — the model only saw grey and distorted shapes and could not identify images or colors.
Unlike Gemma 3, I had to explicitly exclude model.vision_tower and model.multi_modal_projector from quantization before vision started working. I also excluded the audio components, although I haven’t yet confirmed whether that is necessary.
However, excluding the vision and audio components from quantization costs about 1 GB of VRAM. This leaves me with a trade-off:
Quantize vision and audio → <3 GB free VRAM Do not quantize vision and audio → <2 GB free VRAM
# Needed to add the "llm_int8_skip_modules" section to exclude the vision and audio from quantization.
QUANT_CONFIGS = {
"gemma4-e4b-it": BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
llm_int8_skip_modules=[
"model.vision_tower",
"model.multi_modal_projector",
"model.audio_tower",
"lm_head"
]
),
}
# I can't see as off yet that "low_cpu_mem_usage" do anything.
MODEL_KWARGS = {
"gemma4-e4b-it": {
"dtype": torch.bfloat16,
"attn_implementation": "sdpa",
"trust_remote_code": False,
"low_cpu_mem_usage": False
},
}
# Main model loading
quant_config = QUANT_CONFIGS.get(model_id_to_load)
target_gpu = 0
gpu_props = torch.cuda.get_device_properties(target_gpu)
gpu_total_gb = gpu_props.total_memory / (1024 ** 3)
# I need 0.91 to fit complete Gemma 4 E4B it model
gpu_budget_gb = int(gpu_total_gb * 0.91)
cpu_budget_gb = 24
max_memory = {target_gpu: f"{gpu_budget_gb}GiB", "cpu": f"{cpu_budget_gb}GiB"}
base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_REGISTRY[model_id_to_load],
quantization_config=quant_config,
# device_map={"":0},
device_map="auto",
# device_map=device_map,
max_memory=max_memory,
offload_folder="e:\\Folder\\offload_temp",
**MODEL_KWARGS[model_id_to_load]
)
# PEFT model loading
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
is_trainable=False
)
# I tried to quantize kv_cache but it did not help because the kv_cache is loaded full before it is quantizized, so it OOM in the beginning with big context's.
past_key_values = QuantoQuantizedCache(
config=model.config,
)
# I tried Dynamic cache but that did not help also because same as above the whole content is loaded in kv_cache before Dynamic cache start to do its job. So it OOM on big context's.
past_key_values = DynamicCache(
config=model.config,
offloading=True
)
outputs = model.generate(
**inputs,
past_key_values=past_key_values,
tokenizer=tokenizer,
do_sample=True,
stop_strings=current_stop_strings,
**params,
)
# PEFT training had an error it works differently than previous Gemma models I had this "Gemma4ClippableLinear" error. To fix the error I needed to add "linear" to the target_modules.
if is_gemma4_model:
lora_targets_full = [
"q_proj.linear", "k_proj.linear", "v_proj.linear", "o_proj.linear",
"gate_proj.linear", "up_proj.linear", "down_proj.linear"
]
lora_targets_minimal = ["q_proj.linear", "v_proj.linear", "gate_proj.linear", "up_proj.linear"]
else:
lora_targets_full = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
lora_targets_minimal = ["q_proj", "v_proj", "gate_proj", "up_proj"]
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=lora_targets_full,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Hope this help anyone struggling with Gemma 4
Discussion in the ATmosphere