{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreif4wccbvga5kfocgowufwg4fljtcyjajyqtz5626g2aa2vh2whwhe",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkbpvfsdq3m2"
},
"path": "/t/cpu-offloading-error-scenario/175522#post_5",
"publishedAt": "2026-04-24T22:04:38.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub",
"Hugging Face"
],
"textContent": "I’ll post a draft of the issue for now:\n\n* * *\n\nThe **good actual issues** to raise are these, in this order.\n\n## Issue 1 — Primary: PEFT adapter loading fails on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model\n\n**File first at:** `huggingface/transformers`\n**Mention/cross-link:** `huggingface/peft`, `huggingface/accelerate`, `bitsandbytes-foundation/bitsandbytes`\n\n### Suggested title\n\n\n PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit\n\n\n### Why this is the strongest issue\n\nThis is the core failure:\n\n\n Base Gemma 4 loads with custom CPU/GPU device_map.\n All-GPU Gemma 4 + PEFT works.\n PEFT adapter loading triggers Accelerate dispatch/hook logic.\n The failure occurs inside bitsandbytes 4-bit state/parameter handling.\n\n\nThe concrete failure variants are related, not contradictory:\n\n\n Tensor.item() cannot be called on meta tensors\n → bitsandbytes QuantState.as_dict(packed=True)\n → nested_offset = self.offset.item()\n\n\nand, on nearby version/config paths:\n\n\n Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'\n\n\nThe `_is_hf_initialized` family is already visible in upstream issue traffic around Transformers/Accelerate/bitsandbytes parameter reconstruction; there is a current issue for the analogous `Int8Params` case, and another issue describing `_is_hf_initialized` being passed into parameter reconstruction paths. (GitHub)\n\n### Core issue statement\n\nUse wording like this:\n\n\n The base model can be loaded with a split CPU/GPU device_map, and the all-GPU PEFT path works. The failure appears when loading a PEFT adapter onto the already-dispatched bitsandbytes 4-bit Gemma 4 base model. PeftModel.from_pretrained appears to trigger an additional Accelerate dispatch/hook path. That path fails inside bitsandbytes 4-bit quant-state or Params4bit handling.\n\n\n### Why Transformers first\n\nTransformers is the best first repo because this issue crosses:\n\n * Gemma 4 model integration;\n * bitsandbytes quantization integration;\n * device-map loading behavior;\n * PEFT adapter integration expectations;\n * current `_is_hf_initialized` loading behavior.\n\n\n\nAccelerate owns `dispatch_model()` and hook attachment; its docs define dispatching models across GPU, CPU, and disk according to `device_map`, and public Accelerate source/doc snippets show hook attachment is central to this path. (Hugging Face)\n\nbitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState`, but the failure is triggered by the HF integration path. So file at Transformers first and let maintainers route if needed.\n\n* * *\n\n## Issue 2 — Secondary: Passing `device_map` to PEFT breaks Gemma 4 shared-KV generation\n\n**File first at:** `huggingface/transformers`\n**Mention/cross-link:** `huggingface/peft`, `huggingface/accelerate`\n\n### Suggested title\n\n\n Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n### Why this is a separate issue\n\nThis is not the same failure as the PEFT-load/bitsandbytes failure. It occurs later, during generation:\n\n\n Gemma4 self_attn forward\n → shared_kv_states[self.kv_shared_layer_index]\n → KeyError: 22\n\n\nThis happens only after using:\n\n\n PeftModel.from_pretrained(..., device_map=device_map)\n\n\nThat is important because passing `device_map` into PEFT is not simply “offload PEFT too.” It asks PEFT/Accelerate to redispatch the PEFT-wrapped model, using names/layout assumptions that may no longer match the original base model.\n\nGemma 4 has shared-KV-cache behavior where later layers reuse key/value states from earlier layers. If a second dispatch/hook pass changes the execution/capture path, the dict entry expected by the shared layer may not be present. The Gemma 4 architecture writeup describes the shared-KV-cache mechanism; Unsloth’s Gemma 4 guide also calls out shared KV state across E2B/E4B layers. (GitHub)\n\n### Core issue statement\n\n\n Passing the same base model device_map to PeftModel.from_pretrained avoids the initial adapter-load failure, but generation then fails in Gemma 4 shared-KV attention with KeyError. This suggests the PEFT/Accelerate redispatch layout breaks Gemma 4 shared_kv_states bookkeeping.\n\n\n### Why this deserves its own issue\n\nBecause the fix for Issue 1 may not automatically fix Issue 2. Issue 1 is about PEFT adapter loading over bnb 4-bit offload. Issue 2 is about Gemma 4 generation semantics after PEFT-level redispatch.\n\nDo not merge them into one maintainer action item unless you present Issue 2 as a “related second symptom.”\n\n* * *\n\n## Issue 3 — Optional/supporting: PEFT offload-dir / offload-folder handling is confusing or under-documented\n\n**File at:** `huggingface/peft`\n\n### Suggested title\n\n\n Clarify offload_dir/offload_folder handling for PeftModel.from_pretrained on already-dispatched models\n\n\n### Why it is lower priority\n\nThis is probably not the root cause of the current Gemma 4 failure, but it is part of the same user-facing confusion.\n\nThere are existing PEFT issues about `PeftModel.from_pretrained()` failing with:\n\n\n ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`\n\n\nand about inconsistent `offload_dir` / `offload_folder` naming. (GitHub)\n\nThis is worth mentioning in Issue 1 as context, but I would not file it first unless your minimal repro specifically lands on the missing `offload_dir` error.\n\n* * *\n\n# What I would not file\n\n## Not this\n\n\n PEFT expects vision/audio towers to be on GPU.\n\n\nThat is too broad and likely inaccurate.\n\nBetter:\n\n\n PEFT adapter loading triggers redispatch/hook handling on an already-dispatched bnb 4-bit Gemma 4 model, and that dispatch path fails.\n\n\n## Not this\n\n\n CPU offloading is broken.\n\n\nToo broad. The base model can load with CPU/GPU dispatch; Accelerate supports dispatching layers across GPU, CPU, and disk by design. (Hugging Face)\n\nBetter:\n\n\n Runtime PEFT adapter loading on top of a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base is broken on this version set.\n\n\n## Not this as the main issue\n\n\n model.multi_modal_projector offload fails.\n\n\nOnly file a projector-specific issue after verifying that exact module key exists in the actual model. For Gemma 4 variants, bridge/module names can differ.\n\n* * *\n\n# Recommended filing plan\n\n## Best plan\n\nOpen **one primary Transformers issue** with two sections:\n\n\n A. Primary failure: PeftModel.from_pretrained on split-device bnb 4-bit Gemma4 fails during adapter load.\n B. Related failure: adding device_map to PEFT avoids load error but causes Gemma4 shared_kv_states KeyError during generate.\n\n\nThen add:\n\n\n I can split the shared_kv_states issue into a separate ticket if maintainers prefer.\n\n\nThis is efficient because maintainers can see the relationship.\n\n## If you want the cleanest tracking\n\nOpen two separate issues:\n\n 1. **Transformers Issue A:** bnb 4-bit + PEFT + Accelerate dispatch failure.\n 2. **Transformers Issue B:** Gemma 4 shared-KV `KeyError` when `device_map` is passed to PEFT.\n\n\n\nThen cross-link them.\n\n* * *\n\n# Minimal titles to use\n\n## Best title for main issue\n\n\n PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit\n\n\n## Best title for related issue\n\n\n Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n## Optional PEFT docs/UX issue\n\n\n Clarify offload_dir/offload_folder behavior when loading PEFT adapters on already-dispatched models\n\n\n* * *\n\n# Key evidence to include\n\nInclude this exact contrast:\n\n\n Works:\n device_map = {\"\": 0}\n\n Fails:\n device_map = {\n \"model.vision_tower\": \"cpu\",\n \"model.audio_tower\": \"cpu\",\n \"\": 0,\n }\n\n\nMention `model.multi_modal_projector` only if verified by `named_modules()`.\n\nInclude quant config:\n\n\n BitsAndBytesConfig(\n load_in_4bit=True,\n bnb_4bit_quant_type=\"nf4\",\n bnb_4bit_use_double_quant=True,\n bnb_4bit_compute_dtype=torch.bfloat16,\n llm_int8_enable_fp32_cpu_offload=True,\n )\n\n\nMention that `llm_int8_enable_fp32_cpu_offload=True` is required/expected for CPU/disk entries in many bnb quantized `device_map` paths, even though the name is confusing; Transformers’ bitsandbytes docs describe CPU/GPU offload behavior in this quantization area. (GitHub)\n\nInclude the exact two trace tails:\n\n\n Linear4bit._save_to_state_dict\n → weight.quant_state.as_dict(packed=True)\n → nested_offset = self.offset.item()\n → Tensor.item() cannot be called on meta tensors\n\n\nand:\n\n\n Gemma4Attention.forward\n → shared_kv_states[self.kv_shared_layer_index]\n → KeyError: 22\n\n\n* * *\n\n# Bottom line\n\nThe actual issues you are raising are:\n\n 1. **Primary bug:** PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model triggers Accelerate redispatch/hook handling and fails inside bitsandbytes 4-bit state/parameter handling.\n\n 2. **Secondary bug:** Passing `device_map` into PEFT is not a valid workaround for Gemma 4; it can break shared-KV generation with `KeyError: 22`.\n\n 3. **Optional docs/UX issue:** PEFT/Accelerate offload args are confusing around `offload_dir`, `offload_folder`, and already-dispatched base models.\n\n\n\n\nThose are good, concrete, maintainable issues.\n\n* * *\n\nBelow are **ready-to-paste GitHub issues**. I would open **Issue 1 first** in `huggingface/transformers`. If maintainers ask to split the shared-KV failure, open **Issue 2** separately. This framing matches Accelerate’s documented role in dispatching models across GPU/CPU/disk, PEFT’s adapter-loading surface, bitsandbytes 4-bit quant-state handling, and Gemma 4’s shared-KV-cache architecture. (Hugging Face)\n\n* * *\n\n# Issue 1\n\n## Target repo\n\n\n huggingface/transformers\n\n\n## Suggested title\n\n\n PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState\n\n\n## Suggested labels\n\n\n bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload\n\n\n## Body\n\n\n ### System Info\n\n - OS: Windows\n - Python: please fill\n - GPU: please fill\n - NVIDIA driver: please fill\n - CUDA: please fill\n - torch: 2.8.0+cu129\n - transformers: 5.6.2\n - accelerate: 1.14.0.dev0\n - bitsandbytes: 0.49.2\n - peft: 0.19.1\n - model: Gemma 4 E4B IT\n - quantization: bitsandbytes 4-bit NF4\n - adapter type: LoRA\n - attention implementation: sdpa\n - trust_remote_code: False\n\n ### Summary\n\n A Gemma 4 E4B IT base model works when loaded fully on GPU with:\n\n ```python\n device_map = {\"\": 0}\n ```\n\n However, loading the same base model with a custom CPU/GPU `device_map` and then loading a PEFT adapter with `PeftModel.from_pretrained()` fails during adapter loading.\n\n The failure appears when PEFT adapter loading calls into Accelerate dispatch/hook logic. Accelerate then calls `module.state_dict()` while attaching execution hooks, which reaches bitsandbytes `Linear4bit._save_to_state_dict()`. bitsandbytes then serializes `weight.quant_state.as_dict(packed=True)` and fails because a nested quantization scalar is still on the `meta` device:\n\n ```text\n RuntimeError: Tensor.item() cannot be called on meta tensors\n ```\n\n The all-GPU path works. The failure appears specifically when the base model is already CPU/GPU-dispatched and quantized with bitsandbytes 4-bit double quantization.\n\n ### Working case\n\n ```python\n device_map = {\"\": 0}\n ```\n\n This works.\n\n ### Failing case\n\n ```python\n device_map = {\n \"model.vision_tower\": \"cpu\",\n \"model.multi_modal_projector\": \"cpu\",\n \"model.audio_tower\": \"cpu\",\n \"\": 0,\n }\n ```\n\n ### Quantization config\n\n ```python\n from transformers import BitsAndBytesConfig\n import torch\n\n quant_config = BitsAndBytesConfig(\n load_in_4bit=True,\n bnb_4bit_quant_type=\"nf4\",\n bnb_4bit_use_double_quant=True,\n bnb_4bit_compute_dtype=torch.bfloat16,\n llm_int8_enable_fp32_cpu_offload=True,\n )\n ```\n\n ### Base model load\n\n ```python\n base_model = Gemma4ForConditionalGeneration.from_pretrained(\n MODEL_ID,\n quantization_config=quant_config,\n device_map=device_map,\n max_memory=max_memory,\n offload_folder=r\"E:\\Folder\\offload_temp\",\n dtype=torch.bfloat16,\n attn_implementation=\"sdpa\",\n trust_remote_code=False,\n low_cpu_mem_usage=False,\n )\n ```\n\n ### PEFT adapter load\n\n ```python\n from peft import PeftModel\n\n if isinstance(base_model, PeftModel):\n base_model = base_model.merge_and_unload()\n\n model = PeftModel.from_pretrained(\n base_model,\n lora_path,\n adapter_name=adapter_name,\n is_trainable=False,\n )\n ```\n\n ### Error\n\n ```text\n PeftModel.from_pretrained\n → load_adapter\n → dispatch_model\n → attach_align_device_hook_on_blocks\n → attach_execution_device_hook\n → module.state_dict()\n → bitsandbytes Linear4bit._save_to_state_dict\n → self.weight.quant_state.as_dict(packed=True)\n → \"nested_offset\": self.offset.item()\n → RuntimeError: Tensor.item() cannot be called on meta tensors\n ```\n\n Relevant traceback tail:\n\n ```text\n File \"...peft\\peft_model.py\", line 1475, in load_adapter\n dispatch_model(\n\n File \"...accelerate\\big_modeling.py\", line 432, in dispatch_model\n attach_align_device_hook_on_blocks(\n\n File \"...accelerate\\hooks.py\", line 459, in attach_execution_device_hook\n if not hasattr(module, \"_hf_hook\") and len(module.state_dict()) > 0:\n\n File \"...torch\\nn\\modules\\module.py\", line 2260, in state_dict\n module.state_dict(\n\n File \"...bitsandbytes\\nn\\modules.py\", line 525, in _save_to_state_dict\n for k, v in self.weight.quant_state.as_dict(packed=True).items():\n\n File \"...bitsandbytes\\functional.py\", line 581, in as_dict\n \"nested_offset\": self.offset.item(),\n\n File \"...torch_meta_registrations.py\", line 7457, in meta_local_scalar_dense\n raise RuntimeError(\"Tensor.item() cannot be called on meta tensors\")\n\n RuntimeError: Tensor.item() cannot be called on meta tensors\n ```\n\n ### Expected behavior\n\n One of the following:\n\n 1. `PeftModel.from_pretrained()` should preserve the already-dispatched base model layout without triggering a bitsandbytes quant-state serialization path that reads `meta` tensors.\n 2. Accelerate hook attachment should avoid calling `state_dict()` on bitsandbytes `Linear4bit` modules whose quant-state may contain offloaded/meta placeholders.\n 3. bitsandbytes `QuantState.as_dict(packed=True)` should either materialize/move the nested offset before `.item()` or fail with a clearer unsupported-configuration error.\n 4. If this configuration is unsupported, the error should be raised before adapter loading with an explicit message.\n\n ### Actual behavior\n\n The base model can be loaded with the CPU/GPU `device_map`, but PEFT adapter loading triggers an additional Accelerate dispatch/hook path and fails inside bitsandbytes nested quantization-state serialization.\n\n ### Why this seems cross-library\n\n My current read:\n\n - PEFT triggers the failing path by loading the adapter with `PeftModel.from_pretrained()`.\n - Accelerate attaches dispatch/execution hooks and calls `module.state_dict()`.\n - bitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState.as_dict(packed=True)`.\n - Transformers owns the Gemma 4 integration and bitsandbytes quantizer integration.\n\n I am not sure which repository should own the final fix, but this seems to start from the Transformers/PEFT integration path.\n\n ### Additional notes\n\n - The all-GPU path works with `device_map={\"\": 0}`.\n - The failure only appears with CPU/GPU dispatch.\n - The failing field is `nested_offset`, which appears tied to `bnb_4bit_use_double_quant=True`.\n - For quantized models with CPU entries in `device_map`, `llm_int8_enable_fp32_cpu_offload=True` appears necessary even though the flag name says `int8`.\n - Passing `device_map` to `PeftModel.from_pretrained()` is not a valid workaround; it causes a separate Gemma 4 shared-KV generation failure. I can open that as a separate issue if preferred.\n\n ### Diagnostic snippet\n\n ```python\n def find_bnb_meta_quant_state(model):\n bad = []\n for name, module in model.named_modules():\n weight = getattr(module, \"weight\", None)\n quant_state = getattr(weight, \"quant_state\", None)\n if quant_state is None:\n continue\n\n for attr in [\"absmax\", \"code\", \"offset\"]:\n value = getattr(quant_state, attr, None)\n if value is not None and getattr(value, \"is_meta\", False):\n bad.append((name, f\"weight.quant_state.{attr}\", str(value.device)))\n\n state2 = getattr(quant_state, \"state2\", None)\n if state2 is not None:\n for attr in [\"absmax\", \"code\", \"offset\"]:\n value = getattr(state2, attr, None)\n if value is not None and getattr(value, \"is_meta\", False):\n bad.append((name, f\"weight.quant_state.state2.{attr}\", str(value.device)))\n return bad\n\n print(\"hf_device_map:\", getattr(base_model, \"hf_device_map\", None))\n print(\"bnb quant_state meta entries:\", find_bnb_meta_quant_state(base_model)[:20])\n ```\n\n ### Module-name verification snippet\n\n ```python\n for name, module in base_model.named_modules():\n lname = name.lower()\n if any(k in lname for k in [\"vision\", \"audio\", \"project\", \"embed\", \"multi\"]):\n print(name, type(module).__name__)\n ```\n\n ### Questions\n\n 1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?\n 2. Should PEFT avoid redispatching a model that already has `hf_device_map`?\n 3. Should Accelerate avoid calling `state_dict()` during hook attachment for bitsandbytes `Linear4bit` modules?\n 4. Should bitsandbytes handle `QuantState.offset` on `meta` more defensively in `as_dict(packed=True)`?\n 5. Is the recommended workaround to use all-GPU placement, native `load_adapter`, or avoid runtime PEFT injection on offloaded bnb 4-bit models?\n\n ### Relevant links\n\n ```text\n Accelerate big model dispatch docs:\n https://huggingface.co/docs/accelerate/package_reference/big_modeling\n\n Transformers bitsandbytes docs:\n https://huggingface.co/docs/transformers/quantization/bitsandbytes\n\n PEFT PeftModel docs:\n https://huggingface.co/docs/peft/package_reference/peft_model\n\n PEFT ephemeral_gpu_offload docs:\n https://huggingface.co/docs/peft/developer_guides/lora\n\n Transformers native PEFT adapter integration:\n https://huggingface.co/docs/transformers/en/peft\n\n bitsandbytes QuantState source:\n https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py\n\n Related _is_hf_initialized issue family:\n https://github.com/huggingface/transformers/issues/43872\n ```\n\n\n* * *\n\n# Issue 2\n\n## Target repo\n\n\n huggingface/transformers\n\n\n## Suggested title\n\n\n Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n## Suggested labels\n\n\n bug, Gemma4, generation, shared-kv-cache, PEFT, Accelerate, device_map\n\n\n## Body\n\n\n ### System Info\n\n - OS: Windows\n - Python: please fill\n - GPU: please fill\n - NVIDIA driver: please fill\n - CUDA: please fill\n - torch: 2.8.0+cu129\n - transformers: 5.6.2\n - accelerate: 1.14.0.dev0\n - bitsandbytes: 0.49.2\n - peft: 0.19.1\n - model: Gemma 4 E4B IT\n - quantization: bitsandbytes 4-bit NF4\n - adapter type: LoRA\n - attention implementation: sdpa\n - trust_remote_code: False\n\n ### Summary\n\n A Gemma 4 E4B IT model works when loaded fully on GPU with:\n\n ```python\n device_map = {\"\": 0}\n ```\n\n A CPU/GPU-dispatched base model can also be loaded. However, if I pass the same base-model `device_map` to `PeftModel.from_pretrained()`, adapter loading gets farther, but generation fails inside Gemma 4 shared-KV attention with:\n\n ```text\n KeyError: 22\n ```\n\n The failure line is:\n\n ```python\n key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n ```\n\n This suggests that the PEFT/Accelerate redispatch layout breaks Gemma 4 shared-KV bookkeeping during generation.\n\n ### Base model load\n\n ```python\n device_map = {\n \"model.vision_tower\": \"cpu\",\n \"model.multi_modal_projector\": \"cpu\",\n \"model.audio_tower\": \"cpu\",\n \"\": 0,\n }\n\n base_model = Gemma4ForConditionalGeneration.from_pretrained(\n MODEL_ID,\n quantization_config=quant_config,\n device_map=device_map,\n max_memory=max_memory,\n offload_folder=r\"E:\\Folder\\offload_temp\",\n dtype=torch.bfloat16,\n attn_implementation=\"sdpa\",\n trust_remote_code=False,\n low_cpu_mem_usage=False,\n )\n ```\n\n ### PEFT load that triggers the generation failure\n\n ```python\n from peft import PeftModel\n\n if isinstance(base_model, PeftModel):\n base_model = base_model.merge_and_unload()\n\n model = PeftModel.from_pretrained(\n base_model,\n lora_path,\n adapter_name=adapter_name,\n device_map=device_map,\n is_trainable=False,\n )\n ```\n\n ### Generation\n\n ```python\n outputs = model.generate(\n **inputs,\n max_new_tokens=max_new_tokens,\n use_cache=True,\n )\n ```\n\n ### Error\n\n ```text\n File \"...peft\\peft_model.py\", line 2122, in generate\n outputs = self.base_model.generate(*args, **kwargs)\n\n File \"...transformers\\generation\\utils.py\", line 3768, in _prefill\n return self(**model_inputs, return_dict=True)\n\n File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 2516, in forward\n outputs = self.model(\n\n File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 2374, in forward\n outputs = self.language_model(\n\n File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1675, in forward\n hidden_states = decoder_layer(\n\n File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1379, in forward\n hidden_states, _ = self.self_attn(\n\n File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1219, in forward\n key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n\n KeyError: 22\n ```\n\n ### Expected behavior\n\n One of the following:\n\n 1. `PeftModel.from_pretrained(..., device_map=...)` should preserve Gemma 4 shared-KV generation behavior.\n 2. Passing a base-model `device_map` into PEFT should be rejected or documented as unsupported for Gemma 4 shared-KV models.\n 3. Gemma 4 should validate/populate `shared_kv_states` robustly when Accelerate hooks / PEFT wrapping are involved.\n 4. PEFT/Accelerate should avoid a redispatch/hook layout that changes the execution path needed for Gemma 4 shared-KV state capture.\n\n ### Actual behavior\n\n The model loads and reaches `generate()`, but the first generation prefill fails because `shared_kv_states` does not contain the expected source-layer key.\n\n ### Why this seems related to PEFT/Accelerate redispatch\n\n The failure only appears after passing `device_map` to `PeftModel.from_pretrained()`. That appears to perform a second dispatch over the PEFT-wrapped model, rather than simply “offloading PEFT too.”\n\n The same base model works in the all-GPU case, and the first failure mode without PEFT `device_map` is different: adapter loading fails during Accelerate/bitsandbytes hook/state handling.\n\n ### Notes\n\n - Gemma 4 uses shared KV cache: later layers can reuse K/V tensors from earlier layers instead of computing their own.\n - This failure appears to be architecture-specific to Gemma 4’s shared-KV path.\n - For a smaller Gemma 4 reproduction, an equivalent failure can show as `KeyError: 13` depending on layer count / shared-KV layout.\n - Passing `device_map` to PEFT should not be recommended as a workaround for the adapter-load-time offload issue if it can break generation.\n\n ### Questions\n\n 1. Is `PeftModel.from_pretrained(..., device_map=...)` supported for Gemma 4 models with shared KV cache?\n 2. Should PEFT avoid redispatching a base model that was already loaded with `device_map`?\n 3. Should Gemma 4 shared-KV state handling be robust to Accelerate hooks and PEFT wrapping?\n 4. Should the docs recommend `offload_dir`, `offload_buffers`, and `ephemeral_gpu_offload` instead of passing the same base `device_map` into PEFT?\n\n ### Relevant links\n\n ```text\n Gemma 4 shared KV cache background:\n https://huggingface.co/blog/gemma4\n\n Gemma 4 Transformers docs:\n https://huggingface.co/docs/transformers/model_doc/gemma4\n\n Accelerate big model dispatch docs:\n https://huggingface.co/docs/accelerate/package_reference/big_modeling\n\n PEFT PeftModel docs:\n https://huggingface.co/docs/peft/package_reference/peft_model\n\n Transformers native PEFT adapter integration:\n https://huggingface.co/docs/transformers/en/peft\n\n Related KV-shared layer discussion in another runtime:\n https://github.com/microsoft/onnxruntime/issues/28188\n ```\n\n\n* * *\n\n# Optional Issue 3\n\nOnly open this if you want a docs/UX issue in `huggingface/peft`, or if maintainers ask you to separate offload-argument handling from the Gemma 4/bnb failure.\n\n## Target repo\n\n\n huggingface/peft\n\n\n## Suggested title\n\n\n Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models\n\n\n## Body\n\n\n ### Summary\n\n When loading a PEFT adapter on top of a base model that was already loaded with a custom `device_map` and `offload_folder`, it is not obvious which offload arguments should be passed to `PeftModel.from_pretrained()`.\n\n The base model load uses:\n\n ```python\n base_model = AutoModelForCausalLM.from_pretrained(\n MODEL_ID,\n device_map=device_map,\n offload_folder=OFFLOAD_FOLDER,\n ...\n )\n ```\n\n But during PEFT adapter loading, the dispatch path may require:\n\n ```python\n model = PeftModel.from_pretrained(\n base_model,\n adapter_path,\n offload_dir=OFFLOAD_FOLDER,\n offload_buffers=True,\n ephemeral_gpu_offload=True,\n )\n ```\n\n This is confusing because `from_pretrained()` uses `offload_folder`, while Accelerate/PEFT redispatch paths refer to `offload_dir`.\n\n ### Request\n\n Please clarify in PEFT docs:\n\n 1. Whether `PeftModel.from_pretrained()` supports already-dispatched/offloaded base models.\n 2. Whether users should pass `offload_dir` when the base model was loaded with `offload_folder`.\n 3. Whether passing `device_map` to `PeftModel.from_pretrained()` is recommended or discouraged when the base model already has `hf_device_map`.\n 4. Whether `offload_buffers=True` is recommended for partially offloaded quantized models.\n 5. Whether `ephemeral_gpu_offload=True` is intended for this scenario.\n\n ### Relevant links\n\n ```text\n PEFT PeftModel docs:\n https://huggingface.co/docs/peft/package_reference/peft_model\n\n PEFT LoRA / ephemeral_gpu_offload docs:\n https://huggingface.co/docs/peft/developer_guides/lora\n\n Older PEFT offload_dir issue:\n https://github.com/huggingface/peft/issues/225\n\n Accelerate big model dispatch docs:\n https://huggingface.co/docs/accelerate/package_reference/big_modeling\n ```\n\n\n* * *\n\n## My recommendation\n\nOpen **Issue 1** first. It contains the main failure and enough context for maintainers to route ownership. Open **Issue 2** separately if they want the shared-KV failure split out, or if you want cleaner tracking from the start.\n\n**Do not open Issue 3 first.** It is useful, but it is a docs/UX issue. The core bug is Issue 1.",
"title": "CPU offloading error scenario"
}