Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif4wccbvga5kfocgowufwg4fljtcyjajyqtz5626g2aa2vh2whwhe",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkbpvfsdq3m2"
  },
  "path": "/t/cpu-offloading-error-scenario/175522#post_5",
  "publishedAt": "2026-04-24T22:04:38.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face"
  ],
  "textContent": "I’ll post a draft of the issue for now:\n\n* * *\n\nThe **good actual issues** to raise are these, in this order.\n\n## Issue 1 — Primary: PEFT adapter loading fails on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model\n\n**File first at:** `huggingface/transformers`\n**Mention/cross-link:** `huggingface/peft`, `huggingface/accelerate`, `bitsandbytes-foundation/bitsandbytes`\n\n### Suggested title\n\n\n    PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit\n\n\n### Why this is the strongest issue\n\nThis is the core failure:\n\n\n    Base Gemma 4 loads with custom CPU/GPU device_map.\n    All-GPU Gemma 4 + PEFT works.\n    PEFT adapter loading triggers Accelerate dispatch/hook logic.\n    The failure occurs inside bitsandbytes 4-bit state/parameter handling.\n\n\nThe concrete failure variants are related, not contradictory:\n\n\n    Tensor.item() cannot be called on meta tensors\n    → bitsandbytes QuantState.as_dict(packed=True)\n    → nested_offset = self.offset.item()\n\n\nand, on nearby version/config paths:\n\n\n    Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'\n\n\nThe `_is_hf_initialized` family is already visible in upstream issue traffic around Transformers/Accelerate/bitsandbytes parameter reconstruction; there is a current issue for the analogous `Int8Params` case, and another issue describing `_is_hf_initialized` being passed into parameter reconstruction paths. (GitHub)\n\n### Core issue statement\n\nUse wording like this:\n\n\n    The base model can be loaded with a split CPU/GPU device_map, and the all-GPU PEFT path works. The failure appears when loading a PEFT adapter onto the already-dispatched bitsandbytes 4-bit Gemma 4 base model. PeftModel.from_pretrained appears to trigger an additional Accelerate dispatch/hook path. That path fails inside bitsandbytes 4-bit quant-state or Params4bit handling.\n\n\n### Why Transformers first\n\nTransformers is the best first repo because this issue crosses:\n\n  * Gemma 4 model integration;\n  * bitsandbytes quantization integration;\n  * device-map loading behavior;\n  * PEFT adapter integration expectations;\n  * current `_is_hf_initialized` loading behavior.\n\n\n\nAccelerate owns `dispatch_model()` and hook attachment; its docs define dispatching models across GPU, CPU, and disk according to `device_map`, and public Accelerate source/doc snippets show hook attachment is central to this path. (Hugging Face)\n\nbitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState`, but the failure is triggered by the HF integration path. So file at Transformers first and let maintainers route if needed.\n\n* * *\n\n## Issue 2 — Secondary: Passing `device_map` to PEFT breaks Gemma 4 shared-KV generation\n\n**File first at:** `huggingface/transformers`\n**Mention/cross-link:** `huggingface/peft`, `huggingface/accelerate`\n\n### Suggested title\n\n\n    Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n### Why this is a separate issue\n\nThis is not the same failure as the PEFT-load/bitsandbytes failure. It occurs later, during generation:\n\n\n    Gemma4 self_attn forward\n    → shared_kv_states[self.kv_shared_layer_index]\n    → KeyError: 22\n\n\nThis happens only after using:\n\n\n    PeftModel.from_pretrained(..., device_map=device_map)\n\n\nThat is important because passing `device_map` into PEFT is not simply “offload PEFT too.” It asks PEFT/Accelerate to redispatch the PEFT-wrapped model, using names/layout assumptions that may no longer match the original base model.\n\nGemma 4 has shared-KV-cache behavior where later layers reuse key/value states from earlier layers. If a second dispatch/hook pass changes the execution/capture path, the dict entry expected by the shared layer may not be present. The Gemma 4 architecture writeup describes the shared-KV-cache mechanism; Unsloth’s Gemma 4 guide also calls out shared KV state across E2B/E4B layers. (GitHub)\n\n### Core issue statement\n\n\n    Passing the same base model device_map to PeftModel.from_pretrained avoids the initial adapter-load failure, but generation then fails in Gemma 4 shared-KV attention with KeyError. This suggests the PEFT/Accelerate redispatch layout breaks Gemma 4 shared_kv_states bookkeeping.\n\n\n### Why this deserves its own issue\n\nBecause the fix for Issue 1 may not automatically fix Issue 2. Issue 1 is about PEFT adapter loading over bnb 4-bit offload. Issue 2 is about Gemma 4 generation semantics after PEFT-level redispatch.\n\nDo not merge them into one maintainer action item unless you present Issue 2 as a “related second symptom.”\n\n* * *\n\n## Issue 3 — Optional/supporting: PEFT offload-dir / offload-folder handling is confusing or under-documented\n\n**File at:** `huggingface/peft`\n\n### Suggested title\n\n\n    Clarify offload_dir/offload_folder handling for PeftModel.from_pretrained on already-dispatched models\n\n\n### Why it is lower priority\n\nThis is probably not the root cause of the current Gemma 4 failure, but it is part of the same user-facing confusion.\n\nThere are existing PEFT issues about `PeftModel.from_pretrained()` failing with:\n\n\n    ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`\n\n\nand about inconsistent `offload_dir` / `offload_folder` naming. (GitHub)\n\nThis is worth mentioning in Issue 1 as context, but I would not file it first unless your minimal repro specifically lands on the missing `offload_dir` error.\n\n* * *\n\n# What I would not file\n\n## Not this\n\n\n    PEFT expects vision/audio towers to be on GPU.\n\n\nThat is too broad and likely inaccurate.\n\nBetter:\n\n\n    PEFT adapter loading triggers redispatch/hook handling on an already-dispatched bnb 4-bit Gemma 4 model, and that dispatch path fails.\n\n\n## Not this\n\n\n    CPU offloading is broken.\n\n\nToo broad. The base model can load with CPU/GPU dispatch; Accelerate supports dispatching layers across GPU, CPU, and disk by design. (Hugging Face)\n\nBetter:\n\n\n    Runtime PEFT adapter loading on top of a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base is broken on this version set.\n\n\n## Not this as the main issue\n\n\n    model.multi_modal_projector offload fails.\n\n\nOnly file a projector-specific issue after verifying that exact module key exists in the actual model. For Gemma 4 variants, bridge/module names can differ.\n\n* * *\n\n# Recommended filing plan\n\n## Best plan\n\nOpen **one primary Transformers issue** with two sections:\n\n\n    A. Primary failure: PeftModel.from_pretrained on split-device bnb 4-bit Gemma4 fails during adapter load.\n    B. Related failure: adding device_map to PEFT avoids load error but causes Gemma4 shared_kv_states KeyError during generate.\n\n\nThen add:\n\n\n    I can split the shared_kv_states issue into a separate ticket if maintainers prefer.\n\n\nThis is efficient because maintainers can see the relationship.\n\n## If you want the cleanest tracking\n\nOpen two separate issues:\n\n  1. **Transformers Issue A:** bnb 4-bit + PEFT + Accelerate dispatch failure.\n  2. **Transformers Issue B:** Gemma 4 shared-KV `KeyError` when `device_map` is passed to PEFT.\n\n\n\nThen cross-link them.\n\n* * *\n\n# Minimal titles to use\n\n## Best title for main issue\n\n\n    PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit\n\n\n## Best title for related issue\n\n\n    Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n## Optional PEFT docs/UX issue\n\n\n    Clarify offload_dir/offload_folder behavior when loading PEFT adapters on already-dispatched models\n\n\n* * *\n\n# Key evidence to include\n\nInclude this exact contrast:\n\n\n    Works:\n    device_map = {\"\": 0}\n\n    Fails:\n    device_map = {\n        \"model.vision_tower\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0,\n    }\n\n\nMention `model.multi_modal_projector` only if verified by `named_modules()`.\n\nInclude quant config:\n\n\n    BitsAndBytesConfig(\n        load_in_4bit=True,\n        bnb_4bit_quant_type=\"nf4\",\n        bnb_4bit_use_double_quant=True,\n        bnb_4bit_compute_dtype=torch.bfloat16,\n        llm_int8_enable_fp32_cpu_offload=True,\n    )\n\n\nMention that `llm_int8_enable_fp32_cpu_offload=True` is required/expected for CPU/disk entries in many bnb quantized `device_map` paths, even though the name is confusing; Transformers’ bitsandbytes docs describe CPU/GPU offload behavior in this quantization area. (GitHub)\n\nInclude the exact two trace tails:\n\n\n    Linear4bit._save_to_state_dict\n    → weight.quant_state.as_dict(packed=True)\n    → nested_offset = self.offset.item()\n    → Tensor.item() cannot be called on meta tensors\n\n\nand:\n\n\n    Gemma4Attention.forward\n    → shared_kv_states[self.kv_shared_layer_index]\n    → KeyError: 22\n\n\n* * *\n\n# Bottom line\n\nThe actual issues you are raising are:\n\n  1. **Primary bug:** PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model triggers Accelerate redispatch/hook handling and fails inside bitsandbytes 4-bit state/parameter handling.\n\n  2. **Secondary bug:** Passing `device_map` into PEFT is not a valid workaround for Gemma 4; it can break shared-KV generation with `KeyError: 22`.\n\n  3. **Optional docs/UX issue:** PEFT/Accelerate offload args are confusing around `offload_dir`, `offload_folder`, and already-dispatched base models.\n\n\n\n\nThose are good, concrete, maintainable issues.\n\n* * *\n\nBelow are **ready-to-paste GitHub issues**. I would open **Issue 1 first** in `huggingface/transformers`. If maintainers ask to split the shared-KV failure, open **Issue 2** separately. This framing matches Accelerate’s documented role in dispatching models across GPU/CPU/disk, PEFT’s adapter-loading surface, bitsandbytes 4-bit quant-state handling, and Gemma 4’s shared-KV-cache architecture. (Hugging Face)\n\n* * *\n\n# Issue 1\n\n## Target repo\n\n\n    huggingface/transformers\n\n\n## Suggested title\n\n\n    PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState\n\n\n## Suggested labels\n\n\n    bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload\n\n\n## Body\n\n\n    ### System Info\n\n    - OS: Windows\n    - Python: please fill\n    - GPU: please fill\n    - NVIDIA driver: please fill\n    - CUDA: please fill\n    - torch: 2.8.0+cu129\n    - transformers: 5.6.2\n    - accelerate: 1.14.0.dev0\n    - bitsandbytes: 0.49.2\n    - peft: 0.19.1\n    - model: Gemma 4 E4B IT\n    - quantization: bitsandbytes 4-bit NF4\n    - adapter type: LoRA\n    - attention implementation: sdpa\n    - trust_remote_code: False\n\n    ### Summary\n\n    A Gemma 4 E4B IT base model works when loaded fully on GPU with:\n\n    ```python\n    device_map = {\"\": 0}\n    ```\n\n    However, loading the same base model with a custom CPU/GPU `device_map` and then loading a PEFT adapter with `PeftModel.from_pretrained()` fails during adapter loading.\n\n    The failure appears when PEFT adapter loading calls into Accelerate dispatch/hook logic. Accelerate then calls `module.state_dict()` while attaching execution hooks, which reaches bitsandbytes `Linear4bit._save_to_state_dict()`. bitsandbytes then serializes `weight.quant_state.as_dict(packed=True)` and fails because a nested quantization scalar is still on the `meta` device:\n\n    ```text\n    RuntimeError: Tensor.item() cannot be called on meta tensors\n    ```\n\n    The all-GPU path works. The failure appears specifically when the base model is already CPU/GPU-dispatched and quantized with bitsandbytes 4-bit double quantization.\n\n    ### Working case\n\n    ```python\n    device_map = {\"\": 0}\n    ```\n\n    This works.\n\n    ### Failing case\n\n    ```python\n    device_map = {\n        \"model.vision_tower\": \"cpu\",\n        \"model.multi_modal_projector\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0,\n    }\n    ```\n\n    ### Quantization config\n\n    ```python\n    from transformers import BitsAndBytesConfig\n    import torch\n\n    quant_config = BitsAndBytesConfig(\n        load_in_4bit=True,\n        bnb_4bit_quant_type=\"nf4\",\n        bnb_4bit_use_double_quant=True,\n        bnb_4bit_compute_dtype=torch.bfloat16,\n        llm_int8_enable_fp32_cpu_offload=True,\n    )\n    ```\n\n    ### Base model load\n\n    ```python\n    base_model = Gemma4ForConditionalGeneration.from_pretrained(\n        MODEL_ID,\n        quantization_config=quant_config,\n        device_map=device_map,\n        max_memory=max_memory,\n        offload_folder=r\"E:\\Folder\\offload_temp\",\n        dtype=torch.bfloat16,\n        attn_implementation=\"sdpa\",\n        trust_remote_code=False,\n        low_cpu_mem_usage=False,\n    )\n    ```\n\n    ### PEFT adapter load\n\n    ```python\n    from peft import PeftModel\n\n    if isinstance(base_model, PeftModel):\n        base_model = base_model.merge_and_unload()\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=adapter_name,\n        is_trainable=False,\n    )\n    ```\n\n    ### Error\n\n    ```text\n    PeftModel.from_pretrained\n    → load_adapter\n    → dispatch_model\n    → attach_align_device_hook_on_blocks\n    → attach_execution_device_hook\n    → module.state_dict()\n    → bitsandbytes Linear4bit._save_to_state_dict\n    → self.weight.quant_state.as_dict(packed=True)\n    → \"nested_offset\": self.offset.item()\n    → RuntimeError: Tensor.item() cannot be called on meta tensors\n    ```\n\n    Relevant traceback tail:\n\n    ```text\n    File \"...peft\\peft_model.py\", line 1475, in load_adapter\n        dispatch_model(\n\n    File \"...accelerate\\big_modeling.py\", line 432, in dispatch_model\n        attach_align_device_hook_on_blocks(\n\n    File \"...accelerate\\hooks.py\", line 459, in attach_execution_device_hook\n        if not hasattr(module, \"_hf_hook\") and len(module.state_dict()) > 0:\n\n    File \"...torch\\nn\\modules\\module.py\", line 2260, in state_dict\n        module.state_dict(\n\n    File \"...bitsandbytes\\nn\\modules.py\", line 525, in _save_to_state_dict\n        for k, v in self.weight.quant_state.as_dict(packed=True).items():\n\n    File \"...bitsandbytes\\functional.py\", line 581, in as_dict\n        \"nested_offset\": self.offset.item(),\n\n    File \"...torch_meta_registrations.py\", line 7457, in meta_local_scalar_dense\n        raise RuntimeError(\"Tensor.item() cannot be called on meta tensors\")\n\n    RuntimeError: Tensor.item() cannot be called on meta tensors\n    ```\n\n    ### Expected behavior\n\n    One of the following:\n\n    1. `PeftModel.from_pretrained()` should preserve the already-dispatched base model layout without triggering a bitsandbytes quant-state serialization path that reads `meta` tensors.\n    2. Accelerate hook attachment should avoid calling `state_dict()` on bitsandbytes `Linear4bit` modules whose quant-state may contain offloaded/meta placeholders.\n    3. bitsandbytes `QuantState.as_dict(packed=True)` should either materialize/move the nested offset before `.item()` or fail with a clearer unsupported-configuration error.\n    4. If this configuration is unsupported, the error should be raised before adapter loading with an explicit message.\n\n    ### Actual behavior\n\n    The base model can be loaded with the CPU/GPU `device_map`, but PEFT adapter loading triggers an additional Accelerate dispatch/hook path and fails inside bitsandbytes nested quantization-state serialization.\n\n    ### Why this seems cross-library\n\n    My current read:\n\n    - PEFT triggers the failing path by loading the adapter with `PeftModel.from_pretrained()`.\n    - Accelerate attaches dispatch/execution hooks and calls `module.state_dict()`.\n    - bitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState.as_dict(packed=True)`.\n    - Transformers owns the Gemma 4 integration and bitsandbytes quantizer integration.\n\n    I am not sure which repository should own the final fix, but this seems to start from the Transformers/PEFT integration path.\n\n    ### Additional notes\n\n    - The all-GPU path works with `device_map={\"\": 0}`.\n    - The failure only appears with CPU/GPU dispatch.\n    - The failing field is `nested_offset`, which appears tied to `bnb_4bit_use_double_quant=True`.\n    - For quantized models with CPU entries in `device_map`, `llm_int8_enable_fp32_cpu_offload=True` appears necessary even though the flag name says `int8`.\n    - Passing `device_map` to `PeftModel.from_pretrained()` is not a valid workaround; it causes a separate Gemma 4 shared-KV generation failure. I can open that as a separate issue if preferred.\n\n    ### Diagnostic snippet\n\n    ```python\n    def find_bnb_meta_quant_state(model):\n        bad = []\n        for name, module in model.named_modules():\n            weight = getattr(module, \"weight\", None)\n            quant_state = getattr(weight, \"quant_state\", None)\n            if quant_state is None:\n                continue\n\n            for attr in [\"absmax\", \"code\", \"offset\"]:\n                value = getattr(quant_state, attr, None)\n                if value is not None and getattr(value, \"is_meta\", False):\n                    bad.append((name, f\"weight.quant_state.{attr}\", str(value.device)))\n\n            state2 = getattr(quant_state, \"state2\", None)\n            if state2 is not None:\n                for attr in [\"absmax\", \"code\", \"offset\"]:\n                    value = getattr(state2, attr, None)\n                    if value is not None and getattr(value, \"is_meta\", False):\n                        bad.append((name, f\"weight.quant_state.state2.{attr}\", str(value.device)))\n        return bad\n\n    print(\"hf_device_map:\", getattr(base_model, \"hf_device_map\", None))\n    print(\"bnb quant_state meta entries:\", find_bnb_meta_quant_state(base_model)[:20])\n    ```\n\n    ### Module-name verification snippet\n\n    ```python\n    for name, module in base_model.named_modules():\n        lname = name.lower()\n        if any(k in lname for k in [\"vision\", \"audio\", \"project\", \"embed\", \"multi\"]):\n            print(name, type(module).__name__)\n    ```\n\n    ### Questions\n\n    1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?\n    2. Should PEFT avoid redispatching a model that already has `hf_device_map`?\n    3. Should Accelerate avoid calling `state_dict()` during hook attachment for bitsandbytes `Linear4bit` modules?\n    4. Should bitsandbytes handle `QuantState.offset` on `meta` more defensively in `as_dict(packed=True)`?\n    5. Is the recommended workaround to use all-GPU placement, native `load_adapter`, or avoid runtime PEFT injection on offloaded bnb 4-bit models?\n\n    ### Relevant links\n\n    ```text\n    Accelerate big model dispatch docs:\n    https://huggingface.co/docs/accelerate/package_reference/big_modeling\n\n    Transformers bitsandbytes docs:\n    https://huggingface.co/docs/transformers/quantization/bitsandbytes\n\n    PEFT PeftModel docs:\n    https://huggingface.co/docs/peft/package_reference/peft_model\n\n    PEFT ephemeral_gpu_offload docs:\n    https://huggingface.co/docs/peft/developer_guides/lora\n\n    Transformers native PEFT adapter integration:\n    https://huggingface.co/docs/transformers/en/peft\n\n    bitsandbytes QuantState source:\n    https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py\n\n    Related _is_hf_initialized issue family:\n    https://github.com/huggingface/transformers/issues/43872\n    ```\n\n\n* * *\n\n# Issue 2\n\n## Target repo\n\n\n    huggingface/transformers\n\n\n## Suggested title\n\n\n    Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\n## Suggested labels\n\n\n    bug, Gemma4, generation, shared-kv-cache, PEFT, Accelerate, device_map\n\n\n## Body\n\n\n    ### System Info\n\n    - OS: Windows\n    - Python: please fill\n    - GPU: please fill\n    - NVIDIA driver: please fill\n    - CUDA: please fill\n    - torch: 2.8.0+cu129\n    - transformers: 5.6.2\n    - accelerate: 1.14.0.dev0\n    - bitsandbytes: 0.49.2\n    - peft: 0.19.1\n    - model: Gemma 4 E4B IT\n    - quantization: bitsandbytes 4-bit NF4\n    - adapter type: LoRA\n    - attention implementation: sdpa\n    - trust_remote_code: False\n\n    ### Summary\n\n    A Gemma 4 E4B IT model works when loaded fully on GPU with:\n\n    ```python\n    device_map = {\"\": 0}\n    ```\n\n    A CPU/GPU-dispatched base model can also be loaded. However, if I pass the same base-model `device_map` to `PeftModel.from_pretrained()`, adapter loading gets farther, but generation fails inside Gemma 4 shared-KV attention with:\n\n    ```text\n    KeyError: 22\n    ```\n\n    The failure line is:\n\n    ```python\n    key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n    ```\n\n    This suggests that the PEFT/Accelerate redispatch layout breaks Gemma 4 shared-KV bookkeeping during generation.\n\n    ### Base model load\n\n    ```python\n    device_map = {\n        \"model.vision_tower\": \"cpu\",\n        \"model.multi_modal_projector\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0,\n    }\n\n    base_model = Gemma4ForConditionalGeneration.from_pretrained(\n        MODEL_ID,\n        quantization_config=quant_config,\n        device_map=device_map,\n        max_memory=max_memory,\n        offload_folder=r\"E:\\Folder\\offload_temp\",\n        dtype=torch.bfloat16,\n        attn_implementation=\"sdpa\",\n        trust_remote_code=False,\n        low_cpu_mem_usage=False,\n    )\n    ```\n\n    ### PEFT load that triggers the generation failure\n\n    ```python\n    from peft import PeftModel\n\n    if isinstance(base_model, PeftModel):\n        base_model = base_model.merge_and_unload()\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=adapter_name,\n        device_map=device_map,\n        is_trainable=False,\n    )\n    ```\n\n    ### Generation\n\n    ```python\n    outputs = model.generate(\n        **inputs,\n        max_new_tokens=max_new_tokens,\n        use_cache=True,\n    )\n    ```\n\n    ### Error\n\n    ```text\n    File \"...peft\\peft_model.py\", line 2122, in generate\n        outputs = self.base_model.generate(*args, **kwargs)\n\n    File \"...transformers\\generation\\utils.py\", line 3768, in _prefill\n        return self(**model_inputs, return_dict=True)\n\n    File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 2516, in forward\n        outputs = self.model(\n\n    File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 2374, in forward\n        outputs = self.language_model(\n\n    File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1675, in forward\n        hidden_states = decoder_layer(\n\n    File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1379, in forward\n        hidden_states, _ = self.self_attn(\n\n    File \"...transformers\\models\\gemma4\\modeling_gemma4.py\", line 1219, in forward\n        key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n\n    KeyError: 22\n    ```\n\n    ### Expected behavior\n\n    One of the following:\n\n    1. `PeftModel.from_pretrained(..., device_map=...)` should preserve Gemma 4 shared-KV generation behavior.\n    2. Passing a base-model `device_map` into PEFT should be rejected or documented as unsupported for Gemma 4 shared-KV models.\n    3. Gemma 4 should validate/populate `shared_kv_states` robustly when Accelerate hooks / PEFT wrapping are involved.\n    4. PEFT/Accelerate should avoid a redispatch/hook layout that changes the execution path needed for Gemma 4 shared-KV state capture.\n\n    ### Actual behavior\n\n    The model loads and reaches `generate()`, but the first generation prefill fails because `shared_kv_states` does not contain the expected source-layer key.\n\n    ### Why this seems related to PEFT/Accelerate redispatch\n\n    The failure only appears after passing `device_map` to `PeftModel.from_pretrained()`. That appears to perform a second dispatch over the PEFT-wrapped model, rather than simply “offloading PEFT too.”\n\n    The same base model works in the all-GPU case, and the first failure mode without PEFT `device_map` is different: adapter loading fails during Accelerate/bitsandbytes hook/state handling.\n\n    ### Notes\n\n    - Gemma 4 uses shared KV cache: later layers can reuse K/V tensors from earlier layers instead of computing their own.\n    - This failure appears to be architecture-specific to Gemma 4’s shared-KV path.\n    - For a smaller Gemma 4 reproduction, an equivalent failure can show as `KeyError: 13` depending on layer count / shared-KV layout.\n    - Passing `device_map` to PEFT should not be recommended as a workaround for the adapter-load-time offload issue if it can break generation.\n\n    ### Questions\n\n    1. Is `PeftModel.from_pretrained(..., device_map=...)` supported for Gemma 4 models with shared KV cache?\n    2. Should PEFT avoid redispatching a base model that was already loaded with `device_map`?\n    3. Should Gemma 4 shared-KV state handling be robust to Accelerate hooks and PEFT wrapping?\n    4. Should the docs recommend `offload_dir`, `offload_buffers`, and `ephemeral_gpu_offload` instead of passing the same base `device_map` into PEFT?\n\n    ### Relevant links\n\n    ```text\n    Gemma 4 shared KV cache background:\n    https://huggingface.co/blog/gemma4\n\n    Gemma 4 Transformers docs:\n    https://huggingface.co/docs/transformers/model_doc/gemma4\n\n    Accelerate big model dispatch docs:\n    https://huggingface.co/docs/accelerate/package_reference/big_modeling\n\n    PEFT PeftModel docs:\n    https://huggingface.co/docs/peft/package_reference/peft_model\n\n    Transformers native PEFT adapter integration:\n    https://huggingface.co/docs/transformers/en/peft\n\n    Related KV-shared layer discussion in another runtime:\n    https://github.com/microsoft/onnxruntime/issues/28188\n    ```\n\n\n* * *\n\n# Optional Issue 3\n\nOnly open this if you want a docs/UX issue in `huggingface/peft`, or if maintainers ask you to separate offload-argument handling from the Gemma 4/bnb failure.\n\n## Target repo\n\n\n    huggingface/peft\n\n\n## Suggested title\n\n\n    Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models\n\n\n## Body\n\n\n    ### Summary\n\n    When loading a PEFT adapter on top of a base model that was already loaded with a custom `device_map` and `offload_folder`, it is not obvious which offload arguments should be passed to `PeftModel.from_pretrained()`.\n\n    The base model load uses:\n\n    ```python\n    base_model = AutoModelForCausalLM.from_pretrained(\n        MODEL_ID,\n        device_map=device_map,\n        offload_folder=OFFLOAD_FOLDER,\n        ...\n    )\n    ```\n\n    But during PEFT adapter loading, the dispatch path may require:\n\n    ```python\n    model = PeftModel.from_pretrained(\n        base_model,\n        adapter_path,\n        offload_dir=OFFLOAD_FOLDER,\n        offload_buffers=True,\n        ephemeral_gpu_offload=True,\n    )\n    ```\n\n    This is confusing because `from_pretrained()` uses `offload_folder`, while Accelerate/PEFT redispatch paths refer to `offload_dir`.\n\n    ### Request\n\n    Please clarify in PEFT docs:\n\n    1. Whether `PeftModel.from_pretrained()` supports already-dispatched/offloaded base models.\n    2. Whether users should pass `offload_dir` when the base model was loaded with `offload_folder`.\n    3. Whether passing `device_map` to `PeftModel.from_pretrained()` is recommended or discouraged when the base model already has `hf_device_map`.\n    4. Whether `offload_buffers=True` is recommended for partially offloaded quantized models.\n    5. Whether `ephemeral_gpu_offload=True` is intended for this scenario.\n\n    ### Relevant links\n\n    ```text\n    PEFT PeftModel docs:\n    https://huggingface.co/docs/peft/package_reference/peft_model\n\n    PEFT LoRA / ephemeral_gpu_offload docs:\n    https://huggingface.co/docs/peft/developer_guides/lora\n\n    Older PEFT offload_dir issue:\n    https://github.com/huggingface/peft/issues/225\n\n    Accelerate big model dispatch docs:\n    https://huggingface.co/docs/accelerate/package_reference/big_modeling\n    ```\n\n\n* * *\n\n## My recommendation\n\nOpen **Issue 1** first. It contains the main failure and enough context for maintainers to route ownership. Open **Issue 2** separately if they want the shared-KV failure split out, or if you want cleaner tracking from the start.\n\n**Do not open Issue 3 first.** It is useful, but it is a docs/UX issue. The core bug is Issue 1.",
  "title": "CPU offloading error scenario"
}