External Publication
Visit Post

CPU offloading error scenario

Hugging Face Forums [Unofficial] April 24, 2026
Source

I’ll post a draft of the issue for now:


The good actual issues to raise are these, in this order.

Issue 1 — Primary: PEFT adapter loading fails on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model

File first at: huggingface/transformers Mention/cross-link: huggingface/peft, huggingface/accelerate, bitsandbytes-foundation/bitsandbytes

Suggested title

PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit

Why this is the strongest issue

This is the core failure:

Base Gemma 4 loads with custom CPU/GPU device_map.
All-GPU Gemma 4 + PEFT works.
PEFT adapter loading triggers Accelerate dispatch/hook logic.
The failure occurs inside bitsandbytes 4-bit state/parameter handling.

The concrete failure variants are related, not contradictory:

Tensor.item() cannot be called on meta tensors
→ bitsandbytes QuantState.as_dict(packed=True)
→ nested_offset = self.offset.item()

and, on nearby version/config paths:

Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

The _is_hf_initialized family is already visible in upstream issue traffic around Transformers/Accelerate/bitsandbytes parameter reconstruction; there is a current issue for the analogous Int8Params case, and another issue describing _is_hf_initialized being passed into parameter reconstruction paths. (GitHub)

Core issue statement

Use wording like this:

The base model can be loaded with a split CPU/GPU device_map, and the all-GPU PEFT path works. The failure appears when loading a PEFT adapter onto the already-dispatched bitsandbytes 4-bit Gemma 4 base model. PeftModel.from_pretrained appears to trigger an additional Accelerate dispatch/hook path. That path fails inside bitsandbytes 4-bit quant-state or Params4bit handling.

Why Transformers first

Transformers is the best first repo because this issue crosses:

  • Gemma 4 model integration;
  • bitsandbytes quantization integration;
  • device-map loading behavior;
  • PEFT adapter integration expectations;
  • current _is_hf_initialized loading behavior.

Accelerate owns dispatch_model() and hook attachment; its docs define dispatching models across GPU, CPU, and disk according to device_map, and public Accelerate source/doc snippets show hook attachment is central to this path. (Hugging Face)

bitsandbytes owns Linear4bit, Params4bit, and QuantState, but the failure is triggered by the HF integration path. So file at Transformers first and let maintainers route if needed.


Issue 2 — Secondary: Passing device_map to PEFT breaks Gemma 4 shared-KV generation

File first at: huggingface/transformers Mention/cross-link: huggingface/peft, huggingface/accelerate

Suggested title

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22

Why this is a separate issue

This is not the same failure as the PEFT-load/bitsandbytes failure. It occurs later, during generation:

Gemma4 self_attn forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22

This happens only after using:

PeftModel.from_pretrained(..., device_map=device_map)

That is important because passing device_map into PEFT is not simply “offload PEFT too.” It asks PEFT/Accelerate to redispatch the PEFT-wrapped model, using names/layout assumptions that may no longer match the original base model.

Gemma 4 has shared-KV-cache behavior where later layers reuse key/value states from earlier layers. If a second dispatch/hook pass changes the execution/capture path, the dict entry expected by the shared layer may not be present. The Gemma 4 architecture writeup describes the shared-KV-cache mechanism; Unsloth’s Gemma 4 guide also calls out shared KV state across E2B/E4B layers. (GitHub)

Core issue statement

Passing the same base model device_map to PeftModel.from_pretrained avoids the initial adapter-load failure, but generation then fails in Gemma 4 shared-KV attention with KeyError. This suggests the PEFT/Accelerate redispatch layout breaks Gemma 4 shared_kv_states bookkeeping.

Why this deserves its own issue

Because the fix for Issue 1 may not automatically fix Issue 2. Issue 1 is about PEFT adapter loading over bnb 4-bit offload. Issue 2 is about Gemma 4 generation semantics after PEFT-level redispatch.

Do not merge them into one maintainer action item unless you present Issue 2 as a “related second symptom.”


Issue 3 — Optional/supporting: PEFT offload-dir / offload-folder handling is confusing or under-documented

File at: huggingface/peft

Suggested title

Clarify offload_dir/offload_folder handling for PeftModel.from_pretrained on already-dispatched models

Why it is lower priority

This is probably not the root cause of the current Gemma 4 failure, but it is part of the same user-facing confusion.

There are existing PEFT issues about PeftModel.from_pretrained() failing with:

ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`

and about inconsistent offload_dir / offload_folder naming. (GitHub)

This is worth mentioning in Issue 1 as context, but I would not file it first unless your minimal repro specifically lands on the missing offload_dir error.


What I would not file

Not this

PEFT expects vision/audio towers to be on GPU.

That is too broad and likely inaccurate.

Better:

PEFT adapter loading triggers redispatch/hook handling on an already-dispatched bnb 4-bit Gemma 4 model, and that dispatch path fails.

Not this

CPU offloading is broken.

Too broad. The base model can load with CPU/GPU dispatch; Accelerate supports dispatching layers across GPU, CPU, and disk by design. (Hugging Face)

Better:

Runtime PEFT adapter loading on top of a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base is broken on this version set.

Not this as the main issue

model.multi_modal_projector offload fails.

Only file a projector-specific issue after verifying that exact module key exists in the actual model. For Gemma 4 variants, bridge/module names can differ.


Recommended filing plan

Best plan

Open one primary Transformers issue with two sections:

A. Primary failure: PeftModel.from_pretrained on split-device bnb 4-bit Gemma4 fails during adapter load.
B. Related failure: adding device_map to PEFT avoids load error but causes Gemma4 shared_kv_states KeyError during generate.

Then add:

I can split the shared_kv_states issue into a separate ticket if maintainers prefer.

This is efficient because maintainers can see the relationship.

If you want the cleanest tracking

Open two separate issues:

  1. Transformers Issue A: bnb 4-bit + PEFT + Accelerate dispatch failure.
  2. Transformers Issue B: Gemma 4 shared-KV KeyError when device_map is passed to PEFT.

Then cross-link them.


Minimal titles to use

Best title for main issue

PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit

Best title for related issue

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22

Optional PEFT docs/UX issue

Clarify offload_dir/offload_folder behavior when loading PEFT adapters on already-dispatched models

Key evidence to include

Include this exact contrast:

Works:
device_map = {"": 0}

Fails:
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

Mention model.multi_modal_projector only if verified by named_modules().

Include quant config:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)

Mention that llm_int8_enable_fp32_cpu_offload=True is required/expected for CPU/disk entries in many bnb quantized device_map paths, even though the name is confusing; Transformers’ bitsandbytes docs describe CPU/GPU offload behavior in this quantization area. (GitHub)

Include the exact two trace tails:

Linear4bit._save_to_state_dict
→ weight.quant_state.as_dict(packed=True)
→ nested_offset = self.offset.item()
→ Tensor.item() cannot be called on meta tensors

and:

Gemma4Attention.forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22

Bottom line

The actual issues you are raising are:

  1. Primary bug: PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model triggers Accelerate redispatch/hook handling and fails inside bitsandbytes 4-bit state/parameter handling.

  2. Secondary bug: Passing device_map into PEFT is not a valid workaround for Gemma 4; it can break shared-KV generation with KeyError: 22.

  3. Optional docs/UX issue: PEFT/Accelerate offload args are confusing around offload_dir, offload_folder, and already-dispatched base models.

Those are good, concrete, maintainable issues.


Below are ready-to-paste GitHub issues. I would open Issue 1 first in huggingface/transformers. If maintainers ask to split the shared-KV failure, open Issue 2 separately. This framing matches Accelerate’s documented role in dispatching models across GPU/CPU/disk, PEFT’s adapter-loading surface, bitsandbytes 4-bit quant-state handling, and Gemma 4’s shared-KV-cache architecture. (Hugging Face)


Issue 1

Target repo

huggingface/transformers

Suggested title

PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState

Suggested labels

bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload

Body

### System Info

- OS: Windows
- Python: please fill
- GPU: please fill
- NVIDIA driver: please fill
- CUDA: please fill
- torch: 2.8.0+cu129
- transformers: 5.6.2
- accelerate: 1.14.0.dev0
- bitsandbytes: 0.49.2
- peft: 0.19.1
- model: Gemma 4 E4B IT
- quantization: bitsandbytes 4-bit NF4
- adapter type: LoRA
- attention implementation: sdpa
- trust_remote_code: False

### Summary

A Gemma 4 E4B IT base model works when loaded fully on GPU with:

```python
device_map = {"": 0}
```

However, loading the same base model with a custom CPU/GPU `device_map` and then loading a PEFT adapter with `PeftModel.from_pretrained()` fails during adapter loading.

The failure appears when PEFT adapter loading calls into Accelerate dispatch/hook logic. Accelerate then calls `module.state_dict()` while attaching execution hooks, which reaches bitsandbytes `Linear4bit._save_to_state_dict()`. bitsandbytes then serializes `weight.quant_state.as_dict(packed=True)` and fails because a nested quantization scalar is still on the `meta` device:

```text
RuntimeError: Tensor.item() cannot be called on meta tensors
```

The all-GPU path works. The failure appears specifically when the base model is already CPU/GPU-dispatched and quantized with bitsandbytes 4-bit double quantization.

### Working case

```python
device_map = {"": 0}
```

This works.

### Failing case

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.multi_modal_projector": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
```

### Quantization config

```python
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)
```

### Base model load

```python
base_model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    offload_folder=r"E:\Folder\offload_temp",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=False,
)
```

### PEFT adapter load

```python
from peft import PeftModel

if isinstance(base_model, PeftModel):
    base_model = base_model.merge_and_unload()

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=adapter_name,
    is_trainable=False,
)
```

### Error

```text
PeftModel.from_pretrained
→ load_adapter
→ dispatch_model
→ attach_align_device_hook_on_blocks
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ self.weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ RuntimeError: Tensor.item() cannot be called on meta tensors
```

Relevant traceback tail:

```text
File "...peft\peft_model.py", line 1475, in load_adapter
    dispatch_model(

File "...accelerate\big_modeling.py", line 432, in dispatch_model
    attach_align_device_hook_on_blocks(

File "...accelerate\hooks.py", line 459, in attach_execution_device_hook
    if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:

File "...torch\nn\modules\module.py", line 2260, in state_dict
    module.state_dict(

File "...bitsandbytes\nn\modules.py", line 525, in _save_to_state_dict
    for k, v in self.weight.quant_state.as_dict(packed=True).items():

File "...bitsandbytes\functional.py", line 581, in as_dict
    "nested_offset": self.offset.item(),

File "...torch_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")

RuntimeError: Tensor.item() cannot be called on meta tensors
```

### Expected behavior

One of the following:

1. `PeftModel.from_pretrained()` should preserve the already-dispatched base model layout without triggering a bitsandbytes quant-state serialization path that reads `meta` tensors.
2. Accelerate hook attachment should avoid calling `state_dict()` on bitsandbytes `Linear4bit` modules whose quant-state may contain offloaded/meta placeholders.
3. bitsandbytes `QuantState.as_dict(packed=True)` should either materialize/move the nested offset before `.item()` or fail with a clearer unsupported-configuration error.
4. If this configuration is unsupported, the error should be raised before adapter loading with an explicit message.

### Actual behavior

The base model can be loaded with the CPU/GPU `device_map`, but PEFT adapter loading triggers an additional Accelerate dispatch/hook path and fails inside bitsandbytes nested quantization-state serialization.

### Why this seems cross-library

My current read:

- PEFT triggers the failing path by loading the adapter with `PeftModel.from_pretrained()`.
- Accelerate attaches dispatch/execution hooks and calls `module.state_dict()`.
- bitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState.as_dict(packed=True)`.
- Transformers owns the Gemma 4 integration and bitsandbytes quantizer integration.

I am not sure which repository should own the final fix, but this seems to start from the Transformers/PEFT integration path.

### Additional notes

- The all-GPU path works with `device_map={"": 0}`.
- The failure only appears with CPU/GPU dispatch.
- The failing field is `nested_offset`, which appears tied to `bnb_4bit_use_double_quant=True`.
- For quantized models with CPU entries in `device_map`, `llm_int8_enable_fp32_cpu_offload=True` appears necessary even though the flag name says `int8`.
- Passing `device_map` to `PeftModel.from_pretrained()` is not a valid workaround; it causes a separate Gemma 4 shared-KV generation failure. I can open that as a separate issue if preferred.

### Diagnostic snippet

```python
def find_bnb_meta_quant_state(model):
    bad = []
    for name, module in model.named_modules():
        weight = getattr(module, "weight", None)
        quant_state = getattr(weight, "quant_state", None)
        if quant_state is None:
            continue

        for attr in ["absmax", "code", "offset"]:
            value = getattr(quant_state, attr, None)
            if value is not None and getattr(value, "is_meta", False):
                bad.append((name, f"weight.quant_state.{attr}", str(value.device)))

        state2 = getattr(quant_state, "state2", None)
        if state2 is not None:
            for attr in ["absmax", "code", "offset"]:
                value = getattr(state2, attr, None)
                if value is not None and getattr(value, "is_meta", False):
                    bad.append((name, f"weight.quant_state.state2.{attr}", str(value.device)))
    return bad

print("hf_device_map:", getattr(base_model, "hf_device_map", None))
print("bnb quant_state meta entries:", find_bnb_meta_quant_state(base_model)[:20])
```

### Module-name verification snippet

```python
for name, module in base_model.named_modules():
    lname = name.lower()
    if any(k in lname for k in ["vision", "audio", "project", "embed", "multi"]):
        print(name, type(module).__name__)
```

### Questions

1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?
2. Should PEFT avoid redispatching a model that already has `hf_device_map`?
3. Should Accelerate avoid calling `state_dict()` during hook attachment for bitsandbytes `Linear4bit` modules?
4. Should bitsandbytes handle `QuantState.offset` on `meta` more defensively in `as_dict(packed=True)`?
5. Is the recommended workaround to use all-GPU placement, native `load_adapter`, or avoid runtime PEFT injection on offloaded bnb 4-bit models?

### Relevant links

```text
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling

Transformers bitsandbytes docs:
https://huggingface.co/docs/transformers/quantization/bitsandbytes

PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model

PEFT ephemeral_gpu_offload docs:
https://huggingface.co/docs/peft/developer_guides/lora

Transformers native PEFT adapter integration:
https://huggingface.co/docs/transformers/en/peft

bitsandbytes QuantState source:
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py

Related _is_hf_initialized issue family:
https://github.com/huggingface/transformers/issues/43872
```

Issue 2

Target repo

huggingface/transformers

Suggested title

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22

Suggested labels

bug, Gemma4, generation, shared-kv-cache, PEFT, Accelerate, device_map

Body

### System Info

- OS: Windows
- Python: please fill
- GPU: please fill
- NVIDIA driver: please fill
- CUDA: please fill
- torch: 2.8.0+cu129
- transformers: 5.6.2
- accelerate: 1.14.0.dev0
- bitsandbytes: 0.49.2
- peft: 0.19.1
- model: Gemma 4 E4B IT
- quantization: bitsandbytes 4-bit NF4
- adapter type: LoRA
- attention implementation: sdpa
- trust_remote_code: False

### Summary

A Gemma 4 E4B IT model works when loaded fully on GPU with:

```python
device_map = {"": 0}
```

A CPU/GPU-dispatched base model can also be loaded. However, if I pass the same base-model `device_map` to `PeftModel.from_pretrained()`, adapter loading gets farther, but generation fails inside Gemma 4 shared-KV attention with:

```text
KeyError: 22
```

The failure line is:

```python
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
```

This suggests that the PEFT/Accelerate redispatch layout breaks Gemma 4 shared-KV bookkeeping during generation.

### Base model load

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.multi_modal_projector": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    offload_folder=r"E:\Folder\offload_temp",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=False,
)
```

### PEFT load that triggers the generation failure

```python
from peft import PeftModel

if isinstance(base_model, PeftModel):
    base_model = base_model.merge_and_unload()

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=adapter_name,
    device_map=device_map,
    is_trainable=False,
)
```

### Generation

```python
outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    use_cache=True,
)
```

### Error

```text
File "...peft\peft_model.py", line 2122, in generate
    outputs = self.base_model.generate(*args, **kwargs)

File "...transformers\generation\utils.py", line 3768, in _prefill
    return self(**model_inputs, return_dict=True)

File "...transformers\models\gemma4\modeling_gemma4.py", line 2516, in forward
    outputs = self.model(

File "...transformers\models\gemma4\modeling_gemma4.py", line 2374, in forward
    outputs = self.language_model(

File "...transformers\models\gemma4\modeling_gemma4.py", line 1675, in forward
    hidden_states = decoder_layer(

File "...transformers\models\gemma4\modeling_gemma4.py", line 1379, in forward
    hidden_states, _ = self.self_attn(

File "...transformers\models\gemma4\modeling_gemma4.py", line 1219, in forward
    key_states, value_states = shared_kv_states[self.kv_shared_layer_index]

KeyError: 22
```

### Expected behavior

One of the following:

1. `PeftModel.from_pretrained(..., device_map=...)` should preserve Gemma 4 shared-KV generation behavior.
2. Passing a base-model `device_map` into PEFT should be rejected or documented as unsupported for Gemma 4 shared-KV models.
3. Gemma 4 should validate/populate `shared_kv_states` robustly when Accelerate hooks / PEFT wrapping are involved.
4. PEFT/Accelerate should avoid a redispatch/hook layout that changes the execution path needed for Gemma 4 shared-KV state capture.

### Actual behavior

The model loads and reaches `generate()`, but the first generation prefill fails because `shared_kv_states` does not contain the expected source-layer key.

### Why this seems related to PEFT/Accelerate redispatch

The failure only appears after passing `device_map` to `PeftModel.from_pretrained()`. That appears to perform a second dispatch over the PEFT-wrapped model, rather than simply “offloading PEFT too.”

The same base model works in the all-GPU case, and the first failure mode without PEFT `device_map` is different: adapter loading fails during Accelerate/bitsandbytes hook/state handling.

### Notes

- Gemma 4 uses shared KV cache: later layers can reuse K/V tensors from earlier layers instead of computing their own.
- This failure appears to be architecture-specific to Gemma 4’s shared-KV path.
- For a smaller Gemma 4 reproduction, an equivalent failure can show as `KeyError: 13` depending on layer count / shared-KV layout.
- Passing `device_map` to PEFT should not be recommended as a workaround for the adapter-load-time offload issue if it can break generation.

### Questions

1. Is `PeftModel.from_pretrained(..., device_map=...)` supported for Gemma 4 models with shared KV cache?
2. Should PEFT avoid redispatching a base model that was already loaded with `device_map`?
3. Should Gemma 4 shared-KV state handling be robust to Accelerate hooks and PEFT wrapping?
4. Should the docs recommend `offload_dir`, `offload_buffers`, and `ephemeral_gpu_offload` instead of passing the same base `device_map` into PEFT?

### Relevant links

```text
Gemma 4 shared KV cache background:
https://huggingface.co/blog/gemma4

Gemma 4 Transformers docs:
https://huggingface.co/docs/transformers/model_doc/gemma4

Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling

PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model

Transformers native PEFT adapter integration:
https://huggingface.co/docs/transformers/en/peft

Related KV-shared layer discussion in another runtime:
https://github.com/microsoft/onnxruntime/issues/28188
```

Optional Issue 3

Only open this if you want a docs/UX issue in huggingface/peft, or if maintainers ask you to separate offload-argument handling from the Gemma 4/bnb failure.

Target repo

huggingface/peft

Suggested title

Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models

Body

### Summary

When loading a PEFT adapter on top of a base model that was already loaded with a custom `device_map` and `offload_folder`, it is not obvious which offload arguments should be passed to `PeftModel.from_pretrained()`.

The base model load uses:

```python
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map=device_map,
    offload_folder=OFFLOAD_FOLDER,
    ...
)
```

But during PEFT adapter loading, the dispatch path may require:

```python
model = PeftModel.from_pretrained(
    base_model,
    adapter_path,
    offload_dir=OFFLOAD_FOLDER,
    offload_buffers=True,
    ephemeral_gpu_offload=True,
)
```

This is confusing because `from_pretrained()` uses `offload_folder`, while Accelerate/PEFT redispatch paths refer to `offload_dir`.

### Request

Please clarify in PEFT docs:

1. Whether `PeftModel.from_pretrained()` supports already-dispatched/offloaded base models.
2. Whether users should pass `offload_dir` when the base model was loaded with `offload_folder`.
3. Whether passing `device_map` to `PeftModel.from_pretrained()` is recommended or discouraged when the base model already has `hf_device_map`.
4. Whether `offload_buffers=True` is recommended for partially offloaded quantized models.
5. Whether `ephemeral_gpu_offload=True` is intended for this scenario.

### Relevant links

```text
PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model

PEFT LoRA / ephemeral_gpu_offload docs:
https://huggingface.co/docs/peft/developer_guides/lora

Older PEFT offload_dir issue:
https://github.com/huggingface/peft/issues/225

Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling
```

My recommendation

Open Issue 1 first. It contains the main failure and enough context for maintainers to route ownership. Open Issue 2 separately if they want the shared-KV failure split out, or if you want cleaner tracking from the start.

Do not open Issue 3 first. It is useful, but it is a docs/UX issue. The core bug is Issue 1.

Discussion in the ATmosphere

Loading comments...