CPU offloading error scenario
I’ll post a draft of the issue for now:
The good actual issues to raise are these, in this order.
Issue 1 — Primary: PEFT adapter loading fails on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model
File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate, bitsandbytes-foundation/bitsandbytes
Suggested title
PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit
Why this is the strongest issue
This is the core failure:
Base Gemma 4 loads with custom CPU/GPU device_map.
All-GPU Gemma 4 + PEFT works.
PEFT adapter loading triggers Accelerate dispatch/hook logic.
The failure occurs inside bitsandbytes 4-bit state/parameter handling.
The concrete failure variants are related, not contradictory:
Tensor.item() cannot be called on meta tensors
→ bitsandbytes QuantState.as_dict(packed=True)
→ nested_offset = self.offset.item()
and, on nearby version/config paths:
Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
The _is_hf_initialized family is already visible in upstream issue traffic around Transformers/Accelerate/bitsandbytes parameter reconstruction; there is a current issue for the analogous Int8Params case, and another issue describing _is_hf_initialized being passed into parameter reconstruction paths. (GitHub)
Core issue statement
Use wording like this:
The base model can be loaded with a split CPU/GPU device_map, and the all-GPU PEFT path works. The failure appears when loading a PEFT adapter onto the already-dispatched bitsandbytes 4-bit Gemma 4 base model. PeftModel.from_pretrained appears to trigger an additional Accelerate dispatch/hook path. That path fails inside bitsandbytes 4-bit quant-state or Params4bit handling.
Why Transformers first
Transformers is the best first repo because this issue crosses:
- Gemma 4 model integration;
- bitsandbytes quantization integration;
- device-map loading behavior;
- PEFT adapter integration expectations;
- current
_is_hf_initializedloading behavior.
Accelerate owns dispatch_model() and hook attachment; its docs define dispatching models across GPU, CPU, and disk according to device_map, and public Accelerate source/doc snippets show hook attachment is central to this path. (Hugging Face)
bitsandbytes owns Linear4bit, Params4bit, and QuantState, but the failure is triggered by the HF integration path. So file at Transformers first and let maintainers route if needed.
Issue 2 — Secondary: Passing device_map to PEFT breaks Gemma 4 shared-KV generation
File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate
Suggested title
Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22
Why this is a separate issue
This is not the same failure as the PEFT-load/bitsandbytes failure. It occurs later, during generation:
Gemma4 self_attn forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22
This happens only after using:
PeftModel.from_pretrained(..., device_map=device_map)
That is important because passing device_map into PEFT is not simply “offload PEFT too.” It asks PEFT/Accelerate to redispatch the PEFT-wrapped model, using names/layout assumptions that may no longer match the original base model.
Gemma 4 has shared-KV-cache behavior where later layers reuse key/value states from earlier layers. If a second dispatch/hook pass changes the execution/capture path, the dict entry expected by the shared layer may not be present. The Gemma 4 architecture writeup describes the shared-KV-cache mechanism; Unsloth’s Gemma 4 guide also calls out shared KV state across E2B/E4B layers. (GitHub)
Core issue statement
Passing the same base model device_map to PeftModel.from_pretrained avoids the initial adapter-load failure, but generation then fails in Gemma 4 shared-KV attention with KeyError. This suggests the PEFT/Accelerate redispatch layout breaks Gemma 4 shared_kv_states bookkeeping.
Why this deserves its own issue
Because the fix for Issue 1 may not automatically fix Issue 2. Issue 1 is about PEFT adapter loading over bnb 4-bit offload. Issue 2 is about Gemma 4 generation semantics after PEFT-level redispatch.
Do not merge them into one maintainer action item unless you present Issue 2 as a “related second symptom.”
Issue 3 — Optional/supporting: PEFT offload-dir / offload-folder handling is confusing or under-documented
File at: huggingface/peft
Suggested title
Clarify offload_dir/offload_folder handling for PeftModel.from_pretrained on already-dispatched models
Why it is lower priority
This is probably not the root cause of the current Gemma 4 failure, but it is part of the same user-facing confusion.
There are existing PEFT issues about PeftModel.from_pretrained() failing with:
ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`
and about inconsistent offload_dir / offload_folder naming. (GitHub)
This is worth mentioning in Issue 1 as context, but I would not file it first unless your minimal repro specifically lands on the missing offload_dir error.
What I would not file
Not this
PEFT expects vision/audio towers to be on GPU.
That is too broad and likely inaccurate.
Better:
PEFT adapter loading triggers redispatch/hook handling on an already-dispatched bnb 4-bit Gemma 4 model, and that dispatch path fails.
Not this
CPU offloading is broken.
Too broad. The base model can load with CPU/GPU dispatch; Accelerate supports dispatching layers across GPU, CPU, and disk by design. (Hugging Face)
Better:
Runtime PEFT adapter loading on top of a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base is broken on this version set.
Not this as the main issue
model.multi_modal_projector offload fails.
Only file a projector-specific issue after verifying that exact module key exists in the actual model. For Gemma 4 variants, bridge/module names can differ.
Recommended filing plan
Best plan
Open one primary Transformers issue with two sections:
A. Primary failure: PeftModel.from_pretrained on split-device bnb 4-bit Gemma4 fails during adapter load.
B. Related failure: adding device_map to PEFT avoids load error but causes Gemma4 shared_kv_states KeyError during generate.
Then add:
I can split the shared_kv_states issue into a separate ticket if maintainers prefer.
This is efficient because maintainers can see the relationship.
If you want the cleanest tracking
Open two separate issues:
- Transformers Issue A: bnb 4-bit + PEFT + Accelerate dispatch failure.
- Transformers Issue B: Gemma 4 shared-KV
KeyErrorwhendevice_mapis passed to PEFT.
Then cross-link them.
Minimal titles to use
Best title for main issue
PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState/Params4bit
Best title for related issue
Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22
Optional PEFT docs/UX issue
Clarify offload_dir/offload_folder behavior when loading PEFT adapters on already-dispatched models
Key evidence to include
Include this exact contrast:
Works:
device_map = {"": 0}
Fails:
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
Mention model.multi_modal_projector only if verified by named_modules().
Include quant config:
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
Mention that llm_int8_enable_fp32_cpu_offload=True is required/expected for CPU/disk entries in many bnb quantized device_map paths, even though the name is confusing; Transformers’ bitsandbytes docs describe CPU/GPU offload behavior in this quantization area. (GitHub)
Include the exact two trace tails:
Linear4bit._save_to_state_dict
→ weight.quant_state.as_dict(packed=True)
→ nested_offset = self.offset.item()
→ Tensor.item() cannot be called on meta tensors
and:
Gemma4Attention.forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22
Bottom line
The actual issues you are raising are:
Primary bug: PEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model triggers Accelerate redispatch/hook handling and fails inside bitsandbytes 4-bit state/parameter handling.
Secondary bug: Passing
device_mapinto PEFT is not a valid workaround for Gemma 4; it can break shared-KV generation withKeyError: 22.Optional docs/UX issue: PEFT/Accelerate offload args are confusing around
offload_dir,offload_folder, and already-dispatched base models.
Those are good, concrete, maintainable issues.
Below are ready-to-paste GitHub issues. I would open Issue 1 first in huggingface/transformers. If maintainers ask to split the shared-KV failure, open Issue 2 separately. This framing matches Accelerate’s documented role in dispatching models across GPU/CPU/disk, PEFT’s adapter-loading surface, bitsandbytes 4-bit quant-state handling, and Gemma 4’s shared-KV-cache architecture. (Hugging Face)
Issue 1
Target repo
huggingface/transformers
Suggested title
PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment in bitsandbytes QuantState
Suggested labels
bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload
Body
### System Info
- OS: Windows
- Python: please fill
- GPU: please fill
- NVIDIA driver: please fill
- CUDA: please fill
- torch: 2.8.0+cu129
- transformers: 5.6.2
- accelerate: 1.14.0.dev0
- bitsandbytes: 0.49.2
- peft: 0.19.1
- model: Gemma 4 E4B IT
- quantization: bitsandbytes 4-bit NF4
- adapter type: LoRA
- attention implementation: sdpa
- trust_remote_code: False
### Summary
A Gemma 4 E4B IT base model works when loaded fully on GPU with:
```python
device_map = {"": 0}
```
However, loading the same base model with a custom CPU/GPU `device_map` and then loading a PEFT adapter with `PeftModel.from_pretrained()` fails during adapter loading.
The failure appears when PEFT adapter loading calls into Accelerate dispatch/hook logic. Accelerate then calls `module.state_dict()` while attaching execution hooks, which reaches bitsandbytes `Linear4bit._save_to_state_dict()`. bitsandbytes then serializes `weight.quant_state.as_dict(packed=True)` and fails because a nested quantization scalar is still on the `meta` device:
```text
RuntimeError: Tensor.item() cannot be called on meta tensors
```
The all-GPU path works. The failure appears specifically when the base model is already CPU/GPU-dispatched and quantized with bitsandbytes 4-bit double quantization.
### Working case
```python
device_map = {"": 0}
```
This works.
### Failing case
```python
device_map = {
"model.vision_tower": "cpu",
"model.multi_modal_projector": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
```
### Quantization config
```python
from transformers import BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
```
### Base model load
```python
base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_ID,
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder=r"E:\Folder\offload_temp",
dtype=torch.bfloat16,
attn_implementation="sdpa",
trust_remote_code=False,
low_cpu_mem_usage=False,
)
```
### PEFT adapter load
```python
from peft import PeftModel
if isinstance(base_model, PeftModel):
base_model = base_model.merge_and_unload()
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=adapter_name,
is_trainable=False,
)
```
### Error
```text
PeftModel.from_pretrained
→ load_adapter
→ dispatch_model
→ attach_align_device_hook_on_blocks
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ self.weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ RuntimeError: Tensor.item() cannot be called on meta tensors
```
Relevant traceback tail:
```text
File "...peft\peft_model.py", line 1475, in load_adapter
dispatch_model(
File "...accelerate\big_modeling.py", line 432, in dispatch_model
attach_align_device_hook_on_blocks(
File "...accelerate\hooks.py", line 459, in attach_execution_device_hook
if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
File "...torch\nn\modules\module.py", line 2260, in state_dict
module.state_dict(
File "...bitsandbytes\nn\modules.py", line 525, in _save_to_state_dict
for k, v in self.weight.quant_state.as_dict(packed=True).items():
File "...bitsandbytes\functional.py", line 581, in as_dict
"nested_offset": self.offset.item(),
File "...torch_meta_registrations.py", line 7457, in meta_local_scalar_dense
raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors
```
### Expected behavior
One of the following:
1. `PeftModel.from_pretrained()` should preserve the already-dispatched base model layout without triggering a bitsandbytes quant-state serialization path that reads `meta` tensors.
2. Accelerate hook attachment should avoid calling `state_dict()` on bitsandbytes `Linear4bit` modules whose quant-state may contain offloaded/meta placeholders.
3. bitsandbytes `QuantState.as_dict(packed=True)` should either materialize/move the nested offset before `.item()` or fail with a clearer unsupported-configuration error.
4. If this configuration is unsupported, the error should be raised before adapter loading with an explicit message.
### Actual behavior
The base model can be loaded with the CPU/GPU `device_map`, but PEFT adapter loading triggers an additional Accelerate dispatch/hook path and fails inside bitsandbytes nested quantization-state serialization.
### Why this seems cross-library
My current read:
- PEFT triggers the failing path by loading the adapter with `PeftModel.from_pretrained()`.
- Accelerate attaches dispatch/execution hooks and calls `module.state_dict()`.
- bitsandbytes owns `Linear4bit`, `Params4bit`, and `QuantState.as_dict(packed=True)`.
- Transformers owns the Gemma 4 integration and bitsandbytes quantizer integration.
I am not sure which repository should own the final fix, but this seems to start from the Transformers/PEFT integration path.
### Additional notes
- The all-GPU path works with `device_map={"": 0}`.
- The failure only appears with CPU/GPU dispatch.
- The failing field is `nested_offset`, which appears tied to `bnb_4bit_use_double_quant=True`.
- For quantized models with CPU entries in `device_map`, `llm_int8_enable_fp32_cpu_offload=True` appears necessary even though the flag name says `int8`.
- Passing `device_map` to `PeftModel.from_pretrained()` is not a valid workaround; it causes a separate Gemma 4 shared-KV generation failure. I can open that as a separate issue if preferred.
### Diagnostic snippet
```python
def find_bnb_meta_quant_state(model):
bad = []
for name, module in model.named_modules():
weight = getattr(module, "weight", None)
quant_state = getattr(weight, "quant_state", None)
if quant_state is None:
continue
for attr in ["absmax", "code", "offset"]:
value = getattr(quant_state, attr, None)
if value is not None and getattr(value, "is_meta", False):
bad.append((name, f"weight.quant_state.{attr}", str(value.device)))
state2 = getattr(quant_state, "state2", None)
if state2 is not None:
for attr in ["absmax", "code", "offset"]:
value = getattr(state2, attr, None)
if value is not None and getattr(value, "is_meta", False):
bad.append((name, f"weight.quant_state.state2.{attr}", str(value.device)))
return bad
print("hf_device_map:", getattr(base_model, "hf_device_map", None))
print("bnb quant_state meta entries:", find_bnb_meta_quant_state(base_model)[:20])
```
### Module-name verification snippet
```python
for name, module in base_model.named_modules():
lname = name.lower()
if any(k in lname for k in ["vision", "audio", "project", "embed", "multi"]):
print(name, type(module).__name__)
```
### Questions
1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?
2. Should PEFT avoid redispatching a model that already has `hf_device_map`?
3. Should Accelerate avoid calling `state_dict()` during hook attachment for bitsandbytes `Linear4bit` modules?
4. Should bitsandbytes handle `QuantState.offset` on `meta` more defensively in `as_dict(packed=True)`?
5. Is the recommended workaround to use all-GPU placement, native `load_adapter`, or avoid runtime PEFT injection on offloaded bnb 4-bit models?
### Relevant links
```text
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling
Transformers bitsandbytes docs:
https://huggingface.co/docs/transformers/quantization/bitsandbytes
PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model
PEFT ephemeral_gpu_offload docs:
https://huggingface.co/docs/peft/developer_guides/lora
Transformers native PEFT adapter integration:
https://huggingface.co/docs/transformers/en/peft
bitsandbytes QuantState source:
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py
Related _is_hf_initialized issue family:
https://github.com/huggingface/transformers/issues/43872
```
Issue 2
Target repo
huggingface/transformers
Suggested title
Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22
Suggested labels
bug, Gemma4, generation, shared-kv-cache, PEFT, Accelerate, device_map
Body
### System Info
- OS: Windows
- Python: please fill
- GPU: please fill
- NVIDIA driver: please fill
- CUDA: please fill
- torch: 2.8.0+cu129
- transformers: 5.6.2
- accelerate: 1.14.0.dev0
- bitsandbytes: 0.49.2
- peft: 0.19.1
- model: Gemma 4 E4B IT
- quantization: bitsandbytes 4-bit NF4
- adapter type: LoRA
- attention implementation: sdpa
- trust_remote_code: False
### Summary
A Gemma 4 E4B IT model works when loaded fully on GPU with:
```python
device_map = {"": 0}
```
A CPU/GPU-dispatched base model can also be loaded. However, if I pass the same base-model `device_map` to `PeftModel.from_pretrained()`, adapter loading gets farther, but generation fails inside Gemma 4 shared-KV attention with:
```text
KeyError: 22
```
The failure line is:
```python
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
```
This suggests that the PEFT/Accelerate redispatch layout breaks Gemma 4 shared-KV bookkeeping during generation.
### Base model load
```python
device_map = {
"model.vision_tower": "cpu",
"model.multi_modal_projector": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_ID,
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder=r"E:\Folder\offload_temp",
dtype=torch.bfloat16,
attn_implementation="sdpa",
trust_remote_code=False,
low_cpu_mem_usage=False,
)
```
### PEFT load that triggers the generation failure
```python
from peft import PeftModel
if isinstance(base_model, PeftModel):
base_model = base_model.merge_and_unload()
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=adapter_name,
device_map=device_map,
is_trainable=False,
)
```
### Generation
```python
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
use_cache=True,
)
```
### Error
```text
File "...peft\peft_model.py", line 2122, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "...transformers\generation\utils.py", line 3768, in _prefill
return self(**model_inputs, return_dict=True)
File "...transformers\models\gemma4\modeling_gemma4.py", line 2516, in forward
outputs = self.model(
File "...transformers\models\gemma4\modeling_gemma4.py", line 2374, in forward
outputs = self.language_model(
File "...transformers\models\gemma4\modeling_gemma4.py", line 1675, in forward
hidden_states = decoder_layer(
File "...transformers\models\gemma4\modeling_gemma4.py", line 1379, in forward
hidden_states, _ = self.self_attn(
File "...transformers\models\gemma4\modeling_gemma4.py", line 1219, in forward
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
KeyError: 22
```
### Expected behavior
One of the following:
1. `PeftModel.from_pretrained(..., device_map=...)` should preserve Gemma 4 shared-KV generation behavior.
2. Passing a base-model `device_map` into PEFT should be rejected or documented as unsupported for Gemma 4 shared-KV models.
3. Gemma 4 should validate/populate `shared_kv_states` robustly when Accelerate hooks / PEFT wrapping are involved.
4. PEFT/Accelerate should avoid a redispatch/hook layout that changes the execution path needed for Gemma 4 shared-KV state capture.
### Actual behavior
The model loads and reaches `generate()`, but the first generation prefill fails because `shared_kv_states` does not contain the expected source-layer key.
### Why this seems related to PEFT/Accelerate redispatch
The failure only appears after passing `device_map` to `PeftModel.from_pretrained()`. That appears to perform a second dispatch over the PEFT-wrapped model, rather than simply “offloading PEFT too.”
The same base model works in the all-GPU case, and the first failure mode without PEFT `device_map` is different: adapter loading fails during Accelerate/bitsandbytes hook/state handling.
### Notes
- Gemma 4 uses shared KV cache: later layers can reuse K/V tensors from earlier layers instead of computing their own.
- This failure appears to be architecture-specific to Gemma 4’s shared-KV path.
- For a smaller Gemma 4 reproduction, an equivalent failure can show as `KeyError: 13` depending on layer count / shared-KV layout.
- Passing `device_map` to PEFT should not be recommended as a workaround for the adapter-load-time offload issue if it can break generation.
### Questions
1. Is `PeftModel.from_pretrained(..., device_map=...)` supported for Gemma 4 models with shared KV cache?
2. Should PEFT avoid redispatching a base model that was already loaded with `device_map`?
3. Should Gemma 4 shared-KV state handling be robust to Accelerate hooks and PEFT wrapping?
4. Should the docs recommend `offload_dir`, `offload_buffers`, and `ephemeral_gpu_offload` instead of passing the same base `device_map` into PEFT?
### Relevant links
```text
Gemma 4 shared KV cache background:
https://huggingface.co/blog/gemma4
Gemma 4 Transformers docs:
https://huggingface.co/docs/transformers/model_doc/gemma4
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling
PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model
Transformers native PEFT adapter integration:
https://huggingface.co/docs/transformers/en/peft
Related KV-shared layer discussion in another runtime:
https://github.com/microsoft/onnxruntime/issues/28188
```
Optional Issue 3
Only open this if you want a docs/UX issue in huggingface/peft, or if maintainers ask you to separate offload-argument handling from the Gemma 4/bnb failure.
Target repo
huggingface/peft
Suggested title
Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models
Body
### Summary
When loading a PEFT adapter on top of a base model that was already loaded with a custom `device_map` and `offload_folder`, it is not obvious which offload arguments should be passed to `PeftModel.from_pretrained()`.
The base model load uses:
```python
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
offload_folder=OFFLOAD_FOLDER,
...
)
```
But during PEFT adapter loading, the dispatch path may require:
```python
model = PeftModel.from_pretrained(
base_model,
adapter_path,
offload_dir=OFFLOAD_FOLDER,
offload_buffers=True,
ephemeral_gpu_offload=True,
)
```
This is confusing because `from_pretrained()` uses `offload_folder`, while Accelerate/PEFT redispatch paths refer to `offload_dir`.
### Request
Please clarify in PEFT docs:
1. Whether `PeftModel.from_pretrained()` supports already-dispatched/offloaded base models.
2. Whether users should pass `offload_dir` when the base model was loaded with `offload_folder`.
3. Whether passing `device_map` to `PeftModel.from_pretrained()` is recommended or discouraged when the base model already has `hf_device_map`.
4. Whether `offload_buffers=True` is recommended for partially offloaded quantized models.
5. Whether `ephemeral_gpu_offload=True` is intended for this scenario.
### Relevant links
```text
PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model
PEFT LoRA / ephemeral_gpu_offload docs:
https://huggingface.co/docs/peft/developer_guides/lora
Older PEFT offload_dir issue:
https://github.com/huggingface/peft/issues/225
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling
```
My recommendation
Open Issue 1 first. It contains the main failure and enough context for maintainers to route ownership. Open Issue 2 separately if they want the shared-KV failure split out, or if you want cleaner tracking from the start.
Do not open Issue 3 first. It is useful, but it is a docs/UX issue. The core bug is Issue 1.
Discussion in the ATmosphere