{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiakiz6xdk6kacjjfs7ame7n4mqc4lnii2aixsxznle6amard4di5a",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkaodoclill2"
  },
  "path": "/t/cpu-offloading-error-scenario/175522#post_3",
  "publishedAt": "2026-04-24T13:28:52.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "PyTorch Docs",
    "Hugging Face"
  ],
  "textContent": "I ran a few experiments on Colab Free GPU. While there are workarounds, it also seems possible that PEFT isn’t currently behaving as expected:\n\n* * *\n\n## Short answer\n\nYour understanding is **close** , but the important correction is:\n\n> PEFT is not simply “expecting vision/audio to be on GPU.”\n>  The real problem is that **PEFT adapter loading triggers a second Accelerate dispatch/hook pass over a bitsandbytes 4-bit, partially CPU-offloaded Gemma 4 model** , and that path is fragile.\n\nYou are hitting two different failure modes:\n\n  1. **Without`device_map` in PEFT**\nPEFT calls Accelerate dispatch hooks; Accelerate asks a bitsandbytes `Linear4bit` module for `state_dict()`; bitsandbytes tries to serialize nested/double-quant state; that nested quant state contains a `meta` tensor; `.item()` on a `meta` tensor fails.\n\n  2. **With`device_map` in PEFT**\nPEFT does a second dispatch using a device map that was meant for the base model, not the PEFT-wrapped model. The model loads farther, but Gemma 4 generation breaks because Gemma 4’s **shared KV cache** bookkeeping loses an expected source-layer entry, causing `KeyError: 22`.\n\n\n\n\nSo the answer is:\n\n> You can use PEFT with some offload-related options, but passing the same `device_map` into `PeftModel.from_pretrained()` is not the right fix for Gemma 4. It changes the dispatch layout and can break Gemma 4’s shared-KV generation path.\n\n* * *\n\n# What is happening in the first error\n\nYour original PEFT load is:\n\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=lora_source_client_name,\n        is_trainable=False,\n    )\n\n\nThe failure happens after PEFT starts loading the adapter:\n\n\n    PeftModel.from_pretrained\n    → load_adapter\n    → accelerate.dispatch_model\n    → attach_align_device_hook_on_blocks\n    → attach_execution_device_hook\n    → module.state_dict()\n    → bitsandbytes Linear4bit._save_to_state_dict\n    → self.weight.quant_state.as_dict(packed=True)\n    → \"nested_offset\": self.offset.item()\n    → Tensor.item() cannot be called on meta tensors\n\n\nThat traceback is very specific. The failing tensor is not a LoRA adapter tensor. It is inside bitsandbytes’ 4-bit quantization state.\n\nThe key line is:\n\n\n    \"nested_offset\": self.offset.item()\n\n\nThat is tied to nested/double quantization. You enabled:\n\n\n    bnb_4bit_use_double_quant=True\n\n\nIn bitsandbytes, nested quantization stores extra quantization-state fields, and `QuantState.as_dict(packed=True)` serializes those fields. The bitsandbytes source contains the relevant `QuantState` packing logic, including nested quant-state serialization. (GitHub)\n\nPyTorch’s `meta` device is not a real data-holding device. Meta tensors carry shape/dtype metadata but no values, so data-dependent operations like `.item()` are invalid. That is why the exception says `Tensor.item() cannot be called on meta tensors`. (PyTorch Docs)\n\nSo the first error means:\n\n\n    A bitsandbytes nested quantization scalar is still on meta\n    when Accelerate/PEFT asks bitsandbytes to serialize state_dict.\n\n\nThat is a library interaction issue:\n\n\n    PEFT adapter loading\n    + Accelerate dispatch hooks\n    + bitsandbytes 4-bit Linear4bit\n    + double quant / nested quant_state\n    + CPU/GPU offload / meta placeholders\n\n\nIt is not simply:\n\n\n    vision/audio are on CPU, PEFT wants them on GPU\n\n\n* * *\n\n# Why `device_map={\"\": 0}` works\n\nWhen you use:\n\n\n    device_map = {\"\": 0}\n\n\nyou avoid the fragile cross-device path.\n\nEverything is on GPU 0:\n\n\n    no CPU-offloaded towers\n    no CPU/disk hooks for those modules\n    less meta placeholder machinery\n    less redispatch complexity during PEFT load\n\n\nThat does not prove the model or adapter are wrong. It proves that **the all-GPU path avoids the failure surface**.\n\nThis distinction matters:\n\nSetup | What happens\n---|---\n`device_map={\"\": 0}` | No split dispatch; PEFT works.\ncustom CPU/GPU map | Accelerate offload hooks are involved; PEFT triggers redispatch; bitsandbytes quant-state/meta problems appear.\n\n* * *\n\n# Why passing `device_map` to PEFT causes `KeyError: 22`\n\nYou tried:\n\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=lora_source_client_name,\n        device_map=device_map,\n        is_trainable=False\n    )\n\n\nThat gets past the first loading problem, but generation fails later:\n\n\n    key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n    KeyError: 22\n\n\nThis is a **different failure**.\n\nGemma 4 has a shared KV cache architecture. In Gemma 4, the last `num_kv_shared_layers` decoder layers do not compute their own key/value projections; they reuse K/V tensors from an earlier non-shared layer of the same attention type. Hugging Face’s Gemma 4 blog describes this shared-KV-cache optimization explicitly. (Hugging Face)\n\nSo inside generation, Gemma 4 needs something like this:\n\n\n    shared_kv_states[source_layer_index] = (key_states, value_states)\n\n\nThen later:\n\n\n    key_states, value_states = shared_kv_states[self.kv_shared_layer_index]\n\n\nYour error says the later layer expected:\n\n\n    shared_kv_states[22]\n\n\nbut that key did not exist.\n\nThat means the layer that should have populated `shared_kv_states[22]` either:\n\n  1. did not run in the expected way;\n  2. ran under a hook/layout that did not capture/store the expected K/V state;\n  3. had its execution order/state propagation changed by the second dispatch;\n  4. or had Gemma 4’s shared-state bookkeeping disrupted by PEFT/Accelerate wrapper hooks.\n\n\n\nThe important point:\n\n> Passing `device_map` into PEFT changes the PEFT-wrapped model’s dispatch/hook structure. For Gemma 4, that can break the shared-KV path during generation.\n\nThis is why `device_map` in PEFT is not a good fix even if it avoids the original loading error.\n\n* * *\n\n# Why PEFT `device_map` is not equivalent to base-model `device_map`\n\nYour base model is loaded like this:\n\n\n    base_model = Gemma4ForConditionalGeneration.from_pretrained(\n        ...,\n        device_map=device_map,\n        offload_folder=...,\n    )\n\n\nThat is the correct place to put the base model’s device map.\n\nAfter PEFT wrapping, module names and structure are different. PEFT wraps the model, often under paths like:\n\n\n    base_model.model...\n    base_model.model.model...\n\n\nSo the same raw map:\n\n\n    {\n        \"model.vision_tower\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0\n    }\n\n\nmay not mean the same thing after wrapping.\n\nAccelerate’s `device_map` is module-name based and recursively applies placement to submodules. Its docs describe `dispatch_model` as spreading modules across GPU, CPU, or disk according to a device map. (GitHub)\n\nSo there are two different operations:\n\n\n    Base from_pretrained(..., device_map=...)\n        initial placement of the base model\n\n    PeftModel.from_pretrained(..., device_map=...)\n        redispatch of the PEFT-wrapped model\n\n\nThe second one is not just “offload PEFT too.” It can reattach hooks and alter runtime behavior.\n\n* * *\n\n# What PEFT offload options actually do\n\nPEFT does have offload-related controls. The PEFT docs/source mention `ephemeral_gpu_offload`, which can be used when loading adapters with partially offloaded modules. (GitHub)\n\nA safer PEFT call shape is:\n\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=lora_source_client_name,\n        is_trainable=False,\n        offload_dir=r\"E:\\Folder\\offload_temp\",\n        offload_buffers=True,\n        ephemeral_gpu_offload=True,\n        torch_device=\"cuda:0\",\n    )\n\n\nNotice what is **not** there:\n\n\n    device_map=device_map\n\n\nThis is the distinction:\n\nOption | Meaning\n---|---\n`device_map` on base `from_pretrained` | Places the base model modules across CPU/GPU/disk.\n`offload_folder` on base `from_pretrained` | Folder for base-model offload during Transformers loading.\n`offload_dir` on PEFT load | Folder used by Accelerate/PEFT redispatch if offloaded modules are involved.\n`offload_buffers=True` | Also offload buffers when hooks need it.\n`ephemeral_gpu_offload=True` | Temporarily move offloaded pieces to GPU when needed.\n`device_map` on PEFT | Re-dispatches the PEFT-wrapped model; risky here.\n\nThere are known PEFT issue patterns around `offload_dir` not being propagated/found during `PeftModel.from_pretrained()` on offloaded base models. (GitHub)\n\nSo yes, you can try to make PEFT loading offload-aware, but I would do it with `offload_dir`, `offload_buffers`, and `ephemeral_gpu_offload`, **not** by passing the original base `device_map` again.\n\n* * *\n\n# The `llm_int8_enable_fp32_cpu_offload=True` confusion\n\nThis setting is confusing:\n\n\n    llm_int8_enable_fp32_cpu_offload=True\n\n\nThe name says `int8`, but in current Transformers/bitsandbytes integration, CPU/disk dispatch with quantized models often uses this flag as the gate that allows some modules to remain in full precision on CPU when a custom `device_map` contains CPU/disk entries. The Transformers bitsandbytes docs discuss CPU/GPU offload under the bitsandbytes quantization workflow, and public error messages for this path instruct users to enable `llm_int8_enable_fp32_cpu_offload=True` when modules are dispatched to CPU/disk. (Hugging Face)\n\nFor your use case:\n\n\n    4-bit + custom CPU/GPU device_map\n\n\nI would keep it enabled unless you move back to all-GPU.\n\nSo:\n\n## All-GPU path\n\n\n    quant_config = BitsAndBytesConfig(\n        load_in_4bit=True,\n        bnb_4bit_quant_type=\"nf4\",\n        bnb_4bit_use_double_quant=True,\n        bnb_4bit_compute_dtype=torch.bfloat16,\n    )\n\n\n## CPU/GPU split path\n\n\n    quant_config = BitsAndBytesConfig(\n        load_in_4bit=True,\n        bnb_4bit_quant_type=\"nf4\",\n        bnb_4bit_use_double_quant=True,\n        bnb_4bit_compute_dtype=torch.bfloat16,\n        llm_int8_enable_fp32_cpu_offload=True,\n    )\n\n\n* * *\n\n# The `bnb_4bit_use_double_quant=True` part\n\nYour original `Tensor.item()` error points very strongly at this setting:\n\n\n    bnb_4bit_use_double_quant=True\n\n\nDouble quantization is normally useful because it saves additional memory. But it also creates nested quantization state. Your traceback fails specifically while bitsandbytes serializes nested quant-state metadata:\n\n\n    \"nested_offset\": self.offset.item()\n\n\nSo as a diagnostic, test this:\n\n\n    bnb_4bit_use_double_quant=False\n\n\nIf the first error disappears with double quant off, then the issue is specifically:\n\n\n    bitsandbytes nested quant_state\n    + PEFT/Accelerate redispatch\n    + meta/offload\n\n\nIf it still fails, then the broader issue is:\n\n\n    bitsandbytes Params4bit / Linear4bit\n    + PEFT/Accelerate redispatch\n    + offloaded base model\n\n\nEither way, the failure is still in the quantized/offloaded dispatch stack, not simply in PEFT’s preference for GPU placement.\n\n* * *\n\n# The `model.multi_modal_projector` key may not be valid\n\nGemma 4 is multimodal. The official Transformers docs describe the base Gemma 4 model as comprising a **vision backbone** , an **audio backbone** , and a **language model** ; the conditional-generation model includes the language modeling head. (Hugging Face)\n\nBut exact implementation module names matter.\n\nYou used:\n\n\n    \"model.multi_modal_projector\": \"cpu\"\n\n\nI would not assume this key exists for every Gemma 4 checkpoint/implementation.\n\nBefore using it, run:\n\n\n    for name, module in base_model.named_modules():\n        lname = name.lower()\n        if any(k in lname for k in [\"vision\", \"audio\", \"project\", \"embed\", \"multi\"]):\n            print(name, type(module).__name__)\n\n\nThen confirm:\n\n\n    print(base_model.hf_device_map)\n\n\nIf the module key does not exist, Accelerate may ignore it or warn that it does not match any submodule. The fallback `\"\": 0` will place everything else on GPU, but your mental model of what was offloaded will be wrong.\n\nFor Gemma 4, bridge-like names may be closer to:\n\n\n    model.embed_vision\n    model.embed_audio\n    model.audio_tower.output_proj\n\n\ndepending on the specific implementation.\n\nFor first stable testing, I would use only:\n\n\n    device_map = {\n        \"model.vision_tower\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0,\n    }\n\n\nThen add bridge/projector modules only after verifying exact names.\n\n* * *\n\n# The memory warning matters\n\nThis warning is important:\n\n\n    no modules could be assigned to device 0 due to insufficient memory:\n    0: 5668601858 bytes required\n\n\nThat means Accelerate’s current allocation attempt already believes GPU 0 needs about 5.7 GB more free memory for the proposed dispatch plan.\n\nThis is not the same as the Python exception, but it is a warning sign. It says the model placement plan is already under memory pressure.\n\nUnder memory pressure, the system is more likely to use CPU/disk offload and meta placeholders. That increases the chance that PEFT/Accelerate/bitsandbytes redispatch enters an unsupported or fragile path.\n\nSo treat the memory warning as part of the diagnosis:\n\n\n    The dispatch plan is already tight.\n    PEFT loading then triggers another dispatch/hook pass.\n    That pass touches bnb quant-state/meta tensors.\n    The process crashes.\n\n\n* * *\n\n# Recommended code shape\n\n## Base model loading\n\nUse a raw Windows path:\n\n\n    OFFLOAD_DIR = r\"E:\\Folder\\offload_temp\"\n\n\nThen:\n\n\n    quant_config = BitsAndBytesConfig(\n        load_in_4bit=True,\n        bnb_4bit_quant_type=\"nf4\",\n        bnb_4bit_use_double_quant=False,  # first diagnostic; turn on later\n        bnb_4bit_compute_dtype=torch.bfloat16,\n        llm_int8_enable_fp32_cpu_offload=True,\n    )\n\n    device_map = {\n        \"model.vision_tower\": \"cpu\",\n        \"model.audio_tower\": \"cpu\",\n        \"\": 0,\n    }\n\n    base_model = Gemma4ForConditionalGeneration.from_pretrained(\n        MODEL_REGISTRY[model_id_to_load],\n        quantization_config=quant_config,\n        device_map=device_map,\n        max_memory=max_memory,\n        offload_folder=OFFLOAD_DIR,\n        dtype=torch.bfloat16,\n        attn_implementation=\"sdpa\",\n        trust_remote_code=False,\n        low_cpu_mem_usage=True,\n    )\n\n\nI changed three things:\n\n\n    bnb_4bit_use_double_quant=False\n\n\nfor the first diagnostic;\n\n\n    \"model.multi_modal_projector\": \"cpu\"\n\n\nremoved until verified;\n\n\n    low_cpu_mem_usage=True\n\n\nbecause you are explicitly operating near memory/offload boundaries.\n\n* * *\n\n## PEFT loading\n\nDo **not** pass `device_map` here.\n\nUse:\n\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=lora_source_client_name,\n        is_trainable=False,\n        offload_dir=OFFLOAD_DIR,\n        offload_buffers=True,\n        ephemeral_gpu_offload=True,\n        torch_device=\"cuda:0\",\n    )\n\n\nThen:\n\n\n    model.eval()\n\n    model.config.use_cache = True\n    if hasattr(model, \"generation_config\") and model.generation_config is not None:\n        model.generation_config.use_cache = True\n\n\n* * *\n\n## Input placement for generation\n\nFor mixed CPU/GPU dispatched models, avoid blindly doing:\n\n\n    inputs = inputs.to(\"cuda:0\")\n\n\nInstead place token tensors on the input embedding device:\n\n\n    def get_input_embedding_device(model):\n        candidates = [model]\n\n        if hasattr(model, \"base_model\"):\n            candidates.append(model.base_model)\n            if hasattr(model.base_model, \"model\"):\n                candidates.append(model.base_model.model)\n\n        for obj in candidates:\n            try:\n                emb = obj.get_input_embeddings()\n                if emb is not None and hasattr(emb, \"weight\"):\n                    return emb.weight.device\n            except Exception:\n                pass\n\n        return torch.device(\"cuda:0\")\n\n\n    input_device = get_input_embedding_device(model)\n\n    for key, value in inputs.items():\n        if torch.is_tensor(value):\n            inputs[key] = value.to(input_device)\n\n    outputs = model.generate(\n        **inputs,\n        max_new_tokens=max_new_tokens,\n        use_cache=True,\n    )\n\n\nThis does not solve the PEFT loading bug, but it avoids a separate CPU/CUDA mismatch after loading succeeds.\n\n* * *\n\n# Should you create an issue?\n\nYes.\n\nThis is not just a configuration question. You have two issue-worthy failures.\n\n## Issue 1: PEFT load on offloaded 4-bit Gemma 4 fails through bitsandbytes/meta state\n\nSuggested title:\n\n\n    PeftModel.from_pretrained on offloaded 4-bit Gemma4 hits bitsandbytes nested QuantState meta tensor\n\n\nInclude this core traceback:\n\n\n    PeftModel.from_pretrained\n    → load_adapter\n    → dispatch_model\n    → attach_execution_device_hook\n    → module.state_dict()\n    → bitsandbytes Linear4bit._save_to_state_dict\n    → weight.quant_state.as_dict(packed=True)\n    → \"nested_offset\": self.offset.item()\n    → Tensor.item() cannot be called on meta tensors\n\n\nLikely owners:\n\nComponent | Why\n---|---\n**bitsandbytes** | The failing `.item()` call is inside bnb quant-state serialization.\n**Accelerate** | It calls `state_dict()` while attaching dispatch hooks.\n**PEFT** | It triggers the redispatch during adapter loading.\n**Transformers** | It integrates Gemma 4, bnb quantization, and device maps.\n\nI would file first at **Transformers** or **PEFT** , and mention bitsandbytes/Accelerate in the issue body. If you can reduce it to a pure `Linear4bit` / `dispatch_model` repro, then bitsandbytes or Accelerate becomes the better primary repo.\n\n* * *\n\n## Issue 2: Passing `device_map` into PEFT breaks Gemma 4 shared-KV generation\n\nSuggested title:\n\n\n    Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22\n\n\nInclude:\n\n\n    PeftModel.from_pretrained(..., device_map=device_map)\n\n\nthen:\n\n\n    generate()\n    → Gemma4 self_attn\n    → shared_kv_states[self.kv_shared_layer_index]\n    → KeyError: 22\n\n\nLikely owners:\n\nComponent | Why\n---|---\n**Transformers** | `shared_kv_states` is Gemma 4 model logic.\n**PEFT** | The issue appears after PEFT wrapping/adapter loading.\n**Accelerate** | The second dispatch/hook layout is likely the trigger.\n\nI would file this one at **Transformers** , because the final failure is Gemma 4 model logic, then cross-link PEFT/Accelerate if maintainers request it.\n\n* * *\n\n# What I would not do\n\n## Do not pass the same device map to PEFT\n\nAvoid:\n\n\n    model = PeftModel.from_pretrained(\n        base_model,\n        lora_path,\n        adapter_name=lora_source_client_name,\n        device_map=device_map,\n        is_trainable=False,\n    )\n\n\nIt is not equivalent to “offload PEFT too.” It can trigger a second dispatch of the wrapped model and break Gemma 4 shared-KV state.\n\n## Do not use `merge_and_unload()` in this path\n\nThis line is not the direct cause here:\n\n\n    if isinstance(base_model, PeftModel):\n        base_model = base_model.merge_and_unload()\n\n\nBut merging/unloading adapters into quantized/offloaded models is another fragile path. Keep it out of the first stable inference path.\n\n## Do not offload bridge/projector modules until verified\n\nStart with:\n\n\n    \"model.vision_tower\": \"cpu\",\n    \"model.audio_tower\": \"cpu\",\n    \"\": 0,\n\n\nOnly add bridge modules after verifying exact names and after PEFT loading works.\n\n* * *\n\n# Practical decision tree\n\n## If you need it working now\n\nUse:\n\n\n    device_map = {\"\": 0}\n\n\nThat is the known-good operational path.\n\nIf E4B does not fit all-GPU with your adapter, try:\n\n\n    bnb_4bit_use_double_quant=False\n\n\nor use a smaller Gemma 4 variant for runtime.\n\n## If you need CPU-offloaded vision/audio\n\nTry this sequence:\n\n  1. Base model with vision/audio CPU only.\n  2. `bnb_4bit_use_double_quant=False`.\n  3. No PEFT `device_map`.\n  4. PEFT with `offload_dir`, `offload_buffers=True`, `ephemeral_gpu_offload=True`.\n  5. `use_cache=True`.\n  6. Inputs placed on embedding device.\n\n\n\nIf that still fails, it is not a configuration issue anymore; it is a current compatibility bug.\n\n## If you want the cleanest bug report\n\nUse a minimal matrix:\n\nCase | Expected\n---|---\nall GPU + PEFT | works\nsplit base without PEFT | works\nsplit base + PEFT without PEFT `device_map` | `Tensor.item()` meta or `Params4bit` failure\nsplit base + PEFT with PEFT `device_map` | `KeyError: 22` during generate\n\nThat is a strong report.\n\n* * *\n\n# Final answer to your direct question\n\n## “Am I offloading vision/audio to CPU?”\n\nYes, at least for the keys that actually exist:\n\n\n    \"model.vision_tower\": \"cpu\"\n    \"model.audio_tower\": \"cpu\"\n\n\nBut verify `model.multi_modal_projector`; that key may not exist or may not be the correct bridge module name.\n\n## “Does PEFT expect those to be on GPU?”\n\nNo, not in the simple sense.\n\nThe real issue is that PEFT adapter loading triggers a second dispatch/hook process over a quantized, partially offloaded model. That path currently interacts badly with bitsandbytes 4-bit quantization state and Gemma 4’s shared-KV architecture.\n\n## “Can I offload on PEFT as well?”\n\nPartly, yes, but not by passing the same `device_map`.\n\nUse PEFT/offload arguments:\n\n\n    offload_dir=...\n    offload_buffers=True\n    ephemeral_gpu_offload=True\n    torch_device=\"cuda:0\"\n\n\nDo **not** pass:\n\n\n    device_map=device_map\n\n\nfor Gemma 4 PEFT generation.\n\n## “Do I need to create an issue?”\n\nYes. This is issue-worthy.\n\nThe strongest issue is:\n\n\n    PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment because bitsandbytes nested QuantState contains meta tensor.\n\n\nThe second issue is:\n\n\n    Passing device_map to PeftModel.from_pretrained on Gemma4 causes shared_kv_states KeyError during generate.\n\n\nBoth are real cross-library edge cases, not just user error.\n\n* * *\n\n## Short summary\n\n  * The all-GPU path works because it avoids split-dispatch/offload hooks.\n  * The first error is caused by PEFT/Accelerate touching bitsandbytes 4-bit nested quant-state while something is still on `meta`.\n  * The second error is caused by passing `device_map` into PEFT, which can break Gemma 4’s shared-KV generation bookkeeping.\n  * PEFT does not simply require vision/audio on GPU.\n  * Do not pass the same `device_map` to PEFT.\n  * Use `offload_dir`, `offload_buffers=True`, and `ephemeral_gpu_offload=True` instead.\n  * Start with only `vision_tower` and `audio_tower` on CPU; verify any projector/bridge module name before offloading it.\n  * File issues with Transformers/PEFT/Accelerate/bitsandbytes; this is a genuine integration bug surface.\n\n",
  "title": "CPU offloading error scenario"
}