Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicnbpaxr6y6olifseqsdeimynow6qrnm4pxbhbdiwws74xujujlzi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkahnx7kpjl2"
  },
  "path": "/t/cpu-offloading-error-scenario/175522#post_1",
  "publishedAt": "2026-04-24T10:06:02.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I have a problem I am trying to offload the vision and audio to CPU. Via a device map. But I get below error. If I change to device_map={“”:0} then everything works correct. Just checking if somebody can reproduce it and need if I need to change anything. Or if I need to create a issue at PEFT, Transformers, Accelerate or BitandBytes.\n\nIf I understand the problem correct I am offloading model.vision_tower, model.multi_modal_projector and model.audio_tower. But PEFT expects those to be on GPU. Can’t I offload on PEFT as well?\n\nThanks in advance to all\n\n**Model: Gemma 4 E4B IT**\n\n> Device Map:\n>  device_map = {\n>  “model.vision_tower”: “cpu”,\n>  “model.multi_modal_projector”: “cpu”,\n>  “model.audio_tower”: “cpu”,\n>  “”: 0 # This sets the rest of the model to GPU 0\n>  }\n\n**Bits and Bytes Config:**\n\n>\n>         load_in_4bit=True,\n>         bnb_4bit_quant_type=\"nf4\",\n>         bnb_4bit_use_double_quant=True,\n>         bnb_4bit_compute_dtype=torch.bfloat16,\n>         llm_int8_enable_fp32_cpu_offload=True,\n>\n\n**Model Kwargs:**\n\n> “dtype”: torch.bfloat16,\n>  “attn_implementation”: “sdpa”,\n>  “trust_remote_code”: False,\n>  “low_cpu_mem_usage”: False\n\n**Loading Code:**\n\n> base_model = Gemma4ForConditionalGeneration.from_pretrained(\n>  MODEL_REGISTRY[model_id_to_load],\n>  quantization_config=quant_config,\n>  device_map=device_map,\n>  max_memory=max_memory,\n>  offload_folder=“e:\\Folder\\offload_temp”,\n>  **MODEL_KWARGS[model_id_to_load]\n>  )\n\n**PEFT loading:**\n\n>\n>         if isinstance(base_model, PeftModel):\n>             base_model = base_model.merge_and_unload()\n>\n>         model = PeftModel.from_pretrained(\n>             base_model,\n>             lora_path,\n>             adapter_name=lora_source_client_name,\n>             is_trainable=False\n>         )\n>\n\n**Versions:**\nTransformers: v5.6.2\nAccelerate: v1.14.0.dev0\nBitsandBytes: v0.49.2\nTorch: 2.8.0+cu129\nPeft: v0.19.1\n\n**Below Error happens after PEFT Loading:**\n\n> 2026-04-24 11:35:10,080 | Worker (6904) | INFO | Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:\n>\n>   * 0: 5668601858 bytes required\n>  These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.\n>  2026-04-24 11:35:12,470 | Worker (6904) | ERROR | Worker error: Tensor.item() cannot be called on meta tensors\n>  Traceback (most recent call last):\n>  File “E:\\Folder\\inference_worker.py”, line 414, in inference_worker_loop\n>  model = _worker_load_model(model_id_to_load, lora_source_client_name, supports_image, supports_audio)\n>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\inference_worker.py”, line 344, in _worker_load_model\n>  model = PeftModel.from_pretrained(\n>  ^^^^^^^^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\peft\\peft_model.py”, line 582, in from_pretrained\n>  load_result = model.load_adapter(\n>  ^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\peft\\peft_model.py”, line 1475, in load_adapter\n>  dispatch_model(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\big_modeling.py”, line 432, in dispatch_model\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>\n> Previous line repeated 3 more times \\$\\$ $$\n>\n>\n\n\n> File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 677, in attach_align_device_hook_on_blocks\n>  attach_execution_device_hook(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 470, in attach_execution_device_hook\n>  attach_execution_device_hook(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 459, in attach_execution_device_hook\n>  if not hasattr(module, “_hf_hook”) and len(module.state_dict()) > 0:\n>  ^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\torch\\nn\\modules\\module.py”, line 2260, in state_dict\n>  module.state_dict(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\torch\\nn\\modules\\module.py”, line 2257, in state_dict\n>  self._save_to_state_dict(destination, prefix, keep_vars)\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\nn\\modules.py”, line 525, in _save_to_state_dict\n>  for k, v in self.weight.quant_state.as_dict(packed=True).items():\n>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\functional.py”, line 581, in as_dict\n>  “nested_offset”: self.offset.item(),\n>  ^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\torch_meta_registrations.py”, line 7457, in meta_local_scalar_dense\n>  raise RuntimeError(“Tensor.item() cannot be called on meta tensors”)\n>  RuntimeError: Tensor.item() cannot be called on meta tensors\n>  2026-04-24 11:35:12,517 | Worker (15192) | ERROR | Worker returned error: Worker error: Tensor.item() cannot be called on meta tensors",
  "title": "CPU offloading error scenario"
}