{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifhwzljzkp4ozdlyv2dnqy6wp65hw5nhqtz3tozioffbarrv2osgi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkdszkn44j52"
  },
  "path": "/t/cpu-offloading-error-scenario/175522#post_6",
  "publishedAt": "2026-04-25T19:05:10.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "CUDA semantics — PyTorch 2.11 documentation"
  ],
  "textContent": "Hi, thanks for all your help. I have tried your recommendations.\n\nOn PEFT loading it jumped to my second GPU (GPU 1) which does not have enough VRAM. I use it for RAG.\n\nSee below the error. I pasted below all the code sections I used for the test.\n\n> QUANT_CONFIGS = {\n>  “gemma4-e4b-it”: BitsAndBytesConfig(\n>  load_in_4bit=True,\n>  bnb_4bit_quant_type=“nf4”,\n>  bnb_4bit_use_double_quant=False,\n>  bnb_4bit_compute_dtype=torch.bfloat16,\n>  llm_int8_enable_fp32_cpu_offload=True,\n>  )\n>  }\n>\n> MODEL_KWARGS = {\n>  “gemma4-e4b-it”: {\n>  “dtype”: torch.bfloat16,\n>  “attn_implementation”: “sdpa”,\n>  “trust_remote_code”: False,\n>  “low_cpu_mem_usage”: True\n>  },\n>  }\n>\n> OFFLOAD_DIR = r\"E:\\Folder\\offload_temp\"\n>\n> device_map = {\n>  “model.vision_tower”: “cpu”,\n>  “model.audio_tower”: “cpu”,\n>  “”: 0,\n>  }\n>\n> quant_config = QUANT_CONFIGS.get(model_id_to_load)\n>\n> base_model = Gemma4ForConditionalGeneration.from_pretrained(\n>  MODEL_REGISTRY[model_id_to_load],\n>  quantization_config=quant_config,\n>  device_map=device_map,\n>  max_memory=max_memory,\n>  offload_folder=OFFLOAD_DIR,\n>  **MODEL_KWARGS[model_id_to_load]\n>  )\n>\n> model = PeftModel.from_pretrained(\n>  base_model,\n>  lora_path,\n>  adapter_name=lora_source_client_name,\n>  is_trainable=False,\n>  offload_dir=OFFLOAD_DIR,\n>  offload_buffers=True,\n>  ephemeral_gpu_offload=True,\n>  torch_device=“cuda:0”,\n>  )\n>  model.eval()\n>\n> model.config.use_cache = True\n>  if hasattr(model, “generation_config”) and model.generation_config is not None:\n>  model.generation_config.use_cache = True\n>\n> past_key_values = QuantoQuantizedCache(\n>  config=model.config,\n>  )\n>\n> input_device = get_input_embedding_device(model)\n>\n> for key, value in inputs.items():\n>  if torch.is_tensor(value):\n>  inputs[key] = value.to(input_device)\n>\n> outputs = model.generate(\n>  **inputs,\n>  past_key_values=past_key_values,\n>  tokenizer=tokenizer,\n>  do_sample=True,\n>  stop_strings=current_stop_strings,\n>  **params,\n>  )\n\nError:\n\n> 2026-04-25 20:31:12,735 | Worker (17960) | INFO | Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:\n>\n>   * 0: 5668601858 bytes required\n>  These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.\n>  2026-04-25 20:31:14,660 | Worker (17960) | ERROR | Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.11 documentation )\n>  Traceback (most recent call last):\n>  File “E:\\Folder\\inference_worker.py”, line 460, in inference_worker_loop\n>  model = _worker_load_model(model_id_to_load, lora_source_client_name, supports_image, supports_audio)\n>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\inference_worker.py”, line 364, in _worker_load_model\n>  model = PeftModel.from_pretrained(\n>  ^^^^^^^^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\peft\\peft_model.py”, line 582, in from_pretrained\n>  load_result = model.load_adapter(\n>  ^^^^^^^^^^^^^^^^^^^\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\peft\\peft_model.py”, line 1475, in load_adapter\n>  dispatch_model(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\big_modeling.py”, line 432, in dispatch_model\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>  File “E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py”, line 695, in attach_align_device_hook_on_blocks\n>  attach_align_device_hook_on_blocks(\n>\n> Previous line repeated 3 more times $$\n>\n>\n\n\n> File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\accelerate\\\\\\hooks.py\", line 653, in attach_align_device_hook_on_blocks > add_hook_to_module(module, hook) > File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\accelerate\\\\\\hooks.py\", line 183, in add_hook_to_module > module = hook.init_hook(module) > ^^^^^^^^^^^^^^^^^^^^^^ > File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\accelerate\\\\\\hooks.py\", line 305, in init_hook > set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map) > File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\accelerate\\\\\\utils\\\\\\modeling.py\", line 335, in set_module_tensor_to_device > new_value = old_value.to(device, non_blocking=non_blocking) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\bitsandbytes\\\\\\nn\\\\\\modules.py\", line 351, in to > super().to(device=device, dtype=dtype, non_blocking=non_blocking), > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File \"E:\\\\\\Folder\\\\\\gemma_env\\\\\\Lib\\\\\\site-packages\\\\\\bitsandbytes\\\\\\nn\\\\\\modules.py\", line 401, in **torch_function** > return super().**torch_function**(func, types, args, kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables ) > 2026-04-25 20:31:14,676 | Worker (17564) | ERROR | Worker returned error: Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables )",
  "title": "CPU offloading error scenario"
}