External Publication

CPU offloading error scenario

Hugging Face Forums [Unofficial] April 25, 2026

Hi, thanks for all your help. I have tried your recommendations.

On PEFT loading it jumped to my second GPU (GPU 1) which does not have enough VRAM. I use it for RAG.

See below the error. I pasted below all the code sections I used for the test.

QUANT_CONFIGS = { “gemma4-e4b-it”: BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_use_double_quant=False, bnb_4bit_compute_dtype=torch.bfloat16, llm_int8_enable_fp32_cpu_offload=True, ) }

MODEL_KWARGS = { “gemma4-e4b-it”: { “dtype”: torch.bfloat16, “attn_implementation”: “sdpa”, “trust_remote_code”: False, “low_cpu_mem_usage”: True }, }

OFFLOAD_DIR = r"E:\Folder\offload_temp"

device_map = { “model.vision_tower”: “cpu”, “model.audio_tower”: “cpu”, “”: 0, }

quant_config = QUANT_CONFIGS.get(model_id_to_load)

base_model = Gemma4ForConditionalGeneration.from_pretrained( MODEL_REGISTRY[model_id_to_load], quantization_config=quant_config, device_map=device_map, max_memory=max_memory, offload_folder=OFFLOAD_DIR, **MODEL_KWARGS[model_id_to_load] )

model = PeftModel.from_pretrained( base_model, lora_path, adapter_name=lora_source_client_name, is_trainable=False, offload_dir=OFFLOAD_DIR, offload_buffers=True, ephemeral_gpu_offload=True, torch_device=“cuda:0”, ) model.eval()

model.config.use_cache = True if hasattr(model, “generation_config”) and model.generation_config is not None: model.generation_config.use_cache = True

past_key_values = QuantoQuantizedCache( config=model.config, )

input_device = get_input_embedding_device(model)

for key, value in inputs.items(): if torch.is_tensor(value): inputs[key] = value.to(input_device)

outputs = model.generate( **inputs, past_key_values=past_key_values, tokenizer=tokenizer, do_sample=True, stop_strings=current_stop_strings, **params, )

Error:

2026-04-25 20:31:12,735 | Worker (17960) | INFO | Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:

0: 5668601858 bytes required These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config. 2026-04-25 20:31:14,660 | Worker (17960) | ERROR | Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.11 documentation ) Traceback (most recent call last): File “E:\Folder\inference_worker.py”, line 460, in inference_worker_loop model = _worker_load_model(model_id_to_load, lora_source_client_name, supports_image, supports_audio) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “E:\Folder\inference_worker.py”, line 364, in _worker_load_model model = PeftModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 582, in from_pretrained load_result = model.load_adapter( ^^^^^^^^^^^^^^^^^^^ File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 1475, in load_adapter dispatch_model( File “E:\Folder\gemma_env\Lib\site-packages\accelerate\big_modeling.py”, line 432, in dispatch_model attach_align_device_hook_on_blocks( File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks attach_align_device_hook_on_blocks( File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks attach_align_device_hook_on_blocks( File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks attach_align_device_hook_on_blocks(

Previous line repeated 3 more times $$

File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 653, in attach_align_device_hook_on_blocks > add_hook_to_module(module, hook) > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 183, in add_hook_to_module > module = hook.init_hook(module) > ^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 305, in init_hook > set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map) > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\utils\\modeling.py", line 335, in set_module_tensor_to_device > new_value = old_value.to(device, non_blocking=non_blocking) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\nn\\modules.py", line 351, in to > super().to(device=device, dtype=dtype, non_blocking=non_blocking), > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\nn\\modules.py", line 401, in torch_function > return super().torch_function(func, types, args, kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables ) > 2026-04-25 20:31:14,676 | Worker (17564) | ERROR | Worker returned error: Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables )

Discussion in the ATmosphere