External Publication
Visit Post

Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hugging Face Forums [Unofficial] February 10, 2026
Source

Hi everyone,

I’m reporting a significant regression where 4-bit quantization is ignored for Gemma 3 12B after upgrading to Transformers v5.1.0. The model fails to load into VRAM and spills into Shared GPU Memory (System RAM), slowing inference from 7s to 50s.

The Evidence

- Device Map is Empty: Despite device_map="auto", model.hf_device_map returns None.
- Memory Footprint: model.get_memory_footprint() reports 7.62 GB (suggesting it thinks it is quantized), but Windows Task Manager shows 24.2 GB in use.
- Init Error: Using load_in_4bit=True directly results in:
    TypeError: Gemma3ForConditionalGeneration.__init__() got an unexpected keyword argument 'load_in_4bit'

Setup & Code

Hardware: RTX 3060 12GB (Windows 11)
Env A (Working): Transformers v4.57.3, bnb v0.48.2
Env B (Broken): Transformers v5.1.0, bnb v0.49.1

Identical loading code used in both:

Python

quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 )

model = Gemma3ForConditionalGeneration.from_pretrained( “google/gemma-3-12b”, quantization_config=quant_config, device_map=“auto”, torch_dtype=torch.bfloat16, trust_remote_code=True )

Question

It seems Gemma3ForConditionalGeneration in v5.x is no longer passing the quantization_config to the underlying layers correctly on Windows. Has the initialization flow for Gemma 3 changed, or is this a known issue with the new v5 “quantization as a first-class citizen” refactor?

Discussion in the ATmosphere

Loading comments...