External Publication

Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hugging Face Forums [Unofficial] February 16, 2026

Hi John,

I’ve run the diagnostics on v5.2.0.dev0 to check why the memory behavior has shifted compared to v4. I’ve focused on the device mapping and the materialization logs you were interested in.

Confirming Accelerate Engagement

It appears that device_map=“auto” is not being honored in the same way as previous versions.

v5 Observation: hasattr(model, “hf_device_map”) returns False. The model seems to be loading without a dispatch dictionary.
v4 Comparison: The same setup in v4.x properly returns hf_device_map: {‘’: 0}.
Result: Without the device map, the model appears to be “homeless” during the load, materializing in Shared System Memory (RAM) rather than being streamed directly to the GPU.

Proof of Full-Precision Materialization

The logs show a fundamental shift in how weights are handled.

The v4 Log (Streaming): Loading checkpoint shards: 100%|██████████| 5/5 [01:10<00:00, 14.10s/it] In v4, weights are streamed as shards, and Shared GPU memory remains flat at 0.2 GB.
The v5 Log (Materializing): Loading weights: 100%|██████████| 1065/1065 [01:03<00:00, 16.78it/s, Materializing param=…] In v5, the library logs “Materializing” for every individual parameter. During this phase, System RAM and Shared GPU Memory spike to 24GB+.

“Quantize Too Late” Evidence

Even though model.get_memory_footprint() eventually reports 7.62 GB, the physical impact on the system during and immediately after the load suggests a “full-precision copy retained” scenario:

The Mismatch: VRAM/Shared commitment sits at ~24GB until a manual “nudge” (attribute access + gc.collect()) is performed.
Tensor Inspection: Printing the top parameters by numel() shows large bfloat16 tensors present in memory during the materialization phase, rather than the intended 4-bit weights.

Conclusion: It seems that in the v5 refactor for Gemma 3, the quantization handshake is being bypassed or delayed. Instead of “Quantize-on-load,” the model is performing a full 16-bit materialization in System RAM before attempting to shrink the weights.

I’ve attached the Task Manager captures showing the v5 spike for your review.

Discussion in the ATmosphere