{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifgzzmgnvpzynn7j4fqaxtanks525vsxflz27gg433zxq7bsb2lea",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mez7baedim42"
},
"path": "/t/gemma-3-12b-4-bit-quantization-failing-ignored-in-transformers-v5-1-0-gemma3forconditionalgeneration/173278#post_6",
"publishedAt": "2026-02-16T14:20:20.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi John,\n\nI’ve run the diagnostics on v5.2.0.dev0 to check why the memory behavior has shifted compared to v4. I’ve focused on the device mapping and the materialization logs you were interested in.\n\n 1. Confirming Accelerate Engagement\n\n\n\nIt appears that device_map=“auto” is not being honored in the same way as previous versions.\n\n * v5 Observation: hasattr(model, “hf_device_map”) returns False. The model seems to be loading without a dispatch dictionary.\n\n * v4 Comparison: The same setup in v4.x properly returns hf_device_map: {‘’: 0}.\n\n * Result: Without the device map, the model appears to be “homeless” during the load, materializing in Shared System Memory (RAM) rather than being streamed directly to the GPU.\n\n\n\n 2. Proof of Full-Precision Materialization\n\n\n\nThe logs show a fundamental shift in how weights are handled.\n\n * The v4 Log (Streaming): Loading checkpoint shards: 100%|██████████| 5/5 [01:10<00:00, 14.10s/it]\nIn v4, weights are streamed as shards, and Shared GPU memory remains flat at 0.2 GB.\n\n * The v5 Log (Materializing): Loading weights: 100%|██████████| 1065/1065 [01:03<00:00, 16.78it/s, Materializing param=…]\nIn v5, the library logs “Materializing” for every individual parameter. During this phase, System RAM and Shared GPU Memory spike to 24GB+.\n\n\n\n 3. “Quantize Too Late” Evidence\n\n\n\nEven though model.get_memory_footprint() eventually reports 7.62 GB, the physical impact on the system during and immediately after the load suggests a “full-precision copy retained” scenario:\n\n * The Mismatch: VRAM/Shared commitment sits at ~24GB until a manual “nudge” (attribute access + gc.collect()) is performed.\n\n * Tensor Inspection: Printing the top parameters by numel() shows large bfloat16 tensors present in memory during the materialization phase, rather than the intended 4-bit weights.\n\n\n\n\nConclusion:\nIt seems that in the v5 refactor for Gemma 3, the quantization handshake is being bypassed or delayed. Instead of “Quantize-on-load,” the model is performing a full 16-bit materialization in System RAM before attempting to shrink the weights.\n\nI’ve attached the Task Manager captures showing the v5 spike for your review.\n\n",
"title": "Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)"
}