Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihvn24vaga33xnlhlu44zej4mg74c5gewmficrlz7bkbvzwd3yw5q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3meitdk7ne5c2"
  },
  "path": "/t/gemma-3-12b-4-bit-quantization-failing-ignored-in-transformers-v5-1-0-gemma3forconditionalgeneration/173278#post_1",
  "publishedAt": "2026-02-10T10:15:03.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi everyone,\n\nI’m reporting a significant regression where 4-bit quantization is ignored for Gemma 3 12B after upgrading to Transformers v5.1.0. The model fails to load into VRAM and spills into Shared GPU Memory (System RAM), slowing inference from 7s to 50s.\n\nThe Evidence\n\n\n    - Device Map is Empty: Despite device_map=\"auto\", model.hf_device_map returns None.\n    - Memory Footprint: model.get_memory_footprint() reports 7.62 GB (suggesting it thinks it is quantized), but Windows Task Manager shows 24.2 GB in use.\n    - Init Error: Using load_in_4bit=True directly results in:\n    \tTypeError: Gemma3ForConditionalGeneration.__init__() got an unexpected keyword argument 'load_in_4bit'\n\n\nSetup & Code\n\n\n    Hardware: RTX 3060 12GB (Windows 11)\n    Env A (Working): Transformers v4.57.3, bnb v0.48.2\n    Env B (Broken): Transformers v5.1.0, bnb v0.49.1\n\n\nIdentical loading code used in both:\n\nPython\n\nquant_config = BitsAndBytesConfig(\nload_in_4bit=True,\nbnb_4bit_quant_type=“nf4”,\nbnb_4bit_use_double_quant=True,\nbnb_4bit_compute_dtype=torch.bfloat16\n)\n\nmodel = Gemma3ForConditionalGeneration.from_pretrained(\n“google/gemma-3-12b”,\nquantization_config=quant_config,\ndevice_map=“auto”,\ntorch_dtype=torch.bfloat16,\ntrust_remote_code=True\n)\n\nQuestion\n\nIt seems Gemma3ForConditionalGeneration in v5.x is no longer passing the quantization_config to the underlying layers correctly on Windows. Has the initialization flow for Gemma 3 changed, or is this a known issue with the new v5 “quantization as a first-class citizen” refactor?",
  "title": "Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)"
}