Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)
Probably a known behavior change plus a bug…?
What changed in Transformers v5 (relevant to your symptoms)
Transformers v5 introduced a new weight-loading pipeline (“dynamic weight loading / converter”) and explicitly moved toward quantization being a first-class loading path , not an afterthought applied once a full-precision model is already in memory. (Hugging Face)
That is the correct direction, but it also means that the order of operations duringfrom_pretrained() matters much more: where tensors are materialized (CPU vs GPU), when a quantization conversion runs, and when Accelerate dispatch hooks are attached.
Why your numbers look like “4-bit configured, BF16 actually loaded”
1) The 24.2 GB footprint matches BF16-ish weight residency
Gemma 3 12B is a multimodal model (Gemma3ForConditionalGeneration). Its BF16/FP16 weights are far above 12 GB, so on a 12 GB card the Windows driver will often spill into Shared GPU Memory (system RAM) instead of hard failing.
2) get_memory_footprint() can look “4-bit sized” even if peak / resident memory was full precision
model.get_memory_footprint() is not a reliable indicator of peak allocation during load (or of full-precision copies lingering due to allocator behavior / offload behavior). It’s common to see a “small” footprint while the OS-level counters reflect what actually got materialized and kept resident.
This exact mismatch is consistent with a v5 regression where tensors that are supposed to be quantized are materialized on the target device first and only then converted, which is “too late” to prevent the VRAM spike / spill.
The closest known regression: v5 materializes before quantizing (bitsandbytes 4-bit)
There is a highly relevant Transformers issue reporting a v5 regression: bitsandbytes 4-bit is scheduled, but the loader still materializes tensors on GPU before the quantization op runs , causing OOM or severe memory spikes. (GitHub)
The proposed fix in that issue is effectively:
- If a parameter will be quantized (
mapping.quantization_operation is not None), materialize it to CPU first, then quantize, then place it on GPU.
That is exactly the kind of ordering bug that would look like “quantization ignored” on Windows (because Windows can spill into shared memory rather than throwing OOM). (GitHub)
Why model.hf_device_map is None is a big red flag
With device_map="auto", Accelerate’s big-model dispatch normally computes a device map and stores it in model.hf_device_map. (Hugging Face)
If hf_device_map is None, it usually means one of these happened:
- Accelerate dispatch didn’t run (missing/incompatible Accelerate, or a code path that bypasses dispatch).
- The model was instantiated/loaded without the dispatch wrapper being attached (so no map is recorded).
- A nonstandard load path bypassed the “big model inference” integration.
Gemma’s own model card explicitly notes installing Accelerate and demonstrates device_map="auto" usage. (Hugging Face)
So your two “signals” line up with the same underlying theme: in v5, the load/dispatch/convert ordering and integration points changed , and your path appears to bypass or break part of that chain.
About the load_in_4bit=True TypeError
This is expected behavior: load_in_4bit is not a model __init__ argument; it’s handled by the from_pretrained() quantization integration via BitsAndBytesConfig/quantization_config. The docs show the supported pattern is passing a config object to from_pretrained(). (Hugging Face)
So treat that error as “wrong API surface”, not the core regression.
Most likely explanation for your exact case
Putting it together:
- Transformers v5.1.0 uses the new dynamic loader more aggressively. (Hugging Face)
- In the bnb 4-bit path , there is at least one reported regression where tensors are materialized on GPU before quantization , causing a full-precision-sized residency spike. (GitHub)
- On Windows + 12 GB VRAM , that manifests as Shared GPU Memory spill and a 7s → 50s slowdown.
- Separately (but consistent with the same theme), your
hf_device_map=Nonesuggests Accelerate dispatch didn’t successfully run/attach , which further increases the chance that the loader isn’t following the expected “meta → dispatch → quantize → place” flow. (Hugging Face)
High-signal checks that will confirm which part is broken
Run these in the broken env (v5.1.0):
Confirm Accelerate is actually engaged
- If
device_map="auto"is being honored,hf_device_mapshould be a dict. (Hugging Face)
- If
Prove whether full-precision tensors are being materialized
- Print top-5 largest parameters by
numel()and theirdtype/device. - If you see huge BF16 tensors on GPU/CPU alongside 4-bit wrappers, you’re seeing “quantize too late” or “full-precision copy retained”.
- Print top-5 largest parameters by
Force failure instead of silent Windows spill
- Pass
max_memory={0: "11GiB", "cpu": "0GiB"}. - If it still “loads” by spilling today, this should instead fail early; if it fails, it supports the “full precision got materialized” hypothesis.
- Pass
Practical workarounds while waiting for an upstream fix
Try a v5 nightly / a commit including the #43032 fix (or apply that patch locally)
- The fix is small and directly targets your failure mode. (GitHub)
Remove
trust_remote_code=True- Gemma 3 is supported directly; the official usage snippets don’t require remote code. (Hugging Face)
Use TorchAO int4 instead of bitsandbytes (v5 direction)
- v5 explicitly integrates TorchAO as part of the “first-class quantization” effort. (Hugging Face)
Pin back to your known-good combo
- Transformers 4.57.3 + bnb 0.48.2 is a reasonable operational pin until the v5 loader/quantization ordering is confirmed fixed.
What to say in your report (framing that matches current upstream work)
- “This looks like a v5 dynamic weight loading / converter ordering issue: 4-bit quantization is configured, but full-precision tensors appear to be materialized first (Windows then spills into shared memory). This matches the regression described in Transformers #43032 (materialize-to-GPU before quantize). Also,
device_map="auto"doesn’t populatehf_device_map, suggesting Accelerate dispatch isn’t attaching or is bypassed in this path.”
That ties your symptoms to the specific v5 refactor points and a concrete upstream issue/patch. (Hugging Face)
Discussion in the ATmosphere