Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreia5dnlgb536ollw6nzrb3447xt4gwmazdlpuvwu5ecmggz6js6ady",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mekwj2onq6h2"
  },
  "path": "/t/gemma-3-12b-4-bit-quantization-failing-ignored-in-transformers-v5-1-0-gemma3forconditionalgeneration/173278#post_4",
  "publishedAt": "2026-02-11T05:58:31.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub",
    "GitHub",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face",
    "GitHub",
    "Hugging Face",
    "Hugging Face",
    "GitHub",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face"
  ],
  "textContent": "Probably a known behavior change plus a bug…?\n\n* * *\n\n## What changed in Transformers v5 (relevant to your symptoms)\n\nTransformers v5 introduced a **new weight-loading pipeline** (“dynamic weight loading / converter”) and explicitly moved toward **quantization being a first-class loading path** , not an afterthought applied once a full-precision model is already in memory. (Hugging Face)\n\nThat is the correct direction, but it also means that **the order of operations during`from_pretrained()` matters much more**: where tensors are _materialized_ (CPU vs GPU), when a quantization conversion runs, and when Accelerate dispatch hooks are attached.\n\n## Why your numbers look like “4-bit configured, BF16 actually loaded”\n\n### 1) The 24.2 GB footprint matches _BF16-ish_ weight residency\n\nGemma 3 12B is a multimodal model (`Gemma3ForConditionalGeneration`). Its BF16/FP16 weights are far above 12 GB, so on a 12 GB card the Windows driver will often spill into **Shared GPU Memory (system RAM)** instead of hard failing.\n\n### 2) `get_memory_footprint()` can look “4-bit sized” even if peak / resident memory was full precision\n\n`model.get_memory_footprint()` is not a reliable indicator of **peak allocation during load** (or of full-precision copies lingering due to allocator behavior / offload behavior). It’s common to see a “small” footprint while the OS-level counters reflect what actually got materialized and kept resident.\n\nThis exact mismatch is consistent with a v5 regression where tensors that are supposed to be quantized are **materialized on the target device first** and only then converted, which is “too late” to prevent the VRAM spike / spill.\n\n## The closest known regression: v5 materializes before quantizing (bitsandbytes 4-bit)\n\nThere is a highly relevant Transformers issue reporting a v5 regression: **bitsandbytes 4-bit is scheduled, but the loader still materializes tensors on GPU before the quantization op runs** , causing OOM or severe memory spikes. (GitHub)\n\nThe proposed fix in that issue is effectively:\n\n  * _If a parameter will be quantized (`mapping.quantization_operation is not None`), materialize it to CPU first, then quantize, then place it on GPU._\n\n\n\nThat is exactly the kind of ordering bug that would look like “quantization ignored” on Windows (because Windows can spill into shared memory rather than throwing OOM). (GitHub)\n\n## Why `model.hf_device_map is None` is a big red flag\n\nWith `device_map=\"auto\"`, Accelerate’s big-model dispatch normally computes a device map and stores it in `model.hf_device_map`. (Hugging Face)\n\nIf `hf_device_map` is `None`, it usually means one of these happened:\n\n  1. **Accelerate dispatch didn’t run** (missing/incompatible Accelerate, or a code path that bypasses dispatch).\n  2. The model was instantiated/loaded without the dispatch wrapper being attached (so no map is recorded).\n  3. A nonstandard load path bypassed the “big model inference” integration.\n\n\n\nGemma’s own model card explicitly notes installing Accelerate and demonstrates `device_map=\"auto\"` usage. (Hugging Face)\n\nSo your two “signals” line up with the same underlying theme: **in v5, the load/dispatch/convert ordering and integration points changed** , and your path appears to bypass or break part of that chain.\n\n## About the `load_in_4bit=True` `TypeError`\n\nThis is expected behavior: `load_in_4bit` is not a model `__init__` argument; it’s handled by the **`from_pretrained()` quantization integration** via `BitsAndBytesConfig`/`quantization_config`. The docs show the supported pattern is passing a config object to `from_pretrained()`. (Hugging Face)\n\nSo treat that error as “wrong API surface”, not the core regression.\n\n## Most likely explanation for your exact case\n\nPutting it together:\n\n  * **Transformers v5.1.0** uses the new dynamic loader more aggressively. (Hugging Face)\n  * In the **bnb 4-bit path** , there is at least one reported regression where tensors are **materialized on GPU before quantization** , causing a full-precision-sized residency spike. (GitHub)\n  * On **Windows + 12 GB VRAM** , that manifests as **Shared GPU Memory spill** and a 7s → 50s slowdown.\n  * Separately (but consistent with the same theme), your `hf_device_map=None` suggests **Accelerate dispatch didn’t successfully run/attach** , which further increases the chance that the loader isn’t following the expected “meta → dispatch → quantize → place” flow. (Hugging Face)\n\n\n\n## High-signal checks that will confirm which part is broken\n\nRun these in the broken env (v5.1.0):\n\n  1. **Confirm Accelerate is actually engaged**\n\n     * If `device_map=\"auto\"` is being honored, `hf_device_map` should be a dict. (Hugging Face)\n  2. **Prove whether full-precision tensors are being materialized**\n\n     * Print top-5 largest parameters by `numel()` and their `dtype`/`device`.\n     * If you see huge BF16 tensors on GPU/CPU alongside 4-bit wrappers, you’re seeing “quantize too late” or “full-precision copy retained”.\n  3. **Force failure instead of silent Windows spill**\n\n     * Pass `max_memory={0: \"11GiB\", \"cpu\": \"0GiB\"}`.\n     * If it still “loads” by spilling today, this should instead fail early; if it fails, it supports the “full precision got materialized” hypothesis.\n\n\n\n## Practical workarounds while waiting for an upstream fix\n\n  1. **Try a v5 nightly / a commit including the #43032 fix** (or apply that patch locally)\n\n     * The fix is small and directly targets your failure mode. (GitHub)\n  2. **Remove`trust_remote_code=True`**\n\n     * Gemma 3 is supported directly; the official usage snippets don’t require remote code. (Hugging Face)\n  3. **Use TorchAO int4 instead of bitsandbytes (v5 direction)**\n\n     * v5 explicitly integrates TorchAO as part of the “first-class quantization” effort. (Hugging Face)\n  4. **Pin back to your known-good combo**\n\n     * Transformers 4.57.3 + bnb 0.48.2 is a reasonable operational pin until the v5 loader/quantization ordering is confirmed fixed.\n\n\n\n## What to say in your report (framing that matches current upstream work)\n\n  * “This looks like a v5 dynamic weight loading / converter ordering issue: 4-bit quantization is configured, but full-precision tensors appear to be materialized first (Windows then spills into shared memory). This matches the regression described in Transformers #43032 (materialize-to-GPU before quantize). Also, `device_map=\"auto\"` doesn’t populate `hf_device_map`, suggesting Accelerate dispatch isn’t attaching or is bypassed in this path.”\n\n\n\nThat ties your symptoms to the specific v5 refactor points and a concrete upstream issue/patch. (Hugging Face)",
  "title": "Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)"
}