{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiftycxoubqlhvwase7wahyy6awgzofwkcdiauoto2t2o6q6yqo5yi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkgqzl5grns2"
  },
  "path": "/t/cpu-offloading-error-scenario/175522#post_11",
  "publishedAt": "2026-04-26T23:13:11.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "John6666’s earlier issue-draft post in this thread",
    "Transformers #43873 — offloading not working as expected with quantization",
    "PEFT #3169 — LoRA + BnB INT8 + CPU offload wrong device",
    "PEFT PR #3181 — normalize output device for CPU-offloaded BnB layers",
    "Transformers #45482 — Gemma4 cross-device CPU offload errors",
    "Transformers #43872 — bitsandbytes incompatibility: Int8Params.__new__() got unexpected _is_hf_initialized",
    "Accelerate PR #3976 — Fix _is_hf_initialized attribute",
    "PEFT #3129 — Add support for Gemma4ClippableLinear / Gemma 4 QLoRA fails",
    "Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill",
    "PEFT #3129",
    "PEFT LoRA API docs — target_modules",
    "unsloth/gemma-4-E2B-it",
    "fulvian/gemma-4-e2b-medical-qlora-adapter",
    "PEFT #2321 — 4-bit Linear merge warning / different generations",
    "Transformers bitsandbytes quantization docs",
    "HF Forum — Do I need to dequantization before merging the QLoRA?",
    "PEFT #3169",
    "PEFT PR #3181",
    "Original HF Forum thread: CPU offloading error scenario",
    "Earlier John6666-style issue draft / triage post",
    "HF Forum QLoRA merge / dequantization discussion",
    "PEFT #3129 — Add support for Gemma4ClippableLinear",
    "Transformers #43872 — _is_hf_initialized / Int8Params incompatibility",
    "Transformers PR #45347 — Gemma4 device_map auto fix",
    "Transformers PR #45312 — Gemma4 KV-state sharing/cache fix",
    "Base model: unsloth/gemma-4-E2B-it",
    "Parity adapter: fulvian/gemma-4-e2b-medical-qlora-adapter",
    "Broad-target adapter: Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill",
    "Optional clean split-check adapter: welyjesch/filipino_Gemma4_E2B_FT_lora"
  ],
  "textContent": "There were all sorts of issues with various libraries, and things got pretty tangled up , so I decided to run some tests here first to sort them out:\n\n* * *\n\nI did a bit more validation and I think it is useful to keep the current state split into separate buckets. This follows the same triage style as John6666’s earlier issue-draft post in this thread, but adds fresh local validation for the merge/parity bucket.\n\n## Short version\n\nI would currently separate the situation into these issue families:\n\n  1. **Primary original issue from this thread**\nPEFT adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base still looks like its own split/offload problem.\n\n  2. **Already fixed or release-improved pieces**\nSome `_is_hf_initialized` / bnb parameter-moving behavior appears fixed or improved upstream. Gemma 4 `device_map=\"auto\"` support also appears improved in recent Transformers releases.\n\n  3. **Open PEFT target-module issue**\n`Gemma4ClippableLinear` is still a real PEFT target-module blocker for broad-target Gemma 4 adapters.\n\n  4. **Separate local finding: direct 4-bit merge parity**\nDirect `merge_and_unload()` into a bnb 4-bit base still did not reproduce adapter-loaded output in my latest local validation. Reloading the base in bf16, loading the same adapter, and merging there still restored output parity.\n\n\n\n\nSo I would not collapse all of these into one GitHub issue.\n\n* * *\n\n## 1. Primary original issue: split/offloaded bnb 4-bit Gemma 4 + PEFT adapter load\n\nI still think the primary issue from this thread is best described as:\n\n\n    PEFT adapter loading fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.\n\n\nThe important contrast is:\n\n\n    all-GPU bnb 4-bit + PEFT:\n      works / can work\n\n    CPU/GPU split-dispatched bnb 4-bit + PEFT:\n      fails in offload / dispatch / hook / quant-state paths\n\n\nI would not file this as simply:\n\n\n    CPU offload is broken\n\n\nor:\n\n\n    PEFT + bitsandbytes is broken\n\n\nThose are too broad. The all-GPU path can work.\n\nA better primary issue title, if this is filed, would be something like:\n\n\n    PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks\n\n\nI would probably file this first under `huggingface/transformers`, while cross-linking PEFT, Accelerate, and bitsandbytes, because the failure crosses model integration, `device_map`/offload behavior, adapter loading, dispatch hooks, and bnb quantized state.\n\nRelated public trackers that seem relevant but not identical:\n\n  * Transformers #43873 — offloading not working as expected with quantization\n  * PEFT #3169 — LoRA + BnB INT8 + CPU offload wrong device\n  * PEFT PR #3181 — normalize output device for CPU-offloaded BnB layers\n  * Transformers #45482 — Gemma4 cross-device CPU offload errors\n\n\n\n* * *\n\n## 2. `_is_hf_initialized` looks partly fixed, but it is not the whole story\n\nThere is a closed Transformers issue for the `_is_hf_initialized` family:\n\n  * Transformers #43872 — bitsandbytes incompatibility: Int8Params.__new__() got unexpected _is_hf_initialized\n\n\n\nThere is also a merged Accelerate PR:\n\n  * Accelerate PR #3976 — Fix _is_hf_initialized attribute\n\n\n\nThat PR is described as fixing issues when trying to move weights with bnb.\n\nSo I would classify `_is_hf_initialized` as:\n\n\n    fixed_or_release_improved_subproblem\n\n\nBut I would not say the original Forum issue is solved just because that subpath improved. The split/offload path still has other failure modes, especially meta tensor / dispatch / quant-state / cross-device behavior.\n\n* * *\n\n## 3. `Gemma4ClippableLinear` is still a separate current PEFT blocker\n\nThere is already an open PEFT issue for this:\n\n  * PEFT #3129 — Add support for Gemma4ClippableLinear / Gemma 4 QLoRA fails\n\n\n\nI re-checked this in a v3.8 local validation run with current packages:\n\npackage | version\n---|---\ntorch | `2.10.0+cu128`\ntransformers | `5.6.2`\naccelerate | `1.13.0`\npeft | `0.19.1`\nbitsandbytes | `0.49.2`\nGPU | `Tesla T4`\n\nThe broad-target adapter used for this check was:\n\n  * Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill\n\n\n\nThe target scan found:\n\nitem | count\n---|---\n`Gemma4ClippableLinear` hits | `148`\n`Linear4bit` hits | `205`\ntotal target matches | `353`\n\nAdapter load still failed with:\n\n\n    GEMMA4_CLIPPABLE_LINEAR_UNSUPPORTED\n\n\nSo the current classification from this lane is:\n\n\n    GEMMA4_CLIPPABLE_LINEAR_STILL_BLOCKS\n\n\nThis should stay separate from the split/offload issue and also separate from merge-output parity. It is an adapter-target / module-type compatibility issue. Broad targets such as `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj` can hit Gemma 4 wrapper modules that current PEFT does not accept.\n\nFor reporting purposes, I would use PEFT #3129 as the main tracker and not create a duplicate unless maintainers want a separate report with scanner output.\n\nReference docs for why broad `target_modules` matter:\n\n  * PEFT LoRA API docs — target_modules\n\n\n\n* * *\n\n## 4. New local validation: direct bnb 4-bit merge still diverges\n\nThis is not the original CPU-offload issue. It is a separate PEFT / bitsandbytes / QLoRA workflow finding.\n\nI re-ran the local parity check in v3.8 with:\n\n  * base: unsloth/gemma-4-E2B-it\n  * adapter: fulvian/gemma-4-e2b-medical-qlora-adapter\n\n\n\n### Direct 4-bit merge lane\n\n\n    4-bit base + adapter-loaded inference:\n      adapter output differs from base output\n\n    direct merge_and_unload() into bnb 4-bit base:\n      merged output does not match adapter-loaded output\n\n\nThe v3.8 classification was:\n\n\n    DIRECT_4BIT_MERGE_STILL_DIVERGES\n\n\nPrompt-level summary:\n\nclassification | count\n---|---\n`DIRECT_4BIT_MERGED_MATCHES_NEITHER` | `2`\n`DIRECT_4BIT_MERGED_MATCHES_BASE` | `1`\n\nPrompt table:\n\nprompt | classification | base != adapter | adapter == merged | base == merged\n---|---|---|---|---\n`p01_lora_short` | `DIRECT_4BIT_MERGED_MATCHES_NEITHER` | `True` | `False` | `False`\n`p02_hypertension_short` | `DIRECT_4BIT_MERGED_MATCHES_BASE` | `True` | `False` | `True`\n`p03_medical_tutor` | `DIRECT_4BIT_MERGED_MATCHES_NEITHER` | `True` | `False` | `False`\n\nThe direct 4-bit merge emitted `525` warnings:\n\n\n    Merge lora module to 4-bit linear may get different generations due to rounding errors.\n\n\nInterpretation:\n\n  * the adapter is active, because adapter-loaded output differs from the base output;\n  * direct merge into the bnb 4-bit base does not reproduce adapter-loaded output;\n  * in this run, direct merged output was base-like for one prompt and a third output for two prompts.\n\n\n\n### Fresh bf16 base merge lane\n\nThen I used a fresh non-quantized bf16 base:\n\n\n    bf16_base = AutoModelForCausalLM.from_pretrained(\n        BASE_MODEL_ID,\n        device_map={\"\": 0},\n        torch_dtype=torch.bfloat16,\n    )\n\n    bf16_peft = PeftModel.from_pretrained(bf16_base, ADAPTER_ID)\n    merged_bf16 = bf16_peft.merge_and_unload()\n\n\nThe v3.8 classification was:\n\n\n    FRESH_BF16_MERGE_STILL_PASSES\n\n\nPrompt-level summary:\n\nclassification | count\n---|---\n`BF16_MERGED_MATCHES_BF16_ADAPTER` | `3`\n\nPrompt table:\n\nprompt | classification | bf16 adapter == bf16 merged\n---|---|---\n`p01_lora_short` | `BF16_MERGED_MATCHES_BF16_ADAPTER` | `True`\n`p02_hypertension_short` | `BF16_MERGED_MATCHES_BF16_ADAPTER` | `True`\n`p03_medical_tutor` | `BF16_MERGED_MATCHES_BF16_ADAPTER` | `True`\n\nThe fresh bf16 merge emitted `0` merge warnings.\n\nInterpretation:\n\n\n    direct bnb 4-bit merge:\n      adapter-loaded output not reproduced\n\n    fresh bf16 base merge:\n      adapter-loaded output reproduced\n\n\nSo this looks less like “the adapter is bad” and more like a direct bnb 4-bit merge path issue.\n\nRelated issue:\n\n  * PEFT #2321 — 4-bit Linear merge warning / different generations\n\n\n\nThat issue is close because it tracks the warning, but this local validation adds:\n\n  * cross-prompt output divergence;\n  * fresh bf16 merge parity control;\n  * current PEFT / Transformers / bnb stack confirmation.\n\n\n\nI would file this only if maintainers want a separate PEFT issue. It should not be mixed into the original CPU/GPU split-offload issue.\n\nPossible title:\n\n\n    Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity\n\n\nRelevant docs:\n\n  * Transformers bitsandbytes quantization docs\n\n\n\n* * *\n\n## 5. In-place dequantize is not the path I would recommend\n\nI also tested the in-place dequantize path earlier:\n\n\n    peft_model.dequantize()\n    peft_model.merge_and_unload()\n\n\nIn that lane:\n\n\n    dequantize(): PASS\n    merge_and_unload(): FAIL\n\n    AttributeError: 'Parameter' object has no attribute 'quant_state'\n\n\nSo I would not recommend presenting in-place dequantize as the clean solution.\n\nThe safer path, based on the local result, is:\n\n\n    reload base in bf16/fp16\n    load adapter there\n    merge there\n    validate output parity\n\n\nThis is still resource-dependent. It worked on T4 for the small E2B validation, but that should not be generalized to all Gemma 4 sizes or longer generations.\n\nRelated Forum discussion:\n\n  * HF Forum — Do I need to dequantization before merging the QLoRA?\n\n\n\n* * *\n\n## 6. What I would file / not file\n\n### File or keep tracking\n\n#### A. Primary original split/offload issue\n\nTarget:\n\n  * `huggingface/transformers`\n\n\n\nSuggested title:\n\n\n    PeftModel.from_pretrained fails on CPU/GPU-dispatched bitsandbytes 4-bit Gemma4 during adapter load / dispatch hooks\n\n\nCross-link:\n\n  * `huggingface/peft`\n  * `huggingface/accelerate`\n  * `bitsandbytes-foundation/bitsandbytes`\n\n\n\n#### B. `Gemma4ClippableLinear`\n\nUse existing issue:\n\n  * PEFT #3129\n\n\n\nAdd scanner/preflight evidence there if useful.\n\n#### C. bnb CPU-offload wrong-device family\n\nUse existing issue / PR:\n\n  * PEFT #3169\n  * PEFT PR #3181\n\n\n\n#### D. direct 4-bit merge-output parity\n\nMaybe a separate PEFT issue, if maintainers prefer:\n\n\n    Direct merge_and_unload into bitsandbytes 4-bit Linear4bit does not reproduce adapter-loaded output; fresh bf16 merge restores parity\n\n\n### Do not file as-is\n\nI would avoid filing these broad claims:\n\n\n    CPU offload is broken\n    PEFT + bnb 4-bit is broken\n    QLoRA adapters are broken\n    Gemma 4 adapters all fail\n    merge_and_unload is generally broken\n\n\nThose claims are too broad and do not match the evidence.\n\n* * *\n\n## 7. Reference links\n\n### Original Forum context\n\n  * Original HF Forum thread: CPU offloading error scenario\n  * Earlier John6666-style issue draft / triage post\n  * HF Forum QLoRA merge / dequantization discussion\n\n\n\n### GitHub trackers\n\n  * PEFT #3129 — Add support for Gemma4ClippableLinear\n  * PEFT #3169 — LoRA + BnB INT8 + CPU offload wrong device\n  * PEFT PR #3181 — normalize output device for CPU-offloaded BnB layers\n  * PEFT #2321 — 4-bit Linear merge warning / different generations\n  * Transformers #43872 — _is_hf_initialized / Int8Params incompatibility\n  * Transformers #43873 — offloading not working as expected with quantization\n  * Transformers #45482 — Gemma4 cross-device CPU offload errors\n  * Transformers PR #45347 — Gemma4 device_map auto fix\n  * Transformers PR #45312 — Gemma4 KV-state sharing/cache fix\n  * Accelerate PR #3976 — Fix _is_hf_initialized attribute\n\n\n\n### Docs / model links\n\n  * PEFT LoRA API docs — target_modules\n  * Transformers bitsandbytes quantization docs\n  * Base model: unsloth/gemma-4-E2B-it\n  * Parity adapter: fulvian/gemma-4-e2b-medical-qlora-adapter\n  * Broad-target adapter: Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill\n  * Optional clean split-check adapter: welyjesch/filipino_Gemma4_E2B_FT_lora\n\n\n\n* * *\n\n## Bottom line\n\nMy current interpretation is:\n\n\n    The original thread is still best treated as a split/offload PEFT adapter-loading problem on already-dispatched bnb 4-bit Gemma 4 models.\n\n    Some related pieces have improved upstream, especially _is_hf_initialized and Gemma4 device_map support, but that does not fully close the split/offload issue.\n\n    Gemma4ClippableLinear remains a separate PEFT target-module issue.\n\n    Direct merge into bnb 4-bit is a separate merge-output parity issue: in local v3.8 validation it still diverged from adapter-loaded output, while fresh bf16 base merge restored parity.\n\n\n## Small clarification\n\nSmall clarification on the local validation labels above:\n\nWhen I wrote “v3.8 local validation,” that is just my local validation-bundle label. It is not an upstream release version, and I do not mean to imply that the exact notebook/package structure is important.\n\nThe only evidence I intended to carry forward from that run is the high-level classification:\n\n  * direct bnb 4-bit merge did not reproduce adapter-loaded output;\n  * fresh bf16-base merge did reproduce adapter-loaded output;\n  * broad Gemma 4 targets still hit `Gemma4ClippableLinear` and failed as an unsupported PEFT target;\n  * the optional split-dispatch lane was **not** re-run in that validation.\n\n\n\nSo for the original thread issue, I would still treat the existing CPU/GPU offload traces here as the main evidence. The local merge/parity check is a separate bucket.\n\nIf maintainers want a repro, I can extract a minimal script for whichever bucket is most useful:\n\n  1. split/offload adapter-load failure;\n  2. `Gemma4ClippableLinear` target-module failure;\n  3. direct 4-bit merge-output parity divergence.\n\n",
  "title": "CPU offloading error scenario"
}