Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiedqbo445unyskcta4ls2aaw54mlytuxti2ekfxn6pvztuhmbdmc4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mn75qporzdg2"
  },
  "path": "/t/gemma4-e4b-adaptors-fuse-after-training-how/176429#post_2",
  "publishedAt": "2026-06-01T02:21:13.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit",
    "PR #846",
    "issue #907",
    "Blaizzy/mlx-vlm releases",
    "deadbydawn101/gemma-4-E4B-mlx-4bit",
    "deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora",
    "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF",
    "PR #846: fix alpha/rank scaling in LoRaLayer",
    "PR #935",
    "PR #893",
    "PR #1052",
    "mlx-lm issue #1210",
    "mlx-lm issue #1242",
    "PEFT LoRA developer guide",
    "Google Gemma docs",
    "Gemma releases"
  ],
  "textContent": "Oh… This looks like a fairly complex case with several known layers of compound drift:\n\n* * *\n\n## TL;DR\n\nI would not read this as “Gemma 4 E4B adapters cannot be fused”.\n\nA closer reading suggests something narrower:\n\n> Your training recipe is not obviously impossible.\n>  The risky part is the **custom fuse/export path** , especially because it partially dequantizes an MLX quantized Gemma 4 E4B checkpoint, deletes some quantization metadata, writes a single `model.safetensors`, and removes the shard index.\n\nThere are nearby success examples. The closest one I found is deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit, which says it trained with `mlx_vlm.lora`, rank 8, alpha 16, and fused **378 LoRA pairs** into the base weights. But its merge path is different: it explicitly dequantizes and saves the result as a **BF16 3-shard safetensors** model.\n\nSo my current hypothesis is:\n\n> The LoRA math may be roughly right, but the output checkpoint is probably not a consistent Gemma 4 E4B MLX checkpoint.\n\n* * *\n\n## Why this case is probably tricky\n\nThis is sitting at the intersection of several moving parts:\n\nLayer | Why it matters here\n---|---\nGemma 4 E4B architecture | E2B/E4B have Gemma 4-specific projection/shared-KV/multimodal structure.\n`mlx-vlm` Gemma 4 support | Recent releases include several Gemma 4-specific fixes.\nLoRA scaling | `mlx-vlm` had a known `alpha` vs `alpha/rank` scaling issue fixed in PR #846.\nQuantized MLX checkpoint layout | `.weight`, `.scales`, `.biases`, shard index, and config must stay consistent.\nVLM vs text-only loader paths | `mlx_vlm` and `mlx_lm` can expose different practical behavior.\nAdapter vs fused behavior | Adapter-loaded inference can work while fused checkpoints fail.\nServer vs direct CLI inference | `mlx_vlm server --adapter-path` has had an adapter-dropping cache issue: issue #907.\n\nThe `mlx-vlm` v0.5.0 release notes are also worth reading because they include multiple relevant fixes: Gemma 4 quantized per-layer projection loading, Gemma 4 audio fixes, LoRA `alpha/rank` scaling, Gemma 4 LoRA training fixes, etc. See Blaizzy/mlx-vlm releases.\n\n* * *\n\n## Nearby success example vs this case\n\nThe closest success example I found is:\n\n  * deadbydawn101/gemma-4-E4B-mlx-4bit\n  * deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora\n  * deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit\n  * GGUF follow-up: deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF\n\n\n\nThat example is not an official proof that all E4B LoRA fuse workflows are safe, but it is useful because it is very close.\n\nItem | Your case | Nearby success example\n---|---|---\nModel family | Gemma 4 E4B | Gemma 4 E4B\nRuntime family | MLX / `mlx-vlm` | MLX / `mlx-vlm` / `mlx_lm`\nTraining command family | `mlx_vlm.lora` | `mlx_vlm.lora`\nLoRA rank | 8 | 8\nLoRA alpha | 16 | 16\nLR | `1e-5` | `1e-5`\nTraining style | completions-only | SFT completions-only\nBase precision | local E4B 8-bit | E4B MLX 4-bit\nFuse count | unknown from the post | **378 LoRA pairs**\nMerge scale | `module.scale` | explicitly `alpha / rank`\nOutput format | single `model.safetensors`, index removed | **BF16 3-shard safetensors**\nQuant metadata handling | deletes selected `.scales` / `.biases` | dequantized merged checkpoint\nLoader path shown | `mlx_vlm.load(...)` | mostly `mlx_lm` text-generation examples\n\nThe most important difference is not 4-bit vs 8-bit by itself. The important difference is:\n\n> The success example appears to turn the merged model into a coherent BF16 checkpoint.\n>  Your script may be producing a mixed quantized/floating checkpoint.\n\n* * *\n\n## The biggest red flags in the custom fuse script\n\n### 1. `adapter_config.json` is overwritten\n\nThis part is risky:\n\n\n    with open(ADAPTER / \"adapter_config.json\", \"w\") as f:\n        json.dump({\"rank\": 8, \"alpha\": 16.0, \"dropout\": 0.0}, f)\n\n\nEven if those values are correct this time, the fuse script should not rewrite the adapter metadata. If the adapter config contains target module information, naming conventions, or version-specific metadata, this can silently destroy useful information.\n\nSafer:\n\n\n    print((ADAPTER / \"adapter_config.json\").read_text())\n\n\nDo not modify it during fuse.\n\n* * *\n\n### 2. The output may become mixed quantized/floating-point\n\nThis is the biggest issue:\n\n\n    all_w[name + \".weight\"] = w2\n    all_w.pop(name + \".scales\", None)\n    all_w.pop(name + \".biases\", None)\n\n\nFor each LoRA target layer, you are replacing the quantized layer with a dequantized/fused `.weight` and removing its `.scales` / `.biases`.\n\nBut unless you do the same coherently for the whole checkpoint and update the config accordingly, you can end up with something like:\n\nPart of model | Possible state after script\n---|---\nLoRA-target layers | floating-point `.weight`, no `.scales` / `.biases`\nnon-target quantized layers | still quantized `.weight` + `.scales` / `.biases`\nconfig | still copied from the original quantized model\nshard index | removed\noutput file layout | single `model.safetensors`\n\nThat is not obviously a valid MLX Gemma 4 E4B checkpoint layout.\n\nThe nearby fused example says it dequantized the result to BF16 and saved as 3-shard safetensors. That is a much cleaner contract.\n\n* * *\n\n### 3. The shard/index behavior differs from the success example\n\nYour script writes:\n\n\n    mx.save_safetensors(str(SALIDA / \"model.safetensors\"), all_w)\n\n    idx = SALIDA / \"model.safetensors.index.json\"\n    if idx.exists():\n        idx.unlink()\n\n\nThat might work for some small/simple models, but Gemma 4 E4B MLX checkpoints have enough architecture-specific structure that I would avoid this unless I knew the loader accepted exactly this layout.\n\nThe nearby success example says:\n\n> Result dequantized to bfloat16 and saved as 3-shard safetensors.\n\nSo I would try to reproduce that style instead of collapsing everything into one file.\n\n* * *\n\n### 4. `module.scale` must be checked explicitly\n\nYour script does:\n\n\n    lu = module.scale * (module.A @ module.B)\n\n\nThat is only safe if `module.scale == alpha / rank`.\n\nFor rank 8 and alpha 16, the expected value is:\n\n\n    alpha / rank = 16 / 8 = 2.0\n\n\nThis matters because `mlx-vlm` had a known scaling bug where LoRA used raw `alpha` instead of `alpha/rank`. See PR #846: fix alpha/rank scaling in LoRaLayer.\n\nWith rank 8 / alpha 16:\n\nScale used | Effective LoRA strength\n---|---\n`2.0` | expected standard LoRA scaling\n`16.0` | 8x too strong\n\nSo please print it:\n\n\n    for name, module in model.named_modules():\n        if isinstance(module, LoRaLayer):\n            print(name, \"scale=\", float(module.scale), \"A=\", module.A.shape, \"B=\", module.B.shape)\n            break\n\n\nIf this prints `16.0`, that is a serious problem.\n\n* * *\n\n## Known related drift points\n\nThis is why I think this is a compound drift case rather than one simple bug.\n\nArea | Link | Relevance\n---|---|---\nLoRA scaling in `mlx-vlm` | PR #846 | Fixes raw `alpha` vs `alpha/rank`. Directly relevant to rank 8 / alpha 16.\nGemma 4 quantized projection loading | PR #935 | Shows Gemma 4 quantized projection loading was recently touched.\nGemma 4 embedding scaling | PR #893 | Earlier Gemma 4 MLX conversion/embedding behavior was not completely stable.\nGemma 4 LoRA training NaN / freeze leak | PR #1052 | Important if training used image/audio branches or if adapter size is unexpectedly large.\n`mlx_vlm server` drops adapter after first request | issue #907 | Can make adapter testing look like base-model behavior if testing through server.\nGemma 4 checkpoint round-trip/shared-KV divergence | mlx-lm issue #1210 | Shows Gemma 4 checkpoint structure can diverge across MLX runtimes.\nGemma 4 E4B 4bit/8bit load drift | mlx-lm issue #1242 | Shows E4B quantized checkpoints can be sensitive to version/key expectations.\nGeneral PEFT LoRA guide | PEFT LoRA developer guide | Useful for standard LoRA merge mental model.\nGemma official docs | Google Gemma docs | Background on Gemma variants and tuning/deployment.\nGemma 4 release history | Gemma releases | Useful for tracking how recent Gemma 4 is.\n\n* * *\n\n## What I would test before changing more code\n\n### 1. Does the adapter work before fuse?\n\nThis is the most important split.\n\nRun direct CLI inference, not server-based inference:\n\n\n    python -m mlx_vlm.generate \\\n      --model /Users/hal9000/Desktop/AI/modelos/gemma_4_e4b_it_8bit \\\n      --adapter-path /Volumes/ssd./ssd_gemma4/adaptadores_v2 \\\n      --prompt \"Use a fixed validation prompt here.\" \\\n      --max-tokens 128 \\\n      --temperature 0.0\n\n\nAlso try the text path if the task is text-only:\n\n\n    mlx_lm.generate \\\n      --model /Users/hal9000/Desktop/AI/modelos/gemma_4_e4b_it_8bit \\\n      --adapter-path /Volumes/ssd./ssd_gemma4/adaptadores_v2 \\\n      --prompt \"Use a fixed validation prompt here.\" \\\n      --max-tokens 128 \\\n      --temp 0.0\n\n\nInterpretation:\n\nResult | Meaning\n---|---\nbase and adapter output are identical | adapter is not being applied, or target names/config are wrong\nadapter works but fused is base-like | fuse did not merge the important deltas\nadapter works but fused is garbage | scaling / dtype / quant metadata / checkpoint layout issue\nadapter is already garbage | training/data/template/NaN issue, not fuse issue\n\n* * *\n\n### 2. Print the number of fused LoRA layers\n\nThe nearby success example says **378 LoRA pairs**.\n\nYour script already has:\n\n\n    print(\" %d capas fusionadas\" % len(to_update))\n\n\nPlease report this number.\n\nExpected ballpark, if your target coverage matches the nearby E4B example:\n\n\n    len(to_update) ≈ 378\n\n\nIf it is much lower, the fuse script is not seeing all LoRA layers.\n\n* * *\n\n### 3. Print `module.scale`\n\nExpected:\n\n\n    rank = 8\n    alpha = 16\n    scale = 2.0\n\n\nMinimal check:\n\n\n    for name, module in model.named_modules():\n        if isinstance(module, LoRaLayer):\n            print(name, \"scale=\", float(module.scale))\n            break\n\n\nIf it prints `16.0`, then you are likely applying an 8x-too-large LoRA delta.\n\n* * *\n\n### 4. Check adapter size and contents\n\nThe nearby E4B adapter example reports an adapter size around 658 MB:\n\n  * deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora\n\n\n\nCheck yours:\n\n\n    du -h /Volumes/ssd./ssd_gemma4/adaptadores_v2/*\n\n\nThen inspect whether it is really only LoRA tensors or if other weights got saved too:\n\n\n    from pathlib import Path\n    import mlx.core as mx\n\n    adapter_dir = Path(\"/Volumes/ssd./ssd_gemma4/adaptadores_v2\")\n\n    for f in adapter_dir.glob(\"*.safetensors\"):\n        print(\"FILE\", f)\n        w = mx.load(str(f))\n        print(\"tensor count:\", len(w))\n\n        suspicious = []\n        for k in w:\n            if any(s in k for s in [\"audio\", \"vision\", \"embed_audio\", \"embed_vision\"]):\n                suspicious.append(k)\n\n        print(\"suspicious audio/vision/embed keys:\", len(suspicious))\n        for k in suspicious[:50]:\n            print(\" \", k, w[k].shape)\n\n\nWhy this matters: PR #1052 mentions a Gemma 4 LoRA training issue involving `vision` backward NaNs and an `audio_tower` freeze leak. If non-LoRA weights were saved into the adapter, a simple `LoRaLayer`-only fuse script may silently drop them.\n\n* * *\n\n### 5. Check for NaN/Inf in the adapter\n\n\n    from pathlib import Path\n    import mlx.core as mx\n\n    adapter_dir = Path(\"/Volumes/ssd./ssd_gemma4/adaptadores_v2\")\n\n    for f in adapter_dir.glob(\"*.safetensors\"):\n        w = mx.load(str(f))\n        bad = []\n        for k, v in w.items():\n            vf = v.astype(mx.float32)\n            if bool(mx.any(mx.isnan(vf)).item()) or bool(mx.any(mx.isinf(vf)).item()):\n                bad.append(k)\n\n        print(f, \"bad tensors:\", len(bad))\n        for k in bad[:20]:\n            print(\" \", k, w[k].shape)\n\n\nIf this finds NaN/Inf tensors, the adapter is already compromised before fusion.\n\n* * *\n\n### 6. Compare base vs fused key sets\n\nThis is where the mixed-checkpoint issue should become visible.\n\n\n    from pathlib import Path\n    import mlx.core as mx\n\n    BASE = Path(\"/Users/hal9000/Desktop/AI/modelos/gemma_4_e4b_it_8bit\")\n    FUSED = Path(\"/Users/hal9000/Desktop/AI/modelos/gemma_4_e4b_it_8bit_ssd_fused\")\n\n    def load_dir(p):\n        out = {}\n        for sf in sorted(Path(p).glob(\"*.safetensors\")):\n            out.update(mx.load(str(sf)))\n        return out\n\n    base = load_dir(BASE)\n    fused = load_dir(FUSED)\n\n    base_keys = set(base)\n    fused_keys = set(fused)\n\n    print(\"base only:\", len(base_keys - fused_keys))\n    for k in sorted(base_keys - fused_keys)[:100]:\n        print(\"BASE_ONLY\", k, base[k].shape)\n\n    print(\"fused only:\", len(fused_keys - base_keys))\n    for k in sorted(fused_keys - base_keys)[:100]:\n        print(\"FUSED_ONLY\", k, fused[k].shape)\n\n    scale_keys = [k for k in fused if k.endswith(\".scales\")]\n    bias_keys = [k for k in fused if k.endswith(\".biases\")]\n    print(\"fused .scales:\", len(scale_keys))\n    print(\"fused .biases:\", len(bias_keys))\n\n\nIf only some quantization metadata remains, the checkpoint is probably not coherent.\n\n* * *\n\n## What I would change in the fuse strategy\n\nI would not try to preserve the original 8-bit checkpoint format in the first pass.\n\nInstead, I would mimic the nearby success example:\n\n  1. Load the base model.\n  2. Load the adapter.\n  3. Dequantize the base weights.\n  4. Merge LoRA with `alpha / rank`.\n  5. Save a coherent BF16 checkpoint.\n  6. Only after that, optionally quantize again.\n\n\n\nConceptually:\n\n\n    quantized base + adapter\n            ↓\n    dequantized BF16 base\n            ↓\n    BF16 merged/fused model\n            ↓\n    optional re-quantization\n\n\nNot:\n\n\n    quantized base\n            ↓\n    replace only some layers with floating-point fused weights\n            ↓\n    delete selected .scales/.biases\n            ↓\n    single model.safetensors with original config\n\n\nThe latter is much more likely to break loader expectations.\n\n* * *\n\n## Practical recommendation\n\n### Short-term\n\nDo not fuse.\n\nUse the adapter directly:\n\n\n    python -m mlx_vlm.generate \\\n      --model /Users/hal9000/Desktop/AI/modelos/gemma_4_e4b_it_8bit \\\n      --adapter-path /Volumes/ssd./ssd_gemma4/adaptadores_v2 \\\n      --prompt \"your prompt\" \\\n      --max-tokens 256 \\\n      --temperature 0.0\n\n\nIf this works, your adapter is probably okay and the problem is mainly fuse/export.\n\n* * *\n\n### Medium-term\n\nBuild a BF16 fused model rather than an in-place-ish quantized fused model.\n\nThe closest public success example says it used:\n\n\n    W_merged = dequantize(W_base) + (A @ B).T × (alpha / rank)\n\n\nand saved the result as BF16 3-shard safetensors.\n\nSo I would aim for:\n\nRequirement | Target\n---|---\nLoRA pairs | near 378, if matching that E4B coverage\nscale | `alpha / rank = 2.0`\noutput dtype | BF16\nquant metadata | no half-removed mixed state\nfile layout | sharded safetensors if large\nconfig | consistent with BF16 model, not stale 8-bit quant config\n\n* * *\n\n### Long-term\n\nWait for, or request, an official `mlx-vlm` fuse/export path for Gemma 4 E4B VLM adapters.\n\nThere are enough Gemma 4-specific fixes around MLX loading/training/quantization that a hand-written fuse script is fragile. The relevant upstream surface is still moving.\n\n* * *\n\n## My final read\n\nI think your result is probably close, but the current fuse script is crossing too many contracts at once.\n\nThe training settings are not obviously wrong because a nearby E4B example uses similar settings:\n\nSetting | You | Nearby example\n---|---|---\n`mlx_vlm.lora` | yes | yes\nrank | 8 | 8\nalpha | 16 | 16\nLR | `1e-5` | `1e-5`\ncompletions-only | yes | yes\n\nThe main divergence is after training:\n\nFuse/export detail | Your script | Nearby example\n---|---|---\nbase precision | local 8-bit | MLX 4-bit\nmerge result | partially dequantized mixed state possible | BF16 merged checkpoint\nLoRA pair count | unknown | 378\nscale | `module.scale`, must verify | `alpha / rank`\noutput files | single `model.safetensors` | 3-shard safetensors\nquant metadata | selectively deleted | dequantized output\n\nSo I would debug this as:\n\n> adapter correctness first, then LoRA coverage/scaling, then checkpoint serialization.\n\nNot as:\n\n> Gemma 4 E4B is impossible to fuse.\n\nThe fastest useful data points to post back would be:\n\n\n    mlx-vlm version:\n    mlx-lm version:\n    mlx version:\n    adapter size:\n    dynamic adapter output differs from base? yes/no\n    len(to_update):\n    first few module.scale values:\n    NaN/Inf in adapter? yes/no\n    number of .scales/.biases left in fused checkpoint:\n    base-only/fused-only key count:\n\n\nThose numbers should make the failure class much clearer.",
  "title": "Gemma4-e4b adaptors fuse after training , how?"
}