{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreide74pmml5k3dcgbx4ftp6xzvat4fag7mwhy2ytdgo5hskur2hvcy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml7ksxmjxxk2"
  },
  "path": "/t/omgformer-open-source-parallel-masked-diffusion-lm-framework-v2-0-5/175799#post_1",
  "publishedAt": "2026-05-06T20:25:56.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Inception Labs’ Mercury",
    "omgformer · PyPI"
  ],
  "textContent": "# OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)\n\nHi everyone,\n\nI wanted to share a project that just released on PyPI: **OMGFormer** , an open-source PyTorch framework for building and training parallel masked diffusion language models.\n\n* * *\n\n## What is it?\n\nOMGFormer implements the same class of architecture behind Inception Labs’ Mercury — the first commercial-scale diffusion LLM ($50M funded, 1100+ tokens/sec on H100). The key difference: OMGFormer is fully open-source, Apache 2.0, and lets you train your own model from scratch.\n\nInstead of generating tokens one at a time (autoregressive), it generates all tokens in parallel via iterative unmasking:\n\n\n    Step 0: \"Hello [MASK] [MASK] [MASK] [MASK]\"\n    Step 1: \"Hello world  [MASK] [MASK] [MASK]\"\n    Step 2: \"Hello world  how   are  [MASK]\"\n    Step 3: \"Hello world  how   are   you?\"\n\n\n\n256 tokens → 6–10 forward passes instead of 256. With Self-Conditioning, quality stays comparable at even fewer steps.\n\n* * *\n\n## What shipped (v2.0.5)\n\nThe project is very new (~3 days old, one developer) and has no benchmarks yet due to limited compute resources. But the codebase is surprisingly complete:\n\n**Core architecture (60 features):**\n\n  * GQA, MLA (DeepSeek-style), Sliding Window, Linear Attention\n\n  * AdaLN-Zero timestep conditioning (DiT-style)\n\n  * Self-Conditioning, Absorbing Diffusion, Remasking\n\n  * MoE: top-K, Expert Choice (Google Switch), Soft MoE (Google Brain 2023), Shared Expert (DeepSeek)\n\n  * LoRA variants: standard, DoRA, QLoRA, rsLoRA, LoRA+\n\n  * Advanced: KV Cache, MTP head, Model Merging (SLERP/DARE/TIES), PPO/Reward head, GGUF export stub, RAG injector, Dynamic batching\n\n\n\n\n**`omg_data` — Automated data pipeline:**\n\n\n    pipe = DataPipeline(language=\"tr\", task=\"chat\", size_gb=5, tokenizer=\"gpt2\")\n    dataset = pipe.build()  # finds → downloads → cleans → tokenizes automatically\n\n\n\nSupports 15+ languages, 6 task types, full cleaning pipeline (dedup, HTML, URL, unicode, lang filter).\n\n**`omg_hybridomga` — Unified training engine:**\n\n  * All 6 LoRA methods in one package\n\n  * Novel **OMGa** (OMG Adaptive LoRA): per-token learned gate with dual-rank adapters\n\n  * VRAM guard, OOM recovery, MorphicMemory (Markov allocation prediction + tensor reuse)\n\n  * SpectraOptimizer (FFT-domain adaptive AdamW), ResonanceScheduler (gradient-spectrum self-tuning LR)\n\n  * GradientHarmonics (wavelet noise injection), NeuralProfiler (tiny LSTM predicts OOM/explode risk)\n\n  * Unstoppable trainer — retries from checkpoint on any failure\n\n\n\n\n* * *\n\n## Current state\n\nNo training benchmarks yet due to limited compute resources. The architecture is solid and the code is well-written, but:\n\n  * No training benchmarks yet (developer has limited GPU access)\n\n  * Some stubs not fully implemented (Flash Attention 2 flag exists but falls back to SDPA)\n\n  * MoE not yet fully integrated into OMGConfig (listed for next release)\n\n  * No pretrained weights — you train from scratch\n\n\n\n\nThe developer is actively working on it and releases are moving fast.\n\n* * *\n\n## Installation\n\n\n    pip install omgformer           # core\n    pip install omg_data            # data pipeline\n    pip install omg_hybridomga      # training engine\n\n\n\nQuick start:\n\n\n    from omgformer import OMGConfig, OMGModel, MaskScheduler, ParallelDecoder\n\n    cfg     = OMGConfig.from_preset(\"omgformer-small\")  # ~87M params\n    model   = OMGModel(cfg)\n    sched   = MaskScheduler(steps=10, mask_token_id=cfg.mask_token_id, vocab_size=cfg.vocab_size)\n    decoder = ParallelDecoder(model, sched)\n\n\n\n  * PyPI: omgformer · PyPI\n\n\n\n* * *\n\n## Looking for feedback\n\nSince there are no benchmark results yet, the community’s help would be very valuable. If anyone has spare compute and wants to run experiments — even small ones on `omgformer-tiny` or `omgformer-small` — and share results here, that would help validate (or challenge) the approach.\n\nSpecific things worth testing:\n\n  * Does loss converge normally on small datasets?\n\n  * How does generation quality compare to a similarly-sized autoregressive model at the same step budget?\n\n  * Any bugs in the data pipeline for non-English languages?\n\n\n\n\nHappy to discuss the architecture or the diffusion LM approach in general.",
  "title": "OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)"
}