{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreide74pmml5k3dcgbx4ftp6xzvat4fag7mwhy2ytdgo5hskur2hvcy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml7ksxmjxxk2"
},
"path": "/t/omgformer-open-source-parallel-masked-diffusion-lm-framework-v2-0-5/175799#post_1",
"publishedAt": "2026-05-06T20:25:56.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Inception Labs’ Mercury",
"omgformer · PyPI"
],
"textContent": "# OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)\n\nHi everyone,\n\nI wanted to share a project that just released on PyPI: **OMGFormer** , an open-source PyTorch framework for building and training parallel masked diffusion language models.\n\n* * *\n\n## What is it?\n\nOMGFormer implements the same class of architecture behind Inception Labs’ Mercury — the first commercial-scale diffusion LLM ($50M funded, 1100+ tokens/sec on H100). The key difference: OMGFormer is fully open-source, Apache 2.0, and lets you train your own model from scratch.\n\nInstead of generating tokens one at a time (autoregressive), it generates all tokens in parallel via iterative unmasking:\n\n\n Step 0: \"Hello [MASK] [MASK] [MASK] [MASK]\"\n Step 1: \"Hello world [MASK] [MASK] [MASK]\"\n Step 2: \"Hello world how are [MASK]\"\n Step 3: \"Hello world how are you?\"\n\n\n\n256 tokens → 6–10 forward passes instead of 256. With Self-Conditioning, quality stays comparable at even fewer steps.\n\n* * *\n\n## What shipped (v2.0.5)\n\nThe project is very new (~3 days old, one developer) and has no benchmarks yet due to limited compute resources. But the codebase is surprisingly complete:\n\n**Core architecture (60 features):**\n\n * GQA, MLA (DeepSeek-style), Sliding Window, Linear Attention\n\n * AdaLN-Zero timestep conditioning (DiT-style)\n\n * Self-Conditioning, Absorbing Diffusion, Remasking\n\n * MoE: top-K, Expert Choice (Google Switch), Soft MoE (Google Brain 2023), Shared Expert (DeepSeek)\n\n * LoRA variants: standard, DoRA, QLoRA, rsLoRA, LoRA+\n\n * Advanced: KV Cache, MTP head, Model Merging (SLERP/DARE/TIES), PPO/Reward head, GGUF export stub, RAG injector, Dynamic batching\n\n\n\n\n**`omg_data` — Automated data pipeline:**\n\n\n pipe = DataPipeline(language=\"tr\", task=\"chat\", size_gb=5, tokenizer=\"gpt2\")\n dataset = pipe.build() # finds → downloads → cleans → tokenizes automatically\n\n\n\nSupports 15+ languages, 6 task types, full cleaning pipeline (dedup, HTML, URL, unicode, lang filter).\n\n**`omg_hybridomga` — Unified training engine:**\n\n * All 6 LoRA methods in one package\n\n * Novel **OMGa** (OMG Adaptive LoRA): per-token learned gate with dual-rank adapters\n\n * VRAM guard, OOM recovery, MorphicMemory (Markov allocation prediction + tensor reuse)\n\n * SpectraOptimizer (FFT-domain adaptive AdamW), ResonanceScheduler (gradient-spectrum self-tuning LR)\n\n * GradientHarmonics (wavelet noise injection), NeuralProfiler (tiny LSTM predicts OOM/explode risk)\n\n * Unstoppable trainer — retries from checkpoint on any failure\n\n\n\n\n* * *\n\n## Current state\n\nNo training benchmarks yet due to limited compute resources. The architecture is solid and the code is well-written, but:\n\n * No training benchmarks yet (developer has limited GPU access)\n\n * Some stubs not fully implemented (Flash Attention 2 flag exists but falls back to SDPA)\n\n * MoE not yet fully integrated into OMGConfig (listed for next release)\n\n * No pretrained weights — you train from scratch\n\n\n\n\nThe developer is actively working on it and releases are moving fast.\n\n* * *\n\n## Installation\n\n\n pip install omgformer # core\n pip install omg_data # data pipeline\n pip install omg_hybridomga # training engine\n\n\n\nQuick start:\n\n\n from omgformer import OMGConfig, OMGModel, MaskScheduler, ParallelDecoder\n\n cfg = OMGConfig.from_preset(\"omgformer-small\") # ~87M params\n model = OMGModel(cfg)\n sched = MaskScheduler(steps=10, mask_token_id=cfg.mask_token_id, vocab_size=cfg.vocab_size)\n decoder = ParallelDecoder(model, sched)\n\n\n\n * PyPI: omgformer · PyPI\n\n\n\n* * *\n\n## Looking for feedback\n\nSince there are no benchmark results yet, the community’s help would be very valuable. If anyone has spare compute and wants to run experiments — even small ones on `omgformer-tiny` or `omgformer-small` — and share results here, that would help validate (or challenge) the approach.\n\nSpecific things worth testing:\n\n * Does loss converge normally on small datasets?\n\n * How does generation quality compare to a similarly-sized autoregressive model at the same step budget?\n\n * Any bugs in the data pipeline for non-English languages?\n\n\n\n\nHappy to discuss the architecture or the diffusion LM approach in general.",
"title": "OMGFormer — Open-Source Parallel Masked Diffusion LM Framework (v2.0.5)"
}