Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidysgz3a4pef5mhsoigfuo6tby2x7tcdz3go4tptmw2gfxnndmfg4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miy63asac3p2"
  },
  "path": "/t/would-this-concept-model-work/175056#post_4",
  "publishedAt": "2026-04-08T04:23:40.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "arXiv"
  ],
  "textContent": "Looks like there might be a slight bug?\n\n* * *\n\nI reviewed the actual code path, not the README. I also sanity-checked a tiny forward/backward path locally.\n\nThe short verdict is:\n\n**This is a real model implementation, not a fake scaffold.**\n**It can plausibly train into a coherent prototype.**\n**I would not launch the 1B run unchanged.**\nThe main reasons are not “ternary is impossible” or “MDLM is wrong.” The core architecture is aligned with the literature: masked diffusion language models are viable, block diffusion is a real semi-autoregressive extension with KV caching, ternary-from-scratch has precedent, and BitNet a4.8-style hybrid activation handling is the right direction. The fragile zone is still low-bit attention/activation behavior, especially when stacked on diffusion. (arXiv)\n\n## Final judgment\n\nIf you asked me, “Would this codebase probably produce a coherent 1B masked-diffusion model if I spend the compute?”, my answer is:\n\n**Probably yes, after a few fixes.**\n\nIf you asked me, “Would this exact codebase, as-is, cleanly validate the whole concept and be easy to trust at 1B/40B?”, my answer is:\n\n**No. It has a solid core plus several correctness and interpretation issues.**\n\n## What is solid\n\nThese parts are good enough that I would keep them.\n\n### 1. The core modeling choice is valid\n\nThe model is a bidirectional denoiser with absorbing-state masking and a per-sample noise level `t`. That is the right family for MDLM-style training. MDLM specifically showed that simple masked discrete diffusion can be much stronger than older diffusion-for-text setups and can support efficient samplers. (arXiv)\n\n### 2. The ternary-weight implementation is conceptually sound\n\nThe code keeps latent full-precision weights and uses STE ternary quantization in the forward pass. That is the standard kind of construction you would expect from BitNet-style training. The overall idea of native ternary weights is supported by BitNet b1.58. (arXiv)\n\n### 3. The A8 → A4 schedule is the right instinct\n\nThis is one of the best choices in the repo. BitNet a4.8 is not “all 4-bit everywhere from the first step.” It is selective and hybrid. Your code is directionally aligned with that. (arXiv)\n\n### 4. The block sampler has the right basic idea\n\nYour block sampler uses committed context plus an ephemeral current block. That is a sensible prototype for block diffusion. The public block-diffusion work explicitly motivates arbitrary-length generation, KV caching, and parallel token sampling in exactly this general direction. (arXiv)\n\n## Must-fix before a 1B run\n\nThese are the items I would treat as hard blockers or near-blockers.\n\n### 1. `MaskDiffusionLoss` can return `NaN`\n\nThis is the most important correctness bug I found.\n\nThe loss sets all non-supervised positions to `ignore_index` and then calls `F.cross_entropy`. If there are **zero supervised positions** in a batch, PyTorch returns `NaN`. I verified this locally with a tiny smoke test.\n\nWhy this matters:\n\n  * with very long sequences, it is rare that **no positions** are masked,\n  * but it is still a real edge case,\n  * and thinking-token exclusion makes it easier for “all masked positions are excluded from loss” to happen on small or special batches.\n\n\n\nFix:\n\n  * before calling `cross_entropy`, check `if not mask_flat.any(): return logits.new_zeros(())`.\n\n\n\nWithout this fix, rare NaNs can poison long training runs.\n\n### 2. The variable-length curriculum is mostly canceled by the dataloader\n\nYour data-prep script creates variable-length chunks. But `StreamingJsonlDataset` re-tokenizes the stored `\"text\"` and appends everything into one token buffer, then emits fixed `max_length` chunks.\n\nSo end to end, the effective training stream is mostly **fixed-length re-chunked windows** , not the intended weighted length distribution.\n\nWhy this matters:\n\n  * your experiments become harder to interpret,\n  * your training is less like the intended curriculum than you think,\n  * if you believe shorter and longer contexts are both important, the current pipeline largely throws that away.\n\n\n\nFix:\n\n  * store tokenized chunks directly, or\n  * keep one JSONL line = one training example, do not flatten the entire corpus back into a global rolling token buffer.\n\n\n\n### 3. `attention_mask` is built, then ignored\n\nThe collator returns `attention_mask`, but the training loop only uses `batch[\"input_ids\"]`. The model forward path also has no attention-mask input.\n\nToday this is partly hidden because the dataset mostly emits full-length chunks. But partial chunks still exist, and if you later restore true variable-length batching, this becomes a serious issue.\n\nWhy this matters:\n\n  * padded positions can enter the corruption process,\n  * padded positions can contribute to attention,\n  * pad token is set to `eos_token` if missing, so the model can learn from EOS-padding artifacts.\n\n\n\nFix:\n\n  * propagate `attention_mask` into masking and loss,\n  * exclude padded positions from `apply_mask`,\n  * exclude them from supervised loss,\n  * ideally add real attention masking if you want genuine variable-length batches.\n\n\n\n### 4. `BlockDiffusionSampler.generate()` is wrong for `num_samples > 1`\n\nThis is a real logic bug.\n\nThe block sampler accumulates `all_generated` and `block_texts` from `block_ids[0]`, then uses that same shared buffer when returning results for all samples. So if `num_samples > 1`, the returned outputs are effectively copies of sample 0.\n\nFix:\n\n  * keep `all_generated` per sample, not once globally,\n  * keep `block_texts` per sample too.\n\n\n\nIf you only ever sample one output at a time, this does not hurt you. But it is still a bug.\n\n### 5. `generate_sample()` can sample special tokens and silently turn them into the last vocabulary token\n\nIn the training monitor sampler, you sample from the full logit tensor, not just the normal vocabulary slice, and then at the end clamp IDs into `[0, vocab_size - 1]`.\n\nThat means:\n\n  * if the model samples the mask token or think token,\n  * the code silently converts it to the last normal token ID.\n\n\n\nThis does not corrupt training directly. It corrupts your **qualitative monitoring** and makes samples less trustworthy.\n\nFix:\n\n  * slice logits to `:vocab_size` before sampling, like your other samplers already do.\n\n\n\n## Should-fix\n\nThese are not guaranteed failures, but they weaken the experiment.\n\n### 1. Thinking tokens are under-supervised\n\nIn code terms, think positions are excluded from the direct supervised loss. They only receive gradient indirectly through answer quality.\n\nThat can work as an experimental latent-variable trick. But it is weak supervision.\n\nMy expectation:\n\n  * maybe helpful,\n  * maybe ignored,\n  * maybe unstable if you over-interpret it as “reasoning.”\n\n\n\nFor a first serious 1B run, I would either:\n\n  * disable thinking tokens, or\n  * keep only the simplest global-prefix version and remove per-block thinking.\n\n\n\n### 2. Per-block thinking at inference does not match training\n\nTraining prepends one think prefix to the sequence.\nBlock sampling can prepend think tokens **before every block**.\n\nThat is a train-test mismatch.\n\nIt may still “work” in the loose sense that the model produces something. But if the feature matters at all, this mismatch makes the result harder to trust.\n\n### 3. The KV quantization scheme is simpler than the strongest public guidance\n\nYour active cache path uses a simple per-head absmax quantizer for both keys and values.\n\nKIVI’s main conclusion is that keys and values do **not** want the same treatment: keys work better with per-channel quantization, values with per-token quantization. So your cache may still work, but it is not using the best-supported asymmetry yet. (arXiv)\n\n### 4. The full-sequence denoiser’s KV cache buys almost nothing\n\nIn the non-block sampler, you reset the KV cache every denoising step and re-run the full sequence. That is logically correct because the mask pattern changes every step, but it also means the cache is not giving you a real inference win there.\n\nThat is not a bug. It just means:\n\n  * KV cache matters mainly for your **block sampler** ,\n  * not for the full denoiser.\n\n\n\n### 5. The default run is not actually 40B tokens\n\nThe training config computes to about **30.1B tokens** , not 40B.\n\nThat is not a correctness problem. It is a planning problem. If you want a 40B-token run, your step count needs to change.\n\n## Fine for now\n\nThese are not where I would spend time first.\n\n### 1. No causal mask in attention\n\nCorrect for diffusion.\n\n### 2. Latent full-precision weights with STE\n\nStandard for this kind of research implementation.\n\n### 3. MoE code\n\nNot the current concern because it is off by default.\n\n### 4. RoPE offset handling in block mode\n\nDirectionally correct and useful for committed-context generation.\n\n## What I think will happen if you run it unchanged\n\nMost likely:\n\n  * it **does train** ,\n  * it gives coherent outputs,\n  * the ternary core is **not** the main reason it fails,\n  * the A8 → A4 schedule probably helps rather than hurts,\n  * but the final result is harder to interpret because the data pipeline and thinking-token behavior are not clean.\n\n\n\nThe most likely disappointments are:\n\n  * weaker-than-expected gains from low-bit activations,\n  * unclear value from thinking tokens,\n  * KV-cache quality below what you would hope from the best papers,\n  * and results that are noisier than they need to be because of the data path and edge-case loss behavior. Recent work on 4-bit attention explicitly says attention is the main obstacle because of heavy-tailed activations and precision-mismatch instability, which matches where I would expect your run to be most fragile. (arXiv)\n\n\n\n## What I think will happen if you fix the blockers\n\nThen I think the code has a **real chance** to produce a meaningful 1B prototype.\n\nNot “state of the art.”\nNot “obviously better than a same-budget AR baseline.”\nBut a real prototype that demonstrates:\n\n  * masked diffusion training,\n  * ternary-weight viability,\n  * staged hybrid activation quantization,\n  * and blockwise semi-autoregressive generation.\n\n\n\nThat is a legitimate target. MDLM supports the masked-diffusion backbone, block diffusion supports the blockwise generation idea, BitNet b1.58 supports native ternary weights, and BitNet a4.8 supports the general hybrid A8/A4 direction plus low-bit KV as an inference concept. (arXiv)\n\n## My recommendation\n\nBefore spending serious compute on 1B, I would do exactly this:\n\n  1. Fix `MaskDiffusionLoss` for the zero-supervised-token case.\n\n  2. Fix multi-sample block generation.\n\n  3. Fix the qualitative sampler so it cannot turn special tokens into fake normal tokens.\n\n  4. Decide whether you want:\n\n     * real variable-length training, then preserve it end to end and use masks properly, or\n     * fixed-length training, then simplify the pipeline and stop pretending otherwise.\n  5. Disable thinking tokens for the first real 1B run.\n\n  6. Treat the current KV cache as a prototype cache, not a final serving recipe.\n\n\n\n\n## Bottom line\n\nStrict code-review answer:\n\n**The codebase is structurally real and probably trainable.**\n**It is not clean enough yet for an unquestioned 1B run.**\n**The main risks are correctness and experiment-interpretation risks, not “the whole concept is impossible.”**\n\n* * *\n\nHere is the ranked patch plan I would use for your repo.\n\nThe ordering is based on one question only: **what most reduces the chance of wasting a 1B / 40B-token run**. The literature says your backbone choice is plausible: MDLM-style masked diffusion is a real language-modeling family, block diffusion is a real semi-autoregressive extension with KV reuse, ternary-from-scratch has precedent, and BitNet a4.8 supports the general idea of staged hybrid low-bit activations. The main fragility zone remains **low-bit attention/activation behavior** , not the existence of the overall concept. (arXiv)\n\n# Tier 0: patch before any serious 1B run\n\n## 1. Make `MaskDiffusionLoss` safe when there are zero supervised positions\n\n**Files:** `bitdiffusion/diffusion.py`\n\n### Why this is first\n\nThis is the only issue I found that can directly produce a silent training poison. I locally verified that your current loss returns `NaN` when every position is ignored.\n\nIn your code, the loss:\n\n  * flattens logits and targets,\n  * masks out non-supervised positions,\n  * writes `ignore_index` into all other targets,\n  * then calls `F.cross_entropy(...)`.\n\n\n\nIf **all** positions are ignored, PyTorch returns `NaN`.\n\n### Patch\n\nAdd a guard right before `cross_entropy`:\n\n\n    if not mask_flat.any():\n        return logits.new_zeros(())\n\n\n### Why it matters for your concept\n\nDiffusion training already has noisier supervision than plain next-token prediction because the supervised set changes each batch. Block diffusion adds more schedule sensitivity, and low-bit training leaves less numerical slack. A rare `NaN` is much more dangerous in this regime than in a boring baseline. The block-diffusion paper explicitly highlights variance control and noise scheduling as first-class engineering concerns, and Attn-QAT shows that low-bit attention is already the main stability bottleneck. (arXiv)\n\n### Minimal test\n\n  * unit test with `is_masked = torch.zeros(...)`\n  * assert loss is finite and exactly zero\n\n\n\n* * *\n\n## 2. Fix multi-sample block generation\n\n**Files:** `bitdiffusion/sample.py`\n\n### What is wrong\n\nIn `BlockwiseDiffusionSampler.generate()`, `all_generated` and `block_texts` are single shared Python lists, but the method returns one result per sample. The code collects tokens from `block_ids[0]` only, then reuses that same accumulated sequence for every sample.\n\nSo `num_samples > 1` is currently wrong.\n\n### Patch\n\nChange:\n\n  * `all_generated: list[int] = []`\n  * `block_texts: list[str] = []`\n\n\n\nto per-sample structures, for example:\n\n\n    all_generated = [[] for _ in range(num_samples)]\n    block_texts = [[] for _ in range(num_samples)]\n\n\nThen collect and decode per sample.\n\n### Why it matters\n\nThis does not break single-sample runs. But it makes batched sampling misleading, which is bad for evaluating diversity and sampler correctness. Since MDLM and block diffusion are often judged partly on generation behavior, broken multi-sample output makes the model look more deterministic or cleaner than it really is. (arXiv)\n\n### Minimal test\n\n  * run `num_samples=2` with a fixed seed and temperature > 0\n  * assert outputs are independently tracked\n  * assert internal block text lists differ when token traces differ\n\n\n\n* * *\n\n## 3. Fix `generate_sample()` so it cannot sample special tokens and silently map them to normal tokens\n\n**Files:** `bitdiffusion/train.py`\n\n### What is wrong\n\nYour qualitative monitor sampler samples from the full output vocabulary, then later clamps token IDs to `vocab_size - 1`. If the model samples the mask token or think token, that special token gets silently turned into the last normal vocabulary token.\n\nSo your training samples can look cleaner or stranger for the wrong reason.\n\n### Patch\n\nChange:\n\n\n    probs = torch.softmax(logits / temperature, dim=-1)\n\n\nto:\n\n\n    probs = torch.softmax(logits[:, :, :model.config.vocab_size] / temperature, dim=-1)\n\n\nDo not rely on post-hoc clamping.\n\n### Why it matters\n\nThis does not directly affect training, but it absolutely affects whether you trust your monitoring. In diffusion models, qualitative inspection is important because loss curves alone do not tell the whole story about generation quality. (arXiv)\n\n### Minimal test\n\n  * force logits to favor mask token\n  * assert sampler never returns an out-of-range or silently remapped normal token\n\n\n\n* * *\n\n## 4. Decide whether you want true variable-length training or fixed-length training, then make the code match\n\n**Files:** `prepare_hf_jsonl.py`, `bitdiffusion/data.py`, `bitdiffusion/train.py`\n\n### What is wrong\n\nYour prep script creates a variable-length curriculum. Then the dataset loader re-tokenizes each `\"text\"` field, concatenates everything into a rolling token buffer, and emits fixed `max_length` chunks. So the end-to-end training stream is mostly fixed-length again.\n\n### Patch choice A: keep variable-length training\n\n  * store tokenized examples directly\n  * keep one JSONL example = one training example\n  * use `attention_mask` throughout masking and loss\n  * do not re-flatten the corpus into a global rolling token buffer\n\n\n\n### Patch choice B: admit fixed-length training\n\n  * simplify prep\n  * stop generating variable-length chunks upstream\n  * keep fixed-length windows deliberately\n\n\n\n### My recommendation\n\nFor your first 1B run, choose **B** unless variable-length behavior is central to your research question. Fixed-length training is simpler and easier to debug.\n\n### Why it matters\n\nBlock diffusion papers emphasize variance and schedule quality. If your intended curriculum is being erased by the loader, you do not really know what you trained. Clean experimental semantics matter more here than fancy preprocessing. (arXiv)\n\n### Minimal test\n\n  * inspect a batch length histogram after collation\n  * confirm it matches what you think the loader is doing\n\n\n\n* * *\n\n# Tier 1: fix before spending the full 40B tokens\n\n## 5. Propagate `attention_mask` into corruption and loss, or remove padding entirely\n\n**Files:** `bitdiffusion/data.py`, `bitdiffusion/train.py`, `bitdiffusion/diffusion.py`, `bitdiffusion/model.py`\n\n### What is wrong\n\nThe collator builds `attention_mask`. The training loop ignores it. The model forward path also ignores it.\n\nRight now this is partly masked by your fixed-length behavior. But the moment you preserve variable lengths, padded positions become real positions for:\n\n  * masking,\n  * attention,\n  * loss bookkeeping.\n\n\n\nAnd because pad defaults to EOS if the tokenizer lacks a pad token, the model can learn EOS-padding artifacts.\n\n### Patch\n\nAt minimum:\n\n  * exclude padded positions from `apply_mask`\n  * exclude padded positions from `MaskDiffusionLoss`\n\n\n\nIf you later restore true variable-length batching:\n\n  * also pass an attention mask into attention\n\n\n\n### Why it matters\n\nThis is less urgent than the `NaN` fix because your current loader mostly emits full chunks. But once you want honest variable-length behavior, this becomes a correctness issue, not a cleanup. (arXiv)\n\n* * *\n\n## 6. Disable thinking tokens for the first serious 1B baseline\n\n**Files:** `bitdiffusion/diffusion.py`, `bitdiffusion/train.py`, `bitdiffusion/sample.py`\n\n### Why\n\nThis is the weakest-supervised subsystem in the code.\n\nThe code explicitly excludes thinking positions from direct supervised loss and expects them to become useful only through downstream answer gradients. That is possible in principle, but it is a weak signal. Also, training prepends one think prefix to the whole sequence, while the block sampler can prepend think tokens before every block. That is a train-test mismatch.\n\n### Patch\n\nFor the baseline 1B run:\n\n  * set `N_think = 0`\n  * set `think_prob = 0`\n  * keep the code, but remove it from the main experiment\n\n\n\nThen add it back only after the baseline works.\n\n### Why it matters\n\nYour core concept does **not** need thinking tokens to be valid. MDLM, block diffusion, ternary weights, and hybrid A8/A4 already make a complete research story. Thinking tokens add ambiguity without adding much confidence. (arXiv)\n\n* * *\n\n## 7. Keep the current KV cache labeled as a prototype, and do not overfit conclusions to it\n\n**Files:** `bitdiffusion/quantization.py`, `bitdiffusion/sample.py`\n\n### What is happening\n\nYour active cache path uses a simple per-head absmax scheme for both keys and values. That is fine for a prototype, but it is simpler than the best-supported KV-cache quantization approaches.\n\nKIVI’s main result is that keys and values want different treatment: keys per-channel, values per-token. Your current path does not do that. (arXiv)\n\n### Patch\n\nDo one of these:\n\n  * leave the current cache as-is, but call it a prototype cache and benchmark it honestly\n  * or implement asymmetric K/V quantization closer to KIVI\n\n\n\n### My recommendation\n\nFor the first 1B run, keep it simple and prototype-level. Do **not** burn time rewriting the cache before the base model is proven.\n\n### Why it matters\n\nKV cache is mostly an inference feature in your code, not a training feature. So this is not a blocker for pretraining. It is a blocker for making strong claims about deployment efficiency or quality retention. KIVI shows the asymmetry matters. (arXiv)\n\n* * *\n\n## 8. Add one explicit ablation checkpoint before the A8 → A4 switch\n\n**Files:** `bitdiffusion/train.py`\n\n### Patch\n\nSave:\n\n  * one checkpoint right before the activation-mode switch\n  * one checkpoint shortly after entering A4 mode\n\n\n\nAlso log:\n\n  * masked-token accuracy\n  * answer-only loss\n  * fraction of masked positions per batch\n  * gradient norm\n  * activation mode\n\n\n\n### Why it matters\n\nCurrent low-bit attention work says 4-bit attention is the main obstacle because of heavy-tailed activations and precision mismatch. If your run degrades, you want to know whether the break started:\n\n  * before A4,\n  * exactly at A4,\n  * or long after. (arXiv)\n\n\n\n* * *\n\n# Tier 2: worth fixing, but not before the first scaled run\n\n## 9. Separate the “full denoiser” and “block sampler” evaluation stories\n\n**Files:** `bitdiffusion/sample.py`\n\n### Why\n\nYour full denoiser resets the KV cache every denoising step, which is logically correct because the full masked pattern changes every step. That means KV cache does not buy much there. The real cache benefit is in the block sampler.\n\n### Patch\n\nReport them separately:\n\n  * full diffusion sampling quality\n  * blockwise generation quality and speed\n  * KV cache effect only inside the blockwise path\n\n\n\n### Why it matters\n\nIt makes your conclusions cleaner and more aligned with what block diffusion is actually buying. (arXiv)\n\n* * *\n\n## 10. Add smoke tests for the exact failure cases above\n\n**Files:** `tests/`\n\n### Add these tests\n\n  * `MaskDiffusionLoss` zero-supervision returns finite zero\n  * `generate_sample()` never samples special IDs into normal tokens\n  * `BlockwiseDiffusionSampler.generate(num_samples>1)` returns independent per-sample outputs\n  * data loader preserves intended length behavior\n  * one tiny forward/backward pass on CPU works\n\n\n\n### Why it matters\n\nYour code is already close enough to useful that small regressions matter. At 1B scale, simple tests are much cheaper than one wasted launch.\n\n* * *\n\n# Tier 3: optional improvements after the baseline works\n\n## 11. If you want better KV behavior, move toward asymmetric quantization\n\nThis is where I would spend time **after** the base 1B model works. KIVI gives a strong hint that asymmetry between keys and values is not cosmetic. (arXiv)\n\n## 12. If you want stronger A4 confidence, add more attention-specific diagnostics\n\nAttn-QAT makes it very clear that the hard part is not generic quantization. It is attention numerics. That suggests logging:\n\n  * attention score range\n  * softmax entropy\n  * per-head activation max\n  * fraction of saturated quantized values during A4 mode (arXiv)\n\n\n\n## 13. If you want thinking tokens back, add a real training signal\n\nDo this only after the plain model works. Right now they are more of a research hypothesis than a dependable subsystem.\n\n* * *\n\n# The patch order I would actually execute\n\n## This week\n\n  1. Fix `MaskDiffusionLoss` NaN case.\n  2. Fix block sampler multi-sample bug.\n  3. Fix `generate_sample()` vocabulary slicing.\n  4. Decide fixed-length vs variable-length training and simplify accordingly.\n\n\n\n## Before the 1B launch\n\n  5. Disable thinking tokens for baseline.\n  6. Add mask-aware loss/corruption if you keep any variable-length batching.\n  7. Add checkpoints around the A4 transition.\n  8. Add the small tests.\n\n\n\n## After the baseline run\n\n  9. Improve KV asymmetry.\n  10. Add attention-specific A4 diagnostics.\n  11. Reintroduce thinking tokens only as an ablation.\n\n\n\n* * *\n\n# How I think it works after this patch plan\n\nIf you apply Tier 0 and Tier 1, I think the model has a **real chance** to do what you want in the limited sense that matters first:\n\n  * train a 1B masked-diffusion model,\n  * keep the ternary-weight core,\n  * switch into hybrid low-bit activation late,\n  * produce coherent blockwise-generated text,\n  * and give you a trustworthy baseline for later KV and thinking-token experiments.\n\n\n\nThat expectation is supported by the literature around MDLM, block diffusion, BitNet b1.58, and BitNet a4.8. (arXiv)\n\nIf you do **not** apply the top patches, I think the likely failure mode is not dramatic collapse. It is worse: an expensive run that “sort of works,” but leaves you unsure whether the weak points came from diffusion, low-bit attention, your data path, or the under-supervised thinking-token mechanism. Recent 4-bit attention results are exactly why that distinction matters. (arXiv)",
  "title": "Would this concept model work?"
}