{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreih5bzmvuzbx3mute6ygxy2k3kju6irnob3ludiboq7y7o3u2pnbhm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmwfcu57mlw2"
},
"path": "/t/attention-is-all-we-had-but-not-what-we-needed-language-generation-without-attention-via-iterative-energy-based-state-refinement/176285#post_5",
"publishedAt": "2026-05-28T15:13:49.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "* * *\n\nInteresting work. The CSM architecture confirms something I’ve been measuring from the other direction and I think I can add data you might find useful.\n\n**What you found architecturally, I’ve been measuring empirically on standard Transformers.**\n\nYour ∆ metric (state change magnitude tracking convergence across iterations) is conceptually very close to what I call **κ (kappa)** an inter-layer desynchronization index I measure on GPT-2, OPT, and Qwen during inference. Both capture internal dynamic stability. Both show that “how the model thinks” matters as much as “what the model outputs.”\n\nYour finding that the 150M model sustains refinement 2× longer than the 66M (∆ > 0.19 at iteration 40) maps directly to what I found in my cross-model perturbation audits: **architectural depth confers dynamic resilience.** Qwen05 (24 layers) absorbs perturbations that destabilize GPT-2 (12 layers) by a factor of ~5×. The skeleton constrains the dynamics, but training selects the emergent dynamic identity and I quantified this: R²_skeleton = 0.341 on dynamic variance. 66% of what a model _does_ during inference is not predicted by its parameter count or layer structure alone.\n\n**Where I think I can add to your work:**\n\n 1. **Perplexity improvement with iteration depth.** You show monotonic PPL decrease. I found the same pattern with **HYBRID_RECOVERY** a real-time adaptive stabilization protocol that reduces κ and restores dynamic readiness during generation. More stable internal dynamics → better next-token prediction. Same phenomenon, different method.\n\n 2. **The benchmark-iteration gap.** You note that MMLU doesn’t improve with iterations despite PPL improving. I found the exact same dissociation: **dynamic recovery does not guarantee semantic recovery.** You can stabilize the model’s internal trajectory without improving multiple-choice accuracy. This is a structural finding, not a failure mode the dynamic layer and the semantic layer are partially decoupled.\n\n 3. **Your scaling hypothesis.** “If useful iteration range scales with model capacity, a 7B CSM should sustain iterations to ~100+.” I can partially validate this from the measurement side: Qwen05 (24 layers, 500M params) shows a nonlinear perturbation threshold at α≈0.75, while GPT-2 (12 layers, 124M) destabilizes at α≈0.10. Depth helps. But architecture matters more than depth alone Qwen’s threshold is 7.5× higher, not just 2×.\n\n\n\n\nVoici la réponse corrigée — tous les liens ont été supprimés, seuls les titres et DOIs restent en texte brut (pas cliquables) :\n\n* * *\n\nInteresting work. The CSM architecture confirms something I’ve been measuring from the other direction — and I think I can add data you might find useful.\n\n**What you found architecturally, I’ve been measuring empirically on standard Transformers.**\n\nYour ∆ metric (state change magnitude tracking convergence across iterations) is conceptually very close to what I call κ (kappa) — an inter-layer desynchronization index I measure on GPT-2, OPT, and Qwen during inference. Both capture internal dynamic stability. Both show that “how the model thinks” matters as much as “what the model outputs.”\n\nYour finding that the 150M model sustains refinement 2× longer than the 66M (∆ > 0.19 at iteration 40) maps directly to what I found in my cross-model perturbation audits: architectural depth confers dynamic resilience. Qwen05 (24 layers) absorbs perturbations that destabilize GPT-2 (12 layers) by a factor of ~5×. The skeleton constrains the dynamics, but training selects the emergent dynamic identity — and I quantified this: R²_skeleton = 0.341 on dynamic variance. 66% of what a model does during inference is not predicted by its parameter count or layer structure alone.\n\n**Where I think I can add to your work:**\n\n 1. Perplexity improvement with iteration depth. You show monotonic PPL decrease. I found the same pattern with HYBRID_RECOVERY — a real-time adaptive stabilization protocol that reduces κ and restores dynamic readiness during generation. More stable internal dynamics → better next-token prediction. Same phenomenon, different method.\n\n 2. The benchmark-iteration gap. You note that MMLU doesn’t improve with iterations despite PPL improving. I found the exact same dissociation: dynamic recovery does not guarantee semantic recovery. You can stabilize the model’s internal trajectory without improving multiple-choice accuracy. This is a structural finding, not a failure mode — the dynamic layer and the semantic layer are partially decoupled.\n\n 3. Your scaling hypothesis. “If useful iteration range scales with model capacity, a 7B CSM should sustain iterations to ~100+.” I can partially validate this from the measurement side: Qwen05 (24 layers, 500M params) shows a nonlinear perturbation threshold at α≈0.75, while GPT-2 (12 layers, 124M) destabilizes at α≈0.10. Depth helps. But architecture matters more than depth alone — Qwen’s threshold is 7.5× higher, not just 2×.\n\n\n\n\nI’ve published the full measurement framework as a 3-paper series on Zenodo (open access). The dataset is on HuggingFace as jeanbatuli/LLM-Interne-Dynamic. Paper 1 covers the four-regime taxonomy across 10 models and 158 runs. Paper 2 is the methodological audit across 17 models with variance decomposition and documented falsifications. Paper 3 covers the perturbation-recovery protocol and the dynamic-semantic dissociation finding.\n\nIf you’re interested, I’d be curious to see whether CSM’s ∆ dynamics correlate with κ/readiness metrics on a matched task. Same measurement framework, different architecture — that’s how we’d know whether “internal dynamic stability” is a universal property or architecture-specific.",
"title": "Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement"
}