Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigqqr4dg5u43dr5msinidwxi522hly7zqiuhtojg2wklbsshgnmza",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmwlzu5qnpc2"
  },
  "path": "/t/attention-is-all-we-had-but-not-what-we-needed-language-generation-without-attention-via-iterative-energy-based-state-refinement/176285#post_7",
  "publishedAt": "2026-05-28T16:30:15.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Arunesh — glad the Δ-κ correspondence resonated. It’s rare to find independent work converging on the same dynamic stability principle from opposite directions (architecture design vs. empirical measurement).\n\nThe DOIs for the 3-paper series:\n\n  1. **Four Dynamical Regimes in LLMs: An Empirical Phase Map**\n10.5281/zenodo.20348878\n\n  2. **Methodological Audit of Trajectory Instability**\n10.5281/zenodo.20361289\n\n  3. **Dynamic-Layer Controllability**\n10.5281/zenodo.20400171\n\n\n\n\nAll three are open access on Zenodo. The dataset is on HuggingFace as jeanbatuli/LLM-Interne-Dynamic.\n\nOn your 300M CSM — I’d be very interested in whether the ∆->0 convergence point extends as predicted. My perturbation threshold data suggests the relationship isn’t purely linear with parameter count (Qwen05 at 500M shows ~7.5× higher threshold than GPT-2 at 124M, not ~4× as pure depth scaling would predict). Architecture matters alongside scale.\n\nIf you can expose iteration-level hidden states during CSM inference, I can run the same κ/readiness pipeline on CSM that I use on Transformers. Same operator, different architecture. That’s the cleanest cross-validation we could ask for.\n\n-– Jean-Denis",
  "title": "Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement"
}