Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement
Arunesh — glad the Δ-κ correspondence resonated. It’s rare to find independent work converging on the same dynamic stability principle from opposite directions (architecture design vs. empirical measurement).
The DOIs for the 3-paper series:
Four Dynamical Regimes in LLMs: An Empirical Phase Map 10.5281/zenodo.20348878
Methodological Audit of Trajectory Instability 10.5281/zenodo.20361289
Dynamic-Layer Controllability 10.5281/zenodo.20400171
All three are open access on Zenodo. The dataset is on HuggingFace as jeanbatuli/LLM-Interne-Dynamic.
On your 300M CSM — I’d be very interested in whether the ∆->0 convergence point extends as predicted. My perturbation threshold data suggests the relationship isn’t purely linear with parameter count (Qwen05 at 500M shows ~7.5× higher threshold than GPT-2 at 124M, not ~4× as pure depth scaling would predict). Architecture matters alongside scale.
If you can expose iteration-level hidden states during CSM inference, I can run the same κ/readiness pipeline on CSM that I use on Transformers. Same operator, different architecture. That’s the cleanest cross-validation we could ask for.
-– Jean-Denis
Discussion in the ATmosphere