Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement
Hugging Face Forums [Unofficial]
May 28, 2026
Thank you Jean, this is exactly the kind of cross-validation
that strengthens both our findings.
The Δ-κ correspondence is striking — independent metrics
capturing the same internal dynamic stability from different
architectures. Your finding that the dynamic-semantic layers
are partially decoupled explains precisely why our MMLU
scores remain flat while perplexity improves monotonically
with iteration depth.
I’m very interested in measuring κ on CSM’s iteration
dynamics. A cross-architecture comparison would be
valuable for both our work.
I’ll read your 3-paper series carefully. Could you share
the DOIs?
Currently training a 300M CSM to test whether the useful
iteration range extends further with scale. Results within
24 hours.
Arunesh Dwivedi
VKD Industries
Discussion in the ATmosphere