Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement
Hugging Face Forums [Unofficial]
May 28, 2026
Thank you for the DOIs. Will read them.
Your finding that scaling is superlinear (7.5x threshold
from 4x params) is very interesting. CSM is designed for
iteration from the ground up, so the scaling might be
even stronger here.
300M model finishes training in ~4 hours. I’ll run the
iteration test right after and share the delta values at
each depth (3, 5, 10, 15, 20, 25, 30, 40, 45).
For κ: yes, I can give you the full state trajectory
at each iteration — 16 vectors at every step. Same input,
different depths. You can run your κ pipeline directly on it.
Will share the data once the model is ready.
Also can you share your email or you can mail me on aruneshdwivedi87@gmail.com
Arunesh
Discussion in the ATmosphere