External Publication

Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement

Hugging Face Forums [Unofficial] May 28, 2026

Thank you for the DOIs. Will read them. Your finding that scaling is superlinear (7.5x threshold from 4x params) is very interesting. CSM is designed for iteration from the ground up, so the scaling might be even stronger here. 300M model finishes training in ~4 hours. I’ll run the iteration test right after and share the delta values at each depth (3, 5, 10, 15, 20, 25, 30, 40, 45). For κ: yes, I can give you the full state trajectory at each iteration — 16 vectors at every step. Same input, different depths. You can run your κ pipeline directly on it. Will share the data once the model is ready. Also can you share your email or you can mail me on aruneshdwivedi87@gmail.com Arunesh

Discussion in the ATmosphere