External Publication
Visit Post

Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement

Hugging Face Forums [Unofficial] May 28, 2026
Source
We introduce CSM (Convergent State Machine) — a language model with zero attention layers that uses energy-based iterative state refinement over 16 state vectors. Key results: - 66M and 150M models, zero attention anywhere - 150M matches GPT-2 1.5B on MMLU within 0.3% (10x fewer params, 13x less data) - Perplexity decreases monotonically with more iterations - State dynamics scale with model size (66M → iter 15, 150M → iter 30+) - Total training cost: under $50 Paper: Attention Is All We Had — But Not What We Needed: Convergent State Machine for Iterative Energy-Based Language Generation

Discussion in the ATmosphere

Loading comments...