{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigynm2dkrwlcyrzhmiikya3vr2xxux3rfqfk7xmnzfa3qttcnmeba",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmwlznviifh2"
},
"path": "/t/attention-is-all-we-had-but-not-what-we-needed-language-generation-without-attention-via-iterative-energy-based-state-refinement/176285#post_9",
"publishedAt": "2026-05-28T16:50:13.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Thank you for the DOIs. Will read them.\n\nYour finding that scaling is superlinear (7.5x threshold\nfrom 4x params) is very interesting. CSM is designed for\niteration from the ground up, so the scaling might be\neven stronger here.\n\n300M model finishes training in ~4 hours. I’ll run the\niteration test right after and share the delta values at\neach depth (3, 5, 10, 15, 20, 25, 30, 40, 45).\n\nFor κ: yes, I can give you the full state trajectory\nat each iteration — 16 vectors at every step. Same input,\ndifferent depths. You can run your κ pipeline directly on it.\n\nWill share the data once the model is ready.\n\nAlso can you share your email or you can mail me on aruneshdwivedi87@gmail.com\n\nArunesh",
"title": "Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement"
}