[Request] arXiv endorsement for new mech interp paper on LLM self-referential circuits
Hugging Face Forums [Unofficial]
February 10, 2026
Zenodo
When Models Examine Themselves: Vocabulary-Activation Correspondence in...
Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks...
Update: a more concise version with formatting adjustments - welcoming any feedback and discussions.
Discussion in the ATmosphere