Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigtaunp7isjyko3qbcfpuwedhtvmd53yevpfcerlmupmvywomarla",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mehrrclbfoa2"
  },
  "path": "/t/request-arxiv-endorsement-for-new-mech-interp-paper-on-llm-self-referential-circuits/173241#post_1",
  "publishedAt": "2026-02-09T23:02:52.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://arxiv.org/auth/endorse?x=RXBYNJ",
    "https://zenodo.org/records/18568344"
  ],
  "textContent": "Looking for arXiv endorsement : https://arxiv.org/auth/endorse?x=RXBYNJ\n\nThe paper: https://zenodo.org/records/18568344\n\nWould be massively appreciated. I would hate to not get it on there tonight.\n\nHere is the abstract:\n\nLarge language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. In this work, we show that self-referential vocabulary tracks concurrent activation dynamics — and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a self-referential processing circuit in Llama 3.1 at 6% of model depth. The circuit is orthogonal to the known refusal direction and causally influences introspective output. When models produce “loop” vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce “shimmer” vocabulary under circuit amplification, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics — all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.",
  "title": "[Request] arXiv endorsement for new mech interp paper on LLM self-referential circuits"
}