Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieiohcma7dqafso3g7pvzveauwf5tckbbqin2ge72lae2yyoxbuxi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkmgesws5ec2"
  },
  "path": "/t/found-consistent-internal-activation-patterns-preceding-hallucinations-in-gpt-2-relation-dropout-and-last-layer-suppression/175640#post_1",
  "publishedAt": "2026-04-29T05:36:51.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub",
    "Trazemag/hallbench · Datasets at Hugging Face",
    "arxiv.org",
    "Log in to arXiv | arXiv e-print repository"
  ],
  "textContent": "I’m an independent researcher based in Dublin. Over the past\nweek I’ve been investigating whether LLMs hallucinate in\nconsistent, detectable patterns visible in their internal\nactivations before the wrong token is generated.\n\nI found two things I haven’t seen named before:\n\n**1. Relation Dropout (small models)**\nWhen a small transformer hallucinates on factual recall,\nattention to the semantic relation token (e.g. “capital”)\ncollapses in the final block — even when the entity token\n(e.g. “germany”) is strongly attended to. 4/6 hallucination\ncases showed this pattern clearly.\n\n**2. Last-Layer Suppression (GPT-2)**\nGPT-2 actually finds the correct answer in blocks 10-11 of\nits residual stream (Paris peaks at 18%, Berlin at 35%,\nTokyo at 46%) — then Block 12 systematically kills it every\nsingle time across 20,000 prompts. Average suppression layer:\nexactly 12.0.\n\n**What I built:**\n\n  * Trained a 806K parameter transformer from scratch\n  * Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes\n  * Built a 3-type hallucination taxonomy\n  * Released HallScan: pip install hallscan\n  * Published HallBench: 20,000 labeled examples on HuggingFace\n  * Wrote a 6-page LaTeX paper\n\n\n\n**Links:**\nGitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub\nDataset: Trazemag/hallbench · Datasets at Hugging Face\nTool: pip install hallscan\n\n**The ask:**\nI’m trying to submit to arXiv cs.CL but need an endorsement\nas a first-time submitter. If anyone here is a qualified\ncs.CL endorser and finds the work interesting, the\nendorsement link is:\n\narxiv.org\n\n### Log in to arXiv | arXiv e-print repository\n\nHappy to share the paper draft with anyone interested.",
  "title": "Found consistent internal activation patterns preceding  hallucinations in GPT-2 — Relation Dropout and Last-Layer  Suppression"
}