{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreieiohcma7dqafso3g7pvzveauwf5tckbbqin2ge72lae2yyoxbuxi",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkmgesws5ec2"
},
"path": "/t/found-consistent-internal-activation-patterns-preceding-hallucinations-in-gpt-2-relation-dropout-and-last-layer-suppression/175640#post_1",
"publishedAt": "2026-04-29T05:36:51.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub",
"Trazemag/hallbench · Datasets at Hugging Face",
"arxiv.org",
"Log in to arXiv | arXiv e-print repository"
],
"textContent": "I’m an independent researcher based in Dublin. Over the past\nweek I’ve been investigating whether LLMs hallucinate in\nconsistent, detectable patterns visible in their internal\nactivations before the wrong token is generated.\n\nI found two things I haven’t seen named before:\n\n**1. Relation Dropout (small models)**\nWhen a small transformer hallucinates on factual recall,\nattention to the semantic relation token (e.g. “capital”)\ncollapses in the final block — even when the entity token\n(e.g. “germany”) is strongly attended to. 4/6 hallucination\ncases showed this pattern clearly.\n\n**2. Last-Layer Suppression (GPT-2)**\nGPT-2 actually finds the correct answer in blocks 10-11 of\nits residual stream (Paris peaks at 18%, Berlin at 35%,\nTokyo at 46%) — then Block 12 systematically kills it every\nsingle time across 20,000 prompts. Average suppression layer:\nexactly 12.0.\n\n**What I built:**\n\n * Trained a 806K parameter transformer from scratch\n * Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes\n * Built a 3-type hallucination taxonomy\n * Released HallScan: pip install hallscan\n * Published HallBench: 20,000 labeled examples on HuggingFace\n * Wrote a 6-page LaTeX paper\n\n\n\n**Links:**\nGitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub\nDataset: Trazemag/hallbench · Datasets at Hugging Face\nTool: pip install hallscan\n\n**The ask:**\nI’m trying to submit to arXiv cs.CL but need an endorsement\nas a first-time submitter. If anyone here is a qualified\ncs.CL endorser and finds the work interesting, the\nendorsement link is:\n\narxiv.org\n\n### Log in to arXiv | arXiv e-print repository\n\nHappy to share the paper draft with anyone interested.",
"title": "Found consistent internal activation patterns preceding hallucinations in GPT-2 — Relation Dropout and Last-Layer Suppression"
}