Found consistent internal activation patterns preceding hallucinations in GPT-2 — Relation Dropout and Last-Layer Suppression
I’m an independent researcher based in Dublin. Over the past week I’ve been investigating whether LLMs hallucinate in consistent, detectable patterns visible in their internal activations before the wrong token is generated.
I found two things I haven’t seen named before:
1. Relation Dropout (small models) When a small transformer hallucinates on factual recall, attention to the semantic relation token (e.g. “capital”) collapses in the final block — even when the entity token (e.g. “germany”) is strongly attended to. 4/6 hallucination cases showed this pattern clearly.
2. Last-Layer Suppression (GPT-2) GPT-2 actually finds the correct answer in blocks 10-11 of its residual stream (Paris peaks at 18%, Berlin at 35%, Tokyo at 46%) — then Block 12 systematically kills it every single time across 20,000 prompts. Average suppression layer: exactly 12.0.
What I built:
- Trained a 806K parameter transformer from scratch
- Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes
- Built a 3-type hallucination taxonomy
- Released HallScan: pip install hallscan
- Published HallBench: 20,000 labeled examples on HuggingFace
- Wrote a 6-page LaTeX paper
Links: GitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub Dataset: Trazemag/hallbench · Datasets at Hugging Face Tool: pip install hallscan
The ask: I’m trying to submit to arXiv cs.CL but need an endorsement as a first-time submitter. If anyone here is a qualified cs.CL endorser and finds the work interesting, the endorsement link is:
arxiv.org
Log in to arXiv | arXiv e-print repository
Happy to share the paper draft with anyone interested.
Discussion in the ATmosphere