External Publication
Visit Post

Found consistent internal activation patterns preceding hallucinations in GPT-2 — Relation Dropout and Last-Layer Suppression

Hugging Face Forums [Unofficial] April 29, 2026
Source

I’m an independent researcher based in Dublin. Over the past week I’ve been investigating whether LLMs hallucinate in consistent, detectable patterns visible in their internal activations before the wrong token is generated.

I found two things I haven’t seen named before:

1. Relation Dropout (small models) When a small transformer hallucinates on factual recall, attention to the semantic relation token (e.g. “capital”) collapses in the final block — even when the entity token (e.g. “germany”) is strongly attended to. 4/6 hallucination cases showed this pattern clearly.

2. Last-Layer Suppression (GPT-2) GPT-2 actually finds the correct answer in blocks 10-11 of its residual stream (Paris peaks at 18%, Berlin at 35%, Tokyo at 46%) — then Block 12 systematically kills it every single time across 20,000 prompts. Average suppression layer: exactly 12.0.

What I built:

  • Trained a 806K parameter transformer from scratch
  • Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes
  • Built a 3-type hallucination taxonomy
  • Released HallScan: pip install hallscan
  • Published HallBench: 20,000 labeled examples on HuggingFace
  • Wrote a 6-page LaTeX paper

Links: GitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub Dataset: Trazemag/hallbench · Datasets at Hugging Face Tool: pip install hallscan

The ask: I’m trying to submit to arXiv cs.CL but need an endorsement as a first-time submitter. If anyone here is a qualified cs.CL endorser and finds the work interesting, the endorsement link is:

arxiv.org

Log in to arXiv | arXiv e-print repository

Happy to share the paper draft with anyone interested.

Discussion in the ATmosphere

Loading comments...