We found that small LLM is systematically more confident on wrong answers than right ones
Hugging Face Forums [Unofficial]
March 14, 2026
We tested Hybrid Intelligence system with Karpathy`s autoresearchers about (Bio + LLM) intelligence with almost 30,000 experiments and found that small LLM is systematically more confident on wrong answers than right ones.
| Metric | Correct | Wrong |
|---|---|---|
| First-token entropy | Higher | Lower |
| Probability margin | Lower | Higher |
| t-stat | 2.28 | −3.41 |
The model is more uncertain when it’s right. More confident when it’s wrong.
This is the inverse of what calibration should look like.
Also you can check out our first Hybrid Intelligence model: MerlinSafety/HybridIntelligence-0.5B · Hugging Face
Discussion in the ATmosphere