What a Model Doesn't Know Tells You How Big It Is
Subscribe
Knowledge Doesn't Compress: Reverse-Engineering the Physical Size of Frontier Models from "Obscure Facts"
Over the past two years, "Scaling Law is dead" has practically become an industry shibboleth. The reasoning sounds airtight enough — at the same MMLU/GPQA/SimpleQA score, a 7B model in 2026 now stands shoulder-to-shoulder with a 70B from 2023. Huang et al. (2025) even codified this into a widely-cited "Densing Law": capability density doubles every 3.5 months. If you only stare at benchmark numbers, the conclusion that "parameters no longer matter" practically writes itself.
But a fresh arXiv paper I read today (Bojie Li et al., 2026) does something rather clever: instead of arguing whether the Densing Law is right, it asks a more foundational question — is the "stuff" stored inside a model really all of one kind?
Because if it isn't, then the inflation we see in benchmark scores might not be measuring the genuinely scarce part at all.
A Conceptual Divide That Often Gets Glossed Over: Compressible vs. Incompressible
The authors decompose model parameters into three functional roles: $N = N_{\text{fact}} + N_{\text{proc}} + N_{\text{ling}}$.
The first is factual storage; the latter two are procedural capability (reasoning, parsing, instruction-following) and linguistic capability. These two categories aren't remotely the same when it comes to compressibility.
Procedural capability is like a mathematical formula. A person who knows calculus can solve infinitely many integration problems by memorizing a handful of rules; the same goes for models. Better Transformer architectures and smarter training recipes can give a 7B model the reasoning power of yesterday's 70B. This is what the Densing Law is actually describing — efficiency gains in $N_{\text{proc}}$ and $N_{\text{ling}}$.
Factual knowledge is more like a phone book. The fact that "USTC Hackergame was founded in 2014" cannot be derived from any other piece of knowledge, no matter how strong your reasoning is — you just have to memorize it. Per the Shannon Entropy bound, storing a fact consumes a number of bits proportional to its information content; and a Transformer's feed-forward layer can hold only roughly 2–3.6 bits per parameter (Allen-Zhu & Li, 2025; Morris et al., 2025).
Put bluntly: factual storage runs into a physical floor you cannot negotiate around.
This echoes a thread I picked up earlier in "Model Size and Critical Frequency..." — every category of knowledge has a critical frequency $f_c \propto N^{-\alpha}$, and small models, lacking the requisite capacity, simply can't "reach" low-frequency long-tail facts. But that earlier piece offered theoretical priors; what was missing was a measuring stick you could actually apply to a black box. This IKP paper is exactly that stick.
The IKP Design: Reverse-Engineering Parameters from "How Much Obscure Stuff a Model Knows"
The authors built a test set called Incompressible Knowledge Probes (IKP), with 1,400 questions across 7 difficulty tiers (T1–T7). A few design details are worth lingering on:
The entry-level T1–T2 stuff — "What's the capital of Norway?" — is generated with help from GPT-5; T3–T7 long-tail probes are scraped from Wikidata, DBLP, and OpenAlex — for instance, the research focus of a CS professor with only a few dozen citations, or which country a particularly obscure Canadian mountain sits in. The probes are deliberately rare, so models can't sneak through by inferring from context or reasoning their way out.
There's a clever twist in the scoring mechanism: confabulating (hallucinating) costs you 1.0 points; honestly refusing ("I don't know") gets you 0. This penalizes faking it , forcing the model toward the boundary of what it actually knows. That detail matters more than it looks — we'll come back to it.
What follows is the kind of work an econometrician would feel right at home with: calibrate against 89 open-source models with disclosed parameter counts (Llama, Qwen, DeepSeek family, ranging from 135M to 1.6T), and discover that IKP score against $\log_{10}(N)$ traces out a near-pristine line — R{^2} 0.917, with a 14.7 percentage-point bump in accuracy for every 10x in parameters.
IKP Calibration Curve: Log-linear scaling on open models and reverse-projected positions of frontier closed models
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
Leave-one-out cross-validation (LOO-CV) shows 68.5% of predictions land within 2x of true parameter count, and 87.6% within 3x. That precision is materially better than the prior practice of inferring sizes from API throughput and pricing — that older method comes with a well-known >2x uncertainty.
How Big Are the Closed Frontier Models, Really
Push the calibration line out to the right and the "effective knowledge capacity" of proprietary models reads off the chart. The headline numbers:
GPT-5.5 leads at 9.7T, 1.4x ahead of the runner-up Claude Opus 4.6 (5.3T). Below them sits a cluster — GPT-5 / Claude Opus 4.7 / o1 / Grok-4 / o3 — between ~3.0T and ~4.1T. GPT-4o turns out to be smaller than its reputation suggests, estimated at ~720B. On the efficiency end, GPT-5 Mini ~410B, Gemini 2.5 Flash ~207B, Claude Haiku 4.5 just 65B. The ratio between the largest and smallest in the proprietary fleet runs about 150x— virtually identical to the open-source side's range.
These figures dovetail with what I'd worked out via HBM memory constraints in "Small by Design or by Default...", which had the Western frontier landing in the 2T–6.4T range. Two independent paths — one from a hardware capacity ceiling, one from a knowledge capacity floor — converge on the same order of magnitude. That kind of cross-validation is far sturdier than relying on any single source.
One additional observation worth pausing on: "Pro" versions barely add any factual capacity. GPT-5 Pro scores only 1.05x higher than GPT-5 on IKP; GPT-5.5 Pro a mere 1.13x over GPT-5.5. That tracks with what the vendors themselves say — Pro premium is about longer test-time reasoning and finer alignment, not about adding a wing to the library.
But minor version bumps (the ".x" upgrades) are an entirely different animal. The GPT-5 line jumps from roughly 55% at GPT-5 to 71.4% at GPT-5.5 — that magnitude requires actual scale-up. The implication is that OpenAI's "decimal-point upgrades" are full-scale retrains underneath. The knowledge fingerprint section below puts another stake through this hypothesis.
The MoE Verdict: Total Parameters Decide Memory Capacity, Not Active Parameters
A long-running industry debate: when sizing up Mixture-of-Experts models, should we count total parameters or just the subset that activates per token? DeepSeek V3, with 671B total / 37B active, often gets quoted at the active number externally.
The authors run two separate regressions across 37 MoE models:
MoE knowledge capacity: total parameters vs. active parameters
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
Total parameters R{^2} 0.79, active parameters R{^2} 0.51. Total parameters win, decisively. Translated: facts are scattered across all expert weights, and regardless of which experts a given query happens to activate — what you can recall depends on how many books are in the entire library, not on how many librarians happen to be on shift.
This also fills in the missing piece I noted earlier in "The Ceiling Held — They Just Walked Up to It" when discussing DeepSeek V4's 1.6T total / 49B active configuration. When evaluating an MoE model's grip on world knowledge, the right number to anchor on is 1.6T, not 49B.
Falsifying the Densing Law: A Slogan That's Been Badly Misread
As a description of recent benchmark dynamics, the Densing Law is fine. The trouble is that it's been widely misread as "parameter scale no longer matters." The authors take aim directly at that misreading.
The setup is clean:$\text{IKP} = \beta_0 + \beta_1\log_{10}(N) + \beta_2 \cdot \text{months}$
take 96 open-source models with known release dates, fit . If the Densing Law applied to factual capacity, the time coefficient $\hat\beta_2$ should land at roughly +0.0117/month (the slope implied by the law).
Actual fit: $\hat\beta_2$ = -0.0010/month, with a 95% bootstrap confidence interval of [-0.0031, +0.0008].
IKP falsifies the Densing Law: Once parameters are controlled for, the time trend vanishes
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
Point estimate is statistically indistinguishable from zero (p = 0.34), and the Densing Law's predicted slope is rejected at p < 10^{-15}. Adding release date to the scaling regression boosts R{^2} by only +0.0024 — essentially zero additional explanatory power.
The authors run an even more pointed comparison: use MMLU, MMLU-Pro, GPQA Diamond, and SimpleQA as parameter proxies. The result: reasoning-heavy benchmarks drift the fastest (GPQA Diamond moves +2pp/month, meaning a 33B model could gain 24 points in a single year without scaling up at all), while the purely factual SimpleQA shows a time slope of +0.03pp/month — indistinguishable from zero.
Stack those two findings together and the conclusion turns clinical: the apparent ease of "benchmark inflation" comes from the fact that what those benchmarks measure is precisely the compressible part. When you measure the incompressible part, parameter scale — that old workhorse — remains stubbornly, indifferently effective.
A small model trained in 2026 with the latest tricks may have elegant logic, but the volume of obscure encyclopedia entries it can fit into its head still loses to the rough bruiser from 2024.
The Shape of the Tier Ladder, and the T7 Cliff
Break out the per-tier scores and you get a remarkably tidy "knowledge staircase":
Per-tier accuracy heatmap for the top 25 models
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
T1 and T2 saturate quickly (essentially 100% once you cross ~70B and ~120B respectively); T3 has the steepest slope (+32.4pp per 10x of parameters) and is the single most discriminating tier; T4 and T5 continue providing useful separation. T6 only opens up for models above ~2T— GPT-5.5 hits 38.5%, GPT-5.5 Pro 44.5%— marking the point where the proprietary frontier separates from the open-source curve.
By T7, 188 models collectively faceplant. Only Jamba-large (2.8%) and Grok-4 (1.0%) clear the 1% threshold; GPT-5 Pro, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 Pro all sit at exactly 0.0%.
The authors call this the T7 cliff. It isn't a smooth asymptote — it's a wall that every model hits at the same long-tail depth, regardless of size or training budget. Shannon entropy gives a structural explanation: the total information content of humanity's long-tail knowledge has already exceeded what even 10T-class models can absorb, and that long tail grows faster than any foreseeable parameter expansion.
Put differently — what we call a "frontier model" today remains a compressed snapshot of its era's expert discourse , very far from being a vessel for the entirety of human knowledge.
Knowledge Fingerprints: Tracing Model Lineage Through "Hallucination Similarity"
I'll admit it — this is the most fun byproduct of the whole paper.
The authors notice that two models which share a base or have undergone heavy distillation produce the same wrong answers on long-tail facts at rates significantly above what independent models would. They call this the Hallucination Similarity Score (HSS). The rough thresholds: HSS ≥ 0.30 indicates a shared base; 0.10–0.30 suggests lineage (post-training or distillation from a common ancestor); HSS < 0.10 implies independent retraining.
Cross-generational HSS by family: revealing which "minor version bumps" were actually full retrains
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
A few highlights worth flagging:
OpenAI's GPT-5 ".x" transitions (5→5.1, 5.1→5.2, 5.3→5.4) all sit at HSS ≤ 0.08— the full-retrain regime. The "decimal-point upgrades" are, underneath the version label, complete data and training overhauls. Anthropic's Claude Opus shows similar behavior (4.6→4.7, HSS = 0.00). DeepSeek V3 → V3.1 (HSS 0.23) and V3.1 → V3.2 (0.28), on the other hand, look like incremental continued pre-training — a meaningfully different recipe.
The cross-family outliers are even more interesting. Baidu ERNIE 4.5 simultaneously shows HSS of 0.33–0.44 with GPT-4o, Llama-3, Mistral-Large, and Qwen-Max — exactly the pattern you'd expect from heavy training on mixed teacher distillation outputs. Llama 3.1 70B keeps appearing as the implicit "teacher" in many pairings, which probably reflects its position as the most widely used open base for synthetic data generation in 2024, rather than direct distillation.
Distinguishing model lineage purely from black-box API responses, without ever touching the weights — that's a serious piece of practical infrastructure for open-weight license enforcement and provenance auditing.
An Old Question with a Sharper Answer: What Does a Model Actually "Know"?
The authors take a side trip to address another perennial question: why does a model remember Professor A but draw a blank on Professor B?
Bibliometric signals (citation count, H-index) explain only about 35% of the variance. The real driver is effective mention frequency — the density with which a given entity appears in retrievable form across the training corpus. For researcher probes specifically:
Researchers with fewer than 50 citations rarely break 15% identification rate; even researchers above 10K citations can still sit below 25%— meaning high citation count is necessary but not sufficient. The "named-artifact" multiplier, on the other hand, is enormous: researchers tied to canonical tools like FlashAttention or ColBERT clear 86% identification regardless of citation count. Subfield matters too — researchers in Information Retrieval and Programming Language are recognized at 1.5–2x the rate of computer architecture researchers with comparable citations, because their derivative content (tutorials, blogs, library docs, podcasts, tweets) is far denser.
There's a philosophical aftertaste to this. Frontier models aren't "arbiters of objective truth" — they're compressed snapshots of an era's expert discourse environment. Research that gets reformulated again and again by bloggers, tweet threaders, and tutorial authors enters a model's memory pool more reliably than research with ten thousand citations confined to journals. In that sense, measurement of impact is never a monotonic function of citations — it's closer to citations × name distinctiveness × named-artifact amplification × subfield ecosystem density.
A Few Minor But Worth-Remembering Byproducts
Each vendor's hallucination rate on T5–T7 (questions models don't know) — the share of wrong answers as a fraction of (wrong + refusal) — turns out to be a remarkably stable signature:
Hallucination rate by vendor on T5–T7: each vendor's "honesty fingerprint"
source: Bojie Li et al., "Incompressible Knowledge Probes", arXiv:2604.24827
Anthropic, Meta, and xAI sit at the low end (16–23%) — conservative, willing to refuse. Google's smaller Gemma models go as high as 89%–97%. Even within OpenAI there's a generational shift — GPT-4.1 family at 53%–72%, GPT-5 Nano/Mini down to 3%–4%, reflecting a deliberate change in alignment posture.
Another finding that's less flashy but more practically loaded: safety tuning systematically depresses measured capacity. Claude Sonnet 4 refuses 175 out of 200 T5 probes (88%), while its predecessor Claude 3.7 Sonnet refused only 54%. The model isn't ignorant — it's simply choosing to keep quiet. Which means for heavily RLHF-aligned closed models, IKP's parameter estimate should be read as a lower bound , not the median.
Stepping Back, Three Things Belong in the Same Frame
Zoom out and this paper turns out to be the third piece of a puzzle I've been tracking for the past six months. The first piece was the critical frequency framework from October 2025 — a theoretical case that knowledge acquisition is a phase transition rather than a smooth curve, and that small models are mathematically incapable of reaching low-frequency long-tail facts. The second was the HBM memory constraint formula from March 2026 — a hardware-side derivation that pegs the Western frontier at 2T–6.4T while pinning Chinese fleets between 230B and 1T due to memory ceilings.
IKP is the third — an empirical instrument that reads "effective knowledge capacity" directly off a black-box API. Starting from the theoretical prior that facts are incompressible, it builds a calibration benchmark with R{^2} 0.917 across 89 open models, reverse-engineers the physical size of closed frontier models, and — at p < 10^{-15} — nails the popular misreading that "parameter scale no longer matters" to the wall.
Theory → hardware constraint → black-box measurement. Three independent chains of evidence all converge on the same range: the largest frontier models today live in the 5T–10T zone, and they sit at least an order of magnitude away from the physical endpoint of digesting humanity's long-tail knowledge.
That T7 cliff — the one where every model trips at exactly the same depth — is precisely why the scaling story is far from over. The benchmarks have just stopped describing the part that's actually still moving.
Not getting fooled is the biggest alpha you can earn. The next time a "small model beats big model" headline rolls past, at least we now have a ruler we can pick up and measure with ourselves.
Discussion in the ATmosphere