Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count
This is the rare part - you ran a test that partially killed your own headline result, and you’re putting it in rejected-claims and revising the preprint over it. That’s the whole game, and most people quietly don’t. Respect.
A few thoughts now that temperature is ruled in as a primary driver and the live question is the residual :
Name what a residual component would actually be measuring. Once mean entropy is matched across the panel, two models at the same entropy can still arrange that probability differently - a sharp top-1 with a thin tail vs a softer top-2 with a fatter tail carry identical entropy but different shape. So a calibration-independent GD component, if it survives, isn’t “dynamics” - it’s distribution shape at fixed uncertainty (tail mass / top-1–top-2 margin / participation ratio at matched entropy). Framing the surviving claim that way keeps it from drifting back toward the “dynamics” reading the audit just rejected.
Scalar-T only matches the mean - go per-token / per-bin. One temperature per model equalizes average entropy, but token-level entropy varies a lot within a model. If you stratify by local-entropy bin (or temperature-match per token) and the cluster structure still appears within bins, that’s a far harder-to-dismiss residual than a mean-matched one. If it vanishes under per-bin matching, the residual was just calibration at finer grain.
The migration pattern is data, not noise. You found GPT-2 and Phi move - report the post-calibration membership explicitly (who clusters with whom now). If the residual clusters re-form along a different axis - tokenizer family, pre/post-norm, attention-head config - that is the answer to “what does the residual measure.” The direction of migration is more informative than the fact of it.
Bootstrap is the gate, not a nicety, at n=8. With two of eight already migrating, “two clusters still appear post-calibration” could be a small-sample artifact. A bootstrap-over-prompts CI on the residual gap is what decides whether there’s a residual structure to explain at all - I’d run that as the elevation gate before the vocab/perturbation controls, because it’s the cheapest and the answer is binary.
One caution on the attention companion work. Moving to attention as “a different observation level unaffected by this confound” is reasonable, but attention carries its own confounds - attention-sink / massive-activation tokens (the BOS-sink effect), head redundancy - and those are themselves training/architecture dependent. So an “architecture signature” in attention could be an attention-sink signature. The same discipline that just paid off on the logit side (find the boring mechanism first) is worth running there before it becomes the new headline.
Either way: falsify-first, publish the negative, revise - that’s exactly why this framework will end up trustworthy where flashier ones don’t. Good work.
Discussion in the ATmosphere