{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifnj3tg53fwzvbg5r4cmvechymzl56d7o7b6agox3bvrbm5qm2asq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnwhmi2trx32"
},
"path": "/t/cross-architectural-runtime-probability-dynamics-in-transformer-llms-two-clusters-not-explained-by-parameter-count/176630#post_5",
"publishedAt": "2026-06-10T08:04:30.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "This is the rare part - you ran a test that partially killed your own headline result, and you’re putting it in rejected-claims and revising the preprint over it. That’s the whole game, and most people quietly don’t. Respect.\n\nA few thoughts now that temperature is ruled in as a primary driver and the live question is the _residual_ :\n\n**Name what a residual component would actually be measuring.** Once mean entropy is matched across the panel, two models at the same entropy can still arrange that probability differently - a sharp top-1 with a thin tail vs a softer top-2 with a fatter tail carry identical entropy but different _shape_. So a calibration-independent GD component, if it survives, isn’t “dynamics” - it’s **distribution shape at fixed uncertainty** (tail mass / top-1–top-2 margin / participation ratio at matched entropy). Framing the surviving claim that way keeps it from drifting back toward the “dynamics” reading the audit just rejected.\n\n**Scalar-T only matches the mean - go per-token / per-bin.** One temperature per model equalizes _average_ entropy, but token-level entropy varies a lot within a model. If you stratify by local-entropy bin (or temperature-match per token) and the cluster structure _still_ appears within bins, that’s a far harder-to-dismiss residual than a mean-matched one. If it vanishes under per-bin matching, the residual was just calibration at finer grain.\n\n**The migration pattern is data, not noise.** You found GPT-2 and Phi move - report the post-calibration _membership_ explicitly (who clusters with whom now). If the residual clusters re-form along a different axis - tokenizer family, pre/post-norm, attention-head config - that _is_ the answer to “what does the residual measure.” The direction of migration is more informative than the fact of it.\n\n**Bootstrap is the gate, not a nicety, at n=8.** With two of eight already migrating, “two clusters still appear post-calibration” could be a small-sample artifact. A bootstrap-over-prompts CI on the _residual_ gap is what decides whether there’s a residual structure to explain at all - I’d run that as the elevation gate _before_ the vocab/perturbation controls, because it’s the cheapest and the answer is binary.\n\n**One caution on the attention companion work.** Moving to attention as “a different observation level unaffected by this confound” is reasonable, but attention carries its own confounds - attention-sink / massive-activation tokens (the BOS-sink effect), head redundancy - and those are themselves training/architecture dependent. So an “architecture signature” in attention could be an attention-sink signature. The same discipline that just paid off on the logit side (find the boring mechanism first) is worth running there before it becomes the new headline.\n\nEither way: falsify-first, publish the negative, revise - that’s exactly why this framework will end up trustworthy where flashier ones don’t. Good work.",
"title": "Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count"
}