{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidi6sxw4kbogcymynevdegwdyvymkoejkbnoo5nbh4tfkebkxvgeq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mntwduxexbk2"
},
"path": "/t/cross-architectural-runtime-probability-dynamics-in-transformer-llms-two-clusters-not-explained-by-parameter-count/176630#post_2",
"publishedAt": "2026-06-09T09:11:59.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Really like this one - and what makes it worth engaging with seriously is the discipline, not just the result. An explicit rejected-claims list and “hypothesis, not finding” on the corpus story is the posture that turns a cluster plot into something credible. So in the same spirit, here are the controls I’d want to see before believing GD_ratio is measuring _dynamics_ rather than something more boring - plus one concrete test you can run from the architecture side.\n\n(Working from your writeup + the abstract, so apologies if some of this is already handled in the preprint.)\n\n**1. Rule out logit temperature first - this is the cheap one I’d run before anything else.** Your operator reads the geometry of the softmax distribution: entropy, top-k concentration, top-1/top-2 competition. All of those are dominated by the _scale_ of the logits entering the softmax, and that scale is set by things unrelated to “dynamics”: final-LN gain, tied vs untied embeddings, weight decay, label smoothing. A model trained to be more confident has sharper distributions → high G, low D → high GD_ratio, with no trajectory behavior involved. That’s actually a clean _mechanism_ for your training-corpus hypothesis (curated-data models tend to run lower-entropy), but it means GD_ratio could just be a per-model temperature reading. Test: fit one scalar temperature T per model on a held-out set to match a common mean entropy, recompute G and D on the calibrated logits, and see if the two clusters survive. If they collapse, you’ve measured calibration. If they hold, you’ve got something real _on top of_ calibration. Either way it’s a stronger paper.\n\n**2. Make the operator vocab-invariant before cross-model comparison.** Entropy and “dispersion above a 1% threshold” both scale with vocabulary size, and your panel runs 32k (TinyLlama) to ~152k (Qwen). It’s not a clean confound - OPT and Pythia sit at ~50k and still land in the lower cluster, so size alone isn’t the story - but a raw entropy / threshold-count still isn’t comparable across those vocabularies. Normalizing entropy by log(V), or computing the geometry over the top-p (say 0.99) mass instead of the full tail, removes tokenizer granularity as a hidden variable so the split can’t be partly a vocab artifact.\n\n**3. The perturbation experiment needs magnitude-matching and a null.** This is the part closest to work we do (inject into hidden state, measure the downstream distribution), so two things we’ve been burned by: (a) “same input noise” across models is _different relative perturbation_ if their hidden-state norms differ - and activation RMS varies a lot across models, and across layers within a model. Before concluding “architecture-specific response,” scale the injected noise to each model’s per-layer activation RMS; otherwise GPT-2 “absorbing” the noise might just mean GPT-2 carries larger activations. (b) Add a zero / identity-perturbation null that must produce _no_ change - it proves the measurement path itself isn’t manufacturing the effect. With those two controls, “three reproducible perturbation signatures” becomes very hard to argue with.\n\n**4. Put error bars on the gap.** The “order of magnitude, no overlap” split is the headline and it currently rests on one GD_ratio point per model. Bootstrapping GD_ratio over prompts/sequences (resample, recompute, CI per model) lets you state non-overlap quantitatively instead of visually - on n=8, single-author, that’s the difference between “suggestive” and “hard to dismiss.”\n\n**5. A concrete test from the large-model side.** You asked whether this holds past 1.3B. One thing we see clearly on a 12B (Gemma) is that a few architectural choices _impose_ output geometry independent of training. Gemma applies **final-logit softcapping** - a tanh cap on the logits - which structurally bounds exactly the concentration/competition geometry your G and D measure. Prediction: a softcapped model will cluster by the cap, not by its corpus. So dropping Gemma (or any softcapped model) into the panel is a clean way to separate “architecture-imposed logit geometry” from “training dynamics.” More generally, past 7B the confound surface grows - softcapping, QK-norm, attention-sink / massive-activation effects - all of which move your operator without touching anything you’d call dynamics. If the two clusters survive those, the structure is real; if it reorganizes around them, _that’s_ the finding.\n\nNone of this is a knock - it’s an interesting lens and the honest-limitations framing is exactly why it’s worth pushing on. If I had to order it: temperature-normalization first, because it’s cheap and it either kills the result or makes it much stronger.",
"title": "Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count"
}