Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count
Really like this one - and what makes it worth engaging with seriously is the discipline, not just the result. An explicit rejected-claims list and “hypothesis, not finding” on the corpus story is the posture that turns a cluster plot into something credible. So in the same spirit, here are the controls I’d want to see before believing GD_ratio is measuring dynamics rather than something more boring - plus one concrete test you can run from the architecture side.
(Working from your writeup + the abstract, so apologies if some of this is already handled in the preprint.)
1. Rule out logit temperature first - this is the cheap one I’d run before anything else. Your operator reads the geometry of the softmax distribution: entropy, top-k concentration, top-1/top-2 competition. All of those are dominated by the scale of the logits entering the softmax, and that scale is set by things unrelated to “dynamics”: final-LN gain, tied vs untied embeddings, weight decay, label smoothing. A model trained to be more confident has sharper distributions → high G, low D → high GD_ratio, with no trajectory behavior involved. That’s actually a clean mechanism for your training-corpus hypothesis (curated-data models tend to run lower-entropy), but it means GD_ratio could just be a per-model temperature reading. Test: fit one scalar temperature T per model on a held-out set to match a common mean entropy, recompute G and D on the calibrated logits, and see if the two clusters survive. If they collapse, you’ve measured calibration. If they hold, you’ve got something real on top of calibration. Either way it’s a stronger paper.
2. Make the operator vocab-invariant before cross-model comparison. Entropy and “dispersion above a 1% threshold” both scale with vocabulary size, and your panel runs 32k (TinyLlama) to ~152k (Qwen). It’s not a clean confound - OPT and Pythia sit at ~50k and still land in the lower cluster, so size alone isn’t the story - but a raw entropy / threshold-count still isn’t comparable across those vocabularies. Normalizing entropy by log(V), or computing the geometry over the top-p (say 0.99) mass instead of the full tail, removes tokenizer granularity as a hidden variable so the split can’t be partly a vocab artifact.
3. The perturbation experiment needs magnitude-matching and a null. This is the part closest to work we do (inject into hidden state, measure the downstream distribution), so two things we’ve been burned by: (a) “same input noise” across models is different relative perturbation if their hidden-state norms differ - and activation RMS varies a lot across models, and across layers within a model. Before concluding “architecture-specific response,” scale the injected noise to each model’s per-layer activation RMS; otherwise GPT-2 “absorbing” the noise might just mean GPT-2 carries larger activations. (b) Add a zero / identity-perturbation null that must produce no change - it proves the measurement path itself isn’t manufacturing the effect. With those two controls, “three reproducible perturbation signatures” becomes very hard to argue with.
4. Put error bars on the gap. The “order of magnitude, no overlap” split is the headline and it currently rests on one GD_ratio point per model. Bootstrapping GD_ratio over prompts/sequences (resample, recompute, CI per model) lets you state non-overlap quantitatively instead of visually - on n=8, single-author, that’s the difference between “suggestive” and “hard to dismiss.”
5. A concrete test from the large-model side. You asked whether this holds past 1.3B. One thing we see clearly on a 12B (Gemma) is that a few architectural choices impose output geometry independent of training. Gemma applies final-logit softcapping - a tanh cap on the logits - which structurally bounds exactly the concentration/competition geometry your G and D measure. Prediction: a softcapped model will cluster by the cap, not by its corpus. So dropping Gemma (or any softcapped model) into the panel is a clean way to separate “architecture-imposed logit geometry” from “training dynamics.” More generally, past 7B the confound surface grows - softcapping, QK-norm, attention-sink / massive-activation effects - all of which move your operator without touching anything you’d call dynamics. If the two clusters survive those, the structure is real; if it reorganizes around them, that’s the finding.
None of this is a knock - it’s an interesting lens and the honest-limitations framing is exactly why it’s worth pushing on. If I had to order it: temperature-normalization first, because it’s cheap and it either kills the result or makes it much stronger.
Discussion in the ATmosphere