Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count
I want to share a finding from a measurement framework I’ve been working on, because the result is counterintuitive enough that I think it might interest people thinking about architectural differences between transformer LLMs.
The setup
I measured the runtime geometry of probability distributions across eight open-source attention-based transformers ranging from 70M to 1.3B parameters: Pythia-70M, DistilGPT-2, GPT-2, OPT-125M, Pythia-160M, Qwen2.5-0.5B, TinyLlama-1.1B, and Phi-1.5.
For each (token, layer) point during inference, the framework computes geometric properties of the probability distribution over the vocabulary: entropy, concentration on the top candidates, competition between the leading and runner-up tokens, dispersion above a 1% threshold. From these metrics, a bicephalic operator separates two distinct geometric tensions that probability distributions can carry, which I label G (concentration pole) and D (competition pole). The ratio between mean G and mean D, what I call the GD_ratio, becomes a per-model signature.
What I found
The eight models do not vary continuously on the GD_ratio. They partition into two clusters with no overlap and roughly an order of magnitude of gap between them:
GPT-2 GD_ratio 2.458
Phi-1.5 GD_ratio 1.764 DistilGPT-2 GD_ratio 1.577
Qwen-0.5B GD_ratio 0.079 OPT-125M GD_ratio 0.074 Pythia-70M GD_ratio 0.059 Pythia-160M GD_ratio 0.039 TinyLlama-1.1B GD_ratio 0.021
The cluster split appears on three independent components of the operator: the GD_ratio itself, the mean G alone, and the mean D alone. The separation is not an artifact of one metric.
The interesting part is what does not explain the clustering. Parameter count does not. GPT-2 has 124M parameters and is in the upper cluster. OPT-125M has 125M parameters and is in the lower cluster. Phi-1.5 has 1.3B parameters and sits with GPT-2. TinyLlama-1.1B has roughly the same size as Phi-1.5 and sits with OPT.
What might explain it (hypothesis only)
The most parsimonious pattern I can see is that the upper cluster shares characteristics of training corpus curation. Phi-1.5 was trained on heavily curated synthetic data. GPT-2 and DistilGPT-2 share the original GPT-2 WebText distribution and tokenizer, which had its own filtering protocol. The lower cluster spans more heterogeneous training corpora, including older (OPT, Pythia) and newer (Qwen, TinyLlama) architectures trained on relatively unfiltered web text.
I want to be careful here: this is a hypothesis, not a finding. I do not have an experimental setup that isolates training corpus from architecture choices. The hypothesis is consistent with the data but cannot be established by it.
Why this might matter
If the two-cluster structure generalizes, any tool or analysis that implicitly assumes a single dynamic profile across transformer models will produce inconsistent results depending on which cluster the target model falls into. This includes calibration techniques, uncertainty estimation methods, and probably some interpretability approaches that were tuned on one architectural family and may not transfer cleanly to the other.
Other observations in the same study
A few things worth noting briefly:
The framework also defines a five-state taxonomy of dynamic regimes (stable, hidden turbulence, surface branching, committed, full bifurcation). The full bifurcation state turns out to be consistently transient across architectures: on three primary models tested in depth, its self-transition probability is 0.023 (GPT-2) or exactly 0.000 (OPT-125M, Qwen-0.5B). Models pass through this regime, they do not settle into it.
Three models tested under controlled hidden-state perturbation respond in qualitatively different ways. GPT-2 absorbs the perturbation with state percentages shifting by less than 1.5 points. OPT-125M converts the perturbation into surface dispersion (branching state rises +12.5 points). Qwen-0.5B destabilizes its dominant state (stable state drops -18.8 points). Three architectural perturbation signatures, same input noise.
One model (Phi-1.5) produces an anomalous taxonomy distribution under the standard threshold rule. I report it openly as needing dedicated investigation rather than smoothing it over.
What I’m not claiming
The panel is eight models, all under 1.3B parameters. The two-cluster structure could collapse, stretch, or restructure when extended to 7B+ models. I have not validated on non-transformer architectures within this study. The work is single-author and has not been independently replicated. The training-corpus hypothesis is offered, not established.
I included explicit “limited findings” and “rejected claims” sections in the paper, listing five things in each category that initial intuitions suggested but that the data either partially support or actively reject. I treat this as central to the framework’s credibility.
Why I’m posting
I would be interested in hearing whether anyone working with larger or more architecturally diverse models has observed similar partitioning phenomena in their own measurements, whether on attention, hidden states, gradients, or any other intermediate quantity. The two-cluster structure felt unexpected enough that I want to understand whether it is a transformer-wide phenomenon, an artifact of the parameter range I tested, or something specific to the particular operator I defined.
I would also be interested in alternative interpretations of the cluster split beyond the training-corpus hypothesis. Possible candidates I have considered but cannot test from this panel alone: pre-norm vs post-norm architecture, tokenizer differences, attention head configurations, intermediate layer dimensionality, positional encoding choices.
Where the details are
Full methodology, all tables, the explicit limitations section, and the list of rejected claims are in the preprint on Zenodo:
Zenodo
A Runtime Trajectory Dynamics Framework for Large Language Models
Most existing observability tools for large language models analyze attention patterns or final outputs in isolation, leaving the runtime dynamics of probability distributions under-characterized. We introduce V20, a framework that measures the...
Happy to discuss the operator definition, the threshold methodology, the cluster finding, or any concern about the panel size and statistical robustness.
Discussion in the ATmosphere