Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifymz3t7bf6ezd4h7ld33tmindrsji23xbi7cv3zyydegqb4mizy4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mntc7cjwo2l2"
  },
  "path": "/t/cross-architectural-runtime-probability-dynamics-in-transformer-llms-two-clusters-not-explained-by-parameter-count/176630#post_1",
  "publishedAt": "2026-06-09T03:41:43.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Zenodo",
    "A Runtime Trajectory Dynamics Framework for Large Language Models"
  ],
  "textContent": "I want to share a finding from a measurement framework I’ve been working\non, because the result is counterintuitive enough that I think it might\ninterest people thinking about architectural differences between\ntransformer LLMs.\n\n## The setup\n\nI measured the runtime geometry of probability distributions across\neight open-source attention-based transformers ranging from 70M to 1.3B\nparameters: Pythia-70M, DistilGPT-2, GPT-2, OPT-125M, Pythia-160M,\nQwen2.5-0.5B, TinyLlama-1.1B, and Phi-1.5.\n\nFor each (token, layer) point during inference, the framework computes\ngeometric properties of the probability distribution over the vocabulary:\nentropy, concentration on the top candidates, competition between the\nleading and runner-up tokens, dispersion above a 1% threshold. From these\nmetrics, a bicephalic operator separates two distinct geometric tensions\nthat probability distributions can carry, which I label G (concentration\npole) and D (competition pole). The ratio between mean G and mean D, what\nI call the GD_ratio, becomes a per-model signature.\n\n## What I found\n\nThe eight models do not vary continuously on the GD_ratio. They partition\ninto two clusters with no overlap and roughly an order of magnitude of\ngap between them:\n\n## GPT-2 GD_ratio 2.458\n\nPhi-1.5 GD_ratio 1.764\nDistilGPT-2 GD_ratio 1.577\n\nQwen-0.5B GD_ratio 0.079\nOPT-125M GD_ratio 0.074\nPythia-70M GD_ratio 0.059\nPythia-160M GD_ratio 0.039\nTinyLlama-1.1B GD_ratio 0.021\n\nThe cluster split appears on three independent components of the operator:\nthe GD_ratio itself, the mean G alone, and the mean D alone. The\nseparation is not an artifact of one metric.\n\nThe interesting part is what does not explain the clustering. Parameter\ncount does not. GPT-2 has 124M parameters and is in the upper cluster.\nOPT-125M has 125M parameters and is in the lower cluster. Phi-1.5 has\n1.3B parameters and sits with GPT-2. TinyLlama-1.1B has roughly the same\nsize as Phi-1.5 and sits with OPT.\n\n## What might explain it (hypothesis only)\n\nThe most parsimonious pattern I can see is that the upper cluster shares\ncharacteristics of training corpus curation. Phi-1.5 was trained on\nheavily curated synthetic data. GPT-2 and DistilGPT-2 share the original\nGPT-2 WebText distribution and tokenizer, which had its own filtering\nprotocol. The lower cluster spans more heterogeneous training corpora,\nincluding older (OPT, Pythia) and newer (Qwen, TinyLlama) architectures\ntrained on relatively unfiltered web text.\n\nI want to be careful here: this is a hypothesis, not a finding. I do not\nhave an experimental setup that isolates training corpus from architecture\nchoices. The hypothesis is consistent with the data but cannot be\nestablished by it.\n\n## Why this might matter\n\nIf the two-cluster structure generalizes, any tool or analysis that\nimplicitly assumes a single dynamic profile across transformer models\nwill produce inconsistent results depending on which cluster the target\nmodel falls into. This includes calibration techniques, uncertainty\nestimation methods, and probably some interpretability approaches that\nwere tuned on one architectural family and may not transfer cleanly to\nthe other.\n\n## Other observations in the same study\n\nA few things worth noting briefly:\n\n  * The framework also defines a five-state taxonomy of dynamic regimes\n(stable, hidden turbulence, surface branching, committed, full\nbifurcation). The full bifurcation state turns out to be consistently\ntransient across architectures: on three primary models tested in depth,\nits self-transition probability is 0.023 (GPT-2) or exactly 0.000\n(OPT-125M, Qwen-0.5B). Models pass through this regime, they do not\nsettle into it.\n\n  * Three models tested under controlled hidden-state perturbation respond\nin qualitatively different ways. GPT-2 absorbs the perturbation with\nstate percentages shifting by less than 1.5 points. OPT-125M converts\nthe perturbation into surface dispersion (branching state rises +12.5\npoints). Qwen-0.5B destabilizes its dominant state (stable state drops\n-18.8 points). Three architectural perturbation signatures, same input\nnoise.\n\n  * One model (Phi-1.5) produces an anomalous taxonomy distribution under\nthe standard threshold rule. I report it openly as needing dedicated\ninvestigation rather than smoothing it over.\n\n\n\n\n## What I’m not claiming\n\nThe panel is eight models, all under 1.3B parameters. The two-cluster\nstructure could collapse, stretch, or restructure when extended to 7B+\nmodels. I have not validated on non-transformer architectures within\nthis study. The work is single-author and has not been independently\nreplicated. The training-corpus hypothesis is offered, not established.\n\nI included explicit “limited findings” and “rejected claims” sections in\nthe paper, listing five things in each category that initial intuitions\nsuggested but that the data either partially support or actively reject.\nI treat this as central to the framework’s credibility.\n\n## Why I’m posting\n\nI would be interested in hearing whether anyone working with larger or\nmore architecturally diverse models has observed similar partitioning\nphenomena in their own measurements, whether on attention, hidden states,\ngradients, or any other intermediate quantity. The two-cluster structure\nfelt unexpected enough that I want to understand whether it is a\ntransformer-wide phenomenon, an artifact of the parameter range I tested,\nor something specific to the particular operator I defined.\n\nI would also be interested in alternative interpretations of the cluster\nsplit beyond the training-corpus hypothesis. Possible candidates I have\nconsidered but cannot test from this panel alone: pre-norm vs post-norm\narchitecture, tokenizer differences, attention head configurations,\nintermediate layer dimensionality, positional encoding choices.\n\n## Where the details are\n\nFull methodology, all tables, the explicit limitations section, and the\nlist of rejected claims are in the preprint on Zenodo:\n\nZenodo\n\n### A Runtime Trajectory Dynamics Framework for Large Language Models\n\nMost existing observability tools for large language models analyze attention patterns or final outputs in isolation, leaving the runtime dynamics of probability distributions under-characterized. We introduce V20, a framework that measures the...\n\nHappy to discuss the operator definition, the threshold methodology, the\ncluster finding, or any concern about the panel size and statistical\nrobustness.",
  "title": "Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count"
}