{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihwdpbq62sy6fos2vy4hrrvhboai22my47quawonjhddwewvggcyi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnukuq67ilc2"
  },
  "path": "/t/cross-architectural-runtime-probability-dynamics-in-transformer-llms-two-clusters-not-explained-by-parameter-count/176630#post_4",
  "publishedAt": "2026-06-09T15:37:57.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Thanks for this — and you were right.\n\nRan the temperature normalization test you suggested as priority 1. The\nresult is a partial falsification of the strongest interpretation of the\nfinding.\n\nSetup: fitted a per-model scalar temperature T to match a common target\nentropy across the panel (target = 3.01), recomputed geometry on the\ncalibrated logits, recomputed clustering.\n\nWhat happened: two models migrate between clusters after calibration.\nGPT-2 and Phi-1.5 both move. The raw clustering structure is therefore\nsubstantially driven by effective logit temperature, not purely by\nruntime dynamics.\n\nWhat this falsifies: “Raw GD_ratio directly measures model dynamics\nindependently of calibration.” This interpretation is rejected by the\ntest you proposed.\n\nWhat may still hold: residual structure after calibration. Two clusters\nstill appear post-calibration, but their composition changes — so the\nquestion becomes “is there a calibration-independent component to the\nclustering, and if so what does it measure?” That requires the additional\ncontrols you listed: vocab normalization (log V), top-p truncation,\nperturbation magnitude-matched to per-layer activation RMS, bootstrap CIs\non the gap. None of those are done yet.\n\nThe published version of the V20 preprint (deposited yesterday) overstates\nthe dynamical interpretation of the raw GD_ratio. I am preparing a\nrevision that incorporates the temperature audit explicitly. The\nfalsification is going into the rejected-claims section. The remaining\nstructure question is moving into limited-findings with the controls you\nspecified as the protocol for elevation.\n\nThe attention-based findings from companion work (different observation\nlevel than logits) are not affected by this confound and remain the\ndirection I will continue to develop.\n\nGenuinely useful review. The “temperature first because it kills the\nresult or makes it much stronger” framing was the right priority.",
  "title": "Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count"
}