External Publication

Cross-architectural runtime probability dynamics in transformer LLMs — two clusters not explained by parameter count

Hugging Face Forums [Unofficial] June 9, 2026

This is exceptional feedback. Thank you for taking the time to dissect the operator and the methodology with such rigor. You’ve pinpointed the exact confounds that keep me up at night, particularly regarding calibration and activation norms.

I want to address your points directly, as they align perfectly with my own internal audits (some of which are in the “Rejected Claims” section of the preprint, but deserve more prominence):

1. The Temperature/Calibration Confound (Priority #1) You are absolutely right. GD_ratio could simply be a proxy for per-model “confidence temperature” driven by training objectives (e.g., label smoothing, final-LN gain).

Action: I will run the scalar temperature fitting test you suggested. If the clusters collapse after entropy-matching, then GD_ratio is indeed a calibration metric, not a dynamic one. If they hold, it proves there is a structural geometric difference beyond simple sharpness. This is the cheapest and most decisive test. I’ll update the repo with these results.

2. Vocabulary Invariance Good catch on the vocab size disparity (32k vs 152k).

Current Mitigation: My operator uses relative thresholds (top-k competition, dispersion above 1% of the mass), which helps, but raw entropy is definitely biased by V V.
Action: I will recompute using entropy normalized by log⁡(V)log(V) and also try the “top-p mass geometry” approach you mentioned. This should isolate the shape of the distribution from its granularity.

3. Perturbation Magnitude & Nulls This is a critical methodological flaw in many interpretability studies.

Action: I currently inject fixed Gaussian noise. I will switch to RMS-normalized noise (scaled to each layer’s activation RMS) to ensure equitable perturbation strength. I will also add the zero-perturbation null to baseline the measurement pipeline’s stability. This will strengthen the claim that the “three signatures” are architectural, not artifacts of scale.

4. Error Bars & Bootstrapping Agreed. Visual separation on n=8 n =8 is suggestive, not definitive.

Action: I will bootstrap the GD_ratio over prompts and seeds to generate Confidence Intervals (95% CI) for each model. This will quantify the “no overlap” claim statistically.

5. The Large Model Test (Gemma & Softcapping) This is a brilliant concrete test. Gemma’s logit softcapping is a known architectural constraint that directly bounds concentration metrics.

Hypothesis: If Gemma clusters with the “Low GD” group solely due to softcapping, it confirms that architecture can override training corpus signals.
Action: I am currently limited by compute for full 12B runs, but I can run inference on smaller softcapped variants or use existing hidden-state logs if available. Alternatively, I can simulate softcapping on my current panel to see if it forces them into the lower cluster.

Next Steps: I’m prioritizing the Temperature Normalization and RMS-scaled Perturbation audits this week. These are high-leverage controls that will either validate the core hypothesis or refine it into a more precise statement about calibration vs. dynamics.

I’ll post an update here once the bootstrapped CIs and temperature-calibrated ratios are ready. Thanks again for pushing the rigor bar higher this is exactly how robust science gets done.

Best, Jean-Denis

Discussion in the ATmosphere