Stop Looking at Perplexity. The Real Story is in the Geometry. (3 Papers, 17 Models, 0 Benchmarks)
We are obsessed with what LLMs say. We are ignoring how they think.
While everyone is busy fine-tuning for benchmark scores, I’ve been mapping the internal trajectory instability of 17 open-source models during inference. The results don’t just challenge current evaluation methods — they expose a structural blind spot in how we understand model stability.
I’m releasing a complete 3-paper series on LIMEN (Liminal Internal Metric for Emergent Navigation). No benchmarks. No output quality metrics. Just pure, raw geometric dynamics.
Size Does Not Equal Stability
We assume bigger models are more stable. My data says otherwise.
Using a single dimensionless metric (ratio_norm = max(ct_t) / mean(ct_t)), I found that Qwen-2.5 (0.5B–3B) is the only family that consistently resides in an “Adaptive” Regime (balanced flux/stability).
Meanwhile?
- Llama-3.2-3B is “Underactive” (ratio 1.55 — rigid, low variance).
- Gemma-2B is “Chaotic” (ratio 4.42 — violent spikes, high instability debt).
- DistilGPT-2 is “Chaotic” (ratio 35.55 — extreme).
A 1.5B Qwen model is dynamically more stable than larger, instruction-tuned counterparts. This isn’t about parameter count. It’s about architectural geometry.
This ordering holds across all six prompt categories without exception on 158 runs.
Audit (Paper 2 — DOI: 10.5281/zenodo.20361289)
I didn’t stop at 10 models. I audited 17 models (70M to 3B params) across 1,224 runs. The findings were harsh:
Architecture is Minority Rule: Model identity explains only 7% of dynamic variance. Prompt category explains 17%. The remaining 76% is residual noise/seed variance.
Families are a Myth: At scale (n=17), clean “model families” dissolve into a fragmented topology (fragmentation index 0.65, Adjusted Rand Index 0.11 against prior narrative families). Only two pairs survive bootstrap: OPT-125M ↔ Pythia-160M and Phi-1.5 ↔ Qwen-0.5B.
80% of Trajectories are Non-Stationary: Only 13% of token-level trajectories remain in a single regime throughout generation.
Six small-panel hypotheses were explicitly falsified and documented as a primary contribution, not hidden as limitations.
The Hidden Cycle: Collapse is Not Failure
In Paper 3, I document a robust COLLAPSE-RIVALRY cycle. When a model enters a low-entropy “Collapse” state, it returns to a high-entropy “Rivalry” state 84% of the time (on 638 observed cycles).
Collapse isn’t a crash. It’s a self-correction mechanism. Models like Qwen leverage this cycle efficiently. Others get stuck.
The Full Series (Open Access)
I’m releasing everything. Paper 1: Four Dynamical Regimes — The taxonomy. Why Qwen is unique. 10 models, 158 runs.
Paper 2: Methodological Audit (DOI: 10.5281/zenodo.20361289) — The falsifications. Why 76% of variance is noise. 17 models, 1,224 runs.
Paper 3: Dynamic-Layer Controllability — Why you can fix dynamics but not semantics. 3 architectures, systematic perturbation-recovery protocol.
I’m not here to sell a solution. I’m here to expose a problem.
- If you’re working on Mechanistic Interpretability , does
ct_tcorrelate with your circuit analyses? - If you’re in BCI/EEG , do you see these same “Adaptive” vs “Chaotic” signatures in neural trajectories?
- If you’re a Qwen user , does this “Adaptive” geometry explain why it feels easier to fine-tune?
- If you’re training models, should dynamic stability be an evaluation metric alongside perplexity?
Critiques welcome. Falsifications encouraged. Let’s move beyond perplexity.
Discussion in the ATmosphere