{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiatpgi4c2xiyyr34mpfseobilinnozokkk5q2q45otfb47asmqfyi",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmuc6mizvmn2"
},
"path": "/t/stop-looking-at-perplexity-the-real-story-is-in-the-geometry-3-papers-17-models-0-benchmarks/176266#post_1",
"publishedAt": "2026-05-27T18:32:35.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Paper 1: Four Dynamical Regimes",
"Paper 3: Dynamic-Layer Controllability"
],
"textContent": "* * *\n\nWe are obsessed with what LLMs say. We are ignoring how they think.\n\nWhile everyone is busy fine-tuning for benchmark scores, I’ve been mapping the **internal trajectory instability** of 17 open-source models during inference. The results don’t just challenge current evaluation methods — they expose a structural blind spot in how we understand model stability.\n\nI’m releasing a complete 3-paper series on **LIMEN (Liminal Internal Metric for Emergent Navigation)**. No benchmarks. No output quality metrics. Just pure, raw geometric dynamics.\n\n* * *\n\n### Size Does Not Equal Stability\n\nWe assume bigger models are more stable. My data says otherwise.\n\nUsing a single dimensionless metric (`ratio_norm = max(ct_t) / mean(ct_t)`), I found that **Qwen-2.5 (0.5B–3B)** is the _only_ family that consistently resides in an **“Adaptive” Regime** (balanced flux/stability).\n\nMeanwhile?\n\n * **Llama-3.2-3B** is **“Underactive”** (ratio 1.55 — rigid, low variance).\n * **Gemma-2B** is **“Chaotic”** (ratio 4.42 — violent spikes, high instability debt).\n * **DistilGPT-2** is **“Chaotic”** (ratio 35.55 — extreme).\n\n\n\nA 1.5B Qwen model is dynamically more stable than larger, instruction-tuned counterparts. This isn’t about parameter count. It’s about **architectural geometry**.\n\nThis ordering holds across **all six prompt categories without exception** on 158 runs.\n\n* * *\n\n### Audit (Paper 2 — DOI: 10.5281/zenodo.20361289)\n\nI didn’t stop at 10 models. I audited 17 models (70M to 3B params) across 1,224 runs. The findings were harsh:\n\n 1. **Architecture is Minority Rule:** Model identity explains only **7%** of dynamic variance. Prompt category explains **17%**. The remaining **76%** is residual noise/seed variance.\n\n 2. **Families are a Myth:** At scale (n=17), clean “model families” dissolve into a fragmented topology (fragmentation index 0.65, Adjusted Rand Index 0.11 against prior narrative families). Only **two pairs** survive bootstrap: **OPT-125M ↔ Pythia-160M** and **Phi-1.5 ↔ Qwen-0.5B**.\n\n 3. **80% of Trajectories are Non-Stationary:** Only 13% of token-level trajectories remain in a single regime throughout generation.\n\n 4. **Six small-panel hypotheses were explicitly falsified** and documented as a primary contribution, not hidden as limitations.\n\n\n\n\n* * *\n\n### The Hidden Cycle: Collapse is Not Failure\n\nIn Paper 3, I document a robust **COLLAPSE-RIVALRY cycle**. When a model enters a low-entropy “Collapse” state, it returns to a high-entropy “Rivalry” state **84% of the time** (on 638 observed cycles).\n\nCollapse isn’t a crash. It’s a **self-correction mechanism**. Models like Qwen leverage this cycle efficiently. Others get stuck.\n\n* * *\n\n### The Full Series (Open Access)\n\nI’m releasing everything. **Paper 1: Four Dynamical Regimes** — The taxonomy. Why Qwen is unique. 10 models, 158 runs.\n\n 1. **Paper 2: Methodological Audit (DOI: 10.5281/zenodo.20361289)** — The falsifications. Why 76% of variance is noise. 17 models, 1,224 runs.\n\n 2. **Paper 3: Dynamic-Layer Controllability** — Why you can fix dynamics but not semantics. 3 architectures, systematic perturbation-recovery protocol.\n\n\n\n\n* * *\n\nI’m not here to sell a solution. I’m here to expose a problem.\n\n * If you’re working on **Mechanistic Interpretability** , does `ct_t` correlate with your circuit analyses?\n * If you’re in **BCI/EEG** , do you see these same “Adaptive” vs “Chaotic” signatures in neural trajectories?\n * If you’re a **Qwen user** , does this “Adaptive” geometry explain why it feels easier to fine-tune?\n * If you’re training models, should **dynamic stability** be an evaluation metric alongside perplexity?\n\n\n\nCritiques welcome. Falsifications encouraged. Let’s move beyond perplexity.",
"title": "Stop Looking at Perplexity. The Real Story is in the Geometry. (3 Papers, 17 Models, 0 Benchmarks)"
}