{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid44iyqldfxk2dzu3jwukvcetihhkvedevxytmnax3dfs335th2bm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgqn4ia7g732"
  },
  "path": "/t/hidden-state-signals-in-iterative-llm-repair-what-cosine-similarity-at-layer-27-actually-tells-you/174143#post_1",
  "publishedAt": "2026-03-10T19:17:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://doi.org/10.5281/zenodo.18941566"
  ],
  "textContent": "I’ve been running a systematic experiment trying to answer a simple question: can you detect whether an LLM is stuck in an unproductive loop by looking at its hidden states — before the output reveals it?\n\n**The setup:** Qwen2.5-7B-Instruct (4-bit), repair loop on a hard code task (LRU Cache with 5 interdependent bugs, 7 test blocks). Forward hook at Layer 27, extracting `[:, -1, :]` every 50 tokens. Primary signal: `max_prev_similarity` — cosine similarity between the current hidden state and all prior checkpoints.\n\n**What we found:**\n\nThe signal is real. In multiple runs, high cosine similarity appeared *before* the output started looping, and there were clear cases where the hidden-state signal detected semantic stagnation that n-gram and code-block text detectors completely missed. Two reproducible dissociation cases — the model was internally circling while producing superficially different-looking outputs.\n\n**The complication:**\n\nHigh coherence is ambiguous. It marks both productive convergence (the model has found a stable, correct solution) and pathological stagnation (the model is stuck). As a standalone scalar, cosine similarity can’t distinguish attractor types.\n\nThis has a concrete implication: if you build an intervention that triggers when coherence is high, you’ll interrupt both good and bad states equally. That’s what happened in our Phase 10.3 — prompt-based interventions triggered by the signal underperformed the baseline (2/8 vs 3/8 success rate).\n\n**What would actually help:**\n\nA second signal to disambiguate. Entropy and confidence margin (logprob-based) show modest combined signal (AUC ~0.59 for regression detection) but aren’t enough on their own either. The more tractable near-term solution turned out to be architectural: a monotonic controller that preserves best-so-far state rather than trying to predict loop states in advance.\n\n**Why this matters beyond the specific task:**\n\nMost inference-time control approaches operate on the output side — token filtering, chain-of-thought steering, sampling interventions. The hidden state provides a different channel: it captures *process dynamics* rather than *output content*. The dissociation between the two is the interesting finding.\n\nFull interim report (10 phases, all results including negatives): https://doi.org/10.5281/zenodo.18941566\n\nHappy to discuss methodology, signal extraction specifics, or the monotonic controller design.",
  "title": "Hidden-State Signals in Iterative LLM Repair: What Cosine Similarity at Layer 27 Actually Tells You"
}