{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicwqtiwn4rasaxtgoqdsthkdczlwxhwtjoqwmwu7new7w265snqpu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mntc7g5iwva2"
  },
  "path": "/t/cuda-support-added-pre-generation-knowledge-boundary-estimator/176593#post_3",
  "publishedAt": "2026-06-09T03:37:36.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Strong +1 to John’s core point - tighten the evaluation before touching the architecture. I’ll just add three confounds we ran into doing internal-state probing on this exact model family (Gemma), roughly ordered by how much they’d move your numbers. They’re not yet on the list and the first one is, in our experience, the silent killer.\n\n## 0. Certify the weight artifact before trusting _any_ internal-state feature\n\nYour logit-lens “crystallization” trajectory and MLP-write magnitudes are only as faithful as the weights they’re read from - and Gemma’s quantized artifacts are a minefield right now. In our own work on this family we found several GGUF/quantized conversions of Gemma silently broken: a clean bf16 forward gave wikitext PPL ≈ 4.7, while multiple GGUF artifacts of the _same_ weights gave ≈ 190–500 (~100× worse). On an artifact like that, the “knowledge crystallizes across depth” signal you’re keying on is partly measuring **quantization damage** , not parametric knowledge - and it’ll look structured, because the damage is structured. We were using Gemma-4 12b. The E series are probably ok, but worth checking out.\n\nConcrete: print the probed model’s wikitext PPL next to a known reference and sanity-check it before you trust a single feature. Prefer bf16 safetensors-direct for the model you probe. If you have to probe the quantized deployment target, see #2.\n\n## 1. The current LRD-vs-baseline gaps look like they’re inside split + seed noise\n\n  * Single split, ~3k examples: LRD 0.898 vs global-logreg 0.867 is a 0.03 AUROC gap.\n  * On the cross-PopQA split the **last-layer probe (0.961) already beats LRD (0.941)**. That’s the simplest baseline outscoring the full architecture on the transfer split - which, combined with John’s leakage observation, strongly suggests the held-out margin isn’t robust yet.\n\n\n\nWe got burned by exactly this on a different internal-state probe: a held-out score of 0.20 collapsed to ~0.13 the moment we made the train/val split context-disjoint, and single-seed runs swung ±0.04 between adjacent checkpoints - enough to flip a “win” into a “loss.” We now treat any number as **inconclusive until ≥3 seeds** with a bootstrap CI over the test set.\n\nConcrete: 3–5 seeds, report mean ± CI (or a paired bootstrap of LRD − baseline, which is what actually tells you if the gap is real). And dedup at the **entity** level, not just exact-question - PopQA and TriviaQA share entities, so question-level dedup leaves entity leakage intact, and an entity that appears in train teaches the probe the entity, not the knowledge boundary.\n\n## 2. P(knows) is quantization-instance specific, not just model specific\n\nQuantization changes _which_ facts are recoverable - we have direct evidence that coarse quantization degrades exact factual recall on this family, not just average loss. So a sidecar trained on fp16 Gemma will mis-route a Q4 deployment: the boundary literally moves. Train and calibrate the router on the **exact deployed (quantized) instance** , and re-fit the temperature per instance - calibration in particular will not transfer across quant levels.\n\n## 3. (smaller) Layer-class structure on Gemma\n\nGemma interleaves global and sliding-window attention layers - different effective ranges (global carries long-range, SWA fades past its window) and different RoPE. A GRU-over-depth that treats all layers as one uniform sequence is averaging over two different signal regimes. A cheap ablation: tag each layer with its class (or pool the two classes separately) and see whether the depth signal actually lives in the global layers. If it does, that’s both a smaller, more honest feature set and a more transferable one.\n\n## Bottom line\n\nJohn’s eval-tightening is the right first move. I’d prepend **step 0: certify the weight artifact** - a broken quant silently poisons every downstream internal feature, and it’s the cheapest thing to rule out. Then treat the current LRD-vs-baseline gaps as within-noise until you have multi-seed + entity-disjoint + true train→test transfer, and bind the probe to the exact quantized instance you’ll deploy. If the margin survives all of that, you’ve got a genuinely strong, deployment-relevant result - and the prompt-only, one-forward property is worth protecting the whole way.",
  "title": "CUDA support added - Pre-generation knowledge-boundary estimator"
}