Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibz6zlowtiduujhsn2zystbpe463innfty2b2hahsqmw53d2kp554",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjxtzckp5eo2"
  },
  "path": "/t/request-seeking-arxiv-cs-ai-endorsement-independent-researcher-llm-metacognition-benchmark-live-kaggle-leaderboard-8-frontier-models-n-69-human-panel/175421#post_1",
  "publishedAt": "2026-04-21T00:10:43.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://www.kaggle.com/benchmarks/rctoliveira/metacognitive-probe-measuring-llm-self-awareness"
  ],
  "textContent": "Hey folks!\n\nI’m an independent AI researcher seeking an arXiv endorsement for the **cs.AI** category (cross-list: cs.CL, cs.LG, stat.ML). This is my first arXiv submission and I don’t have an institutional affiliation, so I need a personal endorsement from someone who has published in a related category.\n\n### About the paper\n\n**Title:** “The Metacognitive Probe: Decomposing LLM Self-Knowledge into Five Measurable Dimensions”\n\nThe paper presents a 5-task diagnostic benchmark that decomposes LLM self-knowledge into separately-measurable dimensions — confidence calibration, epistemic vigilance, knowledge boundaries, calibration range, and reasoning-chain validation. Standard benchmarks (MMLU, BIG-Bench, HELM) measure _what_ models know; this instrument measures _what models know about what they know_.\n\n**Headline finding:** A 47-point within-model dissociation in Gemini 2.5 Flash — it achieves the panel’s best within-task calibration (T1-CC = 88) but the worst cross-task confidence prediction (T4-CR = 41). Flash reports confidence ≈ 100 on every factoid, including ones it gets wrong. This has direct implications for confidence-gated deployment systems.\n\nThe benchmark is evaluated on 8 frontier models (Claude Opus/Sonnet, Gemini Pro/Flash, DeepSeek-R1, GLM-5, Qwen 3, Gemma 3) and a human calibration panel (N=69). All code, data, prompts, and scoring rubrics are publicly released.\n\n### Verifiable materials\n\n  * **Live Kaggle benchmark:** https://www.kaggle.com/benchmarks/rctoliveira/metacognitive-probe-measuring-llm-self-awareness\n\n  * **Google DeepMind Hackathon entry** (Measuring Progress Toward AGI — Cognitive Abilities track)\n\n  * Happy to share the full PDF privately before you decide\n\n\n\n\n### Endorsement details\n\n  * **Category:** cs.AI (primary), cross-list cs.CL, cs.LG, stat.ML\n\n  * **Endorsement code:** I4G6HG\n\n  * To endorse, the endorser needs to have submitted 3+ papers to any cs.* category on arXiv within the last 5 years\n\n\n\n\nIf you’re an active arXiv author in any of these categories and willing to help, I’d really appreciate it. The endorsement takes about 30 seconds — just clicking a link and confirming. I’m happy to send you the paper first if you’d like to review it.\n\nThanks for your time!\n\nRafael Oliveira",
  "title": "[Request] Seeking arXiv cs.AI endorsement — independent researcher, LLM metacognition benchmark (live Kaggle leaderboard, 8 frontier models, N=69 human panel)"
}