Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibvt4phwit7bnthrnvqkhhxhclhbjx6ouhe7dmtu2z37dldsluif4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmrrmtkd5oz2"
  },
  "path": "/t/arxiv-cs-cr-endorsement-request-three-preprints-on-llm-security/176240#post_1",
  "publishedAt": "2026-05-26T18:18:44.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’m an independent researcher writing to ask whether you’d be willing to endorse me for arxiv’s cs.CR category. I have three small preprints ready to upload, all of which sit in the LLM-security / honeypot-measurement / safety-classifier-calibration space — adjacent to your work on [SPECIFIC PAPER OR TOPIC OF THEIRS].\n\nThe most surprising of the three is a 14-page safety-research note documenting a frontier-LLM safety classifier (Claude Opus 4.7) refusing to score one specific student-LLM output on a CTI-style task, falsified across a 7-judge cross-vendor panel (Sonnet/Haiku/Gemma/foundation-sec/qwen/Llama-4/gpt-oss all engage). Has a self-correction story: I initially reported a 53 % refusal rate, then established that 15/16 of the “refusals” were upstream API credit-balance errors, leaving 1 genuine refusal with cleaner properties. Reproducibility artefacts (data + code + analyses) are released on Zenodo with DOIs.\n\nZenodo: 10.5281/zenodo.20383617 (the safety-note, ~14 pages)\n\nCompanion Paper 2 (Qwen2.5-7B QLoRA distillation, 20 pages): 10.5281/zenodo.20383612\n\nPaper 1 (honeypot measurement, 38 pages) is not yet on Zenodo but I’m happy to send the PDF.\n\nArxiv’s endorsement is per-subject and one-time — once you’ve endorsed for cs.CR I can submit all three. The code you’d give me is a 6-character string from arxiv’s UI. No reading commitment expected; a 30-second skim of the safety-note abstract should be enough to decide.\n\n[ SZYPXN ] endorsement code\n\n## **3-bullet TL;DR for the safety-note (the strongest hook)**\n\n>   1. Claude Opus 4.7 deterministically refuses to score 1 specific student-LLM output (`chunk_idx=2 ttp_summary`) — reproduces 5/5 stochasticity, 7+ trials across two production eval runs.\n>\n>   2. The refusal does NOT generalise to content-class-similar synthetic records: a 24-record probe varying defensive-infrastructure entity attribution (CISA, NIST, FBI IC3, MS-ISAC, CERT-EU, BSI, NCSC-UK, JPCERT, plus Mandiant / CrowdStrike / SentinelOne) gets 0/192 refusals across an 8-judge cross-vendor panel.\n>\n>   3. Two distinct refusal trigger modes surfaced (student-content-driven on the original record; prompt-context-conditioned on an unrelated CDN-attacker record paired with the CISA MAR PDF prompt). Methodology-correction arc: an initial 53 % refusal claim was 15 upstream API errors + 1 genuine refusal — the corrected finding is narrower but cleaner.\n>\n>\n\n\nThanks,\nfiskkrok",
  "title": "Arxiv cs.CR endorsement request — three preprints on LLM/security"
}