{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihxg3kqvxye4utz6o3iz377bjheacltdmxfhyurcbwflrs2sitbzm",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mm47cbtrddj2"
},
"path": "/t/i-built-arsenic-a-tool-to-analyse-what-actually-changes-when-you-upgrade-models/1381153#post_5",
"publishedAt": "2026-05-18T05:52:20.000Z",
"site": "https://community.openai.com",
"textContent": "Your questionnaire approach for probing reasoning stability and ethics operationalisation is interesting, a different layer about how the model thinks about itself than how its outputs change.\nThe combination you’re describing — structured introspection probes alongside output-level behavioural comparison — would be genuinely useful. ARSENIC’s probe format is TOML and straightforward to extend, so it wouldn’t be hard to add a set of diagnostic probes alongside the standard suite and seeing as arsenic does see answer differences as it runs and validates it’s certainly worth exploring. Thanks for the comment!",
"title": "I built ARSENIC - a tool to analyse what actually changes when you upgrade models"
}