Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihxg3kqvxye4utz6o3iz377bjheacltdmxfhyurcbwflrs2sitbzm",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mm47cbtrddj2"
  },
  "path": "/t/i-built-arsenic-a-tool-to-analyse-what-actually-changes-when-you-upgrade-models/1381153#post_5",
  "publishedAt": "2026-05-18T05:52:20.000Z",
  "site": "https://community.openai.com",
  "textContent": "Your questionnaire approach for probing reasoning stability and ethics operationalisation is interesting, a different layer about how the model thinks about itself than how its outputs change.\nThe combination you’re describing — structured introspection probes alongside output-level behavioural comparison — would be genuinely useful. ARSENIC’s probe format is TOML and straightforward to extend, so it wouldn’t be hard to add a set of diagnostic probes alongside the standard suite and seeing as arsenic does see answer differences as it runs and validates it’s certainly worth exploring. Thanks for the comment!",
  "title": "I built ARSENIC - a tool to analyse what actually changes when you upgrade models"
}