I built ARSENIC - a tool to analyse what actually changes when you upgrade models
OpenAI Developer Community
May 18, 2026
Your questionnaire approach for probing reasoning stability and ethics operationalisation is interesting, a different layer about how the model thinks about itself than how its outputs change.
The combination you’re describing — structured introspection probes alongside output-level behavioural comparison — would be genuinely useful. ARSENIC’s probe format is TOML and straightforward to extend, so it wouldn’t be hard to add a set of diagnostic probes alongside the standard suite and seeing as arsenic does see answer differences as it runs and validates it’s certainly worth exploring. Thanks for the comment!
Discussion in the ATmosphere