I built ARSENIC - a tool to analyse what actually changes when you upgrade models
Hi I’ve been working on something called ARSENIC that attempts to show what changes occur when upgrading between models. The idea was that it runs a structured probe suite against two model endpoints in parallel, compares the responses across seven dimensions (morphology, tone, factual accuracy, schema compliance, instruction following, refusal boundaries, and sentence-level claim cross-matching), and produces a migration report that tells you what actually changed and whether it’s safe to upgrade. I ran it against the GPT-4o-mini → GPT-4.1-mini transition this week with the headline finding:
Safe to upgrade: true
GPT-4.1-mini is 45% faster
More concise across open-ended prompts — ContentCompression drift on 10 probes
2 probes warrant review: a bullet list formatting regression and a historical date content omission
Validated prompt patches generated for both
The claim cross-matching is the part I’m most interested in. Rather than whole-response cosine similarity (which misses most of what actually matters), it extracts informationally significant sentences, identifies anchors — specific numbers, dates, named entities — and cross-matches at the sentence level. A response that drops “the rate is 4.5%” and replaces it with “rates vary” is a different finding from one that says the same thing differently. The former is a regression. The latter is compression drift.
There’s also a reconcile subcommand for single-prompt repair — you give it a prompt and two endpoints, it analyses the behavioural gap and try to generate a validated prompt patch that gives the same result as the source model. Useful when you have a specific production prompt that broke and want to try to find the fix without running the full suite.
Full report from the GPT-4o-mini → GPT-4.1-mini run is here if you want to see what the output looks like before building anything: -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4.1-mini.html
GitHub -github.com/markndg/arsenic
Written in Rust. Model-agnostic — works with OpenAI, Anthropic, Google, Ollama, anything OpenAI-compatible. Apache 2.0.
Interested in any feedback, particularly from people who’ve been through model migrations and hit unexpected behaviour changes.
Discussion in the ATmosphere