{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreid3qnfwhpv3gojdfstcmhd4nx7mavjsfgg365yb6rdjcwuvg27xsy",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mm2qanftfdi2"
},
"path": "/t/i-built-arsenic-a-tool-to-analyse-what-actually-changes-when-you-upgrade-models/1381153#post_1",
"publishedAt": "2026-05-17T14:58:27.000Z",
"site": "https://community.openai.com",
"textContent": "Hi\nI’ve been working on something called ARSENIC that attempts to show what changes occur when upgrading between models. The idea was that it runs a structured probe suite against two model endpoints in parallel, compares the responses across seven dimensions (morphology, tone, factual accuracy, schema compliance, instruction following, refusal boundaries, and sentence-level claim cross-matching), and produces a migration report that tells you what actually changed and whether it’s safe to upgrade. I ran it against the GPT-4o-mini → GPT-4.1-mini transition this week with the headline finding:\n\n * Safe to upgrade: **true**\n\n * GPT-4.1-mini is **45% faster**\n\n * More concise across open-ended prompts — ContentCompression drift on 10 probes\n\n * 2 probes warrant review: a bullet list formatting regression and a historical date content omission\n\n * Validated prompt patches generated for both\n\n\n\n\nThe claim cross-matching is the part I’m most interested in. Rather than whole-response cosine similarity (which misses most of what actually matters), it extracts informationally significant sentences, identifies anchors — specific numbers, dates, named entities — and cross-matches at the sentence level. A response that drops “the rate is 4.5%” and replaces it with “rates vary” is a different finding from one that says the same thing differently. The former is a regression. The latter is compression drift.\n\nThere’s also a `reconcile` subcommand for single-prompt repair — you give it a prompt and two endpoints, it analyses the behavioural gap and try to generate a validated prompt patch that gives the same result as the source model. Useful when you have a specific production prompt that broke and want to try to find the fix without running the full suite.\n\nFull report from the GPT-4o-mini → GPT-4.1-mini run is here if you want to see what the output looks like before building anything: -markndg.github.io/arsenic/examples/gpt-4o-mini_vs_gpt-4.1-mini.html\n\nGitHub -github.com/markndg/arsenic\n\nWritten in Rust. Model-agnostic — works with OpenAI, Anthropic, Google, Ollama, anything OpenAI-compatible. Apache 2.0.\n\nInterested in any feedback, particularly from people who’ve been through model migrations and hit unexpected behaviour changes.",
"title": "I built ARSENIC - a tool to analyse what actually changes when you upgrade models"
}