{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreie6aypxvq7twa7bz6ifgfegr6yzm4arr6a6wi3edglm452fojx76e",
"uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mlns562fzlb2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreicxcx7hcr72hk45c4vsmp32fjvbsxowy4usdcyziljqbbwqwnruwm"
},
"mimeType": "image/jpeg",
"size": 621191
},
"description": "The uncomfortable gap between “can edit” and “can be trusted”\n\nA lot of current AI enthusiasm is built around delegation.\n\nWe no longer ask language models only to answer questions. We ask them to modify source code, rewrite reports, refactor configuration files, reorganize spreadsheets, update structured records, transform diagrams, edit subtitles, and operate across entire project folders. In software engineering this is often described as “vibe coding”, but the pattern is broader: a human giv",
"path": "/llms-corrupt-your-documents-when-you-delegate/",
"publishedAt": "2026-05-12T12:29:31.000Z",
"site": "https://corti.com",
"tags": [
"arXiv",
"GitHub"
],
"textContent": "_The uncomfortable gap between “can edit” and “can be trusted”_\n\nA lot of current AI enthusiasm is built around delegation.\n\nWe no longer ask language models only to answer questions. We ask them to modify source code, rewrite reports, refactor configuration files, reorganize spreadsheets, update structured records, transform diagrams, edit subtitles, and operate across entire project folders. In software engineering this is often described as “vibe coding”, but the pattern is broader: a human gives a goal, the AI system manipulates artifacts, and the human supervises at a distance.\n\nThat is exactly the scenario Microsoft Research studies in the paper **“LLMs Corrupt Your Documents When You Delegate”**. The paper introduces **DELEGATE-52** , a benchmark for long-horizon delegated document editing across 52 professional domains, and the result is sobering: even strong frontier models can silently degrade documents over repeated editing workflows. The paper reports that frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupted an average of about **25% of document content** by the end of long simulated workflows, while the average degradation across all evaluated models was substantially worse. (arXiv)\n\nThis is not a paper about models refusing instructions. It is about models trying to do the work, mostly following the request, and still damaging the artifact.\n\nThat distinction matters.\n\n## What DELEGATE-52 evaluates\n\nDELEGATE-52 is designed to answer a practical question:\n\n> If I hand an LLM a set of professional documents and ask it to perform a sequence of realistic edits, how much of the original document remains semantically intact after repeated delegation?\n\nThe benchmark contains work environments across 52 domains, including examples such as Python code, Docker files, database schemas, Graphviz diagrams, recipes, subtitles, accounting ledgers, genealogy records, chess notation, music notation, crystallography files, 3D object files, calendars, transit data, and more. The paper groups these domains into categories such as **Code & Configuration**, **Science & Engineering**, **Creative & Media**, **Structured Records** , and **Everyday** documents. (arXiv)\n\nEach work environment contains:\n\n 1. **Seed documents**\nReal documents found online, not synthetic templates. They are typically in textual, unencoded formats and range around 2,000–5,000 tokens.\n 2. **Edit tasks**\nRealistic, non-trivial transformations a user might ask an AI system to perform. For example, splitting an accounting ledger by category, converting amounts, reformatting records, or restructuring a document.\n 3. **Distractor context**\nRelated but irrelevant files, meant to simulate a realistic workspace where retrieval is not perfect and the model sees more than just the one file it needs. The paper describes distractor context in the 8,000–12,000 token range. (arXiv)\n\n\n\nThis matters because real-world AI delegation rarely happens in a clean prompt containing only the relevant data. It happens in messy repositories, document libraries, folders, SharePoint sites, wiki exports, generated artifacts, old versions, and “probably relevant” files returned by a search or retrieval system.\n\n## The clever part: round-trip editing\n\nOne of the hardest problems in evaluating document editing is that you often do not have a perfect reference answer.\n\nFor a coding benchmark, you might have unit tests. For a math benchmark, you might have a known answer. For a document transformation task, reference answers are expensive to create and domain-specific. How do you know whether the result is semantically equivalent?\n\nDELEGATE-52 solves this with a **round-trip relay**.\n\nInstead of evaluating a single one-way edit, each task is defined as a pair:\n\n\n Original document\n ↓ forward edit\n Transformed document\n ↓ backward edit\n Reconstructed document\n\n\nA perfect model should be able to apply the forward transformation and then apply the inverse transformation, returning the document to its original semantic state.\n\nFor example:\n\n\n Forward task:\n Split this ledger into separate files by expense category.\n\n Backward task:\n Merge the category files back into one chronological ledger.\n\n\nIf the reconstructed ledger differs from the original ledger, something was lost, altered, duplicated, reordered incorrectly, or hallucinated.\n\nThe benchmark then chains multiple round trips together. Ten round trips equal 20 model interactions. The paper calls this a **relay** , and it is designed to simulate long delegated workflows rather than isolated prompt-response interactions. (arXiv)\n\nThe core metric is the **Reconstruction Score** , or **RS@k** , which measures how well the document is preserved after `k` interactions using domain-specific similarity functions. The repository describes this directly: round trips are chained, and performance is measured by comparing the recovered document against the original using domain-specific evaluators. (GitHub)\n\n## Why generic similarity is not enough\n\nA key contribution of the benchmark is that it does **not** rely only on generic text similarity, Levenshtein distance, embeddings, or an LLM judge.\n\nThat would be too weak.\n\nA recipe where `200g butter` becomes `800g butter` may look textually similar but is semantically broken. A DNS zone file with one incorrect record can be operationally dangerous. A calendar entry with the wrong date is not “mostly right”. A source file with a small but critical logic change can still compile and be wrong.\n\nDELEGATE-52 therefore uses domain-specific parsers and evaluators. For a recipe, the parser might extract ingredients, quantities, units, steps, and tips. For another domain, it might parse structured records, source files, metadata, geometry, accounting entries, or notation. The paper states that these domain-specific similarity functions were designed to capture semantic equivalence and that generic similarity measures, including LLM-as-judge approaches, failed to capture nuanced semantic differences reliably. (arXiv)\n\nThis is one of the most important ideas in the paper: **document correctness is domain-specific**.\n\nThere is no universal “looks fine to me” metric that can reliably validate all delegated work.\n\n## The headline result: degradation compounds\n\nThe paper evaluates 19 LLMs from six model families, including OpenAI, Anthropic, Google Gemini, Mistral, xAI, and Moonshot models. The main experiment uses 20 delegated interactions over work environments with seed documents plus distractor context. (arXiv)\n\nThe reported results show that models degrade documents over time. The paper highlights that frontier models lose roughly a quarter of document content by the end of long workflows, and that across all models the average degradation is around 50%. (arXiv)\n\nThe most important lesson is not just that models make mistakes. We already knew that.\n\nThe important lesson is that **short tests are misleading**.\n\nThe paper gives examples where two models perform similarly after two interactions but diverge substantially by the twentieth interaction. Conversely, one model may start behind another and overtake it later. The authors explicitly warn that short interaction simulations are insufficient for understanding long-horizon delegated performance. (arXiv)\n\nThat has direct implications for how we evaluate AI-assisted engineering tools.\n\nA demo where an AI assistant successfully edits one file is not evidence that it can safely maintain a project over 50 edits. A benchmark where a model performs a single transformation does not tell us whether it preserves invariants across repeated transformations. A one-shot code refactor may pass, while a multi-step repository migration slowly accumulates incorrect assumptions.\n\n## Tool use did not magically fix the problem\n\nA common intuition is that agents with tools should perform better than plain LLMs. Give the model file-system access, read/write tools, Python execution, and a multi-turn loop, and surely it should preserve documents more reliably.\n\nThe DELEGATE-52 repository includes exactly this kind of agentic harness: `model_agentic.py`, where the LLM can use tools such as reading files, writing files, deleting files, and running Python in a multi-turn loop. (GitHub)\n\nThe paper’s finding is important: **basic agentic tool use did not improve performance on DELEGATE-52**. (arXiv)\n\nThat does not mean tools are useless. It means that merely wrapping a model in a tool loop is not enough. The agent still needs robust planning, state tracking, validation, rollback, diff awareness, semantic checks, and domain-specific correctness tests.\n\nA tool-using LLM that confidently writes corrupted files is still a corruption engine.\n\n## The failures are sparse but severe\n\nOne of the most interesting sections of the paper analyzes how the degradation happens.\n\nAt first glance, aggregate curves can make degradation look smooth, as if every interaction introduces a small amount of noise. But the paper’s deeper analysis says that is not the main failure mode.\n\nInstead, models often preserve the document reasonably well for some steps, then suffer **critical failures** : individual round trips that drop the score by 10 points or more. The authors report that these sparse critical failures explain about **80% of total document degradation**. Stronger models do not necessarily eliminate the failure mode; they delay it or experience it less often. (arXiv)\n\nThis is exactly the kind of failure that is dangerous in real delegated work.\n\nA model can look reliable for several operations, building user trust, and then silently introduce one severe corruption:\n\n * a field is dropped from a structured record;\n * a financial amount is changed;\n * a calendar recurrence rule is mangled;\n * a source file loses an edge case;\n * a dependency version is changed incorrectly;\n * a music notation file remains syntactically plausible but musically wrong;\n * a 3D object file renders differently;\n * a translation preserves style but loses a constraint.\n\n\n\nThe user may not notice because the document still “looks” valid.\n\n## Deletion versus corruption\n\nThe paper distinguishes between two broad degradation patterns:\n\n 1. **Deletion** : content disappears.\n 2. **Corruption** : content remains present but becomes incorrect.\n\n\n\nThis distinction is critical.\n\nThe paper finds that weaker models tend to lose content through deletion, while frontier models more often corrupt content that is still present. (arXiv)\n\nFrom a user perspective, corruption is often worse than deletion.\n\nMissing content can sometimes be spotted. Incorrect content that remains structurally plausible is harder to detect. A missing row in a ledger is bad; a row with the wrong amount, currency, date, or account can be worse. A removed test is visible in a diff; a subtly weakened assertion may not be. A missing DNS record may cause an outage; an incorrect DNS record may route traffic somewhere unintended.\n\nThis is why “the model preserved most of the file” is not enough. Preservation must be semantic, not cosmetic.\n\n## Structured domains perform better, but only relatively\n\nDELEGATE-52 shows that performance varies significantly by domain.\n\nThe paper reports that models perform better in programmatic and structured domains, such as Python and database schemas, and worse in natural language or niche domains such as recipes, fiction, transit, or textile-related formats. It also notes better performance in domains with high repetitiveness and structural density, and weaker performance in domains with rich, unrepeated vocabulary. (arXiv)\n\nThat fits what many practitioners observe.\n\nLLMs are comparatively strong where:\n\n * syntax is explicit\n * structure is repetitive\n * constraints are local\n * validators exist\n * tests can be executed\n * there are many examples in training data\n * the domain has machine-checkable invariants\n\n\n\nThey are weaker where:\n\n * correctness depends on domain semantics\n * the document is long and irregular\n * many entities must be tracked globally\n * subtle changes matter\n * there is no easy validator\n * the format is rare or specialized\n * human review requires expertise\n\n\n\nThis is also why AI coding often feels ahead of AI document editing in other professional domains. Code has compilers, tests, linters, type checkers, schemas, package managers, and runtime behavior. Many other professional documents do not have such rich verification infrastructure.\n\n## Global restructuring is hard\n\nThe benchmark tags edit tasks by semantic operations such as sorting, merging, splitting, classification, string manipulation, and referencing. The paper finds that tasks requiring **global document restructuring** , such as split-and-merge operations or classification across the whole document, are harder than local operations such as string manipulation. Tasks requiring multiple coordinated operations are harder still. (arXiv)\n\nThis is highly relevant for real workflows.\n\nThe risky tasks are not necessarily simple edits like:\n\n\n Rename this heading.\n Fix this typo.\n Change this variable name.\n Convert this field from snake_case to camelCase.\n\n\nThe risky tasks are more like:\n\n\n Refactor this module into three smaller modules while preserving behavior.\n\n Split this specification into separate requirement documents grouped by subsystem.\n\n Normalize this spreadsheet into separate tables and regenerate the summary.\n\n Convert this accounting ledger into another format and preserve all balances.\n\n Reorganize this policy document by audience and remove duplicates.\n\n Merge these calendar files and preserve recurrence rules.\n\n\nThose tasks require the model to maintain a global mental model of the artifact. That is exactly where small mistakes become structural corruption.\n\n## The image editing result is even worse\n\nThe paper also explores whether the methodology applies beyond text by creating visual work environments for image editing models. The result is even more severe: the authors report that image editing models degrade images much faster than LLMs degrade text. The best image models achieved final reconstruction scores around 28–30%, compared with roughly 70–80% for textual domains, and no image model exceeded 65% after only two interactions. (arXiv)\n\nThis is relevant because “document” should be interpreted broadly.\n\nMany professional artifacts are not plain prose:\n\n * diagrams\n * CAD-like files\n * screenshots\n * design files\n * charts\n * maps\n * slides\n * images\n * audio metadata\n * video subtitles\n * 3D assets\n\n\n\nDelegated editing of these artifacts needs even stronger validation because visual plausibility is not the same as fidelity.\n\n## The repository\n\nMicrosoft released the code in the **microsoft/DELEGATE52** repository. The repository contains the benchmark harness, prompts, domain-specific parsers/evaluators, and experiment runners. The README describes DELEGATE-52 as a benchmark for evaluating LLMs on long-horizon delegated document editing across 52 professional domains, and it points to the dataset hosted on Hugging Face. (GitHub)\n\nThe key files are:\n\n\n run_relay.py Main experiment runner for chained round-trip edits\n run_single.py Runs individual forward/backward edit pairs\n model_openai.py OpenAI / Azure OpenAI model wrapper\n model_agentic.py Tool-using agent harness\n domains/ Domain-specific parsers and evaluators\n prompts/ Prompt templates used during simulation\n\n\nThe public dataset contains the redistributable subset: 234 work environments across 48 domains, each with seed documents, 5–10 reversible edit pairs, and distractor context. The repository README also includes a basic example for running a relay simulation, with a clear warning that simulations call LLM APIs and therefore cost real money. (GitHub)\n\nExample command from the repository:\n\n\n python run_relay.py --model_names gpt-5.4 --domains subtitles --num_round_trips 10\n\n\nFor practitioners, the repository is useful not only as a benchmark, but as a design pattern: create domain-specific round-trip tasks, parse the resulting artifacts, and measure semantic preservation over repeated edits.\n\n## What this means for AI-assisted engineering\n\nFor software engineers, the paper should feel familiar.\n\nWe already know that AI coding assistants can produce impressive results and still introduce subtle defects. The difference is that software engineering has a mature validation culture:\n\n * version control\n * diffs\n * pull requests\n * tests\n * static analysis\n * CI/CD\n * type systems\n * linters\n * code review\n * runtime monitoring\n * rollback\n\n\n\nThe paper’s central message is that all delegated document workflows need a similar discipline.\n\nThe more autonomy we give an AI system, the more we need artifact-level safety mechanisms.\n\nA good AI-assisted workflow should therefore include:\n\n### 1. Version every artifact\n\nNever let an AI agent mutate important documents without version history.\n\nFor code, this means Git. For documents, it may mean SharePoint versioning, OneDrive history, document snapshots, object storage versioning, or explicit pre/post copies. The ability to diff and rollback is not optional.\n\n### 2. Prefer patch-based edits over full rewrites\n\nA model that rewrites an entire file has far more opportunity to corrupt unrelated content.\n\nWhere possible, ask for minimal patches:\n\n\n Only modify the sections required for this change.\n Return a unified diff.\n Do not rewrite unrelated sections.\n Preserve all existing identifiers, values, comments, and ordering unless explicitly instructed.\n\n\nThis does not eliminate risk, but it reduces the blast radius.\n\n### 3. Use domain-specific validators\n\nGeneric LLM review is not enough.\n\nUse validators that understand the artifact:\n\n\n Code tests, type checks, linters, static analysis\n JSON/YAML schema validation\n Terraform/Bicep plan validation, policy checks\n SQL schema diffing, migration tests\n Spreadsheets formula checks, row/column invariants\n Accounting balanced entries, totals, currency checks\n Calendars recurrence validation\n Subtitles timing overlap checks\n DNS zone validation\n Diagrams parse/render validation\n\n\nThis is exactly the spirit of DELEGATE-52’s domain-specific evaluators.\n\n### 4. Detect invariant violations\n\nBefore delegating, define what must remain true.\n\nExamples:\n\n\n The number of invoices must not change.\n All customer IDs must be preserved.\n The total balance must remain identical.\n No dates may be changed unless explicitly requested.\n All source citations must remain attached to their claims.\n All tests that passed before must pass after.\n All public method signatures must remain compatible.\n\n\nThen validate those invariants mechanically where possible.\n\n### 5. Keep humans in the loop for semantic review\n\nThe paper explicitly warns users not to generalize capability from one domain to another and says users still need to closely monitor LLM systems when delegating work. (arXiv)\n\nThe right level of review depends on risk.\n\nFor low-risk prose drafts, lightweight review may be enough. For production code, financial documents, legal text, medical records, security configuration, infrastructure changes, or customer-facing data, delegated edits should go through rigorous review.\n\n### 6. Treat long workflows differently from single edits\n\nThe paper shows that short interaction performance is not predictive of long-horizon performance. (arXiv)\n\nThat means evaluation should match the intended workflow. If your agent will perform 30 steps, do not evaluate it with one-step tasks. If your assistant will edit entire repositories, do not validate it only on isolated snippets. If your process includes retrieved context, include distractor documents in evaluation.\n\n### 7. Build rollback into agentic systems\n\nAn AI agent should not only edit. It should be able to checkpoint, validate, fail, rollback, and explain.\n\nA safer architecture looks like this:\n\n\n Input workspace\n ↓\n Create snapshot\n ↓\n Plan changes\n ↓\n Apply minimal patch\n ↓\n Run validators\n ↓\n Compare semantic invariants\n ↓\n Summarize diff\n ↓\n Human approval or automatic rollback\n\n\nWithout this loop, tool use may only make the model faster at corrupting files.\n\n## A practical delegated-editing checklist\n\nBefore letting an LLM edit important documents, ask:\n\n\n Do I have a clean pre-edit snapshot?\n Can I see a precise diff?\n Can I validate the file syntactically?\n Can I validate it semantically?\n Are there domain-specific invariants?\n Are unrelated sections protected?\n Can I roll back automatically?\n Is the task local or global?\n Is there distractor context that may confuse the model?\n How many sequential edits will happen?\n Do I have tests that reflect the real task, not just a demo?\n\n\nFor AI-assisted coding, that translates into:\n\n\n Use small commits.\n Use feature branches.\n Require tests after every agent step.\n Ask for plans before edits.\n Review diffs carefully.\n Prefer constrained file scopes.\n Run formatters and linters.\n Run unit and integration tests.\n Use static analysis and dependency checks.\n Do not accept broad rewrites without review.\n\n\nFor RAG and document automation systems, it translates into:\n\n\n Preserve source references.\n Validate extracted entities.\n Track document lineage.\n Compare pre/post structured representations.\n Use schema-aware parsing.\n Flag changed numbers, dates, names, IDs, and citations.\n Evaluate over multi-step workflows, not just one answer.\n\n\n## Why this paper matters\n\nThe core insight of **“LLMs Corrupt Your Documents When You Delegate”** is not that LLMs are bad. In fact, the paper also shows rapid progress: it notes that GPT-family benchmark performance improved significantly between the tested GPT 4o and GPT 5.4 models. (arXiv)\n\nThe real message is more nuanced:\n\n> LLMs are becoming capable enough to delegate work to, but not reliable enough to trust without verification.\n\nThat is the dangerous middle.\n\nWhen models were obviously weak, users did not trust them. When models become near-perfect, delegation will be safer. But today’s systems often live in between: impressive, useful, productive, and still capable of silent severe corruption.\n\nThat makes evaluation, validation, and workflow design critical.\n\nDELEGATE-52 gives us a useful language for this problem. It shifts the conversation from “Can the model do the task once?” to “Can the model preserve the artifact over a long delegated workflow?” That is the right question for AI-assisted engineering, document automation, enterprise copilots, and agentic systems.\n\n## Conclusion\n\nDelegation is not just prompting at a larger scale. It is an operational model where an AI system mutates valuable artifacts on behalf of a human.\n\nThat requires trust.\n\nThe Microsoft Research paper shows that today’s LLMs can still violate that trust in subtle and severe ways. They often attempt the task. They may produce plausible output. They may succeed for several steps. But over long workflows, errors compound, critical failures appear, and documents can become corrupted.\n\nThe practical takeaway is clear:\n\nUse LLMs aggressively, but do not delegate blindly.\n\nTreat AI-generated edits like untrusted code changes: version them, diff them, validate them, test them, and review them. The future of AI-assisted work is not “let the model edit everything.” It is **model capability plus engineering discipline**.\n\nThat is where reliable delegation starts.",
"title": "LLMs Corrupt Your Documents When You Delegate",
"updatedAt": "2026-05-12T12:29:32.065Z"
}