Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifftis6e66csjzx56ljdhiq4ziiddcgt4tbtbf3senvf7jn6chbfi",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moh6gtpaaja2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigvn5hg323yyfentsg54jdpf4pszesn3sz46fiufeyjzxtpaauxzy"
    },
    "mimeType": "image/webp",
    "size": 68254
  },
  "path": "/cryptokeesan/why-the-retry-loop-is-usually-the-expensive-part-of-agent-work-1e35",
  "publishedAt": "2026-06-17T01:20:28.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "devtools",
    "automation"
  ],
  "textContent": "The first failure usually is not the expensive one.\n\nThe expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.\n\nWe kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.\n\n##  Why the loop hurts more than the mistake\n\nA single bad step is recoverable. An unbounded retry loop compounds the mistake.\n\nThat is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.\n\nThe failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.\n\n##  What we tried first\n\nThe obvious moves are usually the wrong ones:\n\n  * make the prompt longer\n  * add a generic retry\n  * increase the timeout\n  * let the model reason more\n  * rerun the same command with slightly different wording\n\n\n\nThose changes can make a demo look better, but they do not fix a stuck loop.\n\nIf the environment is unchanged, a retry is often just a second copy of the same mistake.\n\n##  What actually worked\n\nThe fix was not smarter language. It was stricter boundaries.\n\nWe had to make the runtime answer four questions before it kept going:\n\n  1. What is the budget?\n  2. What counts as success?\n  3. What is the verifier?\n  4. What happens when the same failure repeats?\n\n\n\nA small policy block is often enough to make that concrete:\n\n\n\n    {\n      \"budget_cap\": 250,\n      \"max_attempts\": 3,\n      \"stop_on_same_error\": true,\n      \"require_verifier\": true,\n      \"emit_receipt\": true\n    }\n\n\nThat does not sound ambitious. That is the point.\n\nThe biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.\n\n##  Why receipts matter\n\nReceipts turn a run from a vague story into a checkable fact.\n\nA receipt should show:\n\n  * what the agent tried\n  * what changed\n  * what failed\n  * why the run stopped\n\n\n\nWithout that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.\n\nThat is also why this kind of work ends up feeling less like prompt engineering and more like operations.\n\n##  The tradeoff\n\nStricter control means the system stops earlier.\n\nThat can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.\n\nA bounded agent is less flashy than an agent that never gives up. It is also much more usable.\n\nThat is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.\n\n##  What we are watching next\n\nThe next improvement is not more retries.\n\nIt is better failure classification so the runtime can separate:\n\n  * missing permission\n  * stale state\n  * tool mismatch\n  * external outage\n  * real task completion\n\n\n\nWhen those are distinct, the system can choose a better next step instead of recycling the same command.\n\nThat is the line between an agent that looks autonomous and an agent that is actually operable.\n\nWhat failure shape are you still letting your runtime retry too many times?",
  "title": "Why the retry loop is usually the expensive part of agent work"
}