{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiamdvycixxxqjtfvcjz2nwyxeci5nivzifudbr4if4fjwibyhfqmi",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpkg5kdiyww2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiewlhgmawf3udyhxexzhaodyk6h3bbo5ya34ellxfymvnpxet6mwi"
    },
    "mimeType": "image/webp",
    "size": 78528
  },
  "path": "/saurav_bhattacharya/your-agents-retries-are-double-charging-your-users-and-every-eval-is-green-3ija",
  "publishedAt": "2026-07-01T01:02:51.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "agents",
    "observability",
    "typescript"
  ],
  "textContent": "Your agent calls a tool. The tool times out at the network layer but _actually succeeds_ on the server. Your harness sees no response, so it retries. Now `charge_customer` ran twice, `send_email` fired twice, and `create_ticket` left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan.\n\nThis is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes _actions_ — not just generates text — then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler.\n\nI want to argue two things. First: side-effect safety is a **Tier 1 evaluation problem** , not a prompt problem. Second: you cannot even _see_ this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story.\n\n##  Why the model can't save you here\n\nThe instinct is to make the agent smarter. \"Tell it to check whether the charge already went through before retrying.\" Please don't. You are asking a non-deterministic component to enforce an invariant that must hold _every single time_. The model will comply 95% of the time and the other 5% is a chargeback.\n\nRetries don't come from the model anyway. They come from your harness, your HTTP client, your queue's at-least-once delivery, a Kubernetes pod restart mid-execution. The agent's \"reasoning\" is nowhere near the retry. So no amount of judging the agent's output tells you whether the _effect_ happened once or twice.\n\nThis is exactly why I think about evidence on an **independence axis** , not a cost axis. Evidence is only worth what the agent couldn't forge:\n\n  * **Tier 1 — proof the agent can't fake.** The side effect is externally observable. Did exactly one charge with this idempotency key hit Stripe? Does exactly one ticket exist? This is a deterministic yes/no you read from the _world_ , not from the agent.\n  * **Tier 2 — statistical signal vs a baseline the agent didn't author.** Retry-rate per tool trending up, duplicate-detection hits, latency distributions shifting. Signal, cheap, real-time.\n  * **Tier 3 — model-as-judge.** Useful for \"was this refund reasonable?\" Useless for \"did it happen twice.\" A judge is a _shared-substrate opinion_ : a signal, never a verdict, and never allowed in the hot path.\n\n\n\nDouble-execution lives entirely in Tier 1. It's binary, it's forgery-proof, and it's the 80% of production incidents that never needed an LLM to catch.\n\n##  The actual fix: idempotency keys the agent doesn't control\n\nThe correct architecture makes double-execution _impossible_ , then evaluates that the invariant held. The key insight: the idempotency key is derived by the harness from the intent, not minted by the model on each attempt.\n\n\n\n    import { createHash } from \"node:crypto\";\n\n    type ToolCall = { tool: string; args: Record<string, unknown>; runId: string };\n\n    // The key is a pure function of intent — identical across retries,\n    // because the AGENT never generates it. The harness does.\n    function idempotencyKey(call: ToolCall): string {\n      const canonical = JSON.stringify({\n        tool: call.tool,\n        args: call.args,\n        runId: call.runId, // one logical action per run, not per attempt\n      });\n      return createHash(\"sha256\").update(canonical).digest(\"hex\");\n    }\n\n    async function executeOnce(call: ToolCall, sideEffect: (key: string) => Promise<unknown>) {\n      const key = idempotencyKey(call);\n\n      // Tier 1 proof, checked BEFORE the effect: has this exact intent run?\n      const prior = await ledger.get(key);\n      if (prior?.status === \"committed\") {\n        return { key, result: prior.result, replayed: true }; // no second charge\n      }\n\n      await ledger.put(key, { status: \"in_flight\" });\n      const result = await sideEffect(key); // pass key downstream to Stripe et al.\n      await ledger.put(key, { status: \"committed\", result });\n\n      return { key, result, replayed: false };\n    }\n\n\nNow the retry is safe _by construction_ : a second attempt with the same intent replays the recorded result instead of firing the effect again. But — and this is the part people skip — **being safe is not the same as knowing you're safe.** You still have to prove it in your evals.\n\n##  The eval and the trace are one system\n\nHere's where I'll stop describing generic hygiene and tell you what I actually run: **agent-eval** to score and gate the output, and **AgentLens** to capture the trace it scores against. They ship as a unit for a reason I only appreciated after getting burned.\n\nagent-eval owns the Tier 1 gate. After every run it asserts the invariant against ground truth the agent could not author:\n\n\n\n    import { evaluate } from \"agent-eval\";\n\n    const report = await evaluate(run, {\n      checks: [\n        // Tier 1: externally observable, unforgeable proof.\n        { id: \"single-charge\", tier: 1, run: async ({ trace }) => {\n            const key = trace.toolCalls.find(t => t.tool === \"charge_customer\")?.idempotencyKey;\n            const hits = await stripe.charges.list({ metadata: { key } });\n            return { pass: hits.data.length === 1, detail: `charges=${hits.data.length}` };\n        }},\n        // Tier 2: statistical signal vs a baseline the agent didn't set.\n        { id: \"retry-rate\", tier: 2, run: ({ trace }) => {\n            const retries = trace.toolCalls.filter(t => t.replayed).length;\n            return { pass: retries <= baseline.p95, detail: `replays=${retries}` };\n        }},\n      ],\n    });\n\n\nNotice `trace.toolCalls` and `t.idempotencyKey`. Where does that come from? **AgentLens.** It records every model step and every tool step — resolved inputs, the idempotency key the harness derived, the raw provider response, whether the attempt was a replay or a fresh effect. Without that trace, the \"single-charge\" check has nothing to read. The agent's own summary (\"I charged the customer once\") is exactly the self-report you must not trust — it's shared-substrate, the agent authored it.\n\nThat's the whole thesis of pairing them. Tier 1+2 only mean something if they run over data the agent _didn't get to write_. AgentLens produces the unforgeable substrate; agent-eval renders the verdict. One captures how the agent got there, the other decides whether \"there\" was correct — and critically, Tier 1+2 run over trajectories in real time at roughly $0, while the judge stays offline for the subjective tail where it belongs.\n\n##  Ship the 80%\n\nYou do not need a smarter model to stop double-charging customers. You need:\n\n  1. Idempotency keys the model never touches, so retries are safe by construction.\n  2. A Tier 1 check that reads the _world_ and asserts exactly-once — the deterministic, real-time gate.\n  3. A trace (AgentLens) unforgeable enough that the check has real ground truth to read.\n\n\n\nReserve the model-as-judge for the genuinely subjective 20% — \"was this refund fair?\" — and label its output _opinion, not evidence_. The retry storm quietly draining your customers' cards is not in that 20%. It never was. It's the most catchable bug you have, sitting in Tier 1, waiting for you to look at the trace.",
  "title": "Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)"
}