{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihfvhpwnkzpkrxyvncouqy2534omoyr7slaeg77qzyhpa2k3tn5t4",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moqzuqucnga2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreihhkyxplg2shc2cvaail4wzqscztn2cxi4wggbsseqg7kbzkwcrhi"
},
"mimeType": "image/webp",
"size": 77272
},
"path": "/saurav_bhattacharya/agent-model-x-harness-your-eval-layer-is-part-of-the-agent-not-a-tool-beside-it-1422",
"publishedAt": "2026-06-20T22:49:09.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"agents",
"evaluation",
"observability"
],
"textContent": "There's a formula I keep coming back to when people ask why their slick demo agent falls apart in production:\n\n> **Agent = Model × Harness**\n\nThe model is the raw reasoning — Claude, GPT, whatever. It's swappable, and it's getting better on a curve you don't control. The harness is everything else: the goals, the loops, the tools, the scheduler, the retry logic. Most of the engineering that matters lives in the harness, not the model.\n\nBut here's the part most teams get wrong. They define the harness as _the plumbing to run the model_ — goals + loops + tools — and then bolt **evals** and **observability** on the side as external QA. Things you point _at_ the agent, after the fact, from outside.\n\nThat's the mistake. **Your eval layer and your trace layer are inside the harness.** They're not tools beside the agent; they're the half of the agent that makes it a closed loop instead of an open one.\n\n> **Harness = goals + loops + tools + lens + evals.** The first three let the agent _act_. The last two let it _know whether the action was any good_ — which is the only thing that turns an agent that _runs_ into an agent that _improves_.\n\nLet me make that concrete, because I learned it the unglamorous way: by watching a scheduled agent crash and discovering exactly which parts of my harness saved me.\n\n## Open loop vs closed loop\n\nAn **open-loop** agent acts and moves on. It writes the file, hits the API, ships the commit — and nobody, including the agent itself, knows if the outcome was correct. You find out when a human notices something's broken. Most \"autonomous\" agents in the wild are open-loop. They're impressive right up until they silently do the wrong thing for three days.\n\nA **closed-loop** agent has a feedback path:\n\n\n\n act → observe → evaluate → correct → act better\n\n\nThe two pieces that close it are the **lens** (observability — every model call and tool step, resolved inputs, raw outputs, so a run is never a black box) and the **evals** (judgment — the output scored against a standard: deterministic checks, contract validation, model-as-judge). Lens tells you _what the agent did_ ; evals tell you _whether it was good_. You need both, wired into the harness itself — not run by hand once a week by a human who remembers to check.\n\n## The incident that taught me this\n\nI run a fleet of scheduled agents — background workers on cron, each doing a focused job on a timer, each in an isolated session the main process can't watch live.\n\nOne of them crashed mid-run.\n\nIn an open-loop setup, that's a silent disaster: the worker dies, produces nothing, leaves no trace. You discover it days later when the output that should exist doesn't. (If you've run background agents, you know this exact pain — the ones that \"go dark\" for a week before anyone notices.)\n\nHere's what happened instead, because the lens and eval layers were part of the harness:\n\n**1. The lens meant the failure left a record.** Every worker writes its transcript _stub-first_ : the moment it starts, before doing any real work, it writes a record with `Outcome: IN-PROGRESS`. So even a worker that dies one second later has left a footprint. The crash wasn't invisible — there was a transcript sitting there, frozen mid-run, saying \"I started and never finished.\"\n\n**2. The evals meant the failure got caught — loudly, and repeatedly.** Worker transcripts are validated against a versioned contract. A stub stuck at `IN-PROGRESS` is fine in normal mode (it might still be running), but under the \"finished\" check it becomes an _error_ :\n\n\n\n # Normal: an unfinished stub is acceptable\n agent-eval validate transcripts/\n\n # Finished mode: a lingering IN-PROGRESS is a hard error\n agent-eval validate transcripts/ --finished\n\n\nSo the dead run didn't just leave a record — it got _scored as invalid_ , and kept showing up in the gate's invalid list every single run until someone dealt with it. The eval layer refused to let the failure quietly blend into the background.\n\nThat's the whole point. **The lens made the failure observable; the evals made it un-ignorable.** Neither is something I went and ran by hand — they're standing parts of the harness that turned a crash from a black hole into a tracked defect.\n\n## Closing the correction step\n\nDetecting the failure is observe + evaluate. The last piece is _correct_.\n\nThe naive fix is \"add a `finally` block so the worker finalizes its own transcript on crash.\" That doesn't work, and the reason is instructive: **a crashed isolated session is gone.** The process that would run the `finally` is the thing that died. You can't ask a dead worker to clean up after itself.\n\nSo the correction step has to live _outside_ the worker, as its own small loop. I wrote a sweeper: a scheduled janitor that scans for transcripts stuck at `IN-PROGRESS` whose run is _provably over_ — older than a threshold set safely past the longest possible run time, so it can never race a worker that's still legitimately going — and finalizes them to `fail`.\n\nThe logic is deliberately boring:\n\n\n\n for each transcript stub still marked IN-PROGRESS:\n if age > MAX_RUN_TIME + safety_margin: # the run is provably dead\n rewrite Outcome -> \"fail (auto-finalized: abandoned mid-run)\"\n\n\nIdempotent, never deletes anything, runs on a timer. The first time it ran it finalized a backlog of long-dead stubs and dropped the gate's invalid count in finished-mode to just the known, expected exceptions. Now it keeps the loop closed automatically: a worker can still crash, but its corpse gets cleaned up within the hour without me touching anything.\n\nThat's `act → observe → evaluate → correct`, fully wired — and three of those four steps are the lens + eval layers doing their job.\n\n## Self-correcting ≠ self-improving (and why that matters)\n\nHere's where I have to be honest, because this is the part it's tempting to oversell.\n\nWhat I built is a **self-correcting** harness. The loop closes: the system detects deviations from its own standard and repairs them without a human in the loop. That's real, and it's the floor you want under any fleet of autonomous agents.\n\nBut self-correcting is not **self-improving**. Self-correcting means the system _holds the line_ — it keeps itself at the standard. Self-improving would mean the system _moves the line up_ on its own: noticing a check has false positives and tightening its own rubric, noticing a worker keeps regressing and adjusting that worker's instructions. My harness doesn't do that. _I_ still author every improvement. The loops run themselves; the loops were designed by a human.\n\nAnd honestly — that's the right place to stop, for now. A harness that rewrites its own workers and its own eval criteria unsupervised is a different and much riskier thing, precisely because **the evals would be grading work the same system authored.** The moment the thing being judged and the thing doing the judging are the same closed system with no external anchor, \"it passes its own evals\" stops meaning very much. Self-correction _within a human-set contract_ is the sound version. Self-modification _of the contract itself_ is where you want hard guardrails and a human gate before you flip it on.\n\nSo the honest formula, fully expanded:\n\n> **Agent = Model × Harness** , where **Harness = goals + loops + tools + lens + evals** — and it's the lens + evals that make the agent _self-correcting_. Getting from there to _self-improving_ is a real step further, and it's one you should take deliberately, with a human holding the gate.\n\n## The takeaway\n\nIf you're building agents and your evals and traces are something you run _occasionally, from outside_ , you don't have a closed loop — you have an open-loop agent with some QA scripts nearby. The upgrade isn't more tooling. It's a reframe:\n\n * **Lens and evals belong inside the harness** , running on every action, not on demand.\n * **The lens makes failures observable; the evals make them un-ignorable** — that's what converts a crash from an invisible black hole into a tracked defect.\n * **Closing the correction step often means a separate loop** , because the thing that failed can't always clean up after itself.\n * **Self-correcting is the floor; self-improving is a further, deliberate step** — keep a human on the gate before the system grades its own homework.\n\n\n\nAgent = Model × Harness. The model is improving on its own. Whether your _agent_ improves is a question about your harness — and specifically about whether you've wired the loop closed.\n\n_The two layers in this post map to two tools I use as a single unit. **agent-eval_ * is the eval framework — it scores and gates an agent's **output** (contract validation, drift, hallucination, staleness); the `validate --finished` check in the incident above is its work. **AgentLens** is the trace layer — it captures **how the agent got there** (every model and tool step, resolved inputs, raw outputs) so the eval signal is actually debuggable. agent-eval tells you something broke; AgentLens tells you why. They ship together on purpose: two halves of one feedback loop — which is exactly the argument of this post.*",
"title": "Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It"
}