Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicgft6gacvf3fxkf4p4cyz2bh2jizo5tpxoee3nxvghpzys2stxmy",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpsskmgz6gk2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreieawdda2unjbw6wsh7pnwabga7cffsyas6mqacn3lvteapwcn4vgm"
    },
    "mimeType": "image/webp",
    "size": 70954
  },
  "path": "/gabrielanhaia/deploying-agents-containers-orchestration-and-scaling-the-loop-44go",
  "publishedAt": "2026-07-04T09:43:52.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "agents",
    "llm",
    "python",
    "Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust",
    "Observability for LLM Applications",
    "Hermes IDE",
    "GitHub",
    "xgabriel.com",
    "@app.post",
    "@router.get",
    "@retry"
  ],
  "textContent": "  * **Book:** Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust\n  * **Also by me:** Observability for LLM Applications — the companion book in _The AI Engineer's Library_ (2-book series)\n  * **My project:** Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools\n  * **Me:** xgabriel.com | GitHub\n\n\n\nThe agent works on your laptop. It passes evals. Your manager asks when it ships and you say Monday, because the modeling is done. Then you try to put it behind a load balancer and it falls apart, because you deployed it like a web service.\n\nAn agent is not a web service. A web service answers in milliseconds and forgets. An agent thinks for minutes, burns tokens across two or three providers, streams partial output to a browser, and sometimes decides to call `delete_invoice` on the eighth turn. Every deployment decision you make flows from one question: what does this thing do to your infrastructure while it is running?\n\nHere is how to package it, where to hold state, and how to scale a workload whose bottleneck is a model call you do not control.\n\n##  The shape is decided by the longest step\n\nThe single rule that saves you the most pain: **an agent's deployment shape is decided by its longest step, not its average step.**\n\nA support chatbot answers in two seconds. A code-review agent thinks for six minutes. A research agent runs for forty. You cannot put all three behind the same HTTP endpoint and expect any of them to survive. Pick the pattern that matches the longest step, then cap the rest with timeouts.\n\n  * Under 30s → stateless HTTP endpoint (Cloud Run, Fly.io).\n  * 30s to 5m with a user watching → streaming over WebSocket or SSE.\n  * 5m to an hour, async → queue plus worker (Temporal, Inngest, or Redis).\n  * Longer than an hour → still queue plus worker, whether you like it or not.\n\n\n\nDo not hold an HTTP request open for forty minutes. Something you did not know existed will kill it at the worst moment: a proxy, a CDN, a load-balancer idle timeout.\n\n##  Package it: pin everything, drop root\n\nThe base image is the same across every pattern. Pin your Python, pin your SDKs, run as a non-root user, install nothing you do not need.\n\n\n\n    # Dockerfile\n    FROM python:3.13-slim-bookworm AS builder\n    ENV PIP_NO_CACHE_DIR=1\n    WORKDIR /build\n    COPY requirements.txt .\n    RUN pip wheel --wheel-dir /wheels -r requirements.txt\n\n    FROM python:3.13-slim-bookworm AS runtime\n    ENV PYTHONDONTWRITEBYTECODE=1 \\\n        PYTHONUNBUFFERED=1\n    RUN groupadd -r agent && useradd -r -g agent agent\n    WORKDIR /app\n    COPY --from=builder /wheels /wheels\n    COPY requirements.txt .\n    RUN pip install --no-index \\\n        --find-links=/wheels -r requirements.txt\n    COPY app/ ./app/\n    USER agent\n    EXPOSE 8000\n    CMD [\"uvicorn\", \"app.main:app\", \\\n         \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n\n\nTwo things earn their keep. The multi-stage build compiles wheels in stage one and copies only the runtime into stage two, so no build toolchain ships to production. And `slim-bookworm` is 130 MB against 1.1 GB for the default image. As a rough estimate, that smaller pull shaves seconds off cold pod start when you scale up under load.\n\nNever bake an API key into the image. The runtime injects it. On Kubernetes that is a mounted `Secret`; on GCP it is Secret Manager with Workload Identity; on Fly it is `fly secrets set`. The image has no credentials, the agent reads them at boot, and it never logs them.\n\nPin exact versions in `requirements.txt`. The SDK field names drift, and a build that worked in February will break in July if the tags float:\n\n\n\n    anthropic==0.94.1\n    langgraph==1.1.6\n    litellm==1.75.1\n    fastapi==0.118.0\n    uvicorn[standard]==0.34.0\n    redis==5.2.1\n    tenacity==9.1.2\n\n\n##  Stateless first: hold state in Redis, not the process\n\nReach for stateless unless you have a concrete reason not to. Each request starts a fresh agent. Conversation history, tool results, the scratchpad — all of it lives in Redis or Postgres, keyed by a session ID the client sends in. The process holds nothing between requests, so any pod can serve any request and a rolling deploy loses no memory.\n\n\n\n    # app/main.py\n    from fastapi import FastAPI\n    from pydantic import BaseModel\n    from anthropic import AsyncAnthropic\n    import redis.asyncio as redis\n    import json\n\n    app = FastAPI()\n    r = redis.from_url(\"redis://cache:6379/0\")\n    client = AsyncAnthropic()\n\n    class Req(BaseModel):\n        session_id: str\n        message: str\n\n    @app.post(\"/chat\")\n    async def chat(req: Req):\n        key = f\"hist:{req.session_id}\"\n        raw = await r.get(key)\n        history = json.loads(raw) if raw else []\n        history.append(\n            {\"role\": \"user\", \"content\": req.message}\n        )\n        resp = await client.messages.create(\n            model=\"claude-sonnet-4-6\",\n            max_tokens=1024,\n            messages=history,\n        )\n        reply = resp.content[0].text\n        history.append(\n            {\"role\": \"assistant\", \"content\": reply}\n        )\n        await r.set(key, json.dumps(history), ex=3600)\n        return {\"reply\": reply}\n\n\nThe state is the session, not the server. That is the whole trick. When you need a durable, long-running agent instead (one that must survive a worker restart mid-run), you move to a queue and a workflow engine, and the state lives in the workflow store rather than in Redis. Do not fake durability by keeping it in process memory.\n\n##  Health checks: cheap liveness, expensive readiness\n\nKubernetes probes are where agent deployments quietly break, because the defaults assume a fast service.\n\nSplit the two probes. `/healthz` is cheap: did the process start, is the event loop alive. `/readyz` is expensive: can the agent actually reach its provider. A pod that boots with a bad API key should never take traffic.\n\n\n\n    # app/probes.py\n    from fastapi import APIRouter, Response\n    from anthropic import AsyncAnthropic\n\n    router = APIRouter()\n    client = AsyncAnthropic()\n\n    @router.get(\"/healthz\")\n    async def healthz():\n        return {\"status\": \"ok\"}\n\n    @router.get(\"/readyz\")\n    async def readyz():\n        try:\n            await client.models.list()\n            return {\"status\": \"ready\"}\n        except Exception:\n            return Response(status_code=503)\n\n\nLiveness should almost never kill the pod. A legitimate long agent turn can make the event loop look stuck, and the standard advice, kill it after three failures, will murder a run that was working fine. Point liveness at `/healthz` with a generous failure threshold, and set `terminationGracePeriodSeconds` larger than your worst-case turn so a rolling deploy lets in-flight runs finish instead of severing them.\n\n##  Scaling: the bottleneck is a lock you do not own\n\nHere is what makes agents different from a normal backend. Adding workers does not add throughput past a point, because every provider caps you on requests per minute and tokens per minute, per key. Scale your pool from 10 to 100 and those limits do not move. You will just generate more 429s.\n\nSo the first thing to wire is not autoscaling. It is a semaphore per `(provider, model)` pair, sized to your real budget, plus retry with backoff and jitter for the noise that slips through.\n\n\n\n    import asyncio\n    from tenacity import (\n        retry, stop_after_attempt,\n        wait_exponential_jitter,\n    )\n\n    # One gate per model, sized to your RPM budget.\n    gate = asyncio.Semaphore(64)\n\n    @retry(\n        stop=stop_after_attempt(5),\n        wait=wait_exponential_jitter(initial=1, max=30),\n    )\n    async def call_model(client, **kw):\n        async with gate:\n            return await client.messages.create(**kw)\n\n\nThe workload blocks on I/O. The model call is a network wait, so concurrency is nearly free at the runtime level and expensive at the provider level. That inversion is the whole scaling story. Your CPU sits idle while a hundred coroutines wait on Claude. So do not autoscale on CPU alone; it will read near zero while you are completely saturated. Scale on in-flight request count, and cap in-flight runs per pod (four is a sane start for a 1-CPU pod with 30-second turns) so a single box does not fan out a thousand concurrent calls and eat your whole rate limit.\n\nFor a TypeScript agent the same gate is a small counting semaphore around the call:\n\n\n\n    // gate.ts — cap concurrent model calls\n    let active = 0;\n    const queue: Array<() => void> = [];\n    const LIMIT = 64;\n\n    export async function withGate<T>(\n      fn: () => Promise<T>,\n    ): Promise<T> {\n      if (active >= LIMIT) {\n        await new Promise<void>((r) => queue.push(r));\n      }\n      active++;\n      try {\n        return await fn();\n      } finally {\n        active--;\n        queue.shift()?.();\n      }\n    }\n\n\nWhen the pool saturates, the queue grows. That is your signal. Shed load at the ingress (return 429 or park the work on a durable queue) rather than letting the agent thrash against the provider. A queue that holds work is also what carries you through a provider outage: the run waits for the model to come back instead of failing.\n\n##  Route through a gateway, not straight at the provider\n\nPoint your agent at one gateway that owns fallback, not at a provider SDK directly. When Claude Sonnet rate-limits, the gateway retries on the next model in the chain and your agent code never sees it. LiteLLM Proxy is the self-hosted default; OpenRouter and Portkey are the managed options.\n\nThe catch: a fallback chain you have never exercised is a fallback chain you do not have. The failover target needs the same rate-limit headroom as your primary, and a different model means a different tokenizer and tool-call schema. Test the chain against a fixed eval set, reported as its own score, and drill it — force ten percent of traffic through the fallback once a month. If it cannot carry ten percent on a quiet Tuesday, it will not carry a hundred percent at 3 AM.\n\n##  The deployment is the easy part\n\nFive patterns, a Dockerfile, a semaphore, and a fallback chain get you something that ships on Monday and does not fall over on Tuesday. The container runs, the gateway routes around outages, the queue holds work when nothing else can. That part is mechanical.\n\nThe hard part is what happens at 3 AM when the agent starts returning confident answers to questions nobody asked and burning tokens on a loop that never terminates — which is why deployment is only worth doing on top of tracing and evals. _Agents in Production_ walks the five patterns, the scaling limits, and the degradation ladder end to end; _Observability for LLM Applications_ , its companion in _The AI Engineer's Library_ , is the tracing, evals, and cost-accounting layer that tells you the loop went wrong before your users do.",
  "title": "Deploying Agents: Containers, Orchestration, and Scaling the Loop"
}