Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieyxk4i6tkh7hournabzwpsmtit3a7f7tjkdcz3v6bmqlfrr3tp4e",
    "uri": "at://did:plc:qllwm7os6w6f6hxue4mcr7mz/app.bsky.feed.post/3mlqrjmszaz32"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreihm3lddzvuu66ci2rpxrnbtwfi6cztp26azjjdlxxmyg5s6aefpdy"
    },
    "mimeType": "image/jpeg",
    "size": 92463
  },
  "description": "How we defend Arcjet’s MCP tool outputs from prompt injection by separating trusted guidance from untrusted evidence in structured responses.",
  "path": "/how-we-defend-mcp-tool-outputs-from-prompt-injection/",
  "publishedAt": "2026-05-13T16:56:41.000Z",
  "site": "https://blog.arcjet.com",
  "tags": [
    "Arcjet’s MCP server",
    "provides excellent scaffolding",
    "Arcjet",
    "prompt injections on the web",
    "MCP tools spec",
    "Arcjet Guards",
    "@arcjet"
  ],
  "textContent": "When we built Arcjet’s MCP server, the obvious security boundaries were authentication, authorization, input validation, rate limits, and confirmation prompts for mutating tools. The Go MCP SDK provides excellent scaffolding and helps structure these requirements, but it doesn't do anything special about the tool response boundary.\n\nIn a normal API, JSON is data. In an agent workflow, JSON is context. A field called `summary`, `suggestedActions`, or `reason` may be read by a model and used to decide what to do next. If that field contains attacker-controlled text, the tool has become a prompt injection delivery mechanism.\n\nArcjet helps developers secure AI applications from abuse, which means our own systems process both arbitrary HTTP request data and requests from agentic clients. These include attacker controlled data such as the request path and headers, so indirect prompt injection through Arcjet's own MCP is a real concern.\n\n## Tool output is model input\n\nGoogle recently published research on prompt injections on the web. They found prompt injection attempts appearing in public web content, ranging from pranks and SEO manipulation to agent deterrence, exfiltration attempts, and destructive instructions.\n\nThis is relevant as newer applications do not only read user prompts - web pages, emails, docs, logs, support tickets, issue comments, request metadata, and tool results are all potentially part of the input or output.\n\nOur MCP tools allow Arcjet users to query request data in aggregate and for specific requests, so they include security-relevant data such as request paths, hosts, IPs, headers, and error details. Those fields are exactly where hostile input can show up.\n\nThis is easy to miss when building the tool output. For example, you could create a simple string response with:\n\n\n    summary: `Request to ${path} was denied because ${headerName} contained disallowed value: ${headerValue}`;\n\nThis would be bad because `path`, `headerName` and `headerValue` are all attacker controlled. Even though we sanitize the values before storing them, protecting against prompt injection isn't simply a case of applying the right encoding.\n\n## What the MCP spec says\n\nThe MCP tools spec already points in the right direction.\n\nTools are model-controlled, meaning a model can discover and invoke them automatically. The spec recommends human confirmation for sensitive operations, supports structured output through `structuredContent`, and lets tools define an `outputSchema`.\n\nThe security considerations are especially relevant: servers must validate inputs, implement access controls, rate limit tool invocations, and sanitize tool outputs. Clients should validate tool results before passing them to the LLM.\n\n\"Sanitize\" is doing a lot of work there! We have to distinguish between Arcjet-provided guidance and attacker-controlled evidence.\n\nOur answer was to make the trust boundary visible in the response shape so clients and models have an explicit signal about which fields are trusted guidance and which are untrusted evidence.\n\n## The pattern we use\n\nOur rule is: trusted guidance must never contain untrusted text.\n\nTrusted fields are generated only from server-controlled values: enums, counters, thresholds, static templates, and policy decisions. Raw evidence goes into explicitly untrusted fields. This looks something like:\n\n\n    type ExplainDecisionOutput = {\n      summary: string;\n      conclusion: \"ALLOW\" | \"DENY\" | \"ERROR\";\n      reason: string;\n      suggestedActions: string[];\n      untrustedData: {\n        path?: string;\n        host?: string;\n        reasonDetails?: string;\n      };\n    };\n\nSo instead of:\n\n\n    summary: `Request to ${path} was denied because ${headerName} contained disallowed value: ${headerValue}`;\n\nwe use:\n\n\n    summary: \"Request was denied by prompt injection detection.\",\n    untrustedData: { path, headerName, headerValue }\n\nThe trusted summary is less descriptive, but it is safer. The raw evidence is still available for display, investigation, and debugging, but it is not presented as server-authored guidance. This gives the model enough context to reason about the decision while keeping the raw evidence in a separate, clearly labeled place. More sophisticated outputs explain the meaning of the values and what to do with them, but the actual values are always specifically separate.\n\n## Schema text is part of the defense\n\nMCP’s `outputSchema` is not just developer experience - it's part of the security surface. The schema helps clients and models understand the structure of the result before they use it.\n\nWe use schema descriptions to label trust explicitly.\n\nFields like `summary` say they are derived from server-controlled enums and counters. Fields like `untrustedData.path` say they are attacker-controlled and display-only.\n\nThat does not make the model magically safe, but it makes the boundary explicit. The client and the model have fewer reasons to confuse raw evidence with instructions.\n\n## Testing the boundary\n\nWe also added regression tests for the main failure mode: attacker-controlled text crossing into trusted fields.\n\nThe tests inject hostile strings into realistic places: request paths, hosts, reason details, rule labels, rule IDs, bot categories, filter expressions, metadata values, and top paths in analytics output.\n\nThen we assert those strings never appear in trusted fields such as `summary`, `suggestedActions`, `recommendations`, `reason`, or `risk`. If the value needs to be returned, it can appear only in `untrustedData`.\n\n## Runtime enforcement with Guards\n\nThe MCP output pattern protects what our tools say back to agents, but agents also fetch arbitrary web pages, process queue messages, summarize support tickets, and call tools that return untrusted text.\n\nThat is where Arcjet Guards fits.\n\nGuards run Arcjet security rules inside tool handlers, queue workers, background jobs, and other non-HTTP code paths. There is no `Request` object. You pass the input directly and get a decision back.\n\nIn TypeScript:\n\n\n    import { launchArcjet, tokenBucket, detectPromptInjection } from \"@arcjet/guard\";\n\n    const arcjet = launchArcjet({ key: process.env.ARCJET_KEY! });\n\n    const userLimit = tokenBucket({\n      label: \"user.tool_call_bucket\",\n      bucket: \"tool-calls\",\n      refillRate: 100,\n      intervalSeconds: 60,\n      maxTokens: 500,\n    });\n\n    const piRule = detectPromptInjection();\n\n    async function searchWeb(query: string, userId: string) {\n      const decision = await arcjet.guard({\n        label: \"tools.search_web\",\n        metadata: { userId },\n        rules: [userLimit({ key: userId, requested: 1 }), piRule(query)],\n      });\n\n      if (decision.conclusion === \"DENY\") {\n        return { content: \"[Blocked: unsafe tool input]\" };\n      }\n\n      return doSearch(query);\n    }\n\nOr in Python:\n\n\n    import os\n    from arcjet.guard import launch_arcjet, TokenBucket, DetectPromptInjection\n\n    arcjet = launch_arcjet(key=os.environ[\"ARCJET_KEY\"])\n\n    user_limit = TokenBucket(\n        label=\"user.task_bucket\",\n        bucket=\"task-calls\",\n        refill_rate=100,\n        interval_seconds=60,\n        max_tokens=500,\n    )\n\n    pi_rule = DetectPromptInjection()\n\n    async def process_task(user_id: str, message: str):\n        decision = await arcjet.guard(\n            label=\"tasks.generate\",\n            metadata={\"user_id\": user_id},\n            rules=[user_limit(key=user_id, requested=1), pi_rule(message)],\n        )\n\n        if decision.conclusion == \"DENY\":\n            raise RuntimeError(f\"Blocked: {decision.reason}\")\n\n        return await run_task(message)\n\n\nThe important detail is placement. Call `guard()` inline where the operation happens: inside the tool handler, task processor, queue worker, or function where untrusted input enters the system. Configure the client and rules once at module scope. Use stable labels, buckets, and keys so decisions are observable and rate limits do not collide.\n\nPrompt injection defense is not only about better model instructions - it's important to also make trust boundaries visible in code.\n\n## Checklist for agent tool builders\n\nIf you are building MCP tools or agent workflows:\n\n  * Separate trusted guidance from untrusted evidence in every tool response.\n  * Generate trusted fields only from enums, counters, static templates, and policy decisions.\n  * Put raw web, request, metadata, error, and tool text under clearly labeled `untrustedData`.\n  * Define an `outputSchema` and use schema descriptions to mark trust boundaries.\n  * Sanitize and, where possible, validate tool results before they reach the LLM.\n  * Scan fetched or user-supplied content before returning it to the model.\n  * Add rate limits, timeouts, redirect limits, content-type checks, and max response sizes.\n  * Require confirmation for mutating or externally visible actions.\n  * Add adversarial tests for every trusted output field the model might read.\n\n",
  "title": "How we defend MCP tool outputs from prompt injection",
  "updatedAt": "2026-05-13T16:56:42.282Z"
}