Raw Record Source

{
  "$type": "site.standard.document",
  "description": "feat: add dynamic robots.txt to block AI crawlers via Known Agents API",
  "path": "/posts/astro-block-ai-crawlers/",
  "publishedAt": "2026-01-29T00:00:00.000Z",
  "site": "https://read.ryancowl.es",
  "tags": [
    "Code"
  ],
  "textContent": "After migrating to Astro, one of the first things I wanted to port over was my automated crawler blocking setup. The Hugo version used  to fetch a list of AI crawlers and generate  at build time. With Astro, I figured I could pull from multiple sources: the Known Agents API (formerly Dark Visitors), Cloudflare's public bot list, and ai.robots.txt.\n\nAs I mentioned in the Hugo version,  has no legal or technical authority. You're trusting bots to respect rules with no mechanism to enforce them. But it can't hurt to try, and casting a wider net with multiple sources feels like a reasonable upgrade. Let's see how it works.\n\n  \n\nThe endpoint\n\nAstro's endpoints let you generate files dynamically from . In this site,  runs as a server endpoint under the Netlify adapter, so I add cache headers to avoid fetching remote lists on every request.\n\nThe Known Agents API requires a  request with an access token and a list of agent types you want to block. Add your token to a  file if you have one:\n\nThe token is optional in my version. If it is missing, the endpoint skips Known Agents instead of sending .\n\nCreate :\n\nThe idea is simple: fetch the available lists, merge and deduplicate them, and return a formatted . If one source fails, the others can still work. If all remote sources fail, the endpoint falls back to a small built-in list so the response is still useful.\n\nThe  array controls which Known Agents categories get blocked. I'm targeting AI Data Scrapers, AI Agents, AI Assistants, and Undocumented AI Agents while leaving SEO crawlers and search engines alone. You can adjust those categories to match your own preferences.\n\n  \n\nTake it for a test drive\n\nStart the dev server with  and visit . You should see something like:\n\nThe exact list will change over time because the upstream sources change. The response is cacheable for browsers and CDNs, so you get fresh-enough data without making every request wait on remote APIs.\n\n  \n\nTracking what's actually hitting your site\n\nKnown Agents also offers a JavaScript analytics tag that tracks AI agent visits. It won't catch crawlers and scrapers (they don't run JavaScript), but it will show you visits from AI assistants and LLM-referred traffic. I added it to my base head:\n\nBetween the robots.txt blocking and the analytics tag, you get a decent picture of what AI traffic looks like on your site.\n\n  \n\nGoing further\n\nFor stronger enforcement, you could block crawlers at the server level too. Netlify supports an  header via a  file:\n\nI still don't trust that crawlers will respect any of this, but combining  with server-level headers at least makes the intent clear.\nFurther reading\nKnown Agents Documentation\nAstro Endpoints Guide\nai.robots.txt\nThe text file that runs the internet",
  "title": "Automatically Block AI Crawlers in Astro"
}