DGW.ltd

Do androids dream of robots.txt?

dgw.ltd June 17, 2026

I was sent this recently – it’s an agent-readiness check from Cloudflare. They are (literally) in a better position in the stack to check this sort of thing. As we all see bot traffic going through the roof and user-agents from ChatGPT-User, ClaudeBot, Google-Extended increasingly appearing in our server logs, we know that our website content is being read one step beyond the browser. Now the audit, point isitagentready.com at your domain and it grades how prepared you are for the coming wave of AI agents crawling, reading, and acting on your behalf. dgw.ltd scored badly. Nine things missing. My first instinct was the same as yours probably is: open nine tickets and start clearing the board. I’m glad I didn’t, because most of that list is optional at best. What the scanner actually wants The nine items break into three honest buckets once you stop treating a red cross as a to-do. Here’s the triage I landed on: ItemVerdictWhyContent Signals in robots.txt✅ Do itReal, shipped, one line of policyAPI Catalog (RFC 9727)🤔 OptionalHonest, but agents already find /wp-json/DNS-AID❌ SkipDNS zone work for an unratified draftOAuth/OIDC discovery❌ SkipI don’t run OAuth-protected APIsOAuth Protected Resource Metadata❌ SkipSameauth.md❌ SkipSameMCP Server Card❌ SkipI don’t run a public MCP serverAgent Skills index❌ SkipI don’t publish public agent skillsWebMCP❌ SkipA pile of JS to expose “tools” I don’t have One worth doing. One defensible. Seven that range from premature to actively wrong for a personal blog. TBF, it does say at the top of the page this checks against “multiple emerging standards”, but you don’t tend to see that above a massive fail indicator. The one that’s real: Content Signals Content Signals is the rare entry that’s cheap and honest. Cloudflare rolled it out across millions of zones in late 2025, and it’s a single declarative line in robots.txt stating how you feel about AI using your content – split into search, ai-input (RAG and live answers), and ai-train (training). For most sites the defensive default is ai-train=no. This is a blog about AI and agentic coding, so I went the other way: $signal = $marker . "Content-Signal: search=yes, ai-input=yes, ai-train=yes\n"; return substr_replace( $output, $signal, $pos, strlen( $marker ) ); add_filter( 'robots_txt', 'dgwltd_robots_content_signal', 20, 2 ); If you’re writing about this topic, you want that content in training data and surfaced in answers. Some clients actively want their content wherever users are, regardless of platform – AI summaries, social previews, aggregators, wherever. Others don’t. Both are legitimate positions, and they require different strategies. robots.txt has always been hope, not a strategy. It signals intent to reputable crawlers, but it’s not enforceable – it’s a convention, not a promise. That said, not having one means there’s nothing for even the well-behaved crawlers to respect. A WordPress note worth knowing: there’s no robots.txt file to edit (unless you create one manually and upload it to the server). Core can serve it virtually through do_robots() and the robots_txt filter, so you hook the filter rather than drop a file. I run it at priority 20 so it lands after Yoast, which also touches that filter. Why I skipped seven of nine Here’s the thesis, and it’s the bit I actually care about. These scanners quietly conflate two very different things: metadata and machinery. A Content Signal is a label. It costs one line, it’s true whether anyone reads it or not, and the worst case is a crawler ignores it. An OAuth discovery document, an MCP server card, a WebMCP tool definition – those describe machinery. Publishing them implies you run that machinery. So a green checkmark that advertises an auth endpoint you don’t operate isn’t a feature. It’s a liability with a tidy little arrow pointing at it. Empty oauth-authorization-server metadata is strictly worse than no metadata: absent says “nothing here”, present-but-hollow says “something here, come poke it”. WebMCP deserves a special mention, because it’s the purest version of the whole problem. It’s a browser API: you call navigator.modelContext.provideContext() and hand any visiting agent a set of tools your site can run. To pass the check I’d ship a JavaScript library exposing tools I haven’t written. I’m not shipping JS for something I’m not shipping. Note that the WebKit team even came out against WebMCP recently – an official “oppose”. When a site’s actions are hard for an agent to use, that is a gap in the page’s own semantics, and the fix, in our opinion, is to close it in the platform’s shared layers (HTML and ARIA), where the user, assistive technology, and agents all benefit. They continue: Our deeper concern is architectural. An agent acting on a user’s behalf is, in effect, assistive technology: it should operate a site as the user would, and the site should not single it out for different treatment. Note the distinction, though: WebKit opposes WebMCP, the in-browser API – not MCP itself, which belongs server-side. This site is WordPress, where the groundwork is already laid there. The Abilities API landed in core in 6.9: register a capability once, typed and permission-checked, and it’s discoverable. The official MCP Adapter plugin turns those abilities into MCP tools, where the auth already lives. If I ever want agents using the site, that’s the door: registered abilities with capability checks, not a script tag hoping a browser agent turns up. My rule has always been minimal JavaScript, only where it earns its place, and a JS file advertising tools that don’t exist fails that on both counts. The scope matters: abilities expose what the site does – search, browse, filter, related posts – not how an agent drives the page. The site offers capabilities; it isn’t piloting a checkout to order you a burger. The thing the scanner didn’t ask for Something that struck me: the scanner checks for the speculative Cloudflare-flavoured stack and misses llms.txt, which is the single most on-brand item for a content site. It’s a curated markdown map of your site for LLMs – an H1, a summary, and hand-picked links. Signal over coverage, which is the whole point versus a sitemap. So I built one. Virtual, no file, generated on template_redirect from the primary nav menu plus recent posts. Where this leaves me One real change shipped (Content Signals), one bonus item the scanner forgot (llms.txt). Seven checkboxes left deliberately red. Treat agent-readiness scanners like a linter with no taste: they tell you what’s missing, not what’s wise. The gap between publishing a preference and claiming machinery you don’t run is the whole game, and a checkmark can’t see it. We’ll no doubt see more scanners like this, and I’m not knocking their utility. But like any automated test, they’re worth taking with a pinch of salt.

Discussion in the ATmosphere