{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreievjyf6mfmhime6wmcxloxz3xm5ebi7romkpb56sgrfwli3qx2crq",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpg7skouxw32"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiato4ffizwj5356bzk2asiqjazrpikt3iyq4g6nfxmjt5fujgncum"
},
"mimeType": "image/webp",
"size": 74666
},
"path": "/_6a9b7b682ef6dfb20e506/ai-crawlers-are-scanning-your-site-right-now-how-to-check-and-control-access-3bak",
"publishedAt": "2026-06-29T09:33:36.000Z",
"site": "https://dev.to",
"tags": [
"tutorial",
"seo",
"ai",
"AEO Checker",
"Google Search Central: Introduction to robots.txt",
"Google Search Central: Google crawlers and fetchers",
"OpenAI: Crawlers and user agents",
"Anthropic: Web crawling and crawler controls",
"The llms.txt proposal",
"aeocheck.xyz"
],
"textContent": "AI crawlers now appear in many server logs alongside traditional search bots.\nSome are used for search retrieval, some for training, and some for broader web\nindexing. If you care about AI search visibility, you need to know which ones\ncan access your public pages.\n\nThe most common accidental blocker is simple: a robots.txt rule or CDN bot\nsetting that prevents AI crawlers from reaching the content you want discovered.\n\n## The major AI crawler tokens to check\n\nHere are crawler tokens you may see in logs or robots.txt rules:\n\nCrawler token | Company | Notes\n---|---|---\nGPTBot | OpenAI | Documented OpenAI crawler token\nOAI-SearchBot | OpenAI | Documented OpenAI search-related crawler token\nChatGPT-User | OpenAI | Documented OpenAI user-triggered agent token\nClaudeBot | Anthropic | Documented Anthropic crawler token\nClaude-SearchBot | Anthropic | Documented Anthropic search-related crawler token\nGoogle-Extended | Google | Google control token for Gemini Apps and Vertex AI use\nCCBot | Common Crawl | Web corpus crawler used by many downstream systems\nPerplexityBot | Perplexity | Commonly referenced Perplexity crawler token\n\nCrawler names and purposes change. Always confirm against official platform\ndocumentation before making sitewide access decisions.\n\n## First, check what is actually happening\n\nBefore you change anything, find out who is already crawling. If you have server\nlogs:\n\n\n\n grep -E \"GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Google-Extended|CCBot|PerplexityBot\" access.log\n\n\nIf you use Cloudflare, check bot and security events and filter by user agent.\n\nThree quick diagnostic steps:\n\n 1. Open `https://yourdomain.com/robots.txt` and look for broad `Disallow: /` rules.\n 2. Confirm the sitemap is listed in robots.txt or discoverable at `/sitemap.xml`.\n 3. Use our AEO Checker to validate robots.txt and flag restrictive AI crawler rules.\n\n\n\n## The most common mistake\n\nThe blunt rule that makes sites invisible to many crawlers:\n\n\n\n User-agent: *\n Disallow: /\n\n\nThis blocks every well-behaved crawler that follows the wildcard rule. If you\nsee it on a public marketing site, blog, or documentation site, it is probably\ntoo restrictive.\n\nA more common pattern is:\n\n\n\n User-agent: *\n Disallow: /admin\n Disallow: /api\n Disallow: /private\n\n\nThis can be reasonable. The key is to make sure public content is allowed and\nsensitive areas are blocked intentionally.\n\n## The allow vs block decision\n\n**Allow public content** when you want search and AI discovery.\n\n**Selectively block sensitive paths** such as admin, account, checkout, API, and\nprivate areas.\n\n**Block completely** only when you intentionally do not want a crawler to access\nany public content.\n\nFor most content sites, SaaS marketing sites, and documentation sites, the\npractical approach is to allow public pages and block private or operational\npaths.\n\n## Configuring robots.txt\n\nHere is a simple template:\n\n\n\n User-agent: Googlebot\n Allow: /\n\n User-agent: Bingbot\n Allow: /\n\n User-agent: GPTBot\n Allow: /\n\n User-agent: OAI-SearchBot\n Allow: /\n\n User-agent: ChatGPT-User\n Allow: /\n\n User-agent: ClaudeBot\n Allow: /\n\n User-agent: Claude-SearchBot\n Allow: /\n\n User-agent: Google-Extended\n Allow: /\n\n User-agent: *\n Disallow: /admin\n Disallow: /api\n Disallow: /private\n\n Sitemap: https://example.com/sitemap.xml\n\n\nPlace it at `/robots.txt`. Make sure it returns a 200 status and a plain text\nresponse.\n\n## What blocking actually does\n\nRobots.txt is a crawler instruction, not an authentication system. Major\nwell-behaved crawlers generally respect it. Bad actors may not.\n\nIf a path contains sensitive information, protect it with authentication and\nauthorization. Do not rely on robots.txt as a security boundary.\n\n## Watch out for CDN bot protection\n\nEven if robots.txt is correct, CDN bot protection can still block or challenge\nAI crawlers at the network level. If you use Cloudflare or another CDN, review\nbot events and WAF rules after changing crawler access.\n\n## The 5-point AI search readiness checklist\n\n 1. **Robots.txt is accessible** and returns plain text.\n 2. **Sitemap is discoverable** and contains canonical public URLs.\n 3. **AI crawler rules are intentional** rather than accidental.\n 4. **LLMs.txt exists at /llms.txt** if you want an AI-readable site summary.\n 5. **Structured data is present** on important pages.\n\n\n\nRun our AEO Checker to audit these signals in one scan.\n\n## The bottom line\n\nMost accidental AI crawler blocks come from broad robots.txt rules or CDN bot\nsettings. Both are fixable. The right setup is not \"allow everything forever\";\nit is to make public discovery intentional and private areas truly private.\n\n## Sources and further reading\n\n * Google Search Central: Introduction to robots.txt\n * Google Search Central: Google crawlers and fetchers\n * OpenAI: Crawlers and user agents\n * Anthropic: Web crawling and crawler controls\n * The llms.txt proposal\n\n\n\n_Originally published at aeocheck.xyz — free AI search readiness tools._",
"title": "AI Crawlers Are Scanning Your Site Right Now - How to Check and Control Access"
}