{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibfsyytwmxjuvfnk66xmz5zc4luas5cze6jd6f2islz4gx5c5hasu",
    "uri": "at://did:plc:j4nmy4ymoeorm3j6hzbijapg/app.bsky.feed.post/3mlg6wxbjcfj2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreig6dhfv36hlmop3abdaevdljncg4y4g7fuq6jualzq2el3xvqdnfq"
    },
    "mimeType": "image/jpeg",
    "size": 775108
  },
  "description": "A traffic spike with no statistics counterpart, 400 requests in a minute, 25 countries. The clues were all there. So was the twist: I built the bait myself.",
  "path": "/the-detective-and-the-swarm/",
  "publishedAt": "2026-05-09T11:57:54.000Z",
  "site": "https://hoeijmakers.net",
  "tags": [
    "human, bot, crawler, scraper, agent",
    "Markdown",
    "I Thought I Was Optimising for Speed",
    "Thirty Years of Caching, Sorted in an Afternoon",
    "My Visitors Are Not All Human. That Is Fine.",
    "Guests That Should Behave",
    "Markdown, the WD-40 of Digital Information"
  ],
  "textContent": "At 07:43 UTC this morning, something hit my site. Not a flood in the security sense: no alarms, no WAF triggers, nothing in the human analytics. Just a number that shouldn't have been that high, on a chart I check more out of habit than worry.\n\nFour hundred requests in sixty seconds. Then back to normal.\n\nMy human-facing analytics saw nothing. That absence is itself a clue.\n\n## Two layers\n\nMost publishers have one view of their traffic: the analytics dashboard. It shows pageviews, sessions, referrers, the countries their readers come from. It tells them what humans did.\n\nWhat it doesn't show is everything else. And everything else, it turns out, is interesting.\n\nI run a Cloudflare Worker that logs every request to a D1 database before passing it along. Every request: human, bot, crawler, scraper, agent. The Worker tries to classify each one, matching user-agent strings against a database of known bot signatures. What it can't classify, it logs as unknown.\n\nThat second layer is where the detective work happens.\n\n## Reading the evidence\n\nThe 07:43 spike broke down like this: 202 of those 400 requests were for `.md` paths. Paths like `/when-bots-become-readers.md`, `/web-traffic-and-the-rise-of-llms.md`, `/measuring-traffic-machines-bots.md`. The Worker classified 192 of them as human, because the user-agent strings looked like browsers: Chrome 138, Firefox 115, Edge 114. Perfectly formed, perfectly plausible.\n\nBut one user-agent hit 47 different `.md` paths in sixty seconds. Another hit 37. Both from the same two browser fingerprints, distributed across 25 countries.\n\nThen the robots.txt requests: 116 of them in the same minute. That's a preflight pattern, a swarm checking the rules before it reads the content.\n\nChrome 114 hasn't been a current browser for a long time. Neither has Firefox 115. These are frozen strings, a signature of bot infrastructure that picks a browser version and pins it, never updating. The User Agent (UA) looks human. The behaviour doesn't.\n\nThe conclusion assembled itself: a distributed scraper, running across a proxy network or botnet, using spoofed browser identities to avoid classification. Evasive, coordinated, and genuinely clever.\n\n## The twist\n\nHere's where the detective story gets uncomfortable.\n\nThose `.md` endpoints don't exist by default in Ghost. I added them. A few months ago, as an experiment: serve each post as clean Markdown alongside the HTML version, reference them in `llms.txt`, see what happens. The idea was to make the content easier for AI systems to consume. Structured, clean, no JavaScript noise.\n\nThe scrapers found them almost immediately. So did legitimate AI users, ChatGPT-User and OAI-SearchBot among them, reading the same paths through the same door.\n\nI set the bait. They smelled it.\n\n## The idée fixe\n\nThe reflex response to a traffic spike like this is defensive. Block the IPs, rate-limit the endpoint, add a CAPTCHA, harden the WAF. There is an entire industry built around that reflex.\n\nIt rests on a premise worth examining: that keeping machines out is possible, and that it is worth the effort.\n\nNeither is quite true. A scraper that can distribute across 25 countries and rotate frozen browser UAs is not stopped by a robots.txt entry or a Cloudflare rule. It routes around friction the way water routes around a stone. And the content, once published on the public web, is going to be consumed by machine pipelines whether or not you make it easy.\n\nThe more interesting question is what you can learn by watching it happen.\n\nThe spike told me that my Markdown experiment is working, in the sense that it is attracting exactly the traffic it was designed for. It told me that the machine layer of the web is active, distributed, and more sophisticated than most people assume. It told me that the gap between what human analytics show and what the full request log shows is where the real picture lives.\n\nBlocking that traffic would have closed the window. Watching it left it open.\n\n## Signal in the noise\n\nThe thing about running a second logging layer is that most of what it catches is unremarkable. Googlebots, Bingbots, ChatGPT-User ticking through recent posts, the usual crawl of SEO tools and RSS readers. Noise.\n\nBut the noise is the baseline. Without it, the 07:43 spike is invisible. With it, you can ask: what's different about this minute? Why these paths? Why this many countries? Why frozen UAs?\n\nThe detective work is in the filtering, not the blocking!\n\n🗒️\n\nThe .md endpoints at hoeijmakers.net are intentional. Each post is available as clean Markdown alongside the HTML version. The llms.txt file indexes them. This is an ongoing experiment in machine-readable publishing.\n\n* * *\n\nRelated:\n\n  * I Thought I Was Optimising for Speed\n  * Thirty Years of Caching, Sorted in an Afternoon\n  * My Visitors Are Not All Human. That Is Fine.\n  * Guests That Should Behave\n  * Markdown, the WD-40 of Digital Information\n\n",
  "title": "The Detective and the Swarm",
  "updatedAt": "2026-05-09T12:41:27.016Z"
}