Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihcxllohylub5gga4eonk4lwwqgn6cgl7gmgsj7zuwpqalc3b3goq",
    "uri": "at://did:plc:zyvv64s26uy7h2rwhhdq5e5f/app.bsky.feed.post/3mh45zgrg77i2"
  },
  "path": "/2026/03/12/crawl/",
  "publishedAt": "2026-03-12T04:00:00.000Z",
  "site": "https://inkdroid.org",
  "tags": [
    "platforms",
    "web-archives",
    "Henhouse",
    "Jan Fyt",
    "news] about Cloudflare’s new [Crawl API",
    "web archiving",
    "Browsertrix API",
    "National Software Reference Library",
    "Cabrinety Archive",
    "Web Application Firewall",
    "May First",
    "Jess Ogden] and [Shawn Walker",
    "cloudflare_crawl",
    "model offerings",
    "https://doi.org/10.1177/13548565231164759",
    "https://archivaria.ca/index.php/archivaria/article/view/13733"
  ],
  "textContent": " Henhouse by Jan Fyt\n\nThe news] about Cloudflare’s new [Crawl API caught my attention for a few reasons. Read on for why, and what I learned when I asked it to crawl my own site as a test.\n\n* * *\n\nSo, the first reason this news was of interest was how Cloudflare’s Crawl service seemed to be helping people crawl websites with their bots, while at the same time providing the most popular technology for protecting websites from bots. This seemed like a classic fox guarding the hen house kind of situation to me, at least at first. But the little bit of reading I’ve done since makes it seem like they will still respect their own bot gate keeping (e.g. Turnstile). So if your are using Cloudflare or some other bot mitigation technology you will have to follow their instructions to let the Cloudflare crawl bot in to collect pages. I haven’t actually tested if this is the case.\n\nThe genius here is that Cloudflare is known for its Content Delivery Network. So in theory when a user asks to crawl a website they can be delivered data from the cache, without requiring a round trip to the source website. In theory this is good because it means that the burden of scrapers on websites _might_ be greatly reduced. If you run a website with lots of high value resources for LLMs (academic papers, preprints, books, news stories, etc) the same cached content could be delivered to multiple parties without putting extra load on your server.\n\nBut, the primary reason this news caught my eye is that this service looks very much like web archiving technology to me. For example, the Browsertrix API lets you set up, start, monitor and download crawls of websites. Unlike Browsertrix, which is geared to collecting a website for viewing by a person, the Cloudflare Crawl service is oriented at looking at the web for training LLMs. The service returns text content: HTML, Markdown and structured JSON data that results from running the collected text through one of their LLMs, with the given prompt. Why is it interesting that this is like web archiving technology?\n\nIn my dissertation research (Summers, 2020) I looked at how web archiving technology enacts different _ways of seeing_ the web from an archival perspective. I spent a year with NIST’s National Software Reference Library (NSRL) trying to understand how they were collecting software from the web, and how the tools they built embodied a particular way of valuing the web–and making certain things (e.g. software) legible (Scott, 1998). What I found was that the NSRL was engaged in a form of web archiving, where the shape of the archival records were determined by their initial conditions of use (forensics analysis). But these initial forensic uses did not _overdetermine_ the value of the records, which saw a variety of uses later, such as when the NSRL began adding software from Stanford’s Cabrinety Archive, or when the teams personal expertise and interest in video games led them to focus on archiving content from the Steam platform.\n\nSo I guess you could say I was primed to be interested in how Cloudflare’s Crawl service _sees_ the web. This matters because models (LLMs, etc) will be built on top of data that they’ve collected. But also because, if it succeeds, the service will likely get used for other things.\n\nTo test it, I simply asked it to crawl my own static website–the one that you are looking at right now. I did this for a few reasons:\n\n  1. It’s a static website, and I know exactly how many HTML pages were on it: 1,398. All the pages are directly discoverable since the homepage includes pagination links to an index page that includes each post.\n  2. I can easily look at the server logs to see what the crawler activity looks like.\n  3. I don’t use any kind of Web Application Firewall or other form of bot protection on my site (I do have a robots.txt but it doesn’t block `CloudflareBrowserRenderingCrawler/1.0`\n  4. I host my website on May First web server which doesn’t use Cloudflare as a CDN. The web content wouldn’t intentionally be in their CDN already.\n\n\n\nThis methodology was adapted from previous work I did with Jess Ogden] and [Shawn Walker analyzing how the Internet Archive’s [Save Page Now] service shapes what content is archived from the web (Ogden, Summers, & Walker, 2023).\n\nI wrote a little helper program cloudflare_crawl to start, monitor and download the results from the crawl. While the crawler ran I simultaneously watched the server logs. Running the program looks like this:\n\n\n    $ uvx cloudflare_crawl https://inkdroid.org\n\n    created job 36f80f5e-d112-4506-8457-89719a158ce2\n    waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285\n    waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514\n    ...\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json\n\nEach of the resulting JSON files contains some metadata for the crawl, as well as a list of “records”, one for each URL that was discovered.\n\n\n    {\n      \"success\": true,\n      \"result\": {\n        \"id\": \"36f80f5e-d112-4506-8457-89719a158ce2\",\n        \"status\": \"completed\",\n        \"browserSecondsUsed\": 1382.8220786132817,\n        \"total\": 1967,\n        \"finished\": 1967,\n        \"skipped\": 6862,\n        \"cursor\": 51,\n        \"records\": [\n          {\n            \"url\": \"https://inkdroid.org/\",\n            \"status\": \"completed\",\n            \"metadata\": {\n              \"status\": 200,\n              \"title\": \"inkdroid\",\n              \"url\": \"https://inkdroid.org/\",\n              \"lastModified\": \"Sun, 08 Mar 2026 05:00:39 GMT\"\n            },\n            \"markdown\": \"...\"\n            \"html\": \"...\",\n          },\n          {\n            \"url\": \"https://www.flickr.com/photos/inkdroid\",\n            \"status\": \"skipped\"\n          }\n        ]\n      }\n    }\n\nI decided I wasn’t interested in testing their model offerings so I didn’t ask for JSON content (the result of sending the harvested text through a model). If I had, each successful result would have had a `json` property as well. I am sure that people will use this but I was more interested in how the service interacted with the source website, and wasn’t interested in discovering the hard way how much it cost.\n\nBelow is a snippet of how the Cloudflare bot shows up in my nginx logs. As you can see they provide insight into what machine on the Internet is doing the request, what time it was requested, and what URL on the site is being requested.\n\n\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /about/ HTTP/1.1\" 200 5077 \"-\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/main.css HTTP/1.1\" 200 35504 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/highlight.css HTTP/1.1\" 200 1225 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/webmention.css HTTP/1.1\" 200 1238 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /images/feed.png HTTP/1.1\" 200 8134 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /js/bootstrap.min.js HTTP/1.1\" 200 17317 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /images/ehs-trees.jpg HTTP/1.1\" 200 63047 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:59 +0000] \"GET /js/highlight.min.js HTTP/1.1\" 200 20597 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n\nSo how did Cloudflare Crawl see my website?\n\n###  Crawling\n\n###  Results\n\nOne of the more interesting things was that each time I requested the website be crawled it seemed to come back with a different number of results.\n\nOgden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. _Convergence: The International Journal of Research into New Media Technologies_ , 135485652311647. https://doi.org/10.1177/13548565231164759\n\nScott, J. C. (1998). _Seeing like a state: How certain schemes to improve the human condition have failed_. Yale University Press.\n\nSummers, E. (2020). Appraisal talk in web archives. _Archivaria_ , _89_. Retrieved from https://archivaria.ca/index.php/archivaria/article/view/13733",
  "title": "Crawl"
}