Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicpyg3jsh2zrftm4on2jqpav2c6c7mpj6holos2sctthwyrdo3ale",
    "uri": "at://did:plc:zyvv64s26uy7h2rwhhdq5e5f/app.bsky.feed.post/3mhb6xm6cnjq2"
  },
  "path": "/2026/03/16/seeing-the-web/",
  "publishedAt": "2026-03-16T04:00:00.000Z",
  "site": "https://inkdroid.org",
  "tags": [
    "platforms",
    "web-archives",
    "Leica Double-Gauss Lens Design",
    "news",
    "pay-per-crawl API",
    "in the docs",
    "HTTP Message Signatures",
    "crushed by bots",
    "web archiving",
    "Browsertrix API",
    "National Software Reference Library",
    "Cabrinety Archive",
    "Web Application Firewall",
    "May First",
    "Jess Ogden",
    "Shawn Walker",
    "Save Page Now",
    "cloudflare-crawl",
    "model offerings",
    "setting up",
    "ETag",
    "here",
    "https://doi.org/10.1177/13548565231164759",
    "https://theanarchistlibrary.org/library/james-c-scott-seeing-like-a-state",
    "https://doi.org/10.13016/U95C-QAYR"
  ],
  "textContent": " Leica Double-Gauss Lens Design\n\nThe news about Cloudflare’s new pay-per-crawl API caught my attention for a few reasons. Read on for why, a bit about what the results look like, and what I learned when I asked it to crawl this site as a test.\n\n* * *\n\nSo, first of all, what’s up? Cloudflare’s Crawl API helps people collect data from websites with bots, while _at the same time_ providing one of the most popular technologies for preventing websites from being crawled by bots?\n\nAt first this seemed to me like a classic fox guarding the hen house kind of situation. But the little bit of reading in the docs I’ve done since makes it seem like they will still respect their own bot gate keeping (e.g. Turnstile).\n\nIf you are using Cloudflare or some other bot mitigation technology you will have to follow their instructions to let the Cloudflare crawl bot in to collect pages. Interestingly, it appears they are using the latest specs for HTTP Message Signatures to provide this functionality, since you can’t simply let in anyone saying they are `CloudflareBrowserRenderingCrawler` right?\n\nThe genius here is that Cloudflare is known for its Content Delivery Network (CDN). So in theory (more on this below) when a user asks to crawl a website the data can be delivered from the cache, without requiring a round trip back to the source website. This could mean that in some situations the burden of scrapers on websites is greatly reduced. If you run a website with lots of high value resources for LLMs (academic papers, preprints, books, news stories, etc) the same cached content could be delivered to multiple parties without having to go back to the originating server. For resource constrained cultural heritage organizations that are currently getting crushed by bots I think this would be a welcome development.\n\nBut, the primary reason this news caught my eye is that if you squint right Cloudflare’s Crawl API looks very much like web archiving technology. For example, the Browsertrix API lets you set up, start, monitor and download crawls of websites. Unlike Browsertrix, which is geared to collecting a website for viewing by a person, the Cloudflare Crawl service is oriented at looking at the web for training LLMs. The service returns text content: HTML, Markdown and structured JSON data that result from running the collected text through one of their LLMs, with the given prompt.\n\n###  Seeing the Web\n\nSo why is it interesting that this is like web archiving technology?\n\nOk, maybe it isn’t interesting to you, but (ahem) in my dissertation research (Summers, 2020) I spent a lot of time (way too much time tbh) looking at how web archiving technology enacts different _ways of seeing_ the web from an archival perspective. I spent a year with NIST’s National Software Reference Library (NSRL) trying to understand how they were collecting software from the web, and how the tools they built embodied a particular way of seeing and valuing the web–and making certain things (e.g. software) legible (Scott, 1998).\n\nWhat I found was that the NSRL was engaged in a form of web archiving, where the shape of the archival records was determined by their initial conditions of use (in their case, forensics analysis). But these initial forensic uses did not _overdetermine_ the value of the records, which saw a variety of uses, disuses, and misuses later: such as when the NSRL began adding software from Stanford’s Cabrinety Archive, or when the teams personal expertise and interest in video games led them to focus on archiving content from the Steam platform.\n\nSo I guess you could say I was primed to be interested in how Cloudflare’s Crawl service _sees_ the web. This matters because models (LLMs, etc) and other services will be built on top of data that they’ve collected. But also because, if it succeeds, the service will likely get repurposed for other things.\n\n###  Testing\n\nTo test how Cloudflare sees the web, I simply asked it to crawl my own static website–the one that you are looking at right now. I did this for a few reasons:\n\n  1. It’s a static website, and I know exactly how many HTML pages were on it: 1,398. All the pages are directly discoverable since the homepage includes pagination links to an index page that includes each post.\n  2. I can easily look at the server logs to see what the crawler activity looks like.\n  3. I don’t use any kind of Web Application Firewall or other form of bot protection on my site (I do have a robots.txt but it doesn’t block `CloudflareBrowserRenderingCrawler/1.0`)\n  4. I host my website on May First which doesn’t use Cloudflare as a CDN. So the web content wouldn’t intentionally be in Cloudflare’s CDN already.\n\n\n\nThis methodology was adapted from previous work I did with Jess Ogden and Shawn Walker analyzing how the Internet Archive’s Save Page Now service shapes what content is archived from the web (Ogden, Summers, & Walker, 2023).\n\nI wrote a little command line utility cloudflare-crawl to start, monitor and download the results from the crawl. While the crawler ran I simultaneously watched the server logs. Running the utility looks like this:\n\n\n    $ uvx https://github.com/edsu/cloudflare-crawl https://inkdroid.org\n\n    created job 36f80f5e-d112-4506-8457-89719a158ce2\n    waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285\n    waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514\n    ...\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json\n    wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json\n\nEach of the resulting JSON files contains some metadata for the crawl, as well as a list of “records”, one for each URL that was discovered.\n\n\n    {\n      \"success\": true,\n      \"result\": {\n        \"id\": \"36f80f5e-d112-4506-8457-89719a158ce2\",\n        \"status\": \"completed\",\n        \"browserSecondsUsed\": 1382.8220786132817,\n        \"total\": 1967,\n        \"finished\": 1967,\n        \"skipped\": 6862,\n        \"cursor\": 51,\n        \"records\": [\n          {\n            \"url\": \"https://inkdroid.org/\",\n            \"status\": \"completed\",\n            \"metadata\": {\n              \"status\": 200,\n              \"title\": \"inkdroid\",\n              \"url\": \"https://inkdroid.org/\",\n              \"lastModified\": \"Sun, 08 Mar 2026 05:00:39 GMT\"\n            },\n            \"markdown\": \"...\"\n            \"html\": \"...\",\n          },\n          {\n            \"url\": \"https://www.flickr.com/photos/inkdroid\",\n            \"status\": \"skipped\"\n          }\n        ]\n      }\n    }\n\n###  Analysis\n\nI decided I wasn’t very interested in testing their model offerings, so I didn’t ask for JSON content (the result of sending the harvested text through a model). If I had, each successful result would have had a `json` property as well. I am sure that people will use this, but I was more interested in how the service interacted with the source website, and wasn’t interested in discovering the hard way how much it cost to run the content through their LLMs.\n\nBelow is a snippet of how the Cloudflare bot shows up in my nginx logs. As you can see the logs provide insight into what machine on the Internet is doing the request, what time it was requested, and what URL on the site is being requested.\n\n\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /about/ HTTP/1.1\" 200 5077 \"-\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/main.css HTTP/1.1\" 200 35504 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/highlight.css HTTP/1.1\" 200 1225 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /css/webmention.css HTTP/1.1\" 200 1238 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /images/feed.png HTTP/1.1\" 200 8134 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /js/bootstrap.min.js HTTP/1.1\" 200 17317 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] \"GET /images/ehs-trees.jpg HTTP/1.1\" 200 63047 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n    104.28.153.137 - - [12/Mar/2026:14:34:59 +0000] \"GET /js/highlight.min.js HTTP/1.1\" 200 20597 \"https://inkdroid.org/about/\" \"CloudflareBrowserRenderingCrawler/1.0\"\n\nSo how did Cloudflare Crawl see my website?\n\nMaybe it’s early days for the service, but one thing I noticed is that each time I requested the site to be crawled the results seemed to be radically different.\n\ncrawl time  |  completed  |  skipped  |  queued  |  errored  |  unique_urls\n---|---|---|---|---|---\n2026-03-12 13:13:00  |  165  |  84  |  |  1  |  223\n2026-03-12 13:44:00  |  72  |  4  |  2  |  |  78\n2026-03-12 14:09:00  |  1947  |  7304  |  |  23  |  9191\n2026-03-12 16:33:00  |  72  |  4  |  2  |  |  78\n2026-03-12 17:34:00  |  1948  |  7365  |  |  22  |  9191\n2026-03-13 16:50:00  |  1947  |  7363  |  |  23  |  9187\n2026-03-14 07:32:00  |  72  |  4  |  2  |  |  78\n\nThe more successful crawls did a good job of crawling the entire site. My website is well linked, with a standard homepage, that has anchor tag based paging that includes links to all the posts. But knowing when your results are a partial crawl seems to be difficult. Knowing the actual dimensions of a “website” is one of the more difficult things about web archiving practice. The URLs that were labeled as “skipped” were not in scope for the crawl. If you wanted to include those apparently there is a `options.includeExternalLinks` option when setting up the crawl.\n\nFrom watching the web server logs it was clear that:\n\n  1. Cloudflare doesn’t appear to be relying on previously cached data, each result appeared to require a round trip to the server. My static site didn’t change over the course of these tests, and uses ETag to signal when content changes.\n  2. Cloudflare appears to be fetching CSS, JavaScript and images for the rendering of each page (they aren’t being cached by the Browser Worker).\n  3. The throughput on the web server seemed to peak around 300 requests / minute (5 requests / second). For most sites this seems perfectly feasible.\n\n\n\n\n\nFor the more successful crawls it looked like there were 246 independent IP addresses within Cloudflare’s network block that were doing the crawling.\n\nip  |  request_count\n---|---\n104.28.153.88  |  405\n104.28.163.131  |  266\n104.28.161.242  |  232\n104.28.165.231  |  223\n104.28.153.132  |  212\n104.28.163.132  |  212\n104.28.163.81  |  201\n104.28.166.65  |  188\n104.28.166.121  |  186\n104.28.164.201  |  185\n104.28.153.179  |  182\n104.28.153.137  |  178\n104.28.164.202  |  172\n104.28.161.243  |  172\n104.28.166.127  |  163\n104.28.165.232  |  155\n104.28.153.119  |  153\n104.28.165.14  |  151\n104.28.153.83  |  148\n104.28.153.140  |  145\n104.28.153.87  |  145\n104.28.153.55  |  143\n104.28.153.136  |  142\n104.28.163.133  |  132\n104.28.153.118  |  131\n104.28.166.58  |  130\n104.28.163.78  |  126\n104.28.160.31  |  125\n104.28.153.139  |  124\n104.28.161.245  |  124\n104.28.163.214  |  123\n104.28.153.120  |  123\n104.28.165.230  |  121\n104.28.153.180  |  121\n104.28.164.156  |  119\n104.28.153.96  |  119\n104.28.153.64  |  112\n104.28.153.133  |  111\n104.28.166.128  |  111\n104.28.153.128  |  109\n104.28.166.126  |  104\n104.28.165.17  |  103\n104.28.165.18  |  103\n104.28.160.30  |  103\n104.28.153.134  |  101\n104.28.166.120  |  101\n104.28.153.129  |  101\n104.28.153.181  |  100\n104.28.153.86  |  100\n104.28.165.229  |  100\n104.28.163.134  |  99\n104.28.164.203  |  99\n104.28.162.194  |  98\n104.28.166.62  |  98\n104.28.163.212  |  98\n104.28.153.123  |  97\n104.28.164.154  |  97\n104.28.166.61  |  97\n104.28.161.246  |  96\n104.28.153.92  |  96\n104.28.166.125  |  96\n104.28.153.68  |  93\n104.28.159.23  |  92\n104.28.153.76  |  91\n104.28.153.71  |  91\n104.28.153.124  |  90\n104.28.158.143  |  88\n104.28.165.21  |  88\n104.28.153.94  |  87\n104.28.166.118  |  86\n104.28.161.133  |  84\n104.28.153.85  |  82\n104.28.164.152  |  82\n104.28.163.77  |  82\n104.28.153.148  |  79\n104.28.164.150  |  79\n104.28.165.12  |  79\n104.28.161.201  |  79\n104.28.153.183  |  78\n104.28.160.65  |  78\n104.28.153.126  |  77\n104.28.153.138  |  77\n104.28.159.133  |  76\n104.28.165.20  |  75\n104.28.158.137  |  75\n104.28.153.56  |  75\n104.28.153.81  |  74\n104.28.153.131  |  73\n104.28.153.59  |  72\n104.28.166.60  |  72\n104.28.166.66  |  69\n104.28.159.120  |  69\n104.28.153.53  |  68\n104.28.153.185  |  68\n104.28.153.191  |  67\n104.28.166.119  |  66\n104.28.153.95  |  64\n104.28.165.76  |  64\n104.28.154.20  |  62\n104.28.153.121  |  57\n104.28.158.142  |  57\n104.28.160.68  |  56\n104.28.163.177  |  56\n104.28.153.80  |  56\n104.28.161.215  |  55\n104.28.161.244  |  55\n104.28.153.62  |  55\n104.28.166.134  |  55\n104.28.153.122  |  54\n104.28.165.19  |  53\n104.28.153.127  |  53\n104.28.159.118  |  53\n104.28.157.166  |  53\n104.28.153.226  |  53\n104.28.157.169  |  52\n104.28.159.111  |  48\n104.28.153.196  |  48\n104.28.161.132  |  48\n104.28.153.84  |  47\n104.28.161.214  |  47\n104.28.165.13  |  46\n104.28.153.219  |  46\n104.28.163.171  |  46\n104.28.165.15  |  45\n104.28.163.176  |  45\n104.28.159.109  |  45\n104.28.158.155  |  45\n104.28.153.218  |  45\n104.28.158.131  |  44\n104.28.161.200  |  44\n104.28.153.222  |  44\n104.28.161.197  |  44\n104.28.159.74  |  44\n104.28.158.139  |  44\n104.28.158.138  |  44\n104.28.153.235  |  43\n104.28.153.106  |  43\n104.28.164.160  |  43\n104.28.153.57  |  38\n104.28.159.119  |  37\n104.28.163.82  |  36\n104.28.153.197  |  36\n104.28.153.93  |  36\n104.28.160.25  |  35\n104.28.153.78  |  34\n104.28.153.72  |  34\n104.28.153.125  |  34\n104.28.153.61  |  34\n104.28.166.131  |  34\n104.28.158.132  |  33\n104.28.159.135  |  33\n104.28.160.34  |  33\n104.28.163.220  |  33\n104.28.153.77  |  33\n104.28.166.135  |  33\n104.28.164.155  |  33\n104.28.163.213  |  33\n104.28.158.136  |  33\n104.28.160.121  |  33\n104.28.157.174  |  33\n104.28.165.71  |  33\n104.28.153.130  |  33\n104.28.163.76  |  32\n104.28.160.32  |  32\n104.28.160.64  |  32\n104.28.153.89  |  32\n104.28.159.110  |  32\n104.28.163.172  |  32\n104.28.154.18  |  32\n104.28.163.178  |  31\n104.28.166.124  |  30\n104.28.165.114  |  25\n104.28.153.182  |  25\n104.28.166.132  |  25\n104.28.159.108  |  24\n104.28.165.75  |  24\n104.28.157.171  |  24\n104.28.153.240  |  23\n104.28.164.204  |  23\n104.28.153.108  |  23\n104.28.159.24  |  22\n104.28.157.242  |  22\n104.28.153.63  |  22\n104.28.153.105  |  22\n104.28.159.229  |  22\n104.28.158.130  |  22\n104.28.164.213  |  22\n104.28.159.136  |  22\n104.28.164.158  |  22\n104.28.157.83  |  22\n104.28.153.107  |  22\n104.28.159.83  |  22\n104.28.157.172  |  22\n104.28.157.82  |  22\n104.28.158.145  |  22\n104.28.162.93  |  22\n104.28.163.174  |  22\n104.28.153.98  |  22\n104.28.157.170  |  21\n104.28.158.126  |  21\n104.28.165.74  |  21\n104.28.153.216  |  21\n104.28.159.112  |  21\n104.28.161.199  |  14\n104.28.153.194  |  13\n104.28.154.15  |  13\n104.28.159.232  |  13\n104.28.166.59  |  13\n104.28.159.150  |  12\n104.28.165.72  |  12\n104.28.158.252  |  12\n104.28.153.104  |  12\n104.28.158.254  |  11\n104.28.158.129  |  11\n104.28.153.58  |  11\n104.28.162.195  |  11\n104.28.160.28  |  11\n104.28.159.115  |  11\n104.28.158.255  |  11\n104.28.153.214  |  11\n104.28.153.67  |  11\n104.28.160.29  |  11\n104.28.153.195  |  11\n104.28.164.153  |  11\n104.28.160.23  |  11\n104.28.160.24  |  11\n104.28.159.114  |  11\n104.28.160.27  |  11\n104.28.160.66  |  11\n104.28.157.175  |  11\n104.28.157.173  |  11\n104.28.159.122  |  11\n104.28.154.12  |  11\n104.28.160.33  |  11\n104.28.164.159  |  11\n104.28.163.170  |  11\n104.28.165.11  |  11\n104.28.154.17  |  10\n104.28.163.222  |  10\n104.28.159.121  |  2\n104.28.157.243  |  2\n104.28.153.73  |  2\n104.28.157.233  |  2\n104.28.153.54  |  2\n104.28.158.146  |  2\n104.28.163.169  |  2\n\nI spot checked some of the HTML and it did appear to be near identical to what was on the live web. With the fullest results I noticed 4% of URLs were not crawled.\n\nI think there are a few directions this could go from here:\n\n  1. testing what happens when instructing the crawl to collect (instead of skip) pages that are off site\n  2. testing what happens with more dynamic content, and how much to wait for pages to render\n  3. trying to understand why truncated results come back sometimes, and if there are any signals for identifying when it is happening.\n  4. explore whether Cloudflare will lean on cached content for concurrent requests for the same content\n\n\n\nThis last point is surprising: why isn’t Cloudflare using its caching infrastructure as a way of delivering crawled content faster and with fewer resources? Maybe this would require a more significant investment on their part, and they are waiting to see if people start using it first?\n\nOne thing I didn’t mention is that the Cloudflare free plan limits you to maximum of 100 pages per crawl. I set up a $5/month paid plan account in order to do this testing. In all my testing I only seemed to use 0.7 of “browser hours” which will fit well within the 10 hours allowed per month. It currently costs $0.09 / hour when you exceed your limit.\n\nPS. If you are curious the Marimo notebook I was using for some of the analysis can be found here.\n\n###  References\n\nOgden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. _Convergence: The International Journal of Research into New Media Technologies_ , 135485652311647. https://doi.org/10.1177/13548565231164759\n\nScott, J. C. (1998). _Seeing like a state: How certain schemes to improve the human condition have failed_. Yale University Press. Retrieved from https://theanarchistlibrary.org/library/james-c-scott-seeing-like-a-state\n\nSummers, E. H. (2020). _Legibility Machines: Archival Appraisal and the Genealogies of Use_. Digital Repository at the University of Maryland. https://doi.org/10.13016/U95C-QAYR",
  "title": "Ways of Seeing the Web"
}