Raw Record Source

{
  "$type": "site.standard.document",
  "description": "feat: implement robots.txt generator to block AI scrapers in Hugo",
  "path": "/posts/hugo-block-ai-crawlers/",
  "publishedAt": "2024-07-16T00:00:00.000Z",
  "site": "https://read.ryancowl.es",
  "tags": [
    "Code"
  ],
  "textContent": "As AI companies continue to scrape content from the open web, I wanted to take small steps to protect my own content against them. Since the mid 1990s, a simple robots.txt file in the root directory of a website has communicated to bots how they should or shouldn’t crawl its pages. While this file has no legal or technical authority[^1], and relying on  is trusting bots to respect its rules with no mechanism to enforce them, I decided it can’t hurt to try. And who knows, it may help prevent at least some crawlers from shamelessly scraping content. Let’s find out!\n\n  \n\nCreate a robots.txt file with Hugo\n\nHugo can generate a robots.txt file just like any other template. As a first step, I sought out a list of AI crawlers to block in that file. I came across the ai.robots.txt project which seemed like a good starting point. I simply copied the contents of their robots.txt file into a new local file in the  directory of my local Hugo installation:\n\nI edited the newly created  to allow other bots access to the site with the following:\n\nThen I built Hugo locally and checked for the new  file in the root directory at . After confirming that worked as expected, I committed my changes and pushed to production. But then I got to thinking…\n\n  \n\nAutomatically update robots.txt\n\nThe remote robots.txt file appears to be updated regularly as new crawlers as added. Instead of having to remember to check that list and manually add new entries to my local , I decided to take things a step further and integrate the update into Hugo’s build process.\n\nCreate Hugo template file\n\nThe first step is to create a new template file in Hugo. This file will also live in  and we can call it . In that file, we can use Hugo’s resources.GetRemote to snag the list of crawlers from the  GitHub repo and assign it to a variable. Then we can extract the content and use the  filter to ensure the content is treated as safe HTML. And finally we can output the fetched content in the file itself.\n\nI included a couple other things such as a sitemap and allow rules for other bots to crawl the site. Putting it all together, my  looks something like this:\n\nModify Hugo config\n\nWith the new template file created, we just need to adjust Hugo’s configuration file to handle the rest. I’m using , so I opened it up and added the following:\n\nTake it for a test drive\n\nWe can spin up a local development to see how it works with [^2] and take a look at . Sure enough, I see the following:\n\nFrom here, we can push changes to production. Now each time Hugo builds and deploys,  will be updated with the latest version of the  file.\nFurther reading\n\nI don’t trust that AI crawlers will respect  but it’s worth a shot. If you wanted to take this further, you could block crawlers at the server level. Here are some links I’ve found that may be helpful in pursuing that route:\nGo ahead and block AI web crawlers\nBlockin’ Bots (with )\nBlocking Bots with Nginx\n\nAs a next step, I may look into setting up the Dark Visitors Analytics Agent to see what sort of impact this does (or doesn’t) have on crawlers.\n\n[^1]: “For three decades, a tiny text file has kept the internet from chaos. This text file has no particular legal or technical authority, and it’s not even particularly complicated. It represents a handshake deal between some of the earliest pioneers of the internet to respect each other’s wishes and build the internet in a way that benefitted everybody. It’s a mini constitution for the internet, written in code.” – The text file that runs the internet\n\n[^2]: Having first manually created , I had an older version stuck in the cache. I included [](https://read.ryancowl.es/robots.txt) to ignore the cache directory.",
  "title": "Block AI Crawlers with Robots.txt"
}