Raw Record Source

{
  "$type": "site.standard.document",
  "description": "Bots are everywhere. But if they misbehave, you can put a stop at them by blocking their user agent via an nginx config change.",
  "path": "/blog/block-user-agents-on-nginx-config/",
  "publishedAt": "2024-10-05T00:00:00.000Z",
  "site": "at://did:plc:3nlkmby2zllrhcj6z5dnicui/site.standard.publication/3mnr22gea2o2d",
  "textContent": "Bots are everywhere, and so are malicious agents. They crawl, they scrape, they read... and sometimes (or most? of the times),they abuse.\n\nA few times now, I've observed increased CPU / RAM consumption on servers where I wouldn't expect any usage on non-peak hours, and after browsing through access logs, I find out the cause is that certain bots are crawling my sites over and over again.\n\nThis has been made worse, for example, by certain WordPress plug-ins that generate links with unique GET parameters on each page load, making these not-so-smart crawler bots get stuck on an endless loop. What a waste of bandwidth and compute resources.\n\nTo put an end to this, I've now blocked by default certain user agents from accessing any of the websites I host, directly on nginx config. This ensures that malicious actors are stopped even before they can trigger any server execution or load static content. You can follow similar steps or even swap user agent for any other identifier you define to block agents from accessing your servers.\n\n\nCREATE A LIST OF BLOCKED USER AGENTS #\n\nCreate a new file in /etc/nginx with the following content and any name you want.\n\nI'll use blocked_user_agent.rules here:\n\nmap $http_user_agent $blocked_user_agent {\n    # Requests are allowed by default\n    default 0;\n\n    # `~example` will match any user agent strings\n    # that have `example` anywhere inside them.\n    # Some examples:\n    ~Amazonbot 1;\n    ~openai 1;\n    ~chatgpt 1;\n    ~gptbot 1;\n}\n\nThis list can get as short or as long as you need, and you can change it (whether to add or remove blocked user agents) anytime you need.\n\n\nBLOCK REQUESTS BASED ON $BLOCKED_USER_AGENT #\n\nNow that you have an easily-accessible map of user agents, it's time to make this variable available to nginx and block unwanted requests.\n\nOn /etc/nginx/nginx.conf, add the following at the end the http block:\n\nhttp {\n    # Skipping over content...\n    # (...)\n\n    # Include file with map of blocked user agents\n    include /etc/nginx/blocked_user_agent.rules;\n}\n\nFinally, on the config files for each of your sites (inside /etc/nginx/sites-enabled/) add the following inside the server block, before you start matching for any locations:\n\nserver {\n    server_name aitorres.com;\n\n    # Blocking undesired user agents\n    if ($blocked_user_agent) {\n        # `444 No Response`, nginx specific HTTP status code.\n        # You can choose to return other standard HTTP\n        # status codes, like `404 Not Found` or `403 Forbidden`\n        # base on your needs\n        return 444;\n    }\n\n    # Rest of your usual file, unchanged\n}\n\nOne more thing: reload or restart your nginx server from your shell.\n\n# Ensuring config is valid\nnginx -t\n\n# Reloading the server without downtime, you can choose to restart as well\nnginx -s reload\n\nAll done! Your server will start blocking these requests, and you should start seeing reduced resource consumption. If you have access logs enabled for your server, then you'll see the requests from blocked user agents logged with the HTTP status code you chose to return.\n\nOne final note: if you ever modify the list of blocked user agents, remember to reload or restart nginx for the changes to take effect.\n\nThis method is not infalible as it depends on the bot (or the malicious agent behind it) to consistently use the same user agent, but it's a start and just takes a couple minutes to add. Hopefully one day bots will stop misbehaving completely, but until then... ;-)",
  "title": "Block malicious user agents via nginx config"
}