Be part of JetBrains PHPverse 2026 on June 9 – a free online event bringing PHP devs worldwide together.

mvnobrega's avatar

Block bots agents in NGINX not working

My NGINX server is experiencing high CPU load due to Bots. In my log it appears like this:

::ffff:34.195.212.30 - - [13/Jul/2024:17:29:09 -0300] "GET /sonhar-com-regar-flores/ HTTP/1.1" 503 428 "-" "ias-va/3.3 (former https://www.admantx.com + https://integralads.com/about-ias/)"

::ffff:65.109.99.207 - - [13/Jul/2024:17:29:12 -0300] "GET /letra/pessoa/page/3/ HTTP/1.1" 200 40501 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"

::ffff:3.81.17.186 - - [13/Jul/2024:17:29:13 -0300] "GET /sonhar-com-pai-dirigindo/ HTTP/1.1" 200 43444 "-" "Mozilla/5.0 (compatible; proximic; +https://www.comscore.com/Web-Crawler)"

::ffff:66.249.68.39 - - [13/Jul/2024:17:29:13 -0300] "GET /sonhar-com-escorpiao-e-lagosta/ HTTP/1.1" 200 43942 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

::ffff:185.191.171.14 - - [13/Jul/2024:17:29:13 -0300] "GET /biblia/kja/jo/39/22 HTTP/1.1" 200 8272 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"

And I'm trying to block the bot like this:

server {
#[...]

 if ($http_user_agent ~* (SemrushBot|BLEXBot) ) {
 return 403;
 }

}

But it just doesn't work and I can't understand why. I've tried everything and the only thing that works is putting deny followed by the IP address. However, some bots have many IP variations, and it seems impractical to keep updating new IPs all the time.

Can anyone tell me why it's not working?

Note: I always restart all services after any change, and it still doesn't work.

0 likes
9 replies
jlrdw's avatar

Have you tried a Honeypot?

Also there are services that provide help with this. Also check some Github packages.

Also check cloudflare they have a service.

2 likes
mvnobrega's avatar

@jlrdw I'll try to use some package to see if it solves it. What I'm doing should work, but I'll try to use a package. Thanks

Snapey's avatar
Snapey
Best Answer
Level 122

robots.txt is your first defence (if you dont want your site crawling)

3 likes
mvnobrega's avatar

Yes, I did that, but through robots I don't see the error logs and whether it's working. Even though it is registered in robots.txt, it continues to receive logs from bots in nginx

jlrdw's avatar

@mvnobrega My laravel 11 came with a robots.txt file. Had you deleted yours?

2 likes
mvnobrega's avatar

@jlrdw I didn't delete it. But my server has a mix of laravel and wordpress sites. So I was watching everything on the server, anyway, I just added it to block in robots.txt too

1 like
mvnobrega's avatar

I ended up being able to solve it just by including:

if ($http_user_agent ~* (DataForSeoBot|SemrushBot|GPTBot|CriteoBot/0.1|CriteoBot|proximic|AhrefsBot|dotbot|Amazonbot|grapeshot|BLEXBot) ) {
 return 403;
 }

In the NGINX custom_rules file

1 like
naden's avatar

@mvnobrega consider using a nginx map in an external file. It's just much easier to maintain and shared between different hosts. Also for very high traffic sites I like to block all those bots via cloudflare WAP so I dont even have to serve them a 403 on my application server.

Using robots.txt does not work. Lately all these "AI" companies are crawling the web like crazy and just ignoring everything.

Example below

Content of blocked-uas.conf (you can use regular expressions here)

map $http_user_agent $visitor {
  default "user";
  ~Quantcastbot "bot";
  ~Optimizer "bot";
  ~Clickagy "bot";
  ...
}

In your host file do the following

include blocked-uas.conf;

server {
    ...
    if ($visitor = "bot") {
        return 403; 
    }
    ...
}
1 like

Please or to participate in this conversation.