Jul 13, 2024

Level 3

Block bots agents in NGINX not working

My NGINX server is experiencing high CPU load due to Bots. In my log it appears like this:

::ffff:34.195.212.30 - - [13/Jul/2024:17:29:09 -0300] "GET /sonhar-com-regar-flores/ HTTP/1.1" 503 428 "-" "ias-va/3.3 (former https://www.admantx.com + https://integralads.com/about-ias/)"

::ffff:65.109.99.207 - - [13/Jul/2024:17:29:12 -0300] "GET /letra/pessoa/page/3/ HTTP/1.1" 200 40501 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"

::ffff:3.81.17.186 - - [13/Jul/2024:17:29:13 -0300] "GET /sonhar-com-pai-dirigindo/ HTTP/1.1" 200 43444 "-" "Mozilla/5.0 (compatible; proximic; +https://www.comscore.com/Web-Crawler)"

::ffff:66.249.68.39 - - [13/Jul/2024:17:29:13 -0300] "GET /sonhar-com-escorpiao-e-lagosta/ HTTP/1.1" 200 43942 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

::ffff:185.191.171.14 - - [13/Jul/2024:17:29:13 -0300] "GET /biblia/kja/jo/39/22 HTTP/1.1" 200 8272 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"

And I'm trying to block the bot like this:

server {
#[...]

 if ($http_user_agent ~* (SemrushBot|BLEXBot) ) {
 return 403;
 }

}

But it just doesn't work and I can't understand why. I've tried everything and the only thing that works is putting deny followed by the IP address. However, some bots have many IP variations, and it seems impractical to keep updating new IPs all the time.

Can anyone tell me why it's not working?

Note: I always restart all services after any change, and it still doesn't work.

jlrdw

1 year ago

Level 75

Have you tried a Honeypot?

Also there are services that provide help with this. Also check some Github packages.

Also check cloudflare they have a service.

2 likes

mvnobrega

1 year ago

Level 3

@jlrdw I'll try to use some package to see if it solves it. What I'm doing should work, but I'll try to use a package. Thanks

Snapey

1 year ago

Best Answer

Level 122

robots.txt is your first defence (if you dont want your site crawling)

3 likes

mvnobrega

1 year ago

Level 3

Yes, I did that, but through robots I don't see the error logs and whether it's working. Even though it is registered in robots.txt, it continues to receive logs from bots in nginx

jlrdw

1 year ago

Level 75

@mvnobrega My laravel 11 came with a robots.txt file. Had you deleted yours?

2 likes

mvnobrega

1 year ago

Level 3

@jlrdw I didn't delete it. But my server has a mix of laravel and wordpress sites. So I was watching everything on the server, anyway, I just added it to block in robots.txt too

1 like

mvnobrega

1 year ago

Level 3

I ended up being able to solve it just by including:

if ($http_user_agent ~* (DataForSeoBot|SemrushBot|GPTBot|CriteoBot/0.1|CriteoBot|proximic|AhrefsBot|dotbot|Amazonbot|grapeshot|BLEXBot) ) {
 return 403;
 }

In the NGINX custom_rules file

1 like

naden

1 year ago

Level 6

@mvnobrega consider using a nginx map in an external file. It's just much easier to maintain and shared between different hosts. Also for very high traffic sites I like to block all those bots via cloudflare WAP so I dont even have to serve them a 403 on my application server.

Using robots.txt does not work. Lately all these "AI" companies are crawling the web like crazy and just ignoring everything.

Example below

Content of blocked-uas.conf (you can use regular expressions here)

map $http_user_agent $visitor {
  default "user";
  ~Quantcastbot "bot";
  ~Optimizer "bot";
  ~Clickagy "bot";
  ...
}

In your host file do the following

include blocked-uas.conf;

server {
    ...
    if ($visitor = "bot") {
        return 403; 
    }
    ...
}

1 like

RahulAlam

1 year ago

Level 25

I had faced a similar problem, so i made a package for this... the package blocks all the ip addresses that are marked as spam, and all bots ip address are spam.... Checkout the package... https://github.com/rahulalam31/Laravel-Abuse-IP

Please or to participate in this conversation.