RE: What's the proper robots.txt configuration for disallowing access to bots?

Jamie Sammons, modified 1 Year ago. New Member Posts: 2 Join Date: 6/16/23 Recent Posts

Dear Support Team,

We are facing an issue with suspicious traffic to the website which seems to be originating from various msn/bing bots trying to index various parts/subpages of the website.

I've updated the robots.txt configuration of the Public Pages of the site, to the following rules:

**

User-Agent: *
Disallow:
User-agent: bingbot
Disallow: /
User-agent: msnbot
Disallow: /
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

**

This would disallow accessing the site pages from Agent that contain 'bingbot' or 'msnbot' in the String(s). 

Since this doesn't seem to have stopped the bots from crawling the website, do I need to add anything else to these rules or somehow add/re-apply anything else?

Kind Regards,

Antonis

thumbnail
Olaf Kock, modified 1 Year ago. Liferay Legend Posts: 6441 Join Date: 9/23/08 Recent Posts

robots.txt needs to be served from the root directory of your server - e.g. example.com/robots.txt - in case you're configuring this in a secondary site, without declaring a virtual host, this particular​​​​​​​ robots.txt might appear under example.com/web/sitename/robots.txt - you might want to edit the robots.txt of your default site (typically /web/guest), as that's what appears in the root.

Also note that robots.txt is a "recommendation", that robots typically honor, but there are also rogue robots that don't care about your recommendation.

thumbnail
Aravinth Kumar, modified 1 Year ago. Regular Member Posts: 152 Join Date: 6/26/13 Recent Posts

Hi Antonis, 

There are many ways to prevent bot attacks. One way is to use WAF.

Check with some web application firewall to prevent bad bot attacks. 

Regards,

Aravinth