Ask Questions and Find Answers
Important:
Ask is now read-only. You can review any existing questions and answers, but not add anything new.
But - don't panic! While ask is no more, we've replaced it with discuss - the new Liferay Discussion Forum! Read more here here or just visit the site here:
discuss.liferay.com
RE: What's the proper robots.txt configuration for disallowing access to bots?
Dear Support Team,
We are facing an issue with suspicious traffic to the website which seems to be originating from various msn/bing bots trying to index various parts/subpages of the website.
I've updated the robots.txt configuration of the Public Pages of the site, to the following rules:
**
User-Agent: *
Disallow:
User-agent: bingbot
Disallow:
/
User-agent: msnbot
Disallow: /
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml
**
This would disallow accessing the site pages from Agent that contain 'bingbot' or 'msnbot' in the String(s).
Since this doesn't seem to have stopped the bots from crawling the website, do I need to add anything else to these rules or somehow add/re-apply anything else?
Kind Regards,
Antonis
robots.txt needs to be served from the root directory of your server - e.g. example.com/robots.txt - in case you're configuring this in a secondary site, without declaring a virtual host, this particular robots.txt might appear under example.com/web/sitename/robots.txt - you might want to edit the robots.txt of your default site (typically /web/guest), as that's what appears in the root.
Also note that robots.txt is a "recommendation", that robots typically honor, but there are also rogue robots that don't care about your recommendation.
Hi Antonis,
There are many ways to prevent bot attacks. One way is to use WAF.
Check with some web application firewall to prevent bad bot attacks.
Regards,
Aravinth
Powered by Liferay™