Bots and Unexpected Consequences

Don't forget about Bots as you are building your site...

Spiders, Bots and Crawlers are coming for your site. Are you ready?

Recently I was helping a client who seemed be under a DDOS attack...

Analysis of traffic showed a large number of incoming requests that seemed to be doing searches for various expected keywords.

And there were a lot of these requests. Since they were doing searches, this of course had an impact on their Liferay system, slowing response times, reducing platform capacity, and introducing an aspect of instability that, of course, no one was happy about.

After a lot of traffic analysis and site reviews, we came to realize that the site wasn't really under a DDOS attack, they were just being crawled...

So how does a crawler trigger what looks like a DDOS attack via search?

Unfortunately, the content itself kind of caused it.

What we found was that a number of marketing blog pages would discuss aspects about products and services the client offered, and throughout the content you'd find sections like "Click here to see our current list of products..." and that, in turn, was basically a link to search with given criteria, intended to show the search results using a custom ADT...

From a usability perspective, I mean, it really worked great. Since the links triggered a search, nothing in the blog had to be maintained, as new products or services (or contents talking about them) were added to the system, they'd show up as search results, and the ADT (FreeMarker) used to render the results had some minor performance issues, but for a single user perspective, the performance problems with the ADT template were not seen as significant.

When you think of this from a single user browsing your site, you can imagine them finding a blog, clicking a search link on there, then checking a couple of the result links, ... Nothing really abnormal or abusive about that at all.

However, now consider what happens when a crawler is hitting your site. It is going to visit every blog page. It is going to click on every one of those search links, triggering a search on your site. The poor-performing FreeMarker ADT now gets invoked quickly and repeatedly as the crawler follows the search links, turning a minor performance problem into a major performance headache. It is going to visit every one of the search results. If those search results also have a "Click here to search..." sort of links on it, well it's going to follow them too...

And now imagine that you have multiple crawlers hitting your site around the same time and how that multiplies all of that traffic that single crawler generates...

And that is how the client was suffering from what they thought was a DDOS attack. Multiple crawlers were processing the site at the same time, each were performing a bunch of searches and then following the results...

So yeah, what seems like it might have been an elegant solution was causing significant problems when the site got crawled...

So what might some other options be?

In 7.4 we have various other ways to handle this... The Asset Publisher (classic) and the Collections can be used to identify a limited and controlled list of resources that you want to highlight. Rather than linking to a search, instead consider linking to a page and maintain the page with the relevant details.

About the ADT and the FreeMarker usage, well I've railed on that often enough that everyone has heard it all before, but to repeat myself, you can never look at a FM template that runs in isolation and judge the performance and system impact that way. To truly understand the FM impact, you need to use it under a load test. That's where you'll see the impact of multiple concurrent template processes and the drain to available resources and slow response times. These are the FM templates that get you into trouble, but it's not trouble you see when a template runs by itself.

Ultimately the moral of this story kids is that you can't forget the bots, crawlers and spiders that are out there. Understand that they will hit your site and will follow your links and do all of the things you want your users to do, but they'll do them fast, they'll do them in bulk, and they may overwhelm your system and it can seem like some sort of DDOS attack.

Keep that in mind as you're designing your Liferay-based solutions!

Blogs

Some bots don't respect your site and I don't respect them.

True enough, but not respecting them often won't keep them from hitting your site anyway, even when you have a proper robots.txt file in place.

Avoiding scenarios which can have them abusing your site will help prevent that from happening.

Exactly, if I remove all the pages with performance issues my site will become more resistant to robots. I did this for a client who had a lot of problems with crashes and performance, today the site works well for users and robots.