AI Bot Detection and Blocking

Worried about bots scraping your Liferay site? This post details a simple and effective solution: an Anubis-inspired OSGi servlet filter that forces browsers to solve a quick JavaScript challenge, stopping scrapers cold while remaining transparent to real users.

Introduction

My blog/project ideas typically come from client and community questions about how to do something in Liferay. However, in this case, I was inspired by a recent post on SlashDot about a fascinating project called Anubis. I headed over to their website, anubis.techaro.lol, to learn more about how it protects web content from being scraped by AI bots.

The architecture is both simple and powerful. Anubis is designed as a separate "appliance"—essentially a reverse proxy that sits between your public-facing web server and your back-end application server. When a new user requests a page, Anubis intercepts the request. Instead of passing it to the application server, it serves a lightweight page containing a JavaScript challenge. This challenge typically involves solving a small, unique cryptographic puzzle that requires a functioning JavaScript runtime.

Regular browsers can solve this in milliseconds, almost invisibly. However, the vast majority of bots and web scrapers are not full-fledged browsers; they are simple HTTP clients that don't execute JavaScript. They receive the challenge, can't solve it, and are effectively blocked from ever seeing your actual content. The verified browser (representing a real human), on the other hand, submits the correct solution and is seamlessly passed through to the application.

For larger installations with heavy traffic, using a dedicated appliance like Anubis is a fantastic solution. It offloads all the security work, includes a lot of configurable options, and ensures your application server's resources are spent serving legitimate users, not fighting off bots and scrapers.

But for smaller sites, I wondered if a full reverse proxy might be overkill. Couldn't I emulate what the Anubis project did, but in a smaller, self-contained form that would integrate well into a Liferay DXP environment? I decided to find out.

The result is a lightweight, OSGi-based servlet filter that brings the core logic of a JavaScript challenge directly into the Liferay lifecycle. In the next section, we'll dive into the details of how it was built.

Liferay Anubis

The complete project is available on GitHub: https://github.com/dnebing/liferay-anubis.

The repository contains a single Liferay module named liferay-anubis, and its core logic is surprisingly straightforward, residing in just two key Java files.

The Configuration: AnubisBotDetectionConfiguration.java

First up is the configuration interface. Since we want administrators to have full control over the filter's behavior without needing to redeploy code, I used Liferay's Configuration API.

A key benefit of using Liferay's Configuration API is its built-in inheritance model. The filter is designed to look for a site-specific configuration first. If one isn't defined, it falls back to the instance-level settings. If no instance configuration exists, it uses the system-wide defaults. This powerful hierarchy allows a portal administrator to set default settings at the system level, yet still provides the flexibility to override them with per-instance or per-site settings when called for.

This interface defines all the tunable parameters:

  • Enable/Disable: A simple checkbox to turn the entire filter on or off.

  • Excluded Paths: A multi-line text field to list URL paths that should never be challenged. This is critical for functionality. For example, you wouldn't want to challenge Liferay's own resource-bundling servlet (/combo) or essential pages like the login path (/c/portal/login).

  • Ignored File Extensions: A list of file extensions (.css, .js, .png, .jpg, etc.) to ignore. This prevents the filter from trying to serve an HTML challenge page when the browser is simply requesting a stylesheet or an image.

  • Approved User Agents: This allows you to create an allow-list for known and trusted bots. You certainly don't want to block the Googlebot or other search engine crawlers, as that would wreak havoc on your SEO.

  • Customizable Page Content: Finally, there are fields for the page title, custom CSS, and an HTML fragment for the challenge page. This gives administrators an easy way to style the verification page to match their site's branding, just in case a user on a slow connection happens to see it before the JavaScript completes.

The Filter: AnubisBotDetectionFilter.java

The second key file is the filter itself. This class is registered as an OSGi component implementing the standard javax.servlet.Filter interface. It uses the configuration to control its operation on every incoming GET request.

Its logic flow is simple:

  1. Check if the filter is enabled in the site configuration.

  2. Check if the request is from an already-verified session (by looking for a specific HttpSession attribute).

  3. Check if the request is for an excluded path, an ignored file extension, or from an approved user agent.

If the request is not bypassed by any of these rules, the filter steps in and presents the JavaScript challenge.

The challenge itself is designed to be simple but effective. The server generates a unique UUID (a nonce) and stores it in the user's session. This nonce is embedded into the challenge page sent to the browser. The client-side JavaScript's only job is to reverse the characters of this string and submit the result back as a URL parameter.

When the filter receives this reply, it retrieves the original nonce from the session, reverses it on the back-end, and compares it to the value submitted by the browser. If they match, the client has proven it has a JavaScript runtime. The filter then sets the HttpSession attribute, marking the session as "human," and redirects the user to their originally requested URL. All future requests from that session will be allowed through without a challenge.

While this string-reversal task isn't as cryptographically complex as what the full Anubis project might do, it's more than enough to achieve the primary goal: blocking bots and scrapers that cannot execute JavaScript.

Conclusion

This in-application filter has proven to be a viable and effective way to block simple bots and scrapers on a Liferay DXP site. However, before deploying it in a production environment, there are a few important things to keep in mind.

Things to Keep in Mind

First, be very careful with your excluded paths, as it's possible to get yourself stuck in a redirect loop. I actually did this during testing. After logging out, the browser is redirected to /c/portal/logout. Because my session was now invalidated, the filter immediately presented a new challenge, and the browser got stuck in a loop. I solved this by adding /c/portal/.* to the excluded path list, but it's a great example of how critical these settings are.

Next, it will likely take some tuning to get your lists of approved user agents and ignored file extensions right for your specific site. The defaults in the project are a good starting point, but you should monitor your logs after deployment to ensure you aren't blocking legitimate services or resources.

Finally, be aware that external caching layers like CDNs or other caching appliances can potentially defeat this filter. If a bot requests a page that has already been cached from a real user's verified session, the CDN may serve the valid HTML content directly from its cache, and the request will never hit your application server to be challenged.

Potential Improvements

While this implementation works, there are several ways we could make it even better.

First, since we're using the HttpSession to store the validation, every time a user opens a new browser session, they will have to go through the challenge again. A more user-friendly option would be to use a secure, long-lived cookie to store the verification status.

Next, as I've explained exactly how the string-reversal challenge works, it would be relatively easy for a determined bot writer to counter it. A more robust solution would be to implement multiple different types of challenges (e.g., simple math, a DOM manipulation check, a timing analysis) and have the filter randomly present one to the browser. We could also improve security by adding rate-limiting to temporarily block IP addresses that repeatedly fail the challenge, protecting the server from denial-of-service attempts.

The Right Tool for the Job

It’s important to state that this is not the ideal solution for large-scale sites with heavy traffic. The in-application model means your Liferay instance still has to handle the initial request from every bot, which consumes server resources. These sites should absolutely consider a dedicated, high-performance reverse proxy solution such as Anubis itself (https://anubis.techaro.lol/).

But for smaller sites, or for developers looking for a great starting point, this project can be a perfect fit. The full source code is available on GitHub: https://github.com/dnebing/liferay-anubis. If you want to submit improvements, please send me your PRs and I'll be happy to merge them in.

If you have questions, you can usually find me on the Liferay Community Slack channels, or consider attending one of my "Ask Me Anything" sessions!