Adding Content-Signal Support to Liferay

Extending Liferay’s robots.txt to Control AI Training, Search, and Input through Cloudflare’s Content Signals

Introduction

The Content Signals Policy is a new initiative from Cloudflare, published at contentsignals.org. It extends the familiar robots.txt file so site owners can declare how their content may be used:

  • search: allow content to be indexed and shown in search results.

  • ai-input: allow content to be used as real-time inputs to AI models (e.g., RAG/grounding).

  • ai-train: allow content to be used for training or fine-tuning AI models.

For example, this directive allows only search indexing:

Content-Signal: ai-train=no, search=yes, ai-input=no

These signals are machine-readable tags paired with human-readable policy text. They also explicitly reserve rights under EU Directive 2019/790.

Important caveats:

  • Adoption is voluntary. Malicious crawlers may ignore the rules.

  • Cloudflare enforces these for its customers, but outside of Cloudflare you’re relying on good actors.

  • The hope is that as major AI companies standardize on respecting Content Signals, all sites will benefit simply by publishing them.

How Liferay Serves /robots.txt

In Liferay, requests to /robots.txt are handled by a RobotsServlet, which loads template files based on portal properties:

robots.txt.with.sitemap=\
  com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl
robots.txt.without.sitemap=\
  com/liferay/portal/dependencies/\
  robots_txt_without_sitemap.tmpl

By default, those embedded template files look like this:

Without sitemap:

User-Agent: *
Disallow:

With sitemap:

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

At runtime, placeholders like [$PROTOCOL$], [$HOST$], and [$PORT$] are automatically substituted.

Implementing Content-Signal in Your Environment

We can add Content Signals by overriding the default template files. Here's your full step-by-step instructions for adding Content Signals to your environment.

1) Create replacement templates

You have two options when creating the replacement templates:

Option A (same path, no extra config)

Place files at: tomcat/webapps/ROOT/WEB-INF/classes/com/liferay/portal/dependencies/ with the same names as the defaults:

robots_txt_with_sitemap.tmpl
robots_txt_without_sitemap.tmpl

Liferay will automatically use these instead of the embedded ones.

Option B (custom path/name with portal-ext.properties)

If you don’t want to overwrite the stock template files, you can place your custom files anywhere under tomcat/webapps/ROOT/WEB-INF/classes/.

Then point the portal to them in portal-ext.properties.

For example, if you put them here:

tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\
  with_sitemap.tmpl
tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\
  without_sitemap.tmpl

add the following to portal-ext.properties:

robots.txt.with.sitemap=custom/robots/with_sitemap.tmpl
robots.txt.without.sitemap=custom/robots/without_sitemap.tmpl
Note: These paths are relative to the WEB-INF/classes directory, since the lookup is done via the classpath.

Example file (without sitemap):

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Disallow:

Example file (with sitemap):

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

This keeps the original behavior intact while injecting the new Content-Signal line.

2) Clusters and containers

Of course these changes need to be applied to the cluster as a whole. This typically involves either:

  • Updating Each Node Ensure every node includes the same overrides.

  • Docker Use the supported overlay mechanism. Place your files under a host folder like:

[host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\
  com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl
[host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\
  com/liferay/portal/dependencies/\
  robots_txt_without_sitemap.tmpl

Then mount it when starting the container:

docker run -v [host-folder]:/mnt/liferay ...

The container scans /mnt/liferay/files at startup and overlays your files into the runtime classpath.

3) Restart

Restart your nodes (or containers) for the new templates to take effect.

4) Test locally

On a developer machine, test with:

curl http://localhost:8080/robots.txt

You should see the Content-Signal: line and, if you’re using the sitemap template, a fully substituted Sitemap: URL.

In non-local environments, replace localhost:8080 with your actual site host.

Conclusion

Adding Content Signals to Liferay’s robots.txt is simple: override the default template files with updated versions that include your policy.

But keep in mind:

  • Compliance is voluntary. Reputable crawlers will respect the signals; malicious scrapers likely won’t. Pair with WAF/bot defenses for stronger protection.

  • Page-level overrides exist. If a page has its own Robots configuration in the SEO tab of the page configuration, it overrides your global templates. Admins need to add the same Content-Signal rules there if they want them enforced at the page level.

By adopting Content Signals now, you make your preferences clear in a standard, machine-readable way, positioning your site to benefit as more crawlers begin to respect these rules.

Blogs