Adding Content-Signal Support to Liferay

Extending Liferay’s robots.txt to Control AI Training, Search, and Input through Cloudflare’s Content Signals

David H Nebinger
David H Nebinger
2 Minute Read

Introduction

The Content Signals Policy is a new initiative from Cloudflare, published at contentsignals.org. It extends the familiar robots.txt file so site owners can declare how their content may be used:

  • search: allow content to be indexed and shown in search results.

  • ai-input: allow content to be used as real-time inputs to AI models (e.g., RAG/grounding).

  • ai-train: allow content to be used for training or fine-tuning AI models.

For example, this directive allows only search indexing:

Content-Signal: ai-train=no, search=yes, ai-input=no

These signals are machine-readable tags paired with human-readable policy text. They also explicitly reserve rights under EU Directive 2019/790.

Important caveats:

  • Adoption is voluntary. Malicious crawlers may ignore the rules.

  • Cloudflare enforces these for its customers, but outside of Cloudflare you’re relying on good actors.

  • The hope is that as major AI companies standardize on respecting Content Signals, all sites will benefit simply by publishing them.

How Liferay Serves /robots.txt

In Liferay, requests to /robots.txt are handled by a RobotsServlet, which loads template files based on portal properties:

robots.txt.with.sitemap=\
  com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl
robots.txt.without.sitemap=\
  com/liferay/portal/dependencies/\
  robots_txt_without_sitemap.tmpl

By default, those embedded template files look like this:

Without sitemap:

User-Agent: *
Disallow:

With sitemap:

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

At runtime, placeholders like [$PROTOCOL$], [$HOST$], and [$PORT$] are automatically substituted.

Implementing Content-Signal in Your Environment

We can add Content Signals by overriding the default template files. Here's your full step-by-step instructions for adding Content Signals to your environment.

1) Create replacement templates

You have two options when creating the replacement templates:

Option A (same path, no extra config)

Place files at: tomcat/webapps/ROOT/WEB-INF/classes/com/liferay/portal/dependencies/ with the same names as the defaults:

robots_txt_with_sitemap.tmpl
robots_txt_without_sitemap.tmpl

Liferay will automatically use these instead of the embedded ones.

Option B (custom path/name with portal-ext.properties)

If you don’t want to overwrite the stock template files, you can place your custom files anywhere under tomcat/webapps/ROOT/WEB-INF/classes/.

Then point the portal to them in portal-ext.properties.

For example, if you put them here:

tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\
  with_sitemap.tmpl
tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\
  without_sitemap.tmpl

add the following to portal-ext.properties:

robots.txt.with.sitemap=custom/robots/with_sitemap.tmpl
robots.txt.without.sitemap=custom/robots/without_sitemap.tmpl
Note: These paths are relative to the WEB-INF/classes directory, since the lookup is done via the classpath.

Example file (without sitemap):

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Disallow:

Example file (with sitemap):

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

This keeps the original behavior intact while injecting the new Content-Signal line.

2) Clusters and containers

Of course these changes need to be applied to the cluster as a whole. This typically involves either:

  • Updating Each Node Ensure every node includes the same overrides.

  • Docker Use the supported overlay mechanism. Place your files under a host folder like:

[host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\
  com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl
[host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\
  com/liferay/portal/dependencies/\
  robots_txt_without_sitemap.tmpl

Then mount it when starting the container:

docker run -v [host-folder]:/mnt/liferay ...

The container scans /mnt/liferay/files at startup and overlays your files into the runtime classpath.

3) Restart

Restart your nodes (or containers) for the new templates to take effect.

4) Test locally

On a developer machine, test with:

curl http://localhost:8080/robots.txt

You should see the Content-Signal: line and, if you’re using the sitemap template, a fully substituted Sitemap: URL.

In non-local environments, replace localhost:8080 with your actual site host.

Conclusion

Adding Content Signals to Liferay’s robots.txt is simple: override the default template files with updated versions that include your policy.

But keep in mind:

  • Compliance is voluntary. Reputable crawlers will respect the signals; malicious scrapers likely won’t. Pair with WAF/bot defenses for stronger protection.

  • Page-level overrides exist. If a page has its own Robots configuration in the SEO tab of the page configuration, it overrides your global templates. Admins need to add the same Content-Signal rules there if they want them enforced at the page level.

By adopting Content Signals now, you make your preferences clear in a standard, machine-readable way, positioning your site to benefit as more crawlers begin to respect these rules.

Page Comments

Related Assets...

No Results Found

More Blog Entries...

Ben Turner
October 21, 2025
Michael Wall
October 14, 2025