Blogs
Extending Liferay’s robots.txt to Control AI Training, Search, and Input through Cloudflare’s Content Signals

Introduction
The Content Signals Policy is a new initiative from
Cloudflare, published at contentsignals.org. It extends
the familiar robots.txt
file so site owners can declare
how their content may be used:
-
search: allow content to be indexed and shown in search results.
-
ai-input: allow content to be used as real-time inputs to AI models (e.g., RAG/grounding).
-
ai-train: allow content to be used for training or fine-tuning AI models.
For example, this directive allows only search indexing:
Content-Signal: ai-train=no, search=yes, ai-input=no
These signals are machine-readable tags paired with human-readable policy text. They also explicitly reserve rights under EU Directive 2019/790.
Important caveats:
-
Adoption is voluntary. Malicious crawlers may ignore the rules.
-
Cloudflare enforces these for its customers, but outside of Cloudflare you’re relying on good actors.
-
The hope is that as major AI companies standardize on respecting Content Signals, all sites will benefit simply by publishing them.
How Liferay Serves /robots.txt
In Liferay, requests to /robots.txt
are handled by
a RobotsServlet
, which loads template files based on
portal properties:
robots.txt.with.sitemap=\ com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl robots.txt.without.sitemap=\ com/liferay/portal/dependencies/\ robots_txt_without_sitemap.tmpl
By default, those embedded template files look like this:
Without sitemap:
User-Agent: * Disallow:
With sitemap:
User-Agent: * Disallow: Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml
At runtime, placeholders like [$PROTOCOL$], [$HOST$], and [$PORT$] are automatically substituted.
Implementing Content-Signal in Your Environment
We can add Content Signals by overriding the default template files. Here's your full step-by-step instructions for adding Content Signals to your environment.
1) Create replacement templates
You have two options when creating the replacement templates:
Option A (same path, no extra config)
Place files at:
tomcat/webapps/ROOT/WEB-INF/classes/com/liferay/portal/dependencies/
with the same names as the defaults:
robots_txt_with_sitemap.tmpl robots_txt_without_sitemap.tmpl
Liferay will automatically use these instead of the embedded ones.
Option B (custom path/name with portal-ext.properties)
If you don’t want to overwrite the stock template files, you can
place your custom files anywhere under tomcat/webapps/ROOT/WEB-INF/classes/
.
Then point the portal to them in portal-ext.properties.
For example, if you put them here:
tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\ with_sitemap.tmpl tomcat/webapps/ROOT/WEB-INF/classes/custom/robots/\ without_sitemap.tmpl
add the following to portal-ext.properties:
robots.txt.with.sitemap=custom/robots/with_sitemap.tmpl robots.txt.without.sitemap=custom/robots/without_sitemap.tmpl
WEB-INF/classes
directory, since the
lookup is done via the classpath.Example file (without sitemap):
User-Agent: * Content-Signal: ai-train=no, search=yes, ai-input=no Disallow:
Example file (with sitemap):
User-Agent: * Content-Signal: ai-train=no, search=yes, ai-input=no Disallow: Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml
This keeps the original behavior intact while injecting the new Content-Signal line.
2) Clusters and containers
Of course these changes need to be applied to the cluster as a whole. This typically involves either:
-
Updating Each Node Ensure every node includes the same overrides.
-
Docker Use the supported overlay mechanism. Place your files under a host folder like:
[host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\ com/liferay/portal/dependencies/robots_txt_with_sitemap.tmpl [host-folder]/files/tomcat/webapps/ROOT/WEB-INF/classes/\ com/liferay/portal/dependencies/\ robots_txt_without_sitemap.tmpl
Then mount it when starting the container:
docker run -v [host-folder]:/mnt/liferay ...
The container scans /mnt/liferay/files
at startup
and overlays your files into the runtime classpath.
3) Restart
Restart your nodes (or containers) for the new templates to take effect.
4) Test locally
On a developer machine, test with:
curl http://localhost:8080/robots.txt
You should see the Content-Signal:
line and, if
you’re using the sitemap template, a fully substituted Sitemap: URL.
In non-local environments, replace localhost:8080
with your actual site host.
Conclusion
Adding Content Signals to Liferay’s robots.txt
is
simple: override the default template files with updated versions that
include your policy.
But keep in mind:
-
Compliance is voluntary. Reputable crawlers will respect the signals; malicious scrapers likely won’t. Pair with WAF/bot defenses for stronger protection.
-
Page-level overrides exist. If a page has its own Robots configuration in the SEO tab of the page configuration, it overrides your global templates. Admins need to add the same Content-Signal rules there if they want them enforced at the page level.
By adopting Content Signals now, you make your preferences clear in a standard, machine-readable way, positioning your site to benefit as more crawlers begin to respect these rules.