So if you really want to set off a firestorm with geeks like
myself, step into the group and say "VI is better than
Emacs!" Be sure to turn and walk away quickly or you might find
yourself included in the tussle.
Another way to set off a firestorm is to mention ChatGPT. Some
like it, some hate it.
I'm not really going to cover whether it is good or bad (that of
course is going to be left to the reader).
However, if you feel strongly that OpenAI should not be
training its AI off of your content authors' hard work, well I can
help you do this.
So OpenAI is now using a web crawler with a specific name,
GPTBot, for crawling your site to train its AI. You can read all
about it here: https://platform.openai.com/docs/gptbot
For our purposes, though, we're more concerned about
disallowing GPTBot from scanning our site, and that is
defined here: https://platform.openai.com/docs/gptbot/disallowing-gptbot
So Liferay actually gives you full control over the robots.txt
file for each site you host. This is important, you have to handle
each site individually, so you can't expect this change to apply to
all sites you might be hosting.
Basically we start from the Site menu, go to
Design -> Pages, and then on your
Public Pages, click the gear icon to get to the
necessary dialog. In the dialog, choose the SEO tab
from the left side to find your Robots.txt configuration.
Now, the field itself is not very smart or sophisticated.
Basically anything that you enter here is going to be returned
whenever someone navigates to
For reference, let's check the default that is set up for you
when you create the site:
Now, just in case you're not familiar with the robots.txt
syntax, this will basically match on any user agent and no pages are
disallowed. It also identifies how to access the
sitemap.xml file which the crawlers will typically use
next to identify which pages to visit.
Now, let's say we wanted to modify this, per OpenAI's
documentation, so we would disallow the GPTBot but we were okay
leaving other crawlers (such as google and bing crawlers) to access
and index our site.
We'd basically just update the robots.txt as indicated by
OpenAI, so we'd end up with:
This simple change will prevent GPTBot from crawling our site
and training its AI on our content. The second User-Agent stanza,
that will allow any other bot to access anything our site makes available.
Well, this is where we're kind of stuck, yeah?
Unless each provider of LLM generative AIs also publish whatever
bot name they will use for their own training crawlers, we can't add
them to this list.
You may be better protected by changing your robots.txt file to
disallow all bots, but then selectively add bots/crawlers you'll allow
(such as google and bing).
Even here, though, you face two problems.
First is coming up with the list to allow; would you really not
want people using Duck Duck Go to get to your site because you don't
have their bot listed? Or <enter search site here>
because you didn't list their bot?
Second, well this one's important, your robots.txt file could be
completely ignored. Do you believe for a second, that if OpenAI's bot
could no longer hit any website because everyone disallowed GPTBot
that they'd still honor this? Personally, I think they'd come up with
a 2nd bot, not named GPTBot, and maybe not telling everyone what it
is, so they could resume crawling and training their AI. I don't know,
maybe they would still respect it, but do you believe that everyone
So now we know how we can prevent OpenAI and GPTBot from
training their AIs on our content and protect the IP that we are hosting.
Hopefully you'll find this blog useful, if this is something you
Note that the robots.txt has a lot of possible variation and
stuff that you can leverage in your site for more than just
disallowing GPTBot. A great reference on the robots.txt file can be
found here: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt.