Introduction

So if you really want to set off a firestorm with geeks like myself, step into the group and say "VI is better than Emacs!" Be sure to turn and walk away quickly or you might find yourself included in the tussle.

Another way to set off a firestorm is to mention ChatGPT. Some like it, some hate it.

I'm not really going to cover whether it is good or bad (that of course is going to be left to the reader).

However, if you feel strongly that OpenAI should not be training its AI off of your content authors' hard work, well I can help you do this.

About GPTBot

So OpenAI is now using a web crawler with a specific name, GPTBot, for crawling your site to train its AI. You can read all about it here: https://platform.openai.com/docs/gptbot

For our purposes, though, we're more concerned about disallowing GPTBot from scanning our site, and that is defined here: https://platform.openai.com/docs/gptbot/disallowing-gptbot

Updating Your Robots.txt to Disallow GPTBot

So Liferay actually gives you full control over the robots.txt file for each site you host. This is important, you have to handle each site individually, so you can't expect this change to apply to all sites you might be hosting.

Basically we start from the Site menu, go to Design -> Pages, and then on your Public Pages, click the gear icon to get to the necessary dialog. In the dialog, choose the SEO tab from the left side to find your Robots.txt configuration.

Now, the field itself is not very smart or sophisticated. Basically anything that you enter here is going to be returned whenever someone navigates to www.yoursite.com/robots.txt .

For reference, let's check the default that is set up for you when you create the site:

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

Now, just in case you're not familiar with the robots.txt syntax, this will basically match on any user agent and no pages are disallowed. It also identifies how to access the sitemap.xml file which the crawlers will typically use next to identify which pages to visit.

Now, let's say we wanted to modify this, per OpenAI's documentation, so we would disallow the GPTBot but we were okay leaving other crawlers (such as google and bing crawlers) to access and index our site.

We'd basically just update the robots.txt as indicated by OpenAI, so we'd end up with:

User-Agent: GPTBot
Disallow: /

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

This simple change will prevent GPTBot from crawling our site and training its AI on our content. The second User-Agent stanza, that will allow any other bot to access anything our site makes available.

Update 10/02/2023 - Blocking Google Bard / Vertex AI

So today Google announced here: https://blog.google/technology/ai/an-update-on-web-publisher-controls/ about their own robots.txt support to block Bard and Vertex AI from scanning your site.

The page points you to https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers and when you scroll down the page, you'll find Google-Extended in the table and the name of the bot to block.

So now we can update or robots.txt file to be:

User-Agent: GPTBot
Disallow: /

User-Agent: Google-Extended
Disallow: /

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

Now an important difference must be made between OpenAI's approach and Google Extended.

GPTBot is a separate crawler, one that will crawl on its own and, when it finds the robots.txt file, the promise is that the crawler will see the block and should respect it.

Google Extended, though, is not a separate crawler. Google's other crawlers will continue to access your site, but the promise is that Bard and Vertex AI will not use the results of the crawlers if the robots.txt file blocks it from the site.

What About Other AIs?

Well, this is where we're kind of stuck, yeah?

Unless each provider of LLM generative AIs also publish whatever bot name they will use for their own training crawlers, we can't add them to this list.

You may be better protected by changing your robots.txt file to disallow all bots, but then selectively add bots/crawlers you'll allow (such as google and bing).

Even here, though, you face two problems.

First is coming up with the list to allow; would you really not want people using Duck Duck Go to get to your site because you don't have their bot listed? Or <enter search site here> because you didn't list their bot?

Second, well this one's important, your robots.txt file could be completely ignored. Do you believe for a second, that if OpenAI's bot could no longer hit any website because everyone disallowed GPTBot that they'd still honor this? Personally, I think they'd come up with a 2nd bot, not named GPTBot, and maybe not telling everyone what it is, so they could resume crawling and training their AI. I don't know, maybe they would still respect it, but do you believe that everyone else would?

Conclusion

So now we know how we can prevent OpenAI, GPTBot, Bard and Vertex AI from training their AIs on our content and protect the IP that we are hosting.

Hopefully you'll find this blog useful, if this is something you care about.

Note that the robots.txt has a lot of possible variation and stuff that you can leverage in your site for more than just disallowing GPTBot. A great reference on the robots.txt file can be found here: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt.