Blogs

Disabling ChatGPT From Training off Your Site

OpenAI is now using an official bot name used when training it's AI. Read on to learn how to configure your site to block this training.

Introduction

So if you really want to set off a firestorm with geeks like myself, step into the group and say "VI is better than Emacs!" Be sure to turn and walk away quickly or you might find yourself included in the tussle.

Another way to set off a firestorm is to mention ChatGPT. Some like it, some hate it.

I'm not really going to cover whether it is good or bad (that of course is going to be left to the reader).

However, if you feel strongly that OpenAI should not be training its AI off of your content authors' hard work, well I can help you do this.

About GPTBot

So OpenAI is now using a web crawler with a specific name, GPTBot, for crawling your site to train its AI. You can read all about it here: https://platform.openai.com/docs/gptbot

For our purposes, though, we're more concerned about disallowing GPTBot from scanning our site, and that is defined here: https://platform.openai.com/docs/gptbot/disallowing-gptbot

Updating Your Robots.txt to Disallow GPTBot

So Liferay actually gives you full control over the robots.txt file for each site you host. This is important, you have to handle each site individually, so you can't expect this change to apply to all sites you might be hosting.

Basically we start from the Site menu, go to Design -> Pages, and then on your Public Pages, click the gear icon to get to the necessary dialog. In the dialog, choose the SEO tab from the left side to find your Robots.txt configuration.

Now, the field itself is not very smart or sophisticated. Basically anything that you enter here is going to be returned whenever someone navigates to www.yoursite.com/robots.txt .

For reference, let's check the default that is set up for you when you create the site:

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

Now, just in case you're not familiar with the robots.txt syntax, this will basically match on any user agent and no pages are disallowed. It also identifies how to access the sitemap.xml file which the crawlers will typically use next to identify which pages to visit.

Now, let's say we wanted to modify this, per OpenAI's documentation, so we would disallow the GPTBot but we were okay leaving other crawlers (such as google and bing crawlers) to access and index our site.

We'd basically just update the robots.txt as indicated by OpenAI, so we'd end up with:

User-Agent: GPTBot
Disallow: /

User-Agent: *
Disallow:
Sitemap: [$PROTOCOL$]://[$HOST$]:[$PORT$]/sitemap.xml

This simple change will prevent GPTBot from crawling our site and training its AI on our content. The second User-Agent stanza, that will allow any other bot to access anything our site makes available.

What About Other AIs?

Well, this is where we're kind of stuck, yeah?

Unless each provider of LLM generative AIs also publish whatever bot name they will use for their own training crawlers, we can't add them to this list.

You may be better protected by changing your robots.txt file to disallow all bots, but then selectively add bots/crawlers you'll allow (such as google and bing).

Even here, though, you face two problems.

First is coming up with the list to allow; would you really not want people using Duck Duck Go to get to your site because you don't have their bot listed? Or <enter search site here> because you didn't list their bot?

Second, well this one's important, your robots.txt file could be completely ignored. Do you believe for a second, that if OpenAI's bot could no longer hit any website because everyone disallowed GPTBot that they'd still honor this? Personally, I think they'd come up with a 2nd bot, not named GPTBot, and maybe not telling everyone what it is, so they could resume crawling and training their AI. I don't know, maybe they would still respect it, but do you believe that everyone else would?

Conclusion

So now we know how we can prevent OpenAI and GPTBot from training their AIs on our content and protect the IP that we are hosting.

Hopefully you'll find this blog useful, if this is something you care about.

Note that the robots.txt has a lot of possible variation and stuff that you can leverage in your site for more than just disallowing GPTBot. A great reference on the robots.txt file can be found here: https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt.