A.I. And Robots.txt

CCarter · Nov 13, 2024

If you've seen your content being used to train A.I. - and aren't getting paid soon you'll be able to block them.

They are working on updating the Robots.txt file so you can block A.I. bots from training/scraping your data: New Internet Rules Will Block AI Training Bots

The draft proposal for blocking AI training bots suggests three ways to block the bots:

- Robots.txt Protocols
- Meta Robots HTML Elements
- Application Layer Response Header

..

The following are the proposed meta robots directives:

<meta name=”robots” content=”DisallowAITraining”>
<meta name=”examplebot” content=”AllowAITraining”>

Here is the offical proposal: Robots Exclusion Protocol Extension to manage AI content use

What does that mean for you? Very little cause most of you are already dead, I'm talking to a graveyard of SEOs here... You should have listened to us.

eliquid · Nov 13, 2024

Not sure if this impacts what Im about to say...

But I imagine more and more, people are just gonna use Ai instead of search.

So if I ask ChatGPT or XYZAi-Bot12 in the future:

"what's the best way to cook salmon with mango sticky rice" and the Ai generates an answer and LINKS to the website that provided the answer, is THIS gonna prevent or block that result and answer where it might have linked to me for them to click through?

Many things to think about.

ryandiscord · Nov 14, 2024

eliquid said:
Not sure if this impacts what Im about to say...

But I imagine more and more, people are just gonna use Ai instead of search.

So if I ask ChatGPT or XYZAi-Bot12 in the future:

"what's the best way to cook salmon with mango sticky rice" and the Ai generates an answer and LINKS to the website that provided the answer, is THIS gonna prevent or block that result and answer where it might have linked to me for them to click through?

Many things to think about.

Maybe this is where the robots.txt protocol would come into play. Instead of using the meta tag that would block all AI platforms, Allow user agents that link to you, Disallow those who don't.

CCarter · Nov 14, 2024

eliquid said:
is THIS gonna prevent or block that result and answer where it might have linked to me for them to click through?

Just off my experience, when ChatGPT and the likes are asked specific questions they tend to go searching for it with their search engine partner then visit the websites at some level and pull the data.

They don't have all the data readily available. That isn't training IMO.

This one is specifically for A.I. training their data against your data to make their stuff accurate.

When I asked ChatGPT to search for one of my brands the logs came back with this:

Code:

135.237.131.223 - - [14/Nov/2024:11:18:05 -0500] "GET /robots.txt HTTP/2.0" 301 846 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot"
135.237.131.210 - - [14/Nov/2024:11:18:05 -0500] "GET /robots.txt HTTP/2.0" 200 2175 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot"
135.237.131.223 - - [14/Nov/2024:11:18:05 -0500] "GET / HTTP/2.0" 301 370 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot"
135.237.131.210 - - [14/Nov/2024:11:18:05 -0500] "GET / HTTP/2.0" 200 9045 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot"

Their useragent is:

Code:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot

So I guess we'll have to see when this arrives, but I doubt it will block ChatGPT and other from visiting site for NEW information. Now for older information - who knows.

Nonbeardedman · Nov 14, 2024

It's worth considering that this may have a larger impact (from the AI bot's perspective) on certain websites/web masters than others.

Firstly, older websites may have been abandoned by their owners, which means they may not be around to update their robots.txt files. For this reason, ChatGPT, and other such AI bots, may still be able to train their models with websites, but just older ones instead of new ones.

Secondly, if this process has to be done manually, or has webmasters jumping through hoops, I see this proposal having the same fate as certain websites' cookie 'policy'. Like, many websites offer viewers to decline cookies, but it has this whole process where you go and 'manage' your cookies, which redirects you to another page, and then that page guides to another etc. If this is how webmasters will be expected to restrict bots crawling their sites, I doubt many people would still opt-in (or opt-out in this case). Because, the effect of AI crawling their site isn't 'physical'. Many people can't see AI scraping and copying their content in real-time which is why they'd be willing to ignore it - unlike when you come across a competitor stealing your content - your instantly able to detect that mf and take em' down.

Finally, I believe this isn't going to have much of an impact on AI training data because blocking AI bots isn't the 'default' option. It's more like AI WILL scrape your data for training purposes UNLESS you disallow it. This usually doesn't satisfy webmasters. An example of this is DeviantArt that lost over a million users after it announced its AI art generating tool - which was trained with artists' data on the site.

This is an incredibly bad idea that seems almost hand-crafted to alienate and enrage the DeviantArt userbase. Even if muddying the artistic waters by embracing AI was a good idea (it's not), requiring an opt-out instead of an opt-in was a completely unforced error

CreativeBloq

Despite offering an opt-out option, many users, including renowned artists that worked on blockbuster animated films left the platform.

In the case of this thread however, many webmasters don't really have another option. What are they going to do? Leave the internet?

So either many people are going to learn how to do this, some developer is going to make an open source extension which people can click on that'll instantly opt them out, or some higher authority is going to pressure major platforms to include a cute and simple pop-up stating: "Can we Pwease Use Your

SUPER DUPER

Amazing Content to Train Our Sad Wittle AI Bot ~ " before signing up for any platform or CMS - since many people use that - which users can simply accept or decline.

Or maybe it won't idk. I don't really have an idea of the technical side of how all this works.

secretagentdad · Nov 20, 2024

So interesting thing that happened a couple times in the last few weeks.
Keyword sheeter had days where chatgpt referrals were higher than bing referrals.

Will they still send traffic if you tell them to go somewhere else?

A.I. And Robots.txt

CCarter

Final Boss ®

eliquid

ryandiscord

CCarter

Final Boss ®

Nonbeardedman

secretagentdad

Time to be a hoot.