The 2024 changes to the data dump were made because:
Simultaneously, we know that companies have scraped or otherwise ingested Stack Overflow and Stack Exchange data to train models without proper attribution — models that they are monetizing or using for commercial purposes. We know this is happening because the companies themselves, or independent researchers, have disclosed this information.
This is also the reason for the terrible, horrible, no good, very bad Cloudflare walls:
We will heavily rate-limit access to the platform from Microsoft’s IP addresses. This includes Microsoft Azure and, therefore, will impact any community member running a service that accesses the site via Microsoft Azure. However, traffic will generally be exempt from this restriction if it uses the API - see the list of exemptions below. We are planning to implement analogous restrictions for other cloud platforms, e.g. AWS and GCP, at a later date.
OpenAI has released a browser designed to function as a distributed scraper:
By default, we don’t use the content you browse to train our models. If you choose to opt-in this content, you can enable “include web browsing” in your data controls settings. Note, even if you opt into training, webpages that opt out of GPTBot, will not be trained on. If you've enabled training for chats in your ChatGPT account, training will also be enabled for chats in Atlas. This includes website content you've attached when using the Ask ChatGPT sidebar and browser memories that inform your chats.
This browser isn't trivial to programmatically distinguish from Google Chrome, per Simon Willison's research:
The Atlas user-agent is
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36- identical to the user-agent I get for the latest Google Chrome on macOS.
Spoofing another browser's user agent is not a good-faith action, unless you're trying to reduce fingerprinting. Judging by OpenAI's description, Agent Mode is rather conspicuous, and leaks information about "your browsing context" – so Atlas is not attempting in good faith to reduce fingerprinting. This, plus OpenAI's track record, makes me wary about OpenAI's "webpages that opt out of GPTBot, will not be trained on" claim. We've sacrificed a lot of usability to keep the AI scrapers at bay, and that's all for nothing if they can just use residential IPs.
Should Stack Exchange block OpenAI's Atlas browser? If so, why? If not, where do we draw the line? How should we respond to similar systems?