26

The 2024 changes to the data dump were made because:

Simultaneously, we know that companies have scraped or otherwise ingested Stack Overflow and Stack Exchange data to train models without proper attribution — models that they are monetizing or using for commercial purposes. We know this is happening because the companies themselves, or independent researchers, have disclosed this information.

This is also the reason for the terrible, horrible, no good, very bad Cloudflare walls:

We will heavily rate-limit access to the platform from Microsoft’s IP addresses. This includes Microsoft Azure and, therefore, will impact any community member running a service that accesses the site via Microsoft Azure. However, traffic will generally be exempt from this restriction if it uses the API - see the list of exemptions below. We are planning to implement analogous restrictions for other cloud platforms, e.g. AWS and GCP, at a later date.

OpenAI has released a browser designed to function as a distributed scraper:

By default, we don’t use the content you browse to train our models. If you choose to opt-in this content, you can enable “include web browsing” in your data controls settings. Note, even if you opt into training, webpages that opt out of GPTBot, will not be trained on. If you've enabled training for chats in your ChatGPT account, training will also be enabled for chats in Atlas. This includes website content you've attached when using the Ask ChatGPT sidebar and browser memories that inform your chats.

This browser isn't trivial to programmatically distinguish from Google Chrome, per Simon Willison's research:

The Atlas user-agent is Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 - identical to the user-agent I get for the latest Google Chrome on macOS.

Spoofing another browser's user agent is not a good-faith action, unless you're trying to reduce fingerprinting. Judging by OpenAI's description, Agent Mode is rather conspicuous, and leaks information about "your browsing context" – so Atlas is not attempting in good faith to reduce fingerprinting. This, plus OpenAI's track record, makes me wary about OpenAI's "webpages that opt out of GPTBot, will not be trained on" claim. We've sacrificed a lot of usability to keep the AI scrapers at bay, and that's all for nothing if they can just use residential IPs.

Should Stack Exchange block OpenAI's Atlas browser? If so, why? If not, where do we draw the line? How should we respond to similar systems?

20
  • 7
    I think Stack Exchange Inc already shares all its data with OpenAI. Or am I mistaken? I read: Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models. Commented Oct 24 at 19:30
  • 6
    @KIKOSoftware That deal comes with attribution obligations (which I don't think OpenAI are meeting, but that's a different topic). OpenAI doesn't behave like it has an obligation to provide due credit for most of what the GPT models plagiarise. Commented Oct 24 at 19:33
  • 3
    @KIKOSoftware Right: it's still worth coming up with a policy, for when one of the people SE isn't "partnering" with decides they want to make a scraper that runs on customer machines. Commented Oct 24 at 19:39
  • 1
    Still pondering a fuller answer but it does feel against the AUP Commented Oct 25 at 0:13
  • 3
    @KIKOSoftware "content needs to be protected to keep these communities alive" Users on SE agreed to license their contributions under CC BY-SA: that's the protection. Commented Oct 25 at 0:14
  • 4
    @FranckDernoncourt That's the protection from Stack Exchange malfeasance. It's not protection from malfeasance of other parties. We've long had anti-SCRAPER policies: in fact, our recent laissez faire attitude is a substantial relaxation from the original "Stack Content Republishers Attributing Poorly and/or Excelling at Ranking" policy. Commented Oct 25 at 0:22
  • 3
    @wizzwizz4 The protection from Stack Exchange malfeasance is to ensure that the knowledge SE users create doesn't get legally locked into one greedy firm (like Reddit, Yahoo Answers, Expert Exchange or Quora). The spirit is allowing that knowledge to be shared with anyone. Blocking Atlas goes against that spirit. It mesmerizes me why people contributing under CC BY-SA (and granting SE the right to do anything they want) wants to prevent their work from being shared. You are using the wrong site. Commented Oct 25 at 0:42
  • 18
    @FranckDernoncourt The machine that plagiarises my contributions, using my grammar ability to lie, spam, and drive people insane, is not "my work being shared". Given your conflict of interest, your apparent inability to look past it, and how you haven't been disclosing it, I don't think you should participate in these discussions. Commented Oct 25 at 1:21
  • 2
    @wizzwizz4 How do you ensure humans don't misuse the knowledge you share? Commented Oct 25 at 1:31
  • 14
    @FranckDernoncourt cause our 'work' is getting processed, potentially de-attributed (against the CC by SA licence) and locked into a black box for the profit of organisations who don't really care about the commons, or the common folks. I'm fine with my work being shared, and quite a lot of it is a primary source. I'd just like folks to know where its from, why, and not be commingled with sources of differing authenticity Commented Oct 25 at 1:32
  • 8
    @FranckDernoncourt Most humans don't misuse the knowledge I share. Most of those who do misuse it get dealt with by the legal system. But you can't put ChatGPT in prison, nor give it a restraining order preventing it from interacting with vulnerable people: the law doesn't really bind corporations like OpenAI, so those tools aren't available to us. Commented Oct 25 at 1:37
  • 8
    @FranckDernoncourt Furthermore, most humans can't weaponise most of the knowledge I provide them, in that anything they could get from it, they could more easily figure out themselves. ChatGPT can't think for itself: it requires our words as fuel. And it can turn discussions of pure mathematics into an interactive necronomicon. Hang out in the Stack Exchange Lobby for a month, and you'll see how much it hurts people who don't deserve to be hurt. If nothing else, we have a responsibility to put a stop to this. Commented Oct 25 at 1:39
  • 3
    @wizzwizz4 "you can't put ChatGPT in prison, nor give it a restraining order preventing it from interacting with vulnerable people" you can very much sue OpenAI eg I was just watching youtu.be/5YW0_QiXzCc (parents of a teen who killed himself following ChatGPT advice are suing OpenAI). But most ChatGPT output don't weaponize your shared knowledge either. "ChatGPT can't think for itself: it requires our words as fuel" try raising a kid without communicating with them and see how it goes. "you'll see how much it hurts people who don't deserve to be hurt." should be happy to help build AI Commented Oct 25 at 1:55
  • 7
    It seems comparable features also exist in Edge (copilot), Chrome (Gemini) and Comet. Commented Oct 25 at 12:58
  • 3
    Perplexity's shown itself repeatedly untrustworthy about scraping, and it getting sued a lot. I wouldn't trust anything they're building without at least a rudimentary audit. Commented Oct 27 at 1:16

5 Answers 5

10

Well could and would SE do it are separate questions. The AUP as of when I wrote this post says

Content Scraping You may not use any automated data-gathering means (including robots, spiders, scrapers, crawlers, and the like) to gather any text, files, audio or visual media, profile information, or any other content from any Network website for any use that violates the Public Network Terms of Service, including the content license, or this Acceptable Use Policy. Your usage of automated data-gathering means is exempt from this policy if either:

  1. Such automated data-gathering is necessary for accessibility-related reasons.

  2. You have obtained our express written prior consent. (You may contact [email protected] with any inquiries.)

Use of Atlas or other browsers that are also scraping data are literally against the AUP.

I seem to recall an authentic usage policy in the past, which would be relevant for agentic usage. I can't seem to find it though.

There's other issues as well - moderators and staff have access to information that's not meant for public consumption and a tool that's reading and sending this back might be a problem as well. Interactions with folks asking their browser to ask questions... things like that.

I guess the question is what can be done about it - user agents are notoriously unreliable, and most anti AI measures are basically relying on organisations that rarely show trustworthiness to accept being asked nicely not to scrape.

Practically, it feels like it's in our best interests not to allow deceptive data mining tools on the network - even (and especially!) taking into account SE's trying to sell this data with assurances from their clients (which I don't trust too much) that they would be attributing it.

The how and if it's even possible to disallow specific browsers for abusive behaviour is tricky though.

4
  • 2
    "if it's even possible" - While I haven't personally tested it (I use Cloudflare), it's possible something like this might stop it. Commented Oct 25 at 1:52
  • 1
    Is it Inauthentic usage policy? (Btw, I'm just now noticing that the FAQ for scraping seems to contradict the AUP; I think I'll make a post about this.) Commented Oct 25 at 11:12
  • 1
    Indeed it is - and probably worth including as part of an answer Commented Oct 25 at 11:26
  • @JourneymanGeek there's potential that the client hints API could give better info than a UA string, but that would still ultimately be under the UA's control. a UA could probably respond to a CH function call with mock or "real" info based on origin. Commented Oct 27 at 8:50
3

The primary purpose of this site is to share knowledge as wide as possible within the license restrictions (mostly attribution). Therefore any restriction should be well reasoned. Technical restrictions (if possible) may be reasonable for spammers, .. or scrapers who disregard the license.

It comes down to if OpenAI (and others, they are not the only AI service, there is also Google, Meta, ...) are actually observing the license or not. It's currently still not fully decided by courts. We don't know.

We as copyright holders of the content could sue all these companies. Why don't we do this collectively?

Or we could support SO in trying to block them. But SO would not be on our side, they would only block them to ask them for money and then allow them again. We could only make SO a bit richer. Is this what we want?

Anyway, I wonder why the AI services not use the regular data dumps instead for gathering training material. It's not as if there is much change in the content within three months nowadays and it's not as if we could technically block them from using the data dumps. They seem also much better suited to process from a technical point of view (once you get rid of the added fake content).

We simply may not be able from technically blocking anyone there (if you include the data dumps).

But if there is something that works technically, we should extend it to any AI based service that retrieves online information and treat them all the same. Many news platforms for example report significant declines overall in traffic and only increased traffic from bots. Maybe they don't want that traffic and have technical solutions for this.

1
  • "I wonder why the AI services not use the regular data dumps instead for gathering training material." They often do use the regular data dumps but at the same time, crawling is easy for those who already have a crawler (why make an exception for SE, which is just tiny portion of the web? Just crawl the web) and 3 months can matter esp. for next tech + given the typical criticism of LLM knowledge being outdated. Commented Oct 28 at 1:24
-9

This is stupid.

Blocking a user agent will never be not stupid. I don't want to see Stack Exchange in the wall of shame of sites that try to force the user to not use their user agent of choice. If the site doesn't work in the browser, SE has no responsibility to fix it, as the list of supported browser very well describes. But Stack Exchange shouldn't deliberately sabotage a user agent at all.

2
  • 12
    We already block plenty of user agents: scrapers, spambots, DDoS programs. Cloudflare even blocks a lot of legitimate users: I'm currently banned from SEDE, for example. I think this discussion needs to start with the understanding that we're already on the wall of shame. (I am really sympathetic to this position, though: it's the commendable kind of ideological purity.) Commented Oct 25 at 2:09
  • 4
    @wizzwizz4 those are causing performance issues to the site, so it's not the same. In this case, you would block a user behaving as user because... their UA string is in the naughty list. You should know better than that. Commented Oct 25 at 11:23
-18

Honestly I think that we should be instead be trying to embrace this new ai search engine as it can provide help to students and mathematicians without a proper education be able to learn about current affairs and progress their studies.

However I also do agree that some mathematicians may abuse the ai search engine as though to use as a easy way to do a project or an assignment but if they do learn isn't that also good as maths is not about having the answers at the first place but learning the logic and process behind it and if open ai provides this 24/7 is it not extremely beneficial.

To conclude I believe that stack overflow is a place where you can learn from other talented mathematicians and the open ai browser only provides more help and quicker support for their learning and problems.

2
  • 14
    The M here is for meta, not Math, and we're talking about a browser, not a search engine Commented Oct 26 at 12:09
  • 3
    This answer mentions "ai search engine" twice and open ai browser" once. Which did you mean? Commented Oct 26 at 15:42
-20

Should Stack Exchange block OpenAI's Atlas browser?

No, we should not. My understanding of Stack Exchange is that it is a platform where humans help other humans understand and gain knowledge. Blocking a web browser doesn’t go in that direction. More generally, blocking the sharing of knowledge slows down society’s progress.

From a SE Inc. business standpoint, the number of questions has decreased 10-fold already (it is actually 25-fold on SO now), so intentionally blocking the web browser from a company valued at $500 billion USD doesn't strike me as the best way to regain traffic, not to mention that same company likely saved SE Inc. valuation from further sinking last year.

If not, where do we draw the line?

User contributions on Stack Exchange are licensed under CC BY-SA. The red line is determined by the courts, which decide where one infringes that license.

23
  • 12
    Blocking Atlas wouldn't block any humans: it's currently only available for macOS users, all of whom have Safari. There are some access needs that restrict what browsers people can easily use, but Atlas is functionally equivalent to Chrome for all of those I'm aware of. Commented Oct 24 at 19:56
  • 2
    @wizzwizz4 Makes it harder for some people to access it. How would you feel if someone required you to change browsers today? Not impossible, but inconvenient. Commented Oct 24 at 19:57
  • 1
    Sometimes... getting people to change from an awful product is the right move, even if they may disagree at the time. Commented Oct 24 at 20:11
  • 1
    @KevinB You're right, that's what I tell macOS and Safari users. Commented Oct 24 at 20:12
  • 20
    @FranckDernoncourt Stack Exchange already requires me to use particular browsers. I've been advocating for SE to move away from that, and support all browsers. However, we do not exist in a political vacuum: blocking Atlas (basically Chrome with a bundled extension) is different to blocking Lynx (a console-based browser needed by some blind users). Commented Oct 24 at 20:14
  • 1
    @wizzwizz4 Intentionally blocking a web browser from a company valued at $500 billion USD is quite different than not actively supporting a hobbyist browser with, and I'm quoting Gemini so I'm sure you'll agree, a negligible market share. Commented Oct 24 at 20:18
  • 4
    "intentionally blocking the web browser from a company valued at $500 billion USD doesn't strike me as the best way to regain traffic" - what? Commented Oct 24 at 22:42
  • 2
    Why does blocking a product from a highly valued company change anything about our traffic? Commented Oct 24 at 22:46
  • 3
    That's a question of product quality/use statistics, not market valuation. Commented Oct 24 at 22:50
  • 18
    @FranckDernoncourt Some things are more important than short-term traffic maximisation. For example, spam campaigns increase the traffic to our sites, and we work hard to eliminate those as quickly as possible. Commented Oct 24 at 23:12
  • 1
    @wizzwizz4 Why is blocking Atlas short-term traffic maximisation? OpenAI repacing SE? Commented Oct 24 at 23:15
  • 8
    @wizzwizz4 While SE practically only supports a subset of browsers, the trifecta of safari(webkit), chrome (blink) and firefox (gecko) covers most options. I'm testing it on servo and ladybird for fun, and so far so good on both. Commented Oct 25 at 0:22
  • 7
    Any content generated by any built-in tools of Atlas would violate the no LLM generated content policy. So I have no problem banning Atlas if it were possible because LLM generated trash is worthless. OpenAI is overvalued by at least 449 billion dollars. I don’t want users who use Atlas to generate content to contribute, not only will I have to flag and downvote the worthless AI garbage if that happens, some of that garbage won’t be reported and those users I’ll get the misleading impression it’s allowed. Commented Oct 26 at 1:38
  • 10
    I just want to ban LLM generated content… Commented Oct 26 at 1:47
  • 1
    @FranckDernoncourt that looks like on device AI - data doesn't go out, and its offline. And using that to post is still disallowed on many sites already. The concern here is it being used as a sneaky scraper. Commented Oct 27 at 0:50

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.