1. Home
2. Questions
3. Unanswered
4. Tags
6. Chat
7. Users
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

Should we block OpenAI's Atlas web browser?

Ask Question

Asked 1 month ago

Modified 5 days ago

Viewed 924 times

26

The 2024 changes to the data dump were made because:

Simultaneously, we know that companies have scraped or otherwise ingested Stack Overflow and Stack Exchange data to train models without proper attribution — models that they are monetizing or using for commercial purposes. We know this is happening because the companies themselves, or independent researchers, have disclosed this information.

This is also the reason for the terrible, horrible, no good, very bad Cloudflare walls:

We will heavily rate-limit access to the platform from Microsoft’s IP addresses. This includes Microsoft Azure and, therefore, will impact any community member running a service that accesses the site via Microsoft Azure. However, traffic will generally be exempt from this restriction if it uses the API - see the list of exemptions below. We are planning to implement analogous restrictions for other cloud platforms, e.g. AWS and GCP, at a later date.

OpenAI has released a browser designed to function as a distributed scraper:

By default, we don’t use the content you browse to train our models. If you choose to opt-in this content, you can enable “include web browsing” in your data controls settings. Note, even if you opt into training, webpages that opt out of GPTBot, will not be trained on. If you've enabled training for chats in your ChatGPT account, training will also be enabled for chats in Atlas. This includes website content you've attached when using the Ask ChatGPT sidebar and browser memories that inform your chats.

This browser isn't trivial to programmatically distinguish from Google Chrome, per Simon Willison's research:

The Atlas user-agent is Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 - identical to the user-agent I get for the latest Google Chrome on macOS.

Spoofing another browser's user agent is not a good-faith action, unless you're trying to reduce fingerprinting. Judging by OpenAI's description, Agent Mode is rather conspicuous, and leaks information about "your browsing context" – so Atlas is not attempting in good faith to reduce fingerprinting. This, plus OpenAI's track record, makes me wary about OpenAI's "webpages that opt out of GPTBot, will not be trained on" claim. We've sacrificed a lot of usability to keep the AI scrapers at bay, and that's all for nothing if they can just use residential IPs.

Should Stack Exchange block OpenAI's Atlas browser? If so, why? If not, where do we draw the line? How should we respond to similar systems?

asked Oct 24 at 19:00

wizzwizz4

34.5k8 gold badges73 silver badges124 bronze badges

7

I think Stack Exchange Inc already shares all its data with OpenAI. Or am I mistaken? I read: Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models.

KIKO Software
– KIKO Software

2025-10-24 19:30:40 +00:00
Commented Oct 24 at 19:30
6

@KIKOSoftware That deal comes with attribution obligations (which I don't think OpenAI are meeting, but that's a different topic). OpenAI doesn't behave like it has an obligation to provide due credit for most of what the GPT models plagiarise.

wizzwizz4
– wizzwizz4

2025-10-24 19:33:30 +00:00
Commented Oct 24 at 19:33
3

@KIKOSoftware Right: it's still worth coming up with a policy, for when one of the people SE isn't "partnering" with decides they want to make a scraper that runs on customer machines.

wizzwizz4
– wizzwizz4

2025-10-24 19:39:17 +00:00
Commented Oct 24 at 19:39
1

Still pondering a fuller answer but it does feel against the AUP

Journeyman Geek
– Journeyman Geek

2025-10-25 00:13:08 +00:00
Commented Oct 25 at 0:13
3

@KIKOSoftware "content needs to be protected to keep these communities alive" Users on SE agreed to license their contributions under CC BY-SA: that's the protection.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-25 00:14:33 +00:00
Commented Oct 25 at 0:14
4

@FranckDernoncourt That's the protection from Stack Exchange malfeasance. It's not protection from malfeasance of other parties. We've long had anti-SCRAPER policies: in fact, our recent laissez faire attitude is a substantial relaxation from the original "Stack Content Republishers Attributing Poorly and/or Excelling at Ranking" policy.

wizzwizz4
– wizzwizz4

2025-10-25 00:22:48 +00:00
Commented Oct 25 at 0:22
3

@wizzwizz4 The protection from Stack Exchange malfeasance is to ensure that the knowledge SE users create doesn't get legally locked into one greedy firm (like Reddit, Yahoo Answers, Expert Exchange or Quora). The spirit is allowing that knowledge to be shared with anyone. Blocking Atlas goes against that spirit. It mesmerizes me why people contributing under CC BY-SA (and granting SE the right to do anything they want) wants to prevent their work from being shared. You are using the wrong site.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-25 00:42:56 +00:00
Commented Oct 25 at 0:42
18

@FranckDernoncourt The machine that plagiarises my contributions, using my grammar ability to lie, spam, and drive people insane, is not "my work being shared". Given your conflict of interest, your apparent inability to look past it, and how you haven't been disclosing it, I don't think you should participate in these discussions.

wizzwizz4
– wizzwizz4

2025-10-25 01:21:07 +00:00
Commented Oct 25 at 1:21
2

@wizzwizz4 How do you ensure humans don't misuse the knowledge you share?

Franck Dernoncourt
– Franck Dernoncourt

2025-10-25 01:31:40 +00:00
Commented Oct 25 at 1:31
14

@FranckDernoncourt cause our 'work' is getting processed, potentially de-attributed (against the CC by SA licence) and locked into a black box for the profit of organisations who don't really care about the commons, or the common folks. I'm fine with my work being shared, and quite a lot of it is a primary source. I'd just like folks to know where its from, why, and not be commingled with sources of differing authenticity

Journeyman Geek
– Journeyman Geek

2025-10-25 01:32:28 +00:00
Commented Oct 25 at 1:32
8

@FranckDernoncourt Most humans don't misuse the knowledge I share. Most of those who do misuse it get dealt with by the legal system. But you can't put ChatGPT in prison, nor give it a restraining order preventing it from interacting with vulnerable people: the law doesn't really bind corporations like OpenAI, so those tools aren't available to us.

wizzwizz4
– wizzwizz4

2025-10-25 01:37:16 +00:00
Commented Oct 25 at 1:37
8

@FranckDernoncourt Furthermore, most humans can't weaponise most of the knowledge I provide them, in that anything they could get from it, they could more easily figure out themselves. ChatGPT can't think for itself: it requires our words as fuel. And it can turn discussions of pure mathematics into an interactive necronomicon. Hang out in the Stack Exchange Lobby for a month, and you'll see how much it hurts people who don't deserve to be hurt. If nothing else, we have a responsibility to put a stop to this.

wizzwizz4
– wizzwizz4

2025-10-25 01:39:32 +00:00
Commented Oct 25 at 1:39
3

@wizzwizz4 "you can't put ChatGPT in prison, nor give it a restraining order preventing it from interacting with vulnerable people" you can very much sue OpenAI eg I was just watching youtu.be/5YW0_QiXzCc (parents of a teen who killed himself following ChatGPT advice are suing OpenAI). But most ChatGPT output don't weaponize your shared knowledge either. "ChatGPT can't think for itself: it requires our words as fuel" try raising a kid without communicating with them and see how it goes. "you'll see how much it hurts people who don't deserve to be hurt." should be happy to help build AI

Franck Dernoncourt
– Franck Dernoncourt

2025-10-25 01:55:10 +00:00
Commented Oct 25 at 1:55
7

It seems comparable features also exist in Edge (copilot), Chrome (Gemini) and Comet.

Rebecca J. Stones
– Rebecca J. Stones

2025-10-25 12:58:49 +00:00
Commented Oct 25 at 12:58
3

Perplexity's shown itself repeatedly untrustworthy about scraping, and it getting sued a lot. I wouldn't trust anything they're building without at least a rudimentary audit.

Journeyman Geek
– Journeyman Geek

2025-10-27 01:16:53 +00:00
Commented Oct 27 at 1:16

| Show 5 more comments

5 Answers 5

Sorted by:

10

Well could and would SE do it are separate questions. The AUP as of when I wrote this post says

Content Scraping You may not use any automated data-gathering means (including robots, spiders, scrapers, crawlers, and the like) to gather any text, files, audio or visual media, profile information, or any other content from any Network website for any use that violates the Public Network Terms of Service, including the content license, or this Acceptable Use Policy. Your usage of automated data-gathering means is exempt from this policy if either:

Such automated data-gathering is necessary for accessibility-related reasons.

You have obtained our express written prior consent. (You may contact [email protected] with any inquiries.)

Use of Atlas or other browsers that are also scraping data are literally against the AUP.

I seem to recall an authentic usage policy in the past, which would be relevant for agentic usage. I can't seem to find it though.

There's other issues as well - moderators and staff have access to information that's not meant for public consumption and a tool that's reading and sending this back might be a problem as well. Interactions with folks asking their browser to ask questions... things like that.

I guess the question is what can be done about it - user agents are notoriously unreliable, and most anti AI measures are basically relying on organisations that rarely show trustworthiness to accept being asked nicely not to scrape.

Practically, it feels like it's in our best interests not to allow deceptive data mining tools on the network - even (and especially!) taking into account SE's trying to sell this data with assurances from their clients (which I don't trust too much) that they would be attributing it.

The how and if it's even possible to disallow specific browsers for abusive behaviour is tricky though.

edited Oct 25 at 1:39

Otakuwu

4,2882 gold badges8 silver badges37 bronze badges

answered Oct 25 at 1:28

Journeyman Geek

220k53 gold badges409 silver badges912 bronze badges

2

"if it's even possible" - While I haven't personally tested it (I use Cloudflare), it's possible something like this might stop it.

cocomac
– cocomac

2025-10-25 01:52:43 +00:00
Commented Oct 25 at 1:52
1

Is it Inauthentic usage policy? (Btw, I'm just now noticing that the FAQ for scraping seems to contradict the AUP; I think I'll make a post about this.)

Rebecca J. Stones
– Rebecca J. Stones

2025-10-25 11:12:06 +00:00
Commented Oct 25 at 11:12
1

Indeed it is - and probably worth including as part of an answer

Journeyman Geek
– Journeyman Geek

2025-10-25 11:26:58 +00:00
Commented Oct 25 at 11:26
@JourneymanGeek there's potential that the client hints API could give better info than a UA string, but that would still ultimately be under the UA's control. a UA could probably respond to a CH function call with mock or "real" info based on origin.

starball
– starball Mod

2025-10-27 08:50:40 +00:00
Commented Oct 27 at 8:50

Add a comment |

3

The primary purpose of this site is to share knowledge as wide as possible within the license restrictions (mostly attribution). Therefore any restriction should be well reasoned. Technical restrictions (if possible) may be reasonable for spammers, .. or scrapers who disregard the license.

It comes down to if OpenAI (and others, they are not the only AI service, there is also Google, Meta, ...) are actually observing the license or not. It's currently still not fully decided by courts. We don't know.

We as copyright holders of the content could sue all these companies. Why don't we do this collectively?

Or we could support SO in trying to block them. But SO would not be on our side, they would only block them to ask them for money and then allow them again. We could only make SO a bit richer. Is this what we want?

Anyway, I wonder why the AI services not use the regular data dumps instead for gathering training material. It's not as if there is much change in the content within three months nowadays and it's not as if we could technically block them from using the data dumps. They seem also much better suited to process from a technical point of view (once you get rid of the added fake content).

We simply may not be able from technically blocking anyone there (if you include the data dumps).

But if there is something that works technically, we should extend it to any AI based service that retrieves online information and treat them all the same. Many news platforms for example report significant declines overall in traffic and only increased traffic from bots. Maybe they don't want that traffic and have technical solutions for this.

edited Oct 26 at 8:03

answered Oct 26 at 7:58

NoDataDumpNoContribution

20.4k3 gold badges38 silver badges85 bronze badges

"I wonder why the AI services not use the regular data dumps instead for gathering training material." They often do use the regular data dumps but at the same time, crawling is easy for those who already have a crawler (why make an exception for SE, which is just tiny portion of the web? Just crawl the web) and 3 months can matter esp. for next tech + given the typical criticism of LLM knowledge being outdated.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-28 01:24:53 +00:00
Commented Oct 28 at 1:24

Add a comment |

-9

This is stupid.

Blocking a user agent will never be not stupid. I don't want to see Stack Exchange in the wall of shame of sites that try to force the user to not use their user agent of choice. If the site doesn't work in the browser, SE has no responsibility to fix it, as the list of supported browser very well describes. But Stack Exchange shouldn't deliberately sabotage a user agent at all.

edited Oct 25 at 2:00

bobble

19.5k5 gold badges43 silver badges114 bronze badges

answered Oct 25 at 1:58

Braiam

15.7k4 gold badges37 silver badges96 bronze badges

12

We already block plenty of user agents: scrapers, spambots, DDoS programs. Cloudflare even blocks a lot of legitimate users: I'm currently banned from SEDE, for example. I think this discussion needs to start with the understanding that we're already on the wall of shame. (I am really sympathetic to this position, though: it's the commendable kind of ideological purity.)

wizzwizz4
– wizzwizz4

2025-10-25 02:09:45 +00:00
Commented Oct 25 at 2:09
4

@wizzwizz4 those are causing performance issues to the site, so it's not the same. In this case, you would block a user behaving as user because... their UA string is in the naughty list. You should know better than that.

Braiam
– Braiam

2025-10-25 11:23:13 +00:00
Commented Oct 25 at 11:23

Add a comment |

-18

Honestly I think that we should be instead be trying to embrace this new ai search engine as it can provide help to students and mathematicians without a proper education be able to learn about current affairs and progress their studies.

However I also do agree that some mathematicians may abuse the ai search engine as though to use as a easy way to do a project or an assignment but if they do learn isn't that also good as maths is not about having the answers at the first place but learning the logic and process behind it and if open ai provides this 24/7 is it not extremely beneficial.

To conclude I believe that stack overflow is a place where you can learn from other talented mathematicians and the open ai browser only provides more help and quicker support for their learning and problems.

answered Oct 26 at 11:47

Alfie V

1

14

The M here is for meta, not Math, and we're talking about a browser, not a search engine

Journeyman Geek
– Journeyman Geek

2025-10-26 12:09:07 +00:00
Commented Oct 26 at 12:09
3

This answer mentions "ai search engine" twice and open ai browser" once. Which did you mean?

Chester Gillon
– Chester Gillon

2025-10-26 15:42:55 +00:00
Commented Oct 26 at 15:42

Add a comment |

-20

Should Stack Exchange block OpenAI's Atlas browser?

No, we should not. My understanding of Stack Exchange is that it is a platform where humans help other humans understand and gain knowledge. Blocking a web browser doesn’t go in that direction. More generally, blocking the sharing of knowledge slows down society’s progress.

From a SE Inc. business standpoint, the number of questions has decreased 10-fold already (it is actually 25-fold on SO now), so intentionally blocking the web browser from a company valued at $500 billion USD doesn't strike me as the best way to regain traffic, not to mention that same company likely saved SE Inc. valuation from further sinking last year.

If not, where do we draw the line?

User contributions on Stack Exchange are licensed under CC BY-SA. The red line is determined by the courts, which decide where one infringes that license.

edited Oct 24 at 20:32

answered Oct 24 at 19:43

Franck Dernoncourt

59.7k9 gold badges77 silver badges221 bronze badges

12

Blocking Atlas wouldn't block any humans: it's currently only available for macOS users, all of whom have Safari. There are some access needs that restrict what browsers people can easily use, but Atlas is functionally equivalent to Chrome for all of those I'm aware of.

wizzwizz4
– wizzwizz4

2025-10-24 19:56:04 +00:00
Commented Oct 24 at 19:56
2

@wizzwizz4 Makes it harder for some people to access it. How would you feel if someone required you to change browsers today? Not impossible, but inconvenient.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-24 19:57:53 +00:00
Commented Oct 24 at 19:57
1

Sometimes... getting people to change from an awful product is the right move, even if they may disagree at the time.

user400654
– user400654

2025-10-24 20:11:30 +00:00
Commented Oct 24 at 20:11
1

@KevinB You're right, that's what I tell macOS and Safari users.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-24 20:12:12 +00:00
Commented Oct 24 at 20:12
20

@FranckDernoncourt Stack Exchange already requires me to use particular browsers. I've been advocating for SE to move away from that, and support all browsers. However, we do not exist in a political vacuum: blocking Atlas (basically Chrome with a bundled extension) is different to blocking Lynx (a console-based browser needed by some blind users).

wizzwizz4
– wizzwizz4

2025-10-24 20:14:51 +00:00
Commented Oct 24 at 20:14
1

@wizzwizz4 Intentionally blocking a web browser from a company valued at $500 billion USD is quite different than not actively supporting a hobbyist browser with, and I'm quoting Gemini so I'm sure you'll agree, a negligible market share.

Franck Dernoncourt
– Franck Dernoncourt

2025-10-24 20:18:25 +00:00
Commented Oct 24 at 20:18
4

"intentionally blocking the web browser from a company valued at $500 billion USD doesn't strike me as the best way to regain traffic" - what?

Anerdw
– Anerdw

2025-10-24 22:42:15 +00:00
Commented Oct 24 at 22:42
2

Why does blocking a product from a highly valued company change anything about our traffic?

Anerdw
– Anerdw

2025-10-24 22:46:47 +00:00
Commented Oct 24 at 22:46
3

That's a question of product quality/use statistics, not market valuation.

Anerdw
– Anerdw

2025-10-24 22:50:46 +00:00
Commented Oct 24 at 22:50
18

@FranckDernoncourt Some things are more important than short-term traffic maximisation. For example, spam campaigns increase the traffic to our sites, and we work hard to eliminate those as quickly as possible.

wizzwizz4
– wizzwizz4

2025-10-24 23:12:06 +00:00
Commented Oct 24 at 23:12
1

@wizzwizz4 Why is blocking Atlas short-term traffic maximisation? OpenAI repacing SE?

Franck Dernoncourt
– Franck Dernoncourt

2025-10-24 23:15:01 +00:00
Commented Oct 24 at 23:15
8

@wizzwizz4 While SE practically only supports a subset of browsers, the trifecta of safari(webkit), chrome (blink) and firefox (gecko) covers most options. I'm testing it on servo and ladybird for fun, and so far so good on both.

Journeyman Geek
– Journeyman Geek

2025-10-25 00:22:00 +00:00
Commented Oct 25 at 0:22
7

Any content generated by any built-in tools of Atlas would violate the no LLM generated content policy. So I have no problem banning Atlas if it were possible because LLM generated trash is worthless. OpenAI is overvalued by at least 449 billion dollars. I don’t want users who use Atlas to generate content to contribute, not only will I have to flag and downvote the worthless AI garbage if that happens, some of that garbage won’t be reported and those users I’ll get the misleading impression it’s allowed.

Ramhound
– Ramhound

2025-10-26 01:38:49 +00:00
Commented Oct 26 at 1:38
10

I just want to ban LLM generated content…

Ramhound
– Ramhound

2025-10-26 01:47:07 +00:00
Commented Oct 26 at 1:47
1

@FranckDernoncourt that looks like on device AI - data doesn't go out, and its offline. And using that to post is still disallowed on many sites already. The concern here is it being used as a sneaky scraper.

Journeyman Geek
– Journeyman Geek

2025-10-27 00:50:10 +00:00
Commented Oct 27 at 0:50

| Show 8 more comments

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.