Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Required fields*

June 2023 Data Dump is missing

The data dump usually gets refreshed the first weekend of the month, every 3 months.

The current data dump is still from March. Is there just a problem and it's delayed like in the past?


Relevant company information is revealed in the answers by former employee AMtwo, then by current employees Jody Bailey, and finally by Philippe.

Answer*

Cancel
44
  • 53
    "only to begin to collect more information on how it was being used and by whom" ─ how does pausing the dump help you collect this information? If anything, it prevents you from collecting information on who downloads it. If people can download it, that is an opportunity to find out who those people are. Commented Jun 13, 2023 at 22:35
  • 31
    Two questions. (1) Why do you believe that you should or must "continue to work toward the creation of certain guardrails" for the data dumps and why must these guardrails be in place specifically for AI/LLM companies? (2) Will the survey be available for people who want to use the API or data dumps but don't have the features they need today? Commented Jun 13, 2023 at 22:37
  • 45
    The claim that the plan all along was just to collect data on who is using the dump for what purposes, directly contradicts this other answer from an SE staff member who says the dumps were stopped in order to prevent LLM developers from using the data. Commented Jun 13, 2023 at 22:55
  • 42
    I have confirmation via email from Prashanth that this is, indeed, the new official policy. I'm glad to see it. Creative Commons is part of our contract with the community, and it should never be broken -- however, CC does need to address the AI issue in an updated license, in my personal opinion. @wizzwizz4 I also edited the other post to cross link to this one. Commented Jun 13, 2023 at 23:54
  • 49
    Turning the dumps back on was the right thing to do. +1 for that. But at the same time this post is still gaslighting us as to what happened and why (-1 for that) and also totally wishy-washy on what you comit to doing (-1). You've already illegially relicenced my contributions once and maintained radio silence about it, now you are hinting at a plan to do it again. Commented Jun 14, 2023 at 5:46
  • 29
    So. The data dump releases were stopped without any form of announcement and as usual it took someone to notice before the company admitted it was something already planned for at least a month. You say that it was "to get stats about who uses it" while a different post claims that you were past that phase and trying to gate the access to the data. After a while coincidentally when user started to plan about how to produce the data themselves the decision is put on stop for a while and you post this answer, once again denying the users to reply and forcing them to use comments [cont...] Commented Jun 14, 2023 at 8:07
  • 15
    Hey all - we understand the confusion about how this answer was conveyed. I've worked with Philippe to identify ways we can address your feedback and, as such, I've merged the two questions and moved the comments from one to the other so that they're all in one place. Additionally, I've clarified the statement so that it conveys what we intended as we understand that this was being interpreted as a different story from what Jody shared previously. Apologies for any confusion. Commented Jun 14, 2023 at 13:41
  • 16
    @Catija the biggest problem is still trying to figure out what the "truth" actually is. The original statement by the CTO, the original - misleading to put it nicely - statement by Philippe - or your revised one once you saw the backlash from the statement itself. 'that this was being interpreted as a different story from what Jody shared previously.' - it literally was a different story. Jody mentioned the decision to pause/stop the dumps was intentional to prevent "abuse", the original statement said that the intention wasn't to pause. There is/was no misinterpretation. Commented Jun 14, 2023 at 14:06
  • 24
    If we want the company to communicate with us, I think it's better to expect back-and-forth communication and clarification as appropriate, rather than to demand that each message is conveyed perfectly at the first release. It's a barrier to actually communicating if they need to spend hours on each and every draft message. I think about how much iteration went into drafting the strike letter: that's okay for a one-off thing, but if every message took that much effort, the will to communicate is going to evaporate pretty quickly. Commented Jun 14, 2023 at 14:26
  • 12
    @BryanKrauseisonstrike are you kidding? How difficult is it to say this is why we disabled the dump, this is what we were thinking, clearly it's not what the community wanted so here's what we're thinking now.... Instead the CTO came out saying we intentionally stopped it for X and then another statement saying that it wasn't our intention to stop the dump, and then claiming that people are misinterpreting the statement. Commented Jun 14, 2023 at 14:43
  • 12
    @Script47 I think you're overemphasizing certain phrasings to make them unclear in your head rather than trying to reconcile. I think a better communication approach when you get conflicting information is to point out the potential conflict and ask for clarification; that has now happened, so, what's the problem? I am not playing defense for SO. I am very strongly urging them to make changes I feel are necessary to keep the sites I like to use operational. Commented Jun 14, 2023 at 14:50
  • 13
    @Script47 "Our intention was never to stop posting the data dump" is ambiguous, not different; the word stop does not indicate whether it is permanent or temporary. Jody used the word in the temporary sense, Philippe used it in a permanent sense. In Jody's post, the meaning is clarified by the word "until" later in the sentence. Now, Catija edited Philippe's post to make clear he meant the permanent sense. Now there is no ambiguity in either use of the word. However, if you assumed the statements were consistent to begin with, that interpretation was also available with the words used here. Commented Jun 14, 2023 at 15:29
  • 25
    I would have preferred Philippe's statement to acknowledge that this unannounced pause caused a lot of valid angst in the community and include an apology for this impact, and I would have preferred that it recognize explicitly the previous commitments made by the company to users regarding the release of data, but I do not think the message conflicts with the CTO, particularly after the clarifying edits. Commented Jun 14, 2023 at 15:57
  • 12
    @JeffAtwood I completely agree that CC should address LLM usage. However, ChatGPT did not use a data dump to train, but commoncrawl.org, so that's the place to start to enforce any possible CC changes, not the dump. Also, I think it's intrinsic to CC that we allow any usage that respects attribution, not merely some usage that we prefer at the moment. Commented Jun 15, 2023 at 9:04
  • 17
    As the one nursing the process, I can confirm that the final file from the June dump has been fully uploaded, has gone through the usual processing by archive.org, and is now available for consumption. It actually finished at 22:42 UTC so, if we want to be technical, it was still delivered by end of day Friday. :-) Commented Jun 17, 2023 at 1:55