Revision f7e12584-ad20-42b6-bef2-4ed266eebef8 - Meta Stack Exchange

The current implementation looks quite bad, but there's a lot to praise in this announcement. It also seems I have _many_ thoughts about it. (My kingdom for `<details>` / `<summary>`!)
* You've clearly set out the original goals of the project.
* Onboarding new users who have questions. This involves:
* Distinguishing between novel questions, and those already answered on the network, so that they can be handled differently.
* This has been the goal of _quite a lot_ of design decisions in the past, especially as regards the Ask Question flow. A chat interface has the _potential_ to do a much better job, especially if we have systems that can identify when a question is novel.
* A system that can identify when questions are novel could be repurposed in other ways, such as duplicate detection. However, unless you're constructing an auxiliary database (à la Wikidata or Wikifunctions), the low-hanging fruit for duplicate detection can be accomplished better and more easily by a domain expert using a basic search engine.
* Novice askers often require more support than experienced askers, and different genres of question require different templates. A chat interface combining fact-finding (à la the Ask Wizard) and FAQs, perhaps with some rules to catch common errors (like a real-time Staging Ground lite), could cover some bases that the current form-based Ask Wizard doesn't.
* Presenting users with existing material, in a form where they _understand_ that it solves their problems.
* A system that provides users with incorrect information is worse than useless: it's _actively_ harmful. Proper attribution allows us to remove that information from the database (see [Bryan Krause's answer](https://meta.stackexchange.com/a/412392/308065)), and – more importantly – prevents the AI agent from confabulating new and exciting errors. ([Skeptics Stack Exchange](https://skeptics.stackexchange.com/) can handle the _same_ inaccurate claim repeated widely, but would not be able to cope with many _different_ inaccurate claims repeated a few times each.)
* Imparting expertise to users, so they need less hand-holding in future.
To use an example from programming: many newbies don't really _get_ that variable names are functionally irrelevant, nor how completely the computer ignores comments and style choices, so if an example looks too different from their code, they can't interpret it. This skill can be learned, but some people need a bit of a push.
* This is teaching, and therefore hard. I'd be tempted to declare this out of scope, although there _are_ ways that a chat interface could help with this: see, for example, [Rust error codes](https://doc.rust-lang.org/error_codes/error-index.html) (which are _conceptually_ a dialogue between teacher and student – see [E0562](https://doc.rust-lang.org/error_codes/E0562.html) or [E0565](https://doc.rust-lang.org/error_codes/E0565.html)). Future versions of stackoverflow.ai _could_ do this kind of thing.
* Next-token prediction systems are _particularly_ bad at teaching, because they do not possess the requisite ability to model human psychology. This is a skill that precious few humans possess – although many teachers who _don't_ have this skill can still get good outcomes by using and adapting the work of those who do (which is a skill in itself).
* Y'know what _is_ good at teaching, in text form? Books! (And written explanations, more generally.) A good book can explain things as well as, or even better than, a teacher, especially when you start getting deep into a topic (where not much is fundamentals any more, and readers who don't immediately understand an explanation can usually work it out themselves). But finding good books is quite hard. And Stack Exchange is a sort of library…
* Stack Exchange is not currently well-suited for beginner questions. When people ask a question that's already been answered, we usually close it as a duplicate (and rightly so!), so encouraging such users to post new questions is (as it stands) the wrong approach. However, beginners often require things to be explained in multiple ways, before it clicks. Even if one question has multiple answers from different perspectives, the UI isn't particularly suited for that.
I suspect that Q&A pairs aren't the right way to _represent_ beginner-help: instead, it should be more like a decision tree, where we try to identify what misunderstandings a user has, and address them. Handling this manually gets quite old, since most people have the same few misconceptions: a computer could handle this part. _But_, some people have rarer misconceptions: these could be directed to the community, and then worked into the decision tree once addressed.
As far as getting the rarer misconceptions addressed, it _might_ be possible to shoe-horn this into the existing Q&A system, by changing the duplicate system. (Duplicates that can remain open? Or perhaps a policy change would suffice, if we can reliably ensure that the different misconceptions are clear in a question's body.)
* Imitating ChatGPTs' interfaces, for familiarity.
* I'm not sure why "conversational search and discovery" has an additional list item, since this seems to me like the same thing. (Functional specification versus implementation?)
* Competing with ChatGPTs, by being more useful.
* I think focusing on differentiation, and playing to our strengths (not competing with _theirs_), is key here: I'm _really_ glad you're moving in this direction. An OverflowAI that was just a ChatGPT was, I think, a huge mistake.
* You've finally acknowledged that LLM output is neither attributed, nor really attributable. Although,
* > LLMs cannot return attribution reliably
GPT models cannot return attribution _at all_. I'm _still_ trying to wrap my head around what attribution would even _mean_ for GPT output. Next-token generative language models compress the space of prose in a way that makes [low-frequency](https://en.wikipedia.org/wiki/Low-frequency_oscillation) provenance information rather difficult to preserve, even in principle – and while high-frequency / local provenance information _could_ in principle be preserved, the GPT architecture doesn't even _try_ to preserve it. (I expect quantisation-like schemes could reduce high-frequency provenance overhead to manageable levels in the final model, but I think you'd have to do something _clever_ to train an attributing model without a factor-of-a-billion overhead.)
Embedding all posts (or, all paragraphs?) on the network into a vector space with useful similarity properties would cut the provenance overhead from exponential (i.e., linear space) to linear (i.e., constant space). This scheme only allows you to train a language model to _fake_ provenance quite well, which isn't attribution either: that's essentially just a search algorithm. (We're back where we started: I don't expect this to be _better_ than more traditional search algorithms.)
* > analyzes for correctness and comprehensiveness in order to supplement it with knowledge from the LLM
There _is_ no "knowledge from the LLM". That knowledge is always from _somewhere else_. (The rare exceptions, novel valid connections between ideas that the language model has made, are drowned out by the novel _invalid_ connections that the language model has made: philosophically, I'd argue that this is not knowledge.) Maybe you still don't _quite_ get it, yet.
* Your implementation is still deficient:
> A response is created using multiple steps via RAG + multiple rounds of LLM processing
> We created an AI Agent to act as an “answer auditor”: it reads the user’s search, the quotes from SO & SE content, and analyzes for correctness and comprehensiveness
You're using the generative model as a "god of the gaps". Anything you don't (yet) know how to do properly, you're giving to the language model. And while the condemn this approach equally well, I cannot find it in me to be _upset_ about it: if it's worth doing, it's worth doing badly. Where you aren't familiar with the existing techniques for producing chat-like interfaces (and there is _copious_ literature on the subject), filling in the gaps with what you have to hand… kinda makes sense?
But all the criticisms that the phrase "god of the gaps" [was originally coined to describe](https://en.wikipedia.org/wiki/God_of_the_gaps#Origins_of_the_term) apply to this approach just as well. There are better ways to fill in these gaps, and I hope you'll take them just as soon as you know what they are.
* You've identified some ways people are using stackoverflow.ai. These include:
* > traditional technical searches
* > help with error messages,
* > how to build certain functions,
* > what code snippets do
* > comparing different approaches and libraries
* > asking for helping architecting and structuring apps
* > learning about different libraries and concepts.
* > The majority of queries are technical
This is _extremely_ valuable information: you can use it as phase 1 of a [Wizard of Oz design](https://en.wikipedia.org/wiki/Wizard_of_Oz_experiment). However, I don't think you benefit much from keeping the information secret, since only a few players are currently positioned to take advantage of it, and they're _all_ better able to gather it than you are.
Letting us at this dataset (redacted, of course) for mostly-manual perusal would let us construct expert systems, which could be chained together à la [DuckDuckHack](https://github.com/duckduckgo/duckduckhack-docs). Imagine a system like Alexa, but with the cohesiveness (and limited scope) of the Linux kernel. Making something like this work, and work well, _requires_ identifying the low-hanging fruit: it's one great big application of the 80/20 rule.
* > The demographic also shows that it's a different set of users than stackoverflow.com, so there are good signs here for acquiring new community members over time.
This doesn't follow. We've long known that most users never actively interact with the site (this is a _good_ thing, for much the same reason that many readers are not authors). There's no reason to believe you can – or, more pertinently, _should_ – be "acquiring" them as community members. (As users, maybe: that's your choice whether to push them to register accounts, so long as it doesn't hurt the community.)