Skip to content

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Dec 8, 2025

Since a lot of features have been added to continuous batching, this PR aims o refactor the associated tests so we can catch new failures. The structure of the tests used to be one test per model / attention implementation. Now we have one test backend, that makes sure the generation with and without CB is coherent. It is called in two tests:

  • we test all possible set of parameters on one tiny llama model
  • we test a restricted set of parameters on different architectures: full attention, sliding window, etc.

There was also an effort to regroup the streaming tests so it can use the same backend.

Overhaul, the new tests cover more ground for a lesser amount of code. And it already caught one bug: zero-sized cuda graphs failed silently, which led to slight generation divergence.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants