-
Notifications
You must be signed in to change notification settings - Fork 12.5k
llama-server : implement universal assisted decoding #12635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This looks really interesting! It's surprising how much crossover there is between many models' tokenisers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements universal assisted decoding in llama-server, enabling speculative decoding with draft models that have incompatible tokenizers with the main model. The implementation translates tokens between the two models by detokenizing and retokenizing text.
Key changes:
- Removes strict vocabulary compatibility checks between draft and target models
- Adds token translation mechanism through text conversion when vocabularies are incompatible
- Introduces string replacement functionality for handling tokenizer differences
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tools/server/server.cpp | Updates server to handle incompatible vocabularies and configure token replacements |
examples/speculative/speculative.cpp | Removes vocabulary compatibility validation checks |
examples/speculative-simple/speculative-simple.cpp | Removes compatibility checks and updates speculative initialization |
common/speculative.h | Updates function signatures to accept both target and draft contexts |
common/speculative.cpp | Implements core universal assisted decoding logic with token translation |
common/common.h | Adds replacements field to speculative parameters |
common/arg.cpp | Adds command-line argument parsing for token replacements |
Comments suppressed due to low confidence (1)
common/speculative.cpp:30
- The comment 'ctx_main' doesn't match the struct member name 'ctx_tgt'. This should be '/* .ctx_tgt = */' to be consistent with the actual member name.
/* .ctx_main = */ ctx_tgt,
add_opt(common_arg( | ||
{"--spec-replace"}, "TARGET", "DRAFT", | ||
"translate the string in TARGET into DRAFT if the draft model and main model are not compatible", | ||
[](common_params & params, const std::string & tgt, const std::string & dft) { | ||
params.speculative.replacements.push_back({ tgt, dft }); | ||
} | ||
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases do we need to use replacements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in case translation improves speculation accuracy. It's specifically built for instruct tags like <|im_start|>
etc.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.
It currently works, but some improvements can be made.