-
Notifications
You must be signed in to change notification settings - Fork 12.5k
llama-server : implement universal assisted decoding #12635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This looks really interesting! It's surprising how much crossover there is between many models' tokenisers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements universal assisted decoding in llama-server, enabling speculative decoding with draft models that have incompatible tokenizers with the main model. The implementation translates tokens between the two models by detokenizing and retokenizing text.
Key changes:
- Removes strict vocabulary compatibility checks between draft and target models
- Adds token translation mechanism through text conversion when vocabularies are incompatible
- Introduces string replacement functionality for handling tokenizer differences
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tools/server/server.cpp | Updates server to handle incompatible vocabularies and configure token replacements |
examples/speculative/speculative.cpp | Removes vocabulary compatibility validation checks |
examples/speculative-simple/speculative-simple.cpp | Removes compatibility checks and updates speculative initialization |
common/speculative.h | Updates function signatures to accept both target and draft contexts |
common/speculative.cpp | Implements core universal assisted decoding logic with token translation |
common/common.h | Adds replacements field to speculative parameters |
common/arg.cpp | Adds command-line argument parsing for token replacements |
Comments suppressed due to low confidence (1)
common/speculative.cpp:30
- The comment 'ctx_main' doesn't match the struct member name 'ctx_tgt'. This should be '/* .ctx_tgt = */' to be consistent with the actual member name.
/* .ctx_main = */ ctx_tgt,
struct common_speculative * spec, | ||
const std::string& input) { | ||
std::string result = input; | ||
for (const auto& pair : spec->tgt_dft_replacements) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (const auto& pair : spec->tgt_dft_replacements) { | |
for (const auto & pair : spec->tgt_dft_replacements) { |
const auto *model_tgt = llama_get_model(ctx_tgt); | ||
const auto *vocab_tgt = llama_model_get_vocab(model_tgt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the whitespace in a few more places:
const auto *model_tgt = llama_get_model(ctx_tgt); | |
const auto *vocab_tgt = llama_model_get_vocab(model_tgt); | |
const auto * model_tgt = llama_get_model(ctx_tgt); | |
const auto * vocab_tgt = llama_model_get_vocab(model_tgt); |
std::string text; | ||
text = common_detokenize(ctx_tgt, prompt_tgt_main_model, true); | ||
text = replace_to_dft(spec, text); | ||
LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the log pattern:
LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str()); | |
LOG_DBG("%s: main->draft detokenized string: '%s'\n", __func__, text.c_str()); |
add_opt(common_arg( | ||
{"--spec-replace"}, "TARGET", "DRAFT", | ||
"translate the string in TARGET into DRAFT if the draft model and main model are not compatible", | ||
[](common_params & params, const std::string & tgt, const std::string & dft) { | ||
params.speculative.replacements.push_back({ tgt, dft }); | ||
} | ||
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases do we need to use replacements?
This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.
It currently works, but some improvements can be made.