Skip to content

llama-server : implement universal assisted decoding #12635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

g2mt
Copy link

@g2mt g2mt commented Mar 28, 2025

This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.

It currently works, but some improvements can be made.

  • Token healing can be done to fix any weirdness that may occur if the draft model generates tokens that doesn't lie on a word boundary (not sure how much this affects performance).
  • The translation process can be cached to improve sampling time, however it might require substantial refactoring.

@jukofyork
Copy link
Collaborator

This looks really interesting! It's surprising how much crossover there is between many models' tokenisers.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues script Script related testing Everything test related android Issues specific to Android Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 28, 2025
@CISC CISC removed documentation Improvements or additions to documentation build Compilation issues script Script related testing Everything test related android Issues specific to Android Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs labels Jul 8, 2025
@CISC CISC requested a review from Copilot July 28, 2025 21:17
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements universal assisted decoding in llama-server, enabling speculative decoding with draft models that have incompatible tokenizers with the main model. The implementation translates tokens between the two models by detokenizing and retokenizing text.

Key changes:

  • Removes strict vocabulary compatibility checks between draft and target models
  • Adds token translation mechanism through text conversion when vocabularies are incompatible
  • Introduces string replacement functionality for handling tokenizer differences

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/server/server.cpp Updates server to handle incompatible vocabularies and configure token replacements
examples/speculative/speculative.cpp Removes vocabulary compatibility validation checks
examples/speculative-simple/speculative-simple.cpp Removes compatibility checks and updates speculative initialization
common/speculative.h Updates function signatures to accept both target and draft contexts
common/speculative.cpp Implements core universal assisted decoding logic with token translation
common/common.h Adds replacements field to speculative parameters
common/arg.cpp Adds command-line argument parsing for token replacements
Comments suppressed due to low confidence (1)

common/speculative.cpp:30

  • The comment 'ctx_main' doesn't match the struct member name 'ctx_tgt'. This should be '/* .ctx_tgt = */' to be consistent with the actual member name.
        /* .ctx_main   = */ ctx_tgt,

@CISC CISC requested review from ggerganov and removed request for ggerganov, ngxson and JohannesGaessler July 29, 2025 21:19
struct common_speculative * spec,
const std::string& input) {
std::string result = input;
for (const auto& pair : spec->tgt_dft_replacements) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (const auto& pair : spec->tgt_dft_replacements) {
for (const auto & pair : spec->tgt_dft_replacements) {

Comment on lines +212 to +213
const auto *model_tgt = llama_get_model(ctx_tgt);
const auto *vocab_tgt = llama_model_get_vocab(model_tgt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the whitespace in a few more places:

Suggested change
const auto *model_tgt = llama_get_model(ctx_tgt);
const auto *vocab_tgt = llama_model_get_vocab(model_tgt);
const auto * model_tgt = llama_get_model(ctx_tgt);
const auto * vocab_tgt = llama_model_get_vocab(model_tgt);

std::string text;
text = common_detokenize(ctx_tgt, prompt_tgt_main_model, true);
text = replace_to_dft(spec, text);
LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the log pattern:

Suggested change
LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str());
LOG_DBG("%s: main->draft detokenized string: '%s'\n", __func__, text.c_str());

Comment on lines +3256 to +3262
add_opt(common_arg(
{"--spec-replace"}, "TARGET", "DRAFT",
"translate the string in TARGET into DRAFT if the draft model and main model are not compatible",
[](common_params & params, const std::string & tgt, const std::string & dft) {
params.speculative.replacements.push_back({ tgt, dft });
}
).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases do we need to use replacements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants