llama-server : implement universal assisted decoding #12635

g2mt · 2025-03-28T23:17:43Z

This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.

It currently works, but some improvements can be made.

Token healing can be done to fix any weirdness that may occur if the draft model generates tokens that doesn't lie on a word boundary (not sure how much this affects performance).
The translation process can be cached to improve sampling time, however it might require substantial refactoring.

jukofyork · 2025-04-03T21:53:30Z

This looks really interesting! It's surprising how much crossover there is between many models' tokenisers.

Copilot

Pull Request Overview

This PR implements universal assisted decoding in llama-server, enabling speculative decoding with draft models that have incompatible tokenizers with the main model. The implementation translates tokens between the two models by detokenizing and retokenizing text.

Key changes:

Removes strict vocabulary compatibility checks between draft and target models
Adds token translation mechanism through text conversion when vocabularies are incompatible
Introduces string replacement functionality for handling tokenizer differences

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/server/server.cpp	Updates server to handle incompatible vocabularies and configure token replacements
examples/speculative/speculative.cpp	Removes vocabulary compatibility validation checks
examples/speculative-simple/speculative-simple.cpp	Removes compatibility checks and updates speculative initialization
common/speculative.h	Updates function signatures to accept both target and draft contexts
common/speculative.cpp	Implements core universal assisted decoding logic with token translation
common/common.h	Adds replacements field to speculative parameters
common/arg.cpp	Adds command-line argument parsing for token replacements

Comments suppressed due to low confidence (1)

common/speculative.cpp:30

The comment 'ctx_main' doesn't match the struct member name 'ctx_tgt'. This should be '/* .ctx_tgt = */' to be consistent with the actual member name.

        /* .ctx_main   = */ ctx_tgt,

common/speculative.cpp

tools/server/server.cpp

ggerganov · 2025-07-30T05:34:33Z

common/speculative.cpp

+        struct common_speculative * spec,
+        const std::string& input) {
+    std::string result = input;
+    for (const auto& pair : spec->tgt_dft_replacements) {


Suggested change

for (const auto& pair : spec->tgt_dft_replacements) {

for (const auto & pair : spec->tgt_dft_replacements) {

ggerganov · 2025-07-30T05:35:28Z

common/speculative.cpp

+        const auto *model_tgt = llama_get_model(ctx_tgt);
+        const auto *vocab_tgt = llama_model_get_vocab(model_tgt);


Fix the whitespace in a few more places:

Suggested change

const auto *model_tgt = llama_get_model(ctx_tgt);

const auto *vocab_tgt = llama_model_get_vocab(model_tgt);

const auto * model_tgt = llama_get_model(ctx_tgt);

const auto * vocab_tgt = llama_model_get_vocab(model_tgt);

ggerganov · 2025-07-30T05:35:59Z

common/speculative.cpp

+        std::string text;
+        text = common_detokenize(ctx_tgt, prompt_tgt_main_model, true);
+        text = replace_to_dft(spec, text);
+        LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str());


Follow the log pattern:

Suggested change

LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str());

LOG_DBG("%s: main->draft detokenized string: '%s'\n", __func__, text.c_str());

ggerganov · 2025-07-30T05:37:06Z

common/arg.cpp

+    add_opt(common_arg(
+        {"--spec-replace"}, "TARGET", "DRAFT",
+        "translate the string in TARGET into DRAFT if the draft model and main model are not compatible",
+        [](common_params & params, const std::string & tgt, const std::string & dft) {
+            params.speculative.replacements.push_back({ tgt, dft });
+        }
+    ).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER}));


In which cases do we need to use replacements?

llama-server : implement universal assisted decoding

6f96269

g2mt requested a review from ngxson as a code owner March 28, 2025 23:17

github-actions bot added examples server labels Mar 28, 2025

Merge branch 'master' into master

6f74c9c

g2mt added 10 commits June 28, 2025 16:13

Merge remote-tracking branch 'fork/master' into universal-decoding

e667645

Erase prompt tail for kv-cache

ff9e062

set vocab_dft_compatible in common_speculative

39ca594

rename ctx_main to ctx_tgt

eb424dd

move vocab_dft_compatible to spec struct

2550f11

clear mem_dft, remove mem

3c35c9d

detokenize id_last for incompatible models

12751c9

update comment

8419931

add --spec-replace flag

b9fdf20

accept special tokens when translating between draft/main models

160769d

g2mt requested review from JohannesGaessler and ggerganov as code owners June 28, 2025 18:38

g2mt added 2 commits July 14, 2025 17:28

Merge branch 'ggml-org:master' into master

e14bafb

Merge branch 'ggml-org:master' into master

f8cee4e

CISC requested a review from Copilot July 28, 2025 21:17

Copilot AI reviewed Jul 28, 2025

View reviewed changes

common/speculative.cpp Outdated Show resolved Hide resolved

common/speculative.cpp Outdated Show resolved Hide resolved

g2mt added 2 commits July 28, 2025 23:34

fix comment

2cc9e2e

clean up code

829b762

CISC approved these changes Jul 29, 2025

View reviewed changes

tools/server/server.cpp Show resolved Hide resolved

CISC requested review from ggerganov and removed request for ggerganov, ngxson and JohannesGaessler July 29, 2025 21:19

g2mt force-pushed the master branch from d3fb498 to 829b762 Compare July 29, 2025 23:39

restore old example

b045eac

g2mt force-pushed the master branch from 34126c5 to 79d2be4 Compare July 29, 2025 23:45

g2mt added 2 commits July 29, 2025 23:47

log common_speculative_are_compatible in speculative example

79d2be4

fix

6acc681

ggerganov reviewed Jul 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-server : implement universal assisted decoding #12635

llama-server : implement universal assisted decoding #12635

g2mt commented Mar 28, 2025

Uh oh!

jukofyork commented Apr 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jul 30, 2025

Uh oh!

ggerganov Jul 30, 2025

Uh oh!

ggerganov Jul 30, 2025

Uh oh!

ggerganov Jul 30, 2025

Uh oh!

Uh oh!

	for (const auto& pair : spec->tgt_dft_replacements) {
	for (const auto & pair : spec->tgt_dft_replacements) {

		const auto *model_tgt = llama_get_model(ctx_tgt);
		const auto *vocab_tgt = llama_model_get_vocab(model_tgt);

	LOG_DBG("main->draft detokenized string: '%s'\n", text.c_str());
	LOG_DBG("%s: main->draft detokenized string: '%s'\n", __func__, text.c_str());

llama-server : implement universal assisted decoding #12635

Are you sure you want to change the base?

llama-server : implement universal assisted decoding #12635

Conversation

g2mt commented Mar 28, 2025

Uh oh!

jukofyork commented Apr 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!