Pulse · ggml-org/llama.cpp · GitHub

July 21, 2025 – July 28, 2025

Overview

83 Active pull requests

58 Active issues

44 Releases published by 1 person

b5949
published Jul 21, 2025
b5950
published Jul 21, 2025
b5952
published Jul 21, 2025
b5953
published Jul 21, 2025
b5954
published Jul 21, 2025
b5956
published Jul 22, 2025
b5957
published Jul 22, 2025
b5958
published Jul 22, 2025
b5959
published Jul 22, 2025
b5960
published Jul 22, 2025
b5961
published Jul 22, 2025
b5962
published Jul 22, 2025
b5963
published Jul 22, 2025
b5965
published Jul 23, 2025
b5966
published Jul 23, 2025
b5967
published Jul 23, 2025
b5968
published Jul 23, 2025
b5970
published Jul 23, 2025
b5972
published Jul 23, 2025
b5973
published Jul 23, 2025
b5975
published Jul 24, 2025
b5976
published Jul 24, 2025
b5978
published Jul 24, 2025
b5979
published Jul 24, 2025
b5980
published Jul 24, 2025
b5981
published Jul 24, 2025
b5984
published Jul 24, 2025
b5985
published Jul 24, 2025
b5986
published Jul 25, 2025
b5987
published Jul 25, 2025
b5988
published Jul 25, 2025
b5989
published Jul 25, 2025
b5990
published Jul 25, 2025
b5992
published Jul 25, 2025
b5993
published Jul 25, 2025
b5994
published Jul 25, 2025
b5995
published Jul 26, 2025
b5996
published Jul 26, 2025
b5997
published Jul 26, 2025
b5998
published Jul 27, 2025
b5999
published Jul 27, 2025
b6000
published Jul 27, 2025
b6001
published Jul 27, 2025
b6002
published Jul 27, 2025

58 Pull requests merged by 39 people

ops : update Metal
#14912 merged Jul 28, 2025
sync : ggml
#14911 merged Jul 28, 2025
quantize: update README.md
#14905 merged Jul 27, 2025
Vulkan: add ops docs
#14900 merged Jul 27, 2025
SYCL: add ops doc
#14901 merged Jul 27, 2025
llama : clarify comment about pp and tg graphs [no ci]
#14895 merged Jul 27, 2025
vulkan : add fp16 support for the conv_2d kernel
#14872 merged Jul 27, 2025
vulkan: skip empty set_rows to avoid invalid API usage
#14860 merged Jul 27, 2025
make rope_yarn_log_mul optional for deepseek2
#14896 merged Jul 27, 2025
Fix kq_scale for the attention layers of PLaMo2
#14892 merged Jul 27, 2025
Docs: add instructions in ops.md + simplify backend csv
#14889 merged Jul 27, 2025
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3
#14624 merged Jul 26, 2025
CANN: Implement GLU ops
#14884 merged Jul 26, 2025
musa: fix build warnings (unused variable)
#14869 merged Jul 26, 2025
ggml-cpu: disable GGML_NNPA by default due to instability
#14880 merged Jul 25, 2025
metal: SSM_SCAN performance
#14743 merged Jul 25, 2025
opencl: add fused rms_norm_mul
#14841 merged Jul 25, 2025
docs: update HOWTO‑add‑model.md for ModelBase and new model classes
#14874 merged Jul 25, 2025
Code health: Remove invalid portPos specifiers from graph dumping to dot files
#14838 merged Jul 25, 2025
context : restore preemptive sched reset when LLAMA_SET_ROWS=0
#14870 merged Jul 25, 2025
mtmd : Fix 32-bit narrowing issue in export-lora and mtmd clip
#14503 merged Jul 25, 2025
GGML: Check for null buffers in get/set/copy tensor RPC endpoints
#14868 merged Jul 25, 2025
sched : fix multiple evaluations of the same graph with pipeline parallelism
#14855 merged Jul 25, 2025
musa: upgrade musa sdk to rc4.2.0
#14498 merged Jul 24, 2025
sync : ggml
#14858 merged Jul 24, 2025
context : perform output reorder lazily upon access after sync
#14853 merged Jul 24, 2025
chat : fix kimi-k2 chat template
#14852 merged Jul 24, 2025
sycl: unified semantics of block offset calculation
#14814 merged Jul 24, 2025
fix: restore MiniCPM inference after Granite Four changes
#14850 merged Jul 24, 2025
docs: add libcurl-dev install hint for Linux distros
#14801 merged Jul 24, 2025
metal : fix fusion across different encoders
#14849 merged Jul 24, 2025
sycl: fix undefined variable in work group size check
#14843 merged Jul 24, 2025
convert: text-only support for GLM-4.1V-9B-Thinking
#14823 merged Jul 23, 2025
CUDA: fix overflow in FA, tune performance
#14840 merged Jul 23, 2025
CUDA: fix compilation with GGML_CUDA_F16
#14837 merged Jul 23, 2025
ci : correct label refactor->refactoring
#14832 merged Jul 23, 2025
tests : add non-cont K,V FA tests
#14756 merged Jul 23, 2025
CUDA: fix quantized KV cache + multiple sequences
#14822 merged Jul 23, 2025
bug fix: handle saving/loading null layers in recurrent memory
#14675 merged Jul 23, 2025
ggml: fix loongarch quantize_row_q8_1 error
#14827 merged Jul 23, 2025
[CANN] weight format to nz for Ascend310P3
#14407 merged Jul 23, 2025
CUDA: add fused rms norm
#14800 merged Jul 23, 2025
model card yaml tab->2xspace
#14819 merged Jul 22, 2025
vulkan: fix rms_norm_mul to handle broadcasting dim0
#14817 merged Jul 22, 2025
llama : add model type detection for rwkv7 7B&14B
#14816 merged Jul 22, 2025
imatrix: add option to display importance score statistics for a given imatrix file
#12718 merged Jul 22, 2025
Mtmd: add a way to select device for vision encoder
#14236 merged Jul 22, 2025
cuda : implement bf16 cpy ops and enable bf16 cont
#14763 merged Jul 22, 2025
opencl: remove unreachable return
#14806 merged Jul 22, 2025
server : allow setting --reverse-prompt arg
#14799 merged Jul 22, 2025
cuda: remove linking to cublasLt
#14790 merged Jul 21, 2025
opencl : fix im2col when KW!=KH
#14803 merged Jul 21, 2025
OpenCL: add conv2d kernel
#14403 merged Jul 21, 2025
sycl: Fix im2col
#14797 merged Jul 21, 2025
kleidiai: add support for get_rows
#14676 merged Jul 21, 2025
docs : fix backends table in README.md
#14796 merged Jul 21, 2025
vulkan/cuda: Fix im2col when KW!=KH
#14789 merged Jul 21, 2025
llama : fix --reverse-prompt crashing issue
#14794 merged Jul 21, 2025

25 Pull requests opened by 20 people

opencl: tiled mul_mat with local memory for f16 and f32
#14809 opened Jul 22, 2025
convert : handle pre-quantized models
#14810 opened Jul 22, 2025
feat(batched): Add functionality to upload benchmark test results
#14811 opened Jul 22, 2025
sycl: refactor quantization to q8_1
#14815 opened Jul 22, 2025
graph : reduce splits for recurrent and hybrid models
#14825 opened Jul 23, 2025
test-backend-ops: enables perf/eval testing of composite ops
#14833 opened Jul 23, 2025
SvelteKit-based WebUI
#14839 opened Jul 23, 2025
imatrix : use GGUF by default
#14842 opened Jul 24, 2025
mtmd : add support for Voxtral
#14862 opened Jul 24, 2025
Adding chat template support for Granite model
#14864 opened Jul 24, 2025
Extend test case filtering
#14865 opened Jul 24, 2025
Support intern-s1
#14875 opened Jul 25, 2025
model: add hunyuan dense
#14878 opened Jul 25, 2025
GGML: Fix leak of backend buffer memory address in RPC
#14882 opened Jul 26, 2025
SYCL: Add set_rows support for quantized types
#14883 opened Jul 26, 2025
imatrix: calculate activation-based statistics for new format (GGUF) imatrices
#14891 opened Jul 26, 2025
ggml-cpu : deduplicate scalar implementations
#14897 opened Jul 27, 2025
Add support for SmallThinker model series
#14898 opened Jul 27, 2025
Vulkan: Fix minor debug mode issues
#14899 opened Jul 27, 2025
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants
#14903 opened Jul 27, 2025
ggml : repack block_iq4_nlx8 (AVX)
#14904 opened Jul 27, 2025
cuda : add softcap fusion
#14907 opened Jul 27, 2025
opencl: fixed a typo
#14908 opened Jul 27, 2025
opencl: add ops docs
#14910 opened Jul 28, 2025
ops : update BLAS
#14914 opened Jul 28, 2025

41 Issues closed by 17 people

Feature Request: Support EXAONE 4.0
#14474 closed Jul 28, 2025
Misc. bug: [SYCL] llama-cli built by Visual Studio 2022 is not working
#14086 closed Jul 28, 2025
Research: mmap eviction
#14154 closed Jul 28, 2025
prismatic-vlms to gguf?
#14159 closed Jul 28, 2025
Eval bug: MultiGPU x MultiModels = 100% GPU
#14890 closed Jul 27, 2025
Misc. bug: using --jinja flag with server and Qwen3 models removes thinking block, still works on llama-cli
#14894 closed Jul 27, 2025
ggml_vulkan: RADV crash on ggml_set_rows due to zero size buffer
#14845 closed Jul 27, 2025
Eval bug: gemma-3n-E4B-it-Q8_0.gguf is speaking nonsense
#14885 closed Jul 27, 2025
Misc. bug: llama-server webui with --jinja flag does not show thinking when using reasoning models
#14007 closed Jul 27, 2025
Vulkan Runner Frequent Crashing under workload
#14105 closed Jul 27, 2025
Eval bug: (MAC) fail in `GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_Q8_0_H96, flash_attn_ext_q8_0_h96, has_simdgroup_mm);`
#14110 closed Jul 27, 2025
Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes
#14113 closed Jul 27, 2025
Misc. bug: "llama_context_params::swa_full = true" causes very large RAM/VRAM usage
#14123 closed Jul 27, 2025
Misc. bug: llama-server drops multi-part content for final assistant message
#14137 closed Jul 27, 2025
Metrics should not include : in Prometheus metric names
#14150 closed Jul 27, 2025
Feature Request: (webui) read data from /props endpoint and use it on the webui
#11717 closed Jul 26, 2025
Support Hybrid Models
#12331 closed Jul 26, 2025
Eval bug: s390x GGML_NNPA=ON Generates Gibberish Tokens at Different Thread Counts
#14877 closed Jul 25, 2025
Eval bug: Generation speed loss after b5920
#14876 closed Jul 25, 2025
Performance regression with multiple GPUs in commit 01612b7
#14863 closed Jul 25, 2025
Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload
#14795 closed Jul 25, 2025
Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU
#13978 closed Jul 25, 2025
Eval bug: MiniCPM4 0.5B run failed
#14094 closed Jul 25, 2025
Eval bug: Gemma3 decode and update_slots fail with parallel slots
#14097 closed Jul 25, 2025
Eval bug: gemma3 generates infinite "and" output after commit bf9087f
#14835 closed Jul 24, 2025
Misc. bug: llama-server --batch-size always set to 64
#14046 closed Jul 24, 2025
Eval bug: Ollama Runner uses only 1 CPU for its threads, in guaranteed mode in pod when 8 CPUs are allocated to it
#14089 closed Jul 24, 2025
Misc. bug: Server tests /health race conditions
#14092 closed Jul 24, 2025
Compile bug: convert.cu
#14834 closed Jul 23, 2025
Misc. bug: CUDA docker image - libcurl: file too short
#14813 closed Jul 23, 2025
Misc. bug: Failed to run `llama-server` when trying to recurrence the issue #14812
#14829 closed Jul 23, 2025
[How to serve lookahead decoding Qwen 3]
#14057 closed Jul 23, 2025
Eval bug: Model produces gibberish or repeated output when using `-sm row` on CUDA
#14075 closed Jul 23, 2025
Quantize bug: Ernie4.5 MoE 300B low-bit quantization crashes
#14788 closed Jul 22, 2025
Feature Request: Built-in Token Probability Output for Inference API
#14611 closed Jul 22, 2025
Feature Request: Direct FP8 conversion from convert_hf_to_gguf.py
#14762 closed Jul 22, 2025
Eval bug: RWKV inference with llama-parallel gets wrong output with lmhead offloaded to GPU
#14211 closed Jul 22, 2025
Cmake minor bug: Confusing ggml-cpu: -march=native log message when using explicit -march flags and LLAMA_NATIVE=OFF
#14058 closed Jul 22, 2025
Feature Request: Support Kimi K2
#14642 closed Jul 22, 2025
Misc. bug: test-backend-ops: IM2COL test sometimes fail with when KW!=KH
#14777 closed Jul 21, 2025
Eval bug: "zsh: IOT instruction (core dumped)" in RWKV when use reverse prompt without `--prompt` or `-p` option
#14513 closed Jul 21, 2025

17 Issues opened by 17 people

Feature Request: Implement missing ops from backends
#14909 opened Jul 28, 2025
Eval bug: Repeated sequences with gemma3 and image recognition
#14888 opened Jul 26, 2025
Eval bug: No kernel named rms_norm_f32_sycl
#14887 opened Jul 26, 2025
Speed regression with -fa and -ctk
#14881 opened Jul 25, 2025
Misc. bug:convert_hf_to_gguf.py lora fine-tuning effects for visual recognition are completely lost – only the text-related fine-tuning remains effective.
#14867 opened Jul 25, 2025
Eval bug: cline plugin for VS Code does not work with any GGUF
#14866 opened Jul 25, 2025
Misc. bug: slow model loading to GPU when size > 64GB (Vulkan)
#14854 opened Jul 24, 2025
Eval bug: Embedding output differs significantly between b4712 and b4713
#14848 opened Jul 24, 2025
Misc. bug: Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913
#14847 opened Jul 24, 2025
Please implement phi-3-M3-coder
#14846 opened Jul 24, 2025
Eval bug: failed to allocate compute pp buffers
#14836 opened Jul 23, 2025
Misc. bug: llama-server issue on Windows when compiling from source code
#14826 opened Jul 23, 2025
Eval bug: unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL using HIP backend (AMD MI300X) outputs `GGGGG`
#14824 opened Jul 23, 2025
Misc. bug: llama-server embedding endpoint returns vectors with just null values after a while
#14812 opened Jul 22, 2025
Misc. bug: Server cpp no image_data being used
#14807 opened Jul 22, 2025
ggml_vulkan: device Vulkan0 does not support 16-bit storage.
#14805 opened Jul 21, 2025
Misc. bug: llama-quant "found point XXX not on grid: XXXX"
#14798 opened Jul 21, 2025

64 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Add LLaDA 8b Diffusion model
#14771 commented on Jul 28, 2025 • 7 new comments
Fix MinicpmV model converter and clip to avoid using hardcode.
#14750 commented on Jul 28, 2025 • 4 new comments
examples : predicted output for text generation
#14739 commented on Jul 24, 2025 • 3 new comments
feat: Add optional prompt processing progress streaming
#14731 commented on Jul 28, 2025 • 3 new comments
Improve Mistral models integration with llama.cpp
#14737 commented on Jul 25, 2025 • 1 new comment
Feature request: Graphical GGUF viewer
#6715 commented on Jul 28, 2025 • 0 new comments
Compile bug: gcc-12: error: unrecognized command-line option ‘-compress-mode=size’
#14260 commented on Jul 28, 2025 • 0 new comments
Eval bug: Unexpected empty grammar stack after accepting piece: <unused32>
#14413 commented on Jul 28, 2025 • 0 new comments
bug: GGML_ASSERT(backend_embd != nullptr) failed error at llama.cpp:14775
#14418 commented on Jul 28, 2025 • 0 new comments
Some models like gemma-3n crashes - rocBLAS error: Cannot read /opt/rocm-6.4.1/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1036
#14421 commented on Jul 28, 2025 • 0 new comments
Misc. bug:
#14422 commented on Jul 28, 2025 • 0 new comments
Compile bug: Looking for C++ include rocwmma/rocwmma.hpp - not found
#14538 commented on Jul 27, 2025 • 0 new comments
Compile bug: loop not unrolled ROCm warnings
#14776 commented on Jul 27, 2025 • 0 new comments
Eval bug: Gemma 3n on Vulkan on Ryzen APUs produces garbled output
#14525 commented on Jul 27, 2025 • 0 new comments
Feature Request: per-chat prompt caching
#14470 commented on Jul 27, 2025 • 0 new comments
mtmd: Any plan for mtmd to support video input and audio output?
#14295 commented on Jul 27, 2025 • 0 new comments
Eval bug: Program crashes during long input inference when batch size is set to 16384
#14325 commented on Jul 27, 2025 • 0 new comments
Eval bug: Regression: Tool calls still returned in content field as JSON string instead of tool_calls array
#14697 commented on Jul 26, 2025 • 0 new comments
Error while converting peft finetuned merged model to gguf
#12494 commented on Jul 26, 2025 • 0 new comments
Introduce Graph Profiler
#9659 commented on Jul 25, 2025 • 0 new comments
Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance.
#11867 commented on Jul 25, 2025 • 0 new comments
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs
#12063 commented on Jul 22, 2025 • 0 new comments
Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications
#12727 commented on Jul 21, 2025 • 0 new comments
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC)
#13206 commented on Jul 25, 2025 • 0 new comments
CUDA: update build CTK version to 12.8
#13360 commented on Jul 27, 2025 • 0 new comments
model : jina-embeddings-v3 support
#13693 commented on Jul 21, 2025 • 0 new comments
finetune.cpp command-line arg
#13873 commented on Jul 23, 2025 • 0 new comments
convert: add eagle2 draft arch
#13908 commented on Jul 28, 2025 • 0 new comments
llama : support qwen3 rerank and embeddings
#14029 commented on Jul 23, 2025 • 0 new comments
compare-commits.sh: support both llama-bench and test-backend-ops
#14392 commented on Jul 24, 2025 • 0 new comments
mtmd : Support jinja in libmtmd (Only for QwenVL and Qwen Omni)
#14730 commented on Jul 22, 2025 • 0 new comments
docs : mention apt installation method
#14766 commented on Jul 21, 2025 • 0 new comments
Compile bug: Built target undefined reference std::filesystem
#14536 commented on Jul 21, 2025 • 0 new comments
Eval bug: When I store the model on my hard drive, llama.cpp attempts to load it and then says it's warming it up with a blank run after which it crashes the terminal session.
#14297 commented on Jul 22, 2025 • 0 new comments
Misc. bug: Gemma3 multimodal (or all VL models?): </think> tag in the image or PDF text breaks prompt processing (or token generation?)
#14143 commented on Jul 22, 2025 • 0 new comments
Misc. bug: -sm row results in gibberish output on HIP (ROCm 6.3.3)
#13545 commented on Jul 22, 2025 • 0 new comments
Feature Request: Add --upload to llama-bench
#14791 commented on Jul 22, 2025 • 0 new comments
Eval bug: Nondeterministic output with ROCm backend despite zero temperature
#14727 commented on Jul 22, 2025 • 0 new comments
Eval bug: CUDA error: operation not supported
#14692 commented on Jul 22, 2025 • 0 new comments
[BUG] DeepSeek V3 weight_scale_inv tensor mapping not supported in converter
#14781 commented on Jul 22, 2025 • 0 new comments
Eval bug: KV cache stopped working in b5554 version
#14071 commented on Jul 23, 2025 • 0 new comments
server : add "token healing" support
#5765 commented on Jul 23, 2025 • 0 new comments
Feature Request: Ability to pack multiple GGUFs into single one
#13028 commented on Jul 23, 2025 • 0 new comments
Refactor: (clip.cpp) identify and regroup pre-processing strategies
#13077 commented on Jul 23, 2025 • 0 new comments
deprecate llama_batch_get_one and llama_get_logits
#4491 commented on Jul 23, 2025 • 0 new comments
Eval bug: ROCml -> ggml_cuda_compute_forward: MUL_MAT failed when running unsloth/Kimi K2
#14787 commented on Jul 23, 2025 • 0 new comments
Feature Request: add tool calling for deepseek-r1-0528
#14557 commented on Jul 23, 2025 • 0 new comments
Feature Request: Suggest to provide armv7l version to run on Raspberry Pi devices.
#14348 commented on Jul 24, 2025 • 0 new comments
Misc. bug: LLAMA-SERVER is 40% slower than LLAMA-CLI when using identical parameters including -ot option for tensor offloading
#14201 commented on Jul 24, 2025 • 0 new comments
Feature Request: Support GLM-4.1V-9B-Thinking
#14495 commented on Jul 24, 2025 • 0 new comments
Regarding the build for 8060S (gfx1151):
#14734 commented on Jul 24, 2025 • 0 new comments
Misc. bug: Failure to allocate buffer with ROCm 6.4
#14178 commented on Jul 25, 2025 • 0 new comments
Misc. bug: RUNPATH properties are not properly set
#13740 commented on Jul 25, 2025 • 0 new comments
Eval bug: terminate called after throwing an instance of 'std::runtime_error' what(): Unexpected empty grammar stack after accepting piece: [control_36]
#13690 commented on Jul 25, 2025 • 0 new comments
Misc. bug: DeepSeek-R1 0528 671b:Q4_K_XL think tags do not close sometimes
#14679 commented on Jul 25, 2025 • 0 new comments
Feature Request: Add support for Kokoro TTS
#11050 commented on Jul 25, 2025 • 0 new comments
Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp
#11603 commented on Jul 25, 2025 • 0 new comments
Misc. bug: missing messages in JSON export via llama-server web UI
#13552 commented on Jul 25, 2025 • 0 new comments
Eval bug: Unable to run with Qwen3 model on rocm with gfx1100, but works on cpu
#14696 commented on Jul 25, 2025 • 0 new comments
Feature Request: allow running llama with an idle (lowest) priority as well
#14382 commented on Jul 26, 2025 • 0 new comments
Compile error for ggml_gemv_q4_K_8x8_q8_K on Intel x86_64 MacOS (AVX2)
#14372 commented on Jul 26, 2025 • 0 new comments
Misc. bug: Inconsistent Gemma3 implementation in rope factor
#14367 commented on Jul 26, 2025 • 0 new comments
Feature Request: Add support for moonshotai/Kimi-VL-A3B-Instruct
#14318 commented on Jul 26, 2025 • 0 new comments
Eval bug: [CANN]AutoDL Ascend 910B instance running DeepSeek-r1 32B_Q8 error
#14291 commented on Jul 26, 2025 • 0 new comments