Description
Name and Version
❯ ./build/bin/llama-cli --version
version: 4295 (26a8406b)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Problem description & steps to reproduce
Running on my arm64 server, I updated to llama.cpp yesterday and tried to run
./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0_4_4.gguf ...
and got this error message, (because of this PR, #10446, this doesn't seem to be the bug that causes the slow down)
gguf_init_from_file: tensor 'blk.0.attn_k.weight' of type 31: TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking
so I tried running with it with q4_0
model, i.e. ./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.q4_0.gguf
but it is much slower. I don't believe it is using the runtime repacking, otherwise it would be as fast as previous builds running q4_0_4_4
.
To show what I mean, I run 4 different setups, llama.cpp before and after the PR #10446, running both q4_0_4_4
and q4_0
of a model, models from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/tree/main
llama.cpp after PR
❯ ./build/bin/llama-cli --version
version: 4295 (26a8406b)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
llama.cpp before PR
❯ ./build/bin/llama-cli --version
version: 4067 (54ef9cfc)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
Running the commands
./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "..."
and
./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0_4_4.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "..."
Key: (prompt t/s, generation t/s)
Before PR | After PR | |
---|---|---|
Q4_0 | 60 t/s, 10 t/s | 9 t/s, 6 t/s |
Q4_0_4_4 | 51 t/s, 11 t/s | N/A |
First Bad Commit
The first bad commit seems to be
❯ git bisect bad
c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8 is the first bad commit
commit c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8
Author: Shupei Fan <[email protected]>
Date: Thu Nov 28 20:52:03 2024 +0800
ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
ggml/include/ggml-cpu.h | 1 +
ggml/include/ggml.h | 3 +
ggml/src/ggml-common.h | 6 +
ggml/src/ggml-cpu/ggml-cpu-aarch64.c | 321 +++++++++++++++++++++++++++++++++--
ggml/src/ggml-cpu/ggml-cpu-aarch64.h | 2 +
ggml/src/ggml-cpu/ggml-cpu.c | 27 ++-
ggml/src/ggml-cpu/ggml-cpu.cpp | 2 +-
ggml/src/ggml.c | 9 +
8 files changed, 352 insertions(+), 19 deletions(-)
All commits checked
❯ git bisect log
git bisect start
# bad: [26a8406ba9198eb6fdd8329fa717555b4f77f05f] CUDA: fix shared memory access condition for mmv (#10740)
git bisect bad 26a8406ba9198eb6fdd8329fa717555b4f77f05f
# good: [811872a59daefb25fc0c4326bcb6d8ae893c2f7c] speculative : simplify the implementation (#10504)
git bisect good 811872a59daefb25fc0c4326bcb6d8ae893c2f7c
# bad: [5e1ed95583ca552a98d8528b73e1ff81249c2bf9] grammars : add English-only grammar (#10612)
git bisect bad 5e1ed95583ca552a98d8528b73e1ff81249c2bf9
# good: [2025fa67e94358deda4740a74fe9803916cb2f60] kompute : improve backend to pass test_backend_ops (#10542)
git bisect good 2025fa67e94358deda4740a74fe9803916cb2f60
# bad: [4b3242bbea172ac0980378496fbc676d44c4f459] ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580)
git bisect bad 4b3242bbea172ac0980378496fbc676d44c4f459
# bad: [6c595676899013102fdb0aa4b06a49954300c94a] server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
git bisect bad 6c595676899013102fdb0aa4b06a49954300c94a
# bad: [76b27d29c22af03172cf211a8a31025c7c828a57] ggml : fix row condition for i8mm kernels (#10561)
git bisect bad 76b27d29c22af03172cf211a8a31025c7c828a57
# bad: [eea986f215e1dc490654d012ccf2ab62fe8f606d] cmake : fix ARM feature detection (#10543)
git bisect bad eea986f215e1dc490654d012ccf2ab62fe8f606d
# bad: [c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8] ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
git bisect bad c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8
# first bad commit: [c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8] ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
Commands to check a good and bad commit.
Compile by running rm -rdf build && cmake -B build && cmake --build build --config Release -j 4
Run using
./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 40 -p "Jane comes home from work and leave
s her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in,
where does she look for her phone?
Select from the following options:
A. The bedroom
B. The kitchen
C. The living room
D. The shed
E. Under the cooker
Think through the problem step by step before you give an answer.
"
A good run looks like
build: 4206 (2025fa67) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = qwen-research
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 Coder 3B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv 13: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 15: qwen2.block_count u32 = 36
llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
llama_model_loader: - kv 17: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: quantize.imatrix.file str = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv 36: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 252
llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q4_0: 248 tensors
llama_model_loader: - type q4_1: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 36
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 3.09 B
llm_load_print_meta: model size = 1.70 GiB (4.72 BPW)
llm_load_print_meta: general.name = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: CPU_AARCH64 model buffer size = 1440.00 MiB
llm_load_tensors: CPU_Mapped model buffer size = 1726.01 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 112.50 MiB
llama_new_context_with_model: KV self size = 112.50 MiB, K (f16): 56.25 MiB, V (f16): 56.25 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 300.75 MiB
llama_new_context_with_model: graph nodes = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
sampler seed: 0
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 3200, n_batch = 2048, n_predict = 40, n_keep = 0
Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone?
Select from the following options:
A. The bedroom
B. The kitchen
C. The living room
D. The shed
E. Under the cooker
Think through the problem step by step before you give an answer.
Let's analyze the situation step by step:
1. Kane comes home from work.
2. Dave moves Kane's phone from the living room to the bedroom.
3. Kane looks for her phone in
llama_perf_sampler_print: sampling time = 6.83 ms / 149 runs ( 0.05 ms per token, 21828.30 tokens per second)
llama_perf_context_print: load time = 1654.39 ms
llama_perf_context_print: prompt eval time = 2147.19 ms / 109 tokens ( 19.70 ms per token, 50.76 tokens per second)
llama_perf_context_print: eval time = 2831.59 ms / 39 runs ( 72.60 ms per token, 13.77 tokens per second)
llama_perf_context_print: total time = 4996.44 ms / 148 tokens
a bad run looks like
build: 4207 (c202cef1) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = qwen-research
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 Coder 3B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv 13: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 15: qwen2.block_count u32 = 36
llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
llama_model_loader: - kv 17: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: quantize.imatrix.file str = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv 36: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 252
llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q4_0: 248 tensors
llama_model_loader: - type q4_1: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 36
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 3.09 B
llm_load_print_meta: model size = 1.70 GiB (4.72 BPW)
llm_load_print_meta: general.name = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: CPU_Mapped model buffer size = 1738.10 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 112.50 MiB
llama_new_context_with_model: KV self size = 112.50 MiB, K (f16): 56.25 MiB, V (f16): 56.25 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 300.75 MiB
llama_new_context_with_model: graph nodes = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
sampler seed: 0
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 3200, n_batch = 2048, n_predict = 40, n_keep = 0
Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone?
Select from the following options:
A. The bedroom
B. The kitchen
C. The living room
D. The shed
E. Under the cooker
Think through the problem step by step before you give an answer.
Let's analyze the situation:
1. Kane comes home from work and finds Dave's phone in the bedroom.
2. Kane then goes to the living room to look for his own
llama_perf_sampler_print: sampling time = 6.68 ms / 149 runs ( 0.04 ms per token, 22302.05 tokens per second)
llama_perf_context_print: load time = 837.12 ms
llama_perf_context_print: prompt eval time = 11496.71 ms / 109 tokens ( 105.47 ms per token, 9.48 tokens per second)
llama_perf_context_print: eval time = 7783.24 ms / 39 runs ( 199.57 ms per token, 5.01 tokens per second)
llama_perf_context_print: total time = 19297.38 ms / 148 tokens
Relevant log output
❯ ./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "Jane comes home from work and leav
es her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in,
where does she look for her phone?
Select from the following options:
A. The bedroom
B. The kitchen
C. The living room
D. The shed
E. Under the cooker
Think through the problem step by step before you give an answer.
"
build: 4295 (26a8406b) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = qwen-research
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 Coder 3B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv 13: general.tags arr[str,6] = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 15: qwen2.block_count u32 = 36
llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
llama_model_loader: - kv 17: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: quantize.imatrix.file str = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv 36: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 252
llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q4_0: 248 tensors
llama_model_loader: - type q4_1: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 36
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 3.09 B
llm_load_print_meta: model size = 1.70 GiB (4.72 BPW)
llm_load_print_meta: general.name = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: CPU_Mapped model buffer size = 1738.10 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: CPU KV buffer size = 112.50 MiB
llama_new_context_with_model: KV self size = 112.50 MiB, K (f16): 56.25 MiB, V (f16): 56.25 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 300.75 MiB
llama_new_context_with_model: graph nodes = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 0
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 3200, n_batch = 2048, n_predict = 200, n_keep = 0
Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone?
Select from the following options:
A. The bedroom
B. The kitchen
C. The living room
D. The shed
E. Under the cooker
Think through the problem step by step before you give an answer.
Let's analyze the situation:
1. Kane comes home from work and finds Dave's phone in the bedroom.
2. Kane then goes to the living room to look for his own phone.
3. Since Kane's phone is in the living room, he would look for it there first.
4. If Kane's phone is not in the living room, he would look for it in the bedroom.
5. Since Kane's phone is in the bedroom, he would look for it there first.
6. If Kane's phone is not in the bedroom, he would look for it in the living room.
7. Since Kane's phone is in the bedroom, he would look for it there first.
8. If Kane's phone is not in the bedroom, he would look for it in the living room.
9. Since Kane's phone is in the bedroom, he would look for it there first.
llama_perf_sampler_print: sampling time = 31.22 ms / 309 runs ( 0.10 ms per token, 9896.55 tokens per second)
llama_perf_context_print: load time = 805.09 ms
llama_perf_context_print: prompt eval time = 12073.52 ms / 109 tokens ( 110.77 ms per token, 9.03 tokens per second)
llama_perf_context_print: eval time = 32584.03 ms / 199 runs ( 163.74 ms per token, 6.11 tokens per second)
llama_perf_context_print: total time = 44734.56 ms / 308 tokens