Closed
Description
When using batched decoding with >1 parallel sequences, llama.cpp produces nonsensical outputs. Here is an example:
MacintoBookPro3:llama.cpp Ari🍉 ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...
main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110
sequence 0:
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=2+2020202+2<|/im_end|>2<|<|
## 2+20+202022|0202020202202020202020202
sequence 1:
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=0\n<|im assistant|>
02+02+2+20+2+20+20+20+2+20+2+0+2+2+02020+202+2+2+2+2+
However, if I use only 1 parallel sequence instead of 2, the output becomes reasonable:
MacintoBookPro3:llama.cpp Ari🍉 ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 1 110 80
[...]
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2
I manually bisected and found that the problem was introduced by @ggerganov's change d7b800b (#4280). Indeed, after reverting the GGML_PAD
that was added to kv_self.n
, the model output becomes reasonable even with multiple batched sequences:
MacintoBookPro3:llama.cpp Ari🍉 ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...
main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110
sequence 0:
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n
sequence 1:
<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n
I'm not familiar enough with the details here to understand the utility or necessity of the GGML_PAD
operation. Any idea why this is causing this issue? Is it possible that we should omit that for Metal specifically?
Notes:
- I am testing using Metal on a MacBook Pro with M2 Max
- Using OpenHermes 2.5 (Mistral) model at Q4_0 quantization, however the problem occurs with other quants too (tested Q8_0)
- I wonder if there is an additional issue here - note that after reverting the breaking change, the model outputs with parallelization are still consistently different from the outputs without parallelization.
Thank you for all of the wonderful work going into this project!