Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal #6173

Closed

Closed

Regression: llama.cpp produces nonsensical outputs when using batched decoding on Metal#6173

Labels

bug-unconfirmed

When using batched decoding with >1 parallel sequences, llama.cpp produces nonsensical outputs. Here is an example:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=2+2020202+2<|/im_end|>2<|<|

## 2+20+202022|0202020202202020202020202

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=0\n<|im assistant|> 
02+02+2+20+2+20+20+20+2+20+2+0+2+2+02020+202+2+2+2+2+

However, if I use only 1 parallel sequence instead of 2, the output becomes reasonable:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 1 110 80
[...]
 <|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nCan you explain how you got the answer?<|im_end|>\n\n<|im_start|>assistant\nSure! To find the sum of 20 and 2

I manually bisected and found that the problem was introduced by @ggerganov's change d7b800b (#4280). Indeed, after reverting the GGML_PAD that was added to kv_self.n, the model output becomes reasonable even with multiple batched sequences:

MacintoBookPro3:llama.cpp Ari🍉  ./batched openhermes-2.5-mistral-7b.Q4_0.gguf "<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=" 2 110 80
[...]
main: generating 2 sequences ...

main: stream 0 finished at n_cur = 110
main: stream 1 finished at n_cur = 110

sequence 0:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n\n<|im_start|>user\nWhat is 20-20?<|im_end|>\n<|im_start|>assistant\n20-20=0<|im_end|>\n

sequence 1:

<|im_start|>user\nWhat is 20+20?<|im_end|>\n<|im_start|>assistant\n20+20=40<|im_end|>\n<|im_start|>user\nWhat is 20*20?<|im_end|>\n<|im_start|>assistant\n20*20=400<|im_end|>\n

I'm not familiar enough with the details here to understand the utility or necessity of the GGML_PAD operation. Any idea why this is causing this issue? Is it possible that we should omit that for Metal specifically?

Notes:

I am testing using Metal on a MacBook Pro with M2 Max
Using OpenHermes 2.5 (Mistral) model at Q4_0 quantization, however the problem occurs with other quants too (tested Q8_0)
I wonder if there is an additional issue here - note that after reverting the breaking change, the model outputs with parallelization are still consistently different from the outputs without parallelization.

Thank you for all of the wonderful work going into this project!

Metadata

Assignees

No one assigned

Labels

bug-unconfirmed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests