Faster DeepSeek FA on CUDA #408
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a port of this PR in mainline
llama.cpp
.The main difference to PR #386 is that now the FA kernel takes advantage of the fact that the V tensor contains the same data as the K tensor (it is a view on the K cache with an offset given by the RoPE embedding size). Hence, one can reduce the number of loads by reusing K tiles when processing
V*softmax(K*Q)
.To take advantage of this new kernel I had to change the way the K cache is organized. In mainline
llama.cpp
the K cache stores(RoPE, NoPE)
parts in that order, and the FA kernel assumes this arrangement. But inik_llama.cpp
prior to this PR the K cache was stored as(NoPE, RoPE)
. As there are several places where the views into the K cache can go wrong when building the graph, the PR should be tested more thoroughly before merging. I have tested all possible combinations ofmla
andfa
using DeepSeek-Lite and it appears to work correctly, but still.The next graph shows a TG performance comparison between the main branch (black) and this PR (red). Model is DeepSeek-Lite quantized with
Q4_0
, GPU is RTX-4080. We see nice performance improvements, but also a more peculiar behavior as a function ofN_KV
, the number of tokens in the KV cache.When
mla = 2
ormla = 3
this PR has no effect on PP, so the next graph compares PP speed between the main branch (black) and the PR (red) formla = 1
. For reference I have also included PP performance formla = 3
with blue symbols. In case I ave not shown a graph such as this one, it illustrates what one gives up in terms of PP performance by using a mainlinellama.cpp
MLA-enabled GGUF for DeepSeek models. The difference is ~25% forN_KV = 0
and nearly a factor of 2 at 60k tokens. The PR improvesmla = 1
performance by a few percent.Finally, being curious about the peculiar TG behavior as a function of
N_KV
, I ransweep-bench
with the llama.cpp PR, and the next graph shows a TG performance comparison between this PR and the mainline PR. We see that the two curves align very closely, so the strange behavior is not due to me screwing up with the port. I wonder if @JohannesGaessler is aware.