Skip to content

Faster DeepSeek FA on CUDA #408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 12, 2025
Merged

Faster DeepSeek FA on CUDA #408

merged 3 commits into from
May 12, 2025

Conversation

ikawrakow
Copy link
Owner

This is a port of this PR in mainline llama.cpp.

The main difference to PR #386 is that now the FA kernel takes advantage of the fact that the V tensor contains the same data as the K tensor (it is a view on the K cache with an offset given by the RoPE embedding size). Hence, one can reduce the number of loads by reusing K tiles when processing V*softmax(K*Q).

To take advantage of this new kernel I had to change the way the K cache is organized. In mainline llama.cpp the K cache stores (RoPE, NoPE) parts in that order, and the FA kernel assumes this arrangement. But in ik_llama.cpp prior to this PR the K cache was stored as (NoPE, RoPE). As there are several places where the views into the K cache can go wrong when building the graph, the PR should be tested more thoroughly before merging. I have tested all possible combinations of mla and fa using DeepSeek-Lite and it appears to work correctly, but still.

The next graph shows a TG performance comparison between the main branch (black) and this PR (red). Model is DeepSeek-Lite quantized with Q4_0, GPU is RTX-4080. We see nice performance improvements, but also a more peculiar behavior as a function of N_KV, the number of tokens in the KV cache.

z10a

When mla = 2 or mla = 3 this PR has no effect on PP, so the next graph compares PP speed between the main branch (black) and the PR (red) for mla = 1. For reference I have also included PP performance for mla = 3 with blue symbols. In case I ave not shown a graph such as this one, it illustrates what one gives up in terms of PP performance by using a mainline llama.cpp MLA-enabled GGUF for DeepSeek models. The difference is ~25% for N_KV = 0 and nearly a factor of 2 at 60k tokens. The PR improves mla = 1 performance by a few percent.

z10b

Finally, being curious about the peculiar TG behavior as a function of N_KV, I ran sweep-bench with the llama.cpp PR, and the next graph shows a TG performance comparison between this PR and the mainline PR. We see that the two curves align very closely, so the strange behavior is not due to me screwing up with the port. I wonder if @JohannesGaessler is aware.

z10c

Kawrakow added 3 commits May 11, 2025 09:58
Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.
@JohannesGaessler
Copy link

An RTX 4080 has 76 streaming multiprocessor, the CUDA code assigns KV slices to SMs in chunks of size 256. So every 76*256=19456 tokens the size of biggest workload across all of the SMs increases and there is a dip in performance. These so-called quantization effects are much more noticeable with compute than with I/O so they become more pronounced if the I/O of a kernel is optimized.

@Panchovix
Copy link

Just tested on DeepSeek V3 0324 Q2_K_XL and it seems to have improved my t/s TG by about 1-2% (I guess with offloading there isn't much difference), but tested a smaller models (DeepSeek2 16B) on a single GPU (5090) and got about 8-12% speedup, so pretty nice!

This is on top of #405 PR.

Now I'm gonna try #409 on top of that PR and this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants