Skip to content

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
usrlocalben opened this issue May 30, 2025 · 4 comments
Open

Comments

@usrlocalben
Copy link

usrlocalben commented May 30, 2025

What happened?

While testing out an IQ4 quant of R1-0528 I noticed that PP throughput on my system was reduced e.g. 75/s -> 12/s, basically equal to TG throughput. With IQ4 and Q8 shared on GPU I expect PP > 60/s.

I compare with an all Q8_0 quant and see what I expect, PP >50/sec (on main/HEAD today.)

I bisected, and found that this problem was introduced with Pull #461 (commit 1429291).

However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.

Absence/presence of --run-time-repack doesn't cause nor avoid it.

CUDA device is RTX 8000 (Turing)

I glance over the commit and mostly see changes that seem clearly restricted to _R4 suffix components. There are some shared parts where n_interleaved is propagated down the template stack (iqk_mmvq.cu) but at a casual glance nothing strikes me as odd, but I'm certainly not that familiar with it. The dot product interface changed to a mutating one taking an accumulator pointer (previously returning the computed result) and that could be curious.

aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls #386 and #408 ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use -mla 2 -fa

What operating system are you seeing the problem on?

Linux

@ikawrakow
Copy link
Owner

However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.

Absence/presence of --run-time-repack doesn't cause nor avoid it.

To make sure I understand correctly, prior to #461 you observed the same good PP performance irrespective of using or not using --run-time-repack. But after #461 you observe the same bad bad PP performance with or without --run-time-repack ?

@ikawrakow
Copy link
Owner

Please also provide your full command line. This really makes it easier to diagnose the problem.

@usrlocalben
Copy link
Author

ik_llama.cpp/build/bin/llama-server
-mla 2 -fa -fmoe
-amb 512
-c 65536
-np 1
--n-gpu-layers 99
-ctk q8_0
--run-time-repack
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0"
-ot "blk\.4\.ffn_up_exps=CUDA0, blk\.4\.ffn_gate_exps=CUDA0"
-ot "blk\.5\.ffn_up_exps=CUDA0, blk\.5\.ffn_gate_exps=CUDA0"
-ot "blk\.6\.ffn_up_exps=CUDA0, blk\.6\.ffn_gate_exps=CUDA0"
-ot "blk\.7\.ffn_up_exps=CUDA0, blk\.7\.ffn_gate_exps=CUDA0"
-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU"
--host 127.0.0.1 --port 9999
--temp 0.6 --top-p 0.95
-m /path/to/model/DeepSeek-R1-0528-IQ4/data.gguf

commit 24c010b3 (last known good)

rtr=yes
prompt eval time     =  161791.56 ms / 10188 tokens (   15.88 ms per token,    62.97 tokens per second)
generation eval time =  115078.31 ms /  1012 runs   (  113.71 ms per token,     8.79 tokens per second)

rtr=no
prompt eval time     =  612061.95 ms / 10188 tokens (   60.08 ms per token,    16.65 tokens per second)
generation eval time =  144322.65 ms /  1268 runs   (  113.82 ms per token,     8.79 tokens per second)


commit 14292913 (CUDA _R4)

rtr=yes
prompt eval time     =  937934.38 ms / 10188 tokens (   92.06 ms per token,    10.86 tokens per second)
generation eval time =  122195.15 ms /  1065 runs   (  114.74 ms per token,     8.72 tokens per second)

rtr=no
prompt eval time     =  613312.38 ms / 10188 tokens (   60.20 ms per token,    16.61 tokens per second)
generation eval time =  163612.05 ms /  1437 runs   (  113.86 ms per token,     8.78 tokens per second)

@ikawrakow
Copy link
Owner

Observations:

  • rtr=no has the same performance on 1429291 and on 24c010b. In both versions, when rtr=no tensors stored in RAM get offloaded to the GPU to perform the matrix multiplication.
  • rtr=no is much slower that rtr=yes on the last know good 24c010b. On that version, when rtr=yes tensors stored in RAM are not offloaded to the GPU because the CUDA back-end reports to not support matrix multiplications for the repacked types.

Conclusion: PCE-E speed is very low, resulting in low PP performance when tensors stored in RAM are offloaded to the GPU. #461implemented CUDA matrix multiplications for repacked tensors, so after the PR all tensors stored in RAM get offloaded to the GPU to perform matrix multiplications, so performance drops.

Mitigations:

  • If possible, use large u-batches. This allows more work to be done per amount of data copied to the GPU. If you have enough VRAM, -b 4096 -ub 4096 will maximize PP performance.
  • Avoid offloading tensors stored in RAM to the GPU. This is accomplished with -op 26,0,27,0,29,0 where
    • 26,0 disables offloading matrix multiplications
    • 27,0 disables offloading indirect matrix multiplications (used in MoE models)
    • 29,0 disables offloading fused ffn_up+ffn_gate operations (you get these in MoE models when using -fmoe)
  • You may want to experiment with -op (op stands for offload policy, see PR GPU offload policy #405)
    • -op 29,0 -rtr should result in the exact same performance as you had on 24c010b
    • If your PCI-E speed is so low as to give such bad performance with GPU offload enabled, adding -op 27,0 to the above may improve performance compared to what you had on 24c010b

Note that for most people not using -op and using large batches with -b 4096 -ub 4096 maximizes PP performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants