Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

usrlocalben · 2025-05-30T07:31:01Z

What happened?

While testing out an IQ4 quant of R1-0528 I noticed that PP throughput on my system was reduced e.g. 75/s -> 12/s, basically equal to TG throughput. With IQ4 and Q8 shared on GPU I expect PP > 60/s.

I compare with an all Q8_0 quant and see what I expect, PP >50/sec (on main/HEAD today.)

I bisected, and found that this problem was introduced with Pull #461 (commit 1429291).

However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.

Absence/presence of --run-time-repack doesn't cause nor avoid it.

CUDA device is RTX 8000 (Turing)

I glance over the commit and mostly see changes that seem clearly restricted to _R4 suffix components. There are some shared parts where n_interleaved is propagated down the template stack (iqk_mmvq.cu) but at a casual glance nothing strikes me as odd, but I'm certainly not that familiar with it. The dot product interface changed to a mutating one taking an accumulator pointer (previously returning the computed result) and that could be curious.

aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls #386 and #408 ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use -mla 2 -fa

What operating system are you seeing the problem on?

Linux

The text was updated successfully, but these errors were encountered:

ikawrakow · 2025-05-30T07:48:21Z

However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.

Absence/presence of --run-time-repack doesn't cause nor avoid it.

To make sure I understand correctly, prior to #461 you observed the same good PP performance irrespective of using or not using --run-time-repack. But after #461 you observe the same bad bad PP performance with or without --run-time-repack ?

ikawrakow · 2025-05-30T07:56:37Z

Please also provide your full command line. This really makes it easier to diagnose the problem.

usrlocalben · 2025-05-30T17:15:52Z

ik_llama.cpp/build/bin/llama-server
-mla 2 -fa -fmoe
-amb 512
-c 65536
-np 1
--n-gpu-layers 99
-ctk q8_0
--run-time-repack
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0"
-ot "blk\.4\.ffn_up_exps=CUDA0, blk\.4\.ffn_gate_exps=CUDA0"
-ot "blk\.5\.ffn_up_exps=CUDA0, blk\.5\.ffn_gate_exps=CUDA0"
-ot "blk\.6\.ffn_up_exps=CUDA0, blk\.6\.ffn_gate_exps=CUDA0"
-ot "blk\.7\.ffn_up_exps=CUDA0, blk\.7\.ffn_gate_exps=CUDA0"
-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU"
--host 127.0.0.1 --port 9999
--temp 0.6 --top-p 0.95
-m /path/to/model/DeepSeek-R1-0528-IQ4/data.gguf


commit 24c010b3 (last known good)

rtr=yes
prompt eval time     =  161791.56 ms / 10188 tokens (   15.88 ms per token,    62.97 tokens per second)
generation eval time =  115078.31 ms /  1012 runs   (  113.71 ms per token,     8.79 tokens per second)

rtr=no
prompt eval time     =  612061.95 ms / 10188 tokens (   60.08 ms per token,    16.65 tokens per second)
generation eval time =  144322.65 ms /  1268 runs   (  113.82 ms per token,     8.79 tokens per second)


commit 14292913 (CUDA _R4)

rtr=yes
prompt eval time     =  937934.38 ms / 10188 tokens (   92.06 ms per token,    10.86 tokens per second)
generation eval time =  122195.15 ms /  1065 runs   (  114.74 ms per token,     8.72 tokens per second)

rtr=no
prompt eval time     =  613312.38 ms / 10188 tokens (   60.20 ms per token,    16.61 tokens per second)
generation eval time =  163612.05 ms /  1437 runs   (  113.86 ms per token,     8.78 tokens per second)

ikawrakow · 2025-05-31T04:35:34Z

Observations:

rtr=no has the same performance on 1429291 and on 24c010b. In both versions, when rtr=no tensors stored in RAM get offloaded to the GPU to perform the matrix multiplication.
rtr=no is much slower that rtr=yes on the last know good 24c010b. On that version, when rtr=yes tensors stored in RAM are not offloaded to the GPU because the CUDA back-end reports to not support matrix multiplications for the repacked types.

Conclusion: PCE-E speed is very low, resulting in low PP performance when tensors stored in RAM are offloaded to the GPU. #461implemented CUDA matrix multiplications for repacked tensors, so after the PR all tensors stored in RAM get offloaded to the GPU to perform matrix multiplications, so performance drops.

Mitigations:

If possible, use large u-batches. This allows more work to be done per amount of data copied to the GPU. If you have enough VRAM, -b 4096 -ub 4096 will maximize PP performance.
Avoid offloading tensors stored in RAM to the GPU. This is accomplished with -op 26,0,27,0,29,0 where
- 26,0 disables offloading matrix multiplications
- 27,0 disables offloading indirect matrix multiplications (used in MoE models)
- 29,0 disables offloading fused ffn_up+ffn_gate operations (you get these in MoE models when using -fmoe)
You may want to experiment with -op (op stands for offload policy, see PR GPU offload policy #405)
- -op 29,0 -rtr should result in the exact same performance as you had on 24c010b
- If your PCI-E speed is so low as to give such bad performance with GPU offload enabled, adding -op 27,0 to the above may improve performance compared to what you had on 24c010b

Note that for most people not using -op and using large batches with -b 4096 -ub 4096 maximizes PP performance.

ikawrakow mentioned this issue May 31, 2025

Research: performance divergence #476

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

usrlocalben commented May 30, 2025 •

edited

Loading

ikawrakow commented May 30, 2025

Uh oh!

ikawrakow commented May 30, 2025

Uh oh!

usrlocalben commented May 30, 2025

Uh oh!

ikawrakow commented May 31, 2025

Uh oh!

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474

Comments

usrlocalben commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened?

What operating system are you seeing the problem on?

ikawrakow commented May 30, 2025

Uh oh!

ikawrakow commented May 30, 2025

Uh oh!

usrlocalben commented May 30, 2025

Uh oh!

ikawrakow commented May 31, 2025

Uh oh!

usrlocalben commented May 30, 2025 •

edited

Loading