-
Notifications
You must be signed in to change notification settings - Fork 38
Bug: Perf Regression in PP throughput after Pull #461 (...R4 CUDA impl) #474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To make sure I understand correctly, prior to #461 you observed the same good PP performance irrespective of using or not using |
Please also provide your full command line. This really makes it easier to diagnose the problem. |
|
Observations:
Conclusion: PCE-E speed is very low, resulting in low PP performance when tensors stored in RAM are offloaded to the GPU. #461implemented CUDA matrix multiplications for repacked tensors, so after the PR all tensors stored in RAM get offloaded to the GPU to perform matrix multiplications, so performance drops. Mitigations:
Note that for most people not using |
Uh oh!
There was an error while loading. Please reload this page.
What happened?
While testing out an IQ4 quant of R1-0528 I noticed that PP throughput on my system was reduced e.g. 75/s -> 12/s, basically equal to TG throughput. With IQ4 and Q8 shared on GPU I expect PP > 60/s.
I compare with an all Q8_0 quant and see what I expect, PP >50/sec (on main/HEAD today.)
I bisected, and found that this problem was introduced with Pull #461 (commit 1429291).
However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.
Absence/presence of
--run-time-repack
doesn't cause nor avoid it.CUDA device is RTX 8000 (Turing)
I glance over the commit and mostly see changes that seem clearly restricted to _R4 suffix components. There are some shared parts where n_interleaved is propagated down the template stack (iqk_mmvq.cu) but at a casual glance nothing strikes me as odd, but I'm certainly not that familiar with it. The dot product interface changed to a mutating one taking an accumulator pointer (previously returning the computed result) and that could be curious.
aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls #386 and #408 ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use
-mla 2 -fa
What operating system are you seeing the problem on?
Linux
The text was updated successfully, but these errors were encountered: