-
Notifications
You must be signed in to change notification settings - Fork 38
GPU offload policy #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU offload policy #405
Conversation
Many thanks for the PR! Sorry as I think I didn't understand correctly, for the case we were talking on #394 (comment), if we want to do the matrix multiplications on MoE models, we should specify
|
This PR sets |
Amazing, thanks! EDIT: Compilation solved by doing a new git clone. |
Not sure. |
Okay restarting didn't work either. But cloning the PR itself in a new folder worked, so I guess there is an issue with my main folder after pulling the PR separately. Now testing the PR itself, it works! Running with
Speeds are
This is about 10% faster than main llamacpp with the same ubatch size, and GPU 0 running at X8 5.0 saturates at the absolute limit (28-29 GiB/s, 1-2GiB/s higher vs main llamacpp), so maybe there could be a benefit on X16 5.0, but that is yet to test. |
Just an update, tested other deepseek models (v30324, chimera, r1) at q2_k_xl, iq3_xxs, q3_k_s and q3_k_xl, all working fine! So really nice work. |
Thanks for testing, I appreciate it! Johannes has improved the performance |
I see! I think I would have to remove some layers from some experts from GPU to use -b and -ub 4096, which I think it would increase PP but maybe decrease TG a bit? At least I have noticed that with -b 2560 and -ub 2048 with less layers on GPU but more ctx (128K) |
Yes, so it depends what is more important to you. TG performance decrease will be quite modest, about 1/61 per extra not offloaded layer for DeepSeek-R1/V3.
What is the use case for |
Oh just when I was testing on main llamacpp, I had more memory usage with -b and -ub 2048 than 2560/2048 respectively, but maybe it was because something else. Also just 1/61 the speed, pretty worth probably. I get 7 t/s on Q3_K_XL TG but ~80-90 t/s PP. I would trade 2 layers for ~6.3 t/s for more PP speed. |
Okay testing Q2_K_XL with -b 4096 and -ub 4096, PP t/s are insane
EDIT: After some gens it just gets faster
|
When part of the tensors are stored in RAM but there are faster back-ends available (GPU), the scheduler needs to decide if to offload the data for a given op to a faster back-end or to compute the op on the CPU. This is currently done via a simple heuristics where only matrix multiplications (
GGML_MUL_MAT
andGGML_MUL_MAT_ID
) are offloaded if the batch size is larger than some threshold (currently 32). Whenfmoe
is enabled, the fused(ffn_up*X)*unary(ffn_gate*X))
op is never uploaded. In contrast, in mainlinellama.cpp
matrix multiplications are always offloaded when the batch size is>= 32
. The result of this is that when the batch size becomes large enough,llama.cpp
will outperformik_llama.cpp
in prompt processing speed. As "large enough" depends on many factors (size of tensors that need to be uploaded, speed of the PCI-E bus to the GPU, relative speed of the GPU vs the CPU), it is hard to devise a better offload policy that automatically takes the best decision.Hence, this PR adds the ability to manually define the offload policy via a command line argument that can be used for all examples that use
common
(llama-cli, llama-server, llama-sweep-bench, llama-perplexity
, etc.). The argument iswhere
a
andb
are integers. One can have multiple pairs following the-op
or--offload-policy
argument (i.e.,-op a1,b1,a2,b2,a3,b3...
). The first integer defines the op (see below). The second integer is0
or1
and defines if the op should be offloaded (1
) or not offloaded (0
) to the GPU. The first integer is simply the enum value in theggml_op
enum. I know this is clunky, but I also didn't want to go with just allowing or disallowing offload for all ops. If the op is set to-1
, then all op offloads are set to enabled or disabled.Current list of ops
Examples:
-op -1,0
: disable all offload to the GPU-op 26,0
: disable offload of matrix multiplications to the GPU-op 27,0
: disable offload of indirect matrix multiplications to the GPU (used for the experts in a MoE model)-op 29,0
: disable fused up-gate-unary op offload to the GPU (applied to MoE models with-fmoe
)Note
Even if offload for an op is enabled, it may still not be offloaded based on the existing heuristics. This is important for, e.g., token generation where batch size is 1 and the offload will take much longer than just computing on the CPU.
Important
The PR also changes
ik_llama.cpp
to offload fused up-gate-unary ops for batch sizes>= 32
. If you observe PP performance degradation compared to the main branch, the behavior prior to this PR can be recovered using-op 29,0
Note
Row-interleaved quants (
IQ4_K_R4, IQ4_K_R4, Q4_0_R8
, etc.) are never offloaded because there is no CUDA GEMM/GEMV for these quantization types. Hence, using-rtr
is equivalent to-op 26,0,27,0,29,0