-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Openblas sgemm is slower for small size matrices in aarch64 #4580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OpenBLAS does not currently provide dedicated GEMM kernels for "small" matrix sizes on ARM64, and may be switching to multithreading too early. (Also not sure if MKL would perhaps be employing GEMV here for an 1-by-N matrix, certainly a special case that OpenBLAS does not try to exploit) |
@martin-frbg, Thank you. |
General plans to improve "everything" but no ETA - this project does not have much in the way of a permanent team behind it at present, so progress tends to be a bit unpredictable, often driven by outside contributions. |
Would be interesting to see how respective gemv equivalents perform in particular case. |
Two other options might be interesting: libxsmm which is a library specifically to address small-matrix multiplication, including narrow/tall matrices. However, only the 1x512 * 512x512 matrix fits in their recommended size of Arm Performance Libraries has many BLAS functions specifically targeted and optimized for aarch64. I'm curious if they perform better for this small matrix test case. |
Gemm with one dimension eq 1 can be cadt down to gemv. Question is whether those libraries use that trick. |
@lrbison , Thank you. |
@akote123 Hm, I tried to reproduce, but I got different results.
I've also tested the
The results are dramatically different from yours. While I have not tried transposing my matrices, I suspect there is something more going on. This was run on c7g.8xlarge (32 cores). |
m=1 is anomalous as it is equivalent of gemv (1st "matrix" is actually a vector) |
@brada4 You are right of course, but I didn't get time to add that to my test case last night. I've got new data this morning: Additionally I just checked ArmPL, and it seems they catch this special case and call into sgemv, since their timings are nearly identical in both cases, and very similar to OpenBLAS sgemv times as well. |
Thank you very much - I do wonder what version akote123 is/was using, as timings consistently getting worse when going from 1 to n threads for fairly large problem sizes is a bit unexpected |
I have used openblas 0.3.26.
|
@akote123 I believe the issue is that you are using See https://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs |
@martin-frbg has OpenBLAS has considered calling into gemv from gemm in these kinds of special cases? If I tinkered around to do so would you consider accepting a PR, or is it just not worth it? |
The topic has come up a few times in the past e.g. #528 and I have just created a rough draft for the fairly trivial change to add in interface/gemm.c . But if you have written something already in parallel with me, please do post your PR |
uploaded what I currently have as #4708 - bound to be some embarassing coding errors in there still |
Uh oh!
There was an error while loading. Please reload this page.
I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1.
mkl is built in icelake machine.
I have used openblas sgemm as
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .
These are the different shapes I have checked and their timings.
The text was updated successfully, but these errors were encountered: