Skip to content

Openblas sgemm is slower for small size matrices in aarch64 #4580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akote123 opened this issue Mar 26, 2024 · 16 comments · Fixed by #4814
Closed

Openblas sgemm is slower for small size matrices in aarch64 #4580

akote123 opened this issue Mar 26, 2024 · 16 comments · Fixed by #4814

Comments

@akote123
Copy link

akote123 commented Mar 26, 2024

I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1.
mkl is built in icelake machine.

I have used openblas sgemm as
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);

When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .

openblasvsmkl

These are the different shapes I have checked and their timings.

@martin-frbg
Copy link
Collaborator

OpenBLAS does not currently provide dedicated GEMM kernels for "small" matrix sizes on ARM64, and may be switching to multithreading too early. (Also not sure if MKL would perhaps be employing GEMV here for an 1-by-N matrix, certainly a special case that OpenBLAS does not try to exploit)

@akote123
Copy link
Author

@martin-frbg, Thank you.
Is there plan to improve GEMM kernels for small matrix sizes on ARM.

@martin-frbg
Copy link
Collaborator

General plans to improve "everything" but no ETA - this project does not have much in the way of a permanent team behind it at present, so progress tends to be a bit unpredictable, often driven by outside contributions.

@brada4
Copy link
Contributor

brada4 commented Mar 28, 2024

Would be interesting to see how respective gemv equivalents perform in particular case.

@lrbison
Copy link

lrbison commented May 7, 2024

Two other options might be interesting:

libxsmm which is a library specifically to address small-matrix multiplication, including narrow/tall matrices. However, only the 1x512 * 512x512 matrix fits in their recommended size of $(M N K)^{1/3} <= 64$, but it still may be worth the try. I have seen this library perform well on neoverse-v1 (Graviton3).

Arm Performance Libraries has many BLAS functions specifically targeted and optimized for aarch64. I'm curious if they perform better for this small matrix test case.

@brada4
Copy link
Contributor

brada4 commented May 7, 2024

Gemm with one dimension eq 1 can be cadt down to gemv. Question is whether those libraries use that trick.

@akote123
Copy link
Author

akote123 commented May 8, 2024

@lrbison , Thank you.
I have checked libxsmm for batchsize = 1, m = 1, n = 512, k = 2048 got 919us in graviton3 and 887us in Icelake.

@lrbison
Copy link

lrbison commented May 16, 2024

@akote123 Hm, I tried to reproduce, but I got different results.
I'm using OpenBLAS 0.3.26 as compiled by spack:

[email protected]%[email protected]~bignuma~consistent_fpcsr+dynamic_dispatch+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-ubuntu20.04-neoverse_v1

I've also tested the threads=none variant. For testing, I did not use cblas_sgemm, but instead sgemm_ directly, and stored my matrices column major. ie, my call was:

sgemm_("n", "n", &m, &n, &k, &one, a, &m, b, &k, &zero, c, &m)

The results are dramatically different from yours. While I have not tried transposing my matrices, I suspect there is something more going on. This was run on c7g.8xlarge (32 cores).

image

@brada4
Copy link
Contributor

brada4 commented May 16, 2024

m=1 is anomalous as it is equivalent of gemv (1st "matrix" is actually a vector)

@lrbison
Copy link

lrbison commented May 16, 2024

@brada4 You are right of course, but I didn't get time to add that to my test case last night. I've got new data this morning:

image

Additionally I just checked ArmPL, and it seems they catch this special case and call into sgemv, since their timings are nearly identical in both cases, and very similar to OpenBLAS sgemv times as well.

@martin-frbg
Copy link
Collaborator

Thank you very much - I do wonder what version akote123 is/was using, as timings consistently getting worse when going from 1 to n threads for fairly large problem sizes is a bit unexpected

@akote123
Copy link
Author

akote123 commented May 16, 2024

I have used openblas 0.3.26.
@lrbison, I haven't set OMP_NUM_THREADS. For core setting I have used taskset. I have used below code to benchmark. I have taken timings in c7gn.8xlarge.

    for (i = 0; i < 100; i++) {
     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
    }    
    double time_avg = (double)(clock()-start_t)/CLOCKS_PER_SEC/100;
    fprintf(stdout, "%lf\n",time_avg); ```



@lrbison
Copy link

lrbison commented May 16, 2024

@akote123 I believe the issue is that you are using clock() but clock measures CPU time, not wall-clock time. That means each thread is adding ticks in parallel.

See https://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs

@lrbison
Copy link

lrbison commented May 20, 2024

@martin-frbg has OpenBLAS has considered calling into gemv from gemm in these kinds of special cases? If I tinkered around to do so would you consider accepting a PR, or is it just not worth it?

@martin-frbg
Copy link
Collaborator

The topic has come up a few times in the past e.g. #528 and I have just created a rough draft for the fairly trivial change to add in interface/gemm.c . But if you have written something already in parallel with me, please do post your PR

@martin-frbg
Copy link
Collaborator

uploaded what I currently have as #4708 - bound to be some embarassing coding errors in there still

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants