Openblas sgemm is slower for small size matrices in aarch64 #4580

akote123 · 2024-03-26T09:16:47Z

I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1.
mkl is built in icelake machine.

I have used openblas sgemm as
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);

When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .

These are the different shapes I have checked and their timings.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2024-03-26T11:23:05Z

OpenBLAS does not currently provide dedicated GEMM kernels for "small" matrix sizes on ARM64, and may be switching to multithreading too early. (Also not sure if MKL would perhaps be employing GEMV here for an 1-by-N matrix, certainly a special case that OpenBLAS does not try to exploit)

akote123 · 2024-03-26T11:35:04Z

@martin-frbg, Thank you.
Is there plan to improve GEMM kernels for small matrix sizes on ARM.

martin-frbg · 2024-03-26T12:28:48Z

General plans to improve "everything" but no ETA - this project does not have much in the way of a permanent team behind it at present, so progress tends to be a bit unpredictable, often driven by outside contributions.

brada4 · 2024-03-28T21:38:03Z

Would be interesting to see how respective gemv equivalents perform in particular case.

lrbison · 2024-05-07T13:40:54Z

Two other options might be interesting:

libxsmm which is a library specifically to address small-matrix multiplication, including narrow/tall matrices. However, only the 1x512 * 512x512 matrix fits in their recommended size of $(M N K)^{1/3} <= 64$, but it still may be worth the try. I have seen this library perform well on neoverse-v1 (Graviton3).

Arm Performance Libraries has many BLAS functions specifically targeted and optimized for aarch64. I'm curious if they perform better for this small matrix test case.

brada4 · 2024-05-07T18:14:38Z

Gemm with one dimension eq 1 can be cadt down to gemv. Question is whether those libraries use that trick.

akote123 · 2024-05-08T03:30:35Z

@lrbison , Thank you.
I have checked libxsmm for batchsize = 1, m = 1, n = 512, k = 2048 got 919us in graviton3 and 887us in Icelake.

lrbison · 2024-05-16T04:04:12Z

@akote123 Hm, I tried to reproduce, but I got different results.
I'm using OpenBLAS 0.3.26 as compiled by spack:

[email protected]%[email protected]~bignuma~consistent_fpcsr+dynamic_dispatch+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-ubuntu20.04-neoverse_v1

I've also tested the threads=none variant. For testing, I did not use cblas_sgemm, but instead sgemm_ directly, and stored my matrices column major. ie, my call was:

sgemm_("n", "n", &m, &n, &k, &one, a, &m, b, &k, &zero, c, &m)

The results are dramatically different from yours. While I have not tried transposing my matrices, I suspect there is something more going on. This was run on c7g.8xlarge (32 cores).

brada4 · 2024-05-16T06:42:32Z

m=1 is anomalous as it is equivalent of gemv (1st "matrix" is actually a vector)

lrbison · 2024-05-16T15:06:13Z

@brada4 You are right of course, but I didn't get time to add that to my test case last night. I've got new data this morning:

Additionally I just checked ArmPL, and it seems they catch this special case and call into sgemv, since their timings are nearly identical in both cases, and very similar to OpenBLAS sgemv times as well.

martin-frbg · 2024-05-16T15:13:55Z

Thank you very much - I do wonder what version akote123 is/was using, as timings consistently getting worse when going from 1 to n threads for fairly large problem sizes is a bit unexpected

akote123 · 2024-05-16T15:32:30Z

I have used openblas 0.3.26.
@lrbison, I haven't set OMP_NUM_THREADS. For core setting I have used taskset. I have used below code to benchmark. I have taken timings in c7gn.8xlarge.

    for (i = 0; i < 100; i++) {
     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
    }    
    double time_avg = (double)(clock()-start_t)/CLOCKS_PER_SEC/100;
    fprintf(stdout, "%lf\n",time_avg); ```

lrbison · 2024-05-16T16:37:37Z

@akote123 I believe the issue is that you are using clock() but clock measures CPU time, not wall-clock time. That means each thread is adding ticks in parallel.

See https://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs

lrbison · 2024-05-20T19:51:57Z

@martin-frbg has OpenBLAS has considered calling into gemv from gemm in these kinds of special cases? If I tinkered around to do so would you consider accepting a PR, or is it just not worth it?

martin-frbg · 2024-05-20T20:04:48Z

The topic has come up a few times in the past e.g. #528 and I have just created a rough draft for the fairly trivial change to add in interface/gemm.c . But if you have written something already in parallel with me, please do post your PR

martin-frbg · 2024-05-20T20:44:00Z

uploaded what I currently have as #4708 - bound to be some embarassing coding errors in there still

jzhang533 mentioned this issue May 10, 2024

Paddle在M系列mac机器上 pypi预编译包进行静态图infer时会hang住 PaddlePaddle/Paddle#63344

Closed

martin-frbg mentioned this issue May 20, 2024

[WIP] forward GEMM workloads to GEMV when one argument is actually a vector #4708

Closed

Mousius mentioned this issue Jul 23, 2024

Forward GEMM to GEMV when one argument is actually a vector #4814

Merged

martin-frbg closed this as completed in #4814 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Openblas sgemm is slower for small size matrices in aarch64 #4580

Openblas sgemm is slower for small size matrices in aarch64 #4580

akote123 commented Mar 26, 2024 •

edited

Loading

martin-frbg commented Mar 26, 2024

Uh oh!

akote123 commented Mar 26, 2024

Uh oh!

martin-frbg commented Mar 26, 2024

Uh oh!

brada4 commented Mar 28, 2024

Uh oh!

lrbison commented May 7, 2024

Uh oh!

brada4 commented May 7, 2024

Uh oh!

akote123 commented May 8, 2024 •

edited

Loading

Uh oh!

lrbison commented May 16, 2024

Uh oh!

brada4 commented May 16, 2024

Uh oh!

lrbison commented May 16, 2024

Uh oh!

martin-frbg commented May 16, 2024

Uh oh!

akote123 commented May 16, 2024 •

edited

Loading

Uh oh!

lrbison commented May 16, 2024

Uh oh!

lrbison commented May 20, 2024

Uh oh!

martin-frbg commented May 20, 2024

Uh oh!

martin-frbg commented May 20, 2024

Uh oh!

Openblas sgemm is slower for small size matrices in aarch64 #4580

Openblas sgemm is slower for small size matrices in aarch64 #4580

Comments

akote123 commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

martin-frbg commented Mar 26, 2024

Uh oh!

akote123 commented Mar 26, 2024

Uh oh!

martin-frbg commented Mar 26, 2024

Uh oh!

brada4 commented Mar 28, 2024

Uh oh!

lrbison commented May 7, 2024

Uh oh!

brada4 commented May 7, 2024

Uh oh!

akote123 commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lrbison commented May 16, 2024

Uh oh!

brada4 commented May 16, 2024

Uh oh!

lrbison commented May 16, 2024

Uh oh!

martin-frbg commented May 16, 2024

Uh oh!

akote123 commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lrbison commented May 16, 2024

Uh oh!

lrbison commented May 20, 2024

Uh oh!

martin-frbg commented May 20, 2024

Uh oh!

martin-frbg commented May 20, 2024

Uh oh!

akote123 commented Mar 26, 2024 •

edited

Loading

akote123 commented May 8, 2024 •

edited

Loading

akote123 commented May 16, 2024 •

edited

Loading