Improve fmha_bwd tests performance #2376

ex-rzr · 2025-06-20T08:50:26Z

Proposed changes

tile_example_fmha_bwd takes way more time than tile_example_fmha_fwd for the same parameters even considering that bwd does more work.
This make it practical to run it only only very small seqlens.

The main bottleneck is a computation of ds_hp_host_ref. First I optimized its inner loop by avoiding allocation and copying of indices (std::vector). Then I optimized its outer loop by using ParallelTensorFunctor instead of ForEach.

After that, the rest bottlenecks are copies and conversion of several large tensors {nhead, real_seqlen_q, real_seqlen_k} that are implemented with ForEach, I replaced them with CopyAsType.

Before:

time ./bin/tile_example_fmha_bwd -d=64
[fp16|batch|bhsd] b:2, h:8/8, s:3328/3328, d:64/64, scale:0.125, bias:n, dbias:0, p_drop:0, s_randval:0, deterministic:0, mask:n, 1.722 ms, 65.88 TFlops, 31.79 GB/s, valid:y
real	7m39.239s
user	10m9.883s

After:

time ./bin/tile_example_fmha_bwd -d=64
[fp16|batch|bhsd] b:2, h:8/8, s:3328/3328, d:64/64, scale:0.125, bias:n, dbias:0, p_drop:0, s_randval:0, deterministic:0, mask:n, 1.722 ms, 65.86 TFlops, 31.79 GB/s, valid:y
real	0m7.887s
user	2m54.201s

I.e. from 7 min to 7 sec. Now bwd's runtime is comparable to fwd.

The Run CK_TILE_FMHA Tests CI job takes about 5 hours. Let's see if this change decreases its duration as well...

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Each access requires 2 allocations and copies of the vector.

…ification

This sequntial ForEach is the slowest part of validation and it benefits from parallel computation.

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices.

ex-rzr added 4 commits June 20, 2025 09:47

Avoid passing indices (std::vector) by value to host tensor's operator()

5440415

Each access requires 2 allocations and copies of the vector.

Remove 1 unneeded vector copy from the slowest part of fmha_bwd's ver…

f0cfd0e

…ification

Compute ds_hp_host_ref in parallel

388b2a3

This sequntial ForEach is the slowest part of validation and it benefits from parallel computation.

Do not use ForEach for simple copy and conversion of large tensors

6ac1f4a

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve fmha_bwd tests performance #2376

Improve fmha_bwd tests performance #2376

ex-rzr commented Jun 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Improve fmha_bwd tests performance #2376

Are you sure you want to change the base?

Improve fmha_bwd tests performance #2376

Conversation

ex-rzr commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Uh oh!

ex-rzr commented Jun 20, 2025 •

edited

Loading