Skip to content

Improve fmha_bwd tests performance #2376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

ex-rzr
Copy link
Contributor

@ex-rzr ex-rzr commented Jun 20, 2025

Proposed changes

tile_example_fmha_bwd takes way more time than tile_example_fmha_fwd for the same parameters even considering that bwd does more work.
This make it practical to run it only only very small seqlens.

The main bottleneck is a computation of ds_hp_host_ref. First I optimized its inner loop by avoiding allocation and copying of indices (std::vector). Then I optimized its outer loop by using ParallelTensorFunctor instead of ForEach.

After that, the rest bottlenecks are copies and conversion of several large tensors {nhead, real_seqlen_q, real_seqlen_k} that are implemented with ForEach, I replaced them with CopyAsType.

Before:

time ./bin/tile_example_fmha_bwd -d=64
[fp16|batch|bhsd] b:2, h:8/8, s:3328/3328, d:64/64, scale:0.125, bias:n, dbias:0, p_drop:0, s_randval:0, deterministic:0, mask:n, 1.722 ms, 65.88 TFlops, 31.79 GB/s, valid:y
real	7m39.239s
user	10m9.883s

After:

time ./bin/tile_example_fmha_bwd -d=64
[fp16|batch|bhsd] b:2, h:8/8, s:3328/3328, d:64/64, scale:0.125, bias:n, dbias:0, p_drop:0, s_randval:0, deterministic:0, mask:n, 1.722 ms, 65.86 TFlops, 31.79 GB/s, valid:y
real	0m7.887s
user	2m54.201s

I.e. from 7 min to 7 sec. Now bwd's runtime is comparable to fwd.

The Run CK_TILE_FMHA Tests CI job takes about 5 hours. Let's see if this change decreases its duration as well...

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

ex-rzr added 4 commits June 20, 2025 09:47
Each access requires 2 allocations and copies of the vector.
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant