Shard several of the most costly targets. #2266

shumway · 2025-05-30T20:26:39Z

Proposed changes

We want to reduce build time for the CK library that is consumed by MIOpen.

Several of the build units (source files) take ~20 minutes to compile. To reduce compilation time, we split those files
out into multiple copies (build shards). There are three complications:

The kernel type instantiations to be sharded are in std::tuple types. We don't to break up those tuples this finely.
The solution we propose is to create a utility metafunction filter_tuple_by_modulo_t that splits up the kernel
type tuples using a stride and offset.
To avoid duplicating a lot of the instantiation code, template the instantiation functions on a shard and offset.
To call the code, wrap the templated instantiation functions in the original instantiation function, and use
extern template to make sure we don't reinstantiate these templates from the header code.

We partially automated this processes with code generation, by writing a CMake function
generate_sharded_instantiations to generate the source files for the shared instatiation functions and the calling
function directly from CMake.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added inline documentation which enables the maintainers with understanding the motivation
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

I went with this solution because the strategy of breaking down tuples into smaller tuples wasn't going to
get us all the individual kernels. This was the best compromise I could find between vebosity and getting
build performance. Now we have about 165 targets for this target, which shard well on 128 shards:

time ninja device_grouped_conv3d_fwd_instance

This reduces build time from this target from about 22 minutes to 11 minutes.

Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance

spolifroni-amd

doesn't look like there's anything for docs.

include/ck/utility/filter_tuple.hpp

I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR.

Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly.

bartekxk

Hi @shumway , great job. Is it possible to generate these c++ files using some python script? At now it is impossible to check/review if each instance has been added correctly. It will be much easier to handle such python generator instead of 150 files. It will also be very useful to port such python script to generate c++ files for bwd data and bwd weight

shumway · 2025-06-03T13:11:21Z

Hi @shumway , great job. Is it possible to generate these c++ files using some python script? At now it is impossible to check/review if each instance has been added correctly. It will be much easier to handle such python generator instead of 150 files. It will also be very useful to port such python script to generate c++ files for bwd data and bwd weight

@bartekxk and I chatted, and I'll look at existing codegen inside the CK codebase to see how to set this up so that it matches what's already here.

shumway · 2025-06-04T06:17:40Z

@bartekxk 's suggestion to use codegen looks like a good idea. I have started to implement this as a cmake function, using configure_file. This has the advantage that it's pure CMake and we'll get back to essentially no extra hand-written source files. The input template generalizes a lot, but it make take me another day to get all the codegen rules sorted out and replace everything.

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

shumway · 2025-06-04T20:29:15Z

This code is much simpler now, down to 35 files changed. I followed @bartekxk 's suggestion and converted to code
generation. WIth this simplification, I can now shard the 2D convolution targets that are limiting the build time.

Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes.

bartekxk

lgtm

bartekxk · 2025-06-05T09:22:22Z

Could you just make sanity check and verify if test_grouped_convnd_fwd generates the same output on the develop and on your branch?

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project.

This reverts commit 3a0cb27.

Shard several of the most costly targets.

2d74123

Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance

shumway requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, a team and tenpercent as code owners May 30, 2025 20:26

fix clang format

e62f7c7

spolifroni-amd reviewed May 30, 2025

View reviewed changes

andriy-ca reviewed May 30, 2025

View reviewed changes

include/ck/utility/filter_tuple.hpp Show resolved Hide resolved

shumway added 2 commits June 2, 2025 15:37

bartekxk requested changes Jun 3, 2025

View reviewed changes

Migrate the design to a code-generation approach.

52c368d

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

shumway requested a review from bartekxk June 4, 2025 20:27

Shard the longest 2D convolution builds

fa71c45

Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes.

bartekxk previously approved these changes Jun 5, 2025

View reviewed changes

shumway dismissed bartekxk’s stale review via 77c6945 June 11, 2025 21:29

Merge branch 'develop' into shumway/refactor_targets

1f79ac7

bartekxk approved these changes Jun 13, 2025

View reviewed changes

shumway merged commit 3a0cb27 into develop Jun 13, 2025
38 of 45 checks passed

shumway deleted the shumway/refactor_targets branch June 13, 2025 10:58

shumway mentioned this pull request Jun 13, 2025

Test PR for faster, sharded builds (do not merge) ROCm/MIOpen#3805

Closed

BrianHarrisonAMD added a commit that referenced this pull request Jun 17, 2025

Revert "Shard several of the most costly targets. (#2266)"

84bbd8e

This reverts commit 3a0cb27.

illsilin added a commit that referenced this pull request Jun 17, 2025

Revert "Shard several of the most costly targets. (#2266)"

0a988e8

This reverts commit 3a0cb27.

illsilin mentioned this pull request Jun 17, 2025

Revert "Shard several of the most costly targets." #2361

Merged

illsilin added a commit that referenced this pull request Jun 17, 2025

Revert "Shard several of the most costly targets. (#2266)" (#2361)

cdfd772

This reverts commit 3a0cb27.

shumway mentioned this pull request Jun 19, 2025

Shard several of the most costly targets. #2373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shard several of the most costly targets. #2266

Shard several of the most costly targets. #2266

Uh oh!

shumway commented May 30, 2025 •

edited

Loading

Uh oh!

spolifroni-amd left a comment

Uh oh!

Uh oh!

bartekxk left a comment

Uh oh!

shumway commented Jun 3, 2025

Uh oh!

shumway commented Jun 4, 2025

Uh oh!

shumway commented Jun 4, 2025

Uh oh!

bartekxk left a comment

Uh oh!

bartekxk commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Shard several of the most costly targets. #2266

Shard several of the most costly targets. #2266

Uh oh!

Conversation

shumway commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

spolifroni-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

shumway commented Jun 3, 2025

Uh oh!

shumway commented Jun 4, 2025

Uh oh!

shumway commented Jun 4, 2025

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

bartekxk commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

shumway commented May 30, 2025 •

edited

Loading