Shard several of the most costly targets. #2373

shumway · 2025-06-19T22:49:02Z

Proposed changes

Note: This is a roll forward of #2266 which was rolled back in #2361.

We want to reduce build time for the CK library that is consumed by MIOpen.

Several of the build units (source files) take ~20 minutes to compile. To reduce compilation time, we split those files
out into multiple copies (build shards). There are three complications:

The kernel type instantiations to be sharded are in std::tuple types. We don't to break up those tuples this finely.
The solution we propose is to create a utility metafunction filter_tuple_by_modulo_t that splits up the kernel
type tuples using a stride and offset.
To avoid duplicating a lot of the instantiation code, template the instantiation functions on a shard and offset.
To call the code, wrap the templated instantiation functions in the original instantiation function, and use
extern template to make sure we don't reinstantiate these templates from the header code.

We partially automated this processes with code generation, by writing a CMake function
generate_sharded_instantiations to generate the source files for the shared instatiation functions and the calling
function directly from CMake.

I was missing two if-blocks from two template files in #2266, which causes some kernels to be omitted:

library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd/xdl/comp/device_grouped_conv3d_fwd_xdl_ndhwgc_gkzyxc_ndhwgk_f16_comp_instance.in
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd/xdl/comp/device_grouped_conv3d_fwd_xdl_ndhwgc_gkzyxc_ndhwgk_f16_comp_instance.in

The failing test for gfx94x was disabled in MIOpen, but now I've checked this all in gfx94x and it passes now.

Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance

I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR.

Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly.

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes.

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project.

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

spolifroni-amd

Really good in-code comments! Very clear!

shumway and others added 11 commits June 17, 2025 21:16

Shard several of the most costly targets.

179968f

Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance

fix clang format

9b906c5

Migrate the design to a code-generation approach.

d86070c

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

Shard the longest 2D convolution builds

462bb38

Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes.

Migrate the design to a code-generation approach.

de3cfee

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

Migrate the design to a code-generation approach.

b3fc0bf

Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.

Remove accidental copy of a file

9d80372

Remove accidental copies of template files.

632f230

shumway requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, a team and tenpercent as code owners June 19, 2025 22:49

Merge branch 'develop' into shumway/refactor-targets

325909c

spolifroni-amd approved these changes Jun 20, 2025

View reviewed changes

shumway changed the title ~~Shumway/refactor targets~~ Shard several of the most costly targets. Jun 20, 2025

Merge branch 'develop' into shumway/refactor-targets

b017336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shard several of the most costly targets. #2373

Shard several of the most costly targets. #2373

shumway commented Jun 19, 2025

Uh oh!

spolifroni-amd left a comment

Uh oh!

Uh oh!

Shard several of the most costly targets. #2373

Are you sure you want to change the base?

Shard several of the most costly targets. #2373

Conversation

shumway commented Jun 19, 2025

Proposed changes

Uh oh!

spolifroni-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!