-
Notifications
You must be signed in to change notification settings - Fork 199
Shard several of the most costly targets. #2373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shumway
wants to merge
13
commits into
develop
Choose a base branch
from
shumway/refactor-targets
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+1,346
−827
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance
I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR.
Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly.
Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.
Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes.
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project.
Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.
Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function.
spolifroni-amd
approved these changes
Jun 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good in-code comments! Very clear!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
Note: This is a roll forward of #2266 which was rolled back in #2361.
We want to reduce build time for the CK library that is consumed by MIOpen.
Several of the build units (source files) take ~20 minutes to compile. To reduce compilation time, we split those files
out into multiple copies (build shards). There are three complications:
The solution we propose is to create a utility metafunction
filter_tuple_by_modulo_t
that splits up the kerneltype tuples using a stride and offset.
extern template
to make sure we don't reinstantiate these templates from the header code.We partially automated this processes with code generation, by writing a CMake function
generate_sharded_instantiations to generate the source files for the shared instatiation functions and the calling
function directly from CMake.
I was missing two if-blocks from two template files in #2266, which causes some kernels to be omitted:
The failing test for gfx94x was disabled in MIOpen, but now I've checked this all in gfx94x and it passes now.