ck_tile kernel for gemm with groupwise quantized A or B tensor. #2362

vj-krish · 2025-06-18T01:37:41Z

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
i4, fp8 -> f32
i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

[] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

ThomasNing · 2025-06-19T23:16:57Z

@vj-krish Thank you Vijay! Will take a look.

vj-krish requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, tenpercent, ThomasNing and coderfeli as code owners June 18, 2025 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ck_tile kernel for gemm with groupwise quantized A or B tensor. #2362

ck_tile kernel for gemm with groupwise quantized A or B tensor. #2362

Uh oh!

vj-krish commented Jun 18, 2025

Uh oh!

ThomasNing commented Jun 19, 2025

Uh oh!

Uh oh!

ck_tile kernel for gemm with groupwise quantized A or B tensor. #2362

Are you sure you want to change the base?

ck_tile kernel for gemm with groupwise quantized A or B tensor. #2362

Uh oh!

Conversation

vj-krish commented Jun 18, 2025

Proposed changes

Checklist

Discussion

Uh oh!

ThomasNing commented Jun 19, 2025

Uh oh!

Uh oh!