-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPUCodegen] Characterize performance for dynamic fused self attention #18931
Comments
@MaheshRavishankar @Groverkss I ve summarized the info about what needs to be analyzed here. I think for llm s we only care about self-attention -- thus K2 = M. |
For K1/N you can use 64/128. You can probably ignore 256. |
Do we want this done for MI300X in CPX ? |
SPX/CPX either on MI300X should be fine. |
I ll start with CPX then... |
Either should be fine, but I think we have SPX available more easily. The trends should be the same. |
!dtype = f16
!Q = tensor<1x?x64xf16>
!K = tensor<1x?x64xf16>
!V = tensor<1x?x64xf16>
!O = tensor<1x?x64xf16>
#tuning = #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{ workgroup = [1, 64, 0, 0, 0], reduction = [0, 0, 0, 0, 32] }>, translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [64, 4] subgroup_size = 64 ,{mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, subgroup_m_count = 4, subgroup_n_count = 1> , llvm_func_attrs = { "amdgpu-waves-per-eu" = "2","denormal-fp-math-f32" = "preserve-sign" }}>>
#Q = affine_map<(b, m, n, k1, k2) -> (b, m, k1)>
#K = affine_map<(b, m, n, k1, k2) -> (b, k2, k1)>
#V = affine_map<(b, m, n, k1, k2) -> (b, k2, n)>
#S = affine_map<(b, m, n, k1, k2) -> ()>
#O = affine_map<(b, m, n, k1, k2) -> (b, m, n)>
func.func @main(%Q : !Q, %K : !K, %V : !V) -> !O {
%scale = arith.constant 1.0 : !dtype
%c1 = arith.constant 1 : index
%size1 = tensor.dim %Q, %c1 : !O
%empty = tensor.empty(%size1) : !O
%O = iree_linalg_ext.attention
{ indexing_maps = [#Q, #K, #V, #S, #O]
,compilation_info = #tuning
}
ins(%Q, %K, %V, %scale : !Q, !K, !V, !dtype) outs(%empty : !O) {
^bb0(%score: f32):
iree_linalg_ext.yield %score : f32
} -> !O
return %O : !O
} I ve managed to generate IRs as above but its not compiling as of now. is there any test/example of a dynamic attention kernel in the codebase? |
Maybe we should try static sizes for those shapes. Making the shape dynamic will not give us as clear a signal yet. |
We would like to know performance characteristics for dynamic parameters for fused self attention.
M dynamic lengths : 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384
M tile sizes : 16, 32, 64 and 128
K1 values to be used : 64, 128
K2 values : this should be equal to M to be self-attention
The text was updated successfully, but these errors were encountered: