Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPUCodegen] Characterize performance for dynamic fused self attention #18931

Open
manupak opened this issue Oct 29, 2024 · 8 comments
Open

[GPUCodegen] Characterize performance for dynamic fused self attention #18931

manupak opened this issue Oct 29, 2024 · 8 comments

Comments

@manupak
Copy link

manupak commented Oct 29, 2024

We would like to know performance characteristics for dynamic parameters for fused self attention.

M dynamic lengths : 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384
M tile sizes : 16, 32, 64 and 128
K1 values to be used : 64, 128
K2 values : this should be equal to M to be self-attention

@manupak
Copy link
Author

manupak commented Oct 29, 2024

@MaheshRavishankar @Groverkss I ve summarized the info about what needs to be analyzed here.
Let me know if this has to be something different.

I think for llm s we only care about self-attention -- thus K2 = M.

@Groverkss
Copy link
Contributor

For K1/N you can use 64/128. You can probably ignore 256.

@manupak manupak changed the title [GPUCodegen] Characterize performance for dynamic fused self? attention [GPUCodegen] Characterize performance for dynamic fused self attention Oct 29, 2024
@manupak
Copy link
Author

manupak commented Oct 29, 2024

Do we want this done for MI300X in CPX ?

@Groverkss
Copy link
Contributor

Do we want this done for MI300X in CPX ?

SPX/CPX either on MI300X should be fine.

@manupak
Copy link
Author

manupak commented Oct 29, 2024

I ll start with CPX then...

@MaheshRavishankar
Copy link
Contributor

Either should be fine, but I think we have SPX available more easily. The trends should be the same.

@manupak
Copy link
Author

manupak commented Oct 29, 2024

!dtype = f16
!Q     = tensor<1x?x64xf16>
!K     = tensor<1x?x64xf16>
!V     = tensor<1x?x64xf16>
!O     = tensor<1x?x64xf16>

#tuning = #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{ workgroup = [1, 64, 0, 0, 0], reduction = [0, 0, 0, 0, 32] }>, translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [64, 4] subgroup_size = 64 ,{mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, subgroup_m_count = 4, subgroup_n_count = 1> , llvm_func_attrs = { "amdgpu-waves-per-eu" = "2","denormal-fp-math-f32" = "preserve-sign" }}>>


#Q = affine_map<(b, m, n, k1, k2) -> (b, m, k1)>
#K = affine_map<(b, m, n, k1, k2) -> (b, k2, k1)>
#V = affine_map<(b, m, n, k1, k2) -> (b, k2, n)>
#S = affine_map<(b, m, n, k1, k2) -> ()>
#O = affine_map<(b, m, n, k1, k2) -> (b, m, n)>

func.func @main(%Q : !Q, %K : !K, %V : !V) -> !O {
  %scale = arith.constant 1.0 : !dtype
  %c1 = arith.constant 1 : index
  %size1 = tensor.dim %Q, %c1 : !O
  %empty = tensor.empty(%size1) : !O
  %O = iree_linalg_ext.attention 
       { indexing_maps = [#Q, #K, #V, #S, #O]
         ,compilation_info = #tuning
       }
       ins(%Q, %K, %V, %scale : !Q, !K, !V, !dtype) outs(%empty : !O) {
          ^bb0(%score: f32):
            iree_linalg_ext.yield %score : f32
        } -> !O
  return %O : !O
}

I ve managed to generate IRs as above but its not compiling as of now.
Im suspecting Im missing something much simpler given that dynamic attention kernels are supported., yes?

is there any test/example of a dynamic attention kernel in the codebase?

@MaheshRavishankar
Copy link
Contributor

Maybe we should try static sizes for those shapes. Making the shape dynamic will not give us as clear a signal yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants