You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems current fused attention pad the matrices and calls into tensor cores in any case, hence, wasting of computing power for smaller sequence length. We might need DeviceBatchedGemvSoftmaxGemvPermute variant in this case.
The text was updated successfully, but these errors were encountered:
@cloudhan Apologies for the lack of response. Can you please check if this is an issue still with the latest ROCm 6.2? If not, please close the ticket. Thanks!
Sequence length 1 is extremely important for decoding (ASR, text generation, etc)
In onnxruntime, we found the rocblas gemm + sofmax kernel +rocblas gemm is much faster for this case,
The shape for the previous shape is as follows
Another cases are
It seems current fused attention pad the matrices and calls into tensor cores in any case, hence, wasting of computing power for smaller sequence length. We might need
DeviceBatchedGemvSoftmaxGemvPermute
variant in this case.The text was updated successfully, but these errors were encountered: