You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Despite the optimization for GEMM operators in MegatronLM, we identify opportunities for further enhancement in other operators. For the attention part, we adopt FlashAttention-2 [16], which improves work partitioning between different thread blocks and warps. For LayerNorm and GeLU, we observe that they are composed of fine-grained kernels in previous implementations. By fusing these kernels together, we reduce the overhead associated with launching multiple kernels and aid in optimizing memory access patterns, thereby achieving better performance.
Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5
The text was updated successfully, but these errors were encountered:
Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5
The text was updated successfully, but these errors were encountered: