[Feature] Kernel Fusion of Layer Norm and GeLU #86

xrsrke · 2024-03-02T02:55:52Z

Despite the optimization for GEMM operators in MegatronLM, we identify opportunities for further enhancement in other operators. For the attention part, we adopt FlashAttention-2 [16], which improves work partitioning between different thread blocks and warps. For LayerNorm and GeLU, we observe that they are composed of fine-grained kernels in previous implementations. By fusing these kernels together, we reduce the overhead associated with launching multiple kernels and aid in optimizing memory access patterns, thereby achieving better performance.

Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5

xrsrke added enhancement New feature or request help wanted Extra attention is needed labels Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Kernel Fusion of Layer Norm and GeLU #86

[Feature] Kernel Fusion of Layer Norm and GeLU #86

xrsrke commented Mar 2, 2024

[Feature] Kernel Fusion of Layer Norm and GeLU #86

[Feature] Kernel Fusion of Layer Norm and GeLU #86

Comments

xrsrke commented Mar 2, 2024