Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Kernel Fusion of Layer Norm and GeLU #86

Open
xrsrke opened this issue Mar 2, 2024 · 0 comments
Open

[Feature] Kernel Fusion of Layer Norm and GeLU #86

xrsrke opened this issue Mar 2, 2024 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@xrsrke
Copy link
Member

xrsrke commented Mar 2, 2024

Despite the optimization for GEMM operators in MegatronLM, we identify opportunities for further enhancement in other operators. For the attention part, we adopt FlashAttention-2 [16], which improves work partitioning between different thread blocks and warps. For LayerNorm and GeLU, we observe that they are composed of fine-grained kernels in previous implementations. By fusing these kernels together, we reduce the overhead associated with launching multiple kernels and aid in optimizing memory access patterns, thereby achieving better performance.

Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5

@xrsrke xrsrke added enhancement New feature or request help wanted Extra attention is needed labels Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant