Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] All GPUs within the same TP group load training data from shared memory #91

Open
xrsrke opened this issue Mar 3, 2024 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed Low Priority

Comments

@xrsrke
Copy link
Member

xrsrke commented Mar 3, 2024

In a typical data loading phase of distributed training, each GPU worker is equipped with its own data loader, responsible for reading training data into the CPU memory before forwarding it to the GPU. This leads to competition among workers for disk read bandwidth, thereby creating a bottleneck. Notably, we observe that in the LLM training setting, GPU workers within the same machine are in the same tensor parallel group. Consequently, their inputs for each iteration are inherently identical. Based on this observation, we adopt a two-layer tree-based approach. We use a single, dedicated data loader on each machine to read the training data into a piece of shared memory. Subsequently, each GPU worker is responsible for copying the necessary data to its own GPU memory. This eliminates redundant reads and significantly enhances the efficiency of data transfer.

Reference: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5

@xrsrke xrsrke added enhancement New feature or request help wanted Extra attention is needed Low Priority labels Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed Low Priority
Projects
None yet
Development

No branches or pull requests

1 participant