[Feature] Asyncronous Serialization #87
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Move checkpoints from device memory to host memory asynchronously, and write to disk in the background => not blocking the training
Reference: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 7
The text was updated successfully, but these errors were encountered: