Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match resumption loss curve #362

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

sichu2023
Copy link
Collaborator

@sichu2023 sichu2023 commented Oct 25, 2024

Summary

Add additional resumption mechanism to datamodule to ensure stop-and-go and continuous pretraining has the same training loss curve.

Details

Still has a small discrepancy right at resumption but does not affect subsequent training/validation curve.
WANDB log

Testing

To be implemented in subsequent PR on stop-and-go test refactoring.

@sichu2023
Copy link
Collaborator Author

/build-ci

@sichu2023 sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from fadca1e to 44a9b39 Compare October 25, 2024 15:56
@sichu2023 sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from 636a849 to 09a2efa Compare October 25, 2024 17:27
@sichu2023 sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from 09a2efa to 714b6ef Compare October 25, 2024 17:28
@sichu2023
Copy link
Collaborator Author

/build-ci

@sichu2023 sichu2023 enabled auto-merge (squash) October 25, 2024 18:03
@@ -183,6 +183,7 @@ def main(
strategy=strategy,
limit_val_batches=limit_val_batches, # This controls upsampling and downsampling
val_check_interval=val_check_interval,
log_every_n_steps=log_every_n_steps,
Copy link
Collaborator

@jstjohn jstjohn Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add this in one other spot to get logging to be less frequent, in your nl.MegatronStrategy above add

        progress_interval=log_every_n_steps,

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one more comment about reducing the logging interval. Thanks!

Also note I pushed changes that fix geneformer to sichu/fix-pplcallback-sanity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants