-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match resumption loss curve #362
base: main
Are you sure you want to change the base?
Conversation
/build-ci |
sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py
Outdated
Show resolved
Hide resolved
fadca1e
to
44a9b39
Compare
…new dataloaders are requested as a side effect currently
636a849
to
09a2efa
Compare
09a2efa
to
714b6ef
Compare
/build-ci |
@@ -183,6 +183,7 @@ def main( | |||
strategy=strategy, | |||
limit_val_batches=limit_val_batches, # This controls upsampling and downsampling | |||
val_check_interval=val_check_interval, | |||
log_every_n_steps=log_every_n_steps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add this in one other spot to get logging to be less frequent, in your nl.MegatronStrategy
above add
progress_interval=log_every_n_steps,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with one more comment about reducing the logging interval. Thanks!
Also note I pushed changes that fix geneformer to sichu/fix-pplcallback-sanity
Summary
Add additional resumption mechanism to datamodule to ensure stop-and-go and continuous pretraining has the same training loss curve.
Details
Still has a small discrepancy right at resumption but does not affect subsequent training/validation curve.
WANDB log
Testing
To be implemented in subsequent PR on stop-and-go test refactoring.