Match resumption loss curve #362

sichu2023 · 2024-10-25T14:35:50Z

Summary

Add additional resumption mechanism to datamodule to ensure stop-and-go and continuous pretraining has the same training loss curve.

Details

Still has a small discrepancy right at resumption but does not affect subsequent training/validation curve.
WANDB log

Testing

To be implemented in subsequent PR on stop-and-go test refactoring.

sichu2023 · 2024-10-25T15:09:53Z

/build-ci

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py

scripts/protein/esm2/esm2_pretrain.py

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py

scripts/protein/esm2/esm2_pretrain.py

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py

…new dataloaders are requested as a side effect currently

sichu2023 · 2024-10-25T17:54:56Z

/build-ci

jstjohn · 2024-10-25T19:08:17Z

scripts/protein/esm2/esm2_pretrain.py

@@ -183,6 +183,7 @@ def main(
        strategy=strategy,
        limit_val_batches=limit_val_batches,  # This controls upsampling and downsampling
        val_check_interval=val_check_interval,
+        log_every_n_steps=log_every_n_steps,


You need to add this in one other spot to get logging to be less frequent, in your nl.MegatronStrategy above add

progress_interval=log_every_n_steps,

jstjohn

Approved with one more comment about reducing the logging interval. Thanks!

Also note I pushed changes that fix geneformer to sichu/fix-pplcallback-sanity

sichu2023 marked this pull request as ready for review October 25, 2024 15:06

sichu2023 requested review from jstjohn, malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv and pstjohn as code owners October 25, 2024 15:06

sichu2023 mentioned this pull request Oct 25, 2024

Fix training resumption #312

Draft

farhadrgh reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

scripts/protein/esm2/esm2_pretrain.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

scripts/protein/esm2/esm2_pretrain.py Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py Outdated Show resolved Hide resolved

pstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py Outdated Show resolved Hide resolved

jstjohn reviewed Oct 25, 2024

View reviewed changes

scripts/protein/esm2/esm2_pretrain.py Outdated Show resolved Hide resolved

pstjohn reviewed Oct 25, 2024

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py Show resolved Hide resolved

pstjohn approved these changes Oct 25, 2024

View reviewed changes

sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from fadca1e to 44a9b39 Compare October 25, 2024 15:56

sichu2023 commented Oct 25, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/data/datamodule.py Show resolved Hide resolved

sichu2023 and others added 5 commits October 25, 2024 17:27

add trainer.sanity_check

c1cd247

add state_dict and load_state_dict to ESM2DataModule

ff93dcd

fix ppl logging unittest

a964573

refactor to bionemo-llm datamodule.py

ceaa98d

add a function that handles the two updates that need to happen when …

57403c7

…new dataloaders are requested as a side effect currently

sichu2023 added 11 commits October 25, 2024 17:27

switch to use NeMo WrappedDataloader

14e8bc3

add back pytorch_lightning import

4e0cb56

add argparse

0a22321

add log_every_n_steps argparse

254b2ae

improve readibility and use self.update_init_global_step

b4160b4

add mode type

384c8a8

remove save_every_n_steps and fix log_every_n_steps

eb87ce7

rename MegatronDataModule

dae5169

remove __ost_init__

06dfd21

add back kwargs in _create_dataloader

822f1fc

clean up save_every_n_steps

740f776

sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from 636a849 to 09a2efa Compare October 25, 2024 17:27

import MegatronDataModule

714b6ef

sichu2023 force-pushed the sichu/match-resumption-loss-curve branch from 09a2efa to 714b6ef Compare October 25, 2024 17:28

sichu2023 added 2 commits October 25, 2024 17:36

fix MegatronDataModule __init__

0c74328

add mode in geneformer dataloader

c0857cf

sichu2023 enabled auto-merge (squash) October 25, 2024 18:03

sichu2023 requested review from jstjohn and farhadrgh October 25, 2024 18:12

jstjohn reviewed Oct 25, 2024

View reviewed changes

jstjohn approved these changes Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match resumption loss curve #362

Match resumption loss curve #362

sichu2023 commented Oct 25, 2024 •

edited

Loading

sichu2023 commented Oct 25, 2024

sichu2023 commented Oct 25, 2024

jstjohn Oct 25, 2024 •

edited

Loading

jstjohn left a comment

Match resumption loss curve #362

Are you sure you want to change the base?

Match resumption loss curve #362

Conversation

sichu2023 commented Oct 25, 2024 • edited Loading

Summary

Details

Testing

sichu2023 commented Oct 25, 2024

sichu2023 commented Oct 25, 2024

jstjohn Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

jstjohn left a comment

Choose a reason for hiding this comment

sichu2023 commented Oct 25, 2024 •

edited

Loading

jstjohn Oct 25, 2024 •

edited

Loading