Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NeMo/Megatron #302

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Update NeMo/Megatron #302

wants to merge 8 commits into from

Conversation

sichu2023
Copy link
Collaborator

@sichu2023 sichu2023 commented Oct 10, 2024

Pass esm2 golden value tests.

Bugs

  • Notice some test leakage, i.e. pass individually but fail when run together
  • Fiddle error
  • dtype error comes from Megatron-LM

Pytest errors

FAILED scripts/protein/esm2/test_esm2_pretrain.py::test_main_runs - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED scripts/protein/esm2/test_esm2_pretrain.py::test_val_dataloader_in_main_runs_with_limit_val_batches[1.0] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED scripts/protein/esm2/test_esm2_pretrain.py::test_val_dataloader_in_main_runs_with_limit_val_batches[4] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED scripts/protein/esm2/test_esm2_pretrain.py::test_val_dataloader_in_main_runs_with_limit_val_batches[None] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED scripts/singlecell/geneformer/test_train.py::test_pretrain_cli - AssertionError: Pretrain script failed: python      /workspaces/bionemo-fw-ea/scripts/singlecell/geneformer/train.py         --data-dir /home/bionemo/.cache/bionemo/7bd3714b68c30c50f0a36a02fa9dd720-singlecell-testdata-20240506.tar.gz.untar/cellxgen...
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_token_classifier[False] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_regressor[False] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[32] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[bf16-mixed] - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py::test_continue_from_checkpoint_geneformer - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py::test_finetune_geneformer - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED sub-packages/bionemo-llm/tests/bionemo/llm/utils/test_iomixin_utils.py::TestIOMixin::test_dataclass_out_of_sync - AssertionError: assert {'b': 7} == {'b': 7, 'c': 3}
FAILED sub-packages/bionemo-llm/tests/bionemo/llm/utils/test_iomixin_utils.py::TestIOMixin::test_dataclass_hparam_modify_parent_default - AssertionError: assert {'a': 7} == {'a': 7, 'b': 3, 'c': 3}
FAILED sub-packages/bionemo-testing/tests/bionemo/testing/data/test_load.py::test_default_pbss_client - botocore.exceptions.ConfigParseError: Unable to parse config file: /home/bionemo/.aws/config

dtype error comes from Megatron-LM

def sharded_param_state_fs_model_space
...
                        for state_key, state_ten in tensors.items():
                            replace_kwargs = dict(
                                key=f'{prefix}.{state_key}.{sharded_metadata.key}',
                                data=state_ten,
>                               dtype=state_ten.dtype,
                                flattened_range=slice(param_range.start, param_range.end),
                                replica_id=replica_id,
                            )
E                           AttributeError: 'NoneType' object has no attribute 'dtype'

3rdparty/Megatron-LM/megatron/core/optimizer/distrib_optimizer.py:1159: AttributeError

else:
kv_channels = self.config.kv_channels

extra_kwargs["softmax_scale"] = softmax_scale
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved outside of if is_te_min_version("1.10.0"). Otherwise we can just call super().__init__ directly without override.

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related in some ways to #304, maybe sync with @farhadrgh and make sure that you are on a new enough NeMo for his needs as well? I think his stuff was recently merged.

@sichu2023
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants