Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amp_C undefined symbol after installing Megablocks #157

Open
RachitBansal opened this issue Oct 11, 2024 · 3 comments
Open

amp_C undefined symbol after installing Megablocks #157

RachitBansal opened this issue Oct 11, 2024 · 3 comments

Comments

@RachitBansal
Copy link

RachitBansal commented Oct 11, 2024

I am trying to setup and use megablocks to train MoE models, but I see the following error:

Traceback (most recent call last):
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/pretrain_gpt.py", line 8, in <module>
    from megatron import get_args
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/__init__.py", line 13, in <module>
    from .initialize  import initialize_megatron
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/initialize.py", line 19, in <module>
    from megatron.checkpointing import load_args_from_checkpoint
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/checkpointing.py", line 15, in <module>
    from .utils import (unwrap_model,
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/utils.py", line 11, in <module>
    import amp_C
ImportError: /usr/local/lib/python3.10/dist-packages/amp_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.

When I try running gpt2 training (using exp/gpt2/gpt2_gpt2_46m_1gpu.sh) before doing a pip install megablocks, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh) gives the error Megablocks not available.

However, after I do a pip install megablocks or pip install . in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.

@mvpatel2000
Copy link
Contributor

I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on

@RachitBansal
Copy link
Author

I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version?

@mvpatel2000
Copy link
Contributor

We use and recommend images: https://github.com/mosaicml/composer/tree/main/docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants