You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to setup and use megablocks to train MoE models, but I see the following error:
Traceback (most recent call last):
File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/pretrain_gpt.py", line 8, in <module>
from megatron import get_args
File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/__init__.py", line 13, in <module>
from .initialize import initialize_megatron
File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/initialize.py", line 19, in <module>
from megatron.checkpointing import load_args_from_checkpoint
File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/checkpointing.py", line 15, in <module>
from .utils import (unwrap_model,
File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/utils.py", line 11, in <module>
import amp_C
ImportError: /usr/local/lib/python3.10/dist-packages/amp_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
When I try running gpt2 training (using exp/gpt2/gpt2_gpt2_46m_1gpu.sh) before doing a pip install megablocks, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh) gives the error Megablocks not available.
However, after I do apip install megablocks or pip install . in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.
The text was updated successfully, but these errors were encountered:
I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on
I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version?
I am trying to setup and use megablocks to train MoE models, but I see the following error:
I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.
When I try running gpt2 training (using
exp/gpt2/gpt2_gpt2_46m_1gpu.sh
) before doing apip install megablocks
, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh
) gives the errorMegablocks not available
.However, after I do a
pip install megablocks
orpip install .
in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.The text was updated successfully, but these errors were encountered: