Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System librocalution can cause issue due to implicit loading of UCX #690

Open
vchuravy opened this issue Nov 6, 2024 · 1 comment · May be fixed by #691
Open

System librocalution can cause issue due to implicit loading of UCX #690

vchuravy opened this issue Nov 6, 2024 · 1 comment · May be fixed by #691

Comments

@vchuravy
Copy link
Member

vchuravy commented Nov 6, 2024

Originally noted by @leios in JuliaMolSim/Molly.jl#147 (comment)

I was trying to run some code on my AMD laptop, and I was experiencing bizarre errors that were stemming from UCS/UCX.
Very similar to https://juliaparallel.org/MPI.jl/stable/knownissues/#UCX, but surprising since I was not using MPI.

using AMDGPU

import Libdl
for dll in Libdl.dllist()
    @show dll
end

Showed:

dll = "/opt/rocm/lib/libhsakmt.so.1"
dll = "/usr/lib/libdrm.so.2"
dll = "/usr/lib/libdrm_amdgpu.so.1"
dll = "/usr/lib/libnuma.so.1"
dll = "/opt/rocm/lib/libamdhip64.so"
dll = "/opt/rocm/lib/libamd_comgr.so.2"
dll = "/opt/rocm/lib/libhsa-runtime64.so.1"
dll = "/usr/lib/libzstd.so.1"
dll = "/usr/lib/libncursesw.so.6"
dll = "/usr/lib/libelf.so.1"
dll = "/usr/lib/libucs.so.0"
dll = "/usr/lib/libucm.so.0"
dll = "/usr/lib/libsframe.so.1"

Using https://github.com/haampie/libtree

libtree /opt/rocm/lib/*
librocalution.so.1
├── libomp.so [runpath]
├── libmpi.so.40 [runpath]
│   ├── libopen-pal.so.80 [runpath]
│   │   ├── libfabric.so.1 [runpath]
│   │   │   └── libnuma.so.1 [default path]
│   │   ├── libpmix.so.2 [runpath]
│   │   │   ├── libhwloc.so.15 [ld.so.conf]
│   │   │   ├── libevent_core-2.1.so.7 [default path]
│   │   │   └── libevent_pthreads-2.1.so.7 [default path]
│   │   │       └── libevent_core-2.1.so.7 [default path]
│   │   ├── libhwloc.so.15 [runpath]
│   │   │   └── libudev.so.1 [default path]
│   │   │       └── libcap.so.2 [default path]
│   │   ├── libevent_pthreads-2.1.so.7 [runpath]
│   │   ├── libevent_core-2.1.so.7 [runpath]
│   │   ├── libuct.so.0 [runpath]
│   │   │   └── libucs.so.0 [default path]
│   │   │       ├── libzstd.so.1 [default path]
│   │   │       ├── libz.so.1 [default path]
│   │   │       ├── libsframe.so.1 [default path]
│   │   │       └── libucm.so.0 [default path]
│   │   ├── libucm.so.0 [runpath]
│   │   ├── libucs.so.0 [runpath]
│   │   └── libucp.so.0 [runpath]
│   │       ├── libuct.so.0 [default path]
│   │       └── libucs.so.0 [default path]
│   ├── libhwloc.so.15 [runpath]
│   ├── libevent_pthreads-2.1.so.7 [runpath]
│   ├── libevent_core-2.1.so.7 [runpath]
│   ├── libpmix.so.2 [runpath]
│   ├── libucs.so.0 [runpath]
│   ├── libucp.so.0 [runpath]
│   └── libfabric.so.1 [runpath]
└── librocalution_hip.so.1 [ld.so.conf]
    ├── librocblas.so.4 [ld.so.conf]
    │   └── libamdhip64.so.6 [ld.so.conf]
    │       ├── libamd_comgr.so.2 [ld.so.conf]
    │       │   ├── libz.so.1 [default path]
    │       │   ├── libncursesw.so.6 [default path]
    │       │   └── libzstd.so.1 [default path]
    │       ├── libhsa-runtime64.so.1 [ld.so.conf]
    │       │   ├── libhsakmt.so.1 [ld.so.conf]
    │       │   │   ├── libdrm.so.2 [default path]
    │       │   │   ├── libnuma.so.1 [default path]
    │       │   │   └── libdrm_amdgpu.so.1 [default path]
    │       │   │       └── libdrm.so.2 [default path]
    │       │   ├── libdrm.so.2 [default path]
    │       │   └── libelf.so.1 [default path]
    │       │       ├── libz.so.1 [default path]
    │       │       └── libzstd.so.1 [default path]
    │       └── libnuma.so.1 [default path]
    ├── libamdhip64.so.6 [ld.so.conf]
    ├── librocrand.so.1 [ld.so.conf]
    │   └── libamdhip64.so.6 [ld.so.conf]
    └── librocsparse.so.1 [ld.so.conf]
        └── libamdhip64.so.6 [ld.so.conf]

So on my system librocalution is pulling in OpenMPI and thus loads UCX implicitly.

@vchuravy vchuravy linked a pull request Nov 6, 2024 that will close this issue
@vchuravy
Copy link
Member Author

vchuravy commented Nov 6, 2024

On my system this looked like:

 dev [loki:132223:0:132223] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7073bde72000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant