Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures when using ROCM builds that have particular type of debug info in them (both in JLL-mixed-mode and in system-ROCM mode), e.g. on Arch Linux #620

Closed
Krastanov opened this issue Apr 13, 2024 · 3 comments

Comments

@Krastanov
Copy link

Krastanov commented Apr 13, 2024

I am filing this issue because this library:

  • does not work in Arch Linux with the arch-provided ROCM
  • but it works fine in the official Ubuntu docker container provided by AMD running on an the same Arch host.

This is not a problem of driver installation, user permissions, etc, rather a problem with the particular stringent standards Arch follows for builds and the fact that ROCM has some potentially broken asserts. Thus I am creating this issue to track this specific question. Please excuse me if this is not considered appropriate for this issue tracker and please close this issue in such a circumstance.

This issue has overlap with:

In particular, in #371 it is already stated that:

There are three responsible parties here:

  • arch for not wanting more lenient builds (but all other ROCM-using tools work fine with their build, so they probably will not be making exceptions)
  • AMD for having broken debug statements (but we can not expect that to be resolved soon)
  • julia AMDGPU.jl for being more picky than other tools (which is probably good engineering, but it would be nice if there was an arch+julia+amdgpu hacker that has the time to fix this and contribute a fix here -- regrettably I do not have these skills yet, but I am happy to debug work through it if there is someone to hold my hand)

It is probably reasonable to close this issue as "will not fix" if the JLL ROCM artifacts become the established way to use AMDGPU.jl (in a non-mixed pure-jll mode).

All of this, tested on my end, with ROCM 6, 7900 XTX, julia 1.11

To run the official AMD Ubuntu ROCM container under Arch Linux so that you can use AMDGPU.jl (in the container) you can do:

sudo pacman -S hsa-rocr rocm-hip-runtime rocm-device-libs rocm-llvm rocminfo # usually not needed because the docker image will have its own, but useful if you do testing on the host
sudo usermod -a -G render YOUR_USERNAME # maybe not needed
sudo usermod -a -G video YOUR_USERNAME # maybe not needed
docker run -it --rm --device=/dev/kfd --device=/dev/dri --ipc=host --group-add=video --shm-size=16G --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/rocm-terminal /bin/bash
@aksuhton
Copy link

Commenting merely to give thanks for the container instructions.

@aksuhton
Copy link

Returning here to say that I am finding success with out of the box AMDGPU.jl

Thank you everyone who made this possible

@Krastanov
Copy link
Author

I confirm that I do not have this issue anymore either. While this contains a useful reference on how to use Docker / Podman containers with AMDGPU.jl, it does not seem to be necessary anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants