Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running AMDGPU for MultiGPU using an Array of Pointers #662

Closed
pedrovalerolara opened this issue Aug 13, 2024 · 5 comments
Closed

Running AMDGPU for MultiGPU using an Array of Pointers #662

pedrovalerolara opened this issue Aug 13, 2024 · 5 comments

Comments

@pedrovalerolara
Copy link

pedrovalerolara commented Aug 13, 2024

I cannot run AMDGPU on MultiGPU using an Array of Pointers.
The system is a 2x AMD Mi100 GPU.
The code runs well on the first GPU, but not on the second GPU.
When running the code I got this:

Memory access fault by GPU node-3 (Agent handle: 0x806eef0) on address 0x7f338c601000. Reason: Page not present or supervisor privilege.

The equivalent code using CUDA.jl works well.
Please, let me know if you have any questions.

Here is the code to replicate the problem:

function multi_scal(dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  @inbounds x[dev_id][i]*=alpha
  return nothing
end

x = ones(1_000)
alpha = 2.0
x_ret = Vector{Any}(undef, 2)
ndev = length(AMDGPU.devices())
AMDGPU.device!(AMDGPU.device(1))
size_array = length(x)
s_arrays = ceil(Int, size_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = ROCArray(x[((i - 1) * s_arrays) + 1:i * s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
x_ret[1] = amdgpu_pointer_ret
x_ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

for i in 1:ndev
    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[1])
    AMDGPU.synchronize()
    println(dev_id)
end
@luraess
Copy link
Collaborator

luraess commented Aug 13, 2024

Thanks for sharing. Could you maybe format the code in triple back-ticks blocs to avoid issues when copy-pasting it and for better readability? 🙏

@pxl-th
Copy link
Collaborator

pxl-th commented Aug 13, 2024

Haven't run the code, but one problem I see here is that you create amdgpu_pointer_ret on the first device.

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
x_ret[1] = amdgpu_pointer_ret

But then you pass it to other devices:

    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[1])

@pxl-th
Copy link
Collaborator

pxl-th commented Aug 13, 2024

How does CUDA equivalent access array of pointers?
I suspect you need to allocate host memory that is visible accross multiple gpus/contexts, which is done using CU_MEMHOSTALLOC_PORTABLE flag.

@pedrovalerolara
Copy link
Author

pedrovalerolara commented Aug 13, 2024

Hi folks!
Thank you for your quick responses! That is very appreciated.
It looks like this code works

function multi_scal(dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  @inbounds x[dev_id][i]*=alpha
  return nothing
end

AMDGPU.device!(AMDGPU.device(1))
x = ones(1_000)
alpha = 2.0
ndev = length(AMDGPU.devices())
x_ret = Vector{Any}(undef, ndev+1)
size_array = length(x)
s_arrays = ceil(Int, size_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
  x_ret[i] = ROCArray(pointer_ret)
end

x_ret[ndev+1] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

for i in 1:ndev
    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[i])
    AMDGPU.synchronize()
    println(dev_id)
end

@pedrovalerolara
Copy link
Author

How does CUDA equivalent access array of pointers? I suspect you need to allocate host memory that is visible accross multiple gpus/contexts, which is done using CU_MEMHOSTALLOC_PORTABLE flag.

Is there a way to do so in AMDGPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants