Running AMDGPU for MultiGPU using an Array of Pointers #662

pedrovalerolara · 2024-08-13T14:11:33Z

I cannot run AMDGPU on MultiGPU using an Array of Pointers.
The system is a 2x AMD Mi100 GPU.
The code runs well on the first GPU, but not on the second GPU.
When running the code I got this:

Memory access fault by GPU node-3 (Agent handle: 0x806eef0) on address 0x7f338c601000. Reason: Page not present or supervisor privilege.

The equivalent code using CUDA.jl works well.
Please, let me know if you have any questions.

Here is the code to replicate the problem:

function multi_scal(dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  @inbounds x[dev_id][i]*=alpha
  return nothing
end

x = ones(1_000)
alpha = 2.0
x_ret = Vector{Any}(undef, 2)
ndev = length(AMDGPU.devices())
AMDGPU.device!(AMDGPU.device(1))
size_array = length(x)
s_arrays = ceil(Int, size_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = ROCArray(x[((i - 1) * s_arrays) + 1:i * s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
x_ret[1] = amdgpu_pointer_ret
x_ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

for i in 1:ndev
    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[1])
    AMDGPU.synchronize()
    println(dev_id)
end

The text was updated successfully, but these errors were encountered:

luraess · 2024-08-13T14:22:53Z

Thanks for sharing. Could you maybe format the code in triple back-ticks blocs to avoid issues when copy-pasting it and for better readability? 🙏

pxl-th · 2024-08-13T14:46:15Z

Haven't run the code, but one problem I see here is that you create amdgpu_pointer_ret on the first device.

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
x_ret[1] = amdgpu_pointer_ret

But then you pass it to other devices:

    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[1])

pxl-th · 2024-08-13T16:21:19Z

How does CUDA equivalent access array of pointers?
I suspect you need to allocate host memory that is visible accross multiple gpus/contexts, which is done using CU_MEMHOSTALLOC_PORTABLE flag.

pedrovalerolara · 2024-08-13T18:41:38Z

Hi folks!
Thank you for your quick responses! That is very appreciated.
It looks like this code works

function multi_scal(dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  @inbounds x[dev_id][i]*=alpha
  return nothing
end

AMDGPU.device!(AMDGPU.device(1))
x = ones(1_000)
alpha = 2.0
ndev = length(AMDGPU.devices())
x_ret = Vector{Any}(undef, ndev+1)
size_array = length(x)
s_arrays = ceil(Int, size_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
  x_ret[i] = ROCArray(pointer_ret)
end

x_ret[ndev+1] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

for i in 1:ndev
    AMDGPU.device!(AMDGPU.device(i))
    dev_id = i
    println(dev_id)
    @roc groupsize=threads gridsize=blocks multi_scal(dev_id,alpha,x_ret[i])
    AMDGPU.synchronize()
    println(dev_id)
end

pedrovalerolara · 2024-08-13T19:06:05Z

How does CUDA equivalent access array of pointers? I suspect you need to allocate host memory that is visible accross multiple gpus/contexts, which is done using CU_MEMHOSTALLOC_PORTABLE flag.

Is there a way to do so in AMDGPU?

pedrovalerolara closed this as completed Aug 13, 2024

pedrovalerolara mentioned this issue Aug 15, 2024

Using one single array of pointers for multiGPU AMDGPU computation #663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running AMDGPU for MultiGPU using an Array of Pointers #662

Running AMDGPU for MultiGPU using an Array of Pointers #662

pedrovalerolara commented Aug 13, 2024 •

edited by pxl-th

Loading

luraess commented Aug 13, 2024

pxl-th commented Aug 13, 2024

pxl-th commented Aug 13, 2024

pedrovalerolara commented Aug 13, 2024 •

edited

Loading

pedrovalerolara commented Aug 13, 2024

Running AMDGPU for MultiGPU using an Array of Pointers #662

Running AMDGPU for MultiGPU using an Array of Pointers #662

Comments

pedrovalerolara commented Aug 13, 2024 • edited by pxl-th Loading

luraess commented Aug 13, 2024

pxl-th commented Aug 13, 2024

pxl-th commented Aug 13, 2024

pedrovalerolara commented Aug 13, 2024 • edited Loading

pedrovalerolara commented Aug 13, 2024

pedrovalerolara commented Aug 13, 2024 •

edited by pxl-th

Loading

pedrovalerolara commented Aug 13, 2024 •

edited

Loading