Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerating MultiLayerQG on GPUs #373

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
2 changes: 2 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"

[compat]
CUDA = "1, 2.4.2, 3.0.0 - 3.6.4, 3.7.1, 4, 5"
Expand All @@ -30,6 +31,7 @@ SpecialFunctions = "0.10, 1, 2"
StaticArrays = "0.12, 1"
Statistics = "1.6"
julia = "1.6"
KernelAbstractions = "0.9"

[extras]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Expand Down
81 changes: 60 additions & 21 deletions src/multilayerqg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ using
LinearAlgebra,
StaticArrays,
Reexport,
DocStringExtensions
DocStringExtensions,
KernelAbstractions

@reexport using FourierFlows

using FourierFlows: parsevalsum, parsevalsum2, superzeros, plan_flows_rfft
using FourierFlows: parsevalsum, parsevalsum2, superzeros, plan_flows_rfft, CPU, GPU
using KernelAbstractions.Extras.LoopInfo: @unroll

nothingfunction(args...) = nothing

Expand Down Expand Up @@ -111,17 +113,6 @@ function Problem(nlayers::Int, # number of fluid lay
aliased_fraction = 1/3,
T = Float64)

if dev == GPU() && nlayers > 2
@warn """MultiLayerQG module is not optimized on the GPU yet for configurations with
3 fluid layers or more!

See issues on Github at https://github.com/FourierFlows/GeophysicalFlows.jl/issues/112
and https://github.com/FourierFlows/GeophysicalFlows.jl/issues/267.

To use MultiLayerQG with 3 fluid layers or more we suggest, for now, to restrict running
on CPU."""
end

if nlayers == 1
@warn """MultiLayerQG module does work for single-layer configuration but may not be as
optimized. We suggest using SingleLayerQG module for single-layer QG simulation unless
Expand Down Expand Up @@ -533,16 +524,50 @@ Compute the inverse Fourier transform of `varh` and store it in `var`.
"""
invtransform!(var, varh, params::AbstractParams) = ldiv!(var, params.rfftplan, varh)

"""
@kernel function PVinversion_kernel!(a, b, M, nlayers)

Kernel for the PV/stream function inversion step,
i.e., `qh = params.S * ψh` or `ψh = params.S⁻¹ qh`.
"""
@kernel function PVinversion_kernel!(a, b, M, ::Val{nlayers}) where nlayers
i, j = @index(Global, NTuple)

@unroll for k = 1:nlayers

@inbounds a[i, j, k] = 0

@unroll for m = 1:nlayers
@inbounds a[i, j, k] += M[i, j][k, m] * b[i, j, m]
end

end
end
Comment on lines +533 to +545
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the kernel in more general form and added Val. The code has sped up slightly, but the 16-thread CPU still outperforms the GPU. Compare these benchmarks to what I showed here

  • GPU
    nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob) 668.165 ms (2533 allocations: 191.19 KiB)

  • CPU with 16 threads
    nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, CPU(); nx); @btime stepforward!(prob) 444.419 ms (113 allocations: 5.61 KiB)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you are timing the GPU properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @benchmark CUDA.@sync CUDA.@time stepforward!(prob)

0.681338 seconds (57.95 k CPU allocations: 2.419 MiB) (18 GPU allocations: 345.250 MiB, 0.04% memmgmt time)
0.678481 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676472 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.694825 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678072 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677693 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678237 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.677198 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676980 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.676189 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678326 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677010 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677142 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.676321 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time)
0.678115 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677461 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.678255 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677168 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
0.677529 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max):  676.718 ms … 678.645 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     677.681 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   677.765 ms ± 618.941 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█                     █  ██       █   █                 █   █  
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁██▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
677 ms           Histogram: frequency by time          679 ms <

Memory estimate: 199.41 KiB, allocs estimate: 2660.

Seems to be roughly the same as above? Unless I'm misunderstanding what this benchmark is doing...


"""
pvfromstreamfunction!(qh, ψh, params, grid)

Obtain the Fourier transform of the PV from the streamfunction `ψh` in each layer using
`qh = params.S * ψh`.
`qh = params.S * ψh`. We use a work layout over which the PV inversion kernel is launched.
"""
function pvfromstreamfunction!(qh, ψh, params, grid)
for j=1:grid.nl, i=1:grid.nkr
CUDA.@allowscalar @views qh[i, j, :] .= params.S[i, j] * ψh[i, j, :]
end
# Larger workgroups are generally more efficient. For more generality, we could put an
# if statement that incurs different behavior when either nkl or nl are less than 8
workgroup = 8, 8

# The worksize determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# Instantiates the kernel for relevant backend device
backend = KernelAbstractions.get_backend(qh)
kernel! = PVinversion_kernel!(backend, workgroup, worksize)

# Launch the kernel
S, nlayers = params.S, params.nlayers
kernel!(qh, ψh, S, Val(nlayers))

# Ensure that no other operations occur until the kernel has finished
KernelAbstractions.synchronize(backend)

return nothing
end
Expand Down Expand Up @@ -591,12 +616,26 @@ end
streamfunctionfrompv!(ψh, qh, params, grid)

Invert the PV to obtain the Fourier transform of the streamfunction `ψh` in each layer from
`qh` using `ψh = params.S⁻¹ qh`.
`qh` using `ψh = params.S⁻¹ qh`. We use a work layout over which the PV inversion kernel is launched.
"""
function streamfunctionfrompv!(ψh, qh, params, grid)
for j=1:grid.nl, i=1:grid.nkr
CUDA.@allowscalar @views ψh[i, j, :] .= params.S⁻¹[i, j] * qh[i, j, :]
end
# Larger workgroups are generally more efficient. For more generality, we could put an
# if statement that incurs different behavior when either nkl or nl are less than 8
workgroup = 8, 8

# The worksize determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# Instantiates the kernel for relevant backend device
backend = KernelAbstractions.get_backend(ψh)
kernel! = PVinversion_kernel!(backend, workgroup, worksize)

# Launch the kernel
S⁻¹, nlayers = params.S⁻¹, params.nlayers
kernel!(ψh, qh, S⁻¹, Val(nlayers))

# Ensure that no other operations occur until the kernel has finished
KernelAbstractions.synchronize(backend)

return nothing
end
Expand Down
Loading