-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use KernelAbstractions to accelerate MultilayerQG.streamfunctionfrompv!
#112
Comments
The first step is to write a kernel, which will look something like @kernel invert_column!(ψh, qh, S⁻¹)
i, j = @index(Global, NTuple)
@inbounds ψh[i, j] .= S⁻¹[i, j] * qh[i, j]
end The next step is to create a work layout over which the kernel is launched. If we restrict attention to models that always have more than 32 grid points, we can use something like # Larger workgroups are generally more efficient. For more generality, we could put an if statement that incurs
# different behavior when either nkl or nl are less than 16
workgroup = 16, 16
# The size determines how many times the kernel is run
worksize = grid.nkr, grid.nl
# This (and its useage below) will ensure the kernel is not run _before_ the data in qh is available
barrier = Event(dev)
# Creates a loop over the specified worksize, using workgroup to organize the computation
loop_invert_column! = invert_column!(dev, workgroup, worksize)
# Launch the kernel
event = loop_invert_column!(ψh, qh, params.invS, dependencies=barrier)
# This will ensure that no other operations occur until the kernel has finished
wait(dev, event) |
One thing I am not totally sure about is whether |
By the way, I think this optimization also requires the columns of |
With this last suggestion would x, y FFTs work nicely? |
Oof, good point. Hmm, maybe we need to hand-write the matrix matrix multiply then. Not sure. |
yes it's been coming to haunt us either way... |
Something like @kernel invert_column!(ψh, qh, S⁻¹)
i, j = @index(Global, NTuple)
ψh_column = view(ψh, i, j, :)
qh_column = view(qh, i, j, :)
@inbounds ψh_column .= S⁻¹[i, j] * qh_column
end might work. |
Otherwise a kernel along the lines of using KernelAbstractions.Extras.LoopInfo: @unroll
@kernel invert_column!(ψh, qh, S⁻¹, nz)
i, j = @index(Global, NTuple)
@unroll for k = 1:nz
@inbounds ψh[i, j, k] = 0
@unroll for m = 1:nz
@inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m]
end
end
end might work, alternatively. Or maybe my indices are screwed up --- whichever is correct. Nothing is too difficult, it's just a matter of trying it out. |
I should resurrect this.. |
What about https://github.com/mcabbott/Tullio.jl to the rescue? (just a random thought) |
There's probably a lot of solutions! I think I gave two, but there might be more. |
Hi @navidcy @glwagner – I am hoping to resurrect this issue. I'd like to run a significant number of high vertical resolution simulations and leveraging GPU capabilities would be hugely beneficial... I am CUDA/GPU-literate but by no means fluent, and am trying to understand this thread and your thinking behind what a possible fix would be. For my own understanding, the main issue is the scalar indexing of S in the PV inversion step, right? GeophysicalFlows.jl/src/multilayerqg.jl Lines 536 to 548 in 634ef2d
It seems you've both thought of a few ways to rewrite the calculation of S and the PV inversion matrix-matrix multiply so that performance is optimized for GPUs. I'd be more than happy to have a go at changing the code and running the tests myself, but I might need a bit of guidance around what method proposed in this thread would be a fruitful initial direction to go in...! Thanks |
The issue is the explicit loop over |
Just want to emphasize that it is easy to learn KernelAbstractions, you can try just running the docs examples to start which should take a few minutes, and then spend an hour or two to learn how to write your own. At that point you're ready to solve this problem and run high res simulations. |
Okay, awesome, thanks for the direction, Greg! I'll give this a go and hopefully will run into few difficulties. Famous last words...! |
you made this meme??? |
yes |
It's a great one ;) |
I had a go at writing the kernel using
and
I ran some simple benchmark tests on a GPU (rtx8000), a CPU with 16 threads and a CPU with 1 thread. For example, with
Something I was surprised by was that a multi-threaded CPU performed slightly better than the GPU in terms of speed in some cases. For example, with
Does this surprise you? In any case, the rewrite definitely accelerates |
Open a PR I think! Yeah! |
The speed up is 1000x so "significant" is an understatement
Yes, if you can show the code or open a PR maybe we will spot something. |
KernelAbstractions.jl
can be used to accelerate the functionGeophysicalFlows.jl/src/multilayerqg.jl
Lines 299 to 302 in 47a2b51
A simple example showing how to use
KernelAbstractions
is the "Naive Transpose":https://juliagpu.gitlab.io/KernelAbstractions.jl/examples/naive_transpose/
The text was updated successfully, but these errors were encountered: