Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!` #112

glwagner · 2020-09-15T12:30:12Z

KernelAbstractions.jl can be used to accelerate the function

Lines 299 to 302 in 47a2b51

    
           function streamfunctionfrompv!(ψh, qh, params, grid) 
        
             for j=1:grid.nl, i=1:grid.nkr 
        
               CUDA.@allowscalar @views ψh[i, j, :] .= params.invS[i, j] * qh[i, j, :] 
        
             end

A simple example showing how to use KernelAbstractions is the "Naive Transpose":

https://juliagpu.gitlab.io/KernelAbstractions.jl/examples/naive_transpose/

The text was updated successfully, but these errors were encountered:

glwagner · 2020-09-15T12:39:10Z

The first step is to write a kernel, which will look something like

@kernel invert_column!(ψh, qh, S⁻¹)
    i, j = @index(Global, NTuple)
    @inbounds ψh[i, j] .= S⁻¹[i, j] * qh[i, j]
end

The next step is to create a work layout over which the kernel is launched. If we restrict attention to models that always have more than 32 grid points, we can use something like

# Larger workgroups are generally more efficient. For more generality, we could put an if statement that incurs
# different behavior when either nkl or nl are less than 16
workgroup = 16, 16

# The size determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# This (and its useage below) will ensure the kernel is not run _before_ the data in qh is available
barrier = Event(dev)

# Creates a loop over the specified worksize, using workgroup to organize the computation
loop_invert_column! = invert_column!(dev, workgroup, worksize)

# Launch the kernel
event = loop_invert_column!(ψh, qh, params.invS, dependencies=barrier)

# This will ensure that no other operations occur until the kernel has finished
wait(dev, event)

glwagner · 2020-09-15T12:40:31Z

One thing I am not totally sure about is whether KernelAbstractions will compile away the matrix multiplication in @inbounds ψh[i, j] .= S⁻¹[i, j] * qh[i, j]. I think that it will. If not, we may have to unroll our own loop.

glwagner · 2020-09-15T12:45:53Z

By the way, I think this optimization also requires the columns of ψh[i, j] to be stored as StaticArrays. It looks like ψh is a 3D array right now. Other parts of the code may also have to converted to kernels if this change is made, since broadcasting over the 3D array would no longer work.

navidcy · 2020-09-15T19:32:39Z

With this last suggestion would x, y FFTs work nicely?

glwagner · 2020-09-15T20:09:27Z

With this last suggestion would x, y FFTs work nicely?

Oof, good point.

Hmm, maybe we need to hand-write the matrix matrix multiply then. Not sure.

navidcy · 2020-09-15T20:11:54Z

yes it's been coming to haunt us either way...
(I remember a similar discussion some months ago...)

glwagner · 2020-09-16T12:12:32Z

Something like

@kernel invert_column!(ψh, qh, S⁻¹)
    i, j = @index(Global, NTuple)
    ψh_column = view(ψh, i, j, :)
    qh_column = view(qh, i, j, :)
    @inbounds ψh_column .= S⁻¹[i, j] * qh_column
end

might work.

glwagner · 2020-09-16T12:15:35Z

Otherwise a kernel along the lines of

using KernelAbstractions.Extras.LoopInfo: @unroll

@kernel invert_column!(ψh, qh, S⁻¹, nz)
    i, j = @index(Global, NTuple)

    @unroll for k = 1:nz

        @inbounds ψh[i, j, k] = 0

        @unroll for m = 1:nz
            @inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m]
        end

    end
end

might work, alternatively. Or maybe my indices are screwed up --- whichever is correct.

Nothing is too difficult, it's just a matter of trying it out.

navidcy · 2020-12-02T20:54:42Z

I should resurrect this..

navidcy · 2021-03-18T05:08:40Z

What about https://github.com/mcabbott/Tullio.jl to the rescue? (just a random thought)

glwagner · 2021-03-18T14:07:00Z

There's probably a lot of solutions! I think I gave two, but there might be more.

mpudig · 2024-07-25T15:21:53Z

Hi @navidcy @glwagner – I am hoping to resurrect this issue. I'd like to run a significant number of high vertical resolution simulations and leveraging GPU capabilities would be hugely beneficial...

I am CUDA/GPU-literate but by no means fluent, and am trying to understand this thread and your thinking behind what a possible fix would be. For my own understanding, the main issue is the scalar indexing of S in the PV inversion step, right?

GeophysicalFlows.jl/src/multilayerqg.jl

Lines 536 to 548 in 634ef2d

    
           """ 
        
               pvfromstreamfunction!(qh, ψh, params, grid) 
        
           Obtain the Fourier transform of the PV from the streamfunction `ψh` in each layer using 
        
           `qh = params.S * ψh`. 
        
           """ 
        
           function pvfromstreamfunction!(qh, ψh, params, grid) 
        
             for j=1:grid.nl, i=1:grid.nkr 
        
               CUDA.@allowscalar @views qh[i, j, :] .= params.S[i, j] * ψh[i, j, :] 
        
             end 
        
             return nothing 
        
           end

It seems you've both thought of a few ways to rewrite the calculation of S and the PV inversion matrix-matrix multiply so that performance is optimized for GPUs. I'd be more than happy to have a go at changing the code and running the tests myself, but I might need a bit of guidance around what method proposed in this thread would be a fruitful initial direction to go in...!

Thanks

glwagner · 2024-07-25T16:05:51Z

The issue is the explicit loop over i, j. To utilize the GPU you have to write a kernel, hopefully using KernelAbstractions like we do in Oceananigans. Check out the docs for KernelAbstractions and write/run a few kernels for the GPU to test your skills. This is a fairly simple kernel so hopefully it will be fairly straightforward.

glwagner · 2024-07-25T16:09:01Z

Just want to emphasize that it is easy to learn KernelAbstractions, you can try just running the docs examples to start which should take a few minutes, and then spend an hour or two to learn how to write your own. At that point you're ready to solve this problem and run high res simulations.

mpudig · 2024-07-25T16:33:49Z

Okay, awesome, thanks for the direction, Greg! I'll give this a go and hopefully will run into few difficulties. Famous last words...!

glwagner · 2024-07-25T17:08:56Z

navidcy · 2024-07-25T21:23:34Z

you made this meme???

glwagner · 2024-07-26T05:36:18Z

yes

navidcy · 2024-07-26T05:42:39Z

It's a great one ;)

…ons on issue FourierFlows#112.

mpudig · 2024-07-29T16:58:19Z

I had a go at writing the kernel using KernelAbstractions.jl and modifying the MultiLayerQG.pvfromstreamfunction! call:

@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)
  i, j = @index(Global, NTuple)

  @unroll for k = 1:nlayers

      @inbounds qh[i, j, k] = 0

      @unroll for m = 1:nlayers
          @inbounds qh[i, j, k] += S[i, j][k, m] * ψh[i, j, m]
      end
  
  end
end

and

function pvfromstreamfunction!(qh, ψh, params, grid)
  # Larger workgroups are generally more efficient. For more generality, we could put an 
  # if statement that incurs different behavior when either nkl or nl are less than 8
  workgroup = 8, 8

  # The worksize determines how many times the kernel is run
  worksize = grid.nkr, grid.nl

  # Instantiates the kernel for relevant backend device
  backend = KernelAbstractions.get_backend(qh)
  kernel! = pvfromstreamfunction_kernel!(backend, workgroup, worksize)

  # Launch the kernel
  S, nlayers = params.S, params.nlayers
  kernel!(qh, ψh, S, nlayers)

  # This will ensure that no other operations occur until the kernel has finished
  KernelAbstractions.synchronize(backend)

  return nothing
end

I ran some simple benchmark tests on a GPU (rtx8000), a CPU with 16 threads and a CPU with 1 thread. For example, with nlayers = 3 and nx = 128 the improvement on the GPU is significant:

GPU (new code)

nlayers = 3; nx = 128; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob)
1.276 ms (2345 allocations: 179.81 KiB)

GPU (old code)

nlayers = 3; nx = 128; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob)
2.986 s (867442 allocations: 199.74 MiB)

Something I was surprised by was that a multi-threaded CPU performed slightly better than the GPU in terms of speed in some cases. For example, with nlayers = 12 and nx = 512

GPU (new code)

nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob)
1.179 s (2541 allocations: 191.94 KiB)

CPU with 16 threads (new code)

nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, CPU(); nx); @btime stepforward!(prob)
525.084 ms (49375 allocations: 8.28 MiB)

Does this surprise you?

In any case, the rewrite definitely accelerates MultiLayerQG.pvfromstreamfunction! on GPUs for arbitrary nlayers. Let me know if you would like me to open a PR to add the changes!

mpudig · 2024-10-24T20:08:26Z

@navidcy @glwagner Just following up on the above in case you missed it. Let me know if you think I should open a PR to add this to the MultiLayerQG code.

navidcy · 2024-10-24T20:11:53Z

Open a PR I think! Yeah!

glwagner · 2024-10-24T21:18:54Z

For example, with nlayers = 3 and nx = 128 the improvement on the GPU is significant:

The speed up is 1000x so "significant" is an understatement

Does this surprise you?

Yes, if you can show the code or open a PR maybe we will spot something.

navidcy added 🎮 gpu optimization 🏎 ❓ question Further information is requested labels Dec 2, 2020

navidcy added the 🚑 help wanted Extra attention is needed label Jan 28, 2021

navidcy mentioned this issue Mar 6, 2021

Warning when user wants MultiLayerQG on the GPU #208

Merged

navidcy mentioned this issue Nov 16, 2021

Optimized PV inversion for two layer case in MultilayerQG #267

Closed

navidcy mentioned this issue Jun 10, 2023

Fixing sign error for mean meridional PV gradient Qy in MultiLayerQG module #329

Merged

mpudig added a commit to mpudig/GeophysicalFlows.jl that referenced this issue Jul 26, 2024

First attempt at writing kernel and workspace following Greg suggesti…

4f7d797

…ons on issue FourierFlows#112.

mpudig mentioned this issue Oct 28, 2024

Accelerating MultiLayerQG on GPUs #373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!` #112

Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!` #112

glwagner commented Sep 15, 2020

glwagner commented Sep 15, 2020 •

edited

Loading

glwagner commented Sep 15, 2020 •

edited

Loading

glwagner commented Sep 15, 2020

navidcy commented Sep 15, 2020

glwagner commented Sep 15, 2020

navidcy commented Sep 15, 2020

glwagner commented Sep 16, 2020

glwagner commented Sep 16, 2020 •

edited

Loading

navidcy commented Dec 2, 2020

navidcy commented Mar 18, 2021

glwagner commented Mar 18, 2021

mpudig commented Jul 25, 2024

glwagner commented Jul 25, 2024

glwagner commented Jul 25, 2024

mpudig commented Jul 25, 2024

glwagner commented Jul 25, 2024

navidcy commented Jul 25, 2024

glwagner commented Jul 26, 2024

navidcy commented Jul 26, 2024

mpudig commented Jul 29, 2024 •

edited

Loading

mpudig commented Oct 24, 2024

navidcy commented Oct 24, 2024

glwagner commented Oct 24, 2024

Use KernelAbstractions to accelerate MultilayerQG.streamfunctionfrompv! #112

Use KernelAbstractions to accelerate MultilayerQG.streamfunctionfrompv! #112

Comments

glwagner commented Sep 15, 2020

glwagner commented Sep 15, 2020 • edited Loading

glwagner commented Sep 15, 2020 • edited Loading

glwagner commented Sep 15, 2020

navidcy commented Sep 15, 2020

glwagner commented Sep 15, 2020

navidcy commented Sep 15, 2020

glwagner commented Sep 16, 2020

glwagner commented Sep 16, 2020 • edited Loading

navidcy commented Dec 2, 2020

navidcy commented Mar 18, 2021

glwagner commented Mar 18, 2021

mpudig commented Jul 25, 2024

glwagner commented Jul 25, 2024

glwagner commented Jul 25, 2024

mpudig commented Jul 25, 2024

glwagner commented Jul 25, 2024

navidcy commented Jul 25, 2024

glwagner commented Jul 26, 2024

navidcy commented Jul 26, 2024

mpudig commented Jul 29, 2024 • edited Loading

mpudig commented Oct 24, 2024

navidcy commented Oct 24, 2024

glwagner commented Oct 24, 2024

Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!` #112

Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!` #112

glwagner commented Sep 15, 2020 •

edited

Loading

glwagner commented Sep 15, 2020 •

edited

Loading

glwagner commented Sep 16, 2020 •

edited

Loading

mpudig commented Jul 29, 2024 •

edited

Loading