Rewrite the 3D buffer copy example using different uniformElements loops #2377

fwyzard · 2024-09-05T23:10:31Z

No description provided.

fwyzard · 2024-09-06T08:31:34Z

@psychocoderHPC @SimeonEhrig do you prefer that the three kernels use three different approaches (just to showcase them) or only one of them ? And, in case, which one ?

SimeonEhrig · 2024-09-09T06:42:23Z

I like the comparison of the three approaches.

SimeonEhrig · 2024-09-09T06:44:14Z

@fwyzard Can you please push again. Looks like the CI trigger didn't worked.

fwyzard · 2024-09-10T08:00:48Z

Looks like the CI trigger didn't worked.

It's still not working :(

psychocoderHPC · 2024-09-10T09:52:02Z

@fwyzard please amend and force push this PR again, the CI is fixed

example/bufferCopy/src/bufferCopy.cpp

psychocoderHPC · 2024-09-13T07:48:56Z

one job in the ci failed with

Test project /builds/hzdr/crp/alpaka/build
      Start  1: bufferCopy
 1/52 Test  #1: bufferCopy .......................Subprocess aborted***Exception:   0.38 sec
Using alpaka accelerator: AccCpuSerial<3,unsigned long>
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccCpuThreads<3,unsigned long>
0,0,1:1 0,0,0:0 0,1,1:3 0,1,0:2 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,1:1 0,0,0:0 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,1:7 1,1,0:6 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccGpuCudaRt<3,unsigned long>
terminate called after throwing an instance of 'std::runtime_error'
  what():  /builds/hzdr/crp/alpaka/include/alpaka/queue/cuda_hip/QueueUniformCudaHipRt.hpp(175) 'TApi::streamSynchronize(queue.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!

I restarted the job to see if it is a temporary issue but typical it means that GPU device used invalid kernel start parameter e.g. to many threads per block.

psychocoderHPC · 2024-09-13T08:10:15Z

I did a local check of the register footprint for this example to check if it could be that we can not use as many threads per block anymore due to high register usage which will reduce the valid blocksize.
The result is that the new schema is introducing a very large register overhead which will reduce the performance significantly in applications.

(left shows the dev branch, right shows this PR)

psychocoderHPC · 2024-09-13T08:18:48Z

IMO the problem that we run into the error 'too many resources requested for launch' is that we have two kernel in this example but only calculate a valid workdiv for one of them. In this example, we use unfortunately the kernel with the lower register footprint to calculate the workdiv and this is not valid for the second kernel. We have other examples where we do this but this is the first time where it shows that this is not good.

Note: I am not against using the new iterator for this example. I think an easy fix is adding getValidWorkdv for the second kernel and all should work.
This example may perhaps be a good starting point for deeper register footprint analysis to trace the root of the register pressure.

psychocoderHPC · 2024-09-13T08:45:26Z

I opened #2382 to track possible optimizations

psychocoderHPC · 2024-09-13T11:03:18Z

Note: I am a little bit confused by the test even the develop branch version.
The test does not verify the results and prints it only to the terminal.
I realized that the results are wrong for GPU, this means the test is broken.

Using alpaka accelerator: AccCpuSerial<3,unsigned long>
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccGpuCudaRt<3,unsigned long>
0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 
0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7

psychocoderHPC · 2024-09-13T11:11:48Z

It is a printf bug %zu is not working and must be %lu or %llu

fwyzard added the Type:Example label Sep 5, 2024

fwyzard requested review from psychocoderHPC and SimeonEhrig September 6, 2024 08:30

SimeonEhrig previously approved these changes Sep 9, 2024

View reviewed changes

fwyzard force-pushed the rewrite_bufferCopy branch from 85fa529 to c67a881 Compare September 9, 2024 07:28

psychocoderHPC previously approved these changes Sep 10, 2024

View reviewed changes

fwyzard force-pushed the rewrite_bufferCopy branch from c67a881 to a45dac3 Compare September 10, 2024 09:52

psychocoderHPC reviewed Sep 10, 2024

View reviewed changes

example/bufferCopy/src/bufferCopy.cpp Outdated Show resolved Hide resolved

fwyzard dismissed stale reviews from SimeonEhrig and psychocoderHPC via c272fd5 September 10, 2024 13:13

fwyzard force-pushed the rewrite_bufferCopy branch from a45dac3 to c272fd5 Compare September 10, 2024 13:13

psychocoderHPC requested changes Sep 11, 2024

View reviewed changes

example/bufferCopy/src/bufferCopy.cpp Outdated Show resolved Hide resolved

fwyzard force-pushed the rewrite_bufferCopy branch 2 times, most recently from d7b1039 to ee766b1 Compare September 11, 2024 11:12

Rename uniformElements[XYZ] to uniformElementsAlong[XYZ], etc.

6614f7e

fwyzard force-pushed the rewrite_bufferCopy branch 3 times, most recently from dfbe6c8 to 4f046dc Compare September 12, 2024 15:34

Fix uniformElementsAlong<Dim>, etc.

b2f2f2f

fwyzard force-pushed the rewrite_bufferCopy branch 2 times, most recently from e009662 to de6ca89 Compare September 12, 2024 20:47

psychocoderHPC mentioned this pull request Sep 13, 2024

reduce register footprint for the new iterators #2382

Closed

fwyzard force-pushed the rewrite_bufferCopy branch from de6ca89 to c2c9b07 Compare September 13, 2024 14:53

fwyzard requested a review from psychocoderHPC September 13, 2024 14:53

Rewrite the 3D buffer copy example using different uniformElements loops

cc40198

fwyzard force-pushed the rewrite_bufferCopy branch from c2c9b07 to cc40198 Compare September 13, 2024 15:17

psychocoderHPC approved these changes Sep 16, 2024

View reviewed changes

psychocoderHPC merged commit 20f75cc into alpaka-group:develop Sep 16, 2024
22 checks passed

psychocoderHPC mentioned this pull request Sep 16, 2024

reduce Uniform/Independent-Elements iterator register footprint #2383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the 3D buffer copy example using different uniformElements loops #2377

Rewrite the 3D buffer copy example using different uniformElements loops #2377

fwyzard commented Sep 5, 2024

fwyzard commented Sep 6, 2024

SimeonEhrig commented Sep 9, 2024

SimeonEhrig commented Sep 9, 2024

fwyzard commented Sep 10, 2024

psychocoderHPC commented Sep 10, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

Rewrite the 3D buffer copy example using different uniformElements loops #2377

Rewrite the 3D buffer copy example using different uniformElements loops #2377

Conversation

fwyzard commented Sep 5, 2024

fwyzard commented Sep 6, 2024

SimeonEhrig commented Sep 9, 2024

SimeonEhrig commented Sep 9, 2024

fwyzard commented Sep 10, 2024

psychocoderHPC commented Sep 10, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024

psychocoderHPC commented Sep 13, 2024