Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the 3D buffer copy example using different uniformElements loops #2377

Merged
merged 3 commits into from
Sep 16, 2024

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Sep 5, 2024

No description provided.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 6, 2024

@psychocoderHPC @SimeonEhrig do you prefer that the three kernels use three different approaches (just to showcase them) or only one of them ? And, in case, which one ?

@SimeonEhrig
Copy link
Member

I like the comparison of the three approaches.

SimeonEhrig
SimeonEhrig previously approved these changes Sep 9, 2024
@SimeonEhrig
Copy link
Member

@fwyzard Can you please push again. Looks like the CI trigger didn't worked.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 10, 2024

Looks like the CI trigger didn't worked.

It's still not working :(

psychocoderHPC
psychocoderHPC previously approved these changes Sep 10, 2024
@psychocoderHPC
Copy link
Member

@fwyzard please amend and force push this PR again, the CI is fixed

@fwyzard fwyzard force-pushed the rewrite_bufferCopy branch 2 times, most recently from d7b1039 to ee766b1 Compare September 11, 2024 11:12
@fwyzard fwyzard force-pushed the rewrite_bufferCopy branch 3 times, most recently from dfbe6c8 to 4f046dc Compare September 12, 2024 15:34
@fwyzard fwyzard force-pushed the rewrite_bufferCopy branch 2 times, most recently from e009662 to de6ca89 Compare September 12, 2024 20:47
@psychocoderHPC
Copy link
Member

one job in the ci failed with

Test project /builds/hzdr/crp/alpaka/build
      Start  1: bufferCopy
 1/52 Test  #1: bufferCopy .......................Subprocess aborted***Exception:   0.38 sec
Using alpaka accelerator: AccCpuSerial<3,unsigned long>
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccCpuThreads<3,unsigned long>
0,0,1:1 0,0,0:0 0,1,1:3 0,1,0:2 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,1:1 0,0,0:0 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,1:7 1,1,0:6 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccGpuCudaRt<3,unsigned long>
terminate called after throwing an instance of 'std::runtime_error'
  what():  /builds/hzdr/crp/alpaka/include/alpaka/queue/cuda_hip/QueueUniformCudaHipRt.hpp(175) 'TApi::streamSynchronize(queue.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!

I restarted the job to see if it is a temporary issue but typical it means that GPU device used invalid kernel start parameter e.g. to many threads per block.

@psychocoderHPC
Copy link
Member

I did a local check of the register footprint for this example to check if it could be that we can not use as many threads per block anymore due to high register usage which will reduce the valid blocksize.
The result is that the new schema is introducing a very large register overhead which will reduce the performance significantly in applications.

register_footprint
(left shows the dev branch, right shows this PR)

@psychocoderHPC
Copy link
Member

IMO the problem that we run into the error 'too many resources requested for launch' is that we have two kernel in this example but only calculate a valid workdiv for one of them. In this example, we use unfortunately the kernel with the lower register footprint to calculate the workdiv and this is not valid for the second kernel. We have other examples where we do this but this is the first time where it shows that this is not good.

Note: I am not against using the new iterator for this example. I think an easy fix is adding getValidWorkdv for the second kernel and all should work.
This example may perhaps be a good starting point for deeper register footprint analysis to trace the root of the register pressure.

@psychocoderHPC
Copy link
Member

I opened #2382 to track possible optimizations

@psychocoderHPC
Copy link
Member

Note: I am a little bit confused by the test even the develop branch version.
The test does not verify the results and prints it only to the terminal.
I realized that the results are wrong for GPU, this means the test is broken.

Using alpaka accelerator: AccCpuSerial<3,unsigned long>
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
Using alpaka accelerator: AccGpuCudaRt<3,unsigned long>
0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 
0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 0,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 1,124374521316120,0:0 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 
0,0,0:0 0,0,1:1 0,1,0:2 0,1,1:3 1,0,0:4 1,0,1:5 1,1,0:6 1,1,1:7 

@psychocoderHPC
Copy link
Member

It is a printf bug %zu is not working and must be %lu or %llu

@psychocoderHPC psychocoderHPC merged commit 20f75cc into alpaka-group:develop Sep 16, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants