Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling Grid for AMD GPUS #343

Open
philomat opened this issue Mar 2, 2021 · 31 comments
Open

Compiling Grid for AMD GPUS #343

philomat opened this issue Mar 2, 2021 · 31 comments

Comments

@philomat
Copy link

philomat commented Mar 2, 2021

I know the wiki says there is currently no support for AMD GPUs. But I saw commits concerning HIP. Is there a way one could try experimenting with Grid on AMD GPUs?

@paboyle
Copy link
Owner

paboyle commented Mar 4, 2021

yes - hip is believed working but not efficient for AMD GPUs
You might try Benchmark_dwf_fp32 and the --dslash-unroll flag ; new of a few days ago.

@paboyle
Copy link
Owner

paboyle commented Mar 4, 2021

Status of multi-GPU and "nvlink" equivalent is untested. --enable-shm=none and MPI between GPU's is probably safer.

@paboyle
Copy link
Owner

paboyle commented Mar 11, 2021

BTW, I have benchmarked AMD MI50 and MI100, but want to revisit with the new explicit Nc=3 kernel.

I have also compiled under HIP on Summit for Nvidia, and got the same performance as Cuda compile.

@philomat
Copy link
Author

I was able to compile grid, and also to run the the benchmark you suggested. However some of the test are failing, e.g.
The Test_wilson_clover fails with:

Grid : Message : MemoryManager::Init() Using hipMalloc
Grid : Message : 0.335145 s : Grid is setup to use 1 threads
Grid : Message : 0.335153 s : Grid floating point word size is REALF4
Grid : Message : 0.335154 s : Grid floating point word size is REALD8
Grid : Message : 0.335155 s : Grid floating point word size is REAL8
Memory access fault by GPU node-2 (Agent handle: 0x1f04be0) on address 0x7f3e57b92000. Reason: Page not present or supervisor privilege.
[qcd20g01:2774949] *** Process received signal ***
[qcd20g01:2774949] Signal: Aborted (6)
[qcd20g01:2774949] Signal code:  (-6)
[qcd20g01:2774949] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f3d7c01bdd0]
[qcd20g01:2774949] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3d7a6c170f]
[qcd20g01:2774949] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3d7a6abb25]
[qcd20g01:2774949] [ 3] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1bd2b)[0x7f3d796d2d2b]
[qcd20g01:2774949] [ 4] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x61f4d)[0x7f3d79718f4d]
[qcd20g01:2774949] [ 5] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1fd97)[0x7f3d796d6d97]
[qcd20g01:2774949] [ 6] /lib64/libpthread.so.0(+0x82de)[0x7f3d7c0112de]
[qcd20g01:2774949] [ 7] /lib64/libc.so.6(clone+0x43)[0x7f3d7a785e83]
[qcd20g01:2774949] *** End of error message ***
Aborted (core dumped) 

Or Test_nersc_io fails because the plaquette is not correctly reproduced:

Grid : Message : 2.770978 s : NERSC Configuration ./ckpoint_lat.4000 checksum 5c0ac22a header   5c0ac22a
Grid : Message : 2.770987 s : NERSC Configuration ./ckpoint_lat.4000 plaquette 9.83463e-05 header    0.0507244
Grid : Message : 2.771006 s : NERSC Configuration ./ckpoint_lat.4000 link_trace 0.240144 header    0.000115627
 Plaquette mismatch 
Test_nersc_io: /home/scior/Grid/Grid/parallelIO/NerscIO.h:201: static void Grid::NerscIO::readConfiguration(Grid::NerscIO::GaugeField &, Grid::FieldMetaData &, std::string, GaugeStats) [GaugeStats = Grid::GaugeStatistics<Grid::PeriodicGaugeImpl<Grid::GaugeImplTypes<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2>>>>, 3, 12>>>]: Assertion `fabs(clone.plaquette -header.plaquette ) < 1.0e-5' failed.
[qcd20g01:2775537] *** Process received signal ***
[qcd20g01:2775537] Signal: Aborted (6)
[qcd20g01:2775537] Signal code:  (-6)
[qcd20g01:2775537] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f68a70c8dd0]
[qcd20g01:2775537] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f68a576e70f]
[qcd20g01:2775537] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f68a5758b25]
[qcd20g01:2775537] [ 3] /lib64/libc.so.6(+0x219f9)[0x7f68a57589f9]
[qcd20g01:2775537] [ 4] /lib64/libc.so.6(+0x2fcc6)[0x7f68a5766cc6]
[qcd20g01:2775537] [ 5] ./Test_nersc_io[0x40ef28]
[qcd20g01:2775537] [ 6] ./Test_nersc_io[0x40736a]
[qcd20g01:2775537] [ 7] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f68a575a6a3]
[qcd20g01:2775537] [ 8] ./Test_nersc_io[0x40602e]
[qcd20g01:2775537] *** End of error message ***
Aborted (core dumped)

Other test like, e.g. Test_wilson_even_odd seem to work fine.

The configure command I used is:

../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-3.9.0/bin/hipcc MPICXX=mpicxx CXXFLAGS=-fPIC -I/opt/rocm-3.9.0/ -I/home/scior/Thrust/ --with-lime=../depencencies/lime

@paboyle
Copy link
Owner

paboyle commented Mar 18, 2021

Thanks - haven't tried WilsonClover on GPU to be honest, so not absolutely sure if tit works on Nvidia either.

Re. the plaquette - this does work on CUDA, so something interesting to look at on HIP.....
Where are you running this?

@paboyle
Copy link
Owner

paboyle commented Mar 18, 2021

HIP is definitely in the "experimental" category for now, but getting everything to work would be good.
Glad to see you are running on rocm.3.9 which is recent/up to date.

@philomat
Copy link
Author

I am running on a machine at JLab

@paboyle
Copy link
Owner

paboyle commented Mar 18, 2021

I should have asked what specifically is the hardware you are running on, rather than physically where is it is located.

@philomat
Copy link
Author

It's a machine equipped with 4 Vega 20 cards and an AMD Epyc CPU

@paboyle
Copy link
Owner

paboyle commented Mar 18, 2021

can you tell me the performance you get with

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16

and

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll

Thanks

@philomat
Copy link
Author

Here are the results for Benchmark_dwf_fp32 --grid 16.16.16.16:

rid : Message : 3.846120 s : *****************************************************************
Grid : Message : 3.846121 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.846123 s : * Vectorising space-time by 8
Grid : Message : 3.846124 s : * VComplexF size is 64 B
Grid : Message : 3.846126 s : * SINGLE precision 
Grid : Message : 3.846127 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 3.846128 s : *****************************************************************
Grid : Message : 3.989939 s : Called warmup
Grid : Message : 8.192125 s : Called Dw 1000 times in 4.20216e+06 us
Grid : Message : 8.192172 s : mflop/s =   329383
Grid : Message : 8.192175 s : mflop/s per rank =  329383
Grid : Message : 8.192177 s : mflop/s per node =  329383
Grid : Message : 8.192179 s : RF  GiB/s (base 2) =   669.298
Grid : Message : 8.192181 s : mem GiB/s (base 2) =   418.311
Grid : Message : 8.192877 s : norm diff   1.08494e-16
Grid : Message : 8.222942 s : #### Dhop calls report 
Grid : Message : 8.222945 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 8.222948 s : WilsonFermion5D TotalTime   /Calls        : 2106.55 us
Grid : Message : 8.222950 s : WilsonFermion5D CommTime    /Calls        : 0.838162 us
Grid : Message : 8.222952 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 8.222954 s : WilsonFermion5D ComputeTime1/Calls        : 2105.66 us
Grid : Message : 8.222956 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 8.222959 s : Average mflops/s per call                : 334643
Grid : Message : 8.222961 s : Average mflops/s per call per rank       : 334643
Grid : Message : 8.222963 s : Average mflops/s per call per node       : 334643
Grid : Message : 8.222971 s : Average mflops/s per call (full)         : 334501
Grid : Message : 8.222973 s : Average mflops/s per call per rank (full): 334501
Grid : Message : 8.222975 s : Average mflops/s per call per node (full): 334501
Grid : Message : 8.222977 s : WilsonFermion5D Stencil
Grid : Message : 8.222980 s :  Stencil calls 1001
Grid : Message : 8.222989 s :  Stencil halogtime 0.0679321
Grid : Message : 8.222991 s :  Stencil gathertime 0
Grid : Message : 8.222993 s :  Stencil gathermtime 0
Grid : Message : 8.222997 s :  Stencil mergetime 0.0649351
Grid : Message : 8.223033 s :  Stencil decompresstime 0.0759241
Grid : Message : 8.223040 s : WilsonFermion5D StencilEven
Grid : Message : 8.223046 s : WilsonFermion5D StencilOdd
Grid : Message : 8.223049 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 8.223050 s :  timer0 (HaloGatherOpt) 0.525475
Grid : Message : 8.223053 s :  timer1 (Communicate)   0.0464535
Grid : Message : 8.223055 s :  timer2 (CommsMerge )   0.0774226
Grid : Message : 8.223059 s :  timer3 (commsMergeShm) 0.137363
Grid : Message : 8.223061 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 8.223062 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 8.579627 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 8.579647 s : Called DwDag
Grid : Message : 8.579648 s : norm dag result 0.0116481
Grid : Message : 8.589324 s : norm dag ref    0.0116857
Grid : Message : 8.599907 s : norm dag diff   7.42172e-17
Grid : Message : 8.620787 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 8.663312 s : src_e0.000481094
Grid : Message : 8.672821 s : src_o0.000478626
Grid : Message : 8.681868 s : *********************************************************
Grid : Message : 8.681870 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 8.681871 s : * Vectorising space-time by 8
Grid : Message : 8.681873 s : * SINGLE precision 
Grid : Message : 8.681874 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 8.681875 s : *********************************************************
Grid : Message : 10.828743 s : Deo mflop/s =   323085
Grid : Message : 10.828758 s : Deo mflop/s per rank   323085
Grid : Message : 10.828760 s : Deo mflop/s per node   323085
Grid : Message : 10.828762 s : #### Dhop calls report 
Grid : Message : 10.828763 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 10.828765 s : WilsonFermion5D TotalTime   /Calls        : 2144.66 us
Grid : Message : 10.828767 s : WilsonFermion5D CommTime    /Calls        : 1.67133 us
Grid : Message : 10.828769 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 10.828771 s : WilsonFermion5D ComputeTime1/Calls        : 2142.9 us
Grid : Message : 10.828773 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 10.828776 s : Average mflops/s per call                : 328827
Grid : Message : 10.828781 s : Average mflops/s per call per rank       : 328827
Grid : Message : 10.828783 s : Average mflops/s per call per node       : 328827
Grid : Message : 10.828787 s : Average mflops/s per call (full)         : 328556
Grid : Message : 10.828789 s : Average mflops/s per call per rank (full): 328556
Grid : Message : 10.828792 s : Average mflops/s per call per node (full): 328556
Grid : Message : 10.828794 s : WilsonFermion5D Stencil
Grid : Message : 10.828799 s : WilsonFermion5D StencilEven
Grid : Message : 10.828802 s : WilsonFermion5D StencilOdd
Grid : Message : 10.828805 s :  Stencil calls 1001
Grid : Message : 10.828808 s :  Stencil halogtime 0.0629371
Grid : Message : 10.828814 s :  Stencil gathertime 0
Grid : Message : 10.828817 s :  Stencil gathermtime 0
Grid : Message : 10.828819 s :  Stencil mergetime 0.0659341
Grid : Message : 10.828822 s :  Stencil decompresstime 0.0589411
Grid : Message : 10.828824 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 10.828827 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 10.828829 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.828830 s :  timer0 (HaloGatherOpt) 1.04895
Grid : Message : 10.828833 s :  timer1 (Communicate)   0.0889111
Grid : Message : 10.828835 s :  timer2 (CommsMerge )   0.150849
Grid : Message : 10.828838 s :  timer3 (commsMergeShm) 0.293706
Grid : Message : 10.840104 s : r_e0.00576327
Grid : Message : 10.845382 s : r_o12.0372
Grid : Message : 10.850773 s : res0.011619
Grid : Message : 10.911593 s : norm diff   0
Grid : Message : 10.967415 s : norm diff even  0
Grid : Message : 10.976614 s : norm diff odd   0

and here for Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll:

Grid : Message : 3.841540 s : *****************************************************************
Grid : Message : 3.841541 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.841542 s : * Vectorising space-time by 8
Grid : Message : 3.841543 s : * VComplexF size is 64 B
Grid : Message : 3.841546 s : * SINGLE precision 
Grid : Message : 3.841547 s : * Using Nc=3       WilsonKernels
Grid : Message : 3.841548 s : *****************************************************************
Grid : Message : 3.990716 s : Called warmup
Grid : Message : 9.631144 s : Called Dw 1000 times in 5.64041e+06 us
Grid : Message : 9.631178 s : mflop/s =   245394
Grid : Message : 9.631181 s : mflop/s per rank =  245394
Grid : Message : 9.631183 s : mflop/s per node =  245394
Grid : Message : 9.631185 s : RF  GiB/s (base 2) =   498.634
Grid : Message : 9.631187 s : mem GiB/s (base 2) =   311.646
Grid : Message : 9.631881 s : norm diff   9.95688e-14
Grid : Message : 9.661613 s : #### Dhop calls report 
Grid : Message : 9.661618 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 9.661621 s : WilsonFermion5D TotalTime   /Calls        : 2825.97 us
Grid : Message : 9.661624 s : WilsonFermion5D CommTime    /Calls        : 0.883117 us
Grid : Message : 9.661626 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 9.661628 s : WilsonFermion5D ComputeTime1/Calls        : 2825.03 us
Grid : Message : 9.661630 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 9.661635 s : Average mflops/s per call                : 249428
Grid : Message : 9.661637 s : Average mflops/s per call per rank       : 249428
Grid : Message : 9.661642 s : Average mflops/s per call per node       : 249428
Grid : Message : 9.661645 s : Average mflops/s per call (full)         : 249346
Grid : Message : 9.661714 s : Average mflops/s per call per rank (full): 249346
Grid : Message : 9.661718 s : Average mflops/s per call per node (full): 249346
Grid : Message : 9.661723 s : WilsonFermion5D Stencil
Grid : Message : 9.661727 s :  Stencil calls 1001
Grid : Message : 9.661730 s :  Stencil halogtime 0.0729271
Grid : Message : 9.661732 s :  Stencil gathertime 0
Grid : Message : 9.661736 s :  Stencil gathermtime 0
Grid : Message : 9.661738 s :  Stencil mergetime 0.0549451
Grid : Message : 9.661740 s :  Stencil decompresstime 0.0589411
Grid : Message : 9.661747 s : WilsonFermion5D StencilEven
Grid : Message : 9.661751 s : WilsonFermion5D StencilOdd
Grid : Message : 9.661756 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 9.661757 s :  timer0 (HaloGatherOpt) 0.566933
Grid : Message : 9.661761 s :  timer1 (Communicate)   0.045954
Grid : Message : 9.661764 s :  timer2 (CommsMerge )   0.0789211
Grid : Message : 9.661768 s :  timer3 (commsMergeShm) 0.147353
Grid : Message : 9.661770 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 9.661772 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.187450 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 10.187640 s : Called DwDag
Grid : Message : 10.187650 s : norm dag result 12546.2
Grid : Message : 10.284430 s : norm dag ref    12.1948
Grid : Message : 10.390220 s : norm dag diff   7.52814e-14
Grid : Message : 10.597580 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 10.102292 s : src_e0.502726
Grid : Message : 10.111715 s : src_o0.509216
Grid : Message : 10.120448 s : *********************************************************
Grid : Message : 10.120450 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 10.120451 s : * Vectorising space-time by 8
Grid : Message : 10.120453 s : * SINGLE precision 
Grid : Message : 10.120454 s : * Using Nc=3       WilsonKernels
Grid : Message : 10.120455 s : *********************************************************
Grid : Message : 12.975335 s : Deo mflop/s =   242917
Grid : Message : 12.975349 s : Deo mflop/s per rank   242917
Grid : Message : 12.975351 s : Deo mflop/s per node   242917
Grid : Message : 12.975353 s : #### Dhop calls report 
Grid : Message : 12.975354 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 12.975356 s : WilsonFermion5D TotalTime   /Calls        : 2851.97 us
Grid : Message : 12.975358 s : WilsonFermion5D CommTime    /Calls        : 1.74825 us
Grid : Message : 12.975360 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 12.975362 s : WilsonFermion5D ComputeTime1/Calls        : 2850.14 us
Grid : Message : 12.975364 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 12.975367 s : Average mflops/s per call                : 247231
Grid : Message : 12.975371 s : Average mflops/s per call per rank       : 247231
Grid : Message : 12.975374 s : Average mflops/s per call per node       : 247231
Grid : Message : 12.975377 s : Average mflops/s per call (full)         : 247072
Grid : Message : 12.975379 s : Average mflops/s per call per rank (full): 247072
Grid : Message : 12.975381 s : Average mflops/s per call per node (full): 247072
Grid : Message : 12.975384 s : WilsonFermion5D Stencil
Grid : Message : 12.975386 s : WilsonFermion5D StencilEven
Grid : Message : 12.975391 s : WilsonFermion5D StencilOdd
Grid : Message : 12.975394 s :  Stencil calls 1001
Grid : Message : 12.975401 s :  Stencil halogtime 0.0559441
Grid : Message : 12.975405 s :  Stencil gathertime 0
Grid : Message : 12.975408 s :  Stencil gathermtime 0
Grid : Message : 12.975410 s :  Stencil mergetime 0.0549451
Grid : Message : 12.975414 s :  Stencil decompresstime 0.0659341
Grid : Message : 12.975416 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 12.975417 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 12.975422 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 12.975425 s :  timer0 (HaloGatherOpt) 1.13487
Grid : Message : 12.975428 s :  timer1 (Communicate)   0.0879121
Grid : Message : 12.975432 s :  timer2 (CommsMerge )   0.157842
Grid : Message : 12.975435 s :  timer3 (commsMergeShm) 0.280719
Grid : Message : 12.989486 s : r_e12547.9
Grid : Message : 12.994729 s : r_o6.10374
Grid : Message : 13.188000 s : res1578.28
Grid : Message : 13.610200 s : norm diff   0
Grid : Message : 13.116742 s : norm diff even  0
Grid : Message : 13.126443 s : norm diff odd   0

@philomat
Copy link
Author

I just ran the Test_wilson_clover on summit and the test ran without any errors

@paboyle
Copy link
Owner

paboyle commented Mar 18, 2021

Thanks.

My hypothesis that the --dslash-unroll might fix the performance issues is not correct then.

Glad to hear it re. Clover - it's a HIP / CUDA difference, and not general breakage of Clover.

More joy to look forward to....

if you were able to track down which accelerator_for/line of code fails with Clover, that would help.

@philomat
Copy link
Author

The error occurs in the constructor of the WilsonCloverFermion. To be more precise it is happening in the ImportGauge in WilsonCloverFermion.h on line 109

@paboyle
Copy link
Owner

paboyle commented Mar 20, 2021

Could you either

A) run it under a debugger (gdb) and trap the fault and ask it for a back trace with "bt".

OR

B) go to:

Grid/qcd/action/fermion/implementation/WilsonCloverFermionImplementation.h

  1. uncomment the cout << statements
  2. Stick in a print statement at lines 73, 81, 91, and 127 and 140.

A) is not guaranteed to work because I don't know how the GPU runtime is operating, but significantly less effort if
you know how to use a debugger, so it is what I would try first. I think there is a rocm-gdb or rocm-lldb tool

@paboyle
Copy link
Owner

paboyle commented Mar 20, 2021

though the AMD node I had access to, the rocm debugger didn't work for me.

@philomat
Copy link
Author

philomat commented Mar 23, 2021 via email

@paboyle
Copy link
Owner

paboyle commented Mar 23, 2021

that was enough to go on for me to eyeball at least one error.

@paboyle
Copy link
Owner

paboyle commented Mar 23, 2021

More later - I'll try and patch develop.

@paboyle
Copy link
Owner

paboyle commented Mar 26, 2021

Sorry - reviewed again and the code looks right. Darn it...

@jdmaia
Copy link
Contributor

jdmaia commented Jul 19, 2021

Hi guys,

I just saw this. I have been working on grid some some weeks now and it seems like the Wilson clover implementation exceeds the maximum limit of local memory per thread (128k for now). That could explain the runtime error (More recent ROCm releases have an assertion against that, which makes the code fail to compile).

@philomat
Copy link
Author

Hi,

I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation:

error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906.

My configure command:
../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14"

Any ideas how to solve this?

@chulwoo1
Copy link
Collaborator

chulwoo1 commented Aug 19, 2021 via email

@paboyle
Copy link
Owner

paboyle commented Aug 21, 2021 via email

@philomat
Copy link
Author

Hi Peter,

this is the complete output of make:

[scior@qcdi2001 Grid]$ make
cp version-cache Version.h
make all-am
make[1]: Entering directory '/u/home/scior/Grid/build/Grid'
cp version-cache Version.h
CXX util/version.o
CXX qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o
error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_
error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_
2 errors generated when compiling for gfx906.
make[1]: *** [Makefile:3468: qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o] Error 1
make[1]: Leaving directory '/u/home/scior/Grid/build/Grid'
make: *** [Makefile:2490: all] Error 2

@philomat
Copy link
Author

Any progress on this issue? I pulled Grid a couple of days ago and still get the same error.

@jdmaia
Copy link
Contributor

jdmaia commented Oct 12, 2021

@philomat For now I'm avoiding hitting this problem by conditionally compiling the problematic operators, which seems to be fine to build the main benchmark binary (Benchmark_ITT), but I still need to take a look at the code and see if we can reduce the amount of local data allocated per thread and place it somewhere else to avoid hitting this issue.

@paboyle
Copy link
Owner

paboyle commented Nov 25, 2021

I've run on Spock and doing well on Benchmark_ITT and Benchmark_dwf_fp32.

Added the systems/Spock directory with compile and run scripts.
Getting 1.3TF/s on MI100.

@paboyle
Copy link
Owner

paboyle commented Nov 25, 2021

Also get 4TF/s on a whole Spock node, 4x MI-100.

@james-simone
Copy link

I'm also hitting the "stack frame size exceeds limit" error.
commit: HEAD detached at 135808d
Ubuntu 20.04 container
rocm-5.0.0/clang/14.0.0
CXX=hipcc CXXFLAGS=" -std=c++14 -I/opt/rocm/rocthrust/include -I/usr/local/openmpi/include -I/usr/local/fftw/include -I/usr/local/hdf5/include -I/usr/local/scidac/include " LDFLAGS=" -L/opt/rocm/rocthrust/lib -L/usr/local/openmpi/lib -L/usr/local/fftw/lib -L/usr/local/hdf5/lib -L/usr/local/scidac/lib " LIBS="-lmpi" MPICXX=mpicxx /var/tmp/Grid/configure --prefix=/usr/local/grid --enable-accelerator=hip --enable-comms=mpi3-auto --enable-gen-simd-width=64 --enable-numa --enable-openmp --enable-simd=GPU --enable-unified=no

@atamazov
Copy link

There are some hardware-related limitations of the stack frame on AMGPUs. You need to reduce usage of private memory in the kernels.

Note that gfx10 GPUs can use twice more private memory than gfx9 because of narrower wavesize (32 vs 64).

Details can be found here: llvm/llvm-project@1ed4caf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants