Compiling Grid for AMD GPUS #343

philomat · 2021-03-02T14:32:52Z

I know the wiki says there is currently no support for AMD GPUs. But I saw commits concerning HIP. Is there a way one could try experimenting with Grid on AMD GPUs?

paboyle · 2021-03-04T18:35:08Z

yes - hip is believed working but not efficient for AMD GPUs
You might try Benchmark_dwf_fp32 and the --dslash-unroll flag ; new of a few days ago.

paboyle · 2021-03-04T18:35:50Z

Status of multi-GPU and "nvlink" equivalent is untested. --enable-shm=none and MPI between GPU's is probably safer.

paboyle · 2021-03-11T23:29:50Z

BTW, I have benchmarked AMD MI50 and MI100, but want to revisit with the new explicit Nc=3 kernel.

I have also compiled under HIP on Summit for Nvidia, and got the same performance as Cuda compile.

philomat · 2021-03-18T18:20:48Z

I was able to compile grid, and also to run the the benchmark you suggested. However some of the test are failing, e.g.
The Test_wilson_clover fails with:

Grid : Message : MemoryManager::Init() Using hipMalloc
Grid : Message : 0.335145 s : Grid is setup to use 1 threads
Grid : Message : 0.335153 s : Grid floating point word size is REALF4
Grid : Message : 0.335154 s : Grid floating point word size is REALD8
Grid : Message : 0.335155 s : Grid floating point word size is REAL8
Memory access fault by GPU node-2 (Agent handle: 0x1f04be0) on address 0x7f3e57b92000. Reason: Page not present or supervisor privilege.
[qcd20g01:2774949] *** Process received signal ***
[qcd20g01:2774949] Signal: Aborted (6)
[qcd20g01:2774949] Signal code:  (-6)
[qcd20g01:2774949] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f3d7c01bdd0]
[qcd20g01:2774949] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3d7a6c170f]
[qcd20g01:2774949] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3d7a6abb25]
[qcd20g01:2774949] [ 3] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1bd2b)[0x7f3d796d2d2b]
[qcd20g01:2774949] [ 4] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x61f4d)[0x7f3d79718f4d]
[qcd20g01:2774949] [ 5] /opt/rocm-3.9.0/lib/libhsa-runtime64.so.1(+0x1fd97)[0x7f3d796d6d97]
[qcd20g01:2774949] [ 6] /lib64/libpthread.so.0(+0x82de)[0x7f3d7c0112de]
[qcd20g01:2774949] [ 7] /lib64/libc.so.6(clone+0x43)[0x7f3d7a785e83]
[qcd20g01:2774949] *** End of error message ***
Aborted (core dumped)

Or Test_nersc_io fails because the plaquette is not correctly reproduced:

Grid : Message : 2.770978 s : NERSC Configuration ./ckpoint_lat.4000 checksum 5c0ac22a header   5c0ac22a
Grid : Message : 2.770987 s : NERSC Configuration ./ckpoint_lat.4000 plaquette 9.83463e-05 header    0.0507244
Grid : Message : 2.771006 s : NERSC Configuration ./ckpoint_lat.4000 link_trace 0.240144 header    0.000115627
 Plaquette mismatch 
Test_nersc_io: /home/scior/Grid/Grid/parallelIO/NerscIO.h:201: static void Grid::NerscIO::readConfiguration(Grid::NerscIO::GaugeField &, Grid::FieldMetaData &, std::string, GaugeStats) [GaugeStats = Grid::GaugeStatistics<Grid::PeriodicGaugeImpl<Grid::GaugeImplTypes<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2>>>>, 3, 12>>>]: Assertion `fabs(clone.plaquette -header.plaquette ) < 1.0e-5' failed.
[qcd20g01:2775537] *** Process received signal ***
[qcd20g01:2775537] Signal: Aborted (6)
[qcd20g01:2775537] Signal code:  (-6)
[qcd20g01:2775537] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x7f68a70c8dd0]
[qcd20g01:2775537] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f68a576e70f]
[qcd20g01:2775537] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f68a5758b25]
[qcd20g01:2775537] [ 3] /lib64/libc.so.6(+0x219f9)[0x7f68a57589f9]
[qcd20g01:2775537] [ 4] /lib64/libc.so.6(+0x2fcc6)[0x7f68a5766cc6]
[qcd20g01:2775537] [ 5] ./Test_nersc_io[0x40ef28]
[qcd20g01:2775537] [ 6] ./Test_nersc_io[0x40736a]
[qcd20g01:2775537] [ 7] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f68a575a6a3]
[qcd20g01:2775537] [ 8] ./Test_nersc_io[0x40602e]
[qcd20g01:2775537] *** End of error message ***
Aborted (core dumped)

Other test like, e.g. Test_wilson_even_odd seem to work fine.

The configure command I used is:

../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-3.9.0/bin/hipcc MPICXX=mpicxx CXXFLAGS=-fPIC -I/opt/rocm-3.9.0/ -I/home/scior/Thrust/ --with-lime=../depencencies/lime

paboyle · 2021-03-18T18:48:21Z

Thanks - haven't tried WilsonClover on GPU to be honest, so not absolutely sure if tit works on Nvidia either.

Re. the plaquette - this does work on CUDA, so something interesting to look at on HIP.....
Where are you running this?

paboyle · 2021-03-18T18:49:40Z

HIP is definitely in the "experimental" category for now, but getting everything to work would be good.
Glad to see you are running on rocm.3.9 which is recent/up to date.

philomat · 2021-03-18T18:51:17Z

I am running on a machine at JLab

paboyle · 2021-03-18T18:53:19Z

I should have asked what specifically is the hardware you are running on, rather than physically where is it is located.

philomat · 2021-03-18T18:59:06Z

It's a machine equipped with 4 Vega 20 cards and an AMD Epyc CPU

paboyle · 2021-03-18T19:02:05Z

can you tell me the performance you get with

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16

and

benchmarks/Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll

Thanks

philomat · 2021-03-18T20:00:38Z

Here are the results for Benchmark_dwf_fp32 --grid 16.16.16.16:

rid : Message : 3.846120 s : *****************************************************************
Grid : Message : 3.846121 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.846123 s : * Vectorising space-time by 8
Grid : Message : 3.846124 s : * VComplexF size is 64 B
Grid : Message : 3.846126 s : * SINGLE precision 
Grid : Message : 3.846127 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 3.846128 s : *****************************************************************
Grid : Message : 3.989939 s : Called warmup
Grid : Message : 8.192125 s : Called Dw 1000 times in 4.20216e+06 us
Grid : Message : 8.192172 s : mflop/s =   329383
Grid : Message : 8.192175 s : mflop/s per rank =  329383
Grid : Message : 8.192177 s : mflop/s per node =  329383
Grid : Message : 8.192179 s : RF  GiB/s (base 2) =   669.298
Grid : Message : 8.192181 s : mem GiB/s (base 2) =   418.311
Grid : Message : 8.192877 s : norm diff   1.08494e-16
Grid : Message : 8.222942 s : #### Dhop calls report 
Grid : Message : 8.222945 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 8.222948 s : WilsonFermion5D TotalTime   /Calls        : 2106.55 us
Grid : Message : 8.222950 s : WilsonFermion5D CommTime    /Calls        : 0.838162 us
Grid : Message : 8.222952 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 8.222954 s : WilsonFermion5D ComputeTime1/Calls        : 2105.66 us
Grid : Message : 8.222956 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 8.222959 s : Average mflops/s per call                : 334643
Grid : Message : 8.222961 s : Average mflops/s per call per rank       : 334643
Grid : Message : 8.222963 s : Average mflops/s per call per node       : 334643
Grid : Message : 8.222971 s : Average mflops/s per call (full)         : 334501
Grid : Message : 8.222973 s : Average mflops/s per call per rank (full): 334501
Grid : Message : 8.222975 s : Average mflops/s per call per node (full): 334501
Grid : Message : 8.222977 s : WilsonFermion5D Stencil
Grid : Message : 8.222980 s :  Stencil calls 1001
Grid : Message : 8.222989 s :  Stencil halogtime 0.0679321
Grid : Message : 8.222991 s :  Stencil gathertime 0
Grid : Message : 8.222993 s :  Stencil gathermtime 0
Grid : Message : 8.222997 s :  Stencil mergetime 0.0649351
Grid : Message : 8.223033 s :  Stencil decompresstime 0.0759241
Grid : Message : 8.223040 s : WilsonFermion5D StencilEven
Grid : Message : 8.223046 s : WilsonFermion5D StencilOdd
Grid : Message : 8.223049 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 8.223050 s :  timer0 (HaloGatherOpt) 0.525475
Grid : Message : 8.223053 s :  timer1 (Communicate)   0.0464535
Grid : Message : 8.223055 s :  timer2 (CommsMerge )   0.0774226
Grid : Message : 8.223059 s :  timer3 (commsMergeShm) 0.137363
Grid : Message : 8.223061 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 8.223062 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 8.579627 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 8.579647 s : Called DwDag
Grid : Message : 8.579648 s : norm dag result 0.0116481
Grid : Message : 8.589324 s : norm dag ref    0.0116857
Grid : Message : 8.599907 s : norm dag diff   7.42172e-17
Grid : Message : 8.620787 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 8.663312 s : src_e0.000481094
Grid : Message : 8.672821 s : src_o0.000478626
Grid : Message : 8.681868 s : *********************************************************
Grid : Message : 8.681870 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 8.681871 s : * Vectorising space-time by 8
Grid : Message : 8.681873 s : * SINGLE precision 
Grid : Message : 8.681874 s : * Using GENERIC Nc WilsonKernels
Grid : Message : 8.681875 s : *********************************************************
Grid : Message : 10.828743 s : Deo mflop/s =   323085
Grid : Message : 10.828758 s : Deo mflop/s per rank   323085
Grid : Message : 10.828760 s : Deo mflop/s per node   323085
Grid : Message : 10.828762 s : #### Dhop calls report 
Grid : Message : 10.828763 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 10.828765 s : WilsonFermion5D TotalTime   /Calls        : 2144.66 us
Grid : Message : 10.828767 s : WilsonFermion5D CommTime    /Calls        : 1.67133 us
Grid : Message : 10.828769 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 10.828771 s : WilsonFermion5D ComputeTime1/Calls        : 2142.9 us
Grid : Message : 10.828773 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 10.828776 s : Average mflops/s per call                : 328827
Grid : Message : 10.828781 s : Average mflops/s per call per rank       : 328827
Grid : Message : 10.828783 s : Average mflops/s per call per node       : 328827
Grid : Message : 10.828787 s : Average mflops/s per call (full)         : 328556
Grid : Message : 10.828789 s : Average mflops/s per call per rank (full): 328556
Grid : Message : 10.828792 s : Average mflops/s per call per node (full): 328556
Grid : Message : 10.828794 s : WilsonFermion5D Stencil
Grid : Message : 10.828799 s : WilsonFermion5D StencilEven
Grid : Message : 10.828802 s : WilsonFermion5D StencilOdd
Grid : Message : 10.828805 s :  Stencil calls 1001
Grid : Message : 10.828808 s :  Stencil halogtime 0.0629371
Grid : Message : 10.828814 s :  Stencil gathertime 0
Grid : Message : 10.828817 s :  Stencil gathermtime 0
Grid : Message : 10.828819 s :  Stencil mergetime 0.0659341
Grid : Message : 10.828822 s :  Stencil decompresstime 0.0589411
Grid : Message : 10.828824 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 10.828827 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 10.828829 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.828830 s :  timer0 (HaloGatherOpt) 1.04895
Grid : Message : 10.828833 s :  timer1 (Communicate)   0.0889111
Grid : Message : 10.828835 s :  timer2 (CommsMerge )   0.150849
Grid : Message : 10.828838 s :  timer3 (commsMergeShm) 0.293706
Grid : Message : 10.840104 s : r_e0.00576327
Grid : Message : 10.845382 s : r_o12.0372
Grid : Message : 10.850773 s : res0.011619
Grid : Message : 10.911593 s : norm diff   0
Grid : Message : 10.967415 s : norm diff even  0
Grid : Message : 10.976614 s : norm diff odd   0

and here for Benchmark_dwf_fp32 --grid 16.16.16.16 --dslash-unroll:

Grid : Message : 3.841540 s : *****************************************************************
Grid : Message : 3.841541 s : * Benchmarking DomainWallFermionR::Dhop                  
Grid : Message : 3.841542 s : * Vectorising space-time by 8
Grid : Message : 3.841543 s : * VComplexF size is 64 B
Grid : Message : 3.841546 s : * SINGLE precision 
Grid : Message : 3.841547 s : * Using Nc=3       WilsonKernels
Grid : Message : 3.841548 s : *****************************************************************
Grid : Message : 3.990716 s : Called warmup
Grid : Message : 9.631144 s : Called Dw 1000 times in 5.64041e+06 us
Grid : Message : 9.631178 s : mflop/s =   245394
Grid : Message : 9.631181 s : mflop/s per rank =  245394
Grid : Message : 9.631183 s : mflop/s per node =  245394
Grid : Message : 9.631185 s : RF  GiB/s (base 2) =   498.634
Grid : Message : 9.631187 s : mem GiB/s (base 2) =   311.646
Grid : Message : 9.631881 s : norm diff   9.95688e-14
Grid : Message : 9.661613 s : #### Dhop calls report 
Grid : Message : 9.661618 s : WilsonFermion5D Number of DhopEO Calls   : 2002
Grid : Message : 9.661621 s : WilsonFermion5D TotalTime   /Calls        : 2825.97 us
Grid : Message : 9.661624 s : WilsonFermion5D CommTime    /Calls        : 0.883117 us
Grid : Message : 9.661626 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 9.661628 s : WilsonFermion5D ComputeTime1/Calls        : 2825.03 us
Grid : Message : 9.661630 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 9.661635 s : Average mflops/s per call                : 249428
Grid : Message : 9.661637 s : Average mflops/s per call per rank       : 249428
Grid : Message : 9.661642 s : Average mflops/s per call per node       : 249428
Grid : Message : 9.661645 s : Average mflops/s per call (full)         : 249346
Grid : Message : 9.661714 s : Average mflops/s per call per rank (full): 249346
Grid : Message : 9.661718 s : Average mflops/s per call per node (full): 249346
Grid : Message : 9.661723 s : WilsonFermion5D Stencil
Grid : Message : 9.661727 s :  Stencil calls 1001
Grid : Message : 9.661730 s :  Stencil halogtime 0.0729271
Grid : Message : 9.661732 s :  Stencil gathertime 0
Grid : Message : 9.661736 s :  Stencil gathermtime 0
Grid : Message : 9.661738 s :  Stencil mergetime 0.0549451
Grid : Message : 9.661740 s :  Stencil decompresstime 0.0589411
Grid : Message : 9.661747 s : WilsonFermion5D StencilEven
Grid : Message : 9.661751 s : WilsonFermion5D StencilOdd
Grid : Message : 9.661756 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 9.661757 s :  timer0 (HaloGatherOpt) 0.566933
Grid : Message : 9.661761 s :  timer1 (Communicate)   0.045954
Grid : Message : 9.661764 s :  timer2 (CommsMerge )   0.0789211
Grid : Message : 9.661768 s :  timer3 (commsMergeShm) 0.147353
Grid : Message : 9.661770 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 9.661772 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 10.187450 s : Compare to naive wilson implementation Dag to verify correctness
Grid : Message : 10.187640 s : Called DwDag
Grid : Message : 10.187650 s : norm dag result 12546.2
Grid : Message : 10.284430 s : norm dag ref    12.1948
Grid : Message : 10.390220 s : norm dag diff   7.52814e-14
Grid : Message : 10.597580 s : Calling Deo and Doe and //assert Deo+Doe == Dunprec
Grid : Message : 10.102292 s : src_e0.502726
Grid : Message : 10.111715 s : src_o0.509216
Grid : Message : 10.120448 s : *********************************************************
Grid : Message : 10.120450 s : * Benchmarking DomainWallFermionF::DhopEO                
Grid : Message : 10.120451 s : * Vectorising space-time by 8
Grid : Message : 10.120453 s : * SINGLE precision 
Grid : Message : 10.120454 s : * Using Nc=3       WilsonKernels
Grid : Message : 10.120455 s : *********************************************************
Grid : Message : 12.975335 s : Deo mflop/s =   242917
Grid : Message : 12.975349 s : Deo mflop/s per rank   242917
Grid : Message : 12.975351 s : Deo mflop/s per node   242917
Grid : Message : 12.975353 s : #### Dhop calls report 
Grid : Message : 12.975354 s : WilsonFermion5D Number of DhopEO Calls   : 1001
Grid : Message : 12.975356 s : WilsonFermion5D TotalTime   /Calls        : 2851.97 us
Grid : Message : 12.975358 s : WilsonFermion5D CommTime    /Calls        : 1.74825 us
Grid : Message : 12.975360 s : WilsonFermion5D FaceTime    /Calls        : 0 us
Grid : Message : 12.975362 s : WilsonFermion5D ComputeTime1/Calls        : 2850.14 us
Grid : Message : 12.975364 s : WilsonFermion5D ComputeTime2/Calls        : 0 us
Grid : Message : 12.975367 s : Average mflops/s per call                : 247231
Grid : Message : 12.975371 s : Average mflops/s per call per rank       : 247231
Grid : Message : 12.975374 s : Average mflops/s per call per node       : 247231
Grid : Message : 12.975377 s : Average mflops/s per call (full)         : 247072
Grid : Message : 12.975379 s : Average mflops/s per call per rank (full): 247072
Grid : Message : 12.975381 s : Average mflops/s per call per node (full): 247072
Grid : Message : 12.975384 s : WilsonFermion5D Stencil
Grid : Message : 12.975386 s : WilsonFermion5D StencilEven
Grid : Message : 12.975391 s : WilsonFermion5D StencilOdd
Grid : Message : 12.975394 s :  Stencil calls 1001
Grid : Message : 12.975401 s :  Stencil halogtime 0.0559441
Grid : Message : 12.975405 s :  Stencil gathertime 0
Grid : Message : 12.975408 s :  Stencil gathermtime 0
Grid : Message : 12.975410 s :  Stencil mergetime 0.0549451
Grid : Message : 12.975414 s :  Stencil decompresstime 0.0659341
Grid : Message : 12.975416 s : WilsonFermion5D Stencil     Reporti()
Grid : Message : 12.975417 s : WilsonFermion5D StencilEven Reporti()
Grid : Message : 12.975422 s : WilsonFermion5D StencilOdd  Reporti()
Grid : Message : 12.975425 s :  timer0 (HaloGatherOpt) 1.13487
Grid : Message : 12.975428 s :  timer1 (Communicate)   0.0879121
Grid : Message : 12.975432 s :  timer2 (CommsMerge )   0.157842
Grid : Message : 12.975435 s :  timer3 (commsMergeShm) 0.280719
Grid : Message : 12.989486 s : r_e12547.9
Grid : Message : 12.994729 s : r_o6.10374
Grid : Message : 13.188000 s : res1578.28
Grid : Message : 13.610200 s : norm diff   0
Grid : Message : 13.116742 s : norm diff even  0
Grid : Message : 13.126443 s : norm diff odd   0

philomat · 2021-03-18T20:47:32Z

I just ran the Test_wilson_clover on summit and the test ran without any errors

paboyle · 2021-03-18T20:59:18Z

Thanks.

My hypothesis that the --dslash-unroll might fix the performance issues is not correct then.

Glad to hear it re. Clover - it's a HIP / CUDA difference, and not general breakage of Clover.

More joy to look forward to....

if you were able to track down which accelerator_for/line of code fails with Clover, that would help.

philomat · 2021-03-19T20:37:19Z

The error occurs in the constructor of the WilsonCloverFermion. To be more precise it is happening in the ImportGauge in WilsonCloverFermion.h on line 109

paboyle · 2021-03-20T05:24:29Z

Could you either

A) run it under a debugger (gdb) and trap the fault and ask it for a back trace with "bt".

OR

B) go to:

Grid/qcd/action/fermion/implementation/WilsonCloverFermionImplementation.h

uncomment the cout << statements
Stick in a print statement at lines 73, 81, 91, and 127 and 140.

A) is not guaranteed to work because I don't know how the GPU runtime is operating, but significantly less effort if
you know how to use a debugger, so it is what I would try first. I think there is a rocm-gdb or rocm-lldb tool

paboyle · 2021-03-20T05:25:19Z

though the AMD node I had access to, the rocm debugger didn't work for me.

philomat · 2021-03-23T19:04:36Z

I tried both option A and B. The output of bt was not very enlightening. Approach B) tells me the error is between line 81 and 91 in Grid/qcd/action/fermion/implementation/WilsonCloverFermionImplementation.h.

…

Am 20.03.2021 um 01:25 schrieb Peter Boyle ***@***.***>: though the AMD node I had access to, the rocm debugger didn't work for me. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

paboyle · 2021-03-23T19:08:45Z

that was enough to go on for me to eyeball at least one error.

paboyle · 2021-03-23T19:09:02Z

More later - I'll try and patch develop.

paboyle · 2021-03-26T01:14:02Z

Sorry - reviewed again and the code looks right. Darn it...

jdmaia · 2021-07-19T00:30:37Z

Hi guys,

I just saw this. I have been working on grid some some weeks now and it seems like the Wilson clover implementation exceeds the maximum limit of local memory per thread (128k for now). That could explain the runtime error (More recent ROCm releases have an assertion against that, which makes the code fail to compile).

philomat · 2021-08-18T15:12:42Z

Hi,

I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation:

error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906.

My configure command:
../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14"

Any ideas how to solve this?

chulwoo1 · 2021-08-19T16:48:12Z

For whatever it's worth, I'm seeing the same error on OLCF spock, with rocm 4.2.0. Chulwoo

…

On 2021-08-18 11:12, philomat wrote: Hi, I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation: error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906. My configure command: ../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14" Any ideas how to solve this? -- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. Triage notifications on the go with GitHub Mobile for iOS [3] or Android [4]. Links: ------ [1] #343 (comment) [2] https://github.com/notifications/unsubscribe-auth/ABFOT73Q3VPIOPKSRHSKELTT5PEXJANCNFSM4YPC44IQ [3] https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGV2pA8Vo$ [4] https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGa3-jFZl$

paboyle · 2021-08-21T09:17:29Z

Can you give the complete call tree that is failing? From: chulwoo1 ***@***.***> Reply to: paboyle/Grid ***@***.***> Date: Thursday, 19 August 2021 at 17:48 To: paboyle/Grid ***@***.***> Cc: Peter Boyle ***@***.***>, Comment ***@***.***> Subject: Re: [paboyle/Grid] Compiling Grid for AMD GPUS (#343) This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe. For whatever it's worth, I'm seeing the same error on OLCF spock, with rocm 4.2.0. Chulwoo

On 2021-08-18 11:12, philomat wrote: Hi, I just tried to compile Grid on a new AMD GPU (MI100) machine at JLab. Unfortunately, I get errors during compilation: error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_ error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_ 2 errors generated when compiling for gfx906. My configure command: ../configure --enable-unified=no --enable-shm=no --enable-accelerator=hip --enable-comms=mpi3-auto --enable-simd=GPU --enable-gen-simd-width=64 CXX=/opt/rocm-4.3.0/bin/hipcc MPICXX=mpicxx CXXFLAGS="-fPIC -I/opt/rocm-4.3.0/ -std=c++14" Any ideas how to solve this? -- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. Triage notifications on the go with GitHub Mobile for iOS [3] or Android [4]. Links: ------ [1] #343 (comment) [2] https://github.com/notifications/unsubscribe-auth/ABFOT73Q3VPIOPKSRHSKELTT5PEXJANCNFSM4YPC44IQ [3] https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGV2pA8Vo$ [4] https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email__;!!P4SdNyxKAPE!UulLPgaa-YT0rQSGErIN-mCfLWKRbpv8iJ3prjvwxw10K-66idjNqWTCGa3-jFZl$

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#343 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZRZTOJOVTNQ3KVMR4C2TDT5UYVPANCNFSM4YPC44IQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>. The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

philomat · 2021-08-21T14:48:11Z

Hi Peter,

this is the complete output of make:

[scior@qcdi2001 Grid]$ make
cp version-cache Version.h
make all-am
make[1]: Entering directory '/u/home/scior/Grid/build/Grid'
cp version-cache Version.h
CXX util/version.o
CXX qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o
error: stack size limit exceeded (131088) in _ZN4Grid11LambdaApplyIZNS_3adjINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEEEENS_7LatticeIT_EERKSK_EUlmmmE_EEvmmmSJ_
error: stack size limit exceeded (131552) in _ZN4Grid11LambdaApplyIZNS_12outerProductINS_7iScalarINS_7iVectorINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI15HIP_vector_typeIdLj2EEEEEEEELi8EEELi4EEEEESH_EENS_7LatticeIDTcl12outerProductcvT__EcvT0__EEEEERKNSI_ISJ_EERKNSI_ISK_EEEUlmmmE_EEvmmmSJ_
2 errors generated when compiling for gfx906.
make[1]: *** [Makefile:3468: qcd/action/fermion/instantiation/WilsonAdjImplD/WilsonCloverFermionInstantiationWilsonAdjImplD.o] Error 1
make[1]: Leaving directory '/u/home/scior/Grid/build/Grid'
make: *** [Makefile:2490: all] Error 2

philomat · 2021-09-24T14:17:43Z

Any progress on this issue? I pulled Grid a couple of days ago and still get the same error.

jdmaia · 2021-10-12T15:52:43Z

@philomat For now I'm avoiding hitting this problem by conditionally compiling the problematic operators, which seems to be fine to build the main benchmark binary (Benchmark_ITT), but I still need to take a look at the code and see if we can reduce the amount of local data allocated per thread and place it somewhere else to avoid hitting this issue.

paboyle · 2021-11-25T17:35:45Z

I've run on Spock and doing well on Benchmark_ITT and Benchmark_dwf_fp32.

Added the systems/Spock directory with compile and run scripts.
Getting 1.3TF/s on MI100.

paboyle · 2021-11-25T17:36:35Z

Also get 4TF/s on a whole Spock node, 4x MI-100.

james-simone · 2022-02-15T22:23:41Z

I'm also hitting the "stack frame size exceeds limit" error.
commit: HEAD detached at 135808d
Ubuntu 20.04 container
rocm-5.0.0/clang/14.0.0
CXX=hipcc CXXFLAGS=" -std=c++14 -I/opt/rocm/rocthrust/include -I/usr/local/openmpi/include -I/usr/local/fftw/include -I/usr/local/hdf5/include -I/usr/local/scidac/include " LDFLAGS=" -L/opt/rocm/rocthrust/lib -L/usr/local/openmpi/lib -L/usr/local/fftw/lib -L/usr/local/hdf5/lib -L/usr/local/scidac/lib " LIBS="-lmpi" MPICXX=mpicxx /var/tmp/Grid/configure --prefix=/usr/local/grid --enable-accelerator=hip --enable-comms=mpi3-auto --enable-gen-simd-width=64 --enable-numa --enable-openmp --enable-simd=GPU --enable-unified=no

atamazov · 2023-04-26T18:57:31Z

There are some hardware-related limitations of the stack frame on AMGPUs. You need to reduce usage of private memory in the kernels.

Note that gfx10 GPUs can use twice more private memory than gfx9 because of narrower wavesize (32 vs 64).

Details can be found here: llvm/llvm-project@1ed4caf

FelixPGZiegler mentioned this issue Dec 6, 2022

check in GRID_HIP fixes to enable compilation on AMD GPUs for Nc=3 #416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling Grid for AMD GPUS #343

Compiling Grid for AMD GPUS #343

philomat commented Mar 2, 2021

paboyle commented Mar 4, 2021

paboyle commented Mar 4, 2021

paboyle commented Mar 11, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 19, 2021

paboyle commented Mar 20, 2021

paboyle commented Mar 20, 2021

philomat commented Mar 23, 2021 via email

paboyle commented Mar 23, 2021

paboyle commented Mar 23, 2021

paboyle commented Mar 26, 2021

jdmaia commented Jul 19, 2021

philomat commented Aug 18, 2021

chulwoo1 commented Aug 19, 2021 via email

paboyle commented Aug 21, 2021 via email

philomat commented Aug 21, 2021

philomat commented Sep 24, 2021

jdmaia commented Oct 12, 2021

paboyle commented Nov 25, 2021

paboyle commented Nov 25, 2021

james-simone commented Feb 15, 2022

atamazov commented Apr 26, 2023

Compiling Grid for AMD GPUS #343

Compiling Grid for AMD GPUS #343

Comments

philomat commented Mar 2, 2021

paboyle commented Mar 4, 2021

paboyle commented Mar 4, 2021

paboyle commented Mar 11, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 18, 2021

philomat commented Mar 18, 2021

paboyle commented Mar 18, 2021

philomat commented Mar 19, 2021

paboyle commented Mar 20, 2021

paboyle commented Mar 20, 2021

philomat commented Mar 23, 2021 via email

paboyle commented Mar 23, 2021

paboyle commented Mar 23, 2021

paboyle commented Mar 26, 2021

jdmaia commented Jul 19, 2021

philomat commented Aug 18, 2021

chulwoo1 commented Aug 19, 2021 via email

paboyle commented Aug 21, 2021 via email

philomat commented Aug 21, 2021

philomat commented Sep 24, 2021

jdmaia commented Oct 12, 2021

paboyle commented Nov 25, 2021

paboyle commented Nov 25, 2021

james-simone commented Feb 15, 2022

atamazov commented Apr 26, 2023