-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased gsi.x and enkf.x executable wall times following Orion Rocky 9 upgrade #1166
Comments
The MSU helpdesk replied. They have not experienced any slowdown in executable wall times following the Rocky 9 upgrade. There was one MSU user who's jobs ran slower but this was traced to an errant environment variable. The GSI build uses the following modulefiles to build Were all these modules recompiled for Orion Rocky 9? Was there any change in build options? Do all the modules perform as efficiently on Rocky 9 and Centos 7? |
We're using the same set of versions/build options as before. The only difference that jumps out at me is the change in compiler and MPI. You could try the 'unified-env-intel-2023.2.4' environment (also under spack-stack-1.6.0 on Orion) and see if that makes any difference: /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-intel-2023.2.4/install/modulefiles/Core |
The GSI does not build using
We did not profile Watching |
Right, sorry, forgot about the addon environment. Please try /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-intel-2023.2.4/install/modulefiles/Core |
@AlexanderRichert-NOAA , the following modifications were made to a copy of
However,
Adding |
Add timers to gsi source code. Comparison of Orion and Hercules Orion
Hercules
|
Sorry about that, please try again. I forgot to add a line to the stack-intel module. |
@AlexanderRichert-NOAA , got a bit further but still encountered an error. Below is what I see
The modified module file is
|
Below are wall times in seconds for indicated bufr observation reads on Orion and Hercules. Orion
Hercules
|
As a test compile same GSI code on Hera and Dogwood (WCOSS2) and run
The same comment has been added to GSI issue #771 |
@jbathegit, @jack-woollen, @KateFriedman-NOAA, @DavidHuber-NOAA : have any of you noticed an increase in the time it takes to read bufr files on Orion following the Rocky 9 upgrade?
See GSI issue #771 for additional details. |
bufr/12.0.1 test |
debufr test Use bufr utility |
@AlexanderRichert-NOAA suggested compiling the GSI and ENKF on Hercules then running the ctests on Orion with those executables. After compiling, the runtime.global_4denvar_hiproc_contrl.txt: The total amount of wall time = 1394.165947
runtime.global_4denvar_hiproc_updat.txt: The total amount of wall time = 751.449358
runtime.global_4denvar_loproc_contrl.txt: The total amount of wall time = 1059.320163
runtime.global_4denvar_loproc_updat.txt: The total amount of wall time = 986.416513
runtime.global_enkf_hiproc_contrl.txt: The total amount of wall time = 159.291133
runtime.global_enkf_hiproc_updat.txt: The total amount of wall time = 160.180497
runtime.global_enkf_loproc_contrl.txt: The total amount of wall time = 204.069604
runtime.global_enkf_loproc_updat.txt: The total amount of wall time = 186.185513
runtime.hafs_3denvar_hybens_hiproc_contrl.txt: The total amount of wall time = 635.210342
runtime.hafs_3denvar_hybens_hiproc_updat.txt: The total amount of wall time = 625.259733
runtime.hafs_3denvar_hybens_loproc_contrl.txt: The total amount of wall time = 719.252038
runtime.hafs_3denvar_hybens_loproc_updat.txt: The total amount of wall time = 664.469968
runtime.hafs_4denvar_glbens_hiproc_contrl.txt: The total amount of wall time = 686.031241
runtime.hafs_4denvar_glbens_hiproc_updat.txt: The total amount of wall time = 677.697564
runtime.hafs_4denvar_glbens_loproc_contrl.txt: The total amount of wall time = 804.612061
runtime.hafs_4denvar_glbens_loproc_updat.txt: The total amount of wall time = 738.414877
runtime.rtma_hiproc_contrl.txt: The total amount of wall time = 354.441046
runtime.rtma_hiproc_updat.txt: The total amount of wall time = 351.002411
runtime.rtma_loproc_contrl.txt: The total amount of wall time = 362.241694
runtime.rtma_loproc_updat.txt: The total amount of wall time = 366.629561 |
I asked around, and JCSDA is not seeing specific file I/O cases of severe slow down in applications. But someone mentioned that it is important to use |
Thanks for checking @srherbener |
Update: JCSDA has now seen SOCA running much slower than it should on Orion. Here's are excerpts from a Slack discussion today:
|
Is that with GNU only (UFS and JEDI), or is that also with Intel? |
@travissluka can you help clarify? Thanks! |
GNU (I didn't try intel), and JEDI only |
Would be useful to know if the GSI slowdown reported above for Intel als happens with GNU, and vice versa if the JED slowdown also happens with Intel. From what it sounds like, yes. That would be another good data point that the problem is somewhere in the underlying fabric or slurm integration with it. |
another data point, I'm trying on hercules/intel and it's going just as slow (maybe even slower actually) for SOCA. File I/O doesn't seem to be a particular problem, it seems to be MPI related. If it throw more nodes at it the variational solver goes even slower (inverse scalability!) |
I believe I found the issue with the Intel/GCC compilers on Orion. They are bitwise identical to Hercules. The intel-oneapi-compilers/2023.1.0.lua module module file for Orion indicates that the Intel compilers are built for Icelake infrastructure, but Orion's CPUs are Skylake. This may create potentially illegal/less-than-optimal instructions in certain circumstances. I have notified the Orion team and am awaiting a response, but will follow up here when I hear back. Since the gcc compiler on Orion (/usr/bin/gcc) is identical to Hercules, I believe it may share the same problem. To test this, I installed Intel 2023.1.0 (for Skylake infrastructure) via spack and compiled BUFR with it. The debufr application still runs just as slow this way as with the Icelake installation of Intel compilers. I believe, but cannot prove yet, that this is because Intel is referencing Icelake GNU libraries. To prove it, I think I would need to set up a container on Orion with its own GCC and GNU libraries. Not having much experience with containers, it would be nice to have some help setting this up. |
Rebuild GSI |
From a discussion with an Intel contact (app engineer) I think this is a good avenue to pursue. Here are his comments:
|
@srherbener Thank you for this very useful information. Based on the differences in AVX512 instructions between Skylake and Icelake cores, I performed a test on Orion where I added the compiler flags If possible, I would still appreciate a container on Orion with its own GLIBC (2.34) and GCC (11.3.0), just so I could rule out any issues in the system GLIBC library. I attempted to install my own GLIBC/GCC via a crosstool-ng installation here: |
@DavidHuber-NOAA can you provide the path for your AVX512-less debufr executable? I had tried to build without AVX512 but it wasn't clear that it was correctly leaving out all those functions. Some kind of instruction-related issue would seem to be consistent with my observation that there's no one piece of the code where things are slow (for example, eliminating exponentiation, and subsequently Intel's special 'pow' functions, sped up the code disproportionately on Orion compared with Hercules, and yet didn't explain even the majority of the time difference...). |
@AlexanderRichert-NOAA Sure, you can find it here: |
Hm I'm not sure how siginificant it is, but it seems it's still using some AVX512 routines (I seem to recall that the intel memcpy/memset routines were one of the top items in the gprof runs I did): $ nm /work/noaa/global/dhuber/LIBS/bufr_11.7.0_noavx512/build/install/bin/debufr | grep avx
$PWD/debufr | grep avx
00000000004f1ff0 T __intel_avx_memmove
00000000004ef750 T __intel_avx_rep_memcpy
00000000004f0f20 T __intel_avx_rep_memset
00000000004ec300 T __intel_mic_avx512f_memcpy
00000000004ede80 T __intel_mic_avx512f_memset Edit: I just noticed the 'mic' in those two avx512 routines, so those may not be getting used on regular CPUs anyway. In any case, it might be worth running the avx512-less one with gprof to see if there are certain routines that are running slow, vs. just generally running slower. |
Interesting. I compiled that executable with |
It looks like the linking step also included /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.2.4-cuxy653alrskny7hlvpp6hdx767d3xnt/compiler/2023.2.4/linux/bin/intel64/icc -g -traceback -O3 -march=ivybridge CMakeFiles/debufr.dir/debufr.c.o CMakeFiles/debufr.dir/debufr.F90.o -o debufr ../src/libbufr_4.a -lifport -lifcoremt -lpthread
|
Describe the bug
gsi.x
andenkf.x
wall times significantly increase following Orion Rocky 9 upgrade. See this comment in GSI issue #754 for details.To Reproduce
Steps to reproduce the behavior:
Expected behavior
We expect Orion Rocky 9
gsi.x
andenkf.x
wall times to be comparable with what we observed when building and running on Orion Centos 7.System:
Orion
Additional context
The modulefiles used to build
gsi.x
andenkf.x
aregsi_orion.intel.lua
andgsi_common.lua
.gsi_orion.intel.lua
setsTicket RDHPCS#2024062754000098 has been opened with the Orion helpdesk. Perhaps system settings changed with the Orion Rocky 9 upgrade.
The text was updated successfully, but these errors were encountered: