Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion: Migration to Rocky9 OS #2694

Closed
KateFriedman-NOAA opened this issue Jun 17, 2024 · 36 comments · Fixed by #2877
Closed

Orion: Migration to Rocky9 OS #2694

KateFriedman-NOAA opened this issue Jun 17, 2024 · 36 comments · Fixed by #2877
Assignees
Labels
feature New feature or request

Comments

@KateFriedman-NOAA
Copy link
Member

What new functionality do you need?

Support for Rocky9 on Orion

Use: /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9

What are the requirements for the new functionality?

System and components build/run on Orion Rocky9

Acceptance Criteria

System build and runs without issue related to OS

Suggest a solution (optional)

No response

@KateFriedman-NOAA KateFriedman-NOAA added feature New feature or request triage Issues that are triage labels Jun 17, 2024
@GeorgeGayno-NOAA
Copy link
Contributor

@KateFriedman-NOAA - UFS_UTILS will be updated under this issue: ufs-community/UFS_UTILS#963

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Jun 17, 2024
@RussTreadon-NOAA
Copy link
Contributor

GDASApp issue #1159 documents Orion Rocky 9 updates.

@KateFriedman-NOAA
Copy link
Member Author

UWM issue: ufs-community/ufs-weather-model#2332

@GeorgeGayno-NOAA
Copy link
Contributor

I see multiple stacks:

  • /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/modulefiles/Core
  • /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9

Which should we use?

@aerorahul
Copy link
Contributor

aerorahul commented Jun 21, 2024

I see multiple stacks:

  • /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/modulefiles/Core
  • /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/gsi-addon-env-rocky9

Which should we use?

Based on
https://github.com/ufs-community/UFS_UTILS/blob/65b530560c0a1620982d1857fdb36d65be17b867/modulefiles/build.orion.intel.lua#L5
I think /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/modulefiles/Core

@DavidHuber-NOAA
Copy link
Contributor

Opened issue NOAA-EMC/UPP#983 for the UPP.
Opened issue JCSDA/spack-stack#1158 to fix a CRTM-fix file problem in the spack-stack installation, a prerequisite for the GSI NOAA-EMC/GSI#754.

GeorgeGayno-NOAA added a commit to ufs-community/UFS_UTILS that referenced this issue Jun 26, 2024
Point to the Rocky 9 spack-stack. 

Fixes #963.
Related to NOAA-EMC/global-workflow#2694
@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Jun 26, 2024

Changes to build and run gsi.x and enkf.x on Orion Rocky 9 committed to RussTreadon-NOAA/GSI:feature/orion_rocky9. ctests show significant increase in gsi.x and enkf.x wall times. See GSI PR #764 for details.

@RussTreadon-NOAA
Copy link
Contributor

GDASApp and GSI updates

As of GDASApp PR #1180 and GSI PR #764 build and run on Orion Rocky 9.

Note that gsi.x and enkf.x run approximately 2x slower on Orion Rocky 9 and Orion Centos 7. This slowdown has been reported via

  • RDHPCS ticket #2024062754000098
  • spack-stack issue #1166

@KateFriedman-NOAA
Copy link
Member Author

KateFriedman-NOAA commented Jul 1, 2024

Branches for prepobs and fit2obs:

https://github.com/KateFriedman-NOAA/prepobs/tree/feature/orion_rocky9
https://github.com/KateFriedman-NOAA/Fit2Obs/tree/feature/orion_rocky9

Installs on Orion for testing (unable to install under glopara space on Orion):

/work/noaa/global/kfriedma/git/prepobs/feature-orion_rocky9
/work/noaa/global/kfriedma/git/fit2obs/feature-orion_rocky9

Will need to update g-w modulefile path for prepobs once installed:
https://github.com/NOAA-EMC/global-workflow/blob/develop/modulefiles/module_base.orion.lua#L48

Will need to update fit2obs_ver to 1.1.2:
https://github.com/NOAA-EMC/global-workflow/blob/develop/versions/run.spack.ver#L35
...and have it installed everywhere else as well. I can assist with installs elsewhere.

@DavidHuber-NOAA
Copy link
Contributor

The UFS weather model was updated to Rocky 9 on Orion at hash e784814.

@DavidHuber-NOAA
Copy link
Contributor

Unfortunately, this does not include the UPP, so that will need to be incorporated at a later hash. I will help with the UPP today.

@JessicaMeixner-NOAA
Copy link
Contributor

JessicaMeixner-NOAA commented Jul 1, 2024

The UFS weather model was updated to Rocky 9 on Orion at hash e784814.

PR #2729 has the updates for this. I believe after I merge in the updates from 2700 are merged to develop and then to this PR branch the issue with the CI tests will be resolved.

@DavidHuber-NOAA
Copy link
Contributor

The UPP is waiting on a new spack-stack installation of g2tmpl/1.12.0 (spack-stack #1164).

@aerorahul
Copy link
Contributor

#2741 will partially address this issue.
Remaining items after #2741 will be for:
UPP @WenMeng-NOAA is aware
UFSWM #2729 is in progress

@KateFriedman-NOAA
Copy link
Member Author

Updated prepobs for Orion Rocky9 is available here on Orion:

/work/noaa/global/kfriedma/glopara/git/prepobs/v1.0.2

Have also installed new v1.0.2 on both WCOSS2s and Jet. Will install on Hera when it's back from monthly maintenance tomorrow.
The following updates are needed within g-w:

  1. Update these lines:
    https://github.com/NOAA-EMC/global-workflow/blob/develop/modulefiles/module_base.orion.lua#L48
    https://github.com/NOAA-EMC/global-workflow/blob/develop/modulefiles/module_base.hercules.lua#L46

to:

prepend_path("MODULEPATH", "/work/noaa/global/kfriedma/glopara/git/prepobs/v1.0.2/modulefiles")
  1. Update prepobs_run_ver=1.0.2 here:
    https://github.com/NOAA-EMC/global-workflow/blob/develop/versions/run.spack.ver#L32

@KateFriedman-NOAA
Copy link
Member Author

Updated Fit2Obs install on Orion is available here: /work/noaa/global/kfriedma/glopara/git/Fit2Obs/v1.1.2

Have also installed in official global group account spaces on both WCOSS2s and Jet. Will install on Hera when it is back from monthly maintenance.

For this issue the following changes are needed in g-w:

  1. Temporarily change these lines to use the Orion install in my space ahead of Walter's return:

https://github.com/NOAA-EMC/global-workflow/blob/develop/modulefiles/module_base.orion.lua#L51
https://github.com/NOAA-EMC/global-workflow/blob/develop/modulefiles/module_base.hercules.lua#L49

  1. Fit2Obs version needs to be updated to fit2obs_ver=1.1.2 here:

https://github.com/NOAA-EMC/global-workflow/blob/develop/versions/run.spack.ver#L35

@DavidHuber-NOAA
Copy link
Contributor

UPP PR: NOAA-EMC/UPP#987
UFS PR that would be nice to include the UPP into: ufs-community/ufs-weather-model#2326

@KateFriedman-NOAA
Copy link
Member Author

The new prepobs/v1.0.2 also ran successfully on Orion.

Awaiting cycled tests on Hera/WCOSS2/Orion to get far enough to invoke Fit2Obs.

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Jul 8, 2024
- set the prepobs MODULEPATH to where the install
will be when set back to official glopara space

Refs NOAA-EMC#2694
@RussTreadon-NOAA
Copy link
Contributor

GSI update
dclock timers where added to a copy of GSI develop at 529bb796. The code was compiled on Dogwood, Hera, Hercules, and Orion. The global_4denvar ctest was run on each machine. Tabulated below are key timings

machine total read_obs time (s) read_satwnd time (s) total gsi.x wall time (s)
dogwood 169.988 120.264 403.800888
hera 194.612 137.553 466.024682
hercules 164.571 116.393 444.405860
orion 608.372 435.489 964.591370

Comparison of timings across the machines show timings for Hercules, Hera, and Dogwood are comparable. Orion is the outlier. Reading bufr files takes significantly longer on Orion following the Rocky 9 upgrade. The reason(s) for this increased wall time remain, at present, unknown.

For additional details see

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Jul 11, 2024
@KateFriedman-NOAA
Copy link
Member Author

Short cycled atmos-only low-res testing of new prepobs and fit2obs versions have completed successfully on WCOSS2-Dogwood, Hera, Orion, and Hercules. Will initiate PR to update prepobs and fit2obs versions in develop.

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA , where are log files from the Hera, Orion, and Hercules tests? I would like to compare wall times across the machines. GSI ctests indicate that gsi.x runs noticeably slower on Orion (GSI issue #771). debufr tests also show this executable to run slower on Orion (NCEPLIBS-bufr issue #608).

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Jul 11, 2024
@KateFriedman-NOAA
Copy link
Member Author

KateFriedman-NOAA commented Jul 11, 2024

@RussTreadon-NOAA Yes, please see the following:

Hera:

/scratch1/NCEPDEV/global/Kate.Friedman/expdir/testprepobs
/scratch2/NCEPDEV/stmp1/Kate.Friedman/comrot/testprepobs

Orion:

/work/noaa/global/kfriedma/expdir/testprepobs
/work/noaa/stmp/kfriedma/comrot/testprepobs

Hercules:

/work/noaa/global/kfriedma/expdir/testprepobsherc
/work/noaa/stmp/kfriedma/comrot/testprepobsherc

I also ran this same test on WCOSS2-Dogwood if that would be helpful.

I had to increase the eobs walltime on Orion (15mins->30mins) but otherwise left walltimes as is and did not have any other timelimit failures.

@KateFriedman-NOAA
Copy link
Member Author

@RussTreadon-NOAA In case it's helpful, here is the WCOSS2-Dogwood test:

/lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcycprepobs
/lfs/h2/emc/ptmp/kate.friedman/comrot/devcycprepobs

@RussTreadon-NOAA
Copy link
Contributor

Thank you @KateFriedman-NOAA .

gsi.x wall times are considerably higher on Orion compared to other platforms.

Orion

orion-login-2:/work/noaa/stmp/kfriedma/comrot/testprepobs/logs$ grep "wall" */gdasanal.log
2021122100/gdasanal.log: 0: The total amount of wall time                        = 1908.889531
2021122106/gdasanal.log: 0: The total amount of wall time                        = 2038.964334
2021122112/gdasanal.log: 0: The total amount of wall time                        = 1848.529683
2021122118/gdasanal.log: 0: The total amount of wall time                        = 1904.650164
2021122200/gdasanal.log: 0: The total amount of wall time                        = 1883.218843
orion-login-2:/work/noaa/stmp/kfriedma/comrot/testprepobs/logs$ grep "wall" */enkfgdaseobs.log
2021122100/enkfgdaseobs.log: 0: The total amount of wall time                        = 901.077139
2021122106/enkfgdaseobs.log: 0: The total amount of wall time                        = 884.285354
2021122112/enkfgdaseobs.log: 0: The total amount of wall time                        = 933.364648
2021122118/enkfgdaseobs.log: 0: The total amount of wall time                        = 1084.572618
2021122200/enkfgdaseobs.log: 0: The total amount of wall time                        = 983.077523

Hera

Hera(hfe03):/scratch2/NCEPDEV/stmp1/Kate.Friedman/comrot/testprepobs/logs$ grep wall */gdasanal.log
2021122100/gdasanal.log: 0: The total amount of wall time                        = 1116.593922
2021122106/gdasanal.log: 0: The total amount of wall time                        = 1293.681421
2021122112/gdasanal.log: 0: The total amount of wall time                        = 1131.990532
2021122118/gdasanal.log: 0: The total amount of wall time                        = 1168.874366
2021122200/gdasanal.log: 0: The total amount of wall time                        = 1103.230691
Hera(hfe03):/scratch2/NCEPDEV/stmp1/Kate.Friedman/comrot/testprepobs/logs$ grep "wall" */enkfgdaseobs.log
2021122100/enkfgdaseobs.log: 0: The total amount of wall time                        = 407.097608
2021122106/enkfgdaseobs.log: 0: The total amount of wall time                        = 432.787526
2021122112/enkfgdaseobs.log: 0: The total amount of wall time                        = 434.973986
2021122118/enkfgdaseobs.log: 0: The total amount of wall time                        = 532.029862
2021122200/enkfgdaseobs.log: 0: The total amount of wall time                        = 443.266375

Interestingly Orion, Hercules, and WCOSS2 (Dogwood) run the prep step with 4 streams. Hera runs with 1 stream. Thus, Orion prep step wall times are compared with Hercules and Dogwood. Wall times for various executables in the 2021122100 gfsprep job have larger wall times on Orion compared to Hercules and Dogwood. This comment applies to gdasprep and other cycles.

Orion

orion-login-2:/work/noaa/stmp/kfriedma/comrot/testprepobs/logs$ grep "wall" 2021122100/gfsprep.log
The total amount of wall time                        = 2.378960
The total amount of wall time                        = 58.736572
The total amount of wall time                        = 0.683556
The total amount of wall time                        = 2.369928
The total amount of wall time                        = 59.592429
The total amount of wall time                        = 0.772289
The total amount of wall time                        = 2.356385
The total amount of wall time                        = 58.627895
The total amount of wall time                        = 0.618208
The total amount of wall time                        = 2.378561
The total amount of wall time                        = 59.753184
The total amount of wall time                        = 0.767621
The total amount of wall time                        = 1.102821
The total amount of wall time                        = 72.996897
The total amount of wall time                        = 6.940320
The total amount of wall time                        = 0.701914
The total amount of wall time                        = 1.628848
The total amount of wall time                        = 115.640379

Hercules

hercules-login-3:/work/noaa/stmp/kfriedma/comrot/testprepobsherc/logs$ grep "wall" 2021122100/gfsprep.log
The total amount of wall time                        = 1.891698
The total amount of wall time                        = 44.080027
The total amount of wall time                        = 0.417157
The total amount of wall time                        = 1.888169
The total amount of wall time                        = 44.332188
The total amount of wall time                        = 0.416644
The total amount of wall time                        = 1.905027
The total amount of wall time                        = 44.460939
The total amount of wall time                        = 0.397002
The total amount of wall time                        = 1.885472
The total amount of wall time                        = 45.027571
The total amount of wall time                        = 0.226408
The total amount of wall time                        = 0.437389
The total amount of wall time                        = 21.519730
The total amount of wall time                        = 2.531554
The total amount of wall time                        = 0.176123
The total amount of wall time                        = 0.583526
The total amount of wall time                        = 42.280456

WCOSS2 (Dogwood)

russ.treadon@dlogin05:/lfs/h2/emc/ptmp/kate.friedman/comrot/devcycprepobs/logs> grep "wall" 2021122100/gfsprep.log                  
The total amount of wall time                        = 0.936697
The total amount of wall time                        = 23.490267
The total amount of wall time                        = 0.217773
The total amount of wall time                        = 2.307517
The total amount of wall time                        = 24.180469
The total amount of wall time                        = 0.219624
The total amount of wall time                        = 2.280056
The total amount of wall time                        = 24.032663
The total amount of wall time                        = 0.212903
The total amount of wall time                        = 0.984444
The total amount of wall time                        = 23.692733
The total amount of wall time                        = 0.217461
The total amount of wall time                        = 1.772865
The total amount of wall time                        = 20.959409
The total amount of wall time                        = 2.440648
The total amount of wall time                        = 0.141279
The total amount of wall time                        = 0.587792
The total amount of wall time                        = 40.554083

@KateFriedman-NOAA
Copy link
Member Author

Thanks for comparing these walltimes @RussTreadon-NOAA ! Interesting that the prep jobs also run slower on Orion. I'm guessing it's a similar reason to the GSI slowness.

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA , issue #2759 reports an interesting finding from your Hera and Hercules parallels. No metplus files are generated. The Dogwood test generates metplus files. Shouldn't this be true for the Hera and Hercules parallels?

@KateFriedman-NOAA
Copy link
Member Author

@KateFriedman-NOAA , issue #2759 reports an interesting finding from your Hera and Hercules parallels. No metplus files are generated. The Dogwood test generates metplus files. Shouldn't this be true for the Hera and Hercules parallels?

Good catch @RussTreadon-NOAA ! We'll look into it.

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA , issue #2759 reports an interesting finding from your Hera and Hercules parallels. No metplus files are generated. The Dogwood test generates metplus files. Shouldn't this be true for the Hera and Hercules parallels?

Good catch @RussTreadon-NOAA ! We'll look into it.

See g-w issue #2759 for an update - found a way to generate stats but not the correct solution.

@KateFriedman-NOAA
Copy link
Member Author

EMC_verif-global was updated for Orion Rocky9. See NOAA-EMC/EMC_verif-global#127.

Update global-workflow to use new hash: NOAA-EMC/EMC_verif-global@df296f4

@malloryprow
Copy link
Contributor

EMC_verif-global was updated for Orion Rocky9. See NOAA-EMC/EMC_verif-global#127.

Update global-workflow to use new hash: NOAA-EMC/EMC_verif-global@df296f4

@KateFriedman-NOAA Could you update the EMC_verif-global hash again? NOAA-EMC/EMC_verif-global@0d9e0b6

@KateFriedman-NOAA
Copy link
Member Author

WIll do, thanks for the updated hash @malloryprow !

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Jul 17, 2024
@KateFriedman-NOAA
Copy link
Member Author

@DavidHuber-NOAA is testing the latest EMC_verif-global hash to resolve issue #2759. That hash also includes updates for Orion Rocky9, so it will take care of both things in a single PR. FYI @aerorahul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
8 participants