Update Chicoma-CPU and add Chicoma-GPU #73

xylar · 2024-01-27T17:00:24Z

This merge makes a few updates to Chicoma-CPU and adds support for Chicoma's GPU partition.

xylar · 2024-01-27T17:05:02Z

@jonbob and @vanroekel, I haven't run any tests on this yet but wanted to give you a heads up that I'm working on it. I will test all the supported compilers (4 on CPU and 4 on GPU) early next week to make sure they can run a simple E3SM test.

After that, I'll ask for your input.

xylar · 2024-01-31T13:55:47Z

@jonbob, I'm not having any luck testing this on Chicoma. Any runs of ./create_test hang either in CREADE_NEWCASE or SETUP. It could be that it's related to running out of space but I would have thought it would produce an error rather than hanging. Could you see if you have better luck whenever you have some time?

jonbob · 2024-01-31T15:58:00Z

@xylar -- I'll try it later today

cime_config/machines/config_machines.xml

jonbob · 2024-01-31T20:08:00Z

@xylar -- I was able to successfully build:

SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_gnu
SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_intel
but they failed at runtime. I added a comment where I think there's a problem. I'm testing a fix

xylar · 2024-01-31T20:18:29Z

I'll test again tomorrow. Thanks for the help, @jonbob!

jonbob · 2024-01-31T20:33:32Z

After I fixed those lines, it's still complaining about "-m" when it tries to run. From the e3sm.log:

/bin/sh: line 1: -m: command not found

jonbob · 2024-01-31T20:38:09Z

And here's more output:

run command is srun  --label  -n 64 -N 1 -c 2  --cpu_bind=cores
   -m plane=128 /lustre/scratch4/turquoise/jonbob/E3SM/scratch/chicoma-cpu/SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_gnu.20240131_130253_wft96a/bld/e3sm.exe   >> e3sm.log.$LID 2>&1  
2024-01-31 13:27:55 SAVE_PRERUN_PROVENANCE BEGINS HERE

maybe another line break?

xylar · 2024-01-31T21:53:56Z

Yep, it's possible. I had missed the same formatting issues with chicoma-gpu. I appreciate you testing this. If my dumb mistakes get too annoying, you are certainly welcome to kick this back to me and ask me to clean up my mess before involving you further. Sorry about that!

xylar · 2024-02-03T14:32:56Z

With a few fixes, I am now able to run tests on chicoma-cpu:

./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu --walltime 1:00:00 --wait -p w23_freddy
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_intel --walltime 1:00:00 --wait -p w23_freddy
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_nvidia --walltime 1:00:00 --wait -p w23_freddy

However, I'm not able to build mct with nvidiagpu:

configure:2948: checking whether we are cross compiling
configure:2956: cc -o conftest  -I/lustre/scratch4/turquoise/xylar/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu.20240203_071102_4rj99z/bld/nvidiagpu/mpich/nodebug/nothreads/mct/include  -Wl,--allow-multiple-definition -lstdc++ conftest.c  >&5
nvc-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
"conftest.c", line 17: warning: statement is unreachable
    return 0;
    ^

/usr/bin/ld: warning: /tmp/pgcudafatDlMUx6vrhVXz.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
configure:2960: $? = 0
configure:2967: ./conftest
./conftest: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I haven't been able to figure out what is supposed to be providing libcuda.so.1. It doesn't seem to be in cudatoolkit but that also seems to be true on Perlmutter so I don't get the sense that it's supposed to. I couldn't find anything that was done for pm-gpu that I forgot to duplicate for chicoma-gpu that might account for this.

xylar · 2024-02-03T16:03:37Z

On Perlmutter, libcuda.so.1 is in /usr/lib64/. It seems to be something installed as part of the NVidia drivers and not something that can (typically) be found in a module. I can't find the equivalent on Chicoma anywhere, nor any helpful documentation on LANL HPC.

xylar · 2024-02-03T16:08:49Z

I'm giving up on this for now. @vanroekel, if this becomes pressing for you, I suggest getting some help from LANL IC on this.

vanroekel · 2024-02-04T05:08:08Z

Thank you for working on this @xylar I appreciate it. I’ll try pick this up and push on this later in the week.

xylar · 2024-02-04T10:42:02Z

@vanroekel, one thought I had was that maybe libcuda.so.1 is only available on the GPU compute nodes, so I will try might try running tests on an interactive GPU node. You are welcome to give this a try if you beat me to it. A quick check would be if /usr/lib64/libcuda.so.1 exists on a GPU node.

vanroekel · 2024-02-05T20:31:09Z

@xylar you were right on the money. When I logged onto the gpu partition there is a /usr/lib64/libcuda.so.1 file.

xylar · 2024-02-05T21:00:52Z

Okay, that's going to mean that everything for chicoma-gpu has to be done from an interactive or batch job, not on a login node.

xylar · 2024-02-05T21:01:14Z

I can test things there tomorrow if I find the time.

vanroekel · 2024-02-06T15:25:05Z

@xylar I have a bit of time to push on this this morning, do you have a test I could try on chicoma-gpu to verify? Would I change something like this one

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu

to

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu

?

vanroekel · 2024-02-06T15:29:56Z

my change of the test worked. There is an error (different from what you saw before that I'll look into)

Error invoking pkg-config!
Package cudatoolkit_22.7_11.7 was not found in the pkg-config search path.
Perhaps you should add the directory containing `cudatoolkit_22.7_11.7.pc'
to the PKG_CONFIG_PATH environment variable
No package 'cudatoolkit_22.7_11.7' found

xylar · 2024-02-06T17:58:23Z

cime_config/machines/config_machines.xml

-    <MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE>
+    <MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>


@mark-petersen, the issue you pointed out on Slack should be fixed here.

Yes, this should definitely be 128 for chicoma-cpu. Thanks.

xylar · 2024-02-06T19:52:41Z

@vanroekel, that sounds to me like something isn't configured right in the cudatoolkit module on Chicoma, though that's just a guess. Can you load the cudatoolkit module (and all the modules before that) on and do module show cudatoolkit, then compare with what you get on Perlmutter for the same? Specifically, how it's setting up $PKG_CONFIG_PATH? It seems like something isn't right on Chicoma.

xylar · 2024-02-06T19:54:25Z

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu

Yes, that seems like a good thing to try. Maybe @jonbob has something even simpler but given that most of the wait is compile time, it doesn't hurt to test MPAS-O, MPAS-Seaice and MALI all in one go.

jonbob · 2024-02-06T20:01:25Z

That's about as small a test as we could come up with for all three components. You could always try a C- or D-case and just have to build one active component, but it may not save you much with the parallel build

vanroekel · 2024-02-07T05:32:45Z

a small update on the pkg-config issue. I've dug all over on chicoma and cannot find the package config file that is needed. I even dug into the module load script and it is pointing to a pkg_config directory that is non-existent! I've contacted LANL support, so I guess this work is on hold until I hear back from them

vanroekel · 2024-02-07T15:57:10Z

Well an unfortunate update - it seems the files missing are only visible on the front end nodes but libcuda.so is only visible on compute nodes. I'm working with LANL support on how to address this.

vanroekel · 2024-02-07T20:46:55Z

A bit of progress, I have a work around for the pkg-config error. I'm now able to build all the dependencies on gpu, but am getting an error in the mpas build now.

355 nvcc fatal   : Unknown option '-Wl,--allow-multiple-definition'
356 Target namelist_gen built in 0.002274 seconds
357 gmake[2]: *** [mpas-framework/src/tools/CMakeFiles/namelist_gen.dir/build.make:132: mpas-framework/src/tools/namelist_gen] Error 1
358 gmake[2]: Leaving directory '/lustre/scratch4/turquoise/.mdt2/lvanroekel/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_T    EST.chicoma-gpu_nvidiagpu.20240207_114345_9m1jel/bld/cmake-bld'

@jonbob or @xylar do either of you know where MPAS picks up build options so I can try and remove those options that are 'unknown'?

xylar · 2024-02-07T22:02:00Z

@vanroekel, these are presumably coming from here:
https://github.com/E3SM-Project/E3SM/blob/50485d0c929a873b5935ea3c2dc1da4d236319dd/cime_config/machines/cmake_macros/nvidiagpu.cmake#L27
But I would be suspicious that something else is going wrong because these are linker flags and I don't think nvcc should be getting invoked as the linker. (But maybe @jonbob or @philipwjones would be more helpful here.)

xylar · 2024-02-07T22:05:56Z

@vanroekel, I suspect the problem might be that we're missing the equivalent of:
https://github.com/E3SM-Project/E3SM/blob/50485d0c929a873b5935ea3c2dc1da4d236319dd/cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake
I missed the cmake_macros in this PR. ~~Feel free to copy them from their pm-gpu equivalents and push that to this branch.~~ I went ahead and copied them.

xylar · 2024-02-07T22:14:28Z

@vanroekel, see if the macros I just added make a difference.

vanroekel · 2024-02-08T05:46:28Z

Thanks @xylar! I took these and made one more change and I got it to build! Do you want me to pass you my small changes or push to this branch?

However, it still won't run or submit. I'm getting this error

sbatch: error: Requested GRES option unsupported by configured SelectType plugin

@philipwjones any suggestions on what this means? Here are the gpu directives

 60       <directive> --gpus-per-task=1</directive>
 61       <directive> --gpu-bind=none</directive>
```

xylar · 2024-02-08T06:57:16Z

@vanroekel, that sounds like progress! Yes, just push to this branch.

mark-petersen · 2024-02-08T16:58:50Z

It would be convenient to remove badger in this PR, since chicoma replaced badger. Otherwise, we should remove the badger machine file section in a separate PR.

philipwjones · 2024-02-08T17:47:33Z

@vanroekel Do you have the actual batch submit command from the logs? GRES is a resource error so not sure what you actually asked for...

vanroekel · 2024-02-08T19:30:33Z

So a bit of a funny story. I figured out the GRES error, turns out it triggers when you use an account value in sbatch that doesn't have access to chicoma-gpu. My tests have been using a chicoma-cpu only test. When I switch to a different account it submits! I'm testing the E3SM test again, will report back and push changes soon

xylar · 2024-02-08T20:39:24Z

Sounds like progress! User errors are usually the easiest ones to fix (once you spot them).

nvidiagpu now works

vanroekel · 2024-02-09T04:43:47Z

Okay I just pushed changes that for me allowed me to build, submit and run

 ./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu --walltime 1:00:00 --wait -p g23_nonhydro_g

i've only tested nvidiagpu.

Anything else you'd like me to test?

cime_config/machines/config_machines.xml

xylar · 2024-02-09T11:10:02Z

@vanroekel, could you run the same test with gnugpu? I think that would be all we need. I'd run it but I don't think I'm on any projects with access to the GPU partition.

After that, let's call it good. We can always make follow-up PRs to fix anything that comes up.

xylar · 2024-02-09T11:59:23Z

I've moved to E3SM-Project#6228 so please report testing on gnugpu there, @vanroekel.

xylar requested review from vanroekel and jonbob January 27, 2024 17:00

xylar changed the base branch from master to alternate January 27, 2024 17:01

xylar changed the base branch from alternate to master January 27, 2024 17:01

xylar mentioned this pull request Jan 29, 2024

Update to 0.3.0-alpha.1 E3SM-Project/polaris#177

Merged

39 tasks

jonbob reviewed Jan 31, 2024

View reviewed changes

cime_config/machines/config_machines.xml Outdated Show resolved Hide resolved

Update machine chicoma-cpu

6e7b8e6

xylar force-pushed the machinefiles/update-chicoma branch from a2b5a06 to b5db2b8 Compare February 3, 2024 13:39

xylar changed the base branch from master to alternate February 3, 2024 13:40

xylar changed the base branch from alternate to master February 3, 2024 13:40

Add chicoma-gpu, GPU nodes on LANL Chicoma

a27fe2b

xylar force-pushed the machinefiles/update-chicoma branch from b5db2b8 to a27fe2b Compare February 3, 2024 13:48

xylar commented Feb 6, 2024

View reviewed changes

Add chicoma-gpu cmake macros

1050c19

Changes to chicoma-gpu configs

9c15edb

nvidiagpu now works

xylar commented Feb 9, 2024

View reviewed changes

cime_config/machines/config_machines.xml Outdated Show resolved Hide resolved

Fix indentation

89faec8

xylar mentioned this pull request Feb 9, 2024

Update Chicoma-CPU and add Chicoma-GPU E3SM-Project/E3SM#6228

Merged

xylar closed this Feb 9, 2024

		<MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE>
		<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>

Update Chicoma-CPU and add Chicoma-GPU #73

Update Chicoma-CPU and add Chicoma-GPU #73

Conversation

xylar commented Jan 27, 2024

xylar commented Jan 27, 2024

xylar commented Jan 31, 2024

jonbob commented Jan 31, 2024

jonbob commented Jan 31, 2024 • edited Loading

xylar commented Jan 31, 2024

jonbob commented Jan 31, 2024

jonbob commented Jan 31, 2024

xylar commented Jan 31, 2024

xylar commented Feb 3, 2024

xylar commented Feb 3, 2024

xylar commented Feb 3, 2024

vanroekel commented Feb 4, 2024

xylar commented Feb 4, 2024

vanroekel commented Feb 5, 2024

xylar commented Feb 5, 2024

xylar commented Feb 5, 2024

vanroekel commented Feb 6, 2024

vanroekel commented Feb 6, 2024

xylar Feb 6, 2024

Choose a reason for hiding this comment

mark-petersen Feb 6, 2024

Choose a reason for hiding this comment

xylar commented Feb 6, 2024

xylar commented Feb 6, 2024

jonbob commented Feb 6, 2024

vanroekel commented Feb 7, 2024

vanroekel commented Feb 7, 2024

vanroekel commented Feb 7, 2024

xylar commented Feb 7, 2024

xylar commented Feb 7, 2024 • edited Loading

xylar commented Feb 7, 2024

vanroekel commented Feb 8, 2024

xylar commented Feb 8, 2024

mark-petersen commented Feb 8, 2024

philipwjones commented Feb 8, 2024

vanroekel commented Feb 8, 2024

xylar commented Feb 8, 2024

vanroekel commented Feb 9, 2024

xylar commented Feb 9, 2024

xylar commented Feb 9, 2024

jonbob commented Jan 31, 2024 •

edited

Loading

xylar commented Feb 7, 2024 •

edited

Loading