-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Chicoma-CPU and add Chicoma-GPU #73
Update Chicoma-CPU and add Chicoma-GPU #73
Conversation
@jonbob and @vanroekel, I haven't run any tests on this yet but wanted to give you a heads up that I'm working on it. I will test all the supported compilers (4 on CPU and 4 on GPU) early next week to make sure they can run a simple E3SM test. After that, I'll ask for your input. |
@jonbob, I'm not having any luck testing this on Chicoma. Any runs of |
@xylar -- I'll try it later today |
@xylar -- I was able to successfully build:
|
I'll test again tomorrow. Thanks for the help, @jonbob! |
After I fixed those lines, it's still complaining about "-m" when it tries to run. From the e3sm.log:
|
And here's more output:
maybe another line break? |
Yep, it's possible. I had missed the same formatting issues with |
a2b5a06
to
b5db2b8
Compare
b5db2b8
to
a27fe2b
Compare
With a few fixes, I am now able to run tests on
However, I'm not able to build mct with
I haven't been able to figure out what is supposed to be providing |
On Perlmutter, |
I'm giving up on this for now. @vanroekel, if this becomes pressing for you, I suggest getting some help from LANL IC on this. |
Thank you for working on this @xylar I appreciate it. I’ll try pick this up and push on this later in the week. |
@vanroekel, one thought I had was that maybe |
@xylar you were right on the money. When I logged onto the gpu partition there is a /usr/lib64/libcuda.so.1 file. |
Okay, that's going to mean that everything for |
I can test things there tomorrow if I find the time. |
@xylar I have a bit of time to push on this this morning, do you have a test I could try on chicoma-gpu to verify? Would I change something like this one
to
? |
my change of the test worked. There is an error (different from what you saw before that I'll look into)
|
<MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE> | ||
<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mark-petersen, the issue you pointed out on Slack should be fixed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should definitely be 128 for chicoma-cpu. Thanks.
@vanroekel, that sounds to me like something isn't configured right in the |
Yes, that seems like a good thing to try. Maybe @jonbob has something even simpler but given that most of the wait is compile time, it doesn't hurt to test MPAS-O, MPAS-Seaice and MALI all in one go. |
That's about as small a test as we could come up with for all three components. You could always try a C- or D-case and just have to build one active component, but it may not save you much with the parallel build |
a small update on the |
Well an unfortunate update - it seems the files missing are only visible on the front end nodes but libcuda.so is only visible on compute nodes. I'm working with LANL support on how to address this. |
A bit of progress, I have a work around for the pkg-config error. I'm now able to build all the dependencies on gpu, but am getting an error in the mpas build now.
@jonbob or @xylar do either of you know where MPAS picks up build options so I can try and remove those options that are 'unknown'? |
@vanroekel, these are presumably coming from here: |
@vanroekel, I suspect the problem might be that we're missing the equivalent of: |
@vanroekel, see if the macros I just added make a difference. |
Thanks @xylar! I took these and made one more change and I got it to build! Do you want me to pass you my small changes or push to this branch? However, it still won't run or submit. I'm getting this error
@philipwjones any suggestions on what this means? Here are the gpu directives
|
@vanroekel, that sounds like progress! Yes, just push to this branch. |
It would be convenient to remove badger in this PR, since chicoma replaced badger. Otherwise, we should remove the badger machine file section in a separate PR. |
@vanroekel Do you have the actual batch submit command from the logs? GRES is a resource error so not sure what you actually asked for... |
So a bit of a funny story. I figured out the GRES error, turns out it triggers when you use an account value in sbatch that doesn't have access to chicoma-gpu. My tests have been using a chicoma-cpu only test. When I switch to a different account it submits! I'm testing the E3SM test again, will report back and push changes soon |
Sounds like progress! User errors are usually the easiest ones to fix (once you spot them). |
nvidiagpu now works
Okay I just pushed changes that for me allowed me to build, submit and run
i've only tested Anything else you'd like me to test? |
@vanroekel, could you run the same test with After that, let's call it good. We can always make follow-up PRs to fix anything that comes up. |
I've moved to E3SM-Project#6228 so please report testing on |
This merge makes a few updates to Chicoma-CPU and adds support for Chicoma's GPU partition.