Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl_tcp_endpoint open-mpi error in ufs-weather-model regression test #1313

Open
NickSzapiro-NOAA opened this issue Sep 26, 2024 · 1 comment
Assignees
Labels
bug Something is not working OAR-EPIC NOAA Oceanic and Atmospheric Research and Earth Prediction Innovation Center

Comments

@NickSzapiro-NOAA
Copy link

NickSzapiro-NOAA commented Sep 26, 2024

Describe the bug
A GNU debug version of a ufs-weather-model regression test in development for GEFS fails in initialization with error of

[../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (removed) failed: Address already in use (98). 

This seems similar to an existing open-mpi issue (open-mpi/ompi#7246) and something to do with use of all available ports.

It would be nice to confirm that is indeed the issue and resolve if possible (maybe change # of tasks or ports?)

To Reproduce
Try to run gnu cpld_debug_gefs regression test on Hera:

git clone https://github.com/NickSzapiro-NOAA/ufs-weather-model/tree/RT_bmark_gefs
cd ufs-weather-model
git checkout RT_bmark_gefs
git submodule update --init --recursive
cd tests
./rt.sh -a {ACCT} -n "cpld_debug_gefs gnu"

Expected behavior
Regression test should run to completion

System:
Hera

Additional context
As this seems like an issue involving open-mpi, NOAA RDHPCS help desk suggested making an issue here

@NickSzapiro-NOAA NickSzapiro-NOAA added the bug Something is not working label Sep 26, 2024
@climbfuji climbfuji added the OAR-EPIC NOAA Oceanic and Atmospheric Research and Earth Prediction Innovation Center label Sep 26, 2024
@NickSzapiro-NOAA
Copy link
Author

After changing tasks/memory, I get this error now instead:

7 682: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release. 
682: Workarounds are to run on a single node, or to use a system with an RDMA 682: capable network such as Infiniband. 
6: [h2c02:3430684] *** An error occurred in MPI_Win_create 
6: [h2c02:3430684] *** reported by process [3376939008,6] 6: [h2c02:3430684] *** on communicator MPI COMMUNICATOR 74 DUP FROM 73 
6: [h2c02:3430684] *** MPI_ERR_WIN: invalid window 
6: [h2c02:3430684] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working OAR-EPIC NOAA Oceanic and Atmospheric Research and Earth Prediction Innovation Center
Projects
None yet
Development

No branches or pull requests

3 participants