Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion, Intel environment for spack-stack-1.8.0 breaks the system tar command #1355

Open
srherbener opened this issue Oct 22, 2024 · 10 comments
Assignees
Labels
bug Something is not working INFRA JEDI Infrastructure

Comments

@srherbener
Copy link
Collaborator

Describe the bug

The Orion, Intel spack-stack-1.8.0 environment, specifically the LD_LIBRARY_PATH setting, interferes with the execution of the system tar command. See the next section on reproducing the error.

Simply running tar outside of the ecbuild command gets the same failure. After some tracing it appears that things go awry when loading the gzip functionality, where the spack-stack /apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/intel/2021.9.0/libxcrypt-4.4.35-ebrdc3w/lib/libcrypt.so.2 shared library gets loaded instead of the system libcrypt library (/usr/lib64/lib/libcrypt.so.2).

I found that if LD_LIBRARY_PATH is unset (or /usr/lib64/lib is prepended to the front) that the tar command then works properly.

I need help with coming up with a workable fix for this. I've tried

  • Alter the CRTM test/CMakeLists.txt file to unset LD_LIBRARY_PATH before running tar
    • This involves the cmake execute_process command, and I have probably done something wrong when I tried this. It sure seems you should be able to get this to work.
  • Defining and exporting a bash function for tar that that unsets LD_LIBRARY_PATH then runs /usr/bin/tar
    • This works when run in the shell, but does not work when run from cmake

Does anyone have any ideas about how to address this?

Also, could someone check my environment setup to make sure I'm not missing something.

Thanks!

To Reproduce
Steps to reproduce the behavior:

load the intel environment by sourcing a script file that contains the following sequence:

#!/bin/bash

echo "Loading EWOK-SKYLAB Environment Using Spack-Stack 1.8.0"

SPACK_STACK_INTEL_ENV=/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0

# load modules
module purge
module use $SPACK_STACK_INTEL_ENV/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.11.7

jedi-host-post-load() {
  module swap git-lfs git-lfs/3.1.2
}

# This is a fix for the issue where the spack-stack-1.8.0 udunits
# module does not get loaded propery. Without this workaround, the
# udunits module from the "spack-managed" gets loaded instead and
# ecbuild on jedi-bundle fails.
#
# Setting LMOD_TMOD_FIND_FIRST gets rid of the default marking
# of modules, and the modification of MODULEPATH makes sure
# that spack-stack-1.8.0 modules are found first before same
# named modules in other directories (ie, "spack-managed")
export LMOD_TMOD_FIND_FIRST=yes
module use $SPACK_STACK_INTEL_ENV/install/modulefiles/intel/2021.9.0

# Load JEDI modules
module load jedi-fv3-env
module load ewok-env
module load soca-env

The export and module use commands near the end are the workaround to get the proper udunits package to load.

Run ecbuild:

ecbuild -DPython3_EXECUTABLE=$(which python3) $JEDI_SRC

This results in the following error:

-- Building tests for CRTM v3.1.1.
-- Downloading CRTM coeffs files from: https://bin.ssec.wisc.edu/pub/s4/CRTM//fix_REL-3.1.1.2.tgz to /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz
-- Checking if /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1/fix_REL-3.1.1.2 already exists...
-- Untarring the downloaded file (~2 minutes) to /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1
tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
CMake Error at crtm/test/CMakeLists.txt:106 (message):
  Failed to untar the file.


-- Configuring incomplete, errors occurred!

Expected behavior
The tar command run from the CRTM CMake configuration should complete successfully.

System:
Orion, Intel

Additional context
Add any other context about the problem here.

@climbfuji
Copy link
Collaborator

This is a known bug in the Intel oneAPI distribution itself. I sent the Intel developers a bug fix for it at the beginning of the calendar year, and I also sent the bug fix to the Orion/Hercules sysadmins. According to @RatkoVasic-NOAA, this problem was fixed for some of the libraries in the oneAPI distribution, but maybe not all?

@srherbener
Copy link
Collaborator Author

This is a known bug in the Intel oneAPI distribution itself. I sent the Intel developers a bug fix for it at the beginning of the calendar year, and I also sent the bug fix to the Orion/Hercules sysadmins. According to @RatkoVasic-NOAA, this problem was fixed for some of the libraries in the oneAPI distribution, but maybe not all?

Thanks for the response @climbfuji! Very helpful information.

@RatkoVasic-NOAA do you think there might be some libraries in the oneAPI installation that have not been repaired yet? And repairing those might address this issue? Thanks!

@RatkoVasic-NOAA
Copy link
Collaborator

@srherbener I avoided that error by purging all loaded modules from my environment, so for spack-stack installation (both on Orion and Hercules) I started with 'module purge' and then all errors associated with "Failed to untar the file"" disappeared.

@climbfuji
Copy link
Collaborator

Isn't module purge part of the standard instructions for everyone before loading any spack-stack modules?

Also, I would be surprised if that really solved the problem - but I'd be happy to be surprised, for sure :-)

@srherbener srherbener added the INFRA JEDI Infrastructure label Oct 23, 2024
@srherbener
Copy link
Collaborator Author

@srherbener I avoided that error by purging all loaded modules from my environment, so for spack-stack installation (both on Orion and Hercules) I started with 'module purge' and then all errors associated with "Failed to untar the file"" disappeared.

@RatkoVasic-NOAA in the example environment setting (in the description above) I have a call to module purge before the rest of the module load commands. Is this what you are referring to? Or is it something else that I am missing? Thanks!

@RatkoVasic-NOAA
Copy link
Collaborator

@srherbener what happened to me while installing spack-stack, I was getting same error message as you (I wasn't aware that I haven't purged modules before installation).
Modules loaded by default were:

  1. contrib/0.1 2) noaatools/3.1 3) intel-oneapi-compilers/2023.1.0

Then, I purged modules and error messages disappeared.
I thought that sys admins fixed problematic libraries, and it is working because of that, but I realized that most likely it was because I haven't purged modules before installing spack-stack on Orion (and Hercules).
I hope I managed to explain train of thoughts and chain of events. :-)

@climbfuji
Copy link
Collaborator

So that means the sysadmins didn't fix anything yet, you just unloaded the modules when you had a problem. I've seen in the past that some applications don't show this problem, while others do. On discover, for example, fv3-jedi would run fine, but geos-jedi failed with the above error.

Someone other than a weird dude from a different agency with no purpose on orion/hercules should be making a lot of noise all the way up the hierarchy until the sysadmins fix this.

@srherbener
Copy link
Collaborator Author

Right after logging into orion, I see this:

orion-login-2[5] herbener$ module list
No modules loaded
orion-login-2[7] herbener$ echo $MODULEPATH
/apps/spack-managed/modulefiles/linux-rocky9-x86_64/Core:/apps/other/modulefiles:/apps/containers/modulefiles:/apps/licensed/modulefiles

Then I source our JCSDA, JEDI orion, intel environment script, which does a module purge first before any module load commands. Now I see:

(venv-intel) orion-login-2[11] herbener$ module list

Currently Loaded Modules:
  1) intel-oneapi-compilers/2023.1.0  73) py-h5py/3.11.0
  2) stack-intel/2021.9.0             74) py-cftime/1.0.3.4
  3) intel-oneapi-mpi/2021.9.0        75) py-netcdf4/1.5.8
  4) stack-intel-oneapi-mpi/2021.9.0  76) py-bottleneck/1.3.7
  5) gettext/0.21                     77) py-numexpr/2.8.4
  6) glibc/2.34                       78) py-et-xmlfile/1.0.1
  7) libxcrypt/4.4.35                 79) py-openpyxl/3.1.2
  8) zlib-ng/2.1.6                    80) py-six/1.16.0
  9) sqlite/3.43.2                    81) py-python-dateutil/2.8.2
 10) util-linux-uuid/2.38.1           82) py-pytz/2023.3
 11) python/3.11.7                    83) py-pyxlsb/1.0.10
 12) stack-python/3.11.7              84) py-xlrd/2.0.1
 13) snappy/1.1.10                    85) py-xlsxwriter/3.1.7
 14) zstd/1.5.2                       86) py-xlwt/1.3.0
 15) c-blosc/1.21.5                   87) py-pandas/1.5.3
 16) curl/7.76.1                      88) py-pycodestyle/2.11.0
 17) hdf5/1.14.3                      89) py-pyhdf/0.10.4
 18) netcdf-c/4.9.2                   90) libyaml/0.2.5
 19) netcdf-fortran/4.6.1             91) py-pyyaml/6.0.1
 20) fms/2024.02                      92) py-scipy/1.12.0
 21) cmake/3.27.9                     93) py-packaging/23.1
 22) git/2.31.1                       94) py-xarray/2023.7.0
 23) nccmp/1.9.0.1                    95) jedi-base-env/1.0.0
 24) parallel-netcdf/1.12.3           96) jedi-fv3-env/1.0.0
 25) parallelio/2.6.2                 97) py-awscrt/0.16.16
 26) python-venv/1.0                  98) py-colorama/0.4.6
 27) py-pip/23.1.2                    99) py-cryptography/38.0.1
 28) wget/1.21.3                     100) py-distro/1.8.0
 29) base-env/1.0.0                  101) py-docutils/0.19
 30) boost/1.84.0                    102) py-jmespath/1.0.1
 31) openblas/0.3.24                 103) py-wcwidth/0.2.7
 32) py-setuptools/63.4.3            104) py-prompt-toolkit/3.0.38
 33) py-numpy/1.23.5                 105) py-ruamel-yaml/0.17.16
 34) bufr/12.1.0                     106) py-ruamel-yaml-clib/0.2.7
 35) eigen/3.4.0                     107) py-urllib3/1.26.12
 36) eckit/1.27.0                    108) awscli-v2/2.13.22
 37) gsl-lite/0.37.0                 109) ecflow/5.11.4
 38) netcdf-cxx4/4.3.1               110) py-botocore/1.34.44
 39) py-pybind11/2.11.0              111) py-s3transfer/0.10.0
 40) bufr-query/0.0.2                112) py-boto3/1.34.44
 41) ecbuild/3.7.2                   113) py-contourpy/1.0.7
 42) libpng/1.6.37                   114) py-cycler/0.11.0
 43) openjpeg/2.3.1                  115) py-fonttools/4.39.4
 44) eccodes/2.33.0                  116) py-kiwisolver/1.4.5
 45) fftw/3.3.10                     117) py-pillow/9.5.0
 46) fckit/0.11.0                    118) py-pyparsing/3.1.2
 47) fiat/1.2.0                      119) py-matplotlib/3.7.4
 48) ectrans/1.2.0                   120) proj/9.2.1
 49) qhull/2020.2                    121) py-certifi/2023.7.22
 50) atlas/0.38.1                    122) py-pyproj/3.6.0
 51) sp/2.5.0                        123) py-pyshp/2.3.1
 52) gsibec/1.2.1                    124) geos/3.12.1
 53) libjpeg/2.1.0                   125) py-shapely/1.8.0
 54) krb5/1.21.2                     126) py-cartopy/0.23.0
 55) libtirpc/1.3.3                  127) py-smmap/5.0.0
 56) hdf/4.2.15                      128) py-gitdb/4.0.9
 57) jedi-cmake/1.4.0                129) py-gitpython/3.1.40
 58) libxt/1.3.0                     130) py-click/8.1.7
 59) libxmu/1.1.4                    131) py-pyjwt/2.4.0
 60) libxpm/3.5.17                   132) py-charset-normalizer/3.3.0
 61) libxaw/1.0.15                   133) py-idna/3.4
 62) udunits/2.2.28                  134) py-requests/2.31.0
 63) ncview/2.1.9                    135) py-globus-sdk/3.25.0
 64) json/3.11.2                     136) py-globus-cli/3.16.0
 65) json-schema-validator/2.3.0     137) py-markupsafe/2.1.3
 66) odc/1.5.2                       138) py-jinja2/3.1.2
 67) py-attrs/21.4.0                 139) ewok-env/1.0.0
 68) py-pycparser/2.21               140) antlr/2.7.7
 69) py-cffi/1.15.1                  141) gsl/2.7.1
 70) py-findlibs/0.0.2               142) nco/5.1.6
 71) py-eccodes/1.5.0                143) soca-env/1.0.0
 72) py-f90nml/1.4.3                 144) git-lfs/3.1.2

 

echo $MODULEPATH
/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/intel/2021.9.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/intel-oneapi-mpi/2021.9.0-p2ray63/intel/2021.9.0:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/intel-oneapi-mpi/2021.9.0-a66eaip/oneapi/2023.1.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/gcc/12.2.0:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/oneapi/2023.1.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/Core:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/Core:/apps/other/modulefiles:/apps/containers/modulefiles:/apps/licensed/modulefiles

Then I try tar:

(venv-intel) orion-login-2[13] herbener$ tar tzfv build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz 
tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
Segmentation fault (core dumped)

which breaks. If I wipe out LD_LIBRARY_PATH, then the tar command works:

(venv-intel) orion-login-2[14] herbener$ LD_LIBRARY_PATH="" tar tzfv build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz 
drwxr-xr-x bjohnson/domain users 0 2024-08-14 10:53 fix_REL-3.1.1.2/
drwxr-xr-x bjohnson/domain users 0 2024-08-29 14:45 fix_REL-3.1.1.2/fix/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/SEcategory/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/SEcategory/netCDF/
...

The module purge approach does not appear to help with this issue.

After some debugging, I discovered that the issue appears to be that we set LD_LIBRARY_PATH according to the module loads (see the initial description above) places the spack-stack libxcrypto path in front of the system libcrypto path. So when tar executes, the wrong libcrypto library (ie the spack-stack one) gets loaded instead of the correct libcrypto library which is the system one. Unfortunately, we need the LD_LIBRARY_PATH to be set in the order we are getting so that the jedi-bundle build and test all work correctly.

@RatkoVasic-NOAA
Copy link
Collaborator

I see. How about prepending LD_LIBRARY_PATH with system path to libxcrypto in modulefile. Then exec will find that one first and use it instead of spack-stack's?

@climbfuji
Copy link
Collaborator

The underlying problem however is this:

tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
Segmentation fault (core dumped)

It only shows up in libcrypto because the spack-stack librypto ldd-s to libimf.so which has the bug I described above.

@eap eap self-assigned this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working INFRA JEDI Infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants