[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

trz42 · 2024-05-24T07:22:12Z

WORK IN PROGRESS

Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.

PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).

Initially, we only build for compute capability 7.0, later we build for architectures from Pascal but excluding architectures for embedded GPUs and very special compute capabilities such as 9.0a. That is the list of compute capabilities could be 6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0

…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system

- `EESSI-install-software.sh` - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` - `create_lmodsitepackage.py` - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one (`eessi_cuda_and_libraries_enabled_load_hook`) - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR - `eb_hooks.py` - put code that iterates over all files replacing non-distributable ones with symlinks into `host_injections` with a common function (`replace_non_distributable_files_with_symlinks`) - `install_scripts.sh` - add files to copy to CVMFS (see `nvidia_files`) - `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` - improved creation of tmp directory

eessi-bot · 2024-05-24T07:22:15Z

Instance eessi-bot-mc-aws is configured to build:

arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
arch x86_64/generic for repo eessi-hpc.org-2023.06-software
arch x86_64/generic for repo eessi.io-2023.06-compat
arch x86_64/generic for repo eessi.io-2023.06-software
arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
arch x86_64/intel/haswell for repo eessi.io-2023.06-software
arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
arch aarch64/generic for repo eessi-hpc.org-2023.06-software
arch aarch64/generic for repo eessi.io-2023.06-compat
arch aarch64/generic for repo eessi.io-2023.06-software
arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

eessi-bot · 2024-05-24T07:22:16Z

Instance eessi-bot-mc-azure is configured to build:

arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

trz42 · 2024-05-24T07:24:00Z

We run a first attempt without doing any modifications (e.g., to work around issues)...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-24T07:24:03Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11348, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-24T07:24:04Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-24T07:24:07Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11348

failed with

You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua

we also need the changes from Allow overriding the Lmod GPU driver check #579

date	job status	comment
May 24 07:24:07 UTC 2024	submitted	job id `11348` awaits release by job manager
May 24 07:24:09 UTC 2024	released	job awaits launch by Slurm scheduler
May 24 07:25:11 UTC 2024	running	job `11348` is running
May 24 07:39:24 UTC 2024	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11348.out` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716535927.tar.gz` size: 698 MiB (732486169 bytes) entries: 75 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 24 07:39:25 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11348.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-24T08:07:22Z

Building after applied changes provided by #579...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-24T08:07:26Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11349, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-24T08:07:26Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-24T08:07:30Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11349

failed with the same error (possibly because the environment variable EESSI_OVERRIDE_GPU_CHECK is not set or not passed through to the Prefix shell)

You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua

need to add some code for passing that environment variable into the Prefix shell (see 58120d2)

date	job status	comment
May 24 08:07:29 UTC 2024	submitted	job id `11349` awaits release by job manager
May 24 08:08:30 UTC 2024	released	job awaits launch by Slurm scheduler
May 24 08:09:32 UTC 2024	running	job `11349` is running
May 24 08:23:46 UTC 2024	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11349.out` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716538578.tar.gz` size: 698 MiB (732497279 bytes) entries: 75 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 24 08:23:46 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11349.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-24T09:04:03Z

Trying again...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-24T09:04:05Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11357, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-24T09:04:06Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

trz42 · 2024-05-25T05:05:18Z

Cleaned up code for creating/updating Lmod cfg files (lmodrc.lua and SitePackage.lua) and reinstated setting of EESSI_OVERRIDE_GPU_CHECK=1 (but moved that to bot/build.sh). Thus, it should be able to load CUDA/12.1.1 and go on with the installation. We can let this run until it may hit the linker error.

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-25T05:05:21Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11430, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-25T05:05:21Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-25T05:05:25Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11430

this hit a different error (might be fluke, because the next job went on building magma successfully):

[ 10%] Building CUDA object CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/bin/nvcc -forward-unknown-to-host-compiler -DMAGMA_CUDA_ARCH_MIN=700 -DMAGMA_HAVE_CUDA=1 --options-file CMakeFiles/magma.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++17 --generate-code=arch=compute_52,code=[compute_52,sm_52] -Xcompiler=-fPIC --compiler-options -fPIC, -MD -MT CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o -MF CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o.d -x cu -c /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/magmablas/zlarfbx.cu -o CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4390: CMakeFiles/magma.dir/magmablas/zlacpy_sym_in.cu.o] Error 127
make[2]: *** Waiting for unfinished jobs....
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4405: CMakeFiles/magma.dir/magmablas/zlacpy_sym_out.cu.o] Error 127
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4420: CMakeFiles/magma.dir/magmablas/zlag2c.cu.o] Error 127
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4435: CMakeFiles/magma.dir/magmablas/clag2z.cu.o] Error 127
In file included from /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/include/cuda_fp16.h:4019,
                 from /opt/eessi/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/targets/x86_64-linux/include/cublas_api.h:77,
                 from /opt/eessi/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/targets/x86_64-linux/include/cublas_v2.h:69,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_types.h:72,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_copy.h:12,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magmablas.h:12,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_v2.h:22,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/control/magma_internal.h:63,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/magmablas/zlanhe.cu:12:
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/include/cuda_fp16.hpp:65:10: fatal error: nv/target: No such file or directory
   65 | #include <nv/target>
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/magma.dir/build.make:4465: CMakeFiles/magma.dir/magmablas/zlanhe.cu.o] Error 1
sh: 1: cudafe++: not found

date	job status	comment
May 25 05:05:24 UTC 2024	submitted	job id `11430` awaits release by job manager
May 25 05:06:38 UTC 2024	released	job awaits launch by Slurm scheduler
May 25 05:07:57 UTC 2024	running	job `11430` is running
May 25 05:40:33 UTC 2024	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11430.out` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716615134.tar.gz` size: 698 MiB (732495353 bytes) entries: 75 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 25 05:40:33 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11430.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-25T05:24:10Z

Commented out code (in eessi_container.sh) that used different "tricks" to disable the GPU check in the Lmod hook. It should still work because we set EESSI_OVERRIDE_GPU_CHECK in bot/build.sh (and we pass it through into the Prefix shell in run_in_compat_layer_env.sh).

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-25T05:24:12Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11431, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-25T05:24:13Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-25T05:24:16Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11431

failed with the same ld error as in [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

date	job status	comment
May 25 05:24:16 UTC 2024	submitted	job id `11431` awaits release by job manager
May 25 05:25:06 UTC 2024	released	job awaits launch by Slurm scheduler
May 25 05:31:02 UTC 2024	running	job `11431` is running
May 25 07:19:16 UTC 2024	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11431.out` ❌ found message matching `ERROR:` ❌ found message matching `FAILED:` ❌ found message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716621051.tar.gz` size: 1000 MiB (1049201591 bytes) entries: 188 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` `magma/2.7.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 25 07:19:16 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11431.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-25T08:49:26Z

Added pre_configure hook that adds the missing directory containing libcupti.so.12 to LIBRARY_PATH. The building should now proceed further, maybe even finish...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-25T08:49:28Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11432, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-25T08:49:29Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-25T08:49:32Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11432

failed due to indention error in eb_hooks.py

date	job status	comment
May 25 08:49:31 UTC 2024	submitted	job id `11432` awaits release by job manager
May 25 08:50:44 UTC 2024	released	job awaits launch by Slurm scheduler
May 25 08:55:05 UTC 2024	running	job `11432` is running
May 25 09:05:20 UTC 2024	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-11432.out` ❌ found message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ❌ no message matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716627633.tar.gz` size: 0 MiB (15695 bytes) entries: 4 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all no module files in tarball software under 2023.06/software/linux/x86_64/amd/zen2/software no software packages in tarball other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 25 09:05:20 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11432.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-25T09:11:18Z

Try again...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-25T09:11:21Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11433, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-25T09:11:21Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-25T09:11:25Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11433

date	job status	comment
May 25 09:11:24 UTC 2024	submitted	job id `11433` awaits release by job manager
May 25 09:12:41 UTC 2024	released	job awaits launch by Slurm scheduler
May 25 09:14:01 UTC 2024	running	job `11433` is running
May 25 18:49:05 UTC 2024	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-11433.out` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716662368.tar.gz` size: 1208 MiB (1267078707 bytes) entries: 12927 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` `magma/2.7.2-foss-2023a-CUDA-12.1.1` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 25 18:49:05 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11433.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-05-25T19:27:33Z

Now, try building for multiple compute capabilities (6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0)...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

eessi-bot · 2024-05-25T19:27:35Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- submitted job 11437, for details & status see [WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586 (comment)

eessi-bot · 2024-05-25T19:27:36Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42
- expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:
- no jobs were submitted

eessi-bot · 2024-05-25T19:27:39Z

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11437

date	job status	comment
May 25 19:27:39 UTC 2024	submitted	job id `11437` awaits release by job manager
May 25 19:28:29 UTC 2024	released	job awaits launch by Slurm scheduler
May 25 19:29:49 UTC 2024	running	job `11437` is running
May 26 06:30:09 UTC 2024	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-11437.out` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1716704446.tar.gz` size: 1634 MiB (1714103717 bytes) entries: 12927 modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all `cuDNN/8.9.2.26-CUDA-12.1.1.lua` `magma/2.7.2-foss-2023a-CUDA-12.1.1.lua` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/software `cuDNN/8.9.2.26-CUDA-12.1.1` `magma/2.7.2-foss-2023a-CUDA-12.1.1` `PyTorch/2.1.2-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2 `2023.06/init/easybuild/eb_hooks.py` `2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` `2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` `.lmod/SitePackage.lua`
May 26 06:30:09 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-11437.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

trz42 · 2024-06-18T09:45:50Z

Currently not actively being worked on because we need to have rework/implement support for building for GPUs which also depends on support for dev.eessi.io

truib added 11 commits May 17, 2024 11:13

{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1

6664591

add x permissions to install script

0d744e7

fix arguments to cuDNN install script

0d8a896

handle multiple dependencies to CUDA and related packages

bd3469e

fix syntax

db85e23

generalized CUDA/libraries installation script and easystack file

3aad5d9

small improvements after testing script

e191905

Merge branch '2023.06-software.eessi.io' of github-trz:EESSI/software…

12fcec5

…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system

don't copy removed file

7cd0d00

{2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1

2008870

trz42 added help wanted Extra attention is needed 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels May 24, 2024

trz42 marked this pull request as draft May 24, 2024 07:24

add changes from EESSI#579

68aa119

make sure envvar is set in prefix environment

58120d2

truib added 2 commits May 25, 2024 06:57

redo code to create/update Lmod rc/SitePackage files

ba8cc7d

move setting EESSI_OVERRIDE_GPU_CHECK to bot/build.sh

c1a7590

comment code for 'tricks' to override GPU checks

e020246

add pre_configure hook that adds CUPTI lib dir

2e78b71

fix indention error

2af6b1a

trz42 mentioned this pull request May 25, 2024

{2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 NorESSI/software-layer#369

Merged

support multiple compute capabilities

e05eafb

trz42 mentioned this pull request May 27, 2024

Allow overriding the Lmod GPU driver check #579

Merged

trz42 removed the help wanted Extra attention is needed label Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

trz42 commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024

eessi-bot bot commented May 24, 2024

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited by trz42

Loading

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited by trz42

Loading

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

trz42 commented Jun 18, 2024

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

Are you sure you want to change the base?

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

Conversation

trz42 commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024

eessi-bot bot commented May 24, 2024

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024 • edited by trz42 Loading

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024 • edited by trz42 Loading

trz42 commented May 24, 2024

eessi-bot bot commented May 24, 2024 • edited Loading

eessi-bot bot commented May 24, 2024 • edited Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited by trz42 Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited by trz42 Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited by trz42 Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

trz42 commented May 25, 2024

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

eessi-bot bot commented May 25, 2024 • edited Loading

trz42 commented Jun 18, 2024

trz42 commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited by trz42

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited by trz42

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 24, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited by trz42

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading

eessi-bot bot commented May 25, 2024 •

edited

Loading