-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1 #581
{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1 #581
Conversation
Instance
|
Instance
|
bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
eb_hooks.py
Outdated
# iterate over all files in the CUDA installation directory | ||
for dir_path, _, files in os.walk(self.installdir): | ||
for filename in files: | ||
full_path = os.path.join(dir_path, filename) | ||
# we only really care about real files, i.e. not symlinks | ||
if not os.path.islink(full_path): | ||
# check if the current file is part of the allowlist | ||
basename = filename.split('.')[0] | ||
if '.' in filename: | ||
extension = '.' + filename.split('.')[1] | ||
if basename in allowlist: | ||
self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path) | ||
elif '.' in filename and extension in allowlist: | ||
self.log.debug("%s is found in allowlist, so keeping it: %s", extension, full_path) | ||
else: | ||
self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s", | ||
filename, full_path) | ||
# if it is not in the allowlist, delete the file and create a symlink to host_injections | ||
host_inj_path = full_path.replace('versions', 'host_injections') | ||
# make sure source and target of symlink are not the same | ||
if full_path == host_inj_path: | ||
raise EasyBuildError("Source (%s) and target (%s) are the same location, are you sure you " | ||
"are using this hook for a NESSI installation?", | ||
full_path, host_inj_path) | ||
remove_file(full_path) | ||
symlink(host_inj_path, full_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this identical to what is done for CUDA? We should probably just create a function that takes the installdir
and allowlist
as arguments and does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there are subtle differences. For CUDA, the EULA/README lists files you can distribute. For cuDNN the LICENSE lists what type of files you can distribute. These differences require small modifications. For example, in the hook for CUDA we have:
basename = filename.split('.')[0]
if basename in allowlist:
self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path)
else:
self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s",
basename, full_path)
For cuDNN, we have
basename = filename.split('.')[0]
if '.' in filename:
extension = '.' + filename.split('.')[1]
if basename in allowlist:
self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path)
elif '.' in filename and extension in allowlist:
self.log.debug("%s is found in allowlist, so keeping it: %s", extension, full_path)
else:
self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s",
filename, full_path)
Anyhow, the differences are relatively small, so a function would require a parameter that allows it to distinguish between CUDA
and cuDNN
(and in the future maybe other packages such as cuTENSOR
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the extension part, perhaps we should split on all .
and look for the last non-numeric entry (which should be the extension)? I can imagine there could be files like libcuda.so.520.12.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, you already have a good solution, you are taking the second entry which is virtually guaranteed to the the extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a function that implements the suggestion in 74a9a55
eb_hooks.py
Outdated
ec_dict['builddependencies'].append(dep) | ||
value = '\n'.join([value, 'setenv("EESSICUDNNVERSION","%s")' % cudnn_version]) | ||
if key in ec_dict: | ||
if not value in ec_dict[key]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is probably no longer good enough, we're looking for the exact string, but that is not likely to exist (even though the add_property("arch","gpu")
most likely does exist since the applications also should have a CUDA
dep). What we really need to do is
- Grab what is there already
- Split it on
\n
- Add any missing elements
- Put it back together again and replace it
Either this, or only the modify/add the modluafooter
once in the entire function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the if not value in ec_dict[key]
is not good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the value is a composite string of property
and the setenv
, and the property will already (very likely) exist from the CUDA part of this hook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module file for cuDNN
contains the following
-- Built with EasyBuild version 4.9.1
add_property("arch","gpu")
setenv("EESSICUDAVERSION","12.1.1")
For something that builds on top of cuDNN
, we would the above and something like
setenv("EESSICUDNNVERSION","8.9.2.26")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently implemented, for something that builds on top of cuDNN I believe you will have
-- Built with EasyBuild version 4.9.1
add_property("arch","gpu")
setenv("EESSICUDAVERSION","12.1.1")
add_property("arch","gpu")
setenv("EESSICUDNNVERSION","8.9.2.26")
as it will see if the entire string add_property("arch","gpu")\nsetenv("EESSICUDNNVERSION","8.9.2.26")
is in the footer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Working on something to implement the desired footer (and avoiding duplication of code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the function. It still produces the same footer for cuDNN
. I guess a real test would be a build that uses cuDNN
. @ocaisa can you check if the function looks better now?
Retry after fixing args to bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should? |
Full package is 1.4 GB. |
Rebuild after changing hook function that handles dependencies and creates modluafooter entries... bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
One more time... bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
@trz42 I will take your updated |
I also get the feeling that if we are going to move to easystack files (a good idea) then we should probably ship the ones we expect people to use |
Just updated the script with some improvements/fixes after my own testing. |
…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system
- `EESSI-install-software.sh` - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` - `create_lmodsitepackage.py` - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one (`eessi_cuda_and_libraries_enabled_load_hook`) - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR - `eb_hooks.py` - put code that iterates over all files replacing non-distributable ones with symlinks into `host_injections` with a common function (`replace_non_distributable_files_with_symlinks`) - `install_scripts.sh` - add files to copy to CVMFS (see `nvidia_files`) - `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` - improved creation of tmp directory
Run another build after several changes... bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
@trz42 Can we close this now? |
PR merged! Moved |
PR merged! Moved |
PR merged! Moved |
requires:
Attempt to add cuDNN which is a dependency of other packages such as TensorFlow and PyTorch.
Major additions/changes:
scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
withscripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
CUDA
andcuDNN
packages under.../host_injections
EESSI-install-software.sh
scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
withscripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
to installCUDA
,cuDNN
under.../host_injections
eb_hooks.py
symlinks into
host_injections
with a common function(
replace_non_distributable_files_with_symlinks
)post_sanitycheck_hook
which replaces files with symlinks into corresponding paths under.../host_injections
for all files that cannot be redistributedcuDNN
to a build dependency (seeinject_gpu_property
)create_lmodsitepackage.py
eessi_{cuda,cudnn}_enabled_load_hook
functions in a single one(
eessi_cuda_and_libraries_enabled_load_hook
)install_scripts.sh
nvidia_files
)