Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create_tarball might not include Lmod cache and config files correctly for GPU builds #722

Closed
casparvl opened this issue Sep 19, 2024 · 7 comments · Fixed by #744
Closed

Comments

@casparvl
Copy link
Collaborator

I see that new files were created during the install step:

>> Creating/updating Lmod RC file...
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/.lmod/lmodrc.lua
ESC[32m/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/.lmod/lmodrc.lua createdESC[0m
>> Creating/updating Lmod SitePackage.lua ...
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/.lmod/SitePackage.lua
ESC[32m/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/.lmod/SitePackage.lua createdESC[0m

But the create_tarball only looks at

if [ -d ${eessi_version}/software/${os}/${cpu_arch_subdir}/.lmod ]; then
    # include Lmod cache and configuration file (lmodrc.lua),
    # skip whiteout files and backup copies of Lmod cache (spiderT.old.*)
    find ${eessi_version}/software/${os}/${cpu_arch_subdir}/.lmod -type f | egrep -v '/\.wh\.|spiderT.old' >> ${files_list}
fi

@ocaisa
Copy link
Member

ocaisa commented Sep 24, 2024

So this needs an adjustment in the check for creating these, in fact they shouldn't be touched at all by GPU PRs

@ocaisa
Copy link
Member

ocaisa commented Sep 24, 2024

Well, there's no check I think, we just create and if they are different they are picked up so we just need to chomp off accel/nvidia/cc80 in the target path (if it exists)

@ocaisa
Copy link
Member

ocaisa commented Sep 25, 2024

@casparvl Looking into recent accel PRs, I don't see these files appearing...so perhaps there has already been a PR to fix this?

@bedroge
Copy link
Collaborator

bedroge commented Sep 27, 2024

We also discussed this in the call on Wednesday, and I think we agreed that we do want to make these files for every accelerator? Besides having accelerator-specific hooks (which then won't be considered at all for other CPU/GPU targets), it also allows us to build an accelerator-specific Lmod cache, that we can enable by adding the lmodrc.lua file to $LMOD_RC.

@bedroge
Copy link
Collaborator

bedroge commented Sep 27, 2024

So that would mean that instead of merging #744, we would need to make sure that these files get created and that they're included in the tarball.

For the ESPResSO builds from PR #748, I do still see this output in the slurm log:

>> Creating/updating Lmod RC file...
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/lmodrc.lua
ESC[32m/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/lmodrc.lua createdESC[0m
>> Creating/updating Lmod SitePackage.lua ...
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/SitePackage.lua
ESC[32m/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/SitePackage.lua createdESC[0m

So it looks like they are still being created. Looking at the installation script, I think it only does it at the end (here: https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/EESSI-install-software.sh#L312), but not at the beginning (here: https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/EESSI-install-software.sh#L155). Not sure if that's an issue.

I'll open a PR to make sure that these files are included in the tarball.

@bedroge
Copy link
Collaborator

bedroge commented Sep 27, 2024

I checked the contents of the files that were generated by the ESPResSO job:

$ cat software.eessi.io/overlay-upper/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/lmodrc.lua 
propT = {
}
scDescriptT = {
    {
        ["dir"] = "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/cache",
        ["timestamp"] = "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/.lmod/cache/timestamp",
    },
}

That's perfect, and would allow us to generate a cache for the GPU builds.

The SitePackage.lua is just an exact copy of all the other ones. That's also expected, but probably not what we want? @casparvl what was your idea here? Can you even somehow stack multiple SitePackages.lua files on top of each other? Or do you want to let the CPU one include the one for the GPU?

@casparvl
Copy link
Collaborator Author

casparvl commented Sep 30, 2024

Well, LMOD cannot handle multiple SitePackage.lua files. But, you can import those manually. I.e. you'd point LMOD_PACKAGE_PATH to the 'real' SitePackage.lua. Then, have that one 'import' (dofile, as it's called in lua) whatever other SitePackage.lua files you want.

We do this for allowing host-specific hooks, see e.g.

local archSitePackage = archHostInjections .. "/.lmod/SitePackage.lua"

That's how we can import from two specific file locations, see the docs on this https://www.eessi.io/docs/site_specific_config/lmod_hooks/#location-of-the-hooks .

But, the order in which we do this is important. We first register the EESSI hook

hook.register("load", eessi_load_hook)
Then, we load the site-specific hooks
load_site_specific_hooks()

This allows sites to overwrite what we do in EESSI.

Thinking about the GPU part... The CPU SitePackage.lua is always there, so I guess that should be hte one on LMOD_PACKAGE_PATH. Optionally, we could run a dofile on something in the accel/.lmod/SitePackage.lua prefix. But it would definitely not be the same SitePackage.lua file, only something which has additions to the SitePackage.lua in the CPU prefix path...

So probably, it requires a small modification to our current create_lmodsitepackage.py to have the CPU SitePackage.lua include the GPU one, and create a separate creating logic for a GPU SitePackage.lua (or add an argument create_lmodsitepackage.py --cpu or --gpu, if you prefer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants