-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes NOT_RESPONDING kept while shutdown #503
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…dled via Slurm that calls terminate.sh
jkrue
approved these changes
Jun 3, 2024
jkrue
added a commit
that referenced
this pull request
Sep 6, 2024
* fixed rule setting for security groups * fixed multiple network is now list causing error bugs. * trying to figure out why route applying only works once. * Added more echo's for better debugging. * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * fixed remaining "subnet list gets handled as a single subnet" bug and finalized multiple routes handling. * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * fixed None bug where [] is expected when no sshPublicKeyFile is given. * removed master from compute if use master as compute is false * reconstructured role additional in order to make it easier to include. Added quotes for consistency. * Updated all tests (#448) * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * Introduced yaml lock (#464) * removed unnecessary close * simplified update_hosts * updated logging to separate folder and file based on creation date * many small changes and introducing locks * restructured log files again. Removed outdated key warnings from bibigrid.yml * added a few logs * further improved logging hierarchy * Added specific folder places for temporary job storage. This might solve the "SlurmSpoolDir full" bug. * Improved logging * Tried to fix temps and tried update to 23.11 but has errors so commented that part out * added initial space * added existing worker deletion on worker startup if worker already exists as no worker would've been started if Slurm would've known about the existing worker. This is not the best solution. (#468) * made waitForServices a cloud specific key (#465) * Improved log messages in validate_configuration.py to make fixing your configuration easier when using a hybrid-/multi-cloud setup (#466) * removed unnecessary line in provider.py and added cloud information to every log in validate_configuration.py for easier fixing. * track resources for providers separately to make quota checking precise * switched from low level cinder to high level block_storage.get_limits() * added keyword for ssh_timeout and improved argument passing for ssh. * Update issue templates * fixed a missing LOG * removed overwritten variable instantiation * Update bug_report.md * removed trailing whitespaces * added comment about sshTimeout key * Create dependabot.yml (#479) * Code cleanup and minor improvement (#482) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * added timeout to common_configuration * removed debug verbosity and improved log message wording * fixed is_active structure * fixed pip dependabot.yml * added documentation. Changed timeout to 2**(2+attempts) to decrease number of unlikely to work attempts * 474 allow non on demandpermanent workers (#487) * added worker server start without anything else * added host entry for permanent workers * added state unknown for permanent nodes * added on_demand key for groups and instances for ansible templating * fixed wording * temporary solution for custom execute list * added documentation for onDemand * added ansible.cfg replacement * fixed path. Added ansible.cfg to the gitignore * updated default creation and gitignore. Fixed non-vital bug that didn't reset hosts for new cluster start. * Code cleanup (#490) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * attempted zabbix linting fix. Needs testing. * fixed double import * Slurm upgrade fixes (#473) * removed slurm errors * added bibilog to show output log of most recent worker start. Tried fixing the slurm23.11 bug. * fixed a few vpnwkr -> vpngtw remnants. Excluded vpngtw from slurm setup * improved comments regarding changes and versions * removed cgroupautomount as it is defunct * Moved explicit slurm start to avoid errors caused by resume and suspend programs not being copied to their final location yet * added word for clarification * Fixed non-fatal bug that lead to non 0 exits on runs without any error. * changed slurm apt package to slurm-bibigrid * set version to 23.11.* * added a few more checks to make sure everything is set up before installing packages * Added configuration pinning * changed ignore_error to failed_when false * fixed or ignored lint fatals * Update tests (#493) * updated tests * removed print * updated tests * updated tests * fixed too loose condition * updated tests * added cloudScheduling and userRoles in bibigrid.yml * added userRoles in documentation * added varsFiles and comments * added folder path in documentation * fixed naming * added that vars are optional * polished userRoles documentation * 439 additional ansible roles (#495) * added roles structure * updated roles_path * fixed upper lower case * improved customRole implementation * minor fixes regarding role_paths * improved variable naming of user_roles * added documentation for other configurations * added new feature keys * fixed template files not being j2 * added helpful comments and removed no longer used roles/additional/ * userRoles crashes if no role set * fixed ansible.cfg path '"' * implemented partition system * added keys customAnsibleCfg and customSlurmConf as keys that stop the automatic copying * improved spacing * added logging * updated documentation * updated tests. Improved formatting * fix for service being too fast for startup * fixed remote src * changed RESUME to POWER_DOWN and removed delete call which is now handled via Slurm that calls terminate.sh (#503) * Update check (#499) * updated validate_configuration.py in order to provide schema validation. Moved cloud_identifier setting even closer to program start in order to be able to log better when performing other actions than create. * small log change and fix of schema key vpnInstance * updated tests * removed no longer relevant test * added schema validation tests * fixed ftype. Errors with multiple volumes. * made automount bound to defined mountPoints and therefore customizable * added empty line and updated bibigrid.yml * fixed nfsshare regex error and updated check to fit to the new name mountpoint pattern * hotfix: folder creation now before accessing hosts.yml * fixed tests * moved dnsmasq installation infront of /etc/resolv removal * fixed tests * fixed nfs exports by removing unnecessary "/" at the beginning * fixed master running slurmd but not being listed in slurm.conf. Now set to drained. * improved logging * increased timeout. Corrected comment in slurm.j2 * updated info regarding timeouts (changed from 4 to 5). * added SuspendTimeout as optional to elastic_scheduling * updated documentation * permission fix * fixes #394 * fixes #394 (also for hybrid cluster) * increased ResumeTimeout by 5 minutes. yml to yaml * changed all yml to yaml (as preferred by yaml) * updated timeouts. updated tests * fixes #394 - remove host from zabbix when terminated * zabbix api no longer used when not set in configuration * pleased linting by using false instead of no * added logging of traceroute even if debug flag is not set when error is not known. Added a few other logs * Update action 515 (#516) * configuration update possible 515 * added experimental * fixed indentation * fixed missing newline at EOF. Summarized restarts. * added check for running workers * fixed multiple workers due to faulty update * updated tests and removed done todos * updated documentation * removed print * Added apt-reactivate-auto-update to reactivate updates at the end of the playbook run (#518) * changed theia to 900. Added apt-reactivate-auto-update as new 999. * added new line at end of file * changed list representation * added multiple configuration keys for boot volume handling * updated documentation * updated documentation for new volumes and for usually ignored keys * updated and added tests --------- Co-authored-by: Jan Krueger <jkrueger@cebitec.uni-bielefeld.de>
jkrue
added a commit
that referenced
this pull request
Sep 26, 2024
* fixed rule setting for security groups * fixed multiple network is now list causing error bugs. * trying to figure out why route applying only works once. * Added more echo's for better debugging. * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * fixed remaining "subnet list gets handled as a single subnet" bug and finalized multiple routes handling. * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * fixed None bug where [] is expected when no sshPublicKeyFile is given. * removed master from compute if use master as compute is false * reconstructured role additional in order to make it easier to include. Added quotes for consistency. * Updated all tests (#448) * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * Introduced yaml lock (#464) * removed unnecessary close * simplified update_hosts * updated logging to separate folder and file based on creation date * many small changes and introducing locks * restructured log files again. Removed outdated key warnings from bibigrid.yml * added a few logs * further improved logging hierarchy * Added specific folder places for temporary job storage. This might solve the "SlurmSpoolDir full" bug. * Improved logging * Tried to fix temps and tried update to 23.11 but has errors so commented that part out * added initial space * added existing worker deletion on worker startup if worker already exists as no worker would've been started if Slurm would've known about the existing worker. This is not the best solution. (#468) * made waitForServices a cloud specific key (#465) * Improved log messages in validate_configuration.py to make fixing your configuration easier when using a hybrid-/multi-cloud setup (#466) * removed unnecessary line in provider.py and added cloud information to every log in validate_configuration.py for easier fixing. * track resources for providers separately to make quota checking precise * switched from low level cinder to high level block_storage.get_limits() * added keyword for ssh_timeout and improved argument passing for ssh. * Update issue templates * fixed a missing LOG * removed overwritten variable instantiation * Update bug_report.md * removed trailing whitespaces * added comment about sshTimeout key * Create dependabot.yml (#479) * Code cleanup and minor improvement (#482) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * added timeout to common_configuration * removed debug verbosity and improved log message wording * fixed is_active structure * fixed pip dependabot.yml * added documentation. Changed timeout to 2**(2+attempts) to decrease number of unlikely to work attempts * 474 allow non on demandpermanent workers (#487) * added worker server start without anything else * added host entry for permanent workers * added state unknown for permanent nodes * added on_demand key for groups and instances for ansible templating * fixed wording * temporary solution for custom execute list * added documentation for onDemand * added ansible.cfg replacement * fixed path. Added ansible.cfg to the gitignore * updated default creation and gitignore. Fixed non-vital bug that didn't reset hosts for new cluster start. * Code cleanup (#490) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * attempted zabbix linting fix. Needs testing. * fixed double import * Slurm upgrade fixes (#473) * removed slurm errors * added bibilog to show output log of most recent worker start. Tried fixing the slurm23.11 bug. * fixed a few vpnwkr -> vpngtw remnants. Excluded vpngtw from slurm setup * improved comments regarding changes and versions * removed cgroupautomount as it is defunct * Moved explicit slurm start to avoid errors caused by resume and suspend programs not being copied to their final location yet * added word for clarification * Fixed non-fatal bug that lead to non 0 exits on runs without any error. * changed slurm apt package to slurm-bibigrid * set version to 23.11.* * added a few more checks to make sure everything is set up before installing packages * Added configuration pinning * changed ignore_error to failed_when false * fixed or ignored lint fatals * Update tests (#493) * updated tests * removed print * updated tests * updated tests * fixed too loose condition * updated tests * added cloudScheduling and userRoles in bibigrid.yml * added userRoles in documentation * added varsFiles and comments * added folder path in documentation * fixed naming * added that vars are optional * polished userRoles documentation * 439 additional ansible roles (#495) * added roles structure * updated roles_path * fixed upper lower case * improved customRole implementation * minor fixes regarding role_paths * improved variable naming of user_roles * added documentation for other configurations * added new feature keys * fixed template files not being j2 * added helpful comments and removed no longer used roles/additional/ * userRoles crashes if no role set * fixed ansible.cfg path '"' * implemented partition system * added keys customAnsibleCfg and customSlurmConf as keys that stop the automatic copying * improved spacing * added logging * updated documentation * updated tests. Improved formatting * fix for service being too fast for startup * fixed remote src * changed RESUME to POWER_DOWN and removed delete call which is now handled via Slurm that calls terminate.sh (#503) * Update check (#499) * updated validate_configuration.py in order to provide schema validation. Moved cloud_identifier setting even closer to program start in order to be able to log better when performing other actions than create. * small log change and fix of schema key vpnInstance * updated tests * removed no longer relevant test * added schema validation tests * fixed ftype. Errors with multiple volumes. * made automount bound to defined mountPoints and therefore customizable * added empty line and updated bibigrid.yml * fixed nfsshare regex error and updated check to fit to the new name mountpoint pattern * hotfix: folder creation now before accessing hosts.yml * fixed tests * moved dnsmasq installation infront of /etc/resolv removal * fixed tests * fixed nfs exports by removing unnecessary "/" at the beginning * fixed master running slurmd but not being listed in slurm.conf. Now set to drained. * improved logging * increased timeout. Corrected comment in slurm.j2 * updated info regarding timeouts (changed from 4 to 5). * added SuspendTimeout as optional to elastic_scheduling * updated documentation * permission fix * fixes #394 * fixes #394 (also for hybrid cluster) * increased ResumeTimeout by 5 minutes. yml to yaml * changed all yml to yaml (as preferred by yaml) * updated timeouts. updated tests * fixes #394 - remove host from zabbix when terminated * zabbix api no longer used when not set in configuration * pleased linting by using false instead of no * added logging of traceroute even if debug flag is not set when error is not known. Added a few other logs * Update action 515 (#516) * configuration update possible 515 * added experimental * fixed indentation * fixed missing newline at EOF. Summarized restarts. * added check for running workers * fixed multiple workers due to faulty update * updated tests and removed done todos * updated documentation * removed print * Added apt-reactivate-auto-update to reactivate updates at the end of the playbook run (#518) * changed theia to 900. Added apt-reactivate-auto-update as new 999. * added new line at end of file * changed list representation * added multiple configuration keys for boot volume handling * updated documentation * updated documentation for new volumes and for usually ignored keys * updated and added tests * Pleasing Dependabot * Linting now uses python 3.10 * added early termination when configuration file not found * added dontUploadCredentials documentation * fixed broken links * added dontUploadCredentials to schema valiation * fixed dontUploadCredential ansible start bug * prevented BiBiGrid from looking for other keys if created key doesn't work to spot key issues earlier * prevented BiBiGrid from looking for other keys if created key doesn't work to spot key issues earlier * updated requirements.txt * restricted clouds.yaml access * moved openstack credentials permission change to server only * added '' to 3.10 * converted implicit to explicit octet notation * added "" and fixed a few more implicit octets * added "" * added missing " * added allow_agent=False to further prevent BiBiGrid from looking for keys * removed hardcoded /vol/ * updated versions * removed unnecessary comments and commented out Workflow execution --------- Co-authored-by: Jan Krueger <jkrueger@cebitec.uni-bielefeld.de>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of shutting down the nodes manually in the fail script and setting the node to resume, fail script now sets the node state to POWER_DOWN which will automatically call terminate.sh which then terminates the node.
It seems like this prevents the NOT_RESPONDING flag. In any case: it involves Slurm more in the shutdown process and hence is probably a better solution in any case.
This should be tested for a larger run.