SchedMD · yarikoptic · Aug 31, 2023 · Aug 31, 2023 · Aug 31, 2023 · Aug 31, 2023
diff --git a/.codespellrc b/.codespellrc
@@ -0,0 +1,4 @@
+[codespell]
+skip = .git,*.pdf,*.svg
+#
+# ignore-words-list =
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -65,3 +65,8 @@ repos:
   hooks:
   - id: flake8
     exclude: ^dm/
+
+- repo: https://github.com/codespell-project/codespell
+  rev: v2.2.5
+  hooks:
+  - id: codespell
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -120,7 +120,7 @@ All notable changes to this project will be documented in this file.
 ## \[5.6.3\]
 
 - Increase project metadata timeouts.
-- Add cluster cloud logging fiter output.
+- Add cluster cloud logging filter output.
 - Add Ubuntu 22.04 LTS support.
 - Add preliminary ARM64 image support for T2A instances.
 - Add Debian 11 support.
@@ -140,7 +140,7 @@ All notable changes to this project will be documented in this file.
 - startup-script - Add logging level prefixes for parsing.
 - Fix slurm and slurm-gcp logs not showing up in Cloud Logging.
 - resume.py - No longer validate machine_type with placement groups.
-- Raise error from incorrect settings with dependant inputs.
+- Raise error from incorrect settings with dependent inputs.
 - For gcsfuse network storage, server_ip can be null/None or "".
 - Fix munge mount export from controller.
 - Enable DebugFlags=Power by default
@@ -349,7 +349,7 @@ All notable changes to this project will be documented in this file.
 - Fix potential race condition in loading BQ job data.
 - Remove deployment manager support.
 - Update Nvidia to 470.82.01 and CUDA to 11.4.4
-- Reenable gcsfuse in ansible and workaround the repo gpg check problem
+- Re-enable gcsfuse in ansible and workaround the repo gpg check problem
 
 ## \[4.1.5\]
 
@@ -400,7 +400,7 @@ All notable changes to this project will be documented in this file.
 
 ## \[4.0.4\]
 
-- Configure sockets, cores, threads on compute nodes for better performace with
+- Configure sockets, cores, threads on compute nodes for better performance with
   `cons_tres`.
 
 ## \[4.0.3\]

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -22,7 +22,7 @@ If you make an automated change (changing a function name, fixing a pervasive
 spelling mistake), please send the command/regex used to generate the changes
 along with the patch, or note it in the commit message.
 
-While not required, we encourage use of `git format-patch` to geneate the patch.
+While not required, we encourage use of `git format-patch` to generate the patch.
 This ensures the relevant author line and commit message stay attached. Plain
 `diff`'d output is also okay. In either case, please attach them to the bug for
 us to review. Spelling corrections or documentation improvements can be

diff --git a/README.md b/README.md
@@ -61,7 +61,7 @@ to help you get up and running and stay running.
 Issues and/or enhancement requests can be submitted to
 [SchedMD's Bugzilla](https://bugs.schedmd.com).
 
-Also, join comunity discussions on either the
+Also, join community discussions on either the
 [Slurm User mailing list](https://slurm.schedmd.com/mail.html) or the
 [Google Cloud & Slurm Community Discussion Group](https://groups.google.com/forum/#!forum/google-cloud-slurm-discuss).
 

diff --git a/ansible/docker-playbook.yml b/ansible/docker-playbook.yml
@@ -50,7 +50,7 @@
       msg: >
         OS ansible_distribution version ansible_distribution_major_version is not
         supported.
-        Please use a suported OS in list:
+        Please use a supported OS in list:
           - CentOS 7
           - Rocky 8
           - Debian 10, 11

diff --git a/ansible/playbook.yml b/ansible/playbook.yml
@@ -51,7 +51,7 @@
       msg: >
         OS ansible_distribution version ansible_distribution_major_version is not
         supported.
-        Please use a suported OS in list:
+        Please use a supported OS in list:
           - CentOS 7
           - Rocky 8
           - Debian 10, 11

diff --git a/ansible/roles/google_cloud_ops_agents/defaults/main.yml b/ansible/roles/google_cloud_ops_agents/defaults/main.yml
@@ -4,5 +4,5 @@ package_state: present
 version: latest
 # Local path to the config file to copy onto the remote server. Can be absolute or relative.
 main_config_file: ''
-# Local path to the additonal config directory whose contents will be copied onto the remote server. Can be absolute or relative.
+# Local path to the additional config directory whose contents will be copied onto the remote server. Can be absolute or relative.
 additional_config_dir: ''
diff --git a/ansible/roles/google_cloud_ops_agents/tutorial/inventory.gcp.yaml b/ansible/roles/google_cloud_ops_agents/tutorial/inventory.gcp.yaml
@@ -2,7 +2,7 @@
 plugin: gcp_compute
 auth_kind: serviceaccount
 projects:
-  # This value will be subsituted by sed command
+  # This value will be substituted by sed command
   - ENTER_PROJECT_NAME
 keyed_groups:
   # Create groups from GCE labels

diff --git a/ansible/tf-playbook.yml b/ansible/tf-playbook.yml
@@ -48,7 +48,7 @@
       msg: >
         OS ansible_distribution version ansible_distribution_major_version is not
         supported.
-        Please use a suported OS in list:
+        Please use a supported OS in list:
           - CentOS 7
           - Rocky 8
           - Debian 10, 11

diff --git a/docs/cloud.md b/docs/cloud.md
@@ -28,7 +28,7 @@ There are two deployment methods for cloud cluster management:
 
 This deployment method leverages
 [GCP Marketplace](./glossary.md#gcp-marketplace) to make setting up clusters a
-breeze without leaving your browser. While this method is simplier and less
+breeze without leaving your browser. While this method is simpler and less
 flexible, it is great for exploring what `slurm-gcp` is!
 
 See the [Marketplace Guide](./marketplace.md) for setup instructions and more

diff --git a/docs/faq.md b/docs/faq.md
@@ -82,7 +82,7 @@ out. Tickets can be submitted via
 
 ### How do I move data for a job?
 
-Data can be migrated to and from external sources using a worflow of dependant
+Data can be migrated to and from external sources using a workflow of dependent
 jobs. A [workflow submission script](../jobs/submit_workflow.py.py) and
 [helper jobs](../jobs/data_migrate/) are provided. See
 [README](../jobs/README.md) for more information.
@@ -197,8 +197,8 @@ it may be allocated jobs again.
 ### How do I limit user access to only using login nodes?
 
 By default, all instances are configured with
-[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistant
-accross all instances and allows easy user control with
+[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistent
+across all instances and allows easy user control with
 [IAM Roles](./glossary.md#iam-roles).
 
 1. Create a group for all users in `admin.google.com`.
@@ -221,7 +221,7 @@ accross all instances and allows easy user control with
 1. Select boxes for login nodes
 1. Add group as a member with the **IAP-secured Tunnel User** role. Please see
    [Enabling IAP for Compute Engine](https://cloud.google.com/iap/docs/enabling-compute-howto)
-   for mor information.
+   for more information.
 
 ### What Slurm image do I use for production?
 

diff --git a/docs/federation.md b/docs/federation.md
@@ -115,7 +115,7 @@ please refer to [multiple-slurmdbd](#multiple-slurmdbd) section.
 
 ### Additional Requirements
 
-- User UID and GID are consistant accross all federated clusters.
+- User UID and GID are consistent across all federated clusters.
 
 ## Multiple Slurmdbd
 

diff --git a/docs/hybrid.md b/docs/hybrid.md
@@ -26,7 +26,7 @@ This guide focuses on setting up a hybrid [Slurm cluster](./glossary.md#slurm).
 With hybrid, there are different challenges and considerations that need to be
 taken into account. This guide will cover them and their recommended solutions.
 
-There is a clear seperation of how on-prem and cloud resources are managed
+There is a clear separation of how on-prem and cloud resources are managed
 within your hybrid cluster. This means that you can modify either side of the
 hybrid cluster without disrupting the other side! You manage your on-prem and
 our [Slurm cluster module](../terraform/slurm_cluster/README.md) will manage the
@@ -71,7 +71,7 @@ and terminating nodes in the cloud:
   - Creates compute node resources based upon Slurm job allocation and
     configured compute resources.
 - `slurmsync.py`
-  - Synchronizes the Slurm state and the GCP state, reducing discrepencies from
+  - Synchronizes the Slurm state and the GCP state, reducing discrepancies from
     manual admin activity or other edge cases.
   - May update Slurm node states, create or destroy GCP compute resources or
     other script managed GCP resources.
@@ -253,7 +253,7 @@ controller to be able to burst into the cloud.
 
 ### Manage Secrets
 
-Additionally, [MUNGE](./glossary.md#munge) secrets must be consistant across the
+Additionally, [MUNGE](./glossary.md#munge) secrets must be consistent across the
 cluster. There are a few safe ways to deal with munge.key distribution:
 
 - Use NFS to mount `/etc/munge` from the controller (default behavior).
@@ -270,7 +270,7 @@ connections to the munge NFS is critical.
 
 - Isolate the cloud compute nodes of the cluster into their own project, VPC,
   and subnetworks. Use project or network peering to enable access to other
-  cloud infrastructure in a controlled mannor.
+  cloud infrastructure in a controlled manner.
 - Setup firewall rules to control ingress and egress to the controller such that
   only trusted machines or networks use its NFS.
 - Only allow trusted private address (ranges) for communication to the

diff --git a/docs/images.md b/docs/images.md
@@ -116,7 +116,7 @@ custom Slurm image.
 
 ### Creation
 
-Install software dependencies and build images from configation.
+Install software dependencies and build images from configuration.
 
 See [slurm-gcp packer project](../packer/README.md) for details.
 

diff --git a/docs/tpu.md b/docs/tpu.md
@@ -9,7 +9,7 @@
   - [Overview](#overview)
   - [Supported TPU types](#supported-tpu-types)
   - [Supported Tensorflow versions](#supported-tensorflow-versions)
-  - [Slurm-gcp compatiblity matrix](#slurm-gcp-compatiblity-matrix)
+  - [Slurm-gcp compatibility matrix](#slurm-gcp-compatibility-matrix)
   - [Terraform](#terraform)
     - [Quickstart Examples](#quickstart-examples)
   - [TPU example job](#tpu-example-job)
@@ -36,7 +36,7 @@ first it is important to take into account the following considerations.
   the partition ResumeTimeout and SuspendTimeout that contains TPU nodes.
 - Slurm is executed in TPU nodes using a docker container.
 - TPU nodes in Slurm will have different name that the one seen in GCP, that is
-  because TPU names cannot be choosen or known before starting them up.
+  because TPU names cannot be chosen or known before starting them up.
 - python 3.7 or above is needed for the TPU API module to work. In consequence
   TPU nodes will not work with all the OS, like for example CentOS 7, see more
   in the [compatibility matrix](#slurm-gcp-compatiblity-matrix).
@@ -60,7 +60,7 @@ At this moment the following tensorflow versions are supported:
 
 - 2.12.0
 
-## Slurm-gcp compatiblity matrix
+## Slurm-gcp compatibility matrix
 
 Due to the fact that the TPU support has some requirements as having python >=
 3.7 installed not all the OS support it, this table can be used to see the
@@ -165,7 +165,7 @@ would be like this:
 > sbatch -p normal -N 1 : -p tpu -N 1 sbatch_script.sh
 
 This will allocate a node in the normal partition and a node in the tpu
-partition, both in the same heterogenous job.
+partition, both in the same heterogeneous job.
 
 ### Static TPU nodes
 

diff --git a/jobs/README.md b/jobs/README.md
@@ -18,7 +18,7 @@ $ sbatch --export=MIGRATE_INPUT=/tmp/seq.txt,MIGRATE_OUTPUT=/tmp/shuffle.txt \
 ## submit_workflow.py
 
 This script is a runner that submits a sequence of 3 jobs as defined in the
-input structured yaml file. The three jobs submitted can be refered to as:
+input structured yaml file. The three jobs submitted can be referred to as:
 `stage_in`; `main`; and `stage_out`. `stage_in` should move data for `main` to
 consume. `main` is the main script that may consume and generate data.
 `stage_out` should move data generated from `main` to an external location.

diff --git a/packer/docker/build.py b/packer/docker/build.py
@@ -150,7 +150,7 @@ def get_tf_versions(yaml_file_path):
     "install_lustre": "false",
     "source_image_project_id": "irrelevant",
     "zone": "irrelevant",
-    "tf_version": "overriden",
+    "tf_version": "overridden",
 }
 file_params["project_id"] = args.project_id
 file_params["slurm_version"] = args.slurm_version

diff --git a/scripts/resume.py b/scripts/resume.py
@@ -161,7 +161,7 @@ def create_instances_request(nodes, partition_name, placement_group, job_id=None
         if job_id is not None and partition.enable_job_exclusive
         else None
     )
-    # overwrites properties accross all instances
+    # overwrites properties across all instances
     body.instanceProperties = instance_properties(
         nodeset, model, placement_group, labels
     )

diff --git a/scripts/startup.sh b/scripts/startup.sh
@@ -96,7 +96,7 @@ fi
 touch $FLAGFILE
 
 function tpu_setup {
-	#allow the following command to fail, as this attibute does not exist for regular nodes
+	#allow the following command to fail, as this attribute does not exist for regular nodes
 	docker_image=$($CURL $URL/instance/attributes/slurm_docker_image 2> /dev/null || true)
 	if [ -z $docker_image ]; then #Not a tpu node, do not do anything
 		return

diff --git a/terraform/slurm_cluster/README.md b/terraform/slurm_cluster/README.md
@@ -38,7 +38,7 @@ use.
 Partitions define what compute resources are available to the controller so it
 may allocate jobs. Slurm will resume/create compute instances as needed to run
 allocated jobs and will suspend/terminate the instances after they are no longer
-needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistant;
+needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent;
 they are exempt from being suspended/terminated under normal conditions. Dynamic
 nodes are burstable; they will scale up and down with workload.
 

diff --git a/terraform/slurm_cluster/examples/slurm_instance_template/blank/README.md b/terraform/slurm_cluster/examples/slurm_instance_template/blank/README.md
@@ -14,7 +14,7 @@
 
 ## Overview
 
-This exmaple creates a
+This example creates a
 [slurm_instance_template](../../../modules/slurm_instance_template/README.md).
 It is compatible with:
 

diff --git a/terraform/slurm_cluster/examples/slurm_instance_template/compute/README.md b/terraform/slurm_cluster/examples/slurm_instance_template/compute/README.md
@@ -14,7 +14,7 @@
 
 ## Overview
 
-This exmaple creates a
+This example creates a
 [slurm_instance_template](../../../modules/slurm_instance_template/README.md)
 intended to be used by the
 [slurm_partition](../../../modules/slurm_partition/README.md).

diff --git a/terraform/slurm_cluster/examples/slurm_instance_template/controller/README.md b/terraform/slurm_cluster/examples/slurm_instance_template/controller/README.md
@@ -14,7 +14,7 @@
 
 ## Overview
 
-This exmaple creates a
+This example creates a
 [slurm_instance_template](../../../modules/slurm_instance_template/README.md)
 intended to be used by the
 [slurm_controller_instance](../../../modules/slurm_controller_instance/README.md).

diff --git a/terraform/slurm_cluster/examples/slurm_instance_template/login/README.md b/terraform/slurm_cluster/examples/slurm_instance_template/login/README.md
@@ -14,7 +14,7 @@
 
 ## Overview
 
-This exmaple creates a
+This example creates a
 [slurm_instance_template](../../../modules/slurm_instance_template/README.md)
 intended to be used by the
 [slurm_login_instance](../../../modules/slurm_login_instance/README.md).

diff --git a/terraform/slurm_cluster/examples/slurm_partition/simple/README.md b/terraform/slurm_cluster/examples/slurm_partition/simple/README.md
@@ -14,7 +14,7 @@
 
 ## Overview
 
-This exmaple creates a
+This example creates a
 [Slurm partition](../../../modules/slurm_partition/README.md).
 
 ## Usage

diff --git a/terraform/slurm_cluster/modules/slurm_controller_instance/README.md b/terraform/slurm_cluster/modules/slurm_controller_instance/README.md
@@ -26,7 +26,7 @@ It is recommended to pass in an
 [instance template](../../../../docs/glossary.md#instance-template) generated by
 the [slurm_instance_template](../slurm_instance_template/README.md) module.
 
-The controller is responisble for managing compute instances defined by multiple
+The controller is responsible for managing compute instances defined by multiple
 [slurm_partition](../slurm_partition/README.md).
 
 The controller instance run [slurmctld](../../../../docs/glossary.md#slurmctld),

diff --git a/terraform/slurm_cluster/modules/slurm_files/README.md b/terraform/slurm_cluster/modules/slurm_files/README.md
@@ -38,7 +38,7 @@ use.
 Partitions define what compute resources are available to the controller so it
 may allocate jobs. Slurm will resume/create compute instances as needed to run
 allocated jobs and will suspend/terminate the instances after they are no longer
-needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistant;
+needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent;
 they are exempt from being suspended/terminated under normal conditions. Dynamic
 nodes are burstable; they will scale up and down with workload.
 

diff --git a/terraform/slurm_cluster/modules/slurm_files/README_TF.md b/terraform/slurm_cluster/modules/slurm_files/README_TF.md
@@ -81,7 +81,7 @@ No modules.
 | <a name="input_enable_hybrid"></a> [enable\_hybrid](#input\_enable\_hybrid) | Enables use of hybrid controller mode. When true, controller\_hybrid\_config will<br>be used instead of controller\_instance\_config and will disable login instances. | `bool` | `false` | no |
 | <a name="input_epilog_scripts"></a> [epilog\_scripts](#input\_epilog\_scripts) | List of scripts to be used for Epilog. Programs for the slurmd to execute<br>on every node when a user's job completes.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog. | <pre>list(object({<br>    filename = string<br>    content  = string<br>  }))</pre> | `[]` | no |
 | <a name="input_extra_logging_flags"></a> [extra\_logging\_flags](#input\_extra\_logging\_flags) | The list of extra flags for the logging system to use. See the logging\_flags variable in scripts/util.py to get the list of supported log flags. | `map(bool)` | `{}` | no |
-| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Applicaiton Credentials. | `string` | `null` | no |
+| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Application Credentials. | `string` | `null` | no |
 | <a name="input_install_dir"></a> [install\_dir](#input\_install\_dir) | Directory where the hybrid configuration directory will be installed on the<br>on-premise controller (e.g. /etc/slurm/hybrid). This updates the prefix path<br>for the resume and suspend scripts in the generated `cloud.conf` file.<br><br>This variable should be used when the TerraformHost and the SlurmctldHost<br>are different.<br><br>This will default to var.output\_dir if null. | `string` | `null` | no |
 | <a name="input_job_submit_lua_tpl"></a> [job\_submit\_lua\_tpl](#input\_job\_submit\_lua\_tpl) | Slurm job\_submit.lua template file path. | `string` | `null` | no |
 | <a name="input_login_network_storage"></a> [login\_network\_storage](#input\_login\_network\_storage) | Storage to mounted on login and controller instances<br>* server\_ip     : Address of the storage server.<br>* remote\_mount  : The location in the remote instance filesystem to mount from.<br>* local\_mount   : The location on the instance filesystem to mount to.<br>* fs\_type       : Filesystem type (e.g. "nfs").<br>* mount\_options : Options to mount with. | <pre>list(object({<br>    server_ip     = string<br>    remote_mount  = string<br>    local_mount   = string<br>    fs_type       = string<br>    mount_options = string<br>  }))</pre> | `[]` | no |
@@ -96,7 +96,7 @@ No modules.
 | <a name="input_partitions"></a> [partitions](#input\_partitions) | Cluster partitions as a list. | `list(any)` | `[]` | no |
 | <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The GCP project ID. | `string` | n/a | yes |
 | <a name="input_prolog_scripts"></a> [prolog\_scripts](#input\_prolog\_scripts) | List of scripts to be used for Prolog. Programs for the slurmd to execute<br>whenever it is asked to run a job step from a new job allocation.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog. | <pre>list(object({<br>    filename = string<br>    content  = string<br>  }))</pre> | `[]` | no |
-| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directroy of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
+| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directory of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
 | <a name="input_slurm_cluster_name"></a> [slurm\_cluster\_name](#input\_slurm\_cluster\_name) | The cluster name, used for resource naming and slurm accounting. | `string` | n/a | yes |
 | <a name="input_slurm_conf_tpl"></a> [slurm\_conf\_tpl](#input\_slurm\_conf\_tpl) | Slurm slurm.conf template file path. | `string` | `null` | no |
 | <a name="input_slurm_control_addr"></a> [slurm\_control\_addr](#input\_slurm\_control\_addr) | The IP address or a name by which the address can be identified.<br><br>This value is passed to slurm.conf such that:<br>SlurmctldHost={var.slurm\_control\_host}\({var.slurm\_control\_addr}\)<br><br>See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost | `string` | `null` | no |

diff --git a/terraform/slurm_cluster/modules/slurm_files/outputs.tf b/terraform/slurm_cluster/modules/slurm_files/outputs.tf
@@ -30,7 +30,7 @@ output "config" {
 
   precondition {
     condition     = length(local.x_nodeset_overlap) == 0
-    error_message = "All nodeset names must be unqiue among all nodeset types."
+    error_message = "All nodeset names must be unique among all nodeset types."
   }
 }