Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codespell: config, pre-commit hook + fixed typos #32

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .codespellrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[codespell]
skip = .git,*.pdf,*.svg
#
# ignore-words-list =
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,8 @@ repos:
hooks:
- id: flake8
exclude: ^dm/

- repo: https://github.com/codespell-project/codespell
rev: v2.2.5
hooks:
- id: codespell
8 changes: 4 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ All notable changes to this project will be documented in this file.
## \[5.6.3\]

- Increase project metadata timeouts.
- Add cluster cloud logging fiter output.
- Add cluster cloud logging filter output.
- Add Ubuntu 22.04 LTS support.
- Add preliminary ARM64 image support for T2A instances.
- Add Debian 11 support.
Expand All @@ -140,7 +140,7 @@ All notable changes to this project will be documented in this file.
- startup-script - Add logging level prefixes for parsing.
- Fix slurm and slurm-gcp logs not showing up in Cloud Logging.
- resume.py - No longer validate machine_type with placement groups.
- Raise error from incorrect settings with dependant inputs.
- Raise error from incorrect settings with dependent inputs.
- For gcsfuse network storage, server_ip can be null/None or "".
- Fix munge mount export from controller.
- Enable DebugFlags=Power by default
Expand Down Expand Up @@ -349,7 +349,7 @@ All notable changes to this project will be documented in this file.
- Fix potential race condition in loading BQ job data.
- Remove deployment manager support.
- Update Nvidia to 470.82.01 and CUDA to 11.4.4
- Reenable gcsfuse in ansible and workaround the repo gpg check problem
- Re-enable gcsfuse in ansible and workaround the repo gpg check problem

## \[4.1.5\]

Expand Down Expand Up @@ -400,7 +400,7 @@ All notable changes to this project will be documented in this file.

## \[4.0.4\]

- Configure sockets, cores, threads on compute nodes for better performace with
- Configure sockets, cores, threads on compute nodes for better performance with
`cons_tres`.

## \[4.0.3\]
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ If you make an automated change (changing a function name, fixing a pervasive
spelling mistake), please send the command/regex used to generate the changes
along with the patch, or note it in the commit message.

While not required, we encourage use of `git format-patch` to geneate the patch.
While not required, we encourage use of `git format-patch` to generate the patch.
This ensures the relevant author line and commit message stay attached. Plain
`diff`'d output is also okay. In either case, please attach them to the bug for
us to review. Spelling corrections or documentation improvements can be
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ to help you get up and running and stay running.
Issues and/or enhancement requests can be submitted to
[SchedMD's Bugzilla](https://bugs.schedmd.com).

Also, join comunity discussions on either the
Also, join community discussions on either the
[Slurm User mailing list](https://slurm.schedmd.com/mail.html) or the
[Google Cloud & Slurm Community Discussion Group](https://groups.google.com/forum/#!forum/google-cloud-slurm-discuss).

Expand Down
2 changes: 1 addition & 1 deletion ansible/docker-playbook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
msg: >
OS ansible_distribution version ansible_distribution_major_version is not
supported.
Please use a suported OS in list:
Please use a supported OS in list:
- CentOS 7
- Rocky 8
- Debian 10, 11
Expand Down
2 changes: 1 addition & 1 deletion ansible/playbook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
msg: >
OS ansible_distribution version ansible_distribution_major_version is not
supported.
Please use a suported OS in list:
Please use a supported OS in list:
- CentOS 7
- Rocky 8
- Debian 10, 11
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/google_cloud_ops_agents/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ package_state: present
version: latest
# Local path to the config file to copy onto the remote server. Can be absolute or relative.
main_config_file: ''
# Local path to the additonal config directory whose contents will be copied onto the remote server. Can be absolute or relative.
# Local path to the additional config directory whose contents will be copied onto the remote server. Can be absolute or relative.
additional_config_dir: ''
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
plugin: gcp_compute
auth_kind: serviceaccount
projects:
# This value will be subsituted by sed command
# This value will be substituted by sed command
- ENTER_PROJECT_NAME
keyed_groups:
# Create groups from GCE labels
Expand Down
2 changes: 1 addition & 1 deletion ansible/tf-playbook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
msg: >
OS ansible_distribution version ansible_distribution_major_version is not
supported.
Please use a suported OS in list:
Please use a supported OS in list:
- CentOS 7
- Rocky 8
- Debian 10, 11
Expand Down
2 changes: 1 addition & 1 deletion docs/cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ There are two deployment methods for cloud cluster management:

This deployment method leverages
[GCP Marketplace](./glossary.md#gcp-marketplace) to make setting up clusters a
breeze without leaving your browser. While this method is simplier and less
breeze without leaving your browser. While this method is simpler and less
flexible, it is great for exploring what `slurm-gcp` is!

See the [Marketplace Guide](./marketplace.md) for setup instructions and more
Expand Down
8 changes: 4 additions & 4 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ out. Tickets can be submitted via

### How do I move data for a job?

Data can be migrated to and from external sources using a worflow of dependant
Data can be migrated to and from external sources using a workflow of dependent
jobs. A [workflow submission script](../jobs/submit_workflow.py.py) and
[helper jobs](../jobs/data_migrate/) are provided. See
[README](../jobs/README.md) for more information.
Expand Down Expand Up @@ -197,8 +197,8 @@ it may be allocated jobs again.
### How do I limit user access to only using login nodes?

By default, all instances are configured with
[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistant
accross all instances and allows easy user control with
[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistent
across all instances and allows easy user control with
[IAM Roles](./glossary.md#iam-roles).

1. Create a group for all users in `admin.google.com`.
Expand All @@ -221,7 +221,7 @@ accross all instances and allows easy user control with
1. Select boxes for login nodes
1. Add group as a member with the **IAP-secured Tunnel User** role. Please see
[Enabling IAP for Compute Engine](https://cloud.google.com/iap/docs/enabling-compute-howto)
for mor information.
for more information.

### What Slurm image do I use for production?

Expand Down
2 changes: 1 addition & 1 deletion docs/federation.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ please refer to [multiple-slurmdbd](#multiple-slurmdbd) section.

### Additional Requirements

- User UID and GID are consistant accross all federated clusters.
- User UID and GID are consistent across all federated clusters.

## Multiple Slurmdbd

Expand Down
8 changes: 4 additions & 4 deletions docs/hybrid.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ This guide focuses on setting up a hybrid [Slurm cluster](./glossary.md#slurm).
With hybrid, there are different challenges and considerations that need to be
taken into account. This guide will cover them and their recommended solutions.

There is a clear seperation of how on-prem and cloud resources are managed
There is a clear separation of how on-prem and cloud resources are managed
within your hybrid cluster. This means that you can modify either side of the
hybrid cluster without disrupting the other side! You manage your on-prem and
our [Slurm cluster module](../terraform/slurm_cluster/README.md) will manage the
Expand Down Expand Up @@ -71,7 +71,7 @@ and terminating nodes in the cloud:
- Creates compute node resources based upon Slurm job allocation and
configured compute resources.
- `slurmsync.py`
- Synchronizes the Slurm state and the GCP state, reducing discrepencies from
- Synchronizes the Slurm state and the GCP state, reducing discrepancies from
manual admin activity or other edge cases.
- May update Slurm node states, create or destroy GCP compute resources or
other script managed GCP resources.
Expand Down Expand Up @@ -253,7 +253,7 @@ controller to be able to burst into the cloud.

### Manage Secrets

Additionally, [MUNGE](./glossary.md#munge) secrets must be consistant across the
Additionally, [MUNGE](./glossary.md#munge) secrets must be consistent across the
cluster. There are a few safe ways to deal with munge.key distribution:

- Use NFS to mount `/etc/munge` from the controller (default behavior).
Expand All @@ -270,7 +270,7 @@ connections to the munge NFS is critical.

- Isolate the cloud compute nodes of the cluster into their own project, VPC,
and subnetworks. Use project or network peering to enable access to other
cloud infrastructure in a controlled mannor.
cloud infrastructure in a controlled manner.
- Setup firewall rules to control ingress and egress to the controller such that
only trusted machines or networks use its NFS.
- Only allow trusted private address (ranges) for communication to the
Expand Down
2 changes: 1 addition & 1 deletion docs/images.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ custom Slurm image.

### Creation

Install software dependencies and build images from configation.
Install software dependencies and build images from configuration.

See [slurm-gcp packer project](../packer/README.md) for details.

Expand Down
8 changes: 4 additions & 4 deletions docs/tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- [Overview](#overview)
- [Supported TPU types](#supported-tpu-types)
- [Supported Tensorflow versions](#supported-tensorflow-versions)
- [Slurm-gcp compatiblity matrix](#slurm-gcp-compatiblity-matrix)
- [Slurm-gcp compatibility matrix](#slurm-gcp-compatibility-matrix)
- [Terraform](#terraform)
- [Quickstart Examples](#quickstart-examples)
- [TPU example job](#tpu-example-job)
Expand All @@ -36,7 +36,7 @@ first it is important to take into account the following considerations.
the partition ResumeTimeout and SuspendTimeout that contains TPU nodes.
- Slurm is executed in TPU nodes using a docker container.
- TPU nodes in Slurm will have different name that the one seen in GCP, that is
because TPU names cannot be choosen or known before starting them up.
because TPU names cannot be chosen or known before starting them up.
- python 3.7 or above is needed for the TPU API module to work. In consequence
TPU nodes will not work with all the OS, like for example CentOS 7, see more
in the [compatibility matrix](#slurm-gcp-compatiblity-matrix).
Expand All @@ -60,7 +60,7 @@ At this moment the following tensorflow versions are supported:

- 2.12.0

## Slurm-gcp compatiblity matrix
## Slurm-gcp compatibility matrix

Due to the fact that the TPU support has some requirements as having python >=
3.7 installed not all the OS support it, this table can be used to see the
Expand Down Expand Up @@ -165,7 +165,7 @@ would be like this:
> sbatch -p normal -N 1 : -p tpu -N 1 sbatch_script.sh

This will allocate a node in the normal partition and a node in the tpu
partition, both in the same heterogenous job.
partition, both in the same heterogeneous job.

### Static TPU nodes

Expand Down
2 changes: 1 addition & 1 deletion jobs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $ sbatch --export=MIGRATE_INPUT=/tmp/seq.txt,MIGRATE_OUTPUT=/tmp/shuffle.txt \
## submit_workflow.py

This script is a runner that submits a sequence of 3 jobs as defined in the
input structured yaml file. The three jobs submitted can be refered to as:
input structured yaml file. The three jobs submitted can be referred to as:
`stage_in`; `main`; and `stage_out`. `stage_in` should move data for `main` to
consume. `main` is the main script that may consume and generate data.
`stage_out` should move data generated from `main` to an external location.
Expand Down
2 changes: 1 addition & 1 deletion packer/docker/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ def get_tf_versions(yaml_file_path):
"install_lustre": "false",
"source_image_project_id": "irrelevant",
"zone": "irrelevant",
"tf_version": "overriden",
"tf_version": "overridden",
}
file_params["project_id"] = args.project_id
file_params["slurm_version"] = args.slurm_version
Expand Down
2 changes: 1 addition & 1 deletion scripts/resume.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def create_instances_request(nodes, partition_name, placement_group, job_id=None
if job_id is not None and partition.enable_job_exclusive
else None
)
# overwrites properties accross all instances
# overwrites properties across all instances
body.instanceProperties = instance_properties(
nodeset, model, placement_group, labels
)
Expand Down
2 changes: 1 addition & 1 deletion scripts/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ fi
touch $FLAGFILE

function tpu_setup {
#allow the following command to fail, as this attibute does not exist for regular nodes
#allow the following command to fail, as this attribute does not exist for regular nodes
docker_image=$($CURL $URL/instance/attributes/slurm_docker_image 2> /dev/null || true)
if [ -z $docker_image ]; then #Not a tpu node, do not do anything
return
Expand Down
2 changes: 1 addition & 1 deletion terraform/slurm_cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use.
Partitions define what compute resources are available to the controller so it
may allocate jobs. Slurm will resume/create compute instances as needed to run
allocated jobs and will suspend/terminate the instances after they are no longer
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistant;
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent;
they are exempt from being suspended/terminated under normal conditions. Dynamic
nodes are burstable; they will scale up and down with workload.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md).
It is compatible with:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_partition](../../../modules/slurm_partition/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_controller_instance](../../../modules/slurm_controller_instance/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_login_instance](../../../modules/slurm_login_instance/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[Slurm partition](../../../modules/slurm_partition/README.md).

## Usage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ It is recommended to pass in an
[instance template](../../../../docs/glossary.md#instance-template) generated by
the [slurm_instance_template](../slurm_instance_template/README.md) module.

The controller is responisble for managing compute instances defined by multiple
The controller is responsible for managing compute instances defined by multiple
[slurm_partition](../slurm_partition/README.md).

The controller instance run [slurmctld](../../../../docs/glossary.md#slurmctld),
Expand Down
2 changes: 1 addition & 1 deletion terraform/slurm_cluster/modules/slurm_files/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use.
Partitions define what compute resources are available to the controller so it
may allocate jobs. Slurm will resume/create compute instances as needed to run
allocated jobs and will suspend/terminate the instances after they are no longer
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistant;
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent;
they are exempt from being suspended/terminated under normal conditions. Dynamic
nodes are burstable; they will scale up and down with workload.

Expand Down
4 changes: 2 additions & 2 deletions terraform/slurm_cluster/modules/slurm_files/README_TF.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ No modules.
| <a name="input_enable_hybrid"></a> [enable\_hybrid](#input\_enable\_hybrid) | Enables use of hybrid controller mode. When true, controller\_hybrid\_config will<br>be used instead of controller\_instance\_config and will disable login instances. | `bool` | `false` | no |
| <a name="input_epilog_scripts"></a> [epilog\_scripts](#input\_epilog\_scripts) | List of scripts to be used for Epilog. Programs for the slurmd to execute<br>on every node when a user's job completes.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog. | <pre>list(object({<br> filename = string<br> content = string<br> }))</pre> | `[]` | no |
| <a name="input_extra_logging_flags"></a> [extra\_logging\_flags](#input\_extra\_logging\_flags) | The list of extra flags for the logging system to use. See the logging\_flags variable in scripts/util.py to get the list of supported log flags. | `map(bool)` | `{}` | no |
| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Applicaiton Credentials. | `string` | `null` | no |
| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Application Credentials. | `string` | `null` | no |
| <a name="input_install_dir"></a> [install\_dir](#input\_install\_dir) | Directory where the hybrid configuration directory will be installed on the<br>on-premise controller (e.g. /etc/slurm/hybrid). This updates the prefix path<br>for the resume and suspend scripts in the generated `cloud.conf` file.<br><br>This variable should be used when the TerraformHost and the SlurmctldHost<br>are different.<br><br>This will default to var.output\_dir if null. | `string` | `null` | no |
| <a name="input_job_submit_lua_tpl"></a> [job\_submit\_lua\_tpl](#input\_job\_submit\_lua\_tpl) | Slurm job\_submit.lua template file path. | `string` | `null` | no |
| <a name="input_login_network_storage"></a> [login\_network\_storage](#input\_login\_network\_storage) | Storage to mounted on login and controller instances<br>* server\_ip : Address of the storage server.<br>* remote\_mount : The location in the remote instance filesystem to mount from.<br>* local\_mount : The location on the instance filesystem to mount to.<br>* fs\_type : Filesystem type (e.g. "nfs").<br>* mount\_options : Options to mount with. | <pre>list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> }))</pre> | `[]` | no |
Expand All @@ -96,7 +96,7 @@ No modules.
| <a name="input_partitions"></a> [partitions](#input\_partitions) | Cluster partitions as a list. | `list(any)` | `[]` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The GCP project ID. | `string` | n/a | yes |
| <a name="input_prolog_scripts"></a> [prolog\_scripts](#input\_prolog\_scripts) | List of scripts to be used for Prolog. Programs for the slurmd to execute<br>whenever it is asked to run a job step from a new job allocation.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog. | <pre>list(object({<br> filename = string<br> content = string<br> }))</pre> | `[]` | no |
| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directroy of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directory of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
| <a name="input_slurm_cluster_name"></a> [slurm\_cluster\_name](#input\_slurm\_cluster\_name) | The cluster name, used for resource naming and slurm accounting. | `string` | n/a | yes |
| <a name="input_slurm_conf_tpl"></a> [slurm\_conf\_tpl](#input\_slurm\_conf\_tpl) | Slurm slurm.conf template file path. | `string` | `null` | no |
| <a name="input_slurm_control_addr"></a> [slurm\_control\_addr](#input\_slurm\_control\_addr) | The IP address or a name by which the address can be identified.<br><br>This value is passed to slurm.conf such that:<br>SlurmctldHost={var.slurm\_control\_host}\({var.slurm\_control\_addr}\)<br><br>See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost | `string` | `null` | no |
Expand Down
2 changes: 1 addition & 1 deletion terraform/slurm_cluster/modules/slurm_files/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ output "config" {

precondition {
condition = length(local.x_nodeset_overlap) == 0
error_message = "All nodeset names must be unqiue among all nodeset types."
error_message = "All nodeset names must be unique among all nodeset types."
}
}

Expand Down
Loading