docs: GDE Argo disaster recovery runbook TDE-1281 (#814)

#### Motivation Create a disaster recovery Runbook for restoring Argo Workflows with EKS and RDS so that the cluster can be rebuilt easily. Moved from linz/topo-aws-infrastructure#366 - should address all the issues brought up there. #### Checklist - [ ] Tests updated - [x] Docs updated - [x] Issue linked in Title Co-authored-by: Alice Fage <afage@linz.govt.nz>
linz · Oct 15, 2024 · 1b8b0ab · 1b8b0ab
1 parent b851ea5
commit 1b8b0ab
Showing 1 changed file with 150 additions and 0 deletions.
diff --git a/docs/infrastructure/gde-argo-runbook.md b/docs/infrastructure/gde-argo-runbook.md
@@ -0,0 +1,150 @@
+# Disaster recovery runbook
+
+**Warning:** If going through this process when the cluster is still functional, make sure to agree on a time for the deployment with the end users. Teardown and restore should not take more than 4 hours.
+
+## Purpose
+
+Rebuild the Argo Workflows cluster from scratch, restoring existing database contents.
+
+## Prerequisites
+
+1. [`node`](https://nodejs.org/)
+2. [`helm`](https://helm.sh/docs/intro/install/)
+3. [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) - should be the same version as the EKS version of the original cluster. At time of writing, this is only available by looking for `KubernetesVersion.of('VERSION')` in the code (for example, `KubernetesVersion.of('1.30')`).
+4. [`argo`](https://github.com/argoproj/argo-workflows/releases/) - should be the same version as the Argo Workflows Server version of the original cluster. At time of writing, this is only available by looking for `appVersion = 'vVERSION'` in the code (for example, `appVersion = 'v3.5.5'`).
+5. You need to be able to log in using the following AWS accounts and roles to restore production:
+   - LI Topo production account as admin
+   - ODR access account as admin using the admin profile
+
+## Setup
+
+We need to make sure we're starting from a sane repository state. Skip any steps you're _sure_ you don't need to do:
+
+1. Clone the [Open Data Registry repo](https://github.com/linz/open-data-registry-cdk/): `git clone git@github.com:linz/open-data-registry-cdk.git`
+2. Go into the Open Data Registry repo: `cd open-data-registry-cdk`.
+3. Install dependencies: `npm install`.
+4. Exit the Open Data Registry repo: `cd ..`.
+5. Clone the [Topo AWS infrastructure repo](https://github.com/linz/topo-aws-infrastructure/): `git clone git@github.com:linz/topo-aws-infrastructure.git`
+6. Clone [this repo](https://github.com/linz/topo-workflows/): `git clone git@github.com:linz/topo-workflows.git`
+7. Go into the Topo workflows repo: `cd topo-workflows`
+8. Clean the repository of any generated files: `git clean -d --force -x`
+9. Reset any changes to files: `git reset --hard HEAD`
+10. Check out the relevant commit: `git checkout ID`. This could be `origin/master`, the commit used to deploy the old production cluster,
+11. Install dependencies: `npm install`
+12. Log into the LI Topo production account as admin
+
+## [Teardown existing cluster](./destroy.md)
+
+If any of the cluster infrastructure exists but is not functional, see the above link for how to tear it down completely.
+
+## Update database version if necessary
+
+1. Get the details of the most recent production database snapshot: `aws rds describe-db-snapshots --output json --query="sort_by(DBSnapshots[?contains(DBSnapshotIdentifier,'workflows-argodb')], &SnapshotCreateTime)[-1]"`
+2. Compare `EngineVersion` from the above output to `PostgresEngineVersion.VER_` in the code.
+3. Update `PostgresEngineVersion.VER_` in the code with the snapshot `EngineVersion`.
+4. Git commit and push the change above (if applicable).
+
+## Deployment of new cluster
+
+1. Set AWS Account ID for CDK: `export CDK_DEFAULT_ACCOUNT="$(aws sts get-caller-identity --query Account --output text)"`.
+2. Deploy prod cluster using all the relevant roles as maintainers:
+
+   ```
+   ci_role="$(aws iam list-roles --output=text --query="Roles[?starts_with(RoleName, 'CiTopoProd-CiRole')].Arn")"
+   admin_role="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AccountAdminRole"
+   workflow_maintainer_role="$(aws cloudformation describe-stacks --output=text --query="Stacks[].Outputs[].OutputValue" --stack-name=TopographicSharedResourcesProd)"
+   npx cdk deploy --context=maintainer-arns="${ci_role},${admin_role},${workflow_maintainer_role}" Workflows
+   ```
+
+3. Deploy Argo Workflows without archiving:
+
+   1. Connect AWS CLI to the new cluster: `aws eks update-kubeconfig --name=Workflows`.
+   2. Create the Argo Workflows configuration files: `npx cdk8s synth`.
+   3. Remove the `persistence` section of `dist/0005-argo-workflows.k8s.yaml` to disable workflow archiving to database. For example:
+
+      ```patch
+      --- dist/0005-argo-workflows.k8s.yaml.orig
+      +++ dist/0005-argo-workflows.k8s.yaml
+      @@ -88,26 +88,6 @@
+               keyFormat: "{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}-{{workflow.name}}/{{pod.name}}"
+               region: ap-southeast-2
+               useSDKCreds: true
+      -    persistence:
+      -      [redacted]
+      -        tableName: argo_workflows
+           workflowDefaults:
+             spec:
+               parallelism: 3
+      ```
+
+   4. Apply the configuration files twice (may fail the first time due to [CRD async behaviour](initial.deployment.md#custom-resource-definitions)): `kubectl apply --filename=dist/`.
+
+4. Create a temporary RDS database from the snapshot identified when finding the engine version above:
+   1. Get details of the new cluster database: `aws rds describe-db-instances --query="DBInstances[?DBName=='argo'].{EndpointAddress: Endpoint.Address, DBSubnetGroupName: DBSubnetGroup.DBSubnetGroupName, VpcSecurityGroupIds: VpcSecurityGroups[].VpcSecurityGroupId}"`.
+   2. Go to https://ap-southeast-2.console.aws.amazon.com/rds/home?region=ap-southeast-2#db-snapshot:engine=postgres;id=ID, replacing "ID" with the `DBSnapshotIdentifier`.
+   3. Click on _Actions_ → _Restore snapshot_.
+   4. Under _Availability and durability_: select _Single DB Instance_.
+   5. Under _Settings_ set _DB instance identifier_ to "temp-argo-db".
+   6. Under _Instance configuration_: select _Burstable classes_ and _db.t3.micro_.
+   7. Under _Connectivity_ → _DB subnet group_: select the DB subnet group of the new cluster.
+   8. Under _Connectivity_ → _Existing VPC security groups_: select the VPC security group of the new cluster.
+   9. Click _Restore DB instance_.
+   10. Wait for the temporary DB to get to the "Available" state.
+5. Dump the temporary database to the new Argo database:
+
+   1. Submit a ["sleep" workflow](../../workflows/test/sleep.yml) to the new Argo Workflows installation to spin up a pod:
+      `argo submit --namespace=argo workflows/test/sleep.yml`. This will be used to connect to RDS to dump the database to a file.
+   2. Connect to the sleep pod (it can take a while for the pod to spin up, so you might have to retry the second command):
+
+      ```
+      pod_name="$(kubectl --namespace=argo get pods --output=name | grep --only-matching 'test-sleep-.*')"
+      kubectl --namespace=argo exec --stdin --tty "$pod_name" -- /bin/bash
+      ```
+
+   3. Install the PostgreSQL client:
+
+      ```
+      apt update
+      apt install -y postgresql-client
+      ```
+
+   4. Get the temporary db endpoint address: `aws rds describe-db-instances --query="DBInstances[?DBName=='temp-argo-db'].Endpoint.Address"`.
+   5. Dump the database from the temporary database, replacing ENDPOINT with the temp-argo-db endpoint address: `pg_dump --host=ENDPOINT --username=argo_user --dbname=argo > argodbdump`.
+      You will be prompted for a password, get the password from the [AWS Systems Manager Parameter Store](https://ap-southeast-2.console.aws.amazon.com/systems-manager/parameters/%252Feks%252Fargo%252Fpostgres%252Fpassword/description?region=ap-southeast-2&tab=Table).
+   6. Load the database into the new Argo database, replacing ENDPOINT with the new cluster endpoint address:
+      `psql --host=ENDPOINT --username=argo_user --dbname=argo < argodbdump`.
+      You will be prompted for a password, get the password from the [AWS Systems Manager Parameter Store](https://ap-southeast-2.console.aws.amazon.com/systems-manager/parameters/%252Feks%252Fargo%252Fpostgres%252Fpassword/description?region=ap-southeast-2&tab=Table).
+
+6. Redeploy the cluster configuration files to enable the connection to the database and turn on workflow archiving:
+
+   1. Run `npx cdk8s synth` to recreate the `persistence` section in `dist/0005-argo-workflows.k8s.yaml`.
+   2. Redeploy the Argo config file: `kubectl replace --filename=dist/0005-argo-workflows.k8s.yaml`.
+   3. Restart the workflow controller and the server:
+
+      ```
+      kubectl --namespace=argo rollout restart deployment argo-workflows-workflow-controller
+      kubectl --namespace=argo rollout restart deployment argo-workflows-server
+      ```
+
+7. Trigger deployment of Argo workflows. If you created a pull request [above](#update-database-version-if-necessary), merging it will trigger the job. Otherwise you have to trigger the [main workflow](https://github.com/linz/topo-workflows/actions/workflows/main.yml) manually.
+
+## Open Data Registry
+
+**Warning:** This section is for _production only._ When developing in non-prod environments, skip to the next section.
+
+1. Go to the repo with the ODR configuration: `cd ../topo-aws-infrastructure`.
+2. Copy the [LINZ Open Data Registry account](https://github.com/linz/topo-aws-infrastructure/blob/master/src/accounts/odr/README.md) CDK context declaration here: `cp ../topo-aws-infrastructure/src/accounts/odr/cdk.context.json .`.
+3. Update the ARN of the role with a name starting with "Workflows-EksWorkflowsArgoRunnerServiceAccountRole" in `cdk.context.json` to the output of `aws iam list-roles --output=text --query="Roles[?contains(RoleName, 'Workflows-EksWorkflowsArgoRunnerServiceAccountRole')].RoleName"`.
+4. Log into ODR access account as admin using the admin profile.
+5. Deploy the ODR datasets stack: `npx cdk deploy Datasets`.
+6. Commit the updated CDK context:
+   1. `cp cdk.context.json ../topo-aws-infrastructure/src/accounts/odr/cdk.context.json`
+   2. `cd ../topo-aws-infrastructure`
+   3. Commit, push, and create a pull request for this branch.
+
+## Finalise
+
+1. Let the users know that Argo is once again available.
+2. Tidy up
+   1. Delete the _temporary_ database in the AWS web console → RDS or with `aws rds delete-db-instance --db-instance-identifier=ID --skip-final-snapshot`
+   2. Terminate the sleep workflow: `argo --namespace=argo stop "$(argo --namespace=argo list --output=name)"`