Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add checkpoint uds-core slim package #818

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open

Conversation

Racer159
Copy link
Contributor

@Racer159 Racer159 commented Sep 25, 2024

Description

This adds a ~75% faster way to deploy or reset a full uds-core cluster (theoretically would work for other preloaded things like testing GitLab Runner w/GitLab too).

Normal:
image

Checkpoint:
image

Tradeoffs:

  • Requires sudo - not sure of a great way around this without mangling volume permissions for containerd
  • May become unwieldy with more permutations (i.e. with layers work)
  • The cluster would be fully published (so all credentials are reused)

Related Issue

Fixes #N/A

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Checklist before merging

@Racer159 Racer159 changed the title feat: add frozen uds-core slim package feat: add checkpoint uds-core slim package Sep 27, 2024
@Racer159 Racer159 marked this pull request as ready for review September 27, 2024 22:54
@Racer159 Racer159 requested a review from a team as a code owner September 27, 2024 22:54
@Racer159 Racer159 self-assigned this Sep 27, 2024
@Racer159
Copy link
Contributor Author

Racer159 commented Sep 28, 2024

Checkpoint task passed in this PR (except for the actual publish task)
image

Copy link
Contributor

@catsby catsby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an approver but the code does look good to me. I would like to see more information on how to use this package though so it's more clear on how/why/when someone would want to use it.

packages/checkpoint-dev/README.md Outdated Show resolved Hide resolved
packages/checkpoint-dev/zarf.yaml Show resolved Hide resolved
tasks.yaml Outdated Show resolved Hide resolved
.github/workflows/checkpoint.yaml Outdated Show resolved Hide resolved
.github/actions/setup/action.yaml Outdated Show resolved Hide resolved
packages/checkpoint-dev/zarf.yaml Show resolved Hide resolved
packages/checkpoint-dev/checkpoint.sh Outdated Show resolved Hide resolved
tasks/deploy.yaml Outdated Show resolved Hide resolved
packages/checkpoint-dev/checkpoint.sh Outdated Show resolved Hide resolved
Comment on lines +44 to +51
"/var/lib/kubelet")
echo "Copying $SOURCE to ${DATA_DIR}/kubelet_data/"
sudo cp -a "$SOURCE"/. "${DATA_DIR}/kubelet_data/"
;;
"/var/lib/rancher/k3s")
echo "Copying $SOURCE to ${DATA_DIR}/k3s_data/"
sudo cp -a "$SOURCE"/. "${DATA_DIR}/k3s_data/"
;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During creation I see these errors (which cause the deploy to fail later):

     cp: /var/lib/docker/volumes/c0d8ea4ead46f3c6649218be409e19d1cd63bfcc68f32d548a116c7924d7a793/_data/.: No such file or directory
     cp: /var/lib/docker/volumes/822e843b8cf644f9c4c9118671f6014d32ad84a062d690e69b07d5c6fdfcfbe2/_data/.: No such file or directory

I think pretty much universally on macOS docker is run inside of a VM, in my case the VM can be accessed with colima ssh but docker desktop, rancher desktop, etc would likely have similar issues and ways to access the VM.

I was able to rewrite a portion of this script to use docker cp instead and got closer (at least didn't get errors with the volumes). I think this is probably a better, more agnostic option here and simplifies a lot of this logic (no looping through volumes, just copy the two paths we need explicitly). I was hoping it might also remove the need for sudo but in my case one of the paths gave some permission errors still until I added sudo. I'm sure there's some efficiency loss here, but since it's create time I think it's worth it to make this work across distros? In my run locally it took less than a minute still to run which still seems decently performant (granted I couldn't get it to run successfully previously so unsure of the real comparison).

Would be curious your thoughts on this - I dropped the script changes into a gist since there were a handful of changes across the entirety of the file: https://gist.github.com/mjnagel/6d681678df83067169c4e652466f704f

I also had to add --no-xattrs to the final tar command, I got warnings/errors without this (suspect that's some macOS <> Linux stuff). This got me much closer but I hit some issues with the token:

time="2024-10-02T15:19:18Z" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"

I tried to tweak the commands around startup (using the k3d --token option rather than k3s arg) and validated the token exists after extraction but couldn't figure this one out. Would be curious if you hit the same issue with my modified script and can figure out what's wrong?

@bburky
Copy link
Member

bburky commented Oct 25, 2024

Probably ignore all of the following, I tried testing CRIU (docker checkpoint or podman container checkpoint). It almost works with podman... except for not supporting nested containers). Recording my notes here anyway.


Did you try docker checkpoint which uses CRIU and is somewhat meant for this purpose? ...It is still experimental though and requires "experimental": "true" in your /etc/docker/daemon.json, and install a CRIU package into your linux distro...
https://docs.docker.com/reference/cli/docker/checkpoint/
https://criu.org/Docker

If you use --checkpoint-dir you can save the checkpoint to disk and restore it after recreating the container (possibly on a different machine). There seems to be a bug during restore, but there's a workaround, see below.

docker rm -f count
sudo rm -rf /tmp/checkpoint

docker run -d --name=count busybox /bin/sh -c 'for i in $(seq 9999999); do echo "$i" && sleep 1; done'
docker checkpoint create --checkpoint-dir=/tmp/checkpoint count checkpoint1
docker rm count

docker create --name count busybox
# Apparently `docker start --checkpoint-dir` is broken, use workaround: https://github.com/moby/moby/issues/37344#issuecomment-450782189
# docker start --checkpoint-dir /tmp/checkpoint --checkpoint checkpoint1 count
sudo mv /tmp/checkpoint/checkpoint1 "/var/lib/docker/containers/$(docker ps -aq --no-trunc --filter name=count)/checkpoints/"
docker start --checkpoint=checkpoint1 count

docker ps
docker logs -f count

The biggest downside would be this is near impossible to use with Docker Desktop. A big advantage is the cluster never actually "stops", it's magically paused and resumed elsewhere.

Podman seems to support this too, and seems to be a bit more fully supported. k3d (somewhat) supports Podman too. Unlike docker, Podman's CRIU support includes volumes, and capturing multiple containers at once. It can apparently pack the checkpoint into an OCI image too (useful for publishing to GHCR?)

Except... this whole idea may be useless because don't think CRIU supports checkpointing nested namespaces (which is how k3d works to embed sub containers inside it's parent docker container for the k8s node)
https://github.com/checkpoint-restore/criu/blob/v4.0/criu/include/namespaces.h#L47-L48

limactl start template://podman-rootful
export DOCKER_HOST=unix://$HOME/.lima/podman-rootful/sock/podman.sock
k3d cluster create

limactl shell podman-rootful sudo podman container checkpoint --export=/tmp/lima/checkpoint.tgz k3d-k3s-default-server-0 k3d-k3s-default-serverlb
# Error:
#   Can't dump nested pid namespace for 4663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants