-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: add optional k3s-cuda base image flavors #118
Conversation
In local testing, |
Example of a full UDS Core cluster with a passing CUDA workload test: root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package 12m 39s
❯ uds run validate-cuda --set CUDA_TEST="cuda-vector-add"
NOTE Using config file
pod/cuda-test-pod created
✔ Completed "Deploy the test pod to the cluster"
NOTE Using config file
NOTE Using config file
• Waiting for Pod/cuda-test-pod in namespace default to exist.
• Waiting for Podcuda-test-pod in namespace default to be {.status.phase}=Succeeded.
NOTE Using config file
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
✔ Completed "Await test completion and then display the test results"
NOTE Using config file
pod "cuda-test-pod" deleted
✔ Completed "Remove the completed test pod"
root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package 6s
❯ k get node -A
NOTE Using config file
NAME STATUS ROLES AGE VERSION
k3d-uds-server-0 Ready control-plane,master 33m v1.30.4+k3s1
root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package
❯ k describe nodes -A
NOTE Using config file
Name: k3d-uds-server-0
Roles: control-plane,master
[DELETED]
["server","--disable","local-storage","--disable","traefik","--disable","metrics-server","--disable","servicelb","--tls-san","0.0.0.0","--...
k3s.io/node-config-hash: EJB25YOEUV7WTJ5L6EPXNGX2JEFPI4XJIX6RPJHUCHDCXHXSTF4A====
k3s.io/node-env: {"K3S_KUBECONFIG_OUTPUT":"/output/kubeconfig.yaml","K3S_TOKEN":"********"}
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVXVNNI,cpu-cpuid.BHI_CTRL,cpu-cpuid.CETIBT,cpu-cpuid.CETSS,cpu-cpuid...
node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 09 Oct 2024 11:23:09 -0400
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: k3d-uds-server-0
AcquireTime: <unset>
RenewTime: Wed, 09 Oct 2024 11:56:18 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 09 Oct 2024 11:54:47 -0400 Wed, 09 Oct 2024 11:23:09 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 09 Oct 2024 11:54:47 -0400 Wed, 09 Oct 2024 11:23:09 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 09 Oct 2024 11:54:47 -0400 Wed, 09 Oct 2024 11:23:09 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 09 Oct 2024 11:54:47 -0400 Wed, 09 Oct 2024 11:23:09 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.18.0.2
Hostname: k3d-uds-server-0
Capacity:
cpu: 32
ephemeral-storage: 955657596Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65680500Ki
nvidia.com/gpu: 2
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 929663708660
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65680500Ki
nvidia.com/gpu: 2
pods: 110
System Info:
Machine ID:
System UUID:
Boot ID: 7ea5c482-108d-4a53-8b8c-7acd6a4b6f42
Kernel Version: 6.9.3-76060903-generic
OS Image: Ubuntu 22.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.20-k3s1
Kubelet Version: v1.30.4+k3s1
Kube-Proxy Version: v1.30.4+k3s1
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: k3s://k3d-uds-server-0
Non-terminated Pods: (55 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
authservice authservice-5784cf6fb4-s9zk2 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 7m15s
grafana grafana-7949f7b65f-x5c68 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 8m1s
istio-admin-gateway admin-ingressgateway-6b88cb4fb7-pgggb 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 16m
istio-passthrough-gateway passthrough-ingressgateway-557f85ff88-dh68m 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 16m
istio-system istiod-d7d47f664-t92pv 500m (1%) 0 (0%) 2Gi (3%) 0 (0%) 17m
istio-tenant-gateway tenant-ingressgateway-6dc87cc74d-ktp54 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 16m
keycloak keycloak-0 600m (1%) 3 (9%) 640Mi (0%) 2Gi (3%) 15m
kube-system coredns-5666759999-v52vm 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 32m
kube-system gpu-feature-discovery-nmqbz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
kube-system gpu-operator-7fd97f74cf-9p5c5 200m (0%) 500m (1%) 100Mi (0%) 350Mi (0%) 32m
kube-system gpu-operator-node-feature-discovery-gc-6788b6ccf8-q6jnv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
kube-system gpu-operator-node-feature-discovery-master-bc9c67575-c29b5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
kube-system gpu-operator-node-feature-discovery-worker-mjlcd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
kube-system nvidia-dcgm-exporter-2t4z7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
kube-system nvidia-device-plugin-daemonset-n8sdl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
kube-system nvidia-operator-validator-4rf5g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
loki loki-backend-0 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-backend-1 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-backend-2 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-gateway-755b6f4bfc-mhr7b 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-read-7c7686d949-m8rsb 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-read-7c7686d949-p4xnc 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-read-7c7686d949-zf8qh 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-write-0 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-write-1 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
loki loki-write-2 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
metrics-server metrics-server-74cb6d9866-2xflp 200m (0%) 2 (6%) 328Mi (0%) 1Gi (1%) 16m
monitoring alertmanager-kube-prometheus-stack-alertmanager-0 150m (0%) 2100m (6%) 456Mi (0%) 1152Mi (1%) 8m48s
monitoring kube-prometheus-stack-kube-state-metrics-6d8b84c76-xf2g6 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 8m51s
monitoring kube-prometheus-stack-operator-5c66fd5dc4-gwsf6 200m (0%) 2500m (7%) 640Mi (0%) 1536Mi (2%) 8m51s
monitoring kube-prometheus-stack-prometheus-node-exporter-bpf7s 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 8m51s
monitoring prometheus-kube-prometheus-stack-prometheus-0 250m (0%) 2600m (8%) 768Mi (1%) 5248Mi (8%) 8m48s
neuvector neuvector-controller-pod-55b46c88cc-s67nv 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 10m
neuvector neuvector-controller-pod-55b46c88cc-w5lmp 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 8m21s
neuvector neuvector-controller-pod-55b46c88cc-whtwt 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 9m36s
neuvector neuvector-enforcer-pod-7lrwq 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 13m
neuvector neuvector-manager-pod-598bd94df7-j7rwf 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 13m
neuvector neuvector-scanner-pod-5bb75fd8d5-94z7f 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 13m
neuvector neuvector-scanner-pod-5bb75fd8d5-cs2vp 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 13m
neuvector neuvector-scanner-pod-5bb75fd8d5-fz6xz 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 13m
pepr-system pepr-uds-core-794f7cc6b6-fdsgl 200m (0%) 2500m (7%) 192Mi (0%) 1280Mi (1%) 16m
pepr-system pepr-uds-core-794f7cc6b6-lcm4c 200m (0%) 2500m (7%) 192Mi (0%) 1280Mi (1%) 16m
pepr-system pepr-uds-core-watcher-7fd6ddf9f5-dqpw2 200m (0%) 2500m (7%) 192Mi (0%) 1280Mi (1%) 16m
uds-dev-stack ensure-machine-id-wccwz 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 32m
uds-dev-stack local-path-provisioner-7cbf488c7f-mq8vn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
uds-dev-stack metallb-controller-77cb7f5d88-z7v8f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
uds-dev-stack metallb-speaker-k6q55 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
uds-dev-stack minio-64774797ff-jrvn7 150m (0%) 0 (0%) 256Mi (0%) 0 (0%) 32m
uds-dev-stack nginx-d44h7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m
uds-runtime uds-runtime-86c9c868d6-47b7v 350m (1%) 2750m (8%) 256Mi (0%) 2Gi (3%) 7m21s
vector vector-ptw8b 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 8m23s
velero velero-6f4d774858-qtbld 100m (0%) 2 (6%) 128Mi (0%) 1Gi (1%) 6m43s
zarf agent-hook-86cdfdc664-45q2x 100m (0%) 500m (1%) 32Mi (0%) 128Mi (0%) 20m
zarf agent-hook-86cdfdc664-vfb88 100m (0%) 500m (1%) 32Mi (0%) 128Mi (0%) 20m
zarf zarf-docker-registry-7958884866-7tdzh 100m (0%) 3 (9%) 256Mi (0%) 2Gi (3%) 15m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 6400m (20%) 81050m (253%)
memory 9964Mi (15%) 47418Mi (73%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 33m kube-proxy
Normal Starting 33m kubelet Starting kubelet.
Warning InvalidDiskCapacity 33m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 33m (x2 over 33m) kubelet Node k3d-uds-server-0 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 33m (x2 over 33m) kubelet Node k3d-uds-server-0 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 33m (x2 over 33m) kubelet Node k3d-uds-server-0 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 33m kubelet Updated Node Allocatable limit across pods
Normal NodeReady 33m kubelet Node k3d-uds-server-0 status is now: NodeReady
Normal Synced 33m cloud-node-controller Node synced successfully
Normal NodePasswordValidationComplete 33m k3s-supervisor Deferred node password secret validation complete
Normal RegisteredNode 32m node-controller Node k3d-uds-server-0 event: Registered Node k3d-uds-server-0 in Controller
root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package
❯ uds zarf tools kubectl exec -it daemonset/nvidia-device-plugin-daemonset -n kube-system -c nvidia-device-plugin -- nvidia-smi
NOTE Using config file
Wed Oct 9 15:57:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 555.58.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 3W / 115W | 1MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:28:00.0 Off | Off |
| 30% 29C P8 23W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+ |
I have confirmed that base Zarf package deployment, + UDS Core and + UDS Core Slim Dev (IAW this repository's Requested reviews @CollectiveUnicorn: since you have NVIDIA GPUs, your review is for confirming that the Zarf package works, and that the instructions in @rjferguson21 @Racer159 @mjnagel: (1 of) your reviews are required for the permission to merge into main and for checking the new/modified patterns and documentation **Filled-in Documentation Commands: # situation 1: only the UDS K3d `cuda` flavor package
uds run default-cuda
# situation 2: UDS Core full on top of the UDS K3d `cuda` flavor package
export PACKAGE_VERSION=0.9.0
uds run default-cuda
uds zarf package deploy oci://ghcr.io/defenseunicorns/packages/uds/core:0.30.0-upstream --confirm
# situation 3: UDS Core slim dev on top of the UDS K3d `cuda` flavor package (uses a published image from a fork of this repo branch)
uds deploy k3d-core-slim-dev:0.30.0 --set K3D_EXTRA_ARGS="--gpus=all --image=ghcr.io/justinthelaw/uds-k3d/cuda-k3s:v1.28.8-k3s1-cuda-12.5.0-base-ubuntu22.04" --confirm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
-n kube-system \
--values values/nvidia-gpu-operator-values.yaml \
nvidia/gpu-operator |
Closing this PR in favor of moving the k3s-cuda image publishing workflow, tasks, and actions to the UDS AI repository. |
Description
To run and test all variations, please see this PR branch's README.md and docs/GPU.md. Because images are not yet published, you must run
uds run default-cuda
to build locally. You can still set theK3S_IMAGE_VERSION
andCUDA_IMAGE_VERSION
. Please be sure to consult the NVIDIA documentation linked in the GPU.md if you run into issues with your local GPU environment.Proof of the publishing workflows functioning, and proof of published package + image deployments working, can be found here.
BREAKING CHANGES
cuda
, added and documentedCHANGES
yolo
mode)Related Issue
Fixes #117
Type of change
Checklist before merging