feat!: add optional k3s-cuda base image flavors #118

justinthelaw · 2024-10-09T15:23:17Z

Description

To run and test all variations, please see this PR branch's README.md and docs/GPU.md. Because images are not yet published, you must run uds run default-cuda to build locally. You can still set the K3S_IMAGE_VERSION and CUDA_IMAGE_VERSION. Please be sure to consult the NVIDIA documentation linked in the GPU.md if you run into issues with your local GPU environment.

Proof of the publishing workflows functioning, and proof of published package + image deployments working, can be found here.

BREAKING CHANGES

Zarf variable and UDS task variable name changes to align with actual variable contents
Image publishing to a new sub-repository for all k3s-cuda images
Additional Zarf package flavor, cuda, added and documented
CUDA support in K3d requires K3d >= v5.7.x

CHANGES

Adds capability to run CUDA-workloads in the UDS K3d cluster through the addition of:
- NVIDIA GPU Operator (deployed in yolo mode)
- K3s-CUDA base K3D image
- Additional K3D arg for GPU passthrough

Related Issue

Fixes #117

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality) << THIS DOES CONTAIN A BREAKING CHANGE
Other (security config, docs update, etc)

Checklist before merging

Test, docs, adr added or updated as needed
Contributor Guide Steps followed

justinthelaw · 2024-10-09T15:40:03Z

In local testing, zarf init and uds zarf package deploy oci://ghcr.io/defenseunicorns/packages/uds/core:0.28.0-upstream were also used to ensure compatibility with all additional infrastructure.

justinthelaw · 2024-10-09T18:12:33Z

Example of a full UDS Core cluster with a passing CUDA workload test:

root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package 12m 39s
❯ uds run validate-cuda --set CUDA_TEST="cuda-vector-add"
                                                                                                                                                    
      NOTE  Using config file                                                                                                                       
     pod/cuda-test-pod created                                                                                                                      
  ✔  Completed "Deploy the test pod to the cluster"                                                                                                 
                                                                                                                                                    
      NOTE  Using config file                                                                                                                       
                                                                                                                                                    
      NOTE  Using config file                                                                                                                       
       •  Waiting for Pod/cuda-test-pod in namespace default to exist.                                                                              
       •  Waiting for Podcuda-test-pod in namespace default to be {.status.phase}=Succeeded.                                                        
                                                                                                                                                    
      NOTE  Using config file                                                                                                                       
     [Vector addition of 50000 elements]                                                                                                            
     Copy input data from the host memory to the CUDA device                                                                                        
     CUDA kernel launch with 196 blocks of 256 threads                                                                                              
     Copy output data from the CUDA device to the host memory                                                                                       
     Test PASSED                                                                                                                                    
     Done                                                                                                                                           
  ✔  Completed "Await test completion and then display the test results"                                                                            
                                                                                                                                                    
      NOTE  Using config file                                                                                                                       
     pod "cuda-test-pod" deleted                                                                                                                    
  ✔  Completed "Remove the completed test pod"                                                                                                      

root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package 6s
❯ k get node -A

 NOTE  Using config file
NAME               STATUS   ROLES                  AGE   VERSION
k3d-uds-server-0   Ready    control-plane,master   33m   v1.30.4+k3s1

root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package
❯ k describe nodes -A

 NOTE  Using config file
Name:               k3d-uds-server-0
Roles:              control-plane,master
[DELETED]
                      ["server","--disable","local-storage","--disable","traefik","--disable","metrics-server","--disable","servicelb","--tls-san","0.0.0.0","--...
                    k3s.io/node-config-hash: EJB25YOEUV7WTJ5L6EPXNGX2JEFPI4XJIX6RPJHUCHDCXHXSTF4A====
                    k3s.io/node-env: {"K3S_KUBECONFIG_OUTPUT":"/output/kubeconfig.yaml","K3S_TOKEN":"********"}
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVXVNNI,cpu-cpuid.BHI_CTRL,cpu-cpuid.CETIBT,cpu-cpuid.CETSS,cpu-cpuid...
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 09 Oct 2024 11:23:09 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  k3d-uds-server-0
  AcquireTime:     <unset>
  RenewTime:       Wed, 09 Oct 2024 11:56:18 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 09 Oct 2024 11:54:47 -0400   Wed, 09 Oct 2024 11:23:09 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 09 Oct 2024 11:54:47 -0400   Wed, 09 Oct 2024 11:23:09 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 09 Oct 2024 11:54:47 -0400   Wed, 09 Oct 2024 11:23:09 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 09 Oct 2024 11:54:47 -0400   Wed, 09 Oct 2024 11:23:09 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.18.0.2
  Hostname:    k3d-uds-server-0
Capacity:
  cpu:                32
  ephemeral-storage:  955657596Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65680500Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                32
  ephemeral-storage:  929663708660
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65680500Ki
  nvidia.com/gpu:     2
  pods:               110
System Info:
  Machine ID:                 
  System UUID:                
  Boot ID:                    7ea5c482-108d-4a53-8b8c-7acd6a4b6f42
  Kernel Version:             6.9.3-76060903-generic
  OS Image:                   Ubuntu 22.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.20-k3s1
  Kubelet Version:            v1.30.4+k3s1
  Kube-Proxy Version:         v1.30.4+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://k3d-uds-server-0
Non-terminated Pods:          (55 in total)
  Namespace                   Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                          ------------  ----------  ---------------  -------------  ---
  authservice                 authservice-5784cf6fb4-s9zk2                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       7m15s
  grafana                     grafana-7949f7b65f-x5c68                                      100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       8m1s
  istio-admin-gateway         admin-ingressgateway-6b88cb4fb7-pgggb                         100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       16m
  istio-passthrough-gateway   passthrough-ingressgateway-557f85ff88-dh68m                   100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       16m
  istio-system                istiod-d7d47f664-t92pv                                        500m (1%)     0 (0%)      2Gi (3%)         0 (0%)         17m
  istio-tenant-gateway        tenant-ingressgateway-6dc87cc74d-ktp54                        100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       16m
  keycloak                    keycloak-0                                                    600m (1%)     3 (9%)      640Mi (0%)       2Gi (3%)       15m
  kube-system                 coredns-5666759999-v52vm                                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     32m
  kube-system                 gpu-feature-discovery-nmqbz                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  kube-system                 gpu-operator-7fd97f74cf-9p5c5                                 200m (0%)     500m (1%)   100Mi (0%)       350Mi (0%)     32m
  kube-system                 gpu-operator-node-feature-discovery-gc-6788b6ccf8-q6jnv       0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  kube-system                 gpu-operator-node-feature-discovery-master-bc9c67575-c29b5    0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  kube-system                 gpu-operator-node-feature-discovery-worker-mjlcd              0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  kube-system                 nvidia-dcgm-exporter-2t4z7                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  kube-system                 nvidia-device-plugin-daemonset-n8sdl                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  kube-system                 nvidia-operator-validator-4rf5g                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         31m
  loki                        loki-backend-0                                                100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-backend-1                                                100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-backend-2                                                100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-gateway-755b6f4bfc-mhr7b                                 100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-read-7c7686d949-m8rsb                                    100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-read-7c7686d949-p4xnc                                    100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-read-7c7686d949-zf8qh                                    100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-write-0                                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-write-1                                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  loki                        loki-write-2                                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  metrics-server              metrics-server-74cb6d9866-2xflp                               200m (0%)     2 (6%)      328Mi (0%)       1Gi (1%)       16m
  monitoring                  alertmanager-kube-prometheus-stack-alertmanager-0             150m (0%)     2100m (6%)  456Mi (0%)       1152Mi (1%)    8m48s
  monitoring                  kube-prometheus-stack-kube-state-metrics-6d8b84c76-xf2g6      100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       8m51s
  monitoring                  kube-prometheus-stack-operator-5c66fd5dc4-gwsf6               200m (0%)     2500m (7%)  640Mi (0%)       1536Mi (2%)    8m51s
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-bpf7s          100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       8m51s
  monitoring                  prometheus-kube-prometheus-stack-prometheus-0                 250m (0%)     2600m (8%)  768Mi (1%)       5248Mi (8%)    8m48s
  neuvector                   neuvector-controller-pod-55b46c88cc-s67nv                     100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       10m
  neuvector                   neuvector-controller-pod-55b46c88cc-w5lmp                     100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       8m21s
  neuvector                   neuvector-controller-pod-55b46c88cc-whtwt                     100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       9m36s
  neuvector                   neuvector-enforcer-pod-7lrwq                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       13m
  neuvector                   neuvector-manager-pod-598bd94df7-j7rwf                        100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       13m
  neuvector                   neuvector-scanner-pod-5bb75fd8d5-94z7f                        100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       13m
  neuvector                   neuvector-scanner-pod-5bb75fd8d5-cs2vp                        100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       13m
  neuvector                   neuvector-scanner-pod-5bb75fd8d5-fz6xz                        100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       13m
  pepr-system                 pepr-uds-core-794f7cc6b6-fdsgl                                200m (0%)     2500m (7%)  192Mi (0%)       1280Mi (1%)    16m
  pepr-system                 pepr-uds-core-794f7cc6b6-lcm4c                                200m (0%)     2500m (7%)  192Mi (0%)       1280Mi (1%)    16m
  pepr-system                 pepr-uds-core-watcher-7fd6ddf9f5-dqpw2                        200m (0%)     2500m (7%)  192Mi (0%)       1280Mi (1%)    16m
  uds-dev-stack               ensure-machine-id-wccwz                                       100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      32m
  uds-dev-stack               local-path-provisioner-7cbf488c7f-mq8vn                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  uds-dev-stack               metallb-controller-77cb7f5d88-z7v8f                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  uds-dev-stack               metallb-speaker-k6q55                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  uds-dev-stack               minio-64774797ff-jrvn7                                        150m (0%)     0 (0%)      256Mi (0%)       0 (0%)         32m
  uds-dev-stack               nginx-d44h7                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m
  uds-runtime                 uds-runtime-86c9c868d6-47b7v                                  350m (1%)     2750m (8%)  256Mi (0%)       2Gi (3%)       7m21s
  vector                      vector-ptw8b                                                  100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       8m23s
  velero                      velero-6f4d774858-qtbld                                       100m (0%)     2 (6%)      128Mi (0%)       1Gi (1%)       6m43s
  zarf                        agent-hook-86cdfdc664-45q2x                                   100m (0%)     500m (1%)   32Mi (0%)        128Mi (0%)     20m
  zarf                        agent-hook-86cdfdc664-vfb88                                   100m (0%)     500m (1%)   32Mi (0%)        128Mi (0%)     20m
  zarf                        zarf-docker-registry-7958884866-7tdzh                         100m (0%)     3 (9%)      256Mi (0%)       2Gi (3%)       15m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                6400m (20%)   81050m (253%)
  memory             9964Mi (15%)  47418Mi (73%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
  nvidia.com/gpu     0             0
Events:
  Type     Reason                          Age                From                   Message
  ----     ------                          ----               ----                   -------
  Normal   Starting                        33m                kube-proxy             
  Normal   Starting                        33m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity             33m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory         33m (x2 over 33m)  kubelet                Node k3d-uds-server-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure           33m (x2 over 33m)  kubelet                Node k3d-uds-server-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID            33m (x2 over 33m)  kubelet                Node k3d-uds-server-0 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced         33m                kubelet                Updated Node Allocatable limit across pods
  Normal   NodeReady                       33m                kubelet                Node k3d-uds-server-0 status is now: NodeReady
  Normal   Synced                          33m                cloud-node-controller  Node synced successfully
  Normal   NodePasswordValidationComplete  33m                k3s-supervisor         Deferred node password secret validation complete
  Normal   RegisteredNode                  32m                node-controller        Node k3d-uds-server-0 event: Registered Node k3d-uds-server-0 in Controller

root@law-server ~jlaw/dev/uds-k3d/tmp/uds-k3d 117-feat-optional-cuda-image-and-package
❯ uds zarf tools kubectl exec -it daemonset/nvidia-device-plugin-daemonset -n kube-system -c nvidia-device-plugin -- nvidia-smi

 NOTE  Using config file
Wed Oct  9 15:57:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8              3W /  115W |       1MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:28:00.0 Off |                  Off |
| 30%   29C    P8             23W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

justinthelaw · 2024-11-01T19:05:29Z

I have confirmed that base Zarf package deployment, + UDS Core and + UDS Core Slim Dev (IAW this repository's docs/GPU.md instructions) all work locally.

Requested reviews

@CollectiveUnicorn: since you have NVIDIA GPUs, your review is for confirming that the Zarf package works, and that the instructions in docs/GPU.md work. For ### Core Slim Dev you can use my personally published K3s-CUDA image: ghcr.io/justinthelaw/uds-k3d/cuda-k3s:v1.28.8-k3s1-cuda-12.5.0-base-ubuntu22.04**

@rjferguson21 @Racer159 @mjnagel: (1 of) your reviews are required for the permission to merge into main and for checking the new/modified patterns and documentation

**Filled-in Documentation Commands:

# situation 1: only the UDS K3d `cuda` flavor package
uds run default-cuda

# situation 2: UDS Core full on top of the UDS K3d `cuda` flavor package
export PACKAGE_VERSION=0.9.0
uds run default-cuda
uds zarf package deploy oci://ghcr.io/defenseunicorns/packages/uds/core:0.30.0-upstream --confirm

# situation 3: UDS Core slim dev on top of the UDS K3d `cuda` flavor package (uses a published image from a fork of this repo branch)
uds deploy k3d-core-slim-dev:0.30.0 --set K3D_EXTRA_ARGS="--gpus=all --image=ghcr.io/justinthelaw/uds-k3d/cuda-k3s:v1.28.8-k3s1-cuda-12.5.0-base-ubuntu22.04" --confirm

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
    -n kube-system \
    --values values/nvidia-gpu-operator-values.yaml \
    nvidia/gpu-operator

justinthelaw · 2024-11-05T17:49:00Z

Closing this PR in favor of moving the k3s-cuda image publishing workflow, tasks, and actions to the UDS AI repository.

justinthelaw added 2 commits October 9, 2024 11:21

nvidia cuda base image and package

b4345bc

remove incorrect org repo

ca32799

justinthelaw added the enhancement New feature or request label Oct 9, 2024

justinthelaw self-assigned this Oct 9, 2024

justinthelaw linked an issue Oct 9, 2024 that may be closed by this pull request

feat: publish an optional NVIDIA CUDA K3s base image #117

Open

generalize test description name

d8c6637

justinthelaw marked this pull request as ready for review October 9, 2024 15:37

justinthelaw requested a review from a team as a code owner October 9, 2024 15:37

justinthelaw requested review from rjferguson21, Racer159, mjnagel, gphorvath, jalling97 and CollectiveUnicorn October 9, 2024 15:38

justinthelaw and others added 4 commits October 9, 2024 11:43

align build-test matrix with publishing one

cec1c4e

fix --set typo

ae9d9b1

remove erroneous version, go to 0.9.0

6a8081f

Update GPU.md

0dc3f14

CollectiveUnicorn previously approved these changes Oct 9, 2024

View reviewed changes

merge origin main

5bde5f4

justinthelaw dismissed CollectiveUnicorn’s stale review via 5bde5f4 November 1, 2024 15:54

justinthelaw removed request for gphorvath and jalling97 November 1, 2024 15:54

add better docs for core and core-slim-dev

b117425

use lfai nvidia device plugin values

25eb6c6

justinthelaw closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: add optional k3s-cuda base image flavors #118

feat!: add optional k3s-cuda base image flavors #118

justinthelaw commented Oct 9, 2024 •

edited

Loading

justinthelaw commented Oct 9, 2024

justinthelaw commented Oct 9, 2024

justinthelaw commented Nov 1, 2024 •

edited

Loading

justinthelaw commented Nov 5, 2024

feat!: add optional k3s-cuda base image flavors #118

feat!: add optional k3s-cuda base image flavors #118

Conversation

justinthelaw commented Oct 9, 2024 • edited Loading

Description

BREAKING CHANGES

CHANGES

Related Issue

Type of change

Checklist before merging

justinthelaw commented Oct 9, 2024

justinthelaw commented Oct 9, 2024

justinthelaw commented Nov 1, 2024 • edited Loading

justinthelaw commented Nov 5, 2024

justinthelaw commented Oct 9, 2024 •

edited

Loading

justinthelaw commented Nov 1, 2024 •

edited

Loading