Netpols block kubeapi in long lived EKS cluster #821

ntwkninja · 2024-09-25T14:58:12Z

Environment

Device and OS: Bottlerocket
App version: 1.30
Kubernetes distro being used: AWS EKS
Other:

Steps to reproduce

Deploy UDS Core with standard accoutrements
Wait a few days for API IPs to change
Try to do something that triggers an api action
Check metrics server, neuvector, monitoring, promtail, etc. for errors

Expected result

netpols for kubeapi update as AWS updates

Actual Result

kubepi addresses are not updated after being initially set

Visual Proof (screenshots, videos, text, etc)

NAMESPACE              LAST SEEN                  TYPE      REASON                    OBJECT                                          MESSAGE
istio-admin-gateway    31m (x56 over 21d)         Normal    SuccessfullyReconciled    Service/admin-ingressgateway                    Successfully reconciled
istio-login-gateway    31m (x55 over 21d)         Normal    SuccessfullyReconciled    Service/login-ingressgateway                    Successfully reconciled
istio-tenant-gateway   31m (x56 over 21d)         Normal    SuccessfullyReconciled    Service/tenant-ingressgateway                   Successfully reconciled
metrics-server         29m (x2451 over 4d19h)     Warning   Unhealthy                 Pod/metrics-server-59c9dddf69-8l4fk             Liveness probe failed: Get "http://100.64.75.152:15020/app-health/metrics-server/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
metrics-server         4m1s (x29499 over 4d19h)   Warning   BackOff                   Pod/metrics-server-59c9dddf69-8l4fk             Back-off restarting failed container metrics-server in pod metrics-server-59c9dddf69-8l4fk_metrics-server(f619eae8-61d7-420c-a104-0c786e51242a)
istio-admin-gateway    2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/admin-ingressgateway    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-login-gateway    2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/login-ingressgateway    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-system           2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/istiod                  failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
istio-tenant-gateway   2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/tenant-ingressgateway   failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
keycloak               2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/keycloak                failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
zarf                   2m47s (x3259 over 13h)     Warning   FailedGetResourceMetric   HorizontalPodAutoscaler/zarf-docker-registry    failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Severity/Priority

Additional Context

# get new endpoint ips
IP1=$(kubectl get endpointslices.discovery.k8s.io -o json | jq -r '.items[0].endpoints[0] | select(.addresses != null) | .addresses[]' | head -n 1)
IP2=$(kubectl get endpointslices.discovery.k8s.io -o json | jq -r '.items[0].endpoints[1] | select(.addresses != null) | .addresses[]' | head -n 1)

The text was updated successfully, but these errors were encountered:

mjnagel · 2024-09-25T15:06:56Z

Does this resolve itself after a pepr watcher pod restart? I think in the past we've seen this issue when pepr "stops watching" the endpoints.

We have also floated the idea of adding a config option for end users to specify a CIDR range instead of relying on the pepr watch. We should probably just add that at this point given the inconsistency seen with the watch.

ntwkninja · 2024-09-25T15:11:11Z

Does this resolve itself after a pepr watcher pod restart? I think in the past we've seen this issue when pepr "stops watching" the endpoints.

~~I'll try restarting the watcher and report back~~ That worked

joelmccoy · 2024-11-06T19:56:05Z

Want to call out that we ran into this in our internal clusters as well during an k8s upgrade. Kicking the pepr watcher pod reconciled all the netpols.

ntwkninja added the possible-bug Something may not be working label Sep 25, 2024

mjnagel modified the milestone: 0.29.0 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Netpols block kubeapi in long lived EKS cluster #821

Netpols block kubeapi in long lived EKS cluster #821

ntwkninja commented Sep 25, 2024 •

edited

Loading

mjnagel commented Sep 25, 2024

ntwkninja commented Sep 25, 2024 •

edited

Loading

joelmccoy commented Nov 6, 2024

Netpols block kubeapi in long lived EKS cluster #821

Netpols block kubeapi in long lived EKS cluster #821

Comments

ntwkninja commented Sep 25, 2024 • edited Loading

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

mjnagel commented Sep 25, 2024

ntwkninja commented Sep 25, 2024 • edited Loading

joelmccoy commented Nov 6, 2024

ntwkninja commented Sep 25, 2024 •

edited

Loading

ntwkninja commented Sep 25, 2024 •

edited

Loading