Add printing of kubectl error message #714

MrMarvin · 2020-04-21T09:16:43Z

What are you trying to accomplish with this PR?
Currently if there is some error applying a manifest happening at the kubectl apply call, this error will get swallowed by krane. It is reported that an error occurred, but not what the error is.

This behaviour is making it unfeasible to debug underlying issues, for example permission issues on the (service) account used to run krane in a CD environment.

Example:

[INFO][2020-04-20 14:21:46 +0000][staging][someapp-staging]	
[INFO][2020-04-20 14:21:46 +0000][staging][someapp-staging]	------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2020-04-20 14:21:46 +0000][staging][someapp-staging]	Command failed: apply -f /tmp/d20200420-9481-sf4wgt --prune --all --prune-whitelist\=/v1/configmap --prune-whitelist\=/v1/endpoints --prune-whitelist\=/v1/event --prune-whitelist\=/v1/limitrange --prune-whitelist\=/v1/persistentvolumeclaim --prune-whitelist\=/v1/pod --prune-whitelist\=/v1/podtemplate --prune-whitelist\=/v1/replicationcontroller --prune-whitelist\=/v1/resourcequota --prune-whitelist\=/v1/secret --prune-whitelist\=/v1/serviceaccount --prune-whitelist\=/v1/service --prune-whitelist\=apps/v1/daemonset --prune-whitelist\=apps/v1/deployment --prune-whitelist\=apps/v1/replicaset --prune-whitelist\=apps/v1/statefulset --prune-whitelist\=autoscaling/v2beta1/horizontalpodautoscaler --prune-whitelist\=batch/v1beta1/cronjob --prune-whitelist\=batch/v1/job --prune-whitelist\=certmanager.k8s.io/v1alpha1/certificaterequest --prune-whitelist\=certmanager.k8s.io/v1alpha1/certificate --prune-whitelist\=certmanager.k8s.io/v1alpha1/challenge --prune-whitelist\=certmanager.k8s.io/v1alpha1/issuer --prune-whitelist\=certmanager.k8s.io/v1alpha1/order --prune-whitelist\=cloud.google.com/v1beta1/backendconfig --prune-whitelist\=crd.projectcalico.org/v1/networkpolicy --prune-whitelist\=extensions/v1beta1/ingress --prune-whitelist\=internal.autoscaling.k8s.io/v1alpha1/capacityrequest --prune-whitelist\=networking.gke.io/v1beta1/managedcertificate --prune-whitelist\=nodemanagement.gke.io/v1alpha1/updateinfo --prune-whitelist\=policy/v1beta1/poddisruptionbudget --prune-whitelist\=rbac.authorization.k8s.io/v1/rolebinding --prune-whitelist\=rbac.authorization.k8s.io/v1/role --prune-whitelist\=scalingpolicy.kope.io/v1alpha1/scalingpolicy
[FATAL][2020-04-20 14:21:46 +0000][staging][someapp-staging]

Not helpful, as even if a user would have access to the system/environment to manually replicate the kubectl apply call, krane already cleaned up the temp dir.

With this change:

[INFO][2020-04-20 15:44:40 +0000][staging][someapp-staging]	------------------------------------------Result: FAILURE-------------------------------------------
[FATAL][2020-04-20 15:44:40 +0000][staging][someapp-staging]	Command failed: apply -f /tmp/d20200420-757-1v2j9wa --prune --all --prune-whitelist\=/v1/configmap --prune-whitelist\=/v1/endpoints --prune-whitelist\=/v1/event --prune-whitelist\=/v1/limitrange --prune-whitelist\=/v1/persistentvolumeclaim --prune-whitelist\=/v1/pod --prune-whitelist\=/v1/podtemplate --prune-whitelist\=/v1/replicationcontroller --prune-whitelist\=/v1/resourcequota --prune-whitelist\=/v1/secret --prune-whitelist\=/v1/serviceaccount --prune-whitelist\=/v1/service --prune-whitelist\=apps/v1/daemonset --prune-whitelist\=apps/v1/deployment --prune-whitelist\=apps/v1/replicaset --prune-whitelist\=apps/v1/statefulset --prune-whitelist\=autoscaling/v2beta1/horizontalpodautoscaler --prune-whitelist\=batch/v1beta1/cronjob --prune-whitelist\=batch/v1/job --prune-whitelist\=certmanager.k8s.io/v1alpha1/certificaterequest --prune-whitelist\=certmanager.k8s.io/v1alpha1/certificate --prune-whitelist\=certmanager.k8s.io/v1alpha1/challenge --prune-whitelist\=certmanager.k8s.io/v1alpha1/issuer --prune-whitelist\=certmanager.k8s.io/v1alpha1/order --prune-whitelist\=cloud.google.com/v1beta1/backendconfig --prune-whitelist\=crd.projectcalico.org/v1/networkpolicy --prune-whitelist\=extensions/v1beta1/ingress --prune-whitelist\=internal.autoscaling.k8s.io/v1alpha1/capacityrequest --prune-whitelist\=networking.gke.io/v1beta1/managedcertificate --prune-whitelist\=nodemanagement.gke.io/v1alpha1/updateinfo --prune-whitelist\=policy/v1beta1/poddisruptionbudget --prune-whitelist\=rbac.authorization.k8s.io/v1/rolebinding --prune-whitelist\=rbac.authorization.k8s.io/v1/role --prune-whitelist\=scalingpolicy.kope.io/v1alpha1/scalingpolicy:
error: error pruning namespaced object networking.gke.io/v1beta1, kind=managedcertificate: managedcertificates.networking.gke.io is forbidden: user "someserviceaccount@shopify-someapp-staging.iam.gserviceaccount.com" cannot list resource "managedcertificates" in api group "networking.gke.io" in the namespace "someapp-staging"
[FATAL][2020-04-20 15:44:40 +0000][staging][someapp-staging]

How is this accomplished?
Add the collected error to the message presented to the user.

What could go wrong?
Leaking sensitive information, maybe, somehow, eventually?

timothysmith0609 · 2020-04-23T14:52:04Z

Thanks for your contribution!

Error reporting is tricky in Krane, since we need to avoid the scenario where we leak secret information (this can happen if kubectl apply fails and dumps an unencrypted secret resource, for example). Krane tries to be smart about this, only suppressing output when there is a risk of exposing sensitive information, but there are fundamental difficulties that constrain our ability to guarantee correct behaviour. This logic exists in record_apply_failure (the line preceding your change). If you are not deploying any sensitive content, then the solution here is to figure out why output is being suppressed in the first place. Can you confirm whether or not output should be suppressed in your stated case? If not, we can explore that.

MrMarvin · 2020-04-30T08:24:20Z

Thanks @timothysmith0609, this helps to clarify the situation.
While I agree that there's its a fundamental issue to filter for secrets from downstream kubectl invocations, I do not completely agree with the assumptions made.
Not potentially exposing secrets must be evaluated against blocking users from using krane at all. In this particular case, the only option left would be to ditch using krane, fall back to managing kubectl applys manually (shell script or whatever) again, including the ability to shoot yourself in the foot with leaking secrets. It is unusable as it is right now.

We should at least

Clearly state that additional error context is available but withheld for security reasons.
Offer a way to get that output or allow an administrator to re-run the exact same kubectl command manually, i.e. not wiping tmp dirs in error case.

Can you confirm whether or not output should be suppressed in your stated case? If not, we can explore that.

To be honest, I don't know. We do deploy a secrets.ejson along our resources file with krane. So I guess yes? With my understanding of how krane splits all resources into multiple steps, this particular step only reference to 'secret' would be in it's prune-whitelists.
I posted to complete and unmodified output above, it does not contain anything sensitive. So maybe not?

Add printing of kubectl error message

8163ec2

MrMarvin requested a review from a team as a code owner April 21, 2020 09:16

sjagoe mentioned this pull request Dec 17, 2020

kubectl 1.20.0 compatibility #776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add printing of kubectl error message #714

Add printing of kubectl error message #714

MrMarvin commented Apr 21, 2020

timothysmith0609 commented Apr 23, 2020

MrMarvin commented Apr 30, 2020

Add printing of kubectl error message #714

Are you sure you want to change the base?

Add printing of kubectl error message #714

Conversation

MrMarvin commented Apr 21, 2020

timothysmith0609 commented Apr 23, 2020

MrMarvin commented Apr 30, 2020