Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug][v1.23]: cluster_ca_cert and cluster_ca_key always trigger cluster updater #530

Open
ddelange opened this issue Mar 31, 2022 · 31 comments

Comments

@ddelange
Copy link

Hi again!

image

I just tried out v1.23, and spotted this one triggering the cluster updater. I haven't provided the secrets block in my cluster config.

@eddycharly
Copy link
Owner

eddycharly commented Mar 31, 2022

Thanks for reporting.
Does it happen every time or only when you upgrade from a previous version of the provider ?

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

I would say every time, after a couple applies like that I just checked this morning and again the same scenario on a terraform apply: revision = 13 -> 14 on kops_cluster.cluster

Maybe useful info: I have completely destroyed the 1.22 cluster and restarted 1.23 in a different AZ as part of the upgrade, so the chances of 1.22 remnants are small (although I didn't explicitly check whether tfstate was empty/deleted before spinning up 1.23)

@argoyle
Copy link
Contributor

argoyle commented Apr 1, 2022

Just for helping out narrowing in what might be the problem, I don't see this on my own prod-cluster where I have a docker config defined and not on a fresh test-cluster either where I have no docker config.

@eddycharly
Copy link
Owner

@ddelange did you see the same problem with 1.22 or only 1.23 ?

@eddycharly
Copy link
Owner

I spent some time trying to reproduce the issue, but I didn't succeed.
Can you share your tf config ?

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

Hmm, interesting! Thanks for checking guys. This was not the case before upgrading to 1.23. The only diff to cluster.tf apart from the version bump, was adding the containerd.config_override block. Adding a (basicauth) private docker registry to the cluster was the whole reason for updrading to 1.23: as 1.22 has max containerd version 1.4, I needed 1.5+ such that we can make use of containerd registry mirror auth functionality.

Here's an excerpt of our cluster.tf

@eddycharly
Copy link
Owner

This looks similar to what I tested.
You used the provider v1.23.0-alpha.1 with k8s 1.23 right ?

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

Correct, 1.23.0-alpha.1 and

variable "kubernetes_version" {
  type        = string
  description = "Kubernetes version to use for the cluster. MAJOR.MINOR here should not be newer than the kops provider version in versions.tf ref https://kops.sigs.k8s.io/welcome/releases/"
  default     = "v1.23.5"
}

I'm now trying to isolate the bug to the config_override block (applying now without it, seeing if afterwards another apply will trigger the updater again)

@eddycharly
Copy link
Owner

I don't think your issue is related to the config override block, but thanks for trying it out.

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

Looks like unrelated indeed. Behaviour stays the same 🤔

@eddycharly
Copy link
Owner

Could be related to permissions in the s3 bucket.
Can you check the files are correctly stored in the bucket (pki/private/kubernetes-ca/keyset.yaml) ?

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

The file exists, looks like a valid manifest. Created (r/w perms) by my AWS account yesterday when I created the 1.23 cluster. Same permissions as the rest of the files, e.g. under /instancegroups

@eddycharly
Copy link
Owner

Hmmm, I'm dry on ideas :-(
There was an invalid reference to the k/k package in v1.23.0-alpha.1 but I don't think it would cause such an issue.
I can try to cut v1.23.0-alpha.2 though.

@ddelange
Copy link
Author

ddelange commented Apr 1, 2022

Many thanks for taking the time!

I would just close with that for now and if it disappears over time (or I miraculously find a fix) I will report back here.

Have a nice weekend 💥

@ddelange ddelange closed this as completed Apr 1, 2022
@ddelange
Copy link
Author

ddelange commented Apr 2, 2022

hotfix seems to be holding steady :)

  lifecycle {
    ignore_changes = [
      secrets,
    ]
  }

@eddycharly
Copy link
Owner

It’s a hack, you shouldn’t need this.
If possible I would empty the bucket and try from scratch completely.

@eddycharly
Copy link
Owner

I just released v1.23.0-alpha.2 but I doubt it will fix your issue.

@ddelange
Copy link
Author

ddelange commented Apr 2, 2022

Thanks for the ping!

@ddelange
Copy link
Author

ddelange commented Apr 3, 2022

If possible I would empty the bucket and try from scratch completely.

Seems like it did not solve the issue 🤔

@peter-svensson
Copy link

Me and @argoyle had this issue as well on some clusters.
Seems to work if we always define secrets in the config, like:

  secrets {
    docker_config   = "{}"
  }

@eddycharly
Copy link
Owner

Interesting, I am going to reopen and investigate the issue again.

@eddycharly eddycharly reopened this Jun 2, 2022
@eddycharly
Copy link
Owner

I was able to somewhat reproduce the issue:

  • create a cluster with ca cert/private key
  • update the cluster, removing ca cert/private key
  • next applies always trigger an update

Would that look like the scenario you are hitting ?

@ddelange
Copy link
Author

ddelange commented Jun 2, 2022

fwiw, I never used the secrets block

@eddycharly
Copy link
Owner

I have a good suspect in mind but no access to an AWS account, making it difficult to track it down.

CAs are stored in pki/private/kubernetes-ca/keyset.yaml (maybe in pki/private/ca/keyset.yaml with older kOps versions).

When one doesn't provide a CA cert/key, kOps will create one and I'm not sure if it is stored in the same place (I suppose it is but can't confirm).

From what I understand, it's not possible to remove a CA that is being used so it is probably related.

What could happen:

  • You create a cluster without specifying the CA
  • kOps generates a CA automatically and stores it
  • Finally the CA ends up in the state
  • When you run a plan, if you don't specify the secrets block, terraform tries to delete the CA
  • Deleting a CA is not possible (we can eventually rotate it but not delete a CA being used)
  • The CA will constantly be added to the state and there will be a permanent diff

Now, if you specify the secrets block (you can leave it empty) it looks like terraform is not complaining anymore (because CA cert/key are marked as computed).

resource "kops_cluster" "cluster" {
  // ...
  secrets {}
  // ...
}

It would be nice if someone can confirm this.

@eddycharly
Copy link
Owner

@peter-svensson @argoyle I suspect you can use an empty secrets block, no need to use a dummy docker_config as mentioned in my comment above.

@ddelange
Copy link
Author

ddelange commented Jun 2, 2022

--- a/k8s/kops/cluster.tf
+++ b/k8s/kops/cluster.tf
@@ -217,12 +217,7 @@ EOF
     }
   }

-  lifecycle {
-    ignore_changes = [
-      secrets,
-    ]
-  }
+  secrets {}
 }

did indeed not trigger the updater!

@argoyle
Copy link
Contributor

argoyle commented Jun 2, 2022

Seems to do the trick with our setup as well 🎉

@ddelange
Copy link
Author

ddelange commented Jun 2, 2022

we also have a

  authorization {
    always_allow {}
  }

block for the same reason btw :)

@eddycharly
Copy link
Owner

Great, thanks for testing it !
I will consider making secrets required and document that it can be left empty.

@eddycharly
Copy link
Owner

@ddelange why not RBAC ?

  authorization {
    rbac {}
  }

@ddelange
Copy link
Author

ddelange commented Jun 15, 2022

We're spinning up Rancher v2 on the cluster via helm chart v2.6.5.

I flipped the authorization config this morning and recreated the cluster, but the rancher mechanics (lots of opaque helm jobs etc) don't like it and the cluster won't show up anymore in the Rancher UI. I've been skimming through all different pod logs but no rbac related messages.

Ironically, some time ago I managed to fix another Rancher related issue by creating a ClusterRole. Which shouldn't have been necessary as we had always_allow if I understand correctly 🤔

Like I wrote there, spinning up with always_allow will still show rbac:

$ kubectl api-versions | grep rbac
rbac.authorization.k8s.io/v1

EDIT: now succesfully changed to rbac. The cluster was only not being recognised because I was logged in to rancher using github oauth (which had recreated permissions defaulting to standard user because I wiped the cluster), logging in as admin allowed me to see the cluster again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants