Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v4.y] Test VERIFY_PEER / VERIFY_NONE work against real cluster #545

Merged
merged 8 commits into from
Mar 14, 2022

Conversation

cben
Copy link
Collaborator

@cben cben commented Feb 26, 2022

Followup to #540. (It fixed #525 which was only broken on master, not v4.y, but good to have these tests here too.)

Adding tests that Client honor VERIFY_PEER/VERIFY_NONE by connecting to the temporary cluster during test/config/update_certs_k0s.rb script.
They are skipped during a regular rake test so they will NOT run in CI.
(Can we run docker --priviledged in github actions?)

@cben cben force-pushed the v4.y-test_real_cluster_ssl_verify branch from ef89b7f to e12b5a3 Compare February 26, 2022 23:02
@cben
Copy link
Collaborator Author

cben commented Feb 26, 2022

Ooh, exciting! Yes, it can run in CI. (IIUC service containers can't be --priviledged, but directly executing docker run --priviledged on Linux works 💡)

  • it failed with SSL errors differing from what I expected 🎉
    https://github.com/ManageIQ/kubeclient/runs/5346969200?check_suite_focus=true#step:6:60

    Some runs pass the full tests, some fail. Mostly 3.0 was the first to fail but saw a time it was 2.5 that failed. I'm starting to think it's simple race condition with k0s starting up — kubeconfig existing doesn't guarantee apiserver is up yet?
    => Yep, saw curl also failing:
    curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 127.0.0.1:6443
    curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to 127.0.0.2:6443
    => added polling until apiserver is serving.

  • Should switch to docker.pkg.github.com/k0sproject/k0s/k0s so GitHub actions don't hammer docker hub.

  • Can it run on macos / windows? NO.

@cben cben force-pushed the v4.y-test_real_cluster_ssl_verify branch 4 times, most recently from 71a5ab8 to ca920f1 Compare February 27, 2022 13:50
@cben cben force-pushed the v4.y-test_real_cluster_ssl_verify branch 4 times, most recently from 79a20c1 to f4c1a48 Compare March 4, 2022 14:12
@cben cben force-pushed the v4.y-test_real_cluster_ssl_verify branch 3 times, most recently from a40c256 to cafebe0 Compare March 5, 2022 22:18
cben added 2 commits March 6, 2022 00:19
Yay, `docker run --priviledged` is allowed!  k0s starts up really fast too.
(only on linux, keeping regular `rake test` on macos because no `docker`.)
Sometimes minitest starts and then just hangs printing ...nothing?

    Run options: --seed 34017

    # Running:

    Error: The operation was canceled.

Github's default job timeout is a generous 6 hours!
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits
But waiting is not productive and wasteful (and will burn our free minutes).

Setting outer job timeout longer for `bundle install`.
Hoping an inner SIGTERM received from while job is still running may allow
the test to print some traceback.  I have no idea where it's getting stuck.
(macos has no `timeout` command, but most or all stuck runs were linux.)

(I would suspect `exercise_watcher_with_timeout` but also saw same
symptoms on another PR without k0s tests)
@cben cben force-pushed the v4.y-test_real_cluster_ssl_verify branch from cafebe0 to 367d432 Compare March 5, 2022 22:20
@cben
Copy link
Collaborator Author

cben commented Mar 9, 2022

OK pin-pointed the timeout:

KubeclientRealClusterTest#test_real_cluster_verify_none = 1.15 s = .
KubeclientRealClusterTest#test_real_cluster_verify_peer = 
Error: Process completed with exit code 124.

Probably exercise_watcher_with_timeout, which suggests .finish on watches is generally not as reliable as I believed :-(

@cben
Copy link
Collaborator Author

cben commented Mar 9, 2022

@russell @grosser @agrare I still need to tackle the runaway test ^ but please review the general direction.

I want real end-to-end tests of when TLS verification actually verifies, and the setup here of starting a real cluster works surprisingly fast and opens the door to many more future real-cluster tests...

test/test_helper.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@grosser grosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ nice approach

cben and others added 3 commits March 10, 2022 11:16
Co-authored-by: Michael Grosser <michael@grosser.it>
shows stacktrace within Rake itself, not within test.
@cben
Copy link
Collaborator Author

cben commented Mar 14, 2022

Ignoring truffleruby, which always fails anyway (opened #551).
So far I failed to reproduce locally the new cases sometimes getting stuck, only happens on CI, at random.

Same CI situation adding these on master branch (#550).

But these provide real value so I'm going to merge and we'll see how flaky it is in future... (Worst-case, can always fall back to not running real-cluster tests in CI.)

@cben cben merged commit b6d9098 into ManageIQ:v4.y Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants