[GSoC] Compatibility Changes in Trial Controller #2394

Electronic-Waste · 2024-07-24T13:55:20Z

What this PR does / why we need it:

I made some compatibility changes to the Trial Controller. Design details: https://github.com/kubeflow/katib/blob/master/docs/proposals/push-based-metrics-collection.md#compatibility-changes-in-trial-controller

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Electronic-Waste · 2024-08-01T17:32:33Z

@andreyvelich @johnugeorge PTAL👀 if you are available. Thanks!

ref issue: #2340

andreyvelich

Thank you for this @Electronic-Waste!
cc @kubeflow/wg-automl-leads

pkg/controller.v1beta1/trial/trial_controller.go

pkg/controller.v1beta1/trial/trial_controller_util.go

andreyvelich · 2024-08-07T12:53:30Z

/rerun-all

pkg/controller.v1beta1/trial/trial_controller_util.go

andreyvelich · 2024-08-12T14:21:23Z

/assign @johnugeorge @tenzen-y
Please review it when you have time.

tenzen-y · 2024-08-14T11:09:00Z

/assign @johnugeorge @tenzen-y Please review it when you have time.

ACK

tenzen-y

Basically, lgtm

pkg/controller.v1beta1/trial/trial_controller_util.go

pkg/controller.v1beta1/trial/trial_controller_test.go

johnugeorge · 2024-08-23T11:02:41Z

/lgtm

tenzen-y · 2024-08-23T13:58:42Z

/area gsoc

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste · 2024-09-09T11:28:07Z

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!

AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.

So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

Electronic-Waste · 2024-09-09T15:32:43Z

@kubeflow/wg-automl-leads it seems that the coverage reports have some accidents. Could you please re-rerun these test cases and check them again?

andreyvelich · 2024-09-09T15:50:58Z

@Electronic-Waste I think, you can re-trigger tests by add this comment: /rerun-all

andreyvelich · 2024-09-09T15:51:02Z

/rerun-all

Electronic-Waste · 2024-09-09T15:59:16Z

Thanks @andreyvelich
I think what I mean is running those few failed test cases. Sorry for not realizing that I can re-run them all without you. I'll re-rerun them on my own in the future.

andreyvelich · 2024-09-09T16:05:38Z

Not sure why coveralls fails on report tho.
@Electronic-Waste Please can you try to investigate it ?
cc @kubeflow/wg-training-leads

Electronic-Waste · 2024-09-09T16:15:43Z

@andreyvelich AFAIK It fails sometimes. It may turn normal in a few hours.

Electronic-Waste · 2024-09-10T08:51:18Z

@andreyvelich There is an issue describing it: lemurheavy/coveralls-public#1716

It seems that the coverage report will fail if we rerun the old CI build: lemurheavy/coveralls-public#1716 (comment)

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste · 2024-09-11T15:13:19Z

Some test cases are so flaky.
@Electronic-Waste Could you investigate them?

@tenzen-y I think the flaky issue of UTs has been solved now. Could take a look if that looks good to you?

Electronic-Waste · 2024-09-11T15:50:56Z

@andreyvelich There is an issue describing it: lemurheavy/coveralls-public#1716

It seems that the coverage report will fail if we rerun the old CI build: lemurheavy/coveralls-public#1716 (comment)

@tenzen-y FYR, the coverage report will fail if we rerun the old CI build.

Maybe we should restart all CI builds and test them all.

tenzen-y · 2024-09-11T15:57:57Z

/rerun-all

Electronic-Waste · 2024-09-11T16:02:38Z

@tenzen-y The /rerun-all command only restarts Go Test. We need to restart all CI builds since the bug report is:

{"message":"Can't add a job to a build that is already closed. Build 10789114390 is closed. See docs.coveralls.io/parallel-builds","error":true}

Maybe this comment lemurheavy/coveralls-public#1716 (comment) is useful for reference.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste · 2024-09-11T16:09:10Z

I made some tiny changes in the comment lines. It will retrigger all CI builds.

Maybe this can help you check the robustness of UTs in Go Test @tenzen-y :)

Electronic-Waste · 2024-09-12T13:12:31Z

/rerun-all

Electronic-Waste · 2024-09-13T14:00:47Z

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!

AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.

So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

@tenzen-y I'm sure that the flaky error has been addressed now. May I ask whether you need to check it again or not? I can trigger the CI builds again by pushing more tiny changes since the coverage report only works when we retrigger all CI builds.

tenzen-y · 2024-09-18T18:35:02Z

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!
AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.
So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

@tenzen-y I'm sure that the flaky error has been addressed now. May I ask whether you need to check it again or not? I can trigger the CI builds again by pushing more tiny changes since the coverage report only works when we retrigger all CI builds.

Thank you for driving this! Throughout this 2 week CI result, I am sure that we succeeded to get rid of flakiness root causes.

https://github.com/kubeflow/katib/actions/workflows/test-go.yaml

tenzen-y

The most things lgtm

pkg/controller.v1beta1/trial/trial_controller_test.go

Signed-off-by: Electronic-Waste <2690692950@qq.com>

tenzen-y

That was great improvments!
Thank you for doing this!

/lgtm
/approve

google-oss-prow · 2024-09-19T06:24:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Electronic-Waste · 2024-09-19T07:22:48Z

Thank you for your detailed review @tenzen-y!

This PR is holding now. Can you remove the hold label so that this PR can be merged?

tenzen-y · 2024-09-19T07:23:50Z

Thank you for your detailed review @tenzen-y!

This PR is holding now. Can you remove the hold label so that this PR can be merged?

Sure.
/hold cancel

google-oss-prow bot requested a review from andreyvelich July 24, 2024 13:55

google-oss-prow bot added the size/M label Jul 24, 2024

google-oss-prow bot requested review from anencore94 and johnugeorge July 24, 2024 13:55

Electronic-Waste mentioned this pull request Jul 24, 2024

[GSoC] Project6: Push-based Metrics Collection for Katib #2340

Open

6 tasks

andreyvelich reviewed Aug 1, 2024

View reviewed changes

google-oss-prow bot added size/L and removed size/M labels Aug 2, 2024

Electronic-Waste requested a review from andreyvelich August 2, 2024 21:36

andreyvelich reviewed Aug 7, 2024

View reviewed changes

pkg/controller.v1beta1/trial/trial_controller_util.go Show resolved Hide resolved

google-oss-prow bot assigned johnugeorge and tenzen-y Aug 12, 2024

Electronic-Waste requested a review from andreyvelich August 15, 2024 09:42

tenzen-y reviewed Aug 15, 2024

View reviewed changes

pkg/controller.v1beta1/trial/trial_controller_util.go Show resolved Hide resolved

pkg/controller.v1beta1/trial/trial_controller_test.go Outdated Show resolved Hide resolved

google-oss-prow bot added the lgtm label Aug 23, 2024

google-oss-prow bot added the area/gsoc label Aug 23, 2024

Electronic-Waste added 8 commits August 25, 2024 12:04

chore: add condition branch in requeue logic.

4410c1b

Signed-off-by: Electronic-Waste <2690692950@qq.com>

chore: add ReportObservationLog in katib_manager_util.go.

8a92977

Signed-off-by: Electronic-Waste <2690692950@qq.com>

chore: add ReportTrialUnavailableMetrics func.

4fb810c

Signed-off-by: Electronic-Waste <2690692950@qq.com>

chore: insert unavailable value into Katib DB.

712af68

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix: fix lint error.

76e3c95

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix: add nil condition judgement.

7f6271b

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix: add nil condition judgement in trial_controller_util.go

46a5afd

Signed-off-by: Electronic-Waste <2690692950@qq.com>

chore(trial): delete nil check of MC kind in the Trial controller.

59e98a3

Signed-off-by: Electronic-Waste <2690692950@qq.com>

test(trial): fix typo error.

ffe6089

Signed-off-by: Electronic-Waste <2690692950@qq.com>

test(trial): make some tiny changes.

ddd2cb3

Signed-off-by: Electronic-Waste <2690692950@qq.com>

tenzen-y reviewed Sep 18, 2024

View reviewed changes

pkg/controller.v1beta1/trial/trial_controller_test.go Show resolved Hide resolved

pkg/controller.v1beta1/trial/trial_controller_test.go Outdated Show resolved Hide resolved

pkg/controller.v1beta1/trial/trial_controller_test.go Outdated Show resolved Hide resolved

Electronic-Waste added 3 commits September 19, 2024 02:53

fix(trial): move cancel func to t.Cleanup.

6a7a528

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix(trial): use the propagated gomega instance to improve debuggability.

7cb4e3e

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix(trial): use gofmt to reformat code.

9604ed4

Signed-off-by: Electronic-Waste <2690692950@qq.com>

Electronic-Waste mentioned this pull request Sep 19, 2024

Improve Debuggability in UTs of Trial Controller #2431

Open

tenzen-y reviewed Sep 19, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Sep 19, 2024

google-oss-prow bot removed the do-not-merge/hold label Sep 19, 2024

google-oss-prow bot merged commit 867c40a into kubeflow:master Sep 19, 2024
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC] Compatibility Changes in Trial Controller #2394

[GSoC] Compatibility Changes in Trial Controller #2394

Electronic-Waste commented Jul 24, 2024

Electronic-Waste commented Aug 1, 2024

andreyvelich left a comment

andreyvelich commented Aug 7, 2024

andreyvelich commented Aug 12, 2024

tenzen-y commented Aug 14, 2024

tenzen-y left a comment

johnugeorge commented Aug 23, 2024

tenzen-y commented Aug 23, 2024

Electronic-Waste commented Sep 9, 2024

Electronic-Waste commented Sep 9, 2024

andreyvelich commented Sep 9, 2024

andreyvelich commented Sep 9, 2024

Electronic-Waste commented Sep 9, 2024

andreyvelich commented Sep 9, 2024 •

edited

Loading

Electronic-Waste commented Sep 9, 2024

Electronic-Waste commented Sep 10, 2024

Electronic-Waste commented Sep 11, 2024

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

tenzen-y commented Sep 11, 2024

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

Electronic-Waste commented Sep 12, 2024

Electronic-Waste commented Sep 13, 2024 •

edited

Loading

tenzen-y commented Sep 18, 2024

tenzen-y left a comment

tenzen-y left a comment

google-oss-prow bot commented Sep 19, 2024

Electronic-Waste commented Sep 19, 2024

tenzen-y commented Sep 19, 2024

[GSoC] Compatibility Changes in Trial Controller #2394

[GSoC] Compatibility Changes in Trial Controller #2394

Conversation

Electronic-Waste commented Jul 24, 2024

Electronic-Waste commented Aug 1, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich commented Aug 7, 2024

andreyvelich commented Aug 12, 2024

tenzen-y commented Aug 14, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

johnugeorge commented Aug 23, 2024

tenzen-y commented Aug 23, 2024

Electronic-Waste commented Sep 9, 2024

Electronic-Waste commented Sep 9, 2024

andreyvelich commented Sep 9, 2024

andreyvelich commented Sep 9, 2024

Electronic-Waste commented Sep 9, 2024

andreyvelich commented Sep 9, 2024 • edited Loading

Electronic-Waste commented Sep 9, 2024

Electronic-Waste commented Sep 10, 2024

Electronic-Waste commented Sep 11, 2024

Electronic-Waste commented Sep 11, 2024 • edited Loading

tenzen-y commented Sep 11, 2024

Electronic-Waste commented Sep 11, 2024 • edited Loading

Electronic-Waste commented Sep 11, 2024 • edited Loading

Electronic-Waste commented Sep 12, 2024

Electronic-Waste commented Sep 13, 2024 • edited Loading

tenzen-y commented Sep 18, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Sep 19, 2024

Electronic-Waste commented Sep 19, 2024

tenzen-y commented Sep 19, 2024

andreyvelich commented Sep 9, 2024 •

edited

Loading

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

Electronic-Waste commented Sep 11, 2024 •

edited

Loading

Electronic-Waste commented Sep 13, 2024 •

edited

Loading