Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Compatibility Changes in Trial Controller #2394

Merged

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

I made some compatibility changes to the Trial Controller. Design details: https://github.com/kubeflow/katib/blob/master/docs/proposals/push-based-metrics-collection.md#compatibility-changes-in-trial-controller

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@Electronic-Waste
Copy link
Member Author

@andreyvelich @johnugeorge PTAL👀 if you are available. Thanks!

ref issue: #2340

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @Electronic-Waste!
cc @kubeflow/wg-automl-leads

pkg/controller.v1beta1/trial/trial_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1beta1/trial/trial_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1beta1/trial/trial_controller_util.go Outdated Show resolved Hide resolved
pkg/controller.v1beta1/trial/trial_controller_util.go Outdated Show resolved Hide resolved
@andreyvelich
Copy link
Member

/rerun-all

@andreyvelich
Copy link
Member

/assign @johnugeorge @tenzen-y
Please review it when you have time.

@tenzen-y
Copy link
Member

/assign @johnugeorge @tenzen-y Please review it when you have time.

ACK

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, lgtm

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 23, 2024
@tenzen-y
Copy link
Member

/area gsoc

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste
Copy link
Member Author

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!

AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.

So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

@Electronic-Waste
Copy link
Member Author

@kubeflow/wg-automl-leads it seems that the coverage reports have some accidents. Could you please re-rerun these test cases and check them again?

@andreyvelich
Copy link
Member

@Electronic-Waste I think, you can re-trigger tests by add this comment: /rerun-all

@andreyvelich
Copy link
Member

/rerun-all

@Electronic-Waste
Copy link
Member Author

Thanks @andreyvelich
I think what I mean is running those few failed test cases. Sorry for not realizing that I can re-run them all without you. I'll re-rerun them on my own in the future.

@andreyvelich
Copy link
Member

andreyvelich commented Sep 9, 2024

Not sure why coveralls fails on report tho.
@Electronic-Waste Please can you try to investigate it ?
cc @kubeflow/wg-training-leads

@Electronic-Waste
Copy link
Member Author

@andreyvelich AFAIK It fails sometimes. It may turn normal in a few hours.

@Electronic-Waste
Copy link
Member Author

@andreyvelich There is an issue describing it: lemurheavy/coveralls-public#1716

It seems that the coverage report will fail if we rerun the old CI build: lemurheavy/coveralls-public#1716 (comment)

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste
Copy link
Member Author

Some test cases are so flaky.
@Electronic-Waste Could you investigate them?

@tenzen-y I think the flaky issue of UTs has been solved now. Could take a look if that looks good to you?

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 11, 2024

@andreyvelich There is an issue describing it: lemurheavy/coveralls-public#1716

It seems that the coverage report will fail if we rerun the old CI build: lemurheavy/coveralls-public#1716 (comment)

@tenzen-y FYR, the coverage report will fail if we rerun the old CI build.

Maybe we should restart all CI builds and test them all.

@tenzen-y
Copy link
Member

/rerun-all

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 11, 2024

@tenzen-y The /rerun-all command only restarts Go Test. We need to restart all CI builds since the bug report is:

{"message":"Can't add a job to a build that is already closed. Build 10789114390 is closed. See docs.coveralls.io/parallel-builds","error":true}

Maybe this comment lemurheavy/coveralls-public#1716 (comment) is useful for reference.

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 11, 2024

I made some tiny changes in the comment lines. It will retrigger all CI builds.

Maybe this can help you check the robustness of UTs in Go Test @tenzen-y :)

@Electronic-Waste
Copy link
Member Author

/rerun-all

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 13, 2024

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!

AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.

So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

@tenzen-y I'm sure that the flaky error has been addressed now. May I ask whether you need to check it again or not? I can trigger the CI builds again by pushing more tiny changes since the coverage report only works when we retrigger all CI builds.

@tenzen-y
Copy link
Member

@andreyvelich @tenzen-y I've fixed the flaky error by separating the UTs and adjusting the EXPECT() mock clause!
AFAIK, the flaky error is caused by the uncertain triggering times of the reconciliation, thus giving rise to the uncertainty of the times we call the function. The most annoying issue was that we must call GetObservationLog and ReportObservationLog in turn and we didn't know the times they were triggered when we use Push MC.
So I reserved some EXPECT() function in the gomock.InOrder and added some new .AnyTimes() clause outside it. Please let me know if there are any modifications need to be made to this PR. Thanks!

@tenzen-y I'm sure that the flaky error has been addressed now. May I ask whether you need to check it again or not? I can trigger the CI builds again by pushing more tiny changes since the coverage report only works when we retrigger all CI builds.

Thank you for driving this! Throughout this 2 week CI result, I am sure that we succeeded to get rid of flakiness root causes.

https://github.com/kubeflow/katib/actions/workflows/test-go.yaml

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most things lgtm

pkg/controller.v1beta1/trial/trial_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1beta1/trial/trial_controller_test.go Outdated Show resolved Hide resolved
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was great improvments!
Thank you for doing this!

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Sep 19, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Electronic-Waste
Copy link
Member Author

Thank you for your detailed review @tenzen-y!

This PR is holding now. Can you remove the hold label so that this PR can be merged?

@tenzen-y
Copy link
Member

Thank you for your detailed review @tenzen-y!

This PR is holding now. Can you remove the hold label so that this PR can be merged?

Sure.
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 867c40a into kubeflow:master Sep 19, 2024
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants