Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] test: Add e2e test for tune function. #2399

Merged
merged 23 commits into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
e195c74
fix(sdk): fix error field metrics_collector in tune function.
Electronic-Waste Jul 30, 2024
87a3a65
test(sdk): Add e2e tests for tune function.
Electronic-Waste Jul 30, 2024
153cdef
test(sdk): add missing field parameters.
Electronic-Waste Jul 30, 2024
b43f603
refactor(test/sdk): add run-e2e-tune-api.py.
Electronic-Waste Jul 30, 2024
6925087
test(sdk): delete tune testing code in run-e2e-experiment.
Electronic-Waste Jul 30, 2024
c69cda6
test(sdk): add blank lines.
Electronic-Waste Jul 30, 2024
4edc40c
test(sdk): add verbose and temporarily delete e2e-experiment test.
Electronic-Waste Jul 31, 2024
92a3bea
test(sdk): add namespace_labels.
Electronic-Waste Jul 31, 2024
e61bc6d
test(sdk): add time.sleep(5).
Electronic-Waste Aug 1, 2024
71d3aac
test(sdk): add error output.
Electronic-Waste Aug 1, 2024
6ae4616
test(sdk): build random image for tune.
Electronic-Waste Aug 1, 2024
72b3b48
test(sdk): delete extra debug log.
Electronic-Waste Aug 1, 2024
a86a0ae
refactor(test/sdk): create separate workflow for tune.
Electronic-Waste Aug 1, 2024
02b90b2
test(sdk): change api to API.
Electronic-Waste Aug 1, 2024
aec2649
test(sdk): change the permission of scripts.
Electronic-Waste Aug 1, 2024
b7140d2
test(sdk): delete exit code & comment image pulling.
Electronic-Waste Aug 1, 2024
108ec4a
test(sdk): delete image pulling phase.
Electronic-Waste Aug 1, 2024
ef7a05c
test(sdk): refactor workflow file to use template.
Electronic-Waste Aug 2, 2024
33fec7b
test(sdk): mark experiments and trial-images as not required.
Electronic-Waste Aug 6, 2024
42a5d3e
test(sdk): pass tune-api param to setup-minikube.sh.
Electronic-Waste Aug 6, 2024
6beeeaa
test(sdk): fix err in template-e2e-test.
Electronic-Waste Aug 6, 2024
dde4402
test(sdk): add debug logs.
Electronic-Waste Aug 6, 2024
c962aa5
test(sdk): reorder params and delete logs.
Electronic-Waste Aug 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/e2e-test-tune-api.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: E2E Test with tune API

on:
pull_request:
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
e2e:
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}

- name: Setup Minikube Cluster
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh

- name: Setup Katib
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh

- name: Run E2E Experiment
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

strategy:
fail-fast: false
matrix:
# Detail: https://hub.docker.com/r/kindest/node
kubernetes-version: ["v1.27.11", "v1.28.7", "v1.29.2"]
2 changes: 1 addition & 1 deletion sdk/python/v1beta1/kubeflow/katib/api/katib_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ def tune(

# Add metrics collector to the Katib Experiment.
# Up to now, We only support parameter `kind`, of which default value is `StdOut`, to specify the kind of metrics collector.
experiment.spec.metrics_collector = models.V1beta1MetricsCollectorSpec(
experiment.spec.metrics_collector_spec = models.V1beta1MetricsCollectorSpec(
collector=models.V1beta1CollectorSpec(kind=metrics_collector_config["kind"])
)

Expand Down
4 changes: 4 additions & 0 deletions test/e2e/v1beta1/scripts/gh-actions/build-load.sh
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ for name in "${TRIAL_IMAGE_ARRAY[@]}"; do
run "$name" "examples/$VERSION/trial-images/$name/Dockerfile"
done

# Testing image for tune function
echo -e "\nPulling and building testing image for tune function..."
_build_containers "suggestion-hyperopt" "$CMD_PREFIX/suggestion/hyperopt/$VERSION/Dockerfile"

Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved
echo -e "\nCleanup Build Cache...\n"
docker buildx prune -f

Expand Down
139 changes: 1 addition & 138 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import argparse
import logging
import time

from kubeflow.katib import ApiClient
from kubeflow.katib import KatibClient
from kubeflow.katib import models
from kubeflow.katib.constants import constants
from kubeflow.katib.utils.utils import FakeResponse
from kubernetes import client
from verify import verify_experiment_results
import yaml

# Experiment timeout is 40 min.
Expand All @@ -17,143 +17,6 @@
logging.basicConfig(level=logging.INFO)


def verify_experiment_results(
katib_client: KatibClient,
experiment: models.V1beta1Experiment,
exp_name: str,
exp_namespace: str,
):

# Get the best objective metric.
best_objective_metric = None
for metric in experiment.status.current_optimal_trial.observation.metrics:
if metric.name == experiment.spec.objective.objective_metric_name:
best_objective_metric = metric
break

if best_objective_metric is None:
raise Exception(
"Unable to get the best metrics for objective: {}. Current Optimal Trial: {}".format(
experiment.spec.objective.objective_metric_name,
experiment.status.current_optimal_trial,
)
)

# Get Experiment Succeeded reason.
for c in experiment.status.conditions:
if (
c.type == constants.EXPERIMENT_CONDITION_SUCCEEDED
and c.status == constants.CONDITION_STATUS_TRUE
):
succeeded_reason = c.reason
break

trials_completed = experiment.status.trials_succeeded or 0
trials_completed += experiment.status.trials_early_stopped or 0
max_trial_count = experiment.spec.max_trial_count

# If Experiment is Succeeded because of Max Trial Reached, all Trials must be completed.
if (
succeeded_reason == "ExperimentMaxTrialsReached"
and trials_completed != max_trial_count
):
raise Exception(
"All Trials must be Completed. Max Trial count: {}, Experiment status: {}".format(
max_trial_count, experiment.status
)
)

# If Experiment is Succeeded because of Goal reached, the metrics must be correct.
if succeeded_reason == "ExperimentGoalReached" and (
(
experiment.spec.objective.type == "minimize"
and float(best_objective_metric.min) > float(experiment.spec.objective.goal)
)
or (
experiment.spec.objective.type == "maximize"
and float(best_objective_metric.max) < float(experiment.spec.objective.goal)
)
):
raise Exception(
"Experiment goal is reached, but metrics are incorrect. "
f"Experiment objective: {experiment.spec.objective}. "
f"Experiment best objective metric: {best_objective_metric}"
)

# Verify Suggestion's resources. Suggestion name = Experiment name.
suggestion = katib_client.get_suggestion(exp_name, exp_namespace)

# For the Never or FromVolume resume policies Suggestion must be Succeeded.
# For the LongRunning resume policy Suggestion must be always Running.
for c in suggestion.status.conditions:
if (
c.type == constants.EXPERIMENT_CONDITION_SUCCEEDED
and c.status == constants.CONDITION_STATUS_TRUE
and experiment.spec.resume_policy == "LongRunning"
):
raise Exception(
f"Suggestion is Succeeded while Resume Policy is {experiment.spec.resume_policy}."
f"Suggestion conditions: {suggestion.status.conditions}"
)
elif (
c.type == constants.EXPERIMENT_CONDITION_RUNNING
and c.status == constants.CONDITION_STATUS_TRUE
and experiment.spec.resume_policy != "LongRunning"
):
raise Exception(
f"Suggestion is Running while Resume Policy is {experiment.spec.resume_policy}."
f"Suggestion conditions: {suggestion.status.conditions}"
)

# For Never and FromVolume resume policies verify Suggestion's resources.
if (
experiment.spec.resume_policy == "Never"
or experiment.spec.resume_policy == "FromVolume"
):
resource_name = exp_name + "-" + experiment.spec.algorithm.algorithm_name

# Suggestion's Service and Deployment should be deleted.
for i in range(10):
try:
client.AppsV1Api().read_namespaced_deployment(
resource_name, exp_namespace
)
except client.ApiException as e:
if e.status == 404:
break
else:
raise e
# Deployment deletion might take some time.
time.sleep(1)
if i == 10:
raise Exception(
"Suggestion Deployment is still alive for Resume Policy: {}".format(
experiment.spec.resume_policy
)
)

try:
client.CoreV1Api().read_namespaced_service(resource_name, exp_namespace)
except client.ApiException as e:
if e.status != 404:
raise e
else:
raise Exception(
"Suggestion Service is still alive for Resume Policy: {}".format(
experiment.spec.resume_policy
)
)

# For FromVolume resume policy PVC should not be deleted.
if experiment.spec.resume_policy == "FromVolume":
try:
client.CoreV1Api().read_namespaced_persistent_volume_claim(
resource_name, exp_namespace
)
except client.ApiException:
raise Exception("PVC is deleted for FromVolume Resume Policy")


def run_e2e_experiment(
katib_client: KatibClient,
experiment: models.V1beta1Experiment,
Expand Down
97 changes: 97 additions & 0 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import argparse
import logging

from kubeflow.katib import KatibClient
from kubeflow.katib import search
from kubernetes import client
from verify import verify_experiment_results

# Experiment timeout is 40 min.
EXPERIMENT_TIMEOUT = 60 * 40

# The default logging config.
logging.basicConfig(level=logging.INFO)


def run_e2e_experiment_create_by_tune(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
):
# Create Katib Experiment and wait until it is finished.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from get-started tutorial.
# https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk
# [1] Create an objective function.
def objective(parameters):
import time
time.sleep(5)
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
print(f"result={result}")

# [2] Create hyperparameter search space.
parameters = {
"a": search.int(min=10, max=20),
"b": search.double(min=0.1, max=0.2)
}

# [3] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
# And Wait until Experiment reaches Succeeded condition.
katib_client.tune(
name=exp_name,
namespace=exp_namespace,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=4,
resources_per_trial={"cpu": "2"},
)
experiment = katib_client.wait_for_experiment_condition(
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
)

# Verify the Experiment results.
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)

# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--namespace", type=str, required=True, help="Namespace for the Katib E2E test",
)
parser.add_argument(
"--verbose", action="store_true", help="Verbose output for the Katib E2E test",
)
args = parser.parse_args()

if args.verbose:
logging.getLogger().setLevel(logging.DEBUG)

katib_client = KatibClient()

namespace_labels = client.CoreV1Api().read_namespace(args.namespace).metadata.labels
if 'katib.kubeflow.org/metrics-collector-injection' not in namespace_labels:
namespace_labels['katib.kubeflow.org/metrics-collector-injection'] = 'enabled'
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}})

# Test with run_e2e_experiment_create_by_tune
exp_name = "tune-example"
exp_namespace = args.namespace
try:
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)
38 changes: 38 additions & 0 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This shell script is used to run Katib Experiment.
# Input parameter - path to Experiment yaml.

set -o errexit
set -o nounset
set -o pipefail

cd "$(dirname "$0")"

echo "Katib deployments"
kubectl -n kubeflow get deploy
echo "Katib services"
kubectl -n kubeflow get svc
echo "Katib pods"
kubectl -n kubeflow get pod
echo "Katib persistent volume claims"
kubectl get pvc -n kubeflow
echo "Available CRDs"
kubectl get crd

python run-e2e-tune-api.py --namespace default \
--verbose || (kubectl get pods -n kubeflow && exit 1)
Loading
Loading