-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel execution for DAG tasks #4128
base: advanced-dag
Are you sure you want to change the base?
Conversation
…#4067) provide an example, edited from pipeline.yml more focus on dependencies for user dag lib more powerful user interface load and dump new yaml format fix fix: reversed logic in add_edge rename refactor due to reviewer's comments generate task.name if not given add comments for add_edge add `print_exception_no_traceback` when raise make `Dag.tasks` a property print dependencies for `__repr__` move `get_unique_task_name` to common_utils rename methods to use downstream/edge terminology
@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using That said, I noticed you used |
This is mainly due to logging. Threading will share a same I cannot find a way to do this kind of logging redirection back then. If you figured out a way, pls let me know ;) skypilot/sky/utils/ux_utils.py Lines 80 to 121 in 7971aa2
|
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.
Hi @cblmemo, could you check the PR? All TODOs are done; log download isn't added due to complexity - run logs are accessible only if the cluster isn’t terminated. |
By the way, I refactored |
Co-authored-by: Tian Xia <cblmemo@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @andylizf ! This is awesome. LGTM except for some nits. After testing it should be ready to go ;)
Before merging, please update all tests you've done in the PR description :))
WHERE spot_job_id=(?) AND end_at IS null | ||
AND status NOT IN (?, ?)""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentiond in #4128 (comment), we should not cancelled job already started running.
else: | ||
if job_id is None: | ||
assert job_name is not None | ||
job_ids = managed_job_state.get_nonterminal_job_ids_by_name( | ||
job_name) | ||
if len(job_ids) == 0: | ||
return f'No running managed job found with name {job_name!r}.' | ||
if len(job_ids) > 1: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError( | ||
f'Multiple running jobs found with name {job_name!r}.') | ||
job_id = job_ids[0] | ||
|
||
return stream_logs_by_id(job_id, follow) | ||
return stream_logs_by_id(job_id, task_id, follow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove the else
and revert the indents? We should reduce indent if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the input! I agree that too much indentation isn’t ideal, but in this case, a couple of levels seem fine. Early exits work well for handling specific errors, while here, both cases are part of the function’s main flow.
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…nd `RedirectOutputForThread`
Closes #4055
This PR implements parallel execution for DAG tasks in the jobs controller, addressing issue #4055. The changes allow for efficient execution of complex DAGs with independent tasks running concurrently, significantly improving performance for workflows with parallel components.
Changes
JobsController
to identify and execute parallel task groupsTested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh