Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel execution for DAG tasks #4128

Open
wants to merge 82 commits into
base: advanced-dag
Choose a base branch
from

Conversation

andylizf
Copy link
Contributor

Closes #4055

This PR implements parallel execution for DAG tasks in the jobs controller, addressing issue #4055. The changes allow for efficient execution of complex DAGs with independent tasks running concurrently, significantly improving performance for workflows with parallel components.

Changes

  • Modified JobsController to identify and execute parallel task groups
  • Implemented thread-safe task execution and monitoring
  • Added concurrent resource management and cleanup

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@andylizf
Copy link
Contributor Author

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 19, 2024

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

This is mainly due to logging. Threading will share a same sys.stdout which makes the following code not feasible.

I cannot find a way to do this kind of logging redirection back then. If you figured out a way, pls let me know ;)

class RedirectOutputForProcess:
"""Redirects stdout and stderr to a file.
This class enabled output redirect for multiprocessing.Process.
Example usage:
p = multiprocessing.Process(
target=RedirectOutputForProcess(func, file_name).run, args=...)
This is equal to:
p = multiprocessing.Process(target=func, args=...)
Plus redirect all stdout/stderr to file_name.
"""
def __init__(self, func: Callable, file: str, mode: str = 'w') -> None:
self.func = func
self.file = file
self.mode = mode
def run(self, *args, **kwargs):
with open(self.file, self.mode, encoding='utf-8') as f:
sys.stdout = f
sys.stderr = f
# reconfigure logger since the logger is initialized before
# with previous stdout/stderr
sky_logging.reload_logger()
logger = sky_logging.init_logger(__name__)
# The subprocess_util.run('sky status') inside
# sky.execution::_execute cannot be redirect, since we cannot
# directly operate on the stdout/stderr of the subprocess. This
# is because some code in skypilot will specify the stdout/stderr
# of the subprocess.
try:
self.func(*args, **kwargs)
except Exception as e: # pylint: disable=broad-except
logger.error(f'Failed to run {self.func.__name__}. '
f'Details: {common_utils.format_exception(e)}')
with ux_utils.enable_traceback():
logger.error(f' Traceback:\n{traceback.format_exc()}')
raise

@andylizf
Copy link
Contributor Author

andylizf commented Nov 2, 2024

  • Controller-side cluster's launch logs accessibility: Currently, logs like ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log are only available on the controller machine and can only be accessed remotely.
  • Cluster-side task run log persistence: As demonstrated in the example where task 0's logs are no longer available after completion, we need to implement proper log retention for completed tasks.

Hi @cblmemo, could you check the PR? All TODOs are done; log download isn't added due to complexity - run logs are accessible only if the cluster isn’t terminated.

@andylizf
Copy link
Contributor Author

andylizf commented Nov 2, 2024

By the way, I refactored _follow_replica_logs, but the logic may differ slightly. Could you take a look?

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/jobs/controller.py Outdated Show resolved Hide resolved
sky/jobs/controller.py Outdated Show resolved Hide resolved
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @andylizf ! This is awesome. LGTM except for some nits. After testing it should be ready to go ;)

Before merging, please update all tests you've done in the PR description :))

sky/jobs/controller.py Outdated Show resolved Hide resolved
sky/jobs/controller.py Outdated Show resolved Hide resolved
sky/jobs/controller.py Show resolved Hide resolved
sky/jobs/state.py Outdated Show resolved Hide resolved
Comment on lines +480 to +481
WHERE spot_job_id=(?) AND end_at IS null
AND status NOT IN (?, ?)""",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentiond in #4128 (comment), we should not cancelled job already started running.

Comment on lines +626 to +639
else:
if job_id is None:
assert job_name is not None
job_ids = managed_job_state.get_nonterminal_job_ids_by_name(
job_name)
if len(job_ids) == 0:
return f'No running managed job found with name {job_name!r}.'
if len(job_ids) > 1:
with ux_utils.print_exception_no_traceback():
raise ValueError(
f'Multiple running jobs found with name {job_name!r}.')
job_id = job_ids[0]

return stream_logs_by_id(job_id, follow)
return stream_logs_by_id(job_id, task_id, follow)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove the else and revert the indents? We should reduce indent if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the input! I agree that too much indentation isn’t ideal, but in this case, a couple of levels seem fine. Early exits work well for handling specific errors, while here, both cases are part of the function’s main flow.

sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Jobs] Parallel execution for DAG
2 participants