Implement parallel execution for DAG tasks #4128

andylizf · 2024-10-19T01:28:15Z

This PR implements parallel execution for DAG tasks in the jobs controller, addressing issue #4055. The changes allow for efficient execution of complex DAGs with independent tasks running concurrently, significantly improving performance for workflows with parallel components.

Changes

Modified JobsController to identify and execute parallel task groups
Implemented thread-safe task execution and monitoring
Added concurrent resource management and cleanup

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

…#4067) provide an example, edited from pipeline.yml more focus on dependencies for user dag lib more powerful user interface load and dump new yaml format fix fix: reversed logic in add_edge rename refactor due to reviewer's comments generate task.name if not given add comments for add_edge add `print_exception_no_traceback` when raise make `Dag.tasks` a property print dependencies for `__repr__` move `get_unique_task_name` to common_utils rename methods to use downstream/edge terminology

andylizf · 2024-10-19T03:41:46Z

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

cblmemo · 2024-10-19T23:44:32Z

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

This is mainly due to logging. Threading will share a same sys.stdout which makes the following code not feasible.

I cannot find a way to do this kind of logging redirection back then. If you figured out a way, pls let me know ;)

skypilot/sky/utils/ux_utils.py

Lines 80 to 121 in 7971aa2

    
           class RedirectOutputForProcess: 
        
               """Redirects stdout and stderr to a file. 
        
               This class enabled output redirect for multiprocessing.Process. 
        
               Example usage: 
        
               p = multiprocessing.Process( 
        
                   target=RedirectOutputForProcess(func, file_name).run, args=...) 
        
               This is equal to: 
        
               p = multiprocessing.Process(target=func, args=...) 
        
               Plus redirect all stdout/stderr to file_name. 
        
               """ 
        
               def __init__(self, func: Callable, file: str, mode: str = 'w') -> None: 
        
                   self.func = func 
        
                   self.file = file 
        
                   self.mode = mode 
        
               def run(self, *args, **kwargs): 
        
                   with open(self.file, self.mode, encoding='utf-8') as f: 
        
                       sys.stdout = f 
        
                       sys.stderr = f 
        
                       # reconfigure logger since the logger is initialized before 
        
                       # with previous stdout/stderr 
        
                       sky_logging.reload_logger() 
        
                       logger = sky_logging.init_logger(__name__) 
        
                       # The subprocess_util.run('sky status') inside 
        
                       # sky.execution::_execute cannot be redirect, since we cannot 
        
                       # directly operate on the stdout/stderr of the subprocess. This 
        
                       # is because some code in skypilot will specify the stdout/stderr 
        
                       # of the subprocess. 
        
                       try: 
        
                           self.func(*args, **kwargs) 
        
                       except Exception as e:  # pylint: disable=broad-except 
        
                           logger.error(f'Failed to run {self.func.__name__}. ' 
        
                                        f'Details: {common_utils.format_exception(e)}') 
        
                           with ux_utils.enable_traceback(): 
        
                               logger.error(f'  Traceback:\n{traceback.format_exc()}') 
        
                           raise

Co-authored-by: Tian Xia <cblmemo@gmail.com>

…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.

…n its basis

andylizf · 2024-11-02T05:30:40Z

Controller-side cluster's launch logs accessibility: Currently, logs like ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log are only available on the controller machine and can only be accessed remotely.

Cluster-side task run log persistence: As demonstrated in the example where task 0's logs are no longer available after completion, we need to implement proper log retention for completed tasks.

Hi @cblmemo, could you check the PR? All TODOs are done; log download isn't added due to complexity - run logs are accessible only if the cluster isn’t terminated.

andylizf · 2024-11-02T05:34:59Z

By the way, I refactored _follow_replica_logs, but the logic may differ slightly. Could you take a look?

sky/backends/cloud_vm_ray_backend.py

sky/jobs/controller.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

cblmemo

Thanks for adding this feature @andylizf ! This is awesome. LGTM except for some nits. After testing it should be ready to go ;)

Before merging, please update all tests you've done in the PR description :))

sky/jobs/controller.py

sky/jobs/state.py

cblmemo · 2024-11-03T05:19:03Z

sky/jobs/state.py

+            WHERE spot_job_id=(?) AND end_at IS null
+            AND status NOT IN (?, ?)""",


Why change this?

As mentiond in #4128 (comment), we should not cancelled job already started running.

cblmemo · 2024-11-03T05:21:58Z

sky/jobs/utils.py

+    else:
+        if job_id is None:
+            assert job_name is not None
+            job_ids = managed_job_state.get_nonterminal_job_ids_by_name(
+                job_name)
+            if len(job_ids) == 0:
+                return f'No running managed job found with name {job_name!r}.'
+            if len(job_ids) > 1:
+                with ux_utils.print_exception_no_traceback():
+                    raise ValueError(
+                        f'Multiple running jobs found with name {job_name!r}.')
+            job_id = job_ids[0]

-    return stream_logs_by_id(job_id, follow)
+        return stream_logs_by_id(job_id, task_id, follow)


Lets remove the else and revert the indents? We should reduce indent if possible.

Thank you for the input! I agree that too much indentation isn’t ideal, but in this case, a couple of levels seem fine. Early exits work well for handling specific errors, while here, both cases are part of the function’s main flow.

sky/jobs/utils.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

…nd `RedirectOutputForThread`

andylizf added 20 commits October 10, 2024 16:29

provide an example, edited from pipeline.yml

7c0965a

more focus on dependencies for user dag lib

6949ac1

more powerful user interface

55db40b

load and dump new yaml format

db7ff9f

fix

054cc26

fix: reversed logic in add_edge

24ef94e

rename

129bdbf

refactor due to reviewer's comments

12ec5a4

generate task.name if not given

9497a3e

add comments for add_edge

ff528a5

add print_exception_no_traceback when raise

04c6f9d

make Dag.tasks a property

4985813

print dependencies for __repr__

48a2826

move get_unique_task_name to common_utils

78d826d

rename methods to use downstream/edge terminology

e88acc1

fix(jobs): type errors

4bc8b89

refactor: _update_failed_task_state for unified error handling

4ba76c3

refactor: separate finally block for a meaningful name

e1b27f3

feat: simple parallel execution support

c102f5d

andylizf and others added 8 commits October 20, 2024 14:03

Apply suggestions from code review

e4fbb28

Co-authored-by: Tian Xia <cblmemo@gmail.com>

change wording all to up/downstream style

a27969b

Add unique suffix to task names, fallback to timestamp if unnamed

8486352

Unify handling of single and multiple tasks without dependencies

c14980e

Refactor tasks initialization: use list comprehension and fail fast

66fc864

Fix remove task dependency description: upstream, not downstream

65d0bdd

Co-authored-by: Tian Xia <cblmemo@gmail.com>

Remove duplicated self.edges, use nx api instead

28b6482

Revert "Add unique suffix to task names, fallback to timestamp if unn…

1792ba6

…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.

andylizf added 13 commits October 28, 2024 20:55

add some comments explaining why skip convetional code path

43c0eba

add with exception_no_traceback

6c44a72

fix os.makedirs, add expandusr first

12ae686

fix: diamond named diamond

db188ee

fix: use managed_job_id as dirname instead of runtimestamp

824a1f9

fix: make strategy_executor local to prevent race conditions

ad02f4e

feat: implement stream logs for a task command

dab159c

fix: forgot to add parentheses around the addition

d1cc200

chore: log indent

94914a4

fix: early checking and better comments

183312d

refactor: reuse stream_logs_by_id for 2nd part run log tailing

c62bc71

fix: deal with tasks already finished

0aca1b9

refactor: a generalized follow_logs and implement output checking o…

2cd4ddc

…n its basis

format

059ddf4

Merge remote-tracking branch 'upstream/advanced-dag' into dag-execute

9e0892d

cblmemo reviewed Nov 3, 2024

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

sky/jobs/controller.py Outdated Show resolved Hide resolved

sky/jobs/controller.py Outdated Show resolved Hide resolved

add return type annotations

4ef56e1

Co-authored-by: Tian Xia <cblmemo@gmail.com>

cblmemo approved these changes Nov 3, 2024

View reviewed changes

andylizf and others added 9 commits November 2, 2024 22:35

add all annotations and rename task_queue to ready_queue

e7d99d4

Apply suggestions from code review

79a8089

Co-authored-by: Tian Xia <cblmemo@gmail.com>

revert one-liner conditional for better readibility

c4b367c

use tuple unpacking instead of [1]

396b970

Co-authored-by: Tian Xia <cblmemo@gmail.com>

Apply suggestions from code review

f71fdc2

Co-authored-by: Tian Xia <cblmemo@gmail.com>

chore: fix more type annotations

016bce1

refactor: clearer responsibility split between _ThreadAwareOutput a…

a39faef

…nd `RedirectOutputForThread`

feat: ensures open returns a TextIO

4eb6ca9

fix: unbounded is_dag_chain

be13131

andylizf force-pushed the dag-execute branch from db3460b to be13131 Compare November 4, 2024 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel execution for DAG tasks #4128

Implement parallel execution for DAG tasks #4128

andylizf commented Oct 19, 2024

andylizf commented Oct 19, 2024

cblmemo commented Oct 19, 2024

andylizf commented Nov 2, 2024 •

edited

Loading

andylizf commented Nov 2, 2024

cblmemo left a comment

cblmemo Nov 3, 2024

andylizf Nov 3, 2024

cblmemo Nov 3, 2024

andylizf Nov 3, 2024

		WHERE spot_job_id=(?) AND end_at IS null
		AND status NOT IN (?, ?)""",

Implement parallel execution for DAG tasks #4128

Are you sure you want to change the base?

Implement parallel execution for DAG tasks #4128

Conversation

andylizf commented Oct 19, 2024

Changes

andylizf commented Oct 19, 2024

cblmemo commented Oct 19, 2024

andylizf commented Nov 2, 2024 • edited Loading

andylizf commented Nov 2, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Nov 3, 2024

Choose a reason for hiding this comment

andylizf Nov 3, 2024

Choose a reason for hiding this comment

cblmemo Nov 3, 2024

Choose a reason for hiding this comment

andylizf Nov 3, 2024

Choose a reason for hiding this comment

andylizf commented Nov 2, 2024 •

edited

Loading