Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
## Overall design The executor uses `dali::tasking` as the main run-time library. `dali::tasking` ensures that the tasks are executed in correct order. It's the job of the executor to define the dependencies between the tasks and apply other constraints. The heart of the system is `ExecGraph` - a graph structure that stores the detailed information about the execution of the pipeline's graph nodes. `ExecGraph` consists of `ExecNodes` connected with `ExecEdges`. There are two kinds of graph nodes: Operator node - an `ExecNode` which stores an instance of a DALI `OperatorBase`. Its inputs and outputs correspond precisely to the ones in operator's `OpSpec`. Output node - a node which gathers outputs of the operator nodes that comprise pipeline's output.. The inputs of this node are pipeline's outputs. It returns a `dali::Workspace` by value. The `ExecGraph` is normally created by lowering the `graph::OpGraph`. The life cycle of the graph is: * Construction (e.g. by lowering) * Topological sorting * Validation * Analysis & optimization * Usage (in a loop) * PrepareIteration * Launch ## Graph structure ExecGraph is a directed acyclic graph which stores the nodes and edges in linked lists. Each node has an array of input edges and an array of output descriptors. An output descriptor aggregates a list of consumers and some properties of the output (device, pinnedness and similar). ExecEdge is a structure used for navigating the graph. It links an output of a producer node to an input of a consumer node. The graph has no dangling edges. All graph inputs start with an (inputless) ExternalSource node and all pipeline outputs contribute to the output node. Unused outputs have an output descriptor, but no outgoing edges. After construction the graph is sorted and analyzed. The sorting is a topological sort with an additional partitioning that guarantees that BatchSizeProviders appear first. ## Implementing order of execution The order of execution is implemented with dali::tasking dependency mechanisms. See: http://dali-ci-01:7070/docs/17070353/doxygen/html/namespacedali_1_1tasking.html ### Task state and integrity Each task has the current main task and the previous main task. Since operators are non-reentrant and potentially stateful, the current task succeeds the previous task. Simply adding main_task_->Succeed(prev_task_) ensures that the tasking::Scheduler will not begin the task until the previous iteration is complete. ### Data dependencies Each task's main task subscribes to the outputs of the producers. This not only guarantees the order, but provides a mechanism by which the data is passed between operators. Each operator node returns one task output per one operator output. ### Concurrency limit For various reasons we may want to limit the concurrency of operators. Obviously an operator cannot run in parallel with itself due to reasons outlined above - but we may also want to limit the number of concurrently running different operators from various groups. For example, due to technical limitations of `dali::ThreadPool`, it's impossible to run multiple CPU operators simultaneously because concurrent submission of work to the threadpool results in a hang. Concurrency is limited with a tasking::Semaphore shared pointer stored in a node. ### Output buffer limit When scheduling multiple iterations ahead, it's possible for "bubbles" to form - if an operator produces its data quickly but its consumers are slow, the operator node would create multiple output buffers which would live inside tasking framework as data being passed between tasks. In order to limit the number of active output buffers we need another semaphore - but this semaphore needs to be lowered until all consumers are done with the data. To achieve this an auxiliary (empty) task is scheduled to succeed all of the consumers and, upon completion, it raises the semaphore. #### Example In this example the operator Op1 has a maximum of 2 output buffers. The iteration 1 proceeds without delay (only waiting for the previous iteration of each operator). In iteration 2, however, the operator Op1 has to wait before it allocates an output buffer. The blue boxes represent the operators' "main" tasks, the red boxes - the "release_outputs" task and the green boxes - semaphore operations. ![image](https://github.com/user-attachments/assets/a3025551-96af-42e3-a238-544df2d66121) _NOTE: This diagram represents task life cycle, not thread activity - with tasking::Scheduler the worker threads never actually wait as long as there are tasks to execute. Resource (e.g. semaphore) acquisition follows "'wait all" semantics, so the extended "acquire" boxes are not an accurate representation of "waiting on a semaphore"._ ## Workspace lifecycle The new executor follows a "linear" memory usage model - buffers are created as needed and thrown away as soon as they're no longer used. The memory pool is solely responsible for efficient memory recycling. Despite the buffers' being disposable, the workspace object contains some additional structure (e.g. mapping of argument input names to indices) which we don't need to recreate each time the operator runs. Because of that, each ExecNode has a Workspace object which stores the workspace. The workspace is removed from ExecNode at the beginning of the task body and returned to it when the task completes. Life cycle: - Get workspace from ExecNode - (run the task) - Clear workspace - Put workspace back in ExecNode _(*) Clearing the workspace means removing all TensorLists from it_ ## ExecNode Tasks ### Operator task The operator task performs the following operations: - get the inputs from parent tasks - wait for inputs in the operator's stream/order - put the inputs into the workspace - apply default input layouts, if necessary - compute the batch size - create the outputs - run operator's Setup - Resize the outputs, if necessary - run operator's Run - propagate metadata - record CUDA events - restore empty input layouts, if necessary ### Output task - wait for the inputs - construct the "output workspace" where the outputs are workspace outputs (and the tasks's inputs) - move the output workspace to the task's return value --------- Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
- Loading branch information