-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] CUDA HAL driver rewrite for production #13245
Comments
@julianwa @antiagainst @mattwalsh @benvanik I believe this is still in Lei's queue, please update if any progress has been made. |
@julianwa @mattwalsh @antiagainst Noting here, this may need to be bumped up in priority before end of Q2. |
@antiagainst when do you expect the metal back-end to land? I assume that it would be poor choice to start this prior to that. I'm just seeing what the earliest timeframe to start this is -- without considering the other work you have planned for Q2 for the moment. |
I'm under the impression that the CUDA HAL rewrite is scheduled to start at Q3. I'm expecting to finish up and land the Metal HAL impl by end of this month. So earliest we can start is June. |
I've updated the title and fleshed out some of the goals and plans in the issue, after discussing with @benvanik. This is not final and we'll continue refine the details as we go. |
This commit starts a CUDA HAL driver rewrite under `experimental/`. We create a new `cuda2/` directory to host the new code to avoid interrupting the current CodeGen development. This commit just brings in the basics for boot up a new HAL driver, including dynamic symbols management, error status management, and IREE HAL driver implementation. Most of the code is directly copied from existing HAL driver, with noticeable changes: * Split CUDA and NCCL dynamic symbols into separate structures for better organziation and allowing optionality. * Fleshed out CUDA error to IREE status conversions. * Better organized code blocks and improved error messages and various comments. Building this commmit with `-DIREE_EXTERNAL_HAL_DRIVERS=cuda2`, we can have `tools/iree-run-module --dump_devices` showing `cuda2` devices, in parallel to the existing CUDA one. Progress towards #13245
This commit ports over existing CUDA driver allocator and buffer implementation. The main logic is kept as-is, with one noticeable changes--context wrapper is dropped and fields in it are directly put in various API calls. This is to make supporting multiple device and stream easier later. Other changes are just polishing on comments and errors. Progress towards #13245
This commit ports over existing CUDA HAL driver code to enable creating devices, allocators, and buffers. It's mostly NFC, with just symbol renaming and turning various functionalities into unimplemented. With this commit, we are able to pick up CTS tests. Progress towards #13245
This commit reuses the existing CUDA HAL driver's descriptor set and pipeline layout implementation. Along the way, dropped the context wrapper and adds more documentation. Progress towards #13245
This commit starts a HIP backend in HAL. Following the steps of CUDA rewrite (iree-org#13245) effort, this backend will provide improvements over the existing ROCm HAL backend. Building this commmit with -DIREE_EXTERNAL_HAL_DRIVERS=hip will initialize this driver and the driver information can be seen using `tools/iree-run-module --dump_devices`.
This commit starts a HIP backend in HAL. Following the steps of CUDA rewrite (iree-org#13245) effort, this backend will provide improvements over the existing ROCm HAL backend. Building this commmit with -DIREE_EXTERNAL_HAL_DRIVERS=hip will initialize this driver and the driver information can be seen using `tools/iree-run-module --dump_devices`.
This commit changes the stream command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards #13245
Legacy sync mode forces waiting on semaphore immediately so it effectively runs all async allocation/execution in sync mode. Now we have proper semaphore emulation in cuda2, we don't need this anymore. Disable it for end-to-end op tests for experimental cuda2 driver. Progress towards #13245
This commit starts a HIP backend in HAL. Following the steps of the new CUDA backend (`cuda2`) (#13245) effort, HIP will provide enhancements over existing ROCm HAL backend which is based on the current CUDA backend. This commit can be built with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP driver. Device registration can be verified using `tools/iree-run-module --dump_devices`.
This commit changes the graph command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards iree-org#13245
) This commit improves the graph node dependency management by avoiding serializing all nodes in the graph command buffer to a linear chain. Command buffer barriers become empty CUDA graph nodes depending on nodes created after a previous barrier. Progress towards iree-org#13245 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Events aren't used by the compiler right now. Progress towards iree-org#13245
…5437) This commit changes the stream command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards iree-org#13245
Legacy sync mode forces waiting on semaphore immediately so it effectively runs all async allocation/execution in sync mode. Now we have proper semaphore emulation in cuda2, we don't need this anymore. Disable it for end-to-end op tests for experimental cuda2 driver. Progress towards iree-org#13245
This commit starts a HIP backend in HAL. Following the steps of the new CUDA backend (`cuda2`) (iree-org#13245) effort, HIP will provide enhancements over existing ROCm HAL backend which is based on the current CUDA backend. This commit can be built with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP driver. Device registration can be verified using `tools/iree-run-module --dump_devices`.
…15673) When GPU signals the CUevent-implemented semaphore to a new value, we need to perform a `cuLaunchHostFunc()` to invoke the CPU to signal to that value. At the same time, we can try to advance the deferred queue to release more workload to the GPU. Advancing the deferred queue cannot happen as part of the `cuLaunchHostFunc()` call, given that is in a driver thread. Per the documentation: The host function must not make any CUDA API calls. Attempting to use a CUDA API may result in CUDA_ERROR_NOT_PERMITTED, but this is not required. The host function must not perform any synchronization that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized. We were actually doing that and seeing deadlock. So this commit changes the implemention to create a new thread dedicated to monitoring and releasing workload to the GPU, to avoid calling CUDA APIs as part of the `cuLaunchHostFunc()` call. Progress towards #13245
The device already retains the devic event pool, so the device event pool should not retain the device again. Otherwise, they both have at least one retain and will never trigger proper free naturally. Progress towards #13245
This commit moves the CUDA HAL driver rewrite to the `hal/drivers` directory given it's functional and ready for normal usage. By this we can start run tests with CI to make sure it does not regress. Further improvements can happen directly in this directory. This provides an easy route for trying out the rewrite before eventually replace the existing HAL driver. Along the way wired up BUILD configurations. Progress towards #13245
#14620 landed; now
Also sent https://groups.google.com/g/iree-discuss/c/q2_HgHmGma0 to the group. |
#16107 switches the new cuda2 impl on by default. Following up work is to delete cuda1 code and rename cuda2 symbols to cuda, but not blocking, so this issue is marked as closed now. |
This is the last step of refreshing the cuda HAL implementation. Mostly done via the following command: ```sh find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \; find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \; ``` Related to #13245
This is the last step of refreshing the cuda HAL implementation. Mostly done via the following command: ```sh find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \; find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \; ``` Related to #13245
Background
The current CUDA HAL driver was prototyped in the early days and have implementation details rendering production usage hard. To name a few big ones: full stream synchronization after each queue execute, no HAL semaphore support, no stream-ordered allocation support, linear nodes in CUDA graph, tightly integrated context wrapper making it hard to support multi queue/device. There are also various other smaller issues, like tightly coupled CUDA and NCCL symbols hindering optionality, etc.
Goals and Plans
We would need to rewrite portions of the CUDA HAL driver and modernize others accordingly to better implement IREE HAL abstractions and support future production use cases. Concretely, we would like the CUDA HAL driver to:
The above, if done properly, should make supporting multi queue/device training/inference and interleaving different CUDA library/CodeGen solutions and even external systems approaches easier.
Note that this doesn't concern the kernel and executable side much. There are places we can improve there too; but those are largely detachable and can happen separately.
The overall plan is to start a new experimental HAL driver in experiemental/,
cuda2
, and pull in pieces from existing CUDA HAL driver when suitable, and improve and land incrementally.Concrete Tasks
These are major abstract tasks now; as we go we'll break them down into smaller ones and create seprate task list to track.
The text was updated successfully, but these errors were encountered: