[Epic] CUDA HAL driver rewrite for production #13245

allieculp · 2023-04-21T22:28:54Z

Background

The current CUDA HAL driver was prototyped in the early days and have implementation details rendering production usage hard. To name a few big ones: full stream synchronization after each queue execute, no HAL semaphore support, no stream-ordered allocation support, linear nodes in CUDA graph, tightly integrated context wrapper making it hard to support multi queue/device. There are also various other smaller issues, like tightly coupled CUDA and NCCL symbols hindering optionality, etc.

Goals and Plans

We would need to rewrite portions of the CUDA HAL driver and modernize others accordingly to better implement IREE HAL abstractions and support future production use cases. Concretely, we would like the CUDA HAL driver to:

Support async workload submission for better GPU utilization
Implement proper HAL semaphore for host/device <-> host/device synchronization
Stream-ordered allocation to sequence allocations and compute
Implement HAL command buffers with CUDA graphs for low launch overhead
Factor in multi queue/device in various APIs to not limit the support

The above, if done properly, should make supporting multi queue/device training/inference and interleaving different CUDA library/CodeGen solutions and even external systems approaches easier.

Note that this doesn't concern the kernel and executable side much. There are places we can improve there too; but those are largely detachable and can happen separately.

The overall plan is to start a new experimental HAL driver in experiemental/, cuda2, and pull in pieces from existing CUDA HAL driver when suitable, and improve and land incrementally.

Concrete Tasks

These are major abstract tasks now; as we go we'll break them down into smaller ones and create seprate task list to track.

### Basics
- [ ] https://github.com/openxla/iree/pull/13942
- [ ] https://github.com/openxla/iree/pull/13955
- [ ] https://github.com/openxla/iree/pull/14011
- [ ] https://github.com/openxla/iree/pull/14063
- [ ] https://github.com/openxla/iree/issues/13250 
- [ ] Document this CUDA HAL driver and its relationship with other ecosystem components

### Memory Management
- [ ] https://github.com/openxla/iree/pull/13985
- [ ] https://github.com/openxla/iree/pull/13440

### Compute Pipeline
- [ ] https://github.com/openxla/iree/pull/14038
- [ ] https://github.com/openxla/iree/pull/14062
- [ ] https://github.com/openxla/iree/pull/14326
- [ ] https://github.com/openxla/iree/pull/14524
- [ ] https://github.com/openxla/iree/pull/14525
- [ ] https://github.com/openxla/iree/pull/14526
- [ ] https://github.com/openxla/iree/pull/15325
- [ ] https://github.com/openxla/iree/pull/14857
- [ ] https://github.com/openxla/iree/pull/15423

### Synchronization
- [ ] https://github.com/openxla/iree/pull/14325
- [ ] https://github.com/openxla/iree/issues/4727

### Collectives
- [ ] https://github.com/openxla/iree/pull/14059

The text was updated successfully, but these errors were encountered:

allieculp · 2023-05-03T20:11:55Z

@julianwa @antiagainst @mattwalsh @benvanik I believe this is still in Lei's queue, please update if any progress has been made.

allieculp · 2023-05-04T21:37:37Z

@julianwa @mattwalsh @antiagainst Noting here, this may need to be bumped up in priority before end of Q2.

julianwa · 2023-05-08T01:52:31Z

@antiagainst when do you expect the metal back-end to land? I assume that it would be poor choice to start this prior to that. I'm just seeing what the earliest timeframe to start this is -- without considering the other work you have planned for Q2 for the moment.

antiagainst · 2023-05-08T05:40:27Z

I'm under the impression that the CUDA HAL rewrite is scheduled to start at Q3. I'm expecting to finish up and land the Metal HAL impl by end of this month. So earliest we can start is June.

antiagainst · 2023-06-03T20:16:34Z

I've updated the title and fleshed out some of the goals and plans in the issue, after discussing with @benvanik. This is not final and we'll continue refine the details as we go.

This commit starts a CUDA HAL driver rewrite under `experimental/`. We create a new `cuda2/` directory to host the new code to avoid interrupting the current CodeGen development. This commit just brings in the basics for boot up a new HAL driver, including dynamic symbols management, error status management, and IREE HAL driver implementation. Most of the code is directly copied from existing HAL driver, with noticeable changes: * Split CUDA and NCCL dynamic symbols into separate structures for better organziation and allowing optionality. * Fleshed out CUDA error to IREE status conversions. * Better organized code blocks and improved error messages and various comments. Building this commmit with `-DIREE_EXTERNAL_HAL_DRIVERS=cuda2`, we can have `tools/iree-run-module --dump_devices` showing `cuda2` devices, in parallel to the existing CUDA one. Progress towards #13245

This commit ports over existing CUDA driver allocator and buffer implementation. The main logic is kept as-is, with one noticeable changes--context wrapper is dropped and fields in it are directly put in various API calls. This is to make supporting multiple device and stream easier later. Other changes are just polishing on comments and errors. Progress towards #13245

This commit ports over existing CUDA HAL driver code to enable creating devices, allocators, and buffers. It's mostly NFC, with just symbol renaming and turning various functionalities into unimplemented. With this commit, we are able to pick up CTS tests. Progress towards #13245

This commit reuses the existing CUDA HAL driver's descriptor set and pipeline layout implementation. Along the way, dropped the context wrapper and adds more documentation. Progress towards #13245

Progress towards #13245

This commit starts a HIP backend in HAL. Following the steps of CUDA rewrite (iree-org#13245) effort, this backend will provide improvements over the existing ROCm HAL backend. Building this commmit with -DIREE_EXTERNAL_HAL_DRIVERS=hip will initialize this driver and the driver information can be seen using `tools/iree-run-module --dump_devices`.

This commit changes the stream command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards #13245

Legacy sync mode forces waiting on semaphore immediately so it effectively runs all async allocation/execution in sync mode. Now we have proper semaphore emulation in cuda2, we don't need this anymore. Disable it for end-to-end op tests for experimental cuda2 driver. Progress towards #13245

This commit starts a HIP backend in HAL. Following the steps of the new CUDA backend (`cuda2`) (#13245) effort, HIP will provide enhancements over existing ROCm HAL backend which is based on the current CUDA backend. This commit can be built with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP driver. Device registration can be verified using `tools/iree-run-module --dump_devices`.

This commit changes the graph command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards iree-org#13245

) This commit improves the graph node dependency management by avoiding serializing all nodes in the graph command buffer to a linear chain. Command buffer barriers become empty CUDA graph nodes depending on nodes created after a previous barrier. Progress towards iree-org#13245 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Events aren't used by the compiler right now. Progress towards iree-org#13245

…5437) This commit changes the stream command buffer to keep track of a fixed list of descriptor sets. This simplifies the logic of pushing descriptors/constants; we only perform kernel parameter serialization at the dispatch time. This means we don't pay the overhead if multiple push descriptor/constant commands are issued before dispatching. Also we don't need to sort descriptors anymore. Progress towards iree-org#13245

Legacy sync mode forces waiting on semaphore immediately so it effectively runs all async allocation/execution in sync mode. Now we have proper semaphore emulation in cuda2, we don't need this anymore. Disable it for end-to-end op tests for experimental cuda2 driver. Progress towards iree-org#13245

This commit starts a HIP backend in HAL. Following the steps of the new CUDA backend (`cuda2`) (iree-org#13245) effort, HIP will provide enhancements over existing ROCm HAL backend which is based on the current CUDA backend. This commit can be built with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP driver. Device registration can be verified using `tools/iree-run-module --dump_devices`.

…15673) When GPU signals the CUevent-implemented semaphore to a new value, we need to perform a `cuLaunchHostFunc()` to invoke the CPU to signal to that value. At the same time, we can try to advance the deferred queue to release more workload to the GPU. Advancing the deferred queue cannot happen as part of the `cuLaunchHostFunc()` call, given that is in a driver thread. Per the documentation: The host function must not make any CUDA API calls. Attempting to use a CUDA API may result in CUDA_ERROR_NOT_PERMITTED, but this is not required. The host function must not perform any synchronization that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized. We were actually doing that and seeing deadlock. So this commit changes the implemention to create a new thread dedicated to monitoring and releasing workload to the GPU, to avoid calling CUDA APIs as part of the `cuLaunchHostFunc()` call. Progress towards #13245

The device already retains the devic event pool, so the device event pool should not retain the device again. Otherwise, they both have at least one retain and will never trigger proper free naturally. Progress towards #13245

This commit moves the CUDA HAL driver rewrite to the `hal/drivers` directory given it's functional and ready for normal usage. By this we can start run tests with CI to make sure it does not regress. Further improvements can happen directly in this directory. This provides an easy route for trying out the rewrite before eventually replace the existing HAL driver. Along the way wired up BUILD configurations. Progress towards #13245

antiagainst · 2024-01-10T01:04:56Z

#14620 landed; now cuda2 is in hal/drivers. Next, we need:

Replace various tests/benchmarks in-tree to use cuda2 and address potential issues
[After bing stable of some time..]
Delete cuda and rename cuda2 as cuda

Also sent https://groups.google.com/g/iree-discuss/c/q2_HgHmGma0 to the group.

antiagainst · 2024-01-23T22:22:02Z

#16107 switches the new cuda2 impl on by default. Following up work is to delete cuda1 code and rename cuda2 symbols to cuda, but not blocking, so this issue is marked as closed now.

This is the last step of refreshing the cuda HAL implementation. Mostly done via the following command: ```sh find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \; find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \; ``` Related to #13245

Related to #13245

This is the last step of refreshing the cuda HAL implementation. Mostly done via the following command: ```sh find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \; find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \; ``` Related to #13245

allieculp assigned mattwalsh and allieculp Apr 21, 2023

aaron-schneider added the epic Tracking issues for related deliverables (as in Agile dev) label Apr 26, 2023

allieculp changed the title ~~[Epic] CUDA HAL rewrite (asynchronous w/ support for CUDA graphs)~~ [Epic] CUDA HAL Async Implementation (asynchronous w/ support for CUDA graphs) May 2, 2023

allieculp added the p1 label May 4, 2023

allieculp mentioned this issue May 4, 2023

Flesh out HAL conformance testing suite #13250

Closed

julianwa mentioned this issue May 18, 2023

[Epic] OpenXLA PJRT multiple client (single-node, multi-GPU) support #13255

Open

allieculp assigned antiagainst May 18, 2023

antiagainst changed the title ~~[Epic] CUDA HAL Async Implementation (asynchronous w/ support for CUDA graphs)~~ [Epic] CUDA HAL driver rewrite for production Jun 3, 2023

antiagainst mentioned this issue Jun 3, 2023

[cuda] Implement basics for a CUDA HAL driver rewrite #13942

Merged

This was referenced Jun 5, 2023

[cuda] NFC: Split files for CUDA and NCCL dynamic symbols #13954

Merged

[cuda] Dump useful GPU characteristics #13955

Merged

[cuda] Port over allocator and buffer implementation #13985

Merged

This was referenced Jun 8, 2023

[cuda] Wire up basic creating devices, allocators, and buffers #14011

Merged

[cuda] Dump whether the device has integrated memory #13986

Merged

antiagainst mentioned this issue Jun 9, 2023

[cuda] Port over descriptor set and pipeline layout #14038

Merged

antiagainst added a commit that referenced this issue Jun 12, 2023

[cuda] Dump more synchronization related attributes (#14074)

31c2d24

Progress towards #13245

antiagainst mentioned this issue Nov 14, 2023

[cuda] Add command-line option to drop legacy sync mode #15582

Merged

antiagainst mentioned this issue Nov 24, 2023

[cuda] Fix deadlock when advancing deferred queue in driver thread #15673

Merged

antiagainst mentioned this issue Dec 5, 2023

[Epic] HIP HAL driver rewrite #15790

Closed

ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023

[cuda] Mark event related APIs as unimplemented (iree-org#15382)

31ddbb9

Events aren't used by the compiler right now. Progress towards iree-org#13245

antiagainst mentioned this issue Jan 9, 2024

[cuda] Break cyclic retain between device and device event pool #16088

Merged

antiagainst mentioned this issue Jan 17, 2024

[cuda] Switch cuda2 on and cuda1 off by default #16107

Merged

antiagainst closed this as completed in #16107 Jan 23, 2024

antiagainst closed this as completed in 3b3cef9 Jan 23, 2024

This was referenced Jan 27, 2024

[cuda] Drop cuda1 HAL implementation code #16188

Merged

[cuda] NFC: rename cuda2 to cuda #16232

Merged

antiagainst mentioned this issue Jan 28, 2024

[doc] Update docs about the CUDA HAL driver #16234

Merged

antiagainst added a commit that referenced this issue Jan 29, 2024

[doc] Update docs about the CUDA HAL driver (#16234)

1c83020

Related to #13245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] CUDA HAL driver rewrite for production #13245

[Epic] CUDA HAL driver rewrite for production #13245

allieculp commented Apr 21, 2023 •

edited by antiagainst

Loading

allieculp commented May 3, 2023

allieculp commented May 4, 2023

julianwa commented May 8, 2023 •

edited

Loading

antiagainst commented May 8, 2023

antiagainst commented Jun 3, 2023

antiagainst commented Jan 10, 2024

antiagainst commented Jan 23, 2024

[Epic] CUDA HAL driver rewrite for production #13245

[Epic] CUDA HAL driver rewrite for production #13245

Comments

allieculp commented Apr 21, 2023 • edited by antiagainst Loading

Background

Goals and Plans

Concrete Tasks

allieculp commented May 3, 2023

allieculp commented May 4, 2023

julianwa commented May 8, 2023 • edited Loading

antiagainst commented May 8, 2023

antiagainst commented Jun 3, 2023

antiagainst commented Jan 10, 2024

antiagainst commented Jan 23, 2024

allieculp commented Apr 21, 2023 •

edited by antiagainst

Loading

julianwa commented May 8, 2023 •

edited

Loading