Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] HIP HAL driver rewrite #15790

Closed
antiagainst opened this issue Dec 5, 2023 · 0 comments · Fixed by #16568
Closed

[Epic] HIP HAL driver rewrite #15790

antiagainst opened this issue Dec 5, 2023 · 0 comments · Fixed by #16568
Assignees
Labels
hal/hip Runtime HIP HAL backend

Comments

@antiagainst
Copy link
Contributor

antiagainst commented Dec 5, 2023

Background

The current ROCm HAL driver was based on the initial CUDA HAL driver; both are prototypes in the early days. We are in the process of rewriting the CUDA HAL driver, tracked in #13245, to fix various issues listed there. The ROCm HAL inherits similar issues and needs to be rewritten too. This is the tracking epic for this effort.

Goals and Plans

The goals to achieve are similar to the CUDA HAL driver rewrite, listed in #13245. The overall plan is to start a new experimental HAL driver in experiemental/, hip, and leverage corresponding pieces in cuda2 there to bring up.

Along the way, we want to flesh out CTS #13250.

Concrete Tasks

These are major abstract tasks now; as we go we'll break them down into smaller ones and create seprate task list to track.

### Basics
- [ ] https://github.com/openxla/iree/pull/15506
- [ ] https://github.com/openxla/iree/pull/15887
### Memory Management
- [ ] https://github.com/openxla/iree/pull/15791
### Compute Pipeline
- [ ] https://github.com/openxla/iree/pull/15910
- [ ] https://github.com/openxla/iree/pull/15937
- [ ] https://github.com/openxla/iree/pull/16290
- [ ] Enable graph command buffer with latest HIP API
### Synchronization
- [ ] https://github.com/openxla/iree/pull/16305
### Collectives
- [ ] Support collective APIs
### HAL conformance testing suite
- [ ] https://github.com/openxla/iree/pull/16380
nithinsubbiah added a commit that referenced this issue Dec 11, 2023
This patch adds buffer, allocator, and memory pool implementation for
the new HIP backend.

Progress towards #15790
nithinsubbiah added a commit that referenced this issue Dec 13, 2023
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
This patch adds buffer, allocator, and memory pool implementation for
the new HIP backend.

Progress towards iree-org#15790
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
nithinsubbiah added a commit that referenced this issue Feb 2, 2024
antiagainst added a commit that referenced this issue Feb 13, 2024
The `hipDrvGraphAddMemcpyNode` symbol is experimental and not yet
available. So mark update/copy buffer as unimplmented for now till we
have it. For now we will first get stream command buffer up and running.

Progress towards #15790
antiagainst added a commit that referenced this issue Feb 14, 2024
This guarantees that we always have basic bookkeeping structure updated
for the resource. So if something goes wrong when creating the native
executable and we are relying on the `iree_hal_executable_destroy` to
destroy it, we don't crash.

Progress towards #15790
monorimet pushed a commit that referenced this issue Feb 14, 2024
The `hipDrvGraphAddMemcpyNode` symbol is experimental and not yet
available. So mark update/copy buffer as unimplmented for now till we
have it. For now we will first get stream command buffer up and running.

Progress towards #15790
monorimet pushed a commit that referenced this issue Feb 14, 2024
This guarantees that we always have basic bookkeeping structure updated
for the resource. So if something goes wrong when creating the native
executable and we are relying on the `iree_hal_executable_destroy` to
destroy it, we don't crash.

Progress towards #15790
antiagainst added a commit that referenced this issue Feb 27, 2024
This commits fixes a few issues in pending action queue to
resolve driver deadlock issues:

* In host launch func, which is called from a driver thread,
  we cannot invoke any GPU API. Otherwise we might see
  deadlock. This includes cleaning up the actions after
  execution--it may involve buffer releasing/unregistering
  which was the issue causing hip driver hang. Now move
  this cleanup into the worker thread. This is done by adding
  a state field to each action to indicate whether it's alive
  or zombie. We enqueue each action again after done
  execution by flipping its state to zombie to let the worker
  thread to cleanup.
* The worker thread can have five states--two normal states
  (idle waiting or workload pending), three exit states (requested,
  committed, error). They have increasing priorities w.r.t.
  overwriting. We cannot overwrite state later in the list
  without checking. This guarantees that exit requests are
  properly respected and not dropping to the floor so to have
  clean exit.
* When the worker thread is waken to process ready list, we
  need to immediately flip the worker state from workload
  pending to idle waiting, before any real processing. This
  makes sure we don't drop new workload enqueued while
  we are processing, and the worker thread can be waken
  up again properly later.

With the above fixes, we can pass all stablehlo/tosa e2e op
tests on hip driver without hang or crashes. The same
change is mirrored to the cuda pending action queue.

Fixes #15790
Progress towards #16504
@ScottTodd ScottTodd added the hal/hip Runtime HIP HAL backend label Jul 9, 2024
nithinsubbiah added a commit that referenced this issue Jul 30, 2024
This patch retires the ROCm HAL backend given the new HIP HAL backend is
functionally complete (#15790).
The new backend has been tested and proved to give on par or better
performance compared to the ROCm backend.

The documentation still preserves the `ROCm` term to denote the AMD
compute software while using `HIP` specifically for the runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/hip Runtime HIP HAL backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants