-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] HIP HAL driver rewrite #15790
Labels
hal/hip
Runtime HIP HAL backend
Comments
This was referenced Dec 5, 2023
nithinsubbiah
added a commit
that referenced
this issue
Dec 11, 2023
This patch adds buffer, allocator, and memory pool implementation for the new HIP backend. Progress towards #15790
This was referenced Dec 13, 2023
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
This patch adds buffer, allocator, and memory pool implementation for the new HIP backend. Progress towards iree-org#15790
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
ramiro050
pushed a commit
to ramiro050/iree
that referenced
this issue
Dec 19, 2023
nithinsubbiah
added a commit
that referenced
this issue
Jan 9, 2024
banach-space
pushed a commit
to banach-space/iree
that referenced
this issue
Jan 10, 2024
nithinsubbiah
added a commit
that referenced
this issue
Feb 5, 2024
antiagainst
added a commit
that referenced
this issue
Feb 13, 2024
The `hipDrvGraphAddMemcpyNode` symbol is experimental and not yet available. So mark update/copy buffer as unimplmented for now till we have it. For now we will first get stream command buffer up and running. Progress towards #15790
antiagainst
added a commit
that referenced
this issue
Feb 14, 2024
This guarantees that we always have basic bookkeeping structure updated for the resource. So if something goes wrong when creating the native executable and we are relying on the `iree_hal_executable_destroy` to destroy it, we don't crash. Progress towards #15790
monorimet
pushed a commit
that referenced
this issue
Feb 14, 2024
The `hipDrvGraphAddMemcpyNode` symbol is experimental and not yet available. So mark update/copy buffer as unimplmented for now till we have it. For now we will first get stream command buffer up and running. Progress towards #15790
monorimet
pushed a commit
that referenced
this issue
Feb 14, 2024
This guarantees that we always have basic bookkeeping structure updated for the resource. So if something goes wrong when creating the native executable and we are relying on the `iree_hal_executable_destroy` to destroy it, we don't crash. Progress towards #15790
antiagainst
added a commit
that referenced
this issue
Feb 27, 2024
This commits fixes a few issues in pending action queue to resolve driver deadlock issues: * In host launch func, which is called from a driver thread, we cannot invoke any GPU API. Otherwise we might see deadlock. This includes cleaning up the actions after execution--it may involve buffer releasing/unregistering which was the issue causing hip driver hang. Now move this cleanup into the worker thread. This is done by adding a state field to each action to indicate whether it's alive or zombie. We enqueue each action again after done execution by flipping its state to zombie to let the worker thread to cleanup. * The worker thread can have five states--two normal states (idle waiting or workload pending), three exit states (requested, committed, error). They have increasing priorities w.r.t. overwriting. We cannot overwrite state later in the list without checking. This guarantees that exit requests are properly respected and not dropping to the floor so to have clean exit. * When the worker thread is waken to process ready list, we need to immediately flip the worker state from workload pending to idle waiting, before any real processing. This makes sure we don't drop new workload enqueued while we are processing, and the worker thread can be waken up again properly later. With the above fixes, we can pass all stablehlo/tosa e2e op tests on hip driver without hang or crashes. The same change is mirrored to the cuda pending action queue. Fixes #15790 Progress towards #16504
nithinsubbiah
added a commit
that referenced
this issue
Jul 30, 2024
This patch retires the ROCm HAL backend given the new HIP HAL backend is functionally complete (#15790). The new backend has been tested and proved to give on par or better performance compared to the ROCm backend. The documentation still preserves the `ROCm` term to denote the AMD compute software while using `HIP` specifically for the runtime.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
The current ROCm HAL driver was based on the initial CUDA HAL driver; both are prototypes in the early days. We are in the process of rewriting the CUDA HAL driver, tracked in #13245, to fix various issues listed there. The ROCm HAL inherits similar issues and needs to be rewritten too. This is the tracking epic for this effort.
Goals and Plans
The goals to achieve are similar to the CUDA HAL driver rewrite, listed in #13245. The overall plan is to start a new experimental HAL driver in experiemental/,
hip
, and leverage corresponding pieces incuda2
there to bring up.Along the way, we want to flesh out CTS #13250.
Concrete Tasks
These are major abstract tasks now; as we go we'll break them down into smaller ones and create seprate task list to track.
The text was updated successfully, but these errors were encountered: