Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU autoscheduling with Mullapdui2016: the reference implementation #7787

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

antonysigma
Copy link
Contributor

Rationale:

  1. To compare the GPU auto-scheduling performance of Mullapudi2016 against Li2018 and Anderson2021.

  2. To reduce the following claims to practice, quoting the original Mullapudi2016 article:

Portability to Different Architectures: GPU Performance: The inlining, tiling, and grouping processes are otherwise similar to the CPU case. Groups resulting from merging are mapped to CUDA kernels by designating the outer tile loops as GPU block grid dimensions and the inner tile loops as GPU thread block dimensions. All intermediate buffers
within a group are allocated in GPU shared memory

  1. To implement the so-call "single level tiling only" limitation in the Mullapudi2016 and Sioutas2020 algorithms, according to the findings in the Anderson2021 paper:

[Mullapudi et al] develops an automatic scheduling technique using a heuristic cost model and a greedy stage grouping algorithm... but its search space is smaller compared to ours among other reasons because it only supports a single level of tiling, and as we discuss in Section 6.2, this excludes a number of high performance schedules.


Change summary:

Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article:

Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11. https://doi.org/10.1145/2897824.2925952

When target=cuda is detected in the code generator command line arguments, intercept all vectorize, parallel scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the class GPUTilingDedup for deferred execution.

Implement the class GPUTilingDedup to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage is vectorized, reordered, parallel, and then reordered again, the reorder and gpu_threads() schedules are called exactly once.

Also, intercept all split and reorder scheduling calls by Mullapudi's auto-splitting algorithm.

Implement the clss GPUTileHelper to enforce atomic transaction of the gpu schedules. If the current stage is compute_root, mark all auto-split inner dimensions as gpu_threads, and outer dimensions as gpu_blocks. If the Stage is compute_at another Stage, mark all vectorize dimensions as gpu_threads.

If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor.

If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via single_thread.

cc'ed @aekul , @jrk, @abadams .

See also: #7491

@abadams
Copy link
Member

abadams commented Aug 22, 2023

Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.

@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 3 times, most recently from 4bfdf3f to f195efa Compare August 22, 2023 16:46
Copy link
Contributor Author

@antonysigma antonysigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.

Hi @abadams ,

Thank you for reviewing the code, and dotting the i's and t's. I concur that GPU scheduling is an experimental feature, and should be highlighted as such in the user_warning. Could you please show me where to warn the user?

I am also open to an additional option bool ArchParams::emit_gpu_schedules = false;, parsable in the generator command line interface. Though, I highly doubt if anyone would go through the hassle of setting target=host-cuda-cuda_capability_?? just to disable GPU auto-scheduler.

My primary goal is get this PR upstreamed, so that everybody can benefit from the auto-scheduler comparison and other studies. The generated demo.schedule.h can be sub-optimal; we all expect the end users will tweak it for their use cases.

@abadams
Copy link
Member

abadams commented Aug 22, 2023

As this is an attempted reconstruction of his GPU autoscheduler, I should probably tag @ravi-teja-mullapudi to see if this looks sane, because this will affect how people cite and compare to his work in future.

@steven-johnson
Copy link
Contributor

Several bot failures with:

/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]

@antonysigma
Copy link
Contributor Author

antonysigma commented Aug 23, 2023

Several bot failures with:

/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]

Done removing the offending line. I also rebased the changes on top of main.

Update: perhaps we need a separate PR to check for unused variables in the CMake configs:

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 47e90864d..83ded47a1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -587,6 +587,8 @@ target_compile_options(
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-function>
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-macros>
         $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-parameter>
+        $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-variable>
+        $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-const-variable>
 
         $<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat-pedantic>
         $<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat>

@antonysigma
Copy link
Contributor Author

antonysigma commented Aug 24, 2023

@steven-johnson and @abadams , thank you for testing the PR on the CI. Yes, the failure is triggered by the CMake build option -DHalide_TARGET=host-[metal|gpu]. I didn't know we can do that. I like the feature. I will reproduce it on my machine.

There are two types of generator failures:

Functions that are compute_at() a gpu_block() loop must specify the innermost gpu_block() loop for that Func. It stems from the over-estimated "L2 cache size per thread" machine parameter; the value should have been ~70kB instead of 16MB. It is described in the original paper as a limitation, not a bug.

But yeah, we should have a better exception handling mechanism for this actionable error. I need help to improve the user experience.

Another generator failure: Functions that are compute_at() a gpu_block() loop cannot have their own gpu_block() loops. It happens in scalar reduction stages scheduled in compute_at. Resolving the bug...

@steven-johnson
Copy link
Contributor

Updated to main branch to fix OSX WebGPU failures

@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 3 times, most recently from 9bd065d to 3dcb5d4 Compare August 25, 2023 03:04
@antonysigma
Copy link
Contributor Author

antonysigma commented Aug 25, 2023

Update: The GPU scheduling extension for Mullapudi2016 passes all Buildbot tests except for autograd_grad.generator and local_laplacian_generator.

  1. autograd_grad passes the Buildbot tests, but the unamed Var x triggers basic_string::_M_construct == null error on LLVM16; !name.empty() error on LLVM18.

  1. local_laplacian_generator triggers a subtle !name.empty() exception in the Halide IR.

@abadams Yeah I agreed the Buildbot CI jobs ensure production quality auto-schedulers, which is not original goal of the Mullapudi2016's GPU extensions. I will switch this PR to a draft, and work on issue 2 later next week.

@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 4 times, most recently from cb3eb57 to a36d902 Compare August 29, 2023 20:21
@antonysigma antonysigma marked this pull request as draft August 30, 2023 15:34
@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 2 times, most recently from 183e240 to b1e89ce Compare November 10, 2023 21:01
@steven-johnson
Copy link
Contributor

This PR is a year old at this point -- is it defunct?

@antonysigma
Copy link
Contributor Author

antonysigma commented Aug 23, 2024

This PR is a year old at this point -- is it defunct?

Hi @steven-johnson this branch is still active. I actively use it to generate working schedules for imaging processing pipelines on GPUs like Jetson and RTX cards.

I also rebase the branch to upstream's top of tree on a monthly basis.

I recalled that we cannot merge it because it doesn't pass a few CI tests. There are test cases on GitHub actions where the following combinations results in autoscheduler exception:

  • target set to host-cuda
  • the halide algorithm under test is related deep learning use cases, e.g. auto_grad
  • mullapudi2016 autoscheduler is enabled.

Could you show me a way to skip these combos in GitHub actions? Mullapudi2016 is designed for camera ISP use cases, so it makes sense for the GPU reference implementation to report exceptions for deep learning algorithm pipelines.

@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 3 times, most recently from b1e89ce to e692617 Compare August 24, 2024 01:13
Comment on lines -2825 to +3306
int tile_inner_index = dims.size() - outer_dims.size() - 1;
internal_assert(dims.size() >= outer_dims.size());
const auto tile_inner_index = std::max(int(dims.size() - outer_dims.size()) - 1, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tensor multiplications, the assertion dims.size() > outer_dims.size() fails because the GPU schedule wants to map outer dimensions to gpu_blocks. If the assertion is ignored, integer tile_inner_index becomes invalid (i.e. value less than zero).

Here, I am trying to clamp the tile_inner_index to zero. This may break more test case. Help wanted here.

Reverse engineer the GPU scheduling feature as stated in Section 5.4 of
Mullapudi's article:

Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically
scheduling Halide image processing pipelines.
ACM Transactions on Graphics, 35(4), 83pp 1–11
https://doi.org/10.1145/2897824.2925952

When `target=cuda` is detected in the code generator command line
arguments, intercept all `vectorize`, `parallel` scheduling calls
requested by the auto-vectorization algorithm and the
auto-parallelization algo with the class `GPUTilingDedup` for deferred
execution.

Implement the class `GPUTilingDedup` to ensure all Halide gpu schedule
calls are idempotent: no matter how many times the Stage is vectorized,
reordered, and then repeated `vectorized, the `gpu_threads()` is called exactly once.

Also, intercept all `split` and `reorder` scheduling calls by
Mullapudi's auto-splitting algorithm.

Implement the clss `GPUTileHelper` to enforce atomic tranaction of the
gpu schedules. If the current stage is `compute_root`, mark all auto-split
inner dimensions as `gpu_threads`, and outer dimensions as `gpu_blocks`.
If the Stage is `compute_at` another Stage, mark all `vectorize`
dimensions as `gpu_threads`.

If auto-splitting of the current stage does not result in any tile,
implement a rudimentary tiling having tile size = vector_length x
parallel_factor.

If Mullapudi does not call any split, vectorize, or parallel schedules,
assume scalar reduction routine. Implement it on the GPU via
`single_thread`.
@antonysigma antonysigma force-pushed the mullapudi2016-gpu branch 2 times, most recently from d36af45 to 8b38958 Compare September 5, 2024 15:46
@antonysigma
Copy link
Contributor Author

@steven-johnson Update: state of the CI jobs halide-testbench passes almost all tests, except the following:

The following tests FAILED:
	  1 - bgu_filter (Subprocess aborted)
	139 - iir_blur_filter (Subprocess aborted)
	433 - nl_means_process (Subprocess aborted)

I suspect Mullapudi2016 is not designed for algorithms that are not chained stencil pipelines. Is there a way to skip benchmarks on the CI runners based on CMake option -DHalide_target=host-cuda ?

Details of the failure when Halide_target=host-cuda:

bgu Filer - Error: CUDA error: CUDA_ERROR_ILLEGAL_ADDRESS cuCtxSynchronize failed
iir_blur: Error: Input buffer input is accessed at 5, which is beyond the max (3) in dimension 2
nl_means: Error: CUDA error: CUDA_ERROR_OUT_OF_MEMORY cuMemAlloc failed

And when Halide_target=host-metal:

bgu_filter[39113:94564599] Metal API Validation Enabled
-[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1267: failed assertion `(threadsPerThreadgroup.width(12) * threadsPerThreadgroup.height(21) * threadsPerThreadgroup.depth(4))(1008) must be <= 896. (kernel threadgroup size limit)'
Manually-tuned time: 19.9856ms
Required regular expression not found. Regex=[Success!]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants