Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move more pass from Flow stage to GlobalOptimization stage. #14707

Merged
merged 14 commits into from
Aug 28, 2023

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Aug 16, 2023

  • Move four more passes to GlobalOptimization stage.
    • ConvertElementwiseToLinalgPass
    • GeneralizeLinalgNamedOpsPass
    • FuseDequantizationMatmulPass
    • FoldUnitExtentDimsPass
  • Move Flow transformation_pipeline.mlir test to GlobalOptimization/test. It is mainly for testing ConvertElementwiseToLinalg pass which is also tested upstream. We probably can remove it as a follow-up.

@hanhanW hanhanW added infrastructure/benchmark Relating to benchmarking infrastructure benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks labels Aug 16, 2023
@github-actions
Copy link

github-actions bot commented Aug 16, 2023

Abbreviated Benchmark Summary

@ commit 96335ad4164fa6c47702f65a2f51fde33430d79f (vs. base 31a51206afaccd6293c1e7a77e5b9c2ebdcaff7e)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV2\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[4-thread,full-inference,system-scheduling] with zeros @ pixel-4[big-core] 14.671 (vs. 13.529, 8.44%↑) 14.064 1.334
MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][default-flags,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with zeros @ pixel-6-pro[gpu] 78.804 (vs. 74.269, 6.11%↑) 78.777 0.563

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with zeros @ pixel-6-pro[little-core] 65.093 (vs. 73.163, 11.03%↓) 65.090 0.030
MobileNetV2\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags] local\_task(vmvx\_module)[4-thread,full-inference,system-scheduling] with zeros @ pixel-4[big-core] 5101.562 (vs. 5687.836, 10.31%↓) 5139.404 146.600
MobileNetV3Small\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags] local\_task(vmvx\_module)[4-thread,full-inference,system-scheduling] with zeros @ pixel-4[big-core] 991.911 (vs. 1090.495, 9.04%↓) 988.962 17.841

[Top 3 out of 10 results showed]

Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
Unet2dPT(linalg) [cuda-sm\_80-linux\_gnu-cuda][default-flags,compile-stats] 958 (vs. 980, 2.24%↓)

For more information:

Source Workflow Run

@hanhanW hanhanW marked this pull request as ready for review August 16, 2023 21:41
@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 16, 2023

The result is interesting... overall is positive to me.

@MaheshRavishankar
Copy link
Contributor

I'd try to triage the difference in the number of dispatches created.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think needs a bit of triage on the number of dispatches created.

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 17, 2023

Agree that we need more investigation. I was just trying to see what's happening in this case. I'd like to scope it moving SetEncoding to FlowPreprocessing stage. (The invesgitation can happen when we work on improving const-eval heuristic. At that point we need to do some basic fusion and move some passes to preprocessing stage.)

@hanhanW hanhanW changed the title [Flow] Move more passes to FlowPreprocessing stage. [Flow] Move SetEncoding pass to FlowPreprocessing stage. Aug 18, 2023
.addPass(IREE::Flow::createConvert1X1FilterConv2DToMatmulPass);
passManager.addPass(IREE::Flow::createEraseUnusedLinalgOperands());
.addPass(IREE::Flow::createConvert1X1FilterConv2DToMatmulPass)
.addPredicatedPass(clEnableDataTiling, createSetEncodingPass);

// Start of Flow pipeline, verify input legality.
passManager.addPass(IREE::Flow::createVerifyInputLegalityPass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to move this also before the preprocessing passes?

@hanhanW hanhanW changed the title [Flow] Move SetEncoding pass to FlowPreprocessing stage. Move more pass from Flow stage to GlobalOptimization stage. Aug 24, 2023
@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 28, 2023

@MaheshRavishankar We can't move raising special op pass before global optimization now. Because it introduces huge regression to CUDA. Something falls into sequential computation. So far we can move few passes to global optimization phase. Please take a look if it is okay to move them to global optimization phase. It helps us enable const-eval for data tiling (i.e., #14792) because we wont need to handle rank-reduced cases. We can run FoldUnitExtentDimsPass pass before SetEncoding pass.

@benvanik
Copy link
Collaborator

(wonder if this is CUDA transform dialect matchers or some other CUDA-specific patterns special cased on rank or something?)

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 28, 2023

(wonder if this is CUDA transform dialect matchers or some other CUDA-specific patterns special cased on rank or something?)

I don't know at this moment. :) My take is that RaiseSpecialOps pass is a preprocessing for fusion. If we move it before const-eval, const-eval could break the behavior. It hoists some ops to globals, which breaks the graph. So the matcher no longer work. It is one of passes that should be run right before fusion.

@qedawkins
Copy link
Contributor

Can we run RaiseSpecialOps in multiple places? I could see it being worth running both before and after FoldUnitExtentDims/other passes.

@benvanik
Copy link
Collaborator

RaiseSpecialOps indeed would be useful to run at various stages, including possibly as part of a fixed point iteration around many such passes.

@qedawkins
Copy link
Contributor

Cool, that matches my thinking about the pass as well. It should basically contain patterns that are always worth applying, but we don't know exactly when they're going to apply.

@MaheshRavishankar
Copy link
Contributor

I'd rather wait for thigns to stabilize before we run them multiple times.....

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is fine for now.

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 28, 2023

I think we can run it multiple times, if we do know where we want to apply them. I'm not going to study if we can put it to multiple places or not in this PR. The intention is moving more passes to global optimization phase if it makes sense.

@hanhanW hanhanW merged commit afa74de into iree-org:main Aug 28, 2023
58 checks passed
@hanhanW hanhanW deleted the flow-shuffle branch August 28, 2023 23:43
@stellaraccident
Copy link
Collaborator

Note that this increased the amount of constant folding by 20-30x on llama2: it is memorizing a bunch of collapse_shapes mostly, it seems. We may want to deny-list collaspe_shape as constexpr leaf nodes because there is seldom any real value in constevaling a metadata change.

Does not seem to have made any significant change to latency, either of runtime or compile time (except for one compile time outlier that I assume was a fluke).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks infrastructure/benchmark Relating to benchmarking infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants