-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move more pass from Flow stage to GlobalOptimization stage. #14707
Conversation
hanhanW
commented
Aug 16, 2023
•
edited
Loading
edited
- Move four more passes to GlobalOptimization stage.
- ConvertElementwiseToLinalgPass
- GeneralizeLinalgNamedOpsPass
- FuseDequantizationMatmulPass
- FoldUnitExtentDimsPass
- Move Flow transformation_pipeline.mlir test to GlobalOptimization/test. It is mainly for testing ConvertElementwiseToLinalg pass which is also tested upstream. We probably can remove it as a follow-up.
The result is interesting... overall is positive to me. |
I'd try to triage the difference in the number of dispatches created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think needs a bit of triage on the number of dispatches created.
Agree that we need more investigation. I was just trying to see what's happening in this case. I'd like to scope it moving SetEncoding to FlowPreprocessing stage. (The invesgitation can happen when we work on improving const-eval heuristic. At that point we need to do some basic fusion and move some passes to preprocessing stage.) |
.addPass(IREE::Flow::createConvert1X1FilterConv2DToMatmulPass); | ||
passManager.addPass(IREE::Flow::createEraseUnusedLinalgOperands()); | ||
.addPass(IREE::Flow::createConvert1X1FilterConv2DToMatmulPass) | ||
.addPredicatedPass(clEnableDataTiling, createSetEncodingPass); | ||
|
||
// Start of Flow pipeline, verify input legality. | ||
passManager.addPass(IREE::Flow::createVerifyInputLegalityPass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to move this also before the preprocessing passes?
@MaheshRavishankar We can't move raising special op pass before global optimization now. Because it introduces huge regression to CUDA. Something falls into sequential computation. So far we can move few passes to global optimization phase. Please take a look if it is okay to move them to global optimization phase. It helps us enable const-eval for data tiling (i.e., #14792) because we wont need to handle rank-reduced cases. We can run FoldUnitExtentDimsPass pass before SetEncoding pass. |
(wonder if this is CUDA transform dialect matchers or some other CUDA-specific patterns special cased on rank or something?) |
I don't know at this moment. :) My take is that RaiseSpecialOps pass is a preprocessing for fusion. If we move it before const-eval, const-eval could break the behavior. It hoists some ops to globals, which breaks the graph. So the matcher no longer work. It is one of passes that should be run right before fusion. |
Can we run RaiseSpecialOps in multiple places? I could see it being worth running both before and after FoldUnitExtentDims/other passes. |
RaiseSpecialOps indeed would be useful to run at various stages, including possibly as part of a fixed point iteration around many such passes. |
Cool, that matches my thinking about the pass as well. It should basically contain patterns that are always worth applying, but we don't know exactly when they're going to apply. |
I'd rather wait for thigns to stabilize before we run them multiple times..... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is fine for now.
I think we can run it multiple times, if we do know where we want to apply them. I'm not going to study if we can put it to multiple places or not in this PR. The intention is moving more passes to global optimization phase if it makes sense. |
Note that this increased the amount of constant folding by 20-30x on llama2: it is memorizing a bunch of collapse_shapes mostly, it seems. We may want to deny-list collaspe_shape as constexpr leaf nodes because there is seldom any real value in constevaling a metadata change. Does not seem to have made any significant change to latency, either of runtime or compile time (except for one compile time outlier that I assume was a fluke). |