-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826
Conversation
yzhang93
commented
Oct 4, 2024
- This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it.
- With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of [ConvertToDma] Add options to tranpose dma dimensions on target side #812 and brings the dma optimizations in Passes.cpp.
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir
Show resolved
Hide resolved
{ | ||
RewritePatternSet patterns(context); | ||
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel), | ||
onlyZeroStrideOnOuterDim); | ||
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) { | ||
parentOp->emitOpError("failed to subsume loops into DMA operations"); | ||
return signalPassFailure(); | ||
} | ||
} | ||
{ | ||
auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(parentOp); | ||
std::optional<AMDAIEDevice> maybeDevice = getConfigAMDAIEDevice(targetAttr); | ||
if (!maybeDevice) { | ||
parentOp->emitOpError() | ||
<< "has no AMDAIEDevice in the target attribute configuration. This " | ||
"device-specific information is required to determine when loops " | ||
"can be subsumed into DMA operations, and must be attached to a " | ||
"containing ModuleOp."; | ||
RewritePatternSet patterns(context); | ||
populateStridedOpCombinationPattern(patterns); | ||
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) { | ||
parentOp->emitOpError("failed to compose strided operations"); | ||
return signalPassFailure(); | ||
} | ||
AMDAIE::AMDAIEDeviceModel deviceModel = | ||
AMDAIE::getDeviceModel(maybeDevice.value()); | ||
} | ||
{ | ||
RewritePatternSet patterns(context); | ||
populateCanonicalizeDoublyStridedOpPatterns(patterns, false); | ||
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) { | ||
parentOp->emitOpError( | ||
"failed to canonicalize doubly strided DMA operations"); | ||
return signalPassFailure(); | ||
} | ||
} | ||
{ | ||
RewritePatternSet patterns(context); | ||
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel), | ||
onlyZeroStrideOnOuterDim); | ||
} | ||
populateStridedOpCombinationPattern(patterns); | ||
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) { | ||
parentOp->emitOpError("failed to compose strided operations"); | ||
return signalPassFailure(); | ||
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) { | ||
parentOp->emitOpError("failed to compose strided operations"); | ||
return signalPassFailure(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:
RewritePatternSet patterns(context);
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
onlyZeroStrideOnOuterDim);
populateStridedOpCombinationPattern(patterns);
populateCanonicalizeDoublyStridedOpPatterns(patterns, false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got a couple of passes which look like they're doing canonicalization
of DMA ops. Can they be put into a single AMDAIECanonicalizeDma.cpp, which
we then just run like vanilla canonicalization?
AMDAIECanonicalizeDoublyStridedOp.cpp
AMDAIECanonicalizeNpuDmaCpyNd.cpp
AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(obviously not for this PR, but maybe some refactoring for later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp
I don't think these passes are doing canonicalization.
AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaLoopSubsumption.cpp
These involve optimization related transformations across multiple operations (multiple DMA, DMA and loops), which is not a canonicalization.
AMDAIEDmaCSE.cpp
This is a CSE pass, not canonicalization.
AMDAIENormalizeLoopBounds.cpp
What does this have to do with DMA ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:
RewritePatternSet patterns(context); populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel), onlyZeroStrideOnOuterDim); populateStridedOpCombinationPattern(patterns); populateCanonicalizeDoublyStridedOpPatterns(patterns, false);
I think the reason it doesn't work as expected is because it generate following IR, and %48 and %49 are not combinable.
%47 = amdaie.npu.dma_cpy_nd %7([] [] [], %30[0, %46, 0] [4, 64, 32] [32, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%48, MM2S)
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%47, MM2S)
amdaie.npu.dma_wait(%49, MM2S)
But if we keep some order of those patterns, it could generate the IR as
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
Isn't this the same as:
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
I think you can adjust CombineStridedOps
, so the following is combined into the result above:
48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods. I don't want to spend time arguing abstract points for which I have no evidence but my intuition says it's good to have lots of lowering passes, but few, general purpose canonicalization/cse/normalization passes which can be run anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can adjust
CombineStridedOps
, so the following is combined into the result above:48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
Are you sure this is combinable for a more general case regardless of what we already know from this example? It is not consistent with the current logic that inserting a new dimension if two npu has the same number of dimensions of addressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can adjust
CombineStridedOps
, so the following is combined into the result above:48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
Are you sure this is combinable for a more general case regardless of what we already know from this example?
Yes, because the above two DMA ops are just an extension of each other, so no new dimension is needed. This is an edge case which is just not handled currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods.
In that sense, you could argue that every lowering pass is performing a canonicalization. so that the canonical
form is the most efficiently lowered form for the given hardware, and therefore you could put the whole codegen inside a single canonicalization pass. That's against the point of having lowering passes, so I think we shouldn't look at it like that, and unless we have a very good reason to deviate, we should stay with the definition and use of canonicalization within MLIR: https://mlir.llvm.org/docs/Canonicalization/.
This PR adds a corner case for strided op combination. With this PR, the following strides ops: ``` 48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> ``` can be combined as `%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> ` Addressed review comments #826 (comment).
dfccad3
to
9d10fd7
Compare
9d10fd7
to
099513d
Compare
passManager.addPass(createAMDAIEConvertToDmaPass()); | ||
// For matmul pipelines, we do transpose on target side for pack ops to get | ||
// better performance. While for convolution pipelines, the same settings | ||
// cause 'aie.dma_bd' error, so for now keep using transpose on source for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the exact error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is 'aie.dma_bd' op Cannot give more than 3 dimensions for step sizes and wraps in this tile (got 4 dimensions).
The full dump IR is https://gist.github.com/yzhang93/fd34425cc223eccbd233c7d1051f2d63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM