Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

Merged
merged 6 commits into from
Oct 10, 2024

Conversation

yzhang93
Copy link
Contributor

@yzhang93 yzhang93 commented Oct 4, 2024

  1. This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it.
  2. With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of [ConvertToDma] Add options to tranpose dma dimensions on target side #812 and brings the dma optimizations in Passes.cpp.

Comment on lines 55 to 88
{
RewritePatternSet patterns(context);
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
onlyZeroStrideOnOuterDim);
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
parentOp->emitOpError("failed to subsume loops into DMA operations");
return signalPassFailure();
}
}
{
auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(parentOp);
std::optional<AMDAIEDevice> maybeDevice = getConfigAMDAIEDevice(targetAttr);
if (!maybeDevice) {
parentOp->emitOpError()
<< "has no AMDAIEDevice in the target attribute configuration. This "
"device-specific information is required to determine when loops "
"can be subsumed into DMA operations, and must be attached to a "
"containing ModuleOp.";
RewritePatternSet patterns(context);
populateStridedOpCombinationPattern(patterns);
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
parentOp->emitOpError("failed to compose strided operations");
return signalPassFailure();
}
AMDAIE::AMDAIEDeviceModel deviceModel =
AMDAIE::getDeviceModel(maybeDevice.value());
}
{
RewritePatternSet patterns(context);
populateCanonicalizeDoublyStridedOpPatterns(patterns, false);
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
parentOp->emitOpError(
"failed to canonicalize doubly strided DMA operations");
return signalPassFailure();
}
}
{
RewritePatternSet patterns(context);
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
onlyZeroStrideOnOuterDim);
}
populateStridedOpCombinationPattern(patterns);
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
parentOp->emitOpError("failed to compose strided operations");
return signalPassFailure();
if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
parentOp->emitOpError("failed to compose strided operations");
return signalPassFailure();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:

RewritePatternSet patterns(context);
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
                                      onlyZeroStrideOnOuterDim);
populateStridedOpCombinationPattern(patterns);
populateCanonicalizeDoublyStridedOpPatterns(patterns, false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a couple of passes which look like they're doing canonicalization
of DMA ops. Can they be put into a single AMDAIECanonicalizeDma.cpp, which
we then just run like vanilla canonicalization?

AMDAIECanonicalizeDoublyStridedOp.cpp
AMDAIECanonicalizeNpuDmaCpyNd.cpp
AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(obviously not for this PR, but maybe some refactoring for later)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp

I don't think these passes are doing canonicalization.

AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaLoopSubsumption.cpp

These involve optimization related transformations across multiple operations (multiple DMA, DMA and loops), which is not a canonicalization.

AMDAIEDmaCSE.cpp

This is a CSE pass, not canonicalization.

AMDAIENormalizeLoopBounds.cpp

What does this have to do with DMA ops?

Copy link
Contributor Author

@yzhang93 yzhang93 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:

RewritePatternSet patterns(context);
populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
                                      onlyZeroStrideOnOuterDim);
populateStridedOpCombinationPattern(patterns);
populateCanonicalizeDoublyStridedOpPatterns(patterns, false);

I think the reason it doesn't work as expected is because it generate following IR, and %48 and %49 are not combinable.

%47 = amdaie.npu.dma_cpy_nd %7([] [] [], %30[0, %46, 0] [4, 64, 32] [32, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%48, MM2S)
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
amdaie.npu.dma_wait(%47, MM2S)
amdaie.npu.dma_wait(%49, MM2S)

But if we keep some order of those patterns, it could generate the IR as
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Isn't this the same as:

%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods. I don't want to spend time arguing abstract points for which I have no evidence but my intuition says it's good to have lots of lowering passes, but few, general purpose canonicalization/cse/normalization passes which can be run anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Are you sure this is combinable for a more general case regardless of what we already know from this example? It is not consistent with the current logic that inserting a new dimension if two npu has the same number of dimensions of addressing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Are you sure this is combinable for a more general case regardless of what we already know from this example?

Yes, because the above two DMA ops are just an extension of each other, so no new dimension is needed. This is an edge case which is just not handled currently.

Copy link
Collaborator

@jtuyls jtuyls Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods.

In that sense, you could argue that every lowering pass is performing a canonicalization. so that the canonical form is the most efficiently lowered form for the given hardware, and therefore you could put the whole codegen inside a single canonicalization pass. That's against the point of having lowering passes, so I think we shouldn't look at it like that, and unless we have a very good reason to deviate, we should stay with the definition and use of canonicalization within MLIR: https://mlir.llvm.org/docs/Canonicalization/.

yzhang93 added a commit that referenced this pull request Oct 9, 2024
This PR adds a corner case for strided op combination. With this PR, the
following strides ops:

```
48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
%49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
```

can be combined as 

`%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128,
1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>
`

Addressed review comments
#826 (comment).
@yzhang93 yzhang93 force-pushed the relax_circular_dma_condition branch from dfccad3 to 9d10fd7 Compare October 9, 2024 22:24
@yzhang93 yzhang93 force-pushed the relax_circular_dma_condition branch from 9d10fd7 to 099513d Compare October 9, 2024 23:09
passManager.addPass(createAMDAIEConvertToDmaPass());
// For matmul pipelines, we do transpose on target side for pack ops to get
// better performance. While for convolution pipelines, the same settings
// cause 'aie.dma_bd' error, so for now keep using transpose on source for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the exact error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is 'aie.dma_bd' op Cannot give more than 3 dimensions for step sizes and wraps in this tile (got 4 dimensions). The full dump IR is https://gist.github.com/yzhang93/fd34425cc223eccbd233c7d1051f2d63

Copy link
Collaborator

@jtuyls jtuyls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzhang93 yzhang93 merged commit a533e7d into nod-ai:main Oct 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants