[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

yzhang93 · 2024-10-04T01:34:11Z

This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it.
With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of [ConvertToDma] Add options to tranpose dma dimensions on target side #812 and brings the dma optimizations in Passes.cpp.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir

jtuyls · 2024-10-08T08:12:46Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaComposition.cpp

+  {
+    RewritePatternSet patterns(context);
+    populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
+                                      onlyZeroStrideOnOuterDim);
+    if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
+      parentOp->emitOpError("failed to subsume loops into DMA operations");
+      return signalPassFailure();
+    }
+  }
  {
-    auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(parentOp);
-    std::optional<AMDAIEDevice> maybeDevice = getConfigAMDAIEDevice(targetAttr);
-    if (!maybeDevice) {
-      parentOp->emitOpError()
-          << "has no AMDAIEDevice in the target attribute configuration. This "
-             "device-specific information is required to determine when loops "
-             "can be subsumed into DMA operations, and must be attached to a "
-             "containing ModuleOp.";
+    RewritePatternSet patterns(context);
+    populateStridedOpCombinationPattern(patterns);
+    if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
+      parentOp->emitOpError("failed to compose strided operations");
      return signalPassFailure();
    }
-    AMDAIE::AMDAIEDeviceModel deviceModel =
-        AMDAIE::getDeviceModel(maybeDevice.value());
+  }
+  {
+    RewritePatternSet patterns(context);
+    populateCanonicalizeDoublyStridedOpPatterns(patterns, false);
+    if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
+      parentOp->emitOpError(
+          "failed to canonicalize doubly strided DMA operations");
+      return signalPassFailure();
+    }
+  }
+  {
+    RewritePatternSet patterns(context);
    populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel),
                                      onlyZeroStrideOnOuterDim);
-  }
-  populateStridedOpCombinationPattern(patterns);
-  if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
-    parentOp->emitOpError("failed to compose strided operations");
-    return signalPassFailure();
+    if (failed(applyPatternsAndFoldGreedily(parentOp, std::move(patterns)))) {
+      parentOp->emitOpError("failed to compose strided operations");
+      return signalPassFailure();
+    }


Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:

RewritePatternSet patterns(context); populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel), onlyZeroStrideOnOuterDim); populateStridedOpCombinationPattern(patterns); populateCanonicalizeDoublyStridedOpPatterns(patterns, false);

We've got a couple of passes which look like they're doing canonicalization
of DMA ops. Can they be put into a single AMDAIECanonicalizeDma.cpp, which
we then just run like vanilla canonicalization?

AMDAIECanonicalizeDoublyStridedOp.cpp
AMDAIECanonicalizeNpuDmaCpyNd.cpp
AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp

(obviously not for this PR, but maybe some refactoring for later)

AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaCSE.cpp
AMDAIEDmaLoopSubsumption.cpp
AMDAIENormalizeLoopBounds.cpp

I don't think these passes are doing canonicalization.

AMDAIECombineStridedOps.cpp
AMDAIEDmaComposition.cpp
AMDAIEDmaLoopSubsumption.cpp

These involve optimization related transformations across multiple operations (multiple DMA, DMA and loops), which is not a canonicalization.

AMDAIEDmaCSE.cpp

This is a CSE pass, not canonicalization.

AMDAIENormalizeLoopBounds.cpp

What does this have to do with DMA ops?

Applying the patterns one by one will always miss some composition opportunities. We should be populating all patterns and applying them together. This will be the most versatile and keep rewriting until all DMA composition opportunities are exhausted:

RewritePatternSet patterns(context); populateDmaLoopSubsumptionPattern(patterns, std::move(deviceModel), onlyZeroStrideOnOuterDim); populateStridedOpCombinationPattern(patterns); populateCanonicalizeDoublyStridedOpPatterns(patterns, false);

I think the reason it doesn't work as expected is because it generate following IR, and %48 and %49 are not combinable.

%47 = amdaie.npu.dma_cpy_nd %7([] [] [], %30[0, %46, 0] [4, 64, 32] [32, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> amdaie.npu.dma_wait(%48, MM2S) %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> amdaie.npu.dma_wait(%47, MM2S) amdaie.npu.dma_wait(%49, MM2S)

But if we keep some order of those patterns, it could generate the IR as
%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, %45] [4, 32, 64] [4096, 128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Isn't this the same as:

%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods. I don't want to spend time arguing abstract points for which I have no evidence but my intuition says it's good to have lots of lowering passes, but few, general purpose canonicalization/cse/normalization passes which can be run anywhere.

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Are you sure this is combinable for a more general case regardless of what we already know from this example? It is not consistent with the current logic that inserting a new dimension if two npu has the same number of dimensions of addressing.

I think you can adjust CombineStridedOps, so the following is combined into the result above:

48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>>

Are you sure this is combinable for a more general case regardless of what we already know from this example?

Yes, because the above two DMA ops are just an extension of each other, so no new dimension is needed. This is an edge case which is just not handled currently.

I meant canonicalization in a general sense. Canonicalization in the strict sense should be done directly with op class methods.

In that sense, you could argue that every lowering pass is performing a canonicalization. so that the canonical form is the most efficiently lowered form for the given hardware, and therefore you could put the whole codegen inside a single canonicalization pass. That's against the point of having lowering passes, so I think we shouldn't look at it like that, and unless we have a very good reason to deviate, we should stay with the definition and use of canonicalization within MLIR: https://mlir.llvm.org/docs/Canonicalization/.

This PR adds a corner case for strided op combination. With this PR, the following strides ops: ``` 48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [32, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> %49 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[32, %45] [96, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> ``` can be combined as `%48 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, %45] [128, 64] [128, 1]) : source_type = !amdaie.logicalobjectfifo<memref<16384xi32>> ` Addressed review comments #826 (comment).

jtuyls · 2024-10-10T13:14:52Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp

-  passManager.addPass(createAMDAIEConvertToDmaPass());
+  // For matmul pipelines, we do transpose on target side for pack ops to get
+  // better performance. While for convolution pipelines, the same settings
+  // cause 'aie.dma_bd' error, so for now keep using transpose on source for


What's the exact error?

The error is 'aie.dma_bd' op Cannot give more than 3 dimensions for step sizes and wraps in this tile (got 4 dimensions). The full dump IR is https://gist.github.com/yzhang93/fd34425cc223eccbd233c7d1051f2d63

jtuyls

LGTM

yzhang93 requested review from MaheshRavishankar, nirvedhmeshram, Abhishek-Varma and jtuyls as code owners October 4, 2024 01:34

jtuyls reviewed Oct 4, 2024

View reviewed changes

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp Outdated Show resolved Hide resolved

jtuyls reviewed Oct 4, 2024

View reviewed changes

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption_circular.mlir Outdated Show resolved Hide resolved

newling reviewed Oct 4, 2024

View reviewed changes

yzhang93 mentioned this pull request Oct 4, 2024

[Performance] Optimizations for matmul #764

Open

yzhang93 requested a review from jtuyls October 7, 2024 16:34

newling approved these changes Oct 7, 2024

View reviewed changes

jtuyls requested changes Oct 8, 2024

View reviewed changes

yzhang93 mentioned this pull request Oct 9, 2024

[CombineStridedOps] Add a combinable case #839

Merged

yzhang93 force-pushed the relax_circular_dma_condition branch from dfccad3 to 9d10fd7 Compare October 9, 2024 22:24

yzhang93 added 5 commits October 9, 2024 16:05

[DmaLoopSubsumption] Relax circular dma loop subsumption condition

93570bc

Modify test

a909e43

Add CanonicalizeDoublyStridedOpPatterns

3a06632

Rebase on main

cafd44c

Fix conv failure

099513d

yzhang93 force-pushed the relax_circular_dma_condition branch from 9d10fd7 to 099513d Compare October 9, 2024 23:09

jtuyls reviewed Oct 10, 2024

View reviewed changes

jtuyls approved these changes Oct 10, 2024

View reviewed changes

Merge branch 'main' into relax_circular_dma_condition

b5ff143

yzhang93 merged commit a533e7d into nod-ai:main Oct 10, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

yzhang93 commented Oct 4, 2024

jtuyls Oct 8, 2024

newling Oct 8, 2024

newling Oct 8, 2024

jtuyls Oct 8, 2024

yzhang93 Oct 8, 2024 •

edited

Loading

jtuyls Oct 8, 2024

newling Oct 8, 2024

yzhang93 Oct 9, 2024

jtuyls Oct 9, 2024

jtuyls Oct 9, 2024 •

edited

Loading

jtuyls Oct 10, 2024

yzhang93 Oct 10, 2024

jtuyls left a comment

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

[DmaLoopSubsumption] Relax circular dma loop subsumption condition #826

Conversation

yzhang93 commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhang93 Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls left a comment

Choose a reason for hiding this comment

yzhang93 Oct 8, 2024 •

edited

Loading

jtuyls Oct 9, 2024 •

edited

Loading