-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TileAndFuse] Add thread groups for convolution ops #695
Conversation
IR for AIR lowering is below. The compilation fails in
@erwei-xilinx is this something that can be fixed (assuming the IR below looks valid) ? Let me know if you'd like the full IR trace. FWIW this allocation of blocks and threads to convolution works ok with the objectFifo pipeline (which FWIW is currently running out of BDs in a late mlir-aie pass).
|
Suggestion from @erwei-xilinx to run canonicalization to reduce 4-d scf.parallel to 2-d scf.parallel works perfectly. We wonder why canonicalization doesn't do the same for scf.forall |
9ba7444
to
46bb421
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good starting point. I suggest to put this as a separate utility function.
uint32_t nbIndVars = std::count_if(tileSizesVal.begin(), tileSizesVal.end(), | ||
[](int64_t t) { return t != 0; }); | ||
|
||
// See mlir::gpu::MappingId enum: there are currently 13 values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to include the reference https://github.com/llvm/llvm-project/blob/e8063702cfbbf39f0b92283d0588dee264b5eb2b/mlir/include/mlir/Dialect/GPU/IR/GPUDeviceMappingAttr.td#L37.
if (i == 0) | ||
mapping.push_back(getMappingAttributeForDimension(1)); | ||
else if (i == 1) | ||
mapping.push_back(getMappingAttributeForDimension(0)); | ||
else | ||
mapping.push_back(getMappingAttributeForDimension(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be problematic if we have the batch dimension. Typically, we should put DimZ for the batch dimension.
57ae4ac
to
38c7885
Compare
@@ -32,7 +32,25 @@ func.func @conv_2d_nhwc_hwcf(%arg0: tensor<2x14x14x32xbf16>, %arg1: tensor<3x3x3 | |||
// TILE-LEVEL-0-SAME: { | |||
// TILE-LEVEL-0: linalg.fill | |||
// TILE-LEVEL-0: linalg.conv_2d_nhwc_hwcf | |||
// TILE-LEVEL-0: } | |||
// TILE-LEVEL-0: } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the order of mapping attributes for block and thread should be corresponding to each other. Although we are not using block attributes at the moment, it's good to keep the attributes in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so the logic is actually identical for threads and blocks. What's happening in this example is that the tiling sizes at thread and block levels are different:
For blocks: [0, 4, 4, 4, 0, 0, 0]
For threads: [1, 1, 4, 4, 0, 0, 0]
The logic works implemented in this PR works as follows:
First, assign attributes to dimensions with tile size greater than 1. For threads, that is dimensions 2 and 3.
Second, assign attributes to dimensions with tile size equal to 1. For threads, that is dimensions 0 and 1.
The attributes assigned in the order y
then x
then z
then linear_dim_0, linear_dim_1
etc.
For [1, 1, 4, 4, 0, 0, 0]
, after step 1 the assigned dimensions are
[none, none, y, x, none, none, none]
and then after step 2 the assigned dimensions are
[z, linear_dim_0, y, x, none, none, none]
.
And that is why at the thread level we end up with [z, linear_dim_0, y, x].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Yeah, I can follow the steps to set up mapping attributes. It's just not typical that the attributes are different for the same dim in block and thread mapping. I don't have a good solution to solve this other than hardcoding the dimensions. On the other hand, since the block attributes are not used anyway, maybe we could remove these block attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried removing block dimension tiling, but for small matmul examples where the block tile sizes are all 1, there's a problem: scf.forall without tiling attributes get canonicalized away when the iterations space is size 1. i.e. the block-level scf.forall gets removed completely. Ideally we'd be able to work without this outer scf.forall, but currently the passes aren't set up to handle this, I guess. So removing block dimensions for matmuls isn't something we can immediately do.
I could remove them for convolution, but we might have the same issue when we have small convolutions.
So yeah, not sure. Maybe, for now, it's ok to keep the block dimensions as they are for convolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... We should find a way to solve the scf.forall canonicalization problem. Could you add a TODO comment for the block mapping attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll add a comment. Thanks for accepting :)
// TILE-LEVEL-1-SAME: { | ||
// TILE-LEVEL-1: linalg.fill | ||
// TILE-LEVEL-1: linalg.conv_2d_nhwc_hwcf | ||
// TILE-LEVEL-1: } {mapping = [#gpu.thread<z>, #gpu.thread<linear_dim_0>, #gpu.thread<y>, #gpu.thread<x>]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
38c7885
to
282b2ff
Compare
This works for now because the sizes of dimensions
DimZ
andLinearDim0
are 1 with our tiling strategy, and soDimY
andDimX
map to the rows and columns of the AIE array. Follow-up work: pass to collapse scf.forall's with more than 2 induction variables to just 2. So instead of(i,j,k) in (2,3,5)
for example, could be
(i,l) in (2,15)
and then j=l/5 k=l%5.