-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards vectorized convolution (first PR) #864
base: main
Are you sure you want to change the base?
Conversation
The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me. |
compiler/plugins/target/AMD-AIE/aievec/VectorToVectorConversions.cpp
Outdated
Show resolved
Hide resolved
You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those. Let me try and explain with a toy model... consider 8 bytes in tile memory:
from which we want to put bytes
|
Thanks, that makes sense now. |
8f0db8a
to
4107fd4
Compare
This PR does 2 things:
moves aievec lowering before scf-to-cf, because computing the alignment information needed to support convolution vectorizion in cf is more difficult than in scf.
Makes flattening of
transfer_read
ops more "aggressive". Large c&p from upstream MLIR here, sorry. Read on for more info.The current convolution workflow ends up with core code to load from the input image that looks like:
What is the alignment of the transfer_read above? In other words, what is the highest multiple of 2 that divides the offset of the transfer_read for all values of %arg1, %arg2, and %arg3? It is 16 bytes (consider %arg2=1).
This small alignment is a problem for the AIE instruction set, and without a change to the IR, the transfer_read will lower to inefficient, scalar code. Actually it's currently even worse than that -- it isn't correctly scalarized and we see numerical errors for basic convolutions unless we disable vectorization (slack peano channel discussion). We need 32 byte alignment.
The solution that the aievec dialect/project has hit upon is implemented in the following pattern: https://github.com/Xilinx/mlir-aie/blob/9fe5fb5386dbf087aca9bfba3815cd5bfa56d80d/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L119
The pattern converts an unaligned transfer_read into an aligned transfer_read of twice the length, followed by a vector.extract_strided_slice operation. For our convolution example, we therefore want to transfer_read a vector of 64 bf16 elements, and then extract the 32 bf16 elements that we actually want.
Clearly, a
transfer_read
of 64 elements from something of typememref<1x3x4x6x8xbf16>
is not possible (because 6*8 = 48, which does not divide 64). We need some flattening. After running the upstreamFlattenContiguousRowMajorTransfer
pass, the memref is flattened as follows:but this is still insufficient, as the innermost dimension 48 still does not divide 64. This PR therefore makes the flattening more aggressive, so that we get IR like:
With IR like this we will be able to perform the aievec trick linked to above (future PR).