Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards vectorized convolution (first PR) #864

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

newling
Copy link
Contributor

@newling newling commented Oct 29, 2024

This PR does 2 things:

  1. moves aievec lowering before scf-to-cf, because computing the alignment information needed to support convolution vectorizion in cf is more difficult than in scf.

  2. Makes flattening of transfer_read ops more "aggressive". Large c&p from upstream MLIR here, sorry. Read on for more info.

The current convolution workflow ends up with core code to load from the input image that looks like:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read 
         %reinterpret_cast_54[%c0, %arg1, %arg3, %arg2, %c0], %cst 
         {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      ... 
    }
  }
}

What is the alignment of the transfer_read above? In other words, what is the highest multiple of 2 that divides the offset of the transfer_read for all values of %arg1, %arg2, and %arg3? It is 16 bytes (consider %arg2=1).

This small alignment is a problem for the AIE instruction set, and without a change to the IR, the transfer_read will lower to inefficient, scalar code. Actually it's currently even worse than that -- it isn't correctly scalarized and we see numerical errors for basic convolutions unless we disable vectorization (slack peano channel discussion). We need 32 byte alignment.

The solution that the aievec dialect/project has hit upon is implemented in the following pattern: https://github.com/Xilinx/mlir-aie/blob/9fe5fb5386dbf087aca9bfba3815cd5bfa56d80d/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L119

The pattern converts an unaligned transfer_read into an aligned transfer_read of twice the length, followed by a vector.extract_strided_slice operation. For our convolution example, we therefore want to transfer_read a vector of 64 bf16 elements, and then extract the 32 bf16 elements that we actually want.

Clearly, a transfer_read of 64 elements from something of type memref<1x3x4x6x8xbf16> is not possible (because 6*8 = 48, which does not divide 64). We need some flattening. After running the upstream FlattenContiguousRowMajorTransfer pass, the memref is flattened as follows:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %collapse_shape = memref.collapse_shape %reinterpret_cast_54 [[0], [1], [2], [3, 4]] : memref<1x3x4x6x8xbf16> into memref<1x3x4x48xbf16>
      %0 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg2]
      %1 = vector.transfer_read %collapse_shape[%c0, %arg1, %arg3, %0], %cst {in_bounds = [true]} : memref<1x3x4x48xbf16>, vector<32xbf16>
      ...
    }
  }
}

but this is still insufficient, as the innermost dimension 48 still does not divide 64. This PR therefore makes the flattening more aggressive, so that we get IR like:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %collapse_shape = memref.collapse_shape %reinterpret_cast_51 [[0, 1, 2, 3, 4]] : memref<1x3x4x6x8xbf16> into memref<576xbf16>
      %0 = affine.apply affine_map<()[s0, s1, s2] -> (s0 * 192 + s1 * 48 + s2 * 8)>()[%arg1, %arg3, %arg2]
      %1 = vector.transfer_read %collapse_shape[%0], %cst {in_bounds = [true]} : memref<576xbf16>, vector<32xbf16>
    }
  }
}

With IR like this we will be able to perform the aievec trick linked to above (future PR).

@jtuyls
Copy link
Collaborator

jtuyls commented Oct 29, 2024

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

@newling
Copy link
Contributor Author

newling commented Oct 29, 2024

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.

Let me try and explain with a toy model... consider 8 bytes in tile memory:

01234567

from which we want to put bytes 12 into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read 0123 into a (larger) register and then in a subsequent step extract the 12 into a smaller register. To instructions for this second step at the HW level are

  1. 2 extracts (0123 -> 01 and 23)
    https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
  2. one shift (concats top bits from 01 and bottom bits from 23)
    https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

@jtuyls
Copy link
Collaborator

jtuyls commented Oct 30, 2024

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.

Let me try and explain with a toy model... consider 8 bytes in tile memory:

01234567

from which we want to put bytes 12 into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read 0123 into a (larger) register and then in a subsequent step extract the 12 into a smaller register. To instructions for this second step at the HW level are

  1. 2 extracts (0123 -> 01 and 23)
    https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
  2. one shift (concats top bits from 01 and bottom bits from 23)
    https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

Thanks, that makes sense now.

@newling newling force-pushed the towards_vectorized_convolution branch from 8f0db8a to 4107fd4 Compare October 30, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants