Updated documentation for all convolution-based designs (#1399)

Co-authored-by: Joseph Melber <jgmelber@gmail.com> Co-authored-by: Kristof Denolf <kristof.denolf@amd.com>
Xilinx · Apr 25, 2024 · 04fce19 · 04fce19
1 parent 4039c7d
commit 04fce19
Show file tree

Hide file tree

Showing 12 changed files with 140 additions and 209 deletions.
diff --git a/programming_examples/ml/bottleneck/README.md b/programming_examples/ml/bottleneck/README.md
diff --git a/programming_examples/ml/bottleneck/bottleneck_pipeline.png b/programming_examples/ml/bottleneck/bottleneck_pipeline.png
diff --git a/programming_examples/ml/bottleneck/requirements.txt b/programming_examples/ml/bottleneck/requirements.txt
diff --git a/programming_examples/ml/conv2d/README.md b/programming_examples/ml/conv2d/README.md
@@ -10,43 +10,68 @@
 
 # <ins>Convolution 2D </ins>
 ## Introduction
-Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation.  This README provides instructions for implementing convolution on AI Engine. 
+Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. This README provides instructions for implementing convolution on a single AI Engine (AIE) core with 8-bit precision. 
 
-At its core, it is a mathematical operation that combines an input image and a filter to produce an output image. The input data is represented as a multi-dimensional matrix, such as an image with height, width, and channels (e.g., RGB channels). The filter is also represented as a multi-dimensional matrix with filter height, width, input and output channels (the same number of channels as the input data). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied with the overlapping region of the input data. The element-wise products are summed up to produce a single value, which represents the result of the convolution operation for that region. This process is repeated for all possible regions of the input data, producing an output matrix called the feature map.
+At its core, it is a mathematical operation that combines an input tensor and a filter to produce an output tensor. The input tensor is a multi-dimensional matrix with input weight, height, and channel. The filter is also represented as a multi-dimensional matrix with filter height, width, input, and output channels (the same number of channels as the input tensor). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied by the overlapping region of the input tensor. The element-wise products are summed up to produce a single value, representing the result of the convolution operation for that region. This process is repeated for all possible regions of the input tensor, producing an output tensor called the feature map.
 
-The process of applying the filter to different regions of the input data is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input lenght, input channel, output channel, filter height, filter length, and the batch size, each loop corresponding to different aspect of the operation. This systematic process extracts features from the input image, yielding the output feature map, illustrating the computational intricacies of convolution. 
+The process of applying the filter to different regions of the input tensor is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input length, input channel, output channel, filter height, filter length, and batch size, each loop corresponding to a different aspect of the operation. This systematic process extracts features from the input tensor, yielding the output feature map and illustrating the computational intricacies of convolution. In this design, we vectorize a two-dimensional convolution with 1x1 filter size.
 
-## Acceleration Techniques
-1. Kernel Optimzation: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing for enhanced computational efficiency. To ensure accurate convolution results, particularly at the edges of feature maps, we implement zero-padding to handle boundary conditions. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is 4x8 matrix corresponding to 4 element of row and 8 input channels.
 
-2. Quantization: We use int8 precision for activationa and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.  
+## Source Files Overview
 
-3. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations. 
+```
+.
++-- act_layout.png      # Figure describing input/output data layout.
++-- aie2.py             # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- Makefile            # Contains instructions for building and compiling software projects.
++-- README.md           # This file.
++-- run.lit             # For LLVM Integrated Tester (LIT) of the design.
++-- test.py             # Python code testbench for the design example.
+```
+
+## NPU Implementation
+1. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We perform the convolution operation using vector MAC/MUL on this loaded data. We implement zero-padding to handle boundary conditions to ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
+
+2. Quantization: We use `int8` precision for activation and weights. At `int8` precision, AIE offers the highest compute density with 256 MAC/cycle. 
+
+3. Data Layout: We optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations. 
 
 ## Data Layout
-We need to ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. For more efficient processing, we adopt a channels-last memory ordering, denoted as NYCXC8, to ensure that channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels with the same width at once. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel. 
+We must ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. We adopt a channels-last memory ordering, denoted as Y{C/8}X{C8}, to exploit output channel parallelism by ensuring channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels simultaneously with the same width. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: Y{C/8}X{C8}. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel. 
+
+The below figure shows our channel parallel data layout (Y{C/8}X{C8}) for a tensor dimension 8x8x16:
+
+<p align="center">
+ <picture>
+ <source media="(prefers-color-scheme: light)" srcset="act_layout.png">
+ <img alt="block" src="act_layout.png" >
+</picture>
+ <h3 align="center">Channel parallel data layout for activations. An AIE core processes 8 channels in parallel per vector operation.
+ </h3>
+</p>
+
+
+In the Y{C/8}X{C8} (with N=1) data layout, the data is organized in memory as follows:
 
-YCXC8 Input/Output Data Layout:
+* C8:   Indicates that 8 elements of the input channel are processed together.
+* X:    Represents the input feature map dimension.
+* C/8:  Denotes the remaining number of channels.
+* Y:    Represents the output feature map dimension.
 
-In the YCXC8 (with N=1) data layout, the data is organized in memory as follows::
 
-* Y: Represents the output feature map dimension.
-* C: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
+{O/8}{I/8}YX{I8}{O8} Weight Layout:
 
-OIYXI8O8 Weight Layout:
+We align the weight layout as specified: O/8, I/8, Y, X, I8, O8, to match the input tensor processing. We first load the weight tensor and organize it to match this layout, where dimensions represent output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead. 
 
-We align the weight layout as specified: O,I,Y,X,I8,O8, to match the input image processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead. 
+In the {O/8}{I/8}YX{I8}{O8} data layout, the data is organized in memory as follows:
 
-In the OIYXI8O8 data layout, the data is organized in memory as follows:
+* O8:   Indicates that 8 elements of the output channel are processed together.
+* I8:   Indicates that 8 elements of the input channel are processed together.
+* X:    Represents the kernel weight.
+* Y:    Represents the kernel height.
+* I/8:  Denotes the remaining number of input channels.
+* O/8:  Denotes the remaining number of output channels.
 
-* O: Denotes the number of output channels.
-* I: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
 
 ## Compilation
 To compile the design:

diff --git a/programming_examples/ml/conv2d/act_layout.png b/programming_examples/ml/conv2d/act_layout.png
diff --git a/programming_examples/ml/conv2d/requirements.txt b/programming_examples/ml/conv2d/requirements.txt