diff --git a/programming_examples/ml/bottleneck/README.md b/programming_examples/ml/bottleneck/README.md
index 40a69e8576..b1a2229537 100644
--- a/programming_examples/ml/bottleneck/README.md
+++ b/programming_examples/ml/bottleneck/README.md
@@ -8,15 +8,15 @@
//
//===----------------------------------------------------------------------===//-->
-# The Bottleneck Block
+# Bottleneck Block
## Introduction
-The bottleneck block is a key component in deep neural network architectures, such as ResNet. It is designed to help address the challenge of training very deep networks by reducing the computational cost while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a single bottleneck block.
+The bottleneck block is a key component in deep neural network architectures like ResNet. It is designed to help address the challenge of training deep networks by reducing computational costs while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a bottleneck block on a single NPU column using four AI Engine (AIE) cores.
## Bottleneck Block Overview
The components and functionality of a standard bottleneck block:
-* Identity Mapping: The core idea behind bottleneck blocks is the concept of identity mapping. Traditional neural network layers aim to learn a mapping from input to output. In contrast, a bottleneck block learns a residual mapping, which is the difference between the input and the output. The original input is then added back to this residual mapping to obtain the final output. Mathematically, this can be represented as `output = input+ residual.`
+* Identity Mapping: The core idea behind bottleneck blocks is the concept of identity mapping. Traditional neural network layers aim to learn how to map from input to output. In contrast, a bottleneck block learns a residual mapping, which is the difference between the input and the output. The original input is then added to this residual mapping to obtain the final output. Mathematically, this can be represented as `output = input+ residual.`
* Convolutional Layers: Bottleneck blocks typically consist of one or more convolutional layers. These layers are responsible for learning features from the input data. Convolutional layers apply filters/kernels to the input feature maps to extract relevant patterns and features. The number of filters, kernel size, and other parameters can vary based on the specific architecture and requirements.
@@ -24,87 +24,58 @@ The components and functionality of a standard bottleneck block:
* Batch Normalization: Batch normalization is often employed after convolutional layers to stabilize and accelerate the training process. It normalizes the activations of each layer, making optimization more robust and efficient.
-* Skip Connection (Identity Shortcut): This is the hallmark of bottleneck blocks. The skip connection directly passes the input from one layer to a later layer without any modification. It provides an alternative, shorter path for gradient flow during training. If the input and output dimensions of the bottleneck block are the same, the skip connection directly adds the input to the output. If the dimensions differ, the skip connection might include a 1x1 convolutional layer to adjust the dimensions accordingly.
+* Skip Connection (Identity Shortcut): This is the hallmark of bottleneck blocks. The skip connection directly passes the input from one layer to a later layer without modification. It provides an alternative, shorter path for gradient flow during training. If the input and output dimensions of the bottleneck block are the same, the skip connection directly adds the input to the output. If the dimensions differ, the skip connection might include a 1x1 convolutional layer to adjust the dimensions accordingly.
* Final Output: The final output of the bottleneck block is obtained by adding the input to the output of the convolutional layers (including any adjustments made to match dimensions via the skip connection).
-
-## Acceleration Techniques
-1. Depth-First Implementation: Spatial architectures provide coarse-grained flexibility that allows for tailoring of the dataflow to optimize data movement. By tailoring the dataflow, we implement depth-first schedule for a bottleneck block routing the output of one convolutional operation on an AIE core directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip. This approach effectively minimizes the memory footprint associated with intermediate data, mitigating the overhead of costly off-chip accesses leading to increase in the overall performance.
+## Source Files Overview
-2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
-
-3. Kernel Optimzation: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing for enhanced computational efficiency. To ensure accurate convolution results, particularly at the edges of feature maps, we implement zero-padding to handle boundary conditions. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is 4x8 matrix corresponding to 4 element of row and 8 input channels.
-
-4. Quantization: We use int8 precision for activationa and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
-
-5. Layer Fused: We perform two levels of fusion. First, we fuse ReLU in convolution using SRS capabilities of AIE. Second, we fuse BatchNorm into convolution weights.
-
-
-
-## Data Layout
-We need to ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. For more efficient processing, we adopt a channels-last memory ordering, denoted as NYCXC8, to ensure that channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels with the same width at once. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
-
-YCXC8 Input/Output Data Layout:
-
-In the YCXC8 (with N=1) data layout, the data is organized in memory as follows:
-
-* Y: Represents the output feature map dimension.
-* C: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
-
-OIYXI8O8 Weight Layout:
-
-We align the weight layout as specified: O,I,Y,X,I8,O8, to match the input image processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
-
-In the OIYXI8O8 data layout, the data is organized in memory as follows:
-
-* O: Denotes the number of output channels.
-* I: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
+```
+.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- bottleneck_block.png # Figure describing the layers in the bottleneck block after fusing ReLU and batch norm into the convolution layer.
++-- bottleneck_pipeline.png # Figure describing our implementation bottleneck block on a single NPU Column.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
-## Fusing Convolution and Batch Normalization
+## NPU Implementation
-We assume the BatchNorm layer is fused into Convoluion Layer. Fusing BatchNorm into convolution involves incorporating the normalization step directly into the convolution operation. This is achieved by modifying the weights of the convolutional filters to include the scaling and shifting factors. Specifically, the weights are adjusted such that the convolution operation performs the normalization, scaling, and shifting in a single step.
+We map a bottleneck block on a single column of NPU in depth-first manner where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip.
+In our bottleneck pipeline implementation, every adjacent ReLU operation is fused into the convolution operation using the approach described in [conv2d_fused_relu](../conv2d_fused_relu). Fusing adjacent convolution and batch norm layers is another inference-time optimization, which involves updating the weight and bias of the convolution layer. The remaining layers of the bottleneck block are mapped onto a single column of NPU with one `Shim Tile (0,0)` and one `Mem Tile (0,1)`, along with four AIE computer tiles spanning from (0,2) to (0,5), as illustrated in the figure below.
-## Fusing ReLU
+
+
+
Depth-first implementation of bottleneck block pipeline on a single column of NPU.
+
+
-Fusing ReLU into the convolution operation can further optimize the implementation by reducing memory bandwidth requirements and computational overhead. ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. Utilize SIMD instructions to efficiently compute ReLU activation in parallel with convolution. After performing the convolution operation, apply ReLU activation function at vector register level.
-We use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Seeting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
+The data movement within this pipeline is orchestrated using the ObjectFifo (OF) primitive. Initially, input activation is brought into the array via the `Shim Tile (0,0)`. We broadcast the data to both `AIE (0,2)` and `AIE (0,4)` via `Mem Tile (0,1)` to perform the very first convolution and skip addition operation in the bottleneck block, respectively. Since `AIE (0,4)` must await additional data from other kernels before proceeding with its execution, buffering the data for tile (0,4) within the `Mem Tile (0,1)` is imperative to prevent any stalls in the broadcast process. Due to the data's size, direct buffering in the smaller L1 memory module of `AIE (0,4)` is impractical. Therefore, we require two OFs: one for broadcasting to tile (0,2) and the Mem tile and another for data movement between the Mem tile and tile (0,4). These two OFs are interconnected to indicate that data from the first OF should be implicitly copied to the second OF through the Mem tile's DMA.
-```
-::aie::set_saturation(
- aie::saturation_mode::saturate); // Needed to saturate properly to uint8
-::aie::set_rounding(
- aie::rounding_mode::positive_inf); // Needed to saturate properly to uint8
-```
-After convolution and ReLU fusion, the output data is generate in YCXC8 layout. Ensure that the output data layout is compatible with subsequent layers or processing steps in the neural network architecture.
+Starting from the `AIE (0,2)`, data is processed by each compute tile, with the intermediate activations being forwarded to the subsequent tile. `AIE (0,2)` handles 1x1 convolution with fused ReLU operation. Based on our hand analysis, we partition the 3x3 convolution across two cores, `AIE (0,3)` and `AIE (0,5)`, to balance computation and accommodate weight distribution across two cores effectively. Therefore, the feature map from the 1x1 convolution is broadcasted to `AIE (0,3)` and `AIE (0,5)` to ensure all required input channels are available for generating output feature maps in the subsequent 3x3 convolution. We split the output feature map processing across these cores, with each core computing half of the total output channels. The outputs from `AIE (0,3)` and `AIE (0,5)` are then merged in `AIE (0,4)` to perform the final 1x1 convolution. This final convolution operation also integrates skip addition, utilizing the initial input to the bottleneck block and the output of the 1x1 convolution. The final ReLU activation is applied to obtain the final output feature map. This output feature map is transmitted from the `AIE (0,4)` back to the output via the `Shim Tile (0,0)`. Although not shown in the figure, weights are transferred separately using a `Shim Tile (0,0)` channel into `Mem Tile (0,1)`, which distributes them across appropriate AIE cores in parallel, leveraging the large number of MemTile channels.
+We use the following architectural techniques to implement our bottleneck pipeline:
-### Benefits of ReLU Fusion:
+1. Depth-First Implementation: Spatial architectures provide coarse-grained flexibility that allows for tailoring of the data flow to optimize data movement. By tailoring the dataflow, we implement a depth-first schedule for a bottleneck block where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip. This approach effectively minimizes the memory footprint associated with intermediate data, mitigating the overhead of costly off-chip accesses and increasing the overall performance.
-1. Reduced Memory Bandwidth:
-By fusing ReLU into the convolution operation, unnecessary memory accesses and data transfers associated with separate ReLU computation are eliminated, leading to reduced memory bandwidth requirements.
+2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations. Please refer to our [conv2d](../conv2d) design for details on the data layout.
-2. Improved Performance:
-Fusing ReLU reduces the number of instructions executed per element, resulting in improved computational efficiency and overall performance of the convolution operation.
+3. Kernel Optimization: Please refer to our [conv2d](../conv2d) design for details on vectorizing convolution 2D.
-3. Simplified Code Structure:
-Fusing ReLU into the convolution kernel simplifies the code structure and reduces the overhead associated with separate activation function calls, leading to cleaner and more maintainable code.
+4. Quantization: We use int8 precision for activation and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
-4. Enhanced Resource Utilization:
-By combining convolution and ReLU operations, computational resources such as CPU cores or SIMD units are utilized more efficiently, maximizing throughput and achieving better resource utilization.
+5. Layer Fused: Initially, we employ AIE's SRS capabilities to fuse ReLU directly into the convolution operation. This integration optimizes performance by eliminating separate ReLU computations, streamlining the convolution process. Please refer to our [conv2d_fused_relu](../conv2d_fused_relu) design for details on fusing ReLU into the convolution layer.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/bottleneck/bottleneck_pipeline.png b/programming_examples/ml/bottleneck/bottleneck_pipeline.png
new file mode 100644
index 0000000000..e91b231be5
Binary files /dev/null and b/programming_examples/ml/bottleneck/bottleneck_pipeline.png differ
diff --git a/programming_examples/ml/bottleneck/requirements.txt b/programming_examples/ml/bottleneck/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/bottleneck/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d/README.md b/programming_examples/ml/conv2d/README.md
index b2d93f066d..9ea0d96c72 100644
--- a/programming_examples/ml/conv2d/README.md
+++ b/programming_examples/ml/conv2d/README.md
@@ -10,43 +10,68 @@
# Convolution 2D
## Introduction
-Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. This README provides instructions for implementing convolution on AI Engine.
+Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. This README provides instructions for implementing convolution on a single AI Engine (AIE) core with 8-bit precision.
-At its core, it is a mathematical operation that combines an input image and a filter to produce an output image. The input data is represented as a multi-dimensional matrix, such as an image with height, width, and channels (e.g., RGB channels). The filter is also represented as a multi-dimensional matrix with filter height, width, input and output channels (the same number of channels as the input data). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied with the overlapping region of the input data. The element-wise products are summed up to produce a single value, which represents the result of the convolution operation for that region. This process is repeated for all possible regions of the input data, producing an output matrix called the feature map.
+At its core, it is a mathematical operation that combines an input tensor and a filter to produce an output tensor. The input tensor is a multi-dimensional matrix with input weight, height, and channel. The filter is also represented as a multi-dimensional matrix with filter height, width, input, and output channels (the same number of channels as the input tensor). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied by the overlapping region of the input tensor. The element-wise products are summed up to produce a single value, representing the result of the convolution operation for that region. This process is repeated for all possible regions of the input tensor, producing an output tensor called the feature map.
-The process of applying the filter to different regions of the input data is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input lenght, input channel, output channel, filter height, filter length, and the batch size, each loop corresponding to different aspect of the operation. This systematic process extracts features from the input image, yielding the output feature map, illustrating the computational intricacies of convolution.
+The process of applying the filter to different regions of the input tensor is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input length, input channel, output channel, filter height, filter length, and batch size, each loop corresponding to a different aspect of the operation. This systematic process extracts features from the input tensor, yielding the output feature map and illustrating the computational intricacies of convolution. In this design, we vectorize a two-dimensional convolution with 1x1 filter size.
-## Acceleration Techniques
-1. Kernel Optimzation: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing for enhanced computational efficiency. To ensure accurate convolution results, particularly at the edges of feature maps, we implement zero-padding to handle boundary conditions. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is 4x8 matrix corresponding to 4 element of row and 8 input channels.
-2. Quantization: We use int8 precision for activationa and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
+## Source Files Overview
-3. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
+```
+.
++-- act_layout.png # Figure describing input/output data layout.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
+
+## NPU Implementation
+1. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We perform the convolution operation using vector MAC/MUL on this loaded data. We implement zero-padding to handle boundary conditions to ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
+
+2. Quantization: We use `int8` precision for activation and weights. At `int8` precision, AIE offers the highest compute density with 256 MAC/cycle.
+
+3. Data Layout: We optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
## Data Layout
-We need to ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. For more efficient processing, we adopt a channels-last memory ordering, denoted as NYCXC8, to ensure that channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels with the same width at once. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
+We must ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. We adopt a channels-last memory ordering, denoted as Y{C/8}X{C8}, to exploit output channel parallelism by ensuring channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels simultaneously with the same width. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: Y{C/8}X{C8}. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
+
+The below figure shows our channel parallel data layout (Y{C/8}X{C8}) for a tensor dimension 8x8x16:
+
+
+
+
+
Channel parallel data layout for activations. An AIE core processes 8 channels in parallel per vector operation.
+
+
+
+
+In the Y{C/8}X{C8} (with N=1) data layout, the data is organized in memory as follows:
-YCXC8 Input/Output Data Layout:
+* C8: Indicates that 8 elements of the input channel are processed together.
+* X: Represents the input feature map dimension.
+* C/8: Denotes the remaining number of channels.
+* Y: Represents the output feature map dimension.
-In the YCXC8 (with N=1) data layout, the data is organized in memory as follows::
-* Y: Represents the output feature map dimension.
-* C: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
+{O/8}{I/8}YX{I8}{O8} Weight Layout:
-OIYXI8O8 Weight Layout:
+We align the weight layout as specified: O/8, I/8, Y, X, I8, O8, to match the input tensor processing. We first load the weight tensor and organize it to match this layout, where dimensions represent output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
-We align the weight layout as specified: O,I,Y,X,I8,O8, to match the input image processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
+In the {O/8}{I/8}YX{I8}{O8} data layout, the data is organized in memory as follows:
-In the OIYXI8O8 data layout, the data is organized in memory as follows:
+* O8: Indicates that 8 elements of the output channel are processed together.
+* I8: Indicates that 8 elements of the input channel are processed together.
+* X: Represents the kernel weight.
+* Y: Represents the kernel height.
+* I/8: Denotes the remaining number of input channels.
+* O/8: Denotes the remaining number of output channels.
-* O: Denotes the number of output channels.
-* I: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/conv2d/act_layout.png b/programming_examples/ml/conv2d/act_layout.png
new file mode 100644
index 0000000000..7630e06ea0
Binary files /dev/null and b/programming_examples/ml/conv2d/act_layout.png differ
diff --git a/programming_examples/ml/conv2d/requirements.txt b/programming_examples/ml/conv2d/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/conv2d/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d_fused_relu/README.md b/programming_examples/ml/conv2d_fused_relu/README.md
index 3f4a2264cd..4969d177e9 100644
--- a/programming_examples/ml/conv2d_fused_relu/README.md
+++ b/programming_examples/ml/conv2d_fused_relu/README.md
@@ -11,74 +11,47 @@
# Convolution with Fused ReLU
## Introduction
-Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. ReLU (Rectified Linear Unit ) is one of the most commonly used activation functions due to its simplicity and effectiveness. This README provides instructions for implementing convolution with ReLU activation function on AI Engine.
+In [conv2d](../conv2d), we describe how to implement a two-dimensional convolution kernel on AIE. While [relu](../relu) describes the implementation of the Rectified Linear Unit (ReLU) activation function on AIE. This README provides instructions for fusing convolution with the ReLU activation function on a single AI Engine (AIE) core
-At its core, convolution is a mathematical operation that combines an input image and a filter to produce an output image. The input data is represented as a multi-dimensional matrix, such as an image with height, width, and channels (e.g., RGB channels). The filter is also represented as a multi-dimensional matrix with filter height, width, input and output channels (the same number of channels as the input data). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied with the overlapping region of the input data. The element-wise products are summed up to produce a single value, which represents the result of the convolution operation for that region. This process is repeated for all possible regions of the input data, producing an output matrix called the feature map.
-The process of applying the filter to different regions of the input data is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input lenght, input channel, output channel, filter height, filter length, and the batch size, each loop corresponding to different aspect of the operation. This systematic process extracts features from the input image, yielding the output feature map, illustrating the computational intricacies of convolution.
+## Source Files Overview
-## Acceleration Techniques
-1. Kernel Optimzation: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing for enhanced computational efficiency. To ensure accurate convolution results, particularly at the edges of feature maps, we implement zero-padding to handle boundary conditions. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is 4x8 matrix corresponding to 4 element of row and 8 input channels.
-
-2. Quantization: We use int8 precision for activationa and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
-
-3. Layer Fused: We perform two levels of fusion. First, we fuse ReLU in convolution using SRS capabilities of AIE. Second, we fuse BatchNorm into convolution weights.
-
-4. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
-
-## Data Layout
-We need to ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. For more efficient processing, we adopt a channels-last memory ordering, denoted as NYCXC8, to ensure that channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels with the same width at once. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
-
-YCXC8 Input/Output Data Layout:
-
-In the YCXC8 (with N=1) data layout, the data is organized in memory as follows:
-
-* Y: Represents the output feature map dimension.
-* C: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
-
-OIYXI8O8 Weight Layout:
-
-We align the weight layout as specified: O,I,Y,X,I8,O8, to match the input image processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
-
-In the OIYXI8O8 data layout, the data is organized in memory as follows:
+```
+.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
-* O: Denotes the number of output channels.
-* I: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
+## Fusing ReLU
+Fusing ReLU into the convolution operation can optimize the performance by reducing unnecessary data movement, leading to lower external memory bandwidth requirements and computational overhead. The ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. For fixed-point arithmetic, we can utilize the Shift-Round-Saturate (SRS) capability of AIE to apply an appropriate transformation involving shifting out lower-order bits, rounding, and saturation using the SRS family of intrinsics. Using SRS intrinsics, we can efficiently implement ReLU activation while the data is in the accumulation registers. Such an implementation completely eliminates any need for data movement by fusing at the vector register level.
+After performing the convolution operation, we use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Setting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
-## Fusing ReLU
-Fusing ReLU into the convolution operation can further optimize the implementation by reducing memory bandwidth requirements and computational overhead. ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. Utilize SIMD instructions to efficiently compute ReLU activation in parallel with convolution. After performing the convolution operation, apply ReLU activation function at vector register level.
-We use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Seeting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
```
-::aie::set_saturation(
- aie::saturation_mode::saturate); // Needed to saturate properly to uint8
::aie::set_rounding(
- aie::rounding_mode::positive_inf); // Needed to saturate properly to uint8
+ aie::rounding_mode::positive_inf); # Needed to rounding properly to uint8
+::aie::set_saturation(
+ aie::saturation_mode::saturate); # Needed to saturate properly to uint8
```
-After convolution and ReLU fusion, the output data is generate in YCXC8 layout. Ensure that the output data layout is compatible with subsequent layers or processing steps in the neural network architecture.
+The output data is generated in Y{C/8}X{C8} layout. Please refer to our [conv2d](../conv2d) design for details on the data layout.
-### Benefits of ReLU Fusion:
+### Benefits of Fusing Convolutiona and ReLU :
1. Reduced Memory Bandwidth:
-By fusing ReLU into the convolution operation, unnecessary memory accesses and data transfers associated with separate ReLU computation are eliminated, leading to reduced memory bandwidth requirements.
+Fusing ReLU into the convolution operation eliminates unnecessary memory accesses and data transfers associated with separate ReLU computations, leading to reduced memory bandwidth requirements.
2. Improved Performance:
Fusing ReLU reduces the number of instructions executed per element, resulting in improved computational efficiency and overall performance of the convolution operation.
-3. Simplified Code Structure:
-Fusing ReLU into the convolution kernel simplifies the code structure and reduces the overhead associated with separate activation function calls, leading to cleaner and more maintainable code.
+3. Enhanced Resource Utilization:
+Combining convolution and ReLU operations allows computational resources to be utilized more efficiently, maximizing throughput and achieving better resource utilization.
-4. Enhanced Resource Utilization:
-By combining convolution and ReLU operations, computational resources such as CPU cores or SIMD units are utilized more efficiently, maximizing throughput and achieving better resource utilization.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/conv2d_fused_relu/requirements.txt b/programming_examples/ml/conv2d_fused_relu/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/conv2d_fused_relu/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/resnet/README.md b/programming_examples/ml/resnet/README.md
index de4cc92535..5bb146b006 100755
--- a/programming_examples/ml/resnet/README.md
+++ b/programming_examples/ml/resnet/README.md
@@ -8,93 +8,59 @@
//
//===----------------------------------------------------------------------===//-->
-# ResNet with Offloaded Conv2_x Bottleneck Blocks
+# ResNet with Offloaded Conv2_x Layers
## Introduction
-ResNet [[1]](#1) is a convolutional neural network architecture that has gained significant popularity for various computer vision tasks, including image classification, object detection, and image segmentation. It is renowned for its depth and efficiency in training very deep networks.
-
-This README focuses on a specific optimization technique applied to ResNet, specifically targeting the offloading of the conv2_x part of the bottleneck blocks. By offloading computations to dedicated hardware accelerators or specialized processors, we aim to improve the overall efficiency and speed of the network, especially when deploying it on resource-constrained devices or in scenarios where real-time processing is critical.
-
+ResNet [[1]](#1) is a convolutional neural network architecture that has gained significant popularity for various computer vision tasks, including image classification, object detection, and image segmentation. It is renowned for its depth and efficiency in training very deep networks. This README focuses on our implementation of the conv2_x layers of the ResNet architecture using three columns of NPU.
## ResNet Architecture Overview
-ResNet consists of several key components:
+ResNet consists of the following key components:
-1. Input Layer: Accepts input image data with dimensions typically set to 224x224x3 (width, height, RGB channels).
+1. Input Layer: This layer accepts input image data with dimensions typically set to 224x224x3 (width, height, and RGB channels).
2. Convolutional Layers: The initial layers perform convolution operations to extract basic features from the input image.
3. Bottleneck Blocks:
- * ResNet is composed of multiple bottleneck blocks grouped into different stages (conv2_x, conv3_x, conv4_x, conv5_x).
- * Each bottleneck block contains convolutional layers and shortcut connections that facilitate the learning of residual mappings.
- * The conv2_x stage is particularly targeted for offloading computations in this optimization.
+ * ResNet is composed of multiple bottleneck blocks grouped into different stages (conv2_x, conv3_x, conv4_x, conv5_x).
+ * Each bottleneck block contains convolutional layers and shortcut connections that facilitate the learning of residual mappings.
+ * The conv2_x stage is particularly targeted for offloading computations in this optimization.
4. Pooling Layers: Max pooling layers reduce the spatial dimensions of the feature maps.
5. Fully Connected Layer: Produces the final output predictions, typically followed by a softmax activation for classification tasks.
+## Source Files Overview
-## Offloading Conv2_x Bottleneck Blocks
-The conv2_x stage of ResNet comprises a series of bottleneck blocks, each containing convolutional layers responsible for learning more complex features from the input data. By offloading the computations within these blocks to AI Engine, we aim to:
-
-* Reduce the computational burden on the main processing unit (e.g., CPU or GPU).
-* Improve overall inference speed and efficiency, especially in scenarios where real-time processing is crucial.
-* Enable deployment on resource-constrained devices with limited computational resources.
-
-## Usage and Deployment
-To leverage the optimized ResNet with offloaded conv2_x bottleneck blocks:
-* [IRON Programming](https://github.com/Xilinx/mlir-aie/tree/gagan_asplos_resnet/programming_examples/ml/resnet/layers_conv2_x): Demonstrates the IRON flow for offloading conv2_x to AIE.
-
-
-## Acceleration Techniques
-1. Depth-First/Layer-Fused Implementation: Spatial architectures provide coarse-grained flexibility that allows for tailoring of the dataflow to optimize data movement. By tailoring the dataflow, we implement depth-first schedule for a bottleneck block routing the output of one convolutional operation on an AIE core directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip. This approach effectively minimizes the memory footprint associated with intermediate data, mitigating the overhead of costly off-chip accesses leading to increase in the overall performance.
-
-
-2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
-
-3. Kernel Optimzation: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing for enhanced computational efficiency. To ensure accurate convolution results, particularly at the edges of feature maps, we implement zero-padding to handle boundary conditions. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is 4x8 matrix corresponding to 4 element of row and 8 input channels.
-
-4. Quantization: We use int8 precision for activationa and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
-
-5. Layer Fused: We perform two levels of fusion. First, we fuse ReLU in convolution using SRS capabilities of AIE. Second, we fuse BatchNorm into convolution weights.
-
-
-## Data Layout
-We need to ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. For more efficient processing, we adopt a channels-last memory ordering, denoted as NYCXC8, to ensure that channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels with the same width at once. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
-
-YCXC8 Input/Output Data Layout:
-
-In the YCXC8 (with N=1) data layout, the data is organized in memory as follows:
-
-* Y: Represents the output feature map dimension.
-* C: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
+```
+.
++-- layers_conv2_x # Implementation of ResNet conv2_x layers on NPU
+| +-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
+| +-- Makefile # Contains instructions for building and compiling software projects.
+| +-- resnet_conv2x_pipeline.png # Figure describing our implementation of conv2_x layers on NPU.
+| +-- run.lit # For LLVM Integrated Tester (LIT) of the design.
+| +-- test.py # Python code testbench for the design example.
++-- README.md # This file.
-OIYXI8O8 Weight Layout:
+```
-We align the weight layout as specified: O,I,Y,X,I8,O8, to match the input image processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
+## NPU Implementation
+The conv2_x stage of ResNet comprises a series of bottleneck blocks, each containing convolutional, batch norm, and ReLU layers responsible for learning more complex features from the input data. By offloading the computations within these blocks to AI Engine, we aim to:
-In the OIYXI8O8 data layout, the data is organized in memory as follows:
+* Reduce the computational burden on the main processing unit (e.g., CPU or GPU).
+* Improve overall inference speed and efficiency.
-* O: Denotes the number of output channels.
-* I: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
+The below figures shows our implementation of the conv2_x layers of the ResNet architecture using three columns of NPU.
+
+
+
+
ResNet conv2_x stage's bottleneck blocks are stacked depth-first to avoid unnecessary off-chip data movement.
+
+
-## Fusing Convolution and Batch Normalization
+Similar to our [bottleneck design](../../bottleneck), we implement conv2_x layers depth-first. Our implementation connects the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck design](../../bottleneck), the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path that the increases the number of channels.
-We assume the BatchNorm layer is fused into Convoluion Layer. Fusing BatchNorm into convolution involves incorporating the normalization step directly into the convolution operation. This is achieved by modifying the weights of the convolutional filters to include the scaling and shifting factors. Specifically, the weights are adjusted such that the convolution operation performs the normalization, scaling, and shifting in a single step.
+After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck design](../../bottleneck). Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors.
-## Fusing ReLU
-Fusing ReLU into the convolution operation can further optimize the implementation by reducing memory bandwidth requirements and computational overhead. ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. Utilize SIMD instructions to efficiently compute ReLU activation in parallel with convolution. After performing the convolution operation, apply ReLU activation function at vector register level.
-We use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Seeting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
-```
-::aie::set_saturation(
- aie::saturation_mode::saturate); // Needed to saturate properly to uint8
-::aie::set_rounding(
- aie::rounding_mode::positive_inf); // Needed to saturate properly to uint8
-```
-After convolution and ReLU fusion, the output data is generate in YCXC8 layout. Ensure that the output data layout is compatible with subsequent layers or processing steps in the neural network architecture.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/resnet/layers_conv2_x/requirements.txt b/programming_examples/ml/resnet/layers_conv2_x/requirements.txt
deleted file mode 100755
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/resnet/layers_conv2_x/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/resnet/layers_conv2_x/resnet_conv2x_pipeline.png b/programming_examples/ml/resnet/layers_conv2_x/resnet_conv2x_pipeline.png
new file mode 100644
index 0000000000..8e311edd7c
Binary files /dev/null and b/programming_examples/ml/resnet/layers_conv2_x/resnet_conv2x_pipeline.png differ
diff --git a/programming_guide/section-6/README.md b/programming_guide/section-6/README.md
index f54c812ab3..714b8b6249 100644
--- a/programming_guide/section-6/README.md
+++ b/programming_guide/section-6/README.md
@@ -31,7 +31,7 @@ There are a number of example designs available [here](../../programming_example
## Exercises
-1. In [bottlneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many elements does the 3x3 convolution operation require to proceed with its computation?
+1. In [bottleneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many rows of input data does the 3x3 convolution operation require to proceed with its computation?
2. Suppose you have a bottleneck block with input dimensions of 32x32x256. After passing through the 1x1 convolutional layer, the output dimensions become 32x32x64. What would be the output dimensions after the subsequent 3x3 convolutional layer, assuming a stride of 1 and no padding and output channel of 64?
-----