diff --git a/docs/conferenceDescriptions/asplos24TutorialDescription.md b/docs/conferenceDescriptions/asplos24TutorialDescription.md index b7e1d85a79..cec2e59db4 100644 --- a/docs/conferenceDescriptions/asplos24TutorialDescription.md +++ b/docs/conferenceDescriptions/asplos24TutorialDescription.md @@ -1,30 +1,30 @@ -# ASPLOS'24 Tutorial: Levering MLIR to Design for AI Engines on Ryzen AI +# ASPLOS'24 Tutorial: Levering MLIR to Design for AI Engines on Ryzen™ AI ## Introduction -The AI Engine array in the NPU of the AMD Ryzen AI device includes a set of VLIW vector processors with adaptable interconnect. This tutorial is targeted at performance engineers and tool developers who are looking for fast and completely open source design tools to support their research. Participants will first get insight into the AI Engine compute and data movement capabilities. Through small design examples expressed in the MLIR-AIE python language bindings and executed on an Ryzen AI device, participants will leverage AI Engine features for optimizing performance of increasingly complex designs. The labs will be done on Ryzen AI enabled miniPCs giving participants the ability to execute their own designs on real hardware. +The AI Engine array in the NPU of the AMD Ryzen™ AI device includes a set of VLIW vector processors with adaptable interconnect. This tutorial targets performance engineers and tool developers looking for fast and completely open-source design tools to support their research. Participants will first get insight into the AI Engine compute and data movement capabilities. Through small design examples expressed in the MLIR-AIE python language bindings and executed on a Ryzen™ AI device, participants will leverage AI Engine features to optimize the performance of increasingly complex designs. The labs will be done on Ryzen™ AI-enabled miniPCs, giving participants the ability to execute their own designs on real hardware. This tutorial will cover the following key topics: 1. AI Engine architecture introduction -1. AIE core, array configuration and host application code compilation +1. AIE core, array configuration, and host application code compilation 1. Data movement and communication abstraction layers 1. Tracing for performance monitoring 1. Putting it all together on larger examples: matrix multiplication, convolutions as building blocks for ML and computer vision examples ## Agenda -Date: Saturday April 27th 2024 (morning) +Date: Saturday, April 27th, 2024 (morning) Location: Hilton La Jolla Torrey Pines, San Diego, California (with ASPLOS’24) -Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI enabled miniPCs for the hands-on exercises. +Prerequisite: please bring your laptop so that you can SSH into our Ryzen™ AI-enabled miniPCs for the hands-on exercises. ### Contents and Timeline (tentative) | Time | Topic | Presenter | Slides or Code | |------|-------|-----------|----------------| | 08:30am | Intro to spatial compute and explicit data movement | Kristof | [Programming Guide](../../programming_guide/) | -| 08:45am | "Hello World" from Ryzen AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) | -| 09:00am | Data movement on Ryzen AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) | +| 08:45am | "Hello World" from Ryzen™ AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) | +| 09:00am | Data movement on Ryzen™ AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) | | 09:30am | Your First Program | Kristof | [My First Program](../../programming_guide/section-3) | | 09:50am | Exercise 1: Build and run your first program | All | [Passthrough](../../programming_examples/basic/passthrough_kernel/) | | 10:00am | Break | | | @@ -44,8 +44,8 @@ Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI en *Joseph Melber* is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures. -*Kristof Denolf* is a Fellow in AMD's Research and Advanced Development group where he is working on energy efficient computer vision and video processing applications to shape future AMD devices. He earned a M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, a M.Sc. in electronic system design from Leeds Beckett University (2000) and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx and AMD. His main research interest are all aspects of the cost-efficient and dataflow oriented design of video, vision and graphics systems. +*Kristof Denolf* is a Fellow in AMD's Research and Advanced Development group where he is working on energy-efficient computer vision and video processing applications to shape future AMD devices. He earned an M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, an M.Sc. in electronic system design from Leeds Beckett University (2000), and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx, and AMD. His main research interests are all aspects of the cost-efficient and dataflow-oriented design of video, vision, and graphics systems. -*Phil James-Roxby* is a Senior Fellow in AMD’s Research and Advanced Development group, working on compilers and runtimes to support current and future AMD devices, particularly in the domain on AI processing. In the past, he has been responsible for a number of software enablement activities for hardware devices, including SDNet and SDAccel at Xilinx, and the original development environement for the AI Engines. He holds a PhD from the University of Manchester on hardware acceleration of embedded machine learning applications, and his main research interest continues to be how to enable users to efficiently use diverse hardware in heterogenous systems. +*Phil James-Roxby* is a Senior Fellow in AMD’s Research and Advanced Development group, working on compilers and runtimes to support current and future AMD devices, particularly in the domain of AI processing. In the past, he has been responsible for a number of software enablement activities for hardware devices, including SDNet and SDAccel at Xilinx, and the original development environment for the AI Engines. He holds a PhD from the University of Manchester on hardware acceleration of embedded machine learning applications, and his main research interest continues to be how to enable users to efficiently use diverse hardware in heterogeneous systems. -*Samuel Bayliss* is a Fellow in the Research and Advanced Development group at AMD. His academic experience includes formative study at Imperial College London, for which he earned MEng and PhD degrees in 2006 and 2012 respectively. He is energized by his current work in advancing compiler tooling using MLIR, developing programming abstractions for parallel compute and evolving hardware architectures for efficient machine learning. \ No newline at end of file +*Samuel Bayliss* is a Fellow in the Research and Advanced Development group at AMD. His academic experience includes formative study at Imperial College London, for which he earned MEng and PhD degrees in 2006 and 2012, respectively. He is energized by his current work in advancing compiler tooling using MLIR, developing programming abstractions for parallel compute and evolving hardware architectures for efficient machine learning. \ No newline at end of file diff --git a/programming_examples/basic/README.md b/programming_examples/basic/README.md index 9d9a57169f..bfe8a881ef 100644 --- a/programming_examples/basic/README.md +++ b/programming_examples/basic/README.md @@ -10,14 +10,14 @@ # Basic Programming Examples -These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single core and multicore data processing pipelines). They serve to highlight how designs can be described in python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. +These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. -* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs, without involving the AIE core. +* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core. * [Passthrough Kernel](./passthrough_kernel) - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming. * [Vector Scalar Add](./vector_scalar_add) - Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. * [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096`. The kernel does a `1024` vector multiply and is invoked multiple times to complete the full `vector * scalar` compute. * [Vector Reduce Add](./vector_reduce_add) - Single tile performs a reduction of a vector to return the `sum` of the elements. * [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements. * [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements. -* [Vector Exp](./vector_exp) - A simple element wise exponent function, using the look up table capabilities of the AI Engine. +* [Vector Exp](./vector_exp) - A simple element-wise exponent function, using the look up table capabilities of the AI Engine. * [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking. \ No newline at end of file diff --git a/programming_examples/basic/dma_transpose/README.md b/programming_examples/basic/dma_transpose/README.md index 32dd7ac3d3..5a73dde0e3 100644 --- a/programming_examples/basic/dma_transpose/README.md +++ b/programming_examples/basic/dma_transpose/README.md @@ -12,8 +12,8 @@ This reference design can be run on a Ryzen™ AI NPU. -In the [design](./aie2.py) a 2-D array in row-major layout is read from external memory to `ComputeTile2` with a transposed layout, -by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through Shim tile (`col`, 0). +In the [design](./aie2.py), a 2-D array in a row-major layout is read from external memory to `ComputeTile2` with a transposed layout, +by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0). The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/README.md/#object-fifo-link) of the programming guide. diff --git a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md index b808ec4e32..c01f13c58f 100644 --- a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md +++ b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md @@ -10,7 +10,7 @@ # Matrix Vector Multiplication -One tiles in one or more columns perform a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute. +One tile in one or more columns performs a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_multiplication/single_core/README.md b/programming_examples/basic/matrix_multiplication/single_core/README.md index e4b8ad4729..2fe5158e76 100644 --- a/programming_examples/basic/matrix_multiplication/single_core/README.md +++ b/programming_examples/basic/matrix_multiplication/single_core/README.md @@ -10,7 +10,7 @@ # Matrix Multiplication -Single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. +A single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_multiplication/whole_array/README.md b/programming_examples/basic/matrix_multiplication/whole_array/README.md index f91249721d..bdf9e71778 100644 --- a/programming_examples/basic/matrix_multiplication/whole_array/README.md +++ b/programming_examples/basic/matrix_multiplication/whole_array/README.md @@ -10,7 +10,7 @@ # Matrix Multiplication Array -Multiple tiles in a single column perform a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `256x256x256`. The kernel itself computes `64x64x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. +Multiple tiles in a single column perform a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `256x256x256`. The kernel computes `64x64x64 (MxKxN)`, which is invoked multiple times to complete the full matmul compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_scalar_add/README.md b/programming_examples/basic/matrix_scalar_add/README.md index 1b50d4af86..c29df4bfaf 100644 --- a/programming_examples/basic/matrix_scalar_add/README.md +++ b/programming_examples/basic/matrix_scalar_add/README.md @@ -10,9 +10,9 @@ # Matrix Scalar Addition -Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. The DMA in the Shim tile is programmed to bring the bottom left `8x16` portion of a larger `16x128` matrix into the tile to perform the operation. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. The DMA in the Shim tile is programmed to bring the bottom left `8x16` portion of a larger `16x128` matrix into the tile to perform the operation. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Input data is brought to the local memory of the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The Shim tile is programmed with a 2D DMA to only bring a 2D submatrix into the AIE tile for processing. +The kernel executes on AIE tile (`col`, 2). Input data is brought to the local memory of the tile from Shim tile (`col`, 0). The value of `col` depends on whether the application is targeting NPU or VCK5000. The Shim tile is programmed with a 2D DMA to bring only a 2D submatrix into the AIE tile for processing. To compile and run the design for NPU: ``` diff --git a/programming_examples/basic/passthrough_dmas/README.md b/programming_examples/basic/passthrough_dmas/README.md index d85156df10..b3c2e682aa 100644 --- a/programming_examples/basic/passthrough_dmas/README.md +++ b/programming_examples/basic/passthrough_dmas/README.md @@ -12,7 +12,7 @@ This reference design can be run on a Ryzen™ AI NPU. -In the [design](./aie2.py) data is brought from external memory to `ComputeTile2` and back, without modification from the tile, by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through Shim tile (`col`, 0). +In the [design](./aie2.py) data is brought from external memory to `ComputeTile2` and back, without modification from the tile, by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0). The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/03_Link_Distribute_Join/README.md#object-fifo-link) of the programming guide. diff --git a/programming_examples/basic/passthrough_kernel/README.md b/programming_examples/basic/passthrough_kernel/README.md index a012e66209..ed6dfbf218 100644 --- a/programming_examples/basic/passthrough_kernel/README.md +++ b/programming_examples/basic/passthrough_kernel/README.md @@ -10,11 +10,11 @@ # Passthrough Kernel: -This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element sized subvectors, and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. The file generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). 1. `passThrough.cc`: A C++ implementation of vectorized memcpy operations for AIE cores. Found [here](../../../aie_kernels/generic/passThrough.cc). @@ -30,15 +30,15 @@ This simple example effectively passes data through a single compute tile in the 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory. 1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". Note that a vectorized kernel running on the Compute Tile's AIE core copies the data from the input "object" to the output "object". -1. After the vectorized copy is performed the Compute Tile releases the "objects" allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. +1. After the vectorized copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. -It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core is also processing data concurrent with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. +It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. ## Design Component Details ### AIE Array Structural Design -This design performs a memcpy operation on a vector of input data. The AIE design is described in a python module as follows: +This design performs a memcpy operation on a vector of input data. The AIE design is described in a Python module as follows: 1. **Constants & Configuration:** The script defines input/output dimensions (`N`, `n`), buffer sizes in `lineWidthInBytes` and `lineWidthInInt32s`, and tracing support. diff --git a/programming_examples/basic/vector_exp/README.md b/programming_examples/basic/vector_exp/README.md index 7b6fe0eb23..6c13f33578 100644 --- a/programming_examples/basic/vector_exp/README.md +++ b/programming_examples/basic/vector_exp/README.md @@ -11,19 +11,19 @@ # Vector $e^x$ -This example shows how the look up table capability of the AIE can be used to perform approximations to well known functions like $e^x$. +This example shows how the look up table capability of the AIE can be used to perform approximations to well-known functions like $e^x$. This design uses 4 cores, and each core operates on `1024` `bfloat16` numbers. Each core contains a lookup table approximation of the $e^x$ function, which is then used to perform the operation. $e^x$ is typically used in machine learning applications with relatively small numbers, typically around 0..1, and also will return infinity for input values larger than 89, so a small look up table approximation method is often accurate enough compared to a more exact approximation like Taylor series expansion. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `bf16_exp.cc`: A C++ implementation of vectorized table lookup operations for AIE cores. The lookup operation `getExpBf16` operates on vectors of size `16` loading the vectorized accumulator registers with the look up table results. It is then necessary to copy the accumulator register to a regular vector register, before storing back into memory. The source can be found [here](../../../aie_kernels/aie2/bf16_exp.cc). +1. `bf16_exp.cc`: A C++ implementation of vectorized table lookup operations for AIE cores. The lookup operation `getExpBf16` operates on vectors of size `16`, loading the vectorized accumulator registers with the look up table results. It is then necessary to copy the accumulator register to a regular vector register before storing it back into memory. The source can be found [here](../../../aie_kernels/aie2/bf16_exp.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the program verifies the results. -The design also uses a single file from the AIE runtime, in order to initialize the look up table contents to approximate the $e^x$ function. +The design also uses a single file from the AIE runtime to initialize the look up table contents to approximate the $e^x$ function. ## Usage diff --git a/programming_examples/basic/vector_scalar_add/README.md b/programming_examples/basic/vector_scalar_add/README.md index b1cb33333f..5223393ffe 100644 --- a/programming_examples/basic/vector_scalar_add/README.md +++ b/programming_examples/basic/vector_scalar_add/README.md @@ -12,7 +12,7 @@ Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. -The kernel executes on AIE tile (0, 2). Input data is brought to the local memory of the tile from Shim tile (0, 0), through Mem tile (0, 1). The size of the input data from the Shim tile is `16xi32`. The data is stored in the Mem tile and sent to the AIE tile in smaller pieces of size `8xi32`. Output data from the AIE tile to the Shim tile follows the same process, in reverse. +The kernel executes on AIE tile (0, 2). Input data is brought to the local memory of the tile from the Shim tile (0, 0) through the Mem tile (0, 1). The size of the input data from the Shim tile is `16xi32`. The data is stored in the Mem tile and sent to the AIE tile in smaller pieces of size `8xi32`. Output data from the AIE tile to the Shim tile follows the same process, in reverse. This example does not contain a C++ kernel file. The kernel is expressed in Python bindings for the `memref` and `arith` dialects that is then compiled with the AIE compiler to generate the AIE core binary. diff --git a/programming_examples/basic/vector_scalar_mul/README.md b/programming_examples/basic/vector_scalar_mul/README.md index 2ee29e2e19..1ba1649ff9 100644 --- a/programming_examples/basic/vector_scalar_mul/README.md +++ b/programming_examples/basic/vector_scalar_mul/README.md @@ -10,11 +10,11 @@ # Vector Scalar Multiplication: -This IRON design flow example, called "Vector Scalar Multiplication", demonstrates a simple AIE implementation for vectorized vector scalar multiply on a vector of integers. In this design, a single AIE core performs the vector scalar multiply operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element sized subvectors, and is invoked multiple times to complete the full scaling. The example consists of two primary design files: `aie2.py` and `scale.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Vector Scalar Multiplication", demonstrates a simple AIE implementation for vectorized vector scalar multiply on a vector of integers. In this design, a single AIE core performs the vector scalar multiply operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors, and is invoked multiple times to complete the full scaling. The example consists of two primary design files: `aie2.py` and `scale.cc`, and a testbench `test.cpp` or `test.py`. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). 1. `scale.cc`: A C++ implementation of scalar and vectorized vector scalar multiply operations for AIE cores. Found [here](../../../aie_kernels/aie2/scale.cc). @@ -29,16 +29,16 @@ This IRON design flow example, called "Vector Scalar Multiplication", demonstrat This simple example uses a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows: 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 1. The runtime data movement is expressed to read `4096` int32_t data from host memory to the compute tile and write the `4096` data back to host memory. A single int32_t scale factor is also transferred form host memory to the Compute Tile. -1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and stores the result to another output "object" it has acquired from "of_out". Note that a scalar or vectorized kernel running on the Compute Tile's AIE core multiplies the data from the input "object" by a scale factor before storing to the output "object". -1. After the compute is performed the Compute Tile releases the "objects" allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. +1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and stores the result to another output "object" it has acquired from "of_out". Note that a scalar or vectorized kernel running on the Compute Tile's AIE core multiplies the data from the input "object" by a scale factor before storing it to the output "object". +1. After the compute is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. -It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core is also processing data concurrent with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. +It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. ## Design Component Details ### AIE Array Structural Design -This design performs a memcpy operation on a vector of input data. The AIE design is described in a python module as follows: +This design performs a memcpy operation on a vector of input data. The AIE design is described in a Python module as follows: 1. **Constants & Configuration:** The script defines input/output dimensions (`N`, `n`), buffer sizes in `N_in_bytes` and `N_div_n` blocks, the object FIFO buffer depth, and vector vs scalar kernel selection and tracing support booleans. @@ -62,7 +62,7 @@ This design performs a memcpy operation on a vector of input data. The AIE desig ### AIE Core Kernel Code -`scale.cc` contains a C++ implementations of scalar and vectorized vector scalar multiplcation operation designed for AIE cores. It consists of two main sections: +`scale.cc` contains a C++ implementation of scalar and vectorized vector scalar multiplication operation designed for AIE cores. It consists of two main sections: 1. **Scalar Scaling:** The `scale_scalar()` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements. diff --git a/programming_examples/basic/vector_vector_add/README.md b/programming_examples/basic/vector_vector_add/README.md index 2ed5a82605..c7dd75676a 100644 --- a/programming_examples/basic/vector_vector_add/README.md +++ b/programming_examples/basic/vector_vector_add/README.md @@ -10,9 +10,9 @@ # Vector Vector Add -Single tile performs a very simple `+` operations from two vectors loaded into memory. The tile then stores the sum of those two vectors back to external memory. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `+` operation from two vectors loaded into memory. The tile then stores the sum of those two vectors back to external memory. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The AIE tile performs the summation operations and the Shim tile brings the data back out to external memory. +The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` depends on whether the application is targeting NPU or VCK5000. The AIE tile performs the summation operations, and the Shim tile brings the data back out to external memory. To compile and run the design for NPU: ``` diff --git a/programming_examples/basic/vector_vector_mul/README.md b/programming_examples/basic/vector_vector_mul/README.md index 54ab0bd4e1..331f832033 100644 --- a/programming_examples/basic/vector_vector_mul/README.md +++ b/programming_examples/basic/vector_vector_mul/README.md @@ -10,9 +10,9 @@ # Vector Vector Multiplication -Single tile performs a very simple `*` operations from two vectors loaded into memory. The tile then stores the element wise multiplication of those two vectors back to external memory. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `*` operation from two vectors loaded into memory. The tile then stores the element-wise multiplication of those two vectors back to external memory. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The AIE tile performs the multiplication operations and the Shim tile brings the data back out to external memory. +The kernel executes on the AIE tile (`col`, 2). Both input vectors are brought into the tile from the Shim tile (`col`, 0). The value of `col` depends on whether the application targets NPU or VCK5000. The AIE tile performs the multiplication operations, and the Shim tile brings the data back out to external memory. To compile and run the design for NPU: ``` diff --git a/programming_examples/ml/eltwise_add/README.md b/programming_examples/ml/eltwise_add/README.md index 415b0c828c..f801bfe13b 100644 --- a/programming_examples/ml/eltwise_add/README.md +++ b/programming_examples/ml/eltwise_add/README.md @@ -10,14 +10,14 @@ # Eltwise Add -This design implements a `bfloat16` based element wise addition between two vectors, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based element wise addition between two vectors, performed in parallel on two cores in a single column. Element-wise addition usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). Please refer to [bottleneck](../bottleneck/) design on fusing element-wise addition with convolution for the skip addition. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `add.cc`: A C++ implementation of a vectorized vector addition operation for AIE cores. The code uses the AIE API which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). +1. `add.cc`: A C++ implementation of a vectorized vector addition operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/eltwise_mul/README.md b/programming_examples/ml/eltwise_mul/README.md index 9dc531705c..1db8cde361 100644 --- a/programming_examples/ml/eltwise_mul/README.md +++ b/programming_examples/ml/eltwise_mul/README.md @@ -10,14 +10,14 @@ # Eltwise Multiplication -This design implements a `bfloat16` based element wise multiplication between two vectors, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based element-wise multiplication between two vectors, performed in parallel on two cores in a single column. Element-wise multiplication usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `add.cc`: A C++ implementation of a vectorized vector multiplication operation for AIE cores. The code uses the AIE API which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). +1. `add.cc`: A C++ implementation of a vectorized vector multiplication operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/relu/README.md b/programming_examples/ml/relu/README.md index 6d093c6bd8..c450bb8b24 100644 --- a/programming_examples/ml/relu/README.md +++ b/programming_examples/ml/relu/README.md @@ -13,20 +13,20 @@ ReLU, which stands for Rectified Linear Unit, is a type of activation function that is widely used in neural networks, particularly in deep learning models. It is defined mathematically as: $ReLU(x) = max(0,x)$ -This function takes a single number as input and outputs the maximum of zero and the input number. Essentially, it passes positive values through unchanged, and clamps all the negative values to zero. +This function takes a single number as input and outputs the maximum of zero and the input number. Essentially, it passes positive values through unchanged, and clamps all the negative values to zero. Please refer to [conv2d_fused_relu](../conv2d_fused_relu/) design on fusing ReLU with convolution. ## Key Characteristics of ReLU: * Non-linear: While it looks like a linear function, ReLU introduces non-linearity into the model, which is essential for learning complex patterns in data. * Computational Efficiency: One of ReLU's biggest advantages is its computational simplicity. Unlike other activation functions like sigmoid or tanh, ReLU does not involve expensive operations (e.g., exponentials), which makes it computationally efficient and speeds up the training and inference processes. -This design implements a `bfloat16` based ReLU on a vector, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based ReLU on a vector, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `relu.cc`: A C++ implementation of a vectorized ReLU operation for AIE cores, which is 1:1 implementation of the inherent function using low level intrinsics. The AIE2 allows an element-wise max of 32 `bfloat16` numbers against a second vector register containing all zeros, implementing the $ReLU(x) = max(0,x)$ function directly. The source can be found [here](../../../aie_kernels/aie2/relu.cc). +1. `relu.cc`: A C++ implementation of a vectorized ReLU operation for AIE cores, which is a 1:1 implementation of the inherent function using low-level intrinsics. The AIE2 allows an element-wise max of 32 `bfloat16` numbers against a second vector register containing all zeros, implementing the $ReLU(x) = max(0,x)$ function directly. The source can be found [here](../../../aie_kernels/aie2/relu.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/resnet/README.md b/programming_examples/ml/resnet/README.md index 5bb146b006..39d863d687 100755 --- a/programming_examples/ml/resnet/README.md +++ b/programming_examples/ml/resnet/README.md @@ -55,9 +55,9 @@ The below figures shows our implementation of the conv2_x layers of the ResNet a

-Similar to our [bottleneck design](../../bottleneck), we implement conv2_x layers depth-first. Our implementation connects the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck design](../../bottleneck), the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path that the increases the number of channels. +Similar to our [bottleneck](../../bottleneck) design, we implement conv2_x layers depth-first. Our implementation connects the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck](../../bottleneck) design, the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path that the increases the number of channels. -After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck design](../../bottleneck). Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors. +After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck](../../bottleneck) design. Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors. diff --git a/programming_examples/ml/softmax/README.md b/programming_examples/ml/softmax/README.md index 0c17a1c50c..a413993996 100644 --- a/programming_examples/ml/softmax/README.md +++ b/programming_examples/ml/softmax/README.md @@ -36,17 +36,17 @@ The softmax function is a mathematical function commonly used in machine learnin The softmax function employs the exponential function $e^x$, similar to the example found [here](../../basic/vector_exp/). Again to efficiently implement softmax, a lookup table approximation is utilized. -In addition, and unlike any of the other current design examples, this example uses MLIR dialects as direct input, including the `vector`,`affine`,`arith` and `math` dialects. This is shown in the [source](./bf16_softmax.mlir). This is intended to be generated from a higher level description, but is shown here as an example of how you can use other MLIR dialects as input. +In addition, and unlike any of the other current design examples, this example uses MLIR dialects as direct input, including the `vector`,`affine`,`arith` and `math` dialects. This is shown in the [source](./bf16_softmax.mlir). This is intended to be generated from a higher-level description but is shown here as an example of how you can use other MLIR dialects as input. The compilation process is different from the other design examples, and is shown in the [Makefile](./Makefile). 1. The input MLIR is first vectorized into chunks of size 16, and a C++ file is produced which has mapped the various MLIR dialects into AIE intrinsics, including vector loads and stores, vectorized arithmetic on those registers, and the $e^x$ approximation using look up tables 1. This generated C++ is compiled into a first object file -1. A file called `lut_based_ops.cpp` from the AIE2 runtime libary is compiled into a second object file. This file contains the look up table contents to approximate the $e^x$ function. +1. A file called `lut_based_ops.cpp` from the AIE2 runtime library is compiled into a second object file. This file contains the look up table contents to approximate the $e^x$ function. 1. A wrapper file is also compiled into an object file, which prevents C++ name mangling, and allows the wrapped C function to be called from the strucural Python -1. These 3 object files and combined into a single .a file, which is then referenced inside the aie2.py structural Python. +1. These 3 object files are combined into a single .a file, which is then referenced inside the aie2.py structural Python. -This is a slightly more complex process than the rest of the examples, which typically only use a single object file containing the wrapped C++ function call, but is provided to show how a library based flow can also be used. +This is a slightly more complex process than the rest of the examples, which typically only use a single object file containing the wrapped C++ function call, but is provided to show how a library-based flow can also be used. ## Usage diff --git a/programming_examples/vision/color_detect/README.md b/programming_examples/vision/color_detect/README.md index f2f24dbea6..96558d3774 100644 --- a/programming_examples/vision/color_detect/README.md +++ b/programming_examples/vision/color_detect/README.md @@ -20,13 +20,13 @@ The pipeline is mapped onto a single column of the npu device, with one Shim til width="1150">

-The data movement of this pipeline is described using the ObjectFifo (OF) primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering couldn't directly be done in the smaller L1 memory module of tile (0, 5). This is described using two OFs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two OFs are linked to express that data from the first OF should be copied to the second OF implicitly through the Mem tile's DMA. +The data movement of this pipeline is described using the ObjectFifo (OF) primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering could not be done directly in the smaller L1 memory module of tile (0, 5). This is described using two OFs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two OFs are linked to express that data from the first OF should be copied to the second OF implicitly through the Mem tile's DMA. Starting from tile (0, 2) data is processed by each compute tile and the result is sent to the next tile. This is described by a series of one-to-one OFs. An OF also describes the broadcast from tile (0, 2) to tiles (0, 3) and (0, 4). As the three kernels `bitwiseOR`, `gray2rgba` and `bitwiseAND` are mapped together on AIE tile (0, 5), two OFs are also created with tile (0, 5) being both their source and destination to describe the data movement between the three kernels. Finally, the output is sent from tile (0, 5) to the Mem tile and then back to the output through the Shim tile. -To compile desing in Windows: +To compile design in Windows: ``` make make colorDetect.exe diff --git a/programming_guide/README.md b/programming_guide/README.md index df0471ebc0..6adf1dda44 100644 --- a/programming_guide/README.md +++ b/programming_guide/README.md @@ -12,9 +12,9 @@ -The AI Engine (AIE) array is a spatial compute architecture: a modular and scalable system with spatially distributed compute and memories. Its compute dense vector processing runs independently and concurrently to explicitly scheduled data movement. Since the vector compute core (green) of each AIE can only operate on data in its L1 scratchpad memory (light blue), data movement accelerators (purple) bi-directionally transport this data over a switched (dark blue) interconnect network, from any level in the memory hierarchy. +The AI Engine (AIE) array is a spatial compute architecture: a modular and scalable system with spatially distributed compute and memories. Its compute-dense vector processing runs independently and concurrently to explicitly scheduled data movement. Since the vector compute core (green) of each AIE can only operate on data in its L1 scratchpad memory (light blue), data movement accelerators (purple) bi-directionally transport this data over a switched (dark blue) interconnect network from any level in the memory hierarchy. -Programming the AIE-array configures all its spatial building blocks: the compute cores' program memory, the data movers' buffer descriptors, interconnect with switches, etc. This guide introduces our Interface Representation for hands-ON (IRON) close-to-metal programming of the AIE-array. IRON is an open access toolkit enabling performance engineers to build fast and efficient, often specialized designs through a set of Python language bindings around mlir-aie, our MLIR-based representation of the AIE-array. mlir-aie provides the foundation from which complex and performant AI Engine designs can be defined and is supported by simulation and hardware implementation infrastructure. +Programming the AIE-array configures all its spatial building blocks: the compute cores' program memory, the data movers' buffer descriptors, interconnect with switches, etc. This guide introduces our Interface Representation for hands-ON (IRON) close-to-metal programming of the AIE-array. IRON is an open-access toolkit enabling performance engineers to build fast and efficient, often specialized designs through a set of Python language bindings around mlir-aie, our MLIR-based representation of the AIE-array. mlir-aie provides the foundation from which complex and performant AI Engine designs can be defined and is supported by simulation and hardware implementation infrastructure. > **NOTE:** For those interested in better understanding how AI Engine designs are defined at the MLIR level, take a look through the [MLIR tutorial](../mlir_tutorials/) material. mlir-aie also serves as a lower layer for other higher-level abstraction MLIR layers such as [mlir-air](https://github.com/Xilinx/mlir-air). @@ -24,16 +24,16 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a
Section 0 - Getting Set Up for IRON * Introduce recommended hardware to target with IRON -* Simple instructions to set up your hardware, tools and environment +* Simple instructions to set up your hardware, tools, and environment
Section 1 - Basic AI Engine building blocks * Introduce the AI Engine building blocks for expressing an application design -* Give example of python bindings for MLIR source that definre AIE tiles +* Give an example of Python bindings for MLIR source that define AIE tiles
Section 2 - Data Movement (Object FIFOs) -* Introduce topic of objectfifos and how they abstract connections between tiles and data in the AIE array memories +* Introduce the topic of objectfifos and how they abstract connections between tiles and data in the AIE array memories * Explain key objectfifo data movement patterns * Introduce more complex objectfifo connection patterns (link/ broadcast, join/ distribute) * Demonstrate objectfifos with practical examples @@ -41,18 +41,18 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a
Section 3 - My First Program -* Introduce example of first simple program (Vector Scalar Multiplication) -* Illustrate how to run designs on Ryzen™ AI enabled hardware +* Introduce an example of the first simple program (Vector Scalar Multiplication) +* Illustrate how to run designs on Ryzen™ AI-enabled hardware
-
Section 4 - Vector programming & Peformance Measurement +
Section 4 - Vector programming & Performance Measurement -* Discuss topic of vector programming at the kernel level +* Discuss the topic of vector programming at the kernel level * Introduce performance measurement (trace) and how we measure cycle count and efficiency * Performant Vector Scalar Multiplication design example
Section 5 - Example Vector Designs -* Introduce additional vector design examples with exercises to measure performance on each +* Introduce additional vector design examples with exercises to measure their performance: * Passthrough * Vector $e^x$ * Vector Scalar Addition diff --git a/programming_guide/section-0/README.md b/programming_guide/section-0/README.md index c4b5558927..a9f98fbf92 100644 --- a/programming_guide/section-0/README.md +++ b/programming_guide/section-0/README.md @@ -10,7 +10,7 @@ # Section 0 - Getting Set Up for IRON -This programming guide focuses on application programming for the NPU found in Ryzen™ AI laptops and mini PCs. The latest information on Ryzen AI CPUs can be found [here](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). +This programming guide focuses on application programming for the NPU found in Ryzen™ AI laptops and mini PCs. The latest information on Ryzen™ AI CPUs can be found [here](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). ## Recommended Hardware diff --git a/programming_guide/section-1/README.md b/programming_guide/section-1/README.md index 3444d9e392..04888e0f7c 100644 --- a/programming_guide/section-1/README.md +++ b/programming_guide/section-1/README.md @@ -10,10 +10,10 @@ # Section 1 - Basic AI Engine building blocks -When we program the AIE-array, we need to declare and configure its structural building blocks: compute tiles for vector processing, memory tiles as larger level-2 shared scratchpads, and shim tiles supporting data movement to external memory. In this programming guide, we will be utilizing the IRON Python bindings for MLIR-AIE components to describe our design at the tile level of granularity. Later on, when we focus on kernel programming, we will explore vector programming in C/C++. But let's first look at a basic Python source file (named [aie2.py](./aie2.py)) for an IRON design. +When we program the AIE-array, we need to declare and configure its structural building blocks: compute tiles for vector processing, memory tiles as larger level-2 shared scratchpads, and shim tiles supporting data movement to external memory. In this programming guide, we will utilize the IRON Python bindings for MLIR-AIE components to describe our design at the tile level of granularity. Later on, we will explore vector programming in C/C++ when we focus on kernel programming. But let's first look at a basic Python source file (named [aie2.py](./aie2.py)) for an IRON design. ## Walkthrough of Python source file (aie2.py) -At the top of this Python source, we include modules that define the IRON AIE language bindings `aie.dialects.aie` and the mlir-aie context `aie.extras.context` which binds to MLIR definitions for AI Engines. +At the top of this Python source, we include modules that define the IRON AIE language bindings `aie.dialects.aie` and the mlir-aie context `aie.extras.context`, which binds to MLIR definitions for AI Engines. ``` from aie.dialects.aie import * # primary mlir-aie dialect definitions @@ -23,9 +23,9 @@ Then we declare a structural design function that will expand into MLIR code whe ``` # AI Engine structural design function def mlir_aie_design(): - <... AI Engine device, blocks and connections ...> + <... AI Engine device, blocks, and connections ...> ``` -Let's look at how we declare the AI Engine device, blocks and connections. We start off by declaring our AIE device via `@device(AIEDevice.npu)` or `@device(AIEDevice.xcvc1902)`. The blocks and connections themselves will then be declared inside the `def device_body():`. Here, we instantiate our AI Engine blocks, which in this first example are AIE compute tiles. +Let's look at how we declare the AI Engine device, blocks, and connections. We start off by declaring our AIE device via `@device(AIEDevice.npu)` or `@device(AIEDevice.xcvc1902)`. The blocks and connections themselves will then be declared inside the `def device_body():`. Here, we instantiate our AI Engine blocks, which are AIE compute tiles in this first example. The arguments for the tile declaration are the tile coordinates (column, row). We assign each declared tile to a variable in our Python program. @@ -41,7 +41,7 @@ The arguments for the tile declaration are the tile coordinates (column, row). W ComputeTile2 = tile(2, 3) ComputeTile3 = tile(2, 4) ``` -Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context and we follow this by calling our previosly defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)` which takes the code defined in our MLIR context and prints it stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout. +Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)`, which takes the code defined in our MLIR context and prints it stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout. ``` # Declares that subsequent code is in mlir-aie context with mlir_mod_ctx() as ctx: @@ -50,7 +50,7 @@ with mlir_mod_ctx() as ctx: ``` ## Other Tile Types -Next to the compute tiles, an AIE-array also contains data movers for accessing L3 memory (also called shim DMAs) and larger L2 scratchpads (called mem tiles) which are available since the AIE-ML generation - see [the introduction of this programming guide](../README.md). Declaring these other types of structural blocks follows the same syntax but requires physical layout details for the specific target device. Shim DMAs typically occupy row 0, while mem tiles (when available) often reside on row 1. The following code segment declares all the different tile types found in a single NPU column. +Next to the compute tiles, an AIE-array also contains data movers for accessing L3 memory (also called shim DMAs) and larger L2 scratchpads (called mem tiles), which have been available since the AIE-ML generation - see [the introduction of this programming guide](../README.md). Declaring these other types of structural blocks follows the same syntax but requires physical layout details for the specific target device. Shim DMAs typically occupy row 0, while mem tiles (when available) often reside on row 1. The following code segment declares all the different tile types found in a single NPU column. ``` # Device declaration - here using aie2 device NPU @@ -67,11 +67,11 @@ Next to the compute tiles, an AIE-array also contains data movers for accessing ``` ## Exercises -1. To run our Python program from the command line, we type `python3 aie2.py` which converts our Python structural design into MLIR source code. This works from the command line if our design environment already contains the mlir-aie Python-bound dialect module. We included this in the [Makefile](./Makefile), so go ahead and run `make` now. Then take a look at the generated MLIR source under `build/aie.mlir`. +1. To run our Python program from the command line, we type `python3 aie2.py`, which converts our Python structural design into MLIR source code. This works from the command line if our design environment already contains the mlir-aie Python-bound dialect module. We included this in the [Makefile](./Makefile), so go ahead and run `make` now. Then take a look at the generated MLIR source under `build/aie.mlir`. -2. Run `make clean` to remove the generated files. Then introduce an error to the Python source such as misspelling `tile` to `tilex` and then run `make` again. What messages do you see? +2. Run `make clean` to remove the generated files. Then introduce an error to the Python source, such as misspelling `tile` to `tilex`, and then run `make` again. What messages do you see? -3. Run `make clean` again. Now change the error by renaming `tilex` back to `tile`, but change the coordinates to (-1,3) which is an invalid location. Run `make` again. What messages do you see now? +3. Run `make clean` again. Now change the error by renaming `tilex` back to `tile`, but change the coordinates to (-1,3), which is an invalid location. Run `make` again. What messages do you see now? 4. No error is generated but our code is invalid. Take a look at the generated MLIR code under `build/aie.mlir`. This generated output is invalid MLIR syntax and running our mlir-aie tools on this MLIR source will generate an error. We do, however, have some additional Python structural syntax checks that can be enabled if we use the function `ctx.module.operation.verify()`. This verifies that our Python-bound code has valid operation within the mlir-aie context. diff --git a/programming_guide/section-2/README.md b/programming_guide/section-2/README.md index 0679fb2cb1..199280d203 100644 --- a/programming_guide/section-2/README.md +++ b/programming_guide/section-2/README.md @@ -13,12 +13,12 @@ In this section of the programming guide, we introduce the Object FIFO high-level communication primitive used to describe the data movement within the AIE array. At the end of this guide you will: 1. have a high-level understanding of the communication primitive API, 2. have learned how to initialize and access an Object FIFO through meaningful design examples, -3. understand the design decisions which led to current limitations and/or restrictions in the Object FIFO design, +3. understand the design decisions, which led to current limitations and/or restrictions in the Object FIFO design, 4. know where to find more in-depth material of the Object FIFO implementation and lower-level lowering. -To understand the need for a data movement abstraction we must first understand the hardware architecture with which we are working. The AIE array is a [spatial compute architecture](../README.md) with explicit data movement requirements. Each compute unit of the array works on data that is stored within its L1 memory module and that data needs to be explicitly moved there as part of the AIE's array global data movement configuration. This configuration involves several specialized hardware resources which handle the data movement over the entire array in such a way that data arrives at its destination without loss. The Object FIFO provides users with a way to specify the data movement in a more human-comprehensible and accessible manner, without sacrificing some of the more advanced control possibilities which the hardware provides. +To understand the need for a data movement abstraction we must first understand the hardware architecture with which we are working. The AIE array is a [spatial compute architecture](../README.md) with explicit data movement requirements. Each compute unit of the array works on data that is stored within its L1 memory module and that data needs to be explicitly moved there as part of the AIE's array global data movement configuration. This configuration involves several specialized hardware resources that handle the data movement over the entire array in such a way that data arrives at its destination without loss. The Object FIFO provides users with a way to specify the data movement in a more human-comprehensible and accessible manner, without sacrificing some of the more advanced control possibilities which the hardware provides. -> **NOTE:** For more in-depth, low-level material on Object FIFO programming in MLIR please see the MLIR-AIE [tutorials](../mlir_tutorials). +> **NOTE:** For more in-depth, low-level material on Object FIFO programming in MLIR, please see the MLIR-AIE [tutorials](../mlir_tutorials). This guide is split into five sections, where each section builds on top of the previous ones: > **NOTE:** Section 2e contains several practical code examples with common design patterns using the Object FIFO which can be quickly picked up and tweaked for desired use. diff --git a/programming_guide/section-2/section-2a/README.md b/programming_guide/section-2/section-2a/README.md index 0876393bf3..2f254e5e69 100644 --- a/programming_guide/section-2/section-2a/README.md +++ b/programming_guide/section-2/section-2a/README.md @@ -65,7 +65,7 @@ def acquire(self, port, num_elem) ``` Based on the `num_elem` input representing the number of acquired elements, the acquire function will either directly return an object, or an array of objects. -The Object FIFO is an ordered primitive and the API keeps track for each process which object is the next one that they will have access to when acquiring, based on how many they have already acquired and released. Specifically, the first time a process acquires an object it will have access to the first object of the Object FIFO, and after releasing it and acquiring a new one, it'll have access to the second object, and so on until the last object, after which the order starts from the first one again. When acquiring multiple objects and accessing them in the returned array, the object at index 0 will always be the oldest object that that process has access to, which may not be the first object in the pool of that Object FIFO. +The Object FIFO is an ordered primitive and the API keeps track for each process which object is the next one that they will have access to when acquiring, based on how many they have already acquired and released. Specifically, the first time a process acquires an object it will have access to the first object of the Object FIFO, and after releasing it and acquiring a new one, it'll have access to the second object, and so on until the last object, after which the order starts from the first one again. When acquiring multiple objects and accessing them in the returned array, the object at index 0 will always be the oldest object that process has access to, which may not be the first object in the pool of that Object FIFO. To release one or multiple objects users should use the release function of the `object_fifo` class: ```python @@ -100,7 +100,7 @@ def core_body(): of0.release(ObjectFifoPort.Consume, 3) ``` -The figure below illustrates this code: Each of the 4 drawings represents the state of the system during one iteration of execution. In the first three iterations, the producer process on tile A, drawn in blue, progressively acquires the elements of `of0` one by one. Once the third element has been released in the forth iteration, the consumer process on tile B, drawn in green, is able to acquire all three objects at once. +The figure below illustrates this code: Each of the 4 drawings represents the state of the system during one iteration of execution. In the first three iterations, the producer process on tile A, drawn in blue, progressively acquires the elements of `of0` one by one. Once the third element has been released in the fourth iteration, the consumer process on tile B, drawn in green, is able to acquire all three objects at once. @@ -132,7 +132,7 @@ As was mentioned in the beginning of this section, the AIE architecture is a spa A more in-depth, yet still abstract, view of the Object FIFO's depth is that the producer and each consumer have their own working resource pool available in their local memory modules which they can use to send and receive data in relation to the data movement described by the Object FIFO. The Object FIFO primitive and its lowering typically allocate the depth of each of these pools such that the resulting behaviour matches that of the conceptual depth. -The user does however have the possibility to manually choose the depth of these pools. This feature is available because, while the Object FIFO primitive tries to offer a unified representation of the data movement across the AIE array, it also aims to provide performance programmers with the tools to more finely control it. +The user does however have the possibility to manually choose the depth of these pools. This feature is available because, while the Object FIFO primitive tries to offer a unified representation of the data movement across the AIE array, it also aims to provide performance programmers with the tools to control it more finely. For example, in the code snippet below `of0` describes the data movement between producer A and consumer B: ```python diff --git a/programming_guide/section-2/section-2b/01_Reuse/README.md b/programming_guide/section-2/section-2b/01_Reuse/README.md index 5ffb69d2d4..c518547ff1 100644 --- a/programming_guide/section-2/section-2b/01_Reuse/README.md +++ b/programming_guide/section-2/section-2b/01_Reuse/README.md @@ -10,7 +10,7 @@ # Object FIFO Reuse Pattern -In the previous [section](../../section-2a/README.md#accessing-the-objects-of-an-object-fifo) it was mentioned that the Object FIFO acquire and release functions can be paired together to achieve the behaviour of a sliding window with data reuse. Specifically, this communication pattern occurs when a producer or a consumer of an Object FIFO releases less objects than it had previously acquired. As acquiring from an Object FIFO does not destroy the data, unreleased objects can continue to be used without requiring new copies of the data. +In the previous [section](../../section-2a/README.md#accessing-the-objects-of-an-object-fifo) it was mentioned that the Object FIFO acquire and release functions can be paired together to achieve the behaviour of a sliding window with data reuse. Specifically, this communication pattern occurs when a producer or a consumer of an Object FIFO releases fewer objects than it had previously acquired. As acquiring from an Object FIFO does not destroy the data, unreleased objects can continue to be used without requiring new copies of the data. It is important to note that each new acquire function will return a new object or array of objects that a process can access, which **includes unreleased objects from previous acquire calls**. The process should always use the result of the **most recent** acquire call to access unreleased objects to ensure a proper lowering through the Object FIFO primitive. diff --git a/programming_guide/section-2/section-2c/README.md b/programming_guide/section-2/section-2c/README.md index 8c5c36480d..7b9b02219b 100644 --- a/programming_guide/section-2/section-2c/README.md +++ b/programming_guide/section-2/section-2c/README.md @@ -25,7 +25,7 @@ While the Object FIFO primitive aims to reduce the complexity tied to data movem Tile DMAs interact directly with the memory modules of their tiles and are responsible for pushing and retrieving data to and from the AXI stream interconnect. When data is pushed onto the stream, the user can program the DMA's n-dimensional address generation scheme such that the data's layout when pushed may be different than how it is stored in the tile's local memory. In the same way, a user can also specify in what layout a DMA should store the data retrieved from the AXI stream. -DMA blocks contain buffer descriptor operations that summarize what data is being moved, from what offset, how much of it, and in what layout. These buffer descriptors are the `AIE_DMABDOp` operations in MLIR and have their own auto-generated python binding (available under `/python/aie/dialects/_aie_ops_gen.py` after the repository is built): +DMA blocks contain buffer descriptor operations that summarize what data is being moved, from what offset, how much of it, and in what layout. These buffer descriptors are the `AIE_DMABDOp` operations in MLIR and have their own auto-generated Python binding (available under `/python/aie/dialects/_aie_ops_gen.py` after the repository is built): ```python def dma_bd ( diff --git a/programming_guide/section-2/section-2e/04_distribute_L2/README.md b/programming_guide/section-2/section-2e/04_distribute_L2/README.md index 5c2e93c276..b736977dc5 100644 --- a/programming_guide/section-2/section-2e/04_distribute_L2/README.md +++ b/programming_guide/section-2/section-2e/04_distribute_L2/README.md @@ -24,7 +24,7 @@ The design in [distribute_L2.py](./distribute_L2.py) uses an Object FIFO `of_in` object_fifo_link(of_in, [of_in0, of_in1, of_in2]) ``` -All compute tiles are running the same process of acquring one object from their respective input Object FIFOs to consume, adding `1` to all of its entries, and releasing the object. The [join design](../05_join_L2/) shows how the data is sent back out to external memory and tested. +All compute tiles are running the same process of acquiring one object from their respective input Object FIFOs to consume, adding `1` to all of its entries, and releasing the object. The [join design](../05_join_L2/) shows how the data is sent back out to external memory and tested. Other examples containing this data movement pattern are available in the [programming_examples/matrix_multiplication/](../../../../programming_examples/basic/matrix_multiplication/). diff --git a/programming_guide/section-2/section-2e/05_join_L2/README.md b/programming_guide/section-2/section-2e/05_join_L2/README.md index 64f1ef902a..41490aed7b 100644 --- a/programming_guide/section-2/section-2e/05_join_L2/README.md +++ b/programming_guide/section-2/section-2e/05_join_L2/README.md @@ -24,7 +24,7 @@ The design in [join_L2.py](./join_L2.py) uses three Object FIFOs from each of th object_fifo_link([of_out0, of_out1, of_out2], of_out) ``` -All compute tiles are running the same process of acquring one object from their respective input Object FIFOs to produce, writing `1` to all of its entries, and releasing the object. +All compute tiles are running the same process of acquiring one object from their respective input Object FIFOs to produce, writing `1` to all of its entries, and releasing the object. This design is combined with the previous [distribute](../04_distribute_L2/distribute_L2.py) design to achieve a full data movement from external memory to the AIE array and back. The resulting code is available in [distribute_and_join_L2.py](./distribute_and_join_L2.py). It is possible to build, run and test it with the following commands: ``` diff --git a/programming_guide/section-2/section-2f/README.md b/programming_guide/section-2/section-2f/README.md index 5b389e3ba5..ca73436251 100644 --- a/programming_guide/section-2/section-2f/README.md +++ b/programming_guide/section-2/section-2f/README.md @@ -23,7 +23,7 @@ Not all data movement patterns can be described with Object FIFOs. This **advanced** section goes into detail about how a user can express data movement using the Data Movement Accelerators (or `DMA`) on AIE tiles. To better understand the code and concepts introduced in this section it is recommended to first read the [Advanced Topic of Section - 2a on DMAs](../section-2a/README.md/#advanced-topic--data-movement-accelerators). -The AIE architecture currently has three different types of tiles: compute tiles, referred to as "tile", memory tiles referred to as "Mem tiles", and external memory interface tiles referred to as "Shim tiles". Each of these tiles have their own attributes regarding compute capabilities and memory capacity, but the base design of their DMAs is the same. The different types of DMAs can be intialized using the constructors in [aie.py](../../../python/dialects/aie.py): +The AIE architecture currently has three different types of tiles: compute tiles, referred to as "tile", memory tiles referred to as "Mem tiles", and external memory interface tiles referred to as "Shim tiles". Each of these tiles has its own attributes regarding compute capabilities and memory capacity, but the base design of their DMAs is the same. The different types of DMAs can be intialized using the constructors in [aie.py](../../../python/dialects/aie.py): ```python @mem(tile) # compute tile DMA @shim_dma(tile) # Shim tile DMA diff --git a/programming_guide/section-2/section-2g/README.md b/programming_guide/section-2/section-2g/README.md index 1ed9ae0f4b..4712c714a1 100644 --- a/programming_guide/section-2/section-2g/README.md +++ b/programming_guide/section-2/section-2g/README.md @@ -27,7 +27,7 @@ The operations that will be described in this section must be placed in a separa ### Guide to Managing Runtime Data Movement to/from Host Memory -In high-performance computing applications, efficiently managing data movement and synchronization is crucial. This guide provides a comprehensive overview of how to utilize the `npu_dma_memcpy_nd` and `npu_sync` functions to manage data movement at runtime from/to host memory to/from the AIE array (for example in the Ryzen™ AI NPU). +In high-performance computing applications, efficiently managing data movement and synchronization is crucial. This guide provides a comprehensive overview of how to utilize the `npu_dma_memcpy_nd` and `npu_sync` functions to manage data movement at runtime from/to host memory to/from the AIE array (for example, in the Ryzen™ AI NPU). #### **Efficient Data Movement with `npu_dma_memcpy_nd`** @@ -129,7 +129,7 @@ npu_sync(0, 0, 0, 1) #### **Best Practices for Data Movement and Synchronization** -- **Sync to Reuse Buffer Descriptors**: Each `npu_dma_memcpy_nd` is assigned a `bd_id`. There are a maximum of `16` BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly syncronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory. +- **Sync to Reuse Buffer Descriptors**: Each `npu_dma_memcpy_nd` is assigned a `bd_id`. There are a maximum of `16` BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly synchronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory. - **Note Non-blocking Transfers**: Overlap data transfers with computation by leveraging the non-blocking nature of `npu_dma_memcpy_nd`. - **Minimize Synchronization Overhead**: Synchronize judiciously to avoid excessive overhead that might degrade performance. diff --git a/programming_guide/section-3/README.md b/programming_guide/section-3/README.md index 166cab665e..ec6471b28a 100644 --- a/programming_guide/section-3/README.md +++ b/programming_guide/section-3/README.md @@ -12,7 +12,7 @@ -This section creates a first program that will run on the AIE-array. As shown in the figure on the right, we will have to create both binaries for the AIE-array (device) and CPU (host) parts. For the AIE-array, a structural description and kernel code is compiled into the AIE-array binaries: an XCLBIN file ("final.xclbin") and an instruction sequence ("inst.txt"). The host code ("test.exe") loads these AIE-array binaries and contains the test functionality. +This section creates the first program that will run on the AIE-array. As shown in the figure on the right, we will have to create both binaries for the AIE-array (device) and CPU (host) parts. For the AIE-array, a structural description and kernel code is compiled into the AIE-array binaries: an XCLBIN file ("final.xclbin") and an instruction sequence ("inst.txt"). The host code ("test.exe") loads these AIE-array binaries and contains the test functionality. For the AIE-array structural description we will combine what you learned in [section-1](../section-1) for defining a basic structural design in Python with the data movement part from [section-2](../section-2). @@ -81,7 +81,7 @@ We also need to set up the data movement to/from the AIE-array: configure n-dime npu_sync(column=0, row=0, direction=0, channel=0) ``` -Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire and object of 1024 integers to read from from "of_in" and and one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. After the 4 sub-vector iterations, we release the "of_factor" objectFIFO. +Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire and object of 1024 integers to read from "of_in" and one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. After the 4 sub-vector iterations, we release the "of_factor" objectFIFO. This access and execute pattern runs on the AIE compute core `ComputeTile2` and needs to get linked against the precompiled external function "scale.o". We run this pattern in a very large loop to enable enqueuing multiple rounds vector scalar multiply work from the host code. @@ -122,11 +122,11 @@ Note that since the scalar factor is communicated through an object, it is provi ## Host Code -The host code is acts as environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data. Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variant of this code are available. +The host code acts as environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data. Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variant of this code are available. For convenience, a set of test utilities support common elements of command line parsing, the XRT-based environment setup and with testbench functionality: [test_utils.h](../../runtime_lib/test_lib/test_utils.h) or [test.py](../../python/utils/test.py). -The host code contains following elements: +The host code contains the following elements: 1. *Parse program arguments and set up constants*: the host code typically takes at least as arguments: `-x` the XCLBIN file, `-k` kernel name (with default name "MLIR_AIE"), and `-i` the instruction sequence file as arguments since it is its task to load those files and set the kernel name. Both the XCLBIN and instruction sequence are generated when compiling the AIE-array Structural Description and kernel code with `aiecc.py`. @@ -134,7 +134,7 @@ The host code contains following elements: 1. *Create XRT environment*: so that we can use the XRT runtime -1. *Create XRT buffer objects* for the instruction sequence, inputs (vector a and factor) and output (vector c). Note that the `kernel.group_id()` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of python AIE-array structural description, starting with ID number 2 for the first sequence argument and then icrementing by 1. +1. *Create XRT buffer objects* for the instruction sequence, inputs (vector a and factor) and output (vector c). Note that the `kernel.group_id()` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of Python AIE-array structural description, starting with ID number 2 for the first sequence argument and then incrementing by 1. 1. *Initialize and synchronize*: host to device XRT buffer objects diff --git a/programming_guide/section-4/README.md b/programming_guide/section-4/README.md index e4a787afae..abe99b6010 100644 --- a/programming_guide/section-4/README.md +++ b/programming_guide/section-4/README.md @@ -10,9 +10,9 @@ # Section 4 - Vector Programming & Performance Measurement -Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming technqiues to fully leverage the power of the AI Engines for parallel compute. +Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen™ AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming techniques to fully leverage the power of the AI Engines for parallel compute. -It's helpful to first examine perfomance measurement before we delve into vector programming in order to get a baseline for where our application performance is. There are many factors that contribute to performance including latency, throughput and power efficiency. Performance measurement is an active area of research to provide more powerful tools for users to measure the speedup of their appication on AIEs. In [section-4a](./section-4a) and [section-4b](./section-4b/), we look a performance from the perspective of timers and trace. Then in [section-4c](./section-4c), we look more closely at how to vectorize AIE kernel code. +It is helpful to first examine performance measurement before we delve into vector programming in order to get a baseline for where our application performance is. There are many factors that contribute to performance including latency, throughput and power efficiency. Performance measurement is an active area of research to provide more powerful tools for users to measure the speedup of their application on AIEs. In [section-4a](./section-4a) and [section-4b](./section-4b/), we look at performance from the perspective of timers and trace. Then in [section-4c](./section-4c), we look more closely at how to vectorize AIE kernel code. * [Section 4a - Timers](./section-4a) * [Section 4b - Trace](./section-4b) diff --git a/programming_guide/section-5/README.md b/programming_guide/section-5/README.md index 46786fe2a6..0e22fde08a 100644 --- a/programming_guide/section-5/README.md +++ b/programming_guide/section-5/README.md @@ -10,7 +10,7 @@ # Section 5 - Example Vector Designs -The [programming examples](../../programming_examples) are a number of sample designs which further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI. +The [programming examples](../../programming_examples) are a number of sample designs that further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI. ## Simplest @@ -39,7 +39,7 @@ The [passthrough DMAs](../../programming_examples/basic/passthrough_dmas/) examp | [Multi core GEMM](../../programming_examples/basic/matrix_multiplication/whole_array) | bfloat16 | A matrix-matrix multiply using 16 AIEs with operand broadcast. Uses a simple "accumulate in place" strategy | | [GEMV](../../programming_examples/basic/matrix_multiplication/matrix_vector) | bfloat16 | A vector-matrix multiply returning a vector -## Machine Kearning Kernels +## Machine Learning Kernels | Design name | Data type | Description | |-|-|-| @@ -56,7 +56,7 @@ The [passthrough DMAs](../../programming_examples/basic/passthrough_dmas/) examp 1. Take a look at the testbench in our [Vector Exp](../../programming_examples/basic/vector_exp/) example [test.cpp](../../programming_examples/basic/vector_exp/test.cpp). Take note of the data type and the size of the test vector. What do you notice? -1. What is the communication to computation ratio in [ReLU](../../programming_examples/ml/relu/)? +1. What is the communication-to-computation ratio in [ReLU](../../programming_examples/ml/relu/)? 1. **HARD** Which basic example is a component in [Softmax](../../programming_examples/ml/softmax/)? diff --git a/programming_guide/section-6/README.md b/programming_guide/section-6/README.md index 714b8b6249..8be9768955 100644 --- a/programming_guide/section-6/README.md +++ b/programming_guide/section-6/README.md @@ -10,29 +10,29 @@ # Section 6 - Larger Example Designs -There are a number of example designs available [here](../../programming_examples/) which further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI. This section contains more complex application designs for both vision and machine learning use cases. In particular we will describe a ResNet implementation on for Ryzen™ AI. +There are a number of example designs available [here](../../programming_examples/), which further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI. This section contains more complex application designs for both vision and machine learning use cases. In particular, we will describe a ResNet implementation on for Ryzen™ AI. ## Vision Kernels | Design name | Data type | Description | |-|-|-| -| [Vision Passthrough](../../programming_examples/vision/vision_passthrough/) | i8 | A simple pipeline with just one `passThrough` kernel. This pipeline's main purpose is to test whether the data movement works correctly to copy a greyscale image. | -| [Color Detect](../../programming_examples/vision/color_detect/) | i32 | This multi-kernel, multi-core pipeline detects colors in an RGBA image. | -| [Edge Detect](../../programming_examples/vision/edge_detect/) | i32 | A mult-kernel, multi-core pipeline that detects edges in an image and overlays the detection on the original image. | -| [Color Threshold](../../programming_examples/vision/color_threshold/) | i32 | A mult-core data-parallel implementation of color thresholding of a RGBA image. | +| [Vision Passthrough](../../programming_examples/vision/vision_passthrough/) | i8 | A simple pipeline with just one `passThrough` kernel. This pipeline mainly aims to test whether the data movement works correctly to copy a greyscale image. | +| [Color Detect](../../programming_examples/vision/color_detect/) | i32 | This multi-kernel, multi-core pipeline detects colors in an RGBA image. | +| [Edge Detect](../../programming_examples/vision/edge_detect/) | i32 | A multi-kernel, multi-core pipeline that detects edges in an image and overlays the detection on the original image. | +| [Color Threshold](../../programming_examples/vision/color_threshold/) | i32 | A multi-core data-parallel implementation of color thresholding of a RGBA image. | ## Machine Learning Designs | Design name | Data type | Description | |-|-|-| -|[bottleneck](../../programming_examples/ml/bottleneck/)|ui8|A Bottleneck Residual Block is a variant of the residual block that utilises three convolutions, using 1x1, 3x3 and 1x1 filter sizes, respectively. The use of a bottleneck reduces the number of parameters and computations.| -|[resnet](../../programming_examples/ml/resnet/)|ui8|ResNet with offloaded conv2_x bottleneck blocks. The implementation features kernel fusion and dataflow optimizations highlighting the unique architectural capabilties of AI Engines.| +|[bottleneck](../../programming_examples/ml/bottleneck/)|ui8|A Bottleneck Residual Block is a variant of the residual block that utilizes three convolutions, using 1x1, 3x3, and 1x1 filter sizes, respectively. The implementation features fusing of multiple kernels and dataflow optimizations, highlighting the unique architectural capabilities of AI Engines| +|[resnet](../../programming_examples/ml/resnet/)|ui8|ResNet with offloaded conv2_x layers. The implementation features depth-first implementation of multiple bottleneck blocks across multiple NPU columns.| ## Exercises - -1. In [bottleneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many rows of input data does the 3x3 convolution operation require to proceed with its computation? -2. Suppose you have a bottleneck block with input dimensions of 32x32x256. After passing through the 1x1 convolutional layer, the output dimensions become 32x32x64. What would be the output dimensions after the subsequent 3x3 convolutional layer, assuming a stride of 1 and no padding and output channel of 64? +1. In [bottleneck](../../programming_examples/ml/bottleneck/) design, how many different types of fused computations do you observe? +2. In [bottleneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many elements does the 3x3 convolution operation require from the 1x1 convolution core to proceed with its computation? +3. Suppose you have a bottleneck block with input dimensions of 32x32x256. After passing through the 1x1 convolutional layer, the output dimensions become 32x32x64. What would be the output dimensions after the subsequent 3x3 convolutional layer, assuming a stride of 1 with no padding and an output channel of 64? ----- [[Prev - Section 5](../section-5/)] [[Top](..)]