From a2e9a83167c80af4ee8c2cecd50f01a13e0683e9 Mon Sep 17 00:00:00 2001 From: singagan <53442471+singagan@users.noreply.github.com> Date: Thu, 25 Apr 2024 20:39:52 +0200 Subject: [PATCH] documentation fixes for asplos24TutorialDescription, programming examples, and programming guide(#1416) --- .../asplos24TutorialDescription.md | 20 +++++++++---------- programming_examples/basic/README.md | 6 +++--- .../basic/dma_transpose/README.md | 4 ++-- .../matrix_vector/README.md | 2 +- .../single_core/README.md | 2 +- .../whole_array/README.md | 2 +- .../basic/matrix_scalar_add/README.md | 4 ++-- .../basic/passthrough_dmas/README.md | 2 +- .../basic/passthrough_kernel/README.md | 10 +++++----- .../basic/vector_exp/README.md | 8 ++++---- .../basic/vector_scalar_add/README.md | 2 +- .../basic/vector_scalar_mul/README.md | 14 ++++++------- .../basic/vector_vector_add/README.md | 4 ++-- .../basic/vector_vector_mul/README.md | 4 ++-- programming_examples/ml/eltwise_add/README.md | 6 +++--- programming_examples/ml/eltwise_mul/README.md | 6 +++--- programming_examples/ml/relu/README.md | 8 ++++---- programming_examples/ml/resnet/README.md | 4 ++-- programming_examples/ml/softmax/README.md | 8 ++++---- .../vision/color_detect/README.md | 4 ++-- programming_guide/README.md | 20 +++++++++---------- programming_guide/section-0/README.md | 2 +- programming_guide/section-1/README.md | 18 ++++++++--------- programming_guide/section-2/README.md | 6 +++--- .../section-2/section-2a/README.md | 6 +++--- .../section-2/section-2b/01_Reuse/README.md | 2 +- .../section-2/section-2c/README.md | 2 +- .../section-2e/04_distribute_L2/README.md | 2 +- .../section-2/section-2e/05_join_L2/README.md | 2 +- .../section-2/section-2f/README.md | 2 +- .../section-2/section-2g/README.md | 4 ++-- programming_guide/section-3/README.md | 10 +++++----- programming_guide/section-4/README.md | 4 ++-- programming_guide/section-5/README.md | 6 +++--- programming_guide/section-6/README.md | 20 +++++++++---------- 35 files changed, 113 insertions(+), 113 deletions(-) diff --git a/docs/conferenceDescriptions/asplos24TutorialDescription.md b/docs/conferenceDescriptions/asplos24TutorialDescription.md index b7e1d85a79..cec2e59db4 100644 --- a/docs/conferenceDescriptions/asplos24TutorialDescription.md +++ b/docs/conferenceDescriptions/asplos24TutorialDescription.md @@ -1,30 +1,30 @@ -# ASPLOS'24 Tutorial: Levering MLIR to Design for AI Engines on Ryzen AI +# ASPLOS'24 Tutorial: Levering MLIR to Design for AI Engines on Ryzen™ AI ## Introduction -The AI Engine array in the NPU of the AMD Ryzen AI device includes a set of VLIW vector processors with adaptable interconnect. This tutorial is targeted at performance engineers and tool developers who are looking for fast and completely open source design tools to support their research. Participants will first get insight into the AI Engine compute and data movement capabilities. Through small design examples expressed in the MLIR-AIE python language bindings and executed on an Ryzen AI device, participants will leverage AI Engine features for optimizing performance of increasingly complex designs. The labs will be done on Ryzen AI enabled miniPCs giving participants the ability to execute their own designs on real hardware. +The AI Engine array in the NPU of the AMD Ryzen™ AI device includes a set of VLIW vector processors with adaptable interconnect. This tutorial targets performance engineers and tool developers looking for fast and completely open-source design tools to support their research. Participants will first get insight into the AI Engine compute and data movement capabilities. Through small design examples expressed in the MLIR-AIE python language bindings and executed on a Ryzen™ AI device, participants will leverage AI Engine features to optimize the performance of increasingly complex designs. The labs will be done on Ryzen™ AI-enabled miniPCs, giving participants the ability to execute their own designs on real hardware. This tutorial will cover the following key topics: 1. AI Engine architecture introduction -1. AIE core, array configuration and host application code compilation +1. AIE core, array configuration, and host application code compilation 1. Data movement and communication abstraction layers 1. Tracing for performance monitoring 1. Putting it all together on larger examples: matrix multiplication, convolutions as building blocks for ML and computer vision examples ## Agenda -Date: Saturday April 27th 2024 (morning) +Date: Saturday, April 27th, 2024 (morning) Location: Hilton La Jolla Torrey Pines, San Diego, California (with ASPLOS’24) -Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI enabled miniPCs for the hands-on exercises. +Prerequisite: please bring your laptop so that you can SSH into our Ryzen™ AI-enabled miniPCs for the hands-on exercises. ### Contents and Timeline (tentative) | Time | Topic | Presenter | Slides or Code | |------|-------|-----------|----------------| | 08:30am | Intro to spatial compute and explicit data movement | Kristof | [Programming Guide](../../programming_guide/) | -| 08:45am | "Hello World" from Ryzen AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) | -| 09:00am | Data movement on Ryzen AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) | +| 08:45am | "Hello World" from Ryzen™ AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) | +| 09:00am | Data movement on Ryzen™ AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) | | 09:30am | Your First Program | Kristof | [My First Program](../../programming_guide/section-3) | | 09:50am | Exercise 1: Build and run your first program | All | [Passthrough](../../programming_examples/basic/passthrough_kernel/) | | 10:00am | Break | | | @@ -44,8 +44,8 @@ Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI en *Joseph Melber* is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures. -*Kristof Denolf* is a Fellow in AMD's Research and Advanced Development group where he is working on energy efficient computer vision and video processing applications to shape future AMD devices. He earned a M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, a M.Sc. in electronic system design from Leeds Beckett University (2000) and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx and AMD. His main research interest are all aspects of the cost-efficient and dataflow oriented design of video, vision and graphics systems. +*Kristof Denolf* is a Fellow in AMD's Research and Advanced Development group where he is working on energy-efficient computer vision and video processing applications to shape future AMD devices. He earned an M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, an M.Sc. in electronic system design from Leeds Beckett University (2000), and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx, and AMD. His main research interests are all aspects of the cost-efficient and dataflow-oriented design of video, vision, and graphics systems. -*Phil James-Roxby* is a Senior Fellow in AMD’s Research and Advanced Development group, working on compilers and runtimes to support current and future AMD devices, particularly in the domain on AI processing. In the past, he has been responsible for a number of software enablement activities for hardware devices, including SDNet and SDAccel at Xilinx, and the original development environement for the AI Engines. He holds a PhD from the University of Manchester on hardware acceleration of embedded machine learning applications, and his main research interest continues to be how to enable users to efficiently use diverse hardware in heterogenous systems. +*Phil James-Roxby* is a Senior Fellow in AMD’s Research and Advanced Development group, working on compilers and runtimes to support current and future AMD devices, particularly in the domain of AI processing. In the past, he has been responsible for a number of software enablement activities for hardware devices, including SDNet and SDAccel at Xilinx, and the original development environment for the AI Engines. He holds a PhD from the University of Manchester on hardware acceleration of embedded machine learning applications, and his main research interest continues to be how to enable users to efficiently use diverse hardware in heterogeneous systems. -*Samuel Bayliss* is a Fellow in the Research and Advanced Development group at AMD. His academic experience includes formative study at Imperial College London, for which he earned MEng and PhD degrees in 2006 and 2012 respectively. He is energized by his current work in advancing compiler tooling using MLIR, developing programming abstractions for parallel compute and evolving hardware architectures for efficient machine learning. \ No newline at end of file +*Samuel Bayliss* is a Fellow in the Research and Advanced Development group at AMD. His academic experience includes formative study at Imperial College London, for which he earned MEng and PhD degrees in 2006 and 2012, respectively. He is energized by his current work in advancing compiler tooling using MLIR, developing programming abstractions for parallel compute and evolving hardware architectures for efficient machine learning. \ No newline at end of file diff --git a/programming_examples/basic/README.md b/programming_examples/basic/README.md index 9d9a57169f..bfe8a881ef 100644 --- a/programming_examples/basic/README.md +++ b/programming_examples/basic/README.md @@ -10,14 +10,14 @@ # Basic Programming Examples -These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single core and multicore data processing pipelines). They serve to highlight how designs can be described in python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. +These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. [Passthrough Kernel](./passthrough_kernel) and [Vector Scalar Mul](./vector_scalar_mul) are good designs to get started with. Please see [section 3](../../programming_guide/section-3/) of the [programming guide](../../programming_guide/) for a more detailed guide on developing designs. -* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs, without involving the AIE core. +* [Passthrough DMAs](./passthrough_dmas) - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core. * [Passthrough Kernel](./passthrough_kernel) - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming. * [Vector Scalar Add](./vector_scalar_add) - Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. * [Vector Scalar Mul](./vector_scalar_mul) - Single tile performs `vector * scalar` of size `4096`. The kernel does a `1024` vector multiply and is invoked multiple times to complete the full `vector * scalar` compute. * [Vector Reduce Add](./vector_reduce_add) - Single tile performs a reduction of a vector to return the `sum` of the elements. * [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements. * [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements. -* [Vector Exp](./vector_exp) - A simple element wise exponent function, using the look up table capabilities of the AI Engine. +* [Vector Exp](./vector_exp) - A simple element-wise exponent function, using the look up table capabilities of the AI Engine. * [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking. \ No newline at end of file diff --git a/programming_examples/basic/dma_transpose/README.md b/programming_examples/basic/dma_transpose/README.md index 32dd7ac3d3..5a73dde0e3 100644 --- a/programming_examples/basic/dma_transpose/README.md +++ b/programming_examples/basic/dma_transpose/README.md @@ -12,8 +12,8 @@ This reference design can be run on a Ryzen™ AI NPU. -In the [design](./aie2.py) a 2-D array in row-major layout is read from external memory to `ComputeTile2` with a transposed layout, -by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through Shim tile (`col`, 0). +In the [design](./aie2.py), a 2-D array in a row-major layout is read from external memory to `ComputeTile2` with a transposed layout, +by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0). The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/README.md/#object-fifo-link) of the programming guide. diff --git a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md index b808ec4e32..c01f13c58f 100644 --- a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md +++ b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md @@ -10,7 +10,7 @@ # Matrix Vector Multiplication -One tiles in one or more columns perform a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute. +One tile in one or more columns performs a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_multiplication/single_core/README.md b/programming_examples/basic/matrix_multiplication/single_core/README.md index e4b8ad4729..2fe5158e76 100644 --- a/programming_examples/basic/matrix_multiplication/single_core/README.md +++ b/programming_examples/basic/matrix_multiplication/single_core/README.md @@ -10,7 +10,7 @@ # Matrix Multiplication -Single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. +A single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_multiplication/whole_array/README.md b/programming_examples/basic/matrix_multiplication/whole_array/README.md index f91249721d..bdf9e71778 100644 --- a/programming_examples/basic/matrix_multiplication/whole_array/README.md +++ b/programming_examples/basic/matrix_multiplication/whole_array/README.md @@ -10,7 +10,7 @@ # Matrix Multiplication Array -Multiple tiles in a single column perform a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `256x256x256`. The kernel itself computes `64x64x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute. +Multiple tiles in a single column perform a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `256x256x256`. The kernel computes `64x64x64 (MxKxN)`, which is invoked multiple times to complete the full matmul compute. You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu diff --git a/programming_examples/basic/matrix_scalar_add/README.md b/programming_examples/basic/matrix_scalar_add/README.md index 1b50d4af86..c29df4bfaf 100644 --- a/programming_examples/basic/matrix_scalar_add/README.md +++ b/programming_examples/basic/matrix_scalar_add/README.md @@ -10,9 +10,9 @@ # Matrix Scalar Addition -Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. The DMA in the Shim tile is programmed to bring the bottom left `8x16` portion of a larger `16x128` matrix into the tile to perform the operation. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. The DMA in the Shim tile is programmed to bring the bottom left `8x16` portion of a larger `16x128` matrix into the tile to perform the operation. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Input data is brought to the local memory of the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The Shim tile is programmed with a 2D DMA to only bring a 2D submatrix into the AIE tile for processing. +The kernel executes on AIE tile (`col`, 2). Input data is brought to the local memory of the tile from Shim tile (`col`, 0). The value of `col` depends on whether the application is targeting NPU or VCK5000. The Shim tile is programmed with a 2D DMA to bring only a 2D submatrix into the AIE tile for processing. To compile and run the design for NPU: ``` diff --git a/programming_examples/basic/passthrough_dmas/README.md b/programming_examples/basic/passthrough_dmas/README.md index d85156df10..b3c2e682aa 100644 --- a/programming_examples/basic/passthrough_dmas/README.md +++ b/programming_examples/basic/passthrough_dmas/README.md @@ -12,7 +12,7 @@ This reference design can be run on a Ryzen™ AI NPU. -In the [design](./aie2.py) data is brought from external memory to `ComputeTile2` and back, without modification from the tile, by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through Shim tile (`col`, 0). +In the [design](./aie2.py) data is brought from external memory to `ComputeTile2` and back, without modification from the tile, by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0). The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/03_Link_Distribute_Join/README.md#object-fifo-link) of the programming guide. diff --git a/programming_examples/basic/passthrough_kernel/README.md b/programming_examples/basic/passthrough_kernel/README.md index a012e66209..ed6dfbf218 100644 --- a/programming_examples/basic/passthrough_kernel/README.md +++ b/programming_examples/basic/passthrough_kernel/README.md @@ -10,11 +10,11 @@ # Passthrough Kernel: -This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element sized subvectors, and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. The file generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). 1. `passThrough.cc`: A C++ implementation of vectorized memcpy operations for AIE cores. Found [here](../../../aie_kernels/generic/passThrough.cc). @@ -30,15 +30,15 @@ This simple example effectively passes data through a single compute tile in the 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory. 1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". Note that a vectorized kernel running on the Compute Tile's AIE core copies the data from the input "object" to the output "object". -1. After the vectorized copy is performed the Compute Tile releases the "objects" allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. +1. After the vectorized copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. -It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core is also processing data concurrent with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. +It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. ## Design Component Details ### AIE Array Structural Design -This design performs a memcpy operation on a vector of input data. The AIE design is described in a python module as follows: +This design performs a memcpy operation on a vector of input data. The AIE design is described in a Python module as follows: 1. **Constants & Configuration:** The script defines input/output dimensions (`N`, `n`), buffer sizes in `lineWidthInBytes` and `lineWidthInInt32s`, and tracing support. diff --git a/programming_examples/basic/vector_exp/README.md b/programming_examples/basic/vector_exp/README.md index 7b6fe0eb23..6c13f33578 100644 --- a/programming_examples/basic/vector_exp/README.md +++ b/programming_examples/basic/vector_exp/README.md @@ -11,19 +11,19 @@ # Vector $e^x$ -This example shows how the look up table capability of the AIE can be used to perform approximations to well known functions like $e^x$. +This example shows how the look up table capability of the AIE can be used to perform approximations to well-known functions like $e^x$. This design uses 4 cores, and each core operates on `1024` `bfloat16` numbers. Each core contains a lookup table approximation of the $e^x$ function, which is then used to perform the operation. $e^x$ is typically used in machine learning applications with relatively small numbers, typically around 0..1, and also will return infinity for input values larger than 89, so a small look up table approximation method is often accurate enough compared to a more exact approximation like Taylor series expansion. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `bf16_exp.cc`: A C++ implementation of vectorized table lookup operations for AIE cores. The lookup operation `getExpBf16` operates on vectors of size `16` loading the vectorized accumulator registers with the look up table results. It is then necessary to copy the accumulator register to a regular vector register, before storing back into memory. The source can be found [here](../../../aie_kernels/aie2/bf16_exp.cc). +1. `bf16_exp.cc`: A C++ implementation of vectorized table lookup operations for AIE cores. The lookup operation `getExpBf16` operates on vectors of size `16`, loading the vectorized accumulator registers with the look up table results. It is then necessary to copy the accumulator register to a regular vector register before storing it back into memory. The source can be found [here](../../../aie_kernels/aie2/bf16_exp.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the program verifies the results. -The design also uses a single file from the AIE runtime, in order to initialize the look up table contents to approximate the $e^x$ function. +The design also uses a single file from the AIE runtime to initialize the look up table contents to approximate the $e^x$ function. ## Usage diff --git a/programming_examples/basic/vector_scalar_add/README.md b/programming_examples/basic/vector_scalar_add/README.md index b1cb33333f..5223393ffe 100644 --- a/programming_examples/basic/vector_scalar_add/README.md +++ b/programming_examples/basic/vector_scalar_add/README.md @@ -12,7 +12,7 @@ Single tile performs a very simple `+` operation where the kernel loads data from local memory, increments the value by `1` and stores it back. -The kernel executes on AIE tile (0, 2). Input data is brought to the local memory of the tile from Shim tile (0, 0), through Mem tile (0, 1). The size of the input data from the Shim tile is `16xi32`. The data is stored in the Mem tile and sent to the AIE tile in smaller pieces of size `8xi32`. Output data from the AIE tile to the Shim tile follows the same process, in reverse. +The kernel executes on AIE tile (0, 2). Input data is brought to the local memory of the tile from the Shim tile (0, 0) through the Mem tile (0, 1). The size of the input data from the Shim tile is `16xi32`. The data is stored in the Mem tile and sent to the AIE tile in smaller pieces of size `8xi32`. Output data from the AIE tile to the Shim tile follows the same process, in reverse. This example does not contain a C++ kernel file. The kernel is expressed in Python bindings for the `memref` and `arith` dialects that is then compiled with the AIE compiler to generate the AIE core binary. diff --git a/programming_examples/basic/vector_scalar_mul/README.md b/programming_examples/basic/vector_scalar_mul/README.md index 2ee29e2e19..1ba1649ff9 100644 --- a/programming_examples/basic/vector_scalar_mul/README.md +++ b/programming_examples/basic/vector_scalar_mul/README.md @@ -10,11 +10,11 @@ # Vector Scalar Multiplication: -This IRON design flow example, called "Vector Scalar Multiplication", demonstrates a simple AIE implementation for vectorized vector scalar multiply on a vector of integers. In this design, a single AIE core performs the vector scalar multiply operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element sized subvectors, and is invoked multiple times to complete the full scaling. The example consists of two primary design files: `aie2.py` and `scale.cc`, and a testbench `test.cpp` or `test.py`. +This IRON design flow example, called "Vector Scalar Multiplication", demonstrates a simple AIE implementation for vectorized vector scalar multiply on a vector of integers. In this design, a single AIE core performs the vector scalar multiply operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element-sized subvectors, and is invoked multiple times to complete the full scaling. The example consists of two primary design files: `aie2.py` and `scale.cc`, and a testbench `test.cpp` or `test.py`. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen™ AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). 1. `scale.cc`: A C++ implementation of scalar and vectorized vector scalar multiply operations for AIE cores. Found [here](../../../aie_kernels/aie2/scale.cc). @@ -29,16 +29,16 @@ This IRON design flow example, called "Vector Scalar Multiplication", demonstrat This simple example uses a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows: 1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 1. The runtime data movement is expressed to read `4096` int32_t data from host memory to the compute tile and write the `4096` data back to host memory. A single int32_t scale factor is also transferred form host memory to the Compute Tile. -1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and stores the result to another output "object" it has acquired from "of_out". Note that a scalar or vectorized kernel running on the Compute Tile's AIE core multiplies the data from the input "object" by a scale factor before storing to the output "object". -1. After the compute is performed the Compute Tile releases the "objects" allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. +1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and stores the result to another output "object" it has acquired from "of_out". Note that a scalar or vectorized kernel running on the Compute Tile's AIE core multiplies the data from the input "object" by a scale factor before storing it to the output "object". +1. After the compute is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively. -It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core is also processing data concurrent with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. +It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers. ## Design Component Details ### AIE Array Structural Design -This design performs a memcpy operation on a vector of input data. The AIE design is described in a python module as follows: +This design performs a memcpy operation on a vector of input data. The AIE design is described in a Python module as follows: 1. **Constants & Configuration:** The script defines input/output dimensions (`N`, `n`), buffer sizes in `N_in_bytes` and `N_div_n` blocks, the object FIFO buffer depth, and vector vs scalar kernel selection and tracing support booleans. @@ -62,7 +62,7 @@ This design performs a memcpy operation on a vector of input data. The AIE desig ### AIE Core Kernel Code -`scale.cc` contains a C++ implementations of scalar and vectorized vector scalar multiplcation operation designed for AIE cores. It consists of two main sections: +`scale.cc` contains a C++ implementation of scalar and vectorized vector scalar multiplication operation designed for AIE cores. It consists of two main sections: 1. **Scalar Scaling:** The `scale_scalar()` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements. diff --git a/programming_examples/basic/vector_vector_add/README.md b/programming_examples/basic/vector_vector_add/README.md index 2ed5a82605..c7dd75676a 100644 --- a/programming_examples/basic/vector_vector_add/README.md +++ b/programming_examples/basic/vector_vector_add/README.md @@ -10,9 +10,9 @@ # Vector Vector Add -Single tile performs a very simple `+` operations from two vectors loaded into memory. The tile then stores the sum of those two vectors back to external memory. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `+` operation from two vectors loaded into memory. The tile then stores the sum of those two vectors back to external memory. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The AIE tile performs the summation operations and the Shim tile brings the data back out to external memory. +The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` depends on whether the application is targeting NPU or VCK5000. The AIE tile performs the summation operations, and the Shim tile brings the data back out to external memory. To compile and run the design for NPU: ``` diff --git a/programming_examples/basic/vector_vector_mul/README.md b/programming_examples/basic/vector_vector_mul/README.md index 54ab0bd4e1..331f832033 100644 --- a/programming_examples/basic/vector_vector_mul/README.md +++ b/programming_examples/basic/vector_vector_mul/README.md @@ -10,9 +10,9 @@ # Vector Vector Multiplication -Single tile performs a very simple `*` operations from two vectors loaded into memory. The tile then stores the element wise multiplication of those two vectors back to external memory. This reference design can be run on either a RyzenAI NPU or a VCK5000. +A single tile performs a very simple `*` operation from two vectors loaded into memory. The tile then stores the element-wise multiplication of those two vectors back to external memory. This reference design can be run on either a Ryzen™ AI NPU or a VCK5000. -The kernel executes on AIE tile (`col`, 2). Both input vectors are brought into the tile from Shim tile (`col`, 0). The value of `col` is dependent on whether the application is targetting NPU or VCK5000. The AIE tile performs the multiplication operations and the Shim tile brings the data back out to external memory. +The kernel executes on the AIE tile (`col`, 2). Both input vectors are brought into the tile from the Shim tile (`col`, 0). The value of `col` depends on whether the application targets NPU or VCK5000. The AIE tile performs the multiplication operations, and the Shim tile brings the data back out to external memory. To compile and run the design for NPU: ``` diff --git a/programming_examples/ml/eltwise_add/README.md b/programming_examples/ml/eltwise_add/README.md index 415b0c828c..f801bfe13b 100644 --- a/programming_examples/ml/eltwise_add/README.md +++ b/programming_examples/ml/eltwise_add/README.md @@ -10,14 +10,14 @@ # Eltwise Add -This design implements a `bfloat16` based element wise addition between two vectors, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based element wise addition between two vectors, performed in parallel on two cores in a single column. Element-wise addition usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). Please refer to [bottleneck](../bottleneck/) design on fusing element-wise addition with convolution for the skip addition. ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `add.cc`: A C++ implementation of a vectorized vector addition operation for AIE cores. The code uses the AIE API which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). +1. `add.cc`: A C++ implementation of a vectorized vector addition operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/eltwise_mul/README.md b/programming_examples/ml/eltwise_mul/README.md index 9dc531705c..1db8cde361 100644 --- a/programming_examples/ml/eltwise_mul/README.md +++ b/programming_examples/ml/eltwise_mul/README.md @@ -10,14 +10,14 @@ # Eltwise Multiplication -This design implements a `bfloat16` based element wise multiplication between two vectors, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based element-wise multiplication between two vectors, performed in parallel on two cores in a single column. Element-wise multiplication usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `add.cc`: A C++ implementation of a vectorized vector multiplication operation for AIE cores. The code uses the AIE API which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). +1. `add.cc`: A C++ implementation of a vectorized vector multiplication operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html). The source can be found [here](../../../aie_kernels/aie2/add.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/relu/README.md b/programming_examples/ml/relu/README.md index 6d093c6bd8..c450bb8b24 100644 --- a/programming_examples/ml/relu/README.md +++ b/programming_examples/ml/relu/README.md @@ -13,20 +13,20 @@ ReLU, which stands for Rectified Linear Unit, is a type of activation function that is widely used in neural networks, particularly in deep learning models. It is defined mathematically as: $ReLU(x) = max(0,x)$ -This function takes a single number as input and outputs the maximum of zero and the input number. Essentially, it passes positive values through unchanged, and clamps all the negative values to zero. +This function takes a single number as input and outputs the maximum of zero and the input number. Essentially, it passes positive values through unchanged, and clamps all the negative values to zero. Please refer to [conv2d_fused_relu](../conv2d_fused_relu/) design on fusing ReLU with convolution. ## Key Characteristics of ReLU: * Non-linear: While it looks like a linear function, ReLU introduces non-linearity into the model, which is essential for learning complex patterns in data. * Computational Efficiency: One of ReLU's biggest advantages is its computational simplicity. Unlike other activation functions like sigmoid or tanh, ReLU does not involve expensive operations (e.g., exponentials), which makes it computationally efficient and speeds up the training and inference processes. -This design implements a `bfloat16` based ReLU on a vector, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute dense kernel (e.g. a convolution or GEMM). +This design implements a `bfloat16` based ReLU on a vector, performed in parallel on two cores in a single column. This will end up being I/O bound due to the low compute intensity, and in a practical ML implementation, is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM). ## Source Files Overview -1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). +1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -1. `relu.cc`: A C++ implementation of a vectorized ReLU operation for AIE cores, which is 1:1 implementation of the inherent function using low level intrinsics. The AIE2 allows an element-wise max of 32 `bfloat16` numbers against a second vector register containing all zeros, implementing the $ReLU(x) = max(0,x)$ function directly. The source can be found [here](../../../aie_kernels/aie2/relu.cc). +1. `relu.cc`: A C++ implementation of a vectorized ReLU operation for AIE cores, which is a 1:1 implementation of the inherent function using low-level intrinsics. The AIE2 allows an element-wise max of 32 `bfloat16` numbers against a second vector register containing all zeros, implementing the $ReLU(x) = max(0,x)$ function directly. The source can be found [here](../../../aie_kernels/aie2/relu.cc). 1. `test.cpp`: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. diff --git a/programming_examples/ml/resnet/README.md b/programming_examples/ml/resnet/README.md index 5bb146b006..39d863d687 100755 --- a/programming_examples/ml/resnet/README.md +++ b/programming_examples/ml/resnet/README.md @@ -55,9 +55,9 @@ The below figures shows our implementation of the conv2_x layers of the ResNet a
-Similar to our [bottleneck design](../../bottleneck), we implement conv2_x layers depth-first. Our implementation connects the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck design](../../bottleneck), the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path that the increases the number of channels. +Similar to our [bottleneck](../../bottleneck) design, we implement conv2_x layers depth-first. Our implementation connects the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck](../../bottleneck) design, the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path that the increases the number of channels. -After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck design](../../bottleneck). Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors. +After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck](../../bottleneck) design. Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors. diff --git a/programming_examples/ml/softmax/README.md b/programming_examples/ml/softmax/README.md index 0c17a1c50c..a413993996 100644 --- a/programming_examples/ml/softmax/README.md +++ b/programming_examples/ml/softmax/README.md @@ -36,17 +36,17 @@ The softmax function is a mathematical function commonly used in machine learnin The softmax function employs the exponential function $e^x$, similar to the example found [here](../../basic/vector_exp/). Again to efficiently implement softmax, a lookup table approximation is utilized. -In addition, and unlike any of the other current design examples, this example uses MLIR dialects as direct input, including the `vector`,`affine`,`arith` and `math` dialects. This is shown in the [source](./bf16_softmax.mlir). This is intended to be generated from a higher level description, but is shown here as an example of how you can use other MLIR dialects as input. +In addition, and unlike any of the other current design examples, this example uses MLIR dialects as direct input, including the `vector`,`affine`,`arith` and `math` dialects. This is shown in the [source](./bf16_softmax.mlir). This is intended to be generated from a higher-level description but is shown here as an example of how you can use other MLIR dialects as input. The compilation process is different from the other design examples, and is shown in the [Makefile](./Makefile). 1. The input MLIR is first vectorized into chunks of size 16, and a C++ file is produced which has mapped the various MLIR dialects into AIE intrinsics, including vector loads and stores, vectorized arithmetic on those registers, and the $e^x$ approximation using look up tables 1. This generated C++ is compiled into a first object file -1. A file called `lut_based_ops.cpp` from the AIE2 runtime libary is compiled into a second object file. This file contains the look up table contents to approximate the $e^x$ function. +1. A file called `lut_based_ops.cpp` from the AIE2 runtime library is compiled into a second object file. This file contains the look up table contents to approximate the $e^x$ function. 1. A wrapper file is also compiled into an object file, which prevents C++ name mangling, and allows the wrapped C function to be called from the strucural Python -1. These 3 object files and combined into a single .a file, which is then referenced inside the aie2.py structural Python. +1. These 3 object files are combined into a single .a file, which is then referenced inside the aie2.py structural Python. -This is a slightly more complex process than the rest of the examples, which typically only use a single object file containing the wrapped C++ function call, but is provided to show how a library based flow can also be used. +This is a slightly more complex process than the rest of the examples, which typically only use a single object file containing the wrapped C++ function call, but is provided to show how a library-based flow can also be used. ## Usage diff --git a/programming_examples/vision/color_detect/README.md b/programming_examples/vision/color_detect/README.md index f2f24dbea6..96558d3774 100644 --- a/programming_examples/vision/color_detect/README.md +++ b/programming_examples/vision/color_detect/README.md @@ -20,13 +20,13 @@ The pipeline is mapped onto a single column of the npu device, with one Shim til width="1150"> -The data movement of this pipeline is described using the ObjectFifo (OF) primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering couldn't directly be done in the smaller L1 memory module of tile (0, 5). This is described using two OFs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two OFs are linked to express that data from the first OF should be copied to the second OF implicitly through the Mem tile's DMA. +The data movement of this pipeline is described using the ObjectFifo (OF) primitive. Input data is brought into the array via the Shim tile. The data then needs to be broadcasted both to AIE tile (0, 2) and AIE tile (0, 5). However, tile (0, 5) has to wait for additional data from the other kernels before it can proceed with its execution, so in order to avoid any stalls in the broadcast, data for tile (0, 5) is instead buffered in the Mem tile. Because of the size of the data, the buffering could not be done directly in the smaller L1 memory module of tile (0, 5). This is described using two OFs, one for the broadcast to tile (0, 2) and the Mem tile, and one for the data movement between the Mem tile and tile (0, 5). The two OFs are linked to express that data from the first OF should be copied to the second OF implicitly through the Mem tile's DMA. Starting from tile (0, 2) data is processed by each compute tile and the result is sent to the next tile. This is described by a series of one-to-one OFs. An OF also describes the broadcast from tile (0, 2) to tiles (0, 3) and (0, 4). As the three kernels `bitwiseOR`, `gray2rgba` and `bitwiseAND` are mapped together on AIE tile (0, 5), two OFs are also created with tile (0, 5) being both their source and destination to describe the data movement between the three kernels. Finally, the output is sent from tile (0, 5) to the Mem tile and then back to the output through the Shim tile. -To compile desing in Windows: +To compile design in Windows: ``` make make colorDetect.exe diff --git a/programming_guide/README.md b/programming_guide/README.md index df0471ebc0..6adf1dda44 100644 --- a/programming_guide/README.md +++ b/programming_guide/README.md @@ -12,9 +12,9 @@ -The AI Engine (AIE) array is a spatial compute architecture: a modular and scalable system with spatially distributed compute and memories. Its compute dense vector processing runs independently and concurrently to explicitly scheduled data movement. Since the vector compute core (green) of each AIE can only operate on data in its L1 scratchpad memory (light blue), data movement accelerators (purple) bi-directionally transport this data over a switched (dark blue) interconnect network, from any level in the memory hierarchy. +The AI Engine (AIE) array is a spatial compute architecture: a modular and scalable system with spatially distributed compute and memories. Its compute-dense vector processing runs independently and concurrently to explicitly scheduled data movement. Since the vector compute core (green) of each AIE can only operate on data in its L1 scratchpad memory (light blue), data movement accelerators (purple) bi-directionally transport this data over a switched (dark blue) interconnect network from any level in the memory hierarchy. -Programming the AIE-array configures all its spatial building blocks: the compute cores' program memory, the data movers' buffer descriptors, interconnect with switches, etc. This guide introduces our Interface Representation for hands-ON (IRON) close-to-metal programming of the AIE-array. IRON is an open access toolkit enabling performance engineers to build fast and efficient, often specialized designs through a set of Python language bindings around mlir-aie, our MLIR-based representation of the AIE-array. mlir-aie provides the foundation from which complex and performant AI Engine designs can be defined and is supported by simulation and hardware implementation infrastructure. +Programming the AIE-array configures all its spatial building blocks: the compute cores' program memory, the data movers' buffer descriptors, interconnect with switches, etc. This guide introduces our Interface Representation for hands-ON (IRON) close-to-metal programming of the AIE-array. IRON is an open-access toolkit enabling performance engineers to build fast and efficient, often specialized designs through a set of Python language bindings around mlir-aie, our MLIR-based representation of the AIE-array. mlir-aie provides the foundation from which complex and performant AI Engine designs can be defined and is supported by simulation and hardware implementation infrastructure. > **NOTE:** For those interested in better understanding how AI Engine designs are defined at the MLIR level, take a look through the [MLIR tutorial](../mlir_tutorials/) material. mlir-aie also serves as a lower layer for other higher-level abstraction MLIR layers such as [mlir-air](https://github.com/Xilinx/mlir-air). @@ -24,16 +24,16 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a