diff --git a/aie_kernels/aie2/scale.cc b/aie_kernels/aie2/scale.cc index 5d277fd209..b4f64ee0b5 100755 --- a/aie_kernels/aie2/scale.cc +++ b/aie_kernels/aie2/scale.cc @@ -25,7 +25,7 @@ void scale_scalar(T *a, T *c, T factor, const int32_t N) { event1(); } -// Vectorized scale template +// Vectorized scale template (general case) // Assume N is multiple of 16 template void scale_vectorized(T *a, T *c, int32_t factor, const int32_t N) { @@ -46,7 +46,7 @@ void scale_vectorized(T *a, T *c, int32_t factor, const int32_t N) { event1(); } -// Vectorized scale template for int32_t (acc64 used) +// Vectorized scale template (int32_t case, acc64 used) // Assume N is multiple of 16 template <> void scale_vectorized(int32_t *a, int32_t *c, int32_t factor, diff --git a/programming_guide/README.md b/programming_guide/README.md index 6adf1dda44..fc0fd22aab 100644 --- a/programming_guide/README.md +++ b/programming_guide/README.md @@ -44,11 +44,10 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a * Introduce an example of the first simple program (Vector Scalar Multiplication) * Illustrate how to run designs on Ryzen™ AI-enabled hardware -
Section 4 - Vector programming & Performance Measurement +
Section 4 - Peformance Measurement & Vector Programming -* Discuss the topic of vector programming at the kernel level -* Introduce performance measurement (trace) and how we measure cycle count and efficiency -* Performant Vector Scalar Multiplication design example +* Introduce performance measurement (timers, trace) +* Discuss topic of vector programming at the kernel level
Section 5 - Example Vector Designs diff --git a/programming_guide/section-3/README.md b/programming_guide/section-3/README.md index ec6471b28a..c2810b35e9 100644 --- a/programming_guide/section-3/README.md +++ b/programming_guide/section-3/README.md @@ -16,20 +16,21 @@ This section creates the first program that will run on the AIE-array. As shown For the AIE-array structural description we will combine what you learned in [section-1](../section-1) for defining a basic structural design in Python with the data movement part from [section-2](../section-2). -For the AIE kernel code, we will start with non-vectorized code that will run on the scalar processor part of an AIE. [section-4](../section-4) will introduce how to vectorize a compute kernel to harvest the compute density of the AIE. +For the AIE kernel code, we will start with non-vectorized code that will run on the scalar processor part of an AIE. [Section-4](../section-4) will introduce how to vectorize a compute kernel to harvest the compute density of the AIE. The host code can be written in either C++ (as shown in the figure) or in Python. We will also introduce some convenience utility libraries for typical test functionality and to simplify context and buffer creation when the [Xilinx RunTime (XRT)](https://github.com/Xilinx/XRT) is used, for instance in the [AMD XDNA Driver](https://github.com/amd/xdna-driver) for Ryzen™ AI devices. -Throughout this section, a [vector scalar multiplication](../../programming_examples/basic/vector_scalar_mul/) (c = a * factor) will be used as an example. Vector scalar multiplication takes an input vector a and computes the output vector c by multiplying each element of a with a factor. In our example, the total vector size is set to 4096 integers (32b) that will processed in chunks of 1024. -This design is also available in the [programming_examples](../../programming_examples) of this repository. We will first introduce the AIE-array structural description, the review the kernel code and then introduce the host code. Finally we will show ho to run the design on Ryzen™ AI enabled hardware. +Throughout this section, a [vector scalar multiplication](../../programming_examples/basic/vector_scalar_mul/) (c = a * factor) will be used as an example. Vector scalar multiplication takes an input vector `a` and computes the output vector `c` by multiplying each element of a with a factor. In this example, the total vector size is set to 4096 (16b) that will processed in chunks of 1024. + +This design is also available in the [programming_examples](../../programming_examples) of this repository. We will first introduce the AIE-array structural description, review the kernel code and then introduce the host code. Finally we will show how to run the design on Ryzen™ AI enabled hardware. ## AIE-array Structural Description -The [aie2.py](../../programming_examples/basic/vector_scalar_mul/aie2.py) AIE-array structural description (see [section-1](../section-1)) deploys both a compute core (green) for the multiplication and a shimDMA (purple) for data movement of both input vector a and output vector c residing in external memory. +The [aie2.py](../../programming_examples/basic/vector_scalar_mul/aie2.py) AIE-array structural description (see [section-1](../section-1)) deploys both a compute core (green) for the multiplication and a shimDMA (purple) for data movement of both input vector `a` and output vector `c` residing in external memory. ```python # Device declaration - here using aie2 device NPU @@ -41,7 +42,7 @@ def device_body(): ComputeTile2 = tile(0, 2) ``` -We also need to declare that the compute core will run an external function: a kernel written in C++ that will be linked into the design as pre-compiled kernel (more details in the next subsection). To get our initial design running on the AIE-array, we will run a generic version of the vector scalar multiply run on the scalar processor of the AIE. +We also need to declare that the compute core will run an external function: a kernel written in C++ that will be linked into the design as pre-compiled kernel (more details below). To get our initial design running on the AIE-array, we will run a generic version of the vector scalar multiply design here in this directory that is run on the scalar processor of the AIE. This local version will use `int32_t` datatype instead of the default `int16_t`for the [programming_examples version](../../programming_examples/basic/vector_scalar_mul/). ```python # Type declarations @@ -57,7 +58,7 @@ Since the compute core can only access L1 memory, input data needs to be explici -This enables looking at the data movement in the AIE-array from a logical view where we deploy 3 objectFIFOs: "of_in" to bring in the vector a, "of_factor" to bring in the scalar factor, and "of_out" to move the output vector c, all using shimDMA. Note that the objects for "of_in" and "of_out" are declared to have the `memRef_ty` type: 1024 int32 elements, while the factor is an object containing a single integer. All objectFIFO are set up using a depth size of 2 to enable the concurrent execution to the Shim Tile and Compute Tile DMAs data movement with the processing on the compute core. +This enables looking at the data movement in the AIE-array from a logical view where we deploy 3 objectFIFOs: "of_in" to bring in the vector `a`, "of_factor" to bring in the scalar factor, and "of_out" to move the output vector `c`, all using shimDMA. Note that the objects for "of_in" and "of_out" are declared to have the `memRef_ty` type: 1024 int32 elements, while the factor is an object containing a single integer. All objectFIFOs are set up using a depth size of 2 to enable the concurrent execution to the Shim Tile and Compute Tile DMAs with the processing on the compute core. ```python # AIE-array data movement with object fifos @@ -66,7 +67,7 @@ This enables looking at the data movement in the AIE-array from a logical view w of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_ty) ``` -We also need to set up the data movement to/from the AIE-array: configure n-dimensional DMA transfers in the shimDMAs to read/write to/from L3 external memory. For NPU, this is done with the `npu_dma_memcpy_nd` function (more details in [section 2-g](../section-2/section-2g)). Note that the n-dimensional transfer has a size of 4096 int32 elements and that the `metadata` argument in the `npu_dma_memcpy_nd` needs to match the `name` argument of the corresponding object FIFO. +We also need to set up the data movement to/from the AIE-array: configure n-dimensional DMA transfers in the shimDMAs to read/write to/from L3 external memory. For NPU, this is done with the `npu_dma_memcpy_nd` function (more details in [section 2-g](../section-2/section-2g)). Note that the n-dimensional transfer has a size of 4096 int32 elements and that the `metadata` argument in the `npu_dma_memcpy_nd` needs to match the `name` argument of the corresponding object FIFO, in this case `in`, `inFactor` and `out`. ```python # To/from AIE-array data movement @@ -81,9 +82,9 @@ We also need to set up the data movement to/from the AIE-array: configure n-dime npu_sync(column=0, row=0, direction=0, channel=0) ``` -Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire and object of 1024 integers to read from "of_in" and one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. After the 4 sub-vector iterations, we release the "of_factor" objectFIFO. +Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire an object of 1024 integers to read from "of_in" and one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. And finally after the 4 sub-vector iterations, we release the "of_factor" objectFIFO. -This access and execute pattern runs on the AIE compute core `ComputeTile2` and needs to get linked against the precompiled external function "scale.o". We run this pattern in a very large loop to enable enqueuing multiple rounds vector scalar multiply work from the host code. +This access and execute pattern runs on the AIE compute core `ComputeTile2` and needs to get linked against the precompiled external function `"scale.o"`. We run this pattern in a very large loop to enable enqueuing multiple rounds of vector scalar multiply work from the host code. ```python @core(ComputeTile2, "scale.o") @@ -105,7 +106,7 @@ This access and execute pattern runs on the AIE compute core `ComputeTile2` and ## Kernel Code -We can program the AIE compute core using C++ code and compile it with `xchesscc` into an kernel object file. In this section, we will use a generic implementation of the vector scalar multiplication that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements. +We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local verion of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements. ```c void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out, @@ -122,19 +123,24 @@ Note that since the scalar factor is communicated through an object, it is provi ## Host Code -The host code acts as environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data. Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variant of this code are available. +The host code acts as an environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution of the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data (to be covered in [section-4c](../section-4/section-4c/). Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variants of this code are available. -For convenience, a set of test utilities support common elements of command line parsing, the XRT-based environment setup and with testbench functionality: [test_utils.h](../../runtime_lib/test_lib/test_utils.h) or [test.py](../../python/utils/test.py). +For convenience, a set of test utilities support common elements of command line parsing, the XRT-based environment setup and testbench functionality: [test_utils.h](../../runtime_lib/test_lib/test_utils.h) or [test.py](../../python/utils/test.py). The host code contains the following elements: -1. *Parse program arguments and set up constants*: the host code typically takes at least as arguments: `-x` the XCLBIN file, `-k` kernel name (with default name "MLIR_AIE"), and `-i` the instruction sequence file as arguments since it is its task to load those files and set the kernel name. Both the XCLBIN and instruction sequence are generated when compiling the AIE-array Structural Description and kernel code with `aiecc.py`. +1. *Parse program arguments and set up constants*: the host code typically requires the following 3 arguments: + * `-x` the XCLBIN file + * `-k` kernel name (with default name "MLIR_AIE") + * `-i` the instruction sequence file as arguments + + This is because it is its task to load those files and set the kernel name. Both the XCLBIN and instruction sequence are generated when compiling the AIE-array structural description and kernel code with `aiecc.py`. 1. *Read instruction sequence*: load the instruction sequence from the specified file in memory 1. *Create XRT environment*: so that we can use the XRT runtime -1. *Create XRT buffer objects* for the instruction sequence, inputs (vector a and factor) and output (vector c). Note that the `kernel.group_id()` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of Python AIE-array structural description, starting with ID number 2 for the first sequence argument and then incrementing by 1. +1. *Create XRT buffer objects* for the instruction sequence, inputs (vector `a` and `factor`) and output (vector `c`). Note that the `kernel.group_id()` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of python AIE-array structural description, starting with ID number 2 for the first sequence argument and then incrementing by 1. This mapping is described as well in the [python utils documentation](../../python/utils/README.md#configure-shimdma). 1. *Initialize and synchronize*: host to device XRT buffer objects diff --git a/programming_guide/section-4/README.md b/programming_guide/section-4/README.md index abe99b6010..a681de4d10 100644 --- a/programming_guide/section-4/README.md +++ b/programming_guide/section-4/README.md @@ -8,9 +8,9 @@ // //===----------------------------------------------------------------------===//--> -# Section 4 - Vector Programming & Performance Measurement +# Section 4 - Performance Measurement & Vector Programming -Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen™ AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming techniques to fully leverage the power of the AI Engines for parallel compute. +Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen™ AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming technqiues to fully leverage the power of the AI Engines for parallel compute. It is helpful to first examine performance measurement before we delve into vector programming in order to get a baseline for where our application performance is. There are many factors that contribute to performance including latency, throughput and power efficiency. Performance measurement is an active area of research to provide more powerful tools for users to measure the speedup of their application on AIEs. In [section-4a](./section-4a) and [section-4b](./section-4b/), we look at performance from the perspective of timers and trace. Then in [section-4c](./section-4c), we look more closely at how to vectorize AIE kernel code. diff --git a/programming_guide/section-4/section-4a/Makefile b/programming_guide/section-4/section-4a/Makefile index 370ac81cfc..ad1a03468e 100644 --- a/programming_guide/section-4/section-4a/Makefile +++ b/programming_guide/section-4/section-4a/Makefile @@ -31,7 +31,7 @@ ${targetname}.exe: ${srcdir}/test.cpp rm -rf _build mkdir -p _build cd _build && ${powershell} cmake -E env CXXFLAGS="-std=c++23 -ggdb" cmake ${srcdir} -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${targetname} - cd _build && ${powershell} cmake --build , --config Release + cd _build && ${powershell} cmake --build . --config Release ifeq "${powershell}" "powershell.exe" cp _build/${targetname}.exe $@ else @@ -41,11 +41,11 @@ endif run: ${targetname}.exe build/final.xclbin build/insts.txt ${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE -run-1: ${targetname}.exe build/final.xclbin build/insts.txt - ${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 1 -v 2 +run-10: ${targetname}.exe build/final.xclbin build/insts.txt + ${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 10 -run-4: ${targetname}.exe build/final.xclbin build/insts.txt - ${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 4 -v 2 +run-10-warmup: ${targetname}.exe build/final.xclbin build/insts.txt + ${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 10 --warmup 4 run_py: build/final.xclbin build/insts.txt ${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE diff --git a/programming_guide/section-4/section-4a/README.md b/programming_guide/section-4/section-4a/README.md index 79acbb02fe..507dbb375d 100644 --- a/programming_guide/section-4/section-4a/README.md +++ b/programming_guide/section-4/section-4a/README.md @@ -10,7 +10,7 @@ # Section 4a - Timers -* [Section 4 - Vector Programming & Peformance Measurement](../../section-4) +* [Section 4 - Performance Measurement & Vector Programming](../../section-4) * Section 4a - Timers * [Section 4b - Trace](../section-4b) * [Section 4c - Kernel Vectorization and Optimization](../section-4c) @@ -20,7 +20,7 @@ We begin by first looking at timers for measuring application performance and what that tells us. The performance of an accelerated AI Engine application involves a number of components on the software stack, from invoking the application at the OS level, to passing control on to the kernel drivers, moving and dispatching work to the AIE array, running the accelerated application on AIE cores, and finally returning the data to the application for next-step processing. The most straightforward way to capture the performance of this entire stack of communication and processing is with an application timer, also known as the "wall clock" time. This gives us the upper bounds for how long an AIE accelerated application takes but adds to it the OS and kernel driver overhead. This is something that can be minimized when running multiple iterations of an acclerated program or running a sufficiently compute intensive application. Let's take a look at how we add the "wall clock" timer to an example program. ## Application timer - Modifying [test.cpp](./test.cpp) -Adding the application timer is as simple as noting a start and stop time surrounding the calling of the kernel function. We can use the clock timer from the chrono library which is imported via `import ` but this may already be imported by other libraries (in our `test.cpp`, this is the case). Then we record the start and stop time of our chrono timer with function calls surrounding our kernel function call like the following: +Adding the application timer is as simple as noting a start and stop time surrounding the calling of the kernel function. We can use the clock timer from the chrono library which is imported via `import ` but this may already be imported by other libraries (this is the case in our `test.cpp`). Then we record the start and stop time of our chrono timer with timer function calls surrounding our kernel function as follows: ```c++ auto start = std::chrono::high_resolution_clock::now(); @@ -34,9 +34,9 @@ Adding the application timer is as simple as noting a start and stop time surrou This provides us with a good baseline for how long our accelerated kernel function takes. ## Multiple iterations -A timer for a single kernel function call is a useful starting data point for understanding performance but there can be a lot of variability and overhead for a single call that is smoothed out when run multiple times. In order to benchmark the steady-state kernel run time, we can add code around our kernel call to execute multiple times and capture the minimium, maximize and average time that our kernel takes. +A timer for a single kernel function call is a useful starting point for understanding performance but there can be a lot of variability and overhead for a single call that is smoothed out when run multiple times. In order to benchmark the steady-state kernel run time, we can add code around our kernel call to execute multiple times and capture the minimium, maximize and average time that our kernel takes. -In our example [test.cpp](./test.cpp), we wrap our calls within a for loop (based on `num_iter`/ number of iterations). +In our example [test.cpp](./test.cpp), we wrap our calls within a for loop (based on `num_iter` or number of iterations). ```c++ unsigned num_iter = n_iterations + n_warmup_iterations; @@ -44,7 +44,7 @@ In our example [test.cpp](./test.cpp), we wrap our calls within a for loop (base <... kernel run code ...> } ``` -It is also useful to run the kernel a number of times prior to recording the steady-state average. This hides initial startup timing overhead that sometimes occurs during the first few runs. We call these initial loops warmup iterations which do not include verifying results and measuring the kernel function time. +It is also useful to run the kernel a number of times prior to calculating the steady-state run times. This hides initial startup timing overhead that sometimes occurs during the first few runs. We call these initial loops warmup iterations, where we do not verify the results or measure the run time during warmup. ```c++ for (unsigned iter = 0; iter < num_iter; iter++) { <... kernel run code ...> @@ -64,19 +64,23 @@ Finally, we accumulate relevant timer data to calculate and track average, minim npu_time_max = (npu_time > npu_time_max) ? npu_time : npu_time_max; } ``` -We can then compute and print the actual average, minimum and maximum run times. +We can then compute and print the actual average, minimum and maximum run times at the end of our host code. ```c++ std::cout << "Avg NPU time: " << npu_time_total / n_iterations << "us." << std::endl; std::cout << "Min NPU time: " << npu_time_min << "us." << std::endl; std::cout << "Max NPU time: " << npu_time_max << "us." << std::endl; ``` +In addition, if you have an estimate of the number of MACs each kernel execution takes, you can report additional performance data such as GFLOPs as can be seen in the matrix multiplication example [test.cpp](../../../programming_examples/basic/matrix_multiplication/test.cpp#L170). + ## Exercises -1. Take a look at the timer code in our example [test.cpp](./test.cpp). Then build and run the design by calling `make; make run` and note the reported average "wall clock" time. What value did you see? +1. Take a look at the timer code in our example [test.cpp](./test.cpp). Then build and run the design by calling `make run` and note the reported average "wall clock" time. What value did you see? 1. Our design was run once with a single iteration and no warmup. Let's run our design again by calling `make run` again. What reported Avg NPU time did you see this time? -1. Let's set our iterations to 10 and run again with `make run` which recompiles our host code for `test.cpp`. What reported Avg NPU time do you see this time? +1. Let's set our iterations to 10 and run again with `make run-10` which passes in the argument `--iters 10` to our executable. What reported Avg NPU time do you see this time? + +1. Finally, let's add a 4 warmup iterations to cut higher outliers when the application is first run by calling `make run-10-warmup`. This passes in the `--warmup 4` to our executable. What reported Avg NPU time do you see this time? ----- [[Up]](../../section-4) [[Next]](../section-4b) diff --git a/programming_guide/section-4/section-4b/README.md b/programming_guide/section-4/section-4b/README.md index 6a5a6d3832..1676eaa295 100644 --- a/programming_guide/section-4/section-4b/README.md +++ b/programming_guide/section-4/section-4b/README.md @@ -10,7 +10,7 @@ # Section 4b - Trace -* [Section 4 - Vector Programming & Peformance Measurement](../../section-4) +* [Section 4 - Performance Measurement & Vector Programming](../../section-4) * [Section 4a - Timers](../section-4a) * Section 4b - Trace * [Section 4c - Kernel Vectorization and Optimization](../section-4c) @@ -34,21 +34,21 @@ Enabling trace support can be done with the following steps: Enabling tracing means (1a) configuring the trace units for a given tile and then (1b) routing the generated events packets through the stream switches to the shim DMA where we can write them to a buffer in DDR for post-runtime processing. ### (1a) Configure trace units for an AIE tile -The first necessary component for trace configuration is setting the right values for the trace control registers for each tile that we want to enable tracing for. In addition, the generated trace packets will need to be routed to shimDMA and then written to one of the 3 inout buffers. We have abstracted these two steps with the python wrapper function `configure_simple_tracing_aie2` which is in [python/utils/test.py](../../../python/utils/test.py) and is described in more detail in the [README.md under python/utils](../../../python/utils). An example of how this function is used is shown below for quick reference +The first necessary component for trace configuration is setting the right values for the trace control registers for each tile that we want to enable tracing for. In addition, the generated trace packets will need to be routed to shimDMA and then written to one of the 3 inout buffers. We have abstracted these two steps with the python wrapper function `configure_simple_tracing_aie2` which is in [python/utils/test.py](../../../python/utils/test.py) and is described in more detail the [README](../../../python/utils) under `python/utils`. An example of how this function is used is shown below for quick reference ```python trace_utils.configure_simple_tracing_aie2( ComputeTile2, ShimTile, - ddr_id=1, + ddr_id=2, size=traceSizeInBytes, offset=tensorSize, ) ``` This block is defined within the sequence definition for `@FuncOp.from_py_func` where we define the shimDMA data movement to the 3 inout buffers. -> **Note** This simplification works very well for the trace buffer from a single tile to the shimDMA. However, if we want to do something more complicated like allocating the trace buffer from multiple tiles into a single larger buffer, this function will not be able to express that. For that, please consult the [README.md under python/utils](../../../python/utils) for more guidance on how to customize the trace configuration. +> **Note** This simplification works very well for the trace buffer from a single tile to the shimDMA. However, if we want to do something more advaned like allocating the trace buffer from multiple tiles into a single larger buffer, this function will not be able to express that. For that, please consult the [README](../../../python/utils) under `python/utils` for more guidance on how to customize the trace configuration. ### (1b) Define trace event routes from tile to shimDMA -Once the trace units and shimDMA are configured, we need to define how the trace packets are routed from compute tile to shim tile. This is done via circuit switched flows or packet switched flows as described below. +Once the trace units and shimDMA are configured, we need to define how the trace packets are routed from compute tile to shim tile. This is done via circuit switched flows or packet switched flows as described below. Note that trace units in the MemTile and ShimTile can also be configured and routed. #### Circuit switched flows An example of a simple circuit switch routing flow to route trace event packets from a compute tile to a shimDMA would be: @@ -57,12 +57,45 @@ An example of a simple circuit switch routing flow to route trace event packets flow(ComputeTile, WireBundle.Trace, 0, ShimTile, WireBundle.DMA, 1) ``` -It is important to consider the path this routing might take and how many other streams might be using that same path. This points to whether our design may experience stream routing congestion or not. While capturing trace events are non-intrusive (does not affect the performance of the AIE cores), the routing of these trace packets are not and need to be balanced in your design to prevent congestion. +`flow` creates a circuit switched flow between src and dest and has the general syntax of: +```python +flow(source, source_bundle, source_channel, dest, dest_bundle, dest_channel) +``` +* *source* - source tile of the flow +* *source_bundle* - type of source WireBundle (see full list in AIEAttrs.td) +* *source_channel* - source channel index +* *dest* - destination tile of the flow +* *dest_bundle* - type of destination WireBundle (see full list in AIEAttrs.td) +* *dest_channel* - destination channel index + +It is important to consider the path this routing might take and how many other streams might be using that same path. This points to whether our design may experience stream routing congestion or not. While capturing trace events are non-intrusive (does not affect the performance of the AIE cores), the routing of these trace packets need to be balanced in your design to prevent congestion. #### Packet switched flows The alternative to circuit switched routes is packet switched routes. The benefit of this is the ability to share a single stream switch routing channel between multiple routes. The drawback is the slight overhead of data packet headers as well as needing to gauge how much congestion might be present on a shared route given the data movement requirement of the AIE array design. This means that if multiple flows are sharing the same channel, any particular flow might experience backpressure while another flow is serviced. Depending on the performance requirement of the design, this may or may not have a performance impact. -To support packet switched flows, we need to declare packet flows and attach both a `packet ID` and `packet type` to the packets. `Packet type` in particular is needed to distinguish packets coming from different tiles types (tile core, tile memory, memtiles, shimtiles). The association between tile trace unit and packet types are as follows: +In IRON python bindings, we declare packet flows with the following syntax: +```python +packetflow(pkt_id, source, source_port, source_channel, dest, dest_port, dest_channel, keep_pkt_header) +``` +* *pkt_id* - unique packet ID +* *source* - source tile of the packet flow +* *source_port* - type of source WireBundle (see full list in AIEAttrs.td). Some examples include `WireBundle.Trace`, `WireBundle.DMA`, `WireBundle.North` +* *source_channel* - source channel index. For a given port, we often use multiple channels such as DMA channel 0 and DMA channel 1. In AIE2 core tiles, trace ports use channel 0 for the tile core and 1 for the tile memory. +* *dest* - destination tile of the packet flow +* *dest_port* - type of destination WireBundle (see full list in AIEAttrs.td) +* *dest_channel* - destination channel index +* *keep_pkt_header* - boolean flag to keep header + + +MLIR examples are similar and are included below for quick reference but are more fully defined in the [AIE Dialect online documentation](https://xilinx.github.io/mlir-aie/AIE.html): +```mlir +packetflow(1) { + aie.packet_source<%tile02, Trace : 0> // core trace + aie.packet_dest<%tile00, DMA : 1> +} {keep_pkt_header = "true"} +``` + +To support packet switched flows, we need to declare packet flows and attach both a `packet ID` and `packet type` to the packets. `Packet type` in particular is needed to distinguish packets coming from different tiles types (tile core, tile memory, memtiles, shimtiles). The association between tile trace units and packet types are as follows: | Tile trace unit | packet type | |-----------------|-------------| @@ -73,11 +106,7 @@ To support packet switched flows, we need to declare packet flows and attach bot **NOTE**: Quick reminder that most source flow channels from `WireBundle.Trace` will use channel 0, but the `Tile memory` actually uses channel 1. -The `packet IDs`, on the other hand, can be anything you want as long as they are globally unique to distinguish routes from one another. An example is shown below for two tiles where both tile core and tile memory trace units are routed. Note the `packet ID` used after the `packetflow` keyword. Also note that we set `keep_pkt_hdr = true` as we would like to keep the packet headers when they are moved to DDR so we can distinguish the packets during post-run parsing. - -In IRON python bindings, we declare packet flows with the following syntax: - -`packetflow(packet ID, Source Tile, Source Port, Source Port Channel, Destination Tile, Destination Port, Destination Port Channel, Keep Packet Header boolean)` +The `packet IDs`, on the other hand, an be variable but must be globally unique to distinguish routes from one another. An example is shown below for two tiles where both tile core and tile memory trace units are routed. Note the `packet ID` used after the `packetflow` keyword. Also note that we set `keep_pkt_hdr = true` as we would like to keep the packet headers when they are moved to DDR so we can distinguish the packets during post-run parsing. ```python packetflow(1, ComputeTile2, WireBundle.Trace, 0, ShimTile, WireBundle.DMA, 1, keep_pkt_hdr=True) # core trace @@ -85,27 +114,13 @@ packetflow(2, ComputeTile2, WireBundle.Trace, 1, ShimTile, WireBundle.DMA, 1, ke packetflow(3, ComputeTile3, WireBundle.Trace, 0, ShimTile, WireBundle.DMA, 1, keep_pkt_hdr=True) # core trace packetflow(4, ComputeTile3, WireBundle.Trace, 1, ShimTile, WireBundle.DMA, 1, keep_pkt_hdr=True) # core mem trace ``` -* packet ID - The first argument that uniquely identifies each packet flow. - -Then we have 3 arguments for the source and 3 for the destination. -* `Tile` - Previously defined tile -* `Port` - Wire bundles for the port including `WireBundle.Trace`, `WireBundle.DMA`, `WireBundle.North`, etc. -* `Channel` # - For a given port, we often use multiple channels such as DMA channel 0 and DMA channel 1. Another example in AIE2, trace ports use channel 0 for the tile core and 1 for the tile memory. - -MLIR examples are similar and are included below for quick reference but are more fully defined in the [AIE Dialect online documentation](https://xilinx.github.io/mlir-aie/AIE.html): -```mlir -packetflow(1) { - aie.packet_source<%tile02, Trace : 0> // core trace - aie.packet_dest<%tile00, DMA : 1> -} {keep_pkt_header = "true"} -``` ## 2. Configure host code to read trace data and write it to a text file -Once the trace units are configured and enabled, we want the host code to read the trace data from DDR and write it out to a text file for post-run processing. To give a better sense of how this comes together, this section provides an example design source files and Makefile whose kernel is based off the [Vector Scalar Add example](../../../programming_examples/basic/vector_scalar_add/). +Once the trace units are configured and enabled, we want the host code to read the trace data from DDR and write it out to a text file for post-run processing. To give a better sense of how this comes together, this section provides an example design that is again a simplifed version of the [Vector Scalar Multiply example](../../../programming_examples/basic/vector_scalar_mul/). ### AIE structural design code ([aie2.py](./aie2.py)) -In order to write the DDR data to a text file, we need to decide where we want the DDR data to first be stored and then read from that location, before writing to a text file. This starts inside the [aie2.py](./aie2.py) file where we use the `configure_simple_tracing_aie2` function call to configure the trace units and program the shimDMA to write to one of the 3 inout buffers. There are many ways to configure our structural design to write this data out but one pattern is the following: `inout0` is for input data, `inout1` is for output data, and `inout2` is for output trace data as illustrated below: +In order to write the DDR data to a text file, we need to decide where we want the DDR data to first be stored and then read from that location, before writing to a text file. This starts inside the [aie2.py](./aie2.py) file where we use the `configure_simple_tracing_aie2` function call to configure the trace units and program the shimDMA to write to 1 of the 3 inout buffers. There are many ways to configure our structural design to write this data out but one pattern is the following: `inout0` is for input data, `inout1` is for output data, and `inout2` is for output trace data as illustrated below: | inout0 | inout1 | inout2 | |--------|--------|--------| @@ -124,12 +139,11 @@ As described in [python/utils](../../../python/utils) for `trace.py`, we configu | 1 | inout1 | | 2 | inout2 | -Our section-4b example is modeled after the [Vector Scalar Multiply example](../../../programming_examples/basic/vector_scalar_mul). Here, we are using the second inout mapping pattern (inputA, inputB, outputC + trace) in our [aie2.py](./aie.py) source where `inout0` is called `A` (the vetor input), `inout1` is called `F` (the scalar input) and `inout2` is called `C` (the vector output). Since the trace is mapped to `inout2`, we set `ddr_id=2` and the offset to be the output data buffer size since the trace is appended after the data (`offset=4096*4`). +In our simplified vector scalar multiply example, we are using the second inout mapping pattern (inputA, inputB, outputC + trace) as seen in the [aie2.py](./aie.py) source where `inout0` is called `A` (the vector input), `inout1` is called `F` (the scalar input) and `inout2` is called `C` (the vector output). Since the trace is mapped to `inout2`, we set `ddr_id=2` and set the offset to be the output data buffer size given the trace is appended after the data (`offset=4096*4`). Once [aie2.py](./aie2.py) is configured to output trace data through one of the 3 inout buffers with matching `ddr_id` config and `offset`, we turn our attention to the host code to read the DDR data and write it to a file. - -> **NOTE** In our example design, the [aie2.py](./aie2.py) and associated [Makefile](./Makefile), we provide a Makefile target `run` for standard build and `trace` for trace-enabled build. The trace-enabled build passes the trace buffer size as an argument to [aie2.py](./aie2.py) which conditionally enables the trace `flow` and calls `configure_simple_tracing_aie2` as long as `trace_size` is > 0. This is also true for the [Vector Scalar Multiply example](../../../programming_examples/basic/vector_scalar_mul). +> **NOTE** In our example design ([aie2.py](./aie2.py), [Makefile](./Makefile)), we provide a Makefile target `run` for standard build and `trace` for trace-enabled build. The trace-enabled build passes the trace buffer size as an argument to [aie2.py](./aie2.py) which conditionally enables the trace `flow` and calls `configure_simple_tracing_aie2` as long as `trace_size` is > 0. This is also true for the [Vector Scalar Multiply example](../../../programming_examples/basic/vector_scalar_mul). ### (2a) C/C++ Host code ([test.cpp](./test.cpp)) The main changes needed for [test.cpp](./test.cpp) is the increase in the output buffer size to account for the trace buffer size, being careful to read only the output buffer portion when verifying correctness of the results. We also need to be sure to pass the correct buffer offset which points to the trace buffer data when calling `write_out_trace`. @@ -146,7 +160,7 @@ Within the [test.cpp](./test.cpp), we redefine OUT_SIZE to be the sum of output ```c++ int OUT_SIZE = IN_SIZE + trace_size; ``` -All subsequent references to the output buffer size should use `OUT_SIZE`. The exception is when we want to verify the output results which should be bounded by the original output buffer size, in this case `IN_SIZE`. +All subsequent references to the output buffer size should use `OUT_SIZE`. The exception is when we want to verify the output results which should be bounded by the original output buffer size, in this case `IN_SIZE`. Finally, the function to write the trace output to a file as defined in `aie.utils.trace` is `write_out_trace` and we need to pass it the pointer in the output buffer where the trace data begins, the trace buffer size and the trace file name (default is `trace.txt`). ```c++ @@ -155,7 +169,7 @@ Finally, the function to write the trace output to a file as defined in `aie.uti ``` ### (2b) Python Host code ([test.py](./test.py)) -In the [Makefile](./Makefile), we also have a `trace_py` target which calls the python host code `test.py`. Here in addition to the `-t ${trace_size}`, we also define the `-s ${data_size}` which is the data size (in uint32) for our Vector Scalar Multiply kernel. +In the [Makefile](./Makefile), we also have a `trace_py` target which calls the python host code `test.py`. Here in addition to the `-t ${trace_size}`, we also define the `-s ${data_size}` which is the data size (in int32) for our version of the vector scalar multiply kernel. ```Makefile trace_py: build/final_trace_${data_size}.xclbin build/insts_${data_size}.txt ${powershell} python3 test.py -x build/final_trace_${data_size}.xclbin -i build/insts_${data_size}.txt -k MLIR_AIE -t ${trace_size} -s ${data_size} @@ -170,7 +184,7 @@ During verification, the `output_buffer` excludes the trace data and uses the `r entire_buffer = bo_inout2.read(OUT_SIZE, 0).view(np.uint32) output_buffer = entire_buffer[:INOUT2_VOLUME] ``` -Finally, we read `trace buffer` from the entire_buffer starting a the offset of the `INOUT2_VOLUME` and pass the trace buffer to the python equivalent of `write_out_trace` which is defined in `aie.utils.trace`. +Finally, we read `trace buffer` from the entire_buffer starting at the offset of the `INOUT2_VOLUME` and pass the trace buffer to the python equivalent of `write_out_trace` which is defined in `aie.utils.trace`. > **Note** This version doesn't need the trace_size as our python function recognizes when the array is empty. ```python if opts.trace_size > 0: @@ -183,7 +197,7 @@ Once the packet trace text file is generated (`trace.txt`), we use a python-base ```Makefile ../../../programming_examples/utils/parse_trace.py --filename trace.txt --mlir build/aie_trace.mlir --colshift 1 > trace_vs.json ``` -This leverages the python parse scripts under [programming_examples/utils](../../../programming_examples/utils/). Follow [this link](../../../programming_examples/utils/) to get more details about how to use the python parse scripts and how they are coded. +This leverages the python parse scripts under [programming_examples/utils](../../../programming_examples/utils/). Follow [this link](../../../programming_examples/utils/) to get more details about how to use the python parse scripts. ## 4. Open json file in a visualization tool like Perfetto Open https://ui.perfetto.dev in your browser and then open up the waveform json file generated in step 3. You can navigate the waveform viewer as you would a standard waveform viewer and can even zoom/pan the waveform with the a,s,w,d keyboard keys. @@ -191,25 +205,31 @@ Open https://ui.perfetto.dev in your browser and then open up the waveform json ## Additional Debug Hints * If you are getting 0's in your trace outputs. Check these things: * Buffer offset for the DMA is the right size (based on output buffer size) - * The correct tile is being routed to the the correct shim DMA. It's not uncommon in a multi core design to route the wrong tile, espeically if the tile symbols/ names are very similar or confusing. - * Check matching packet IDs for packet-routed flows. The packet flow ID must match the configured ID value in Trace Control 1 register or else the packets don't get routed. + * The correct tile is being routed to the the correct shim DMA. It's not uncommon in a multi core design to route the wrong tile, espeically if the tile names might be very similar. + * For packet-routed flows, check correctly matching packet IDs. The packet flow ID must match the configured ID value in Trace Control 1 register or else the packets don't get routed. ## Exercises -1. Let's give tracing a try. In this directory, we're been examining a local design based off the `Vector Scalar Mul` example. Run `make trace` to compile the design and generate a trace file and run the `prase_trace.py` script on it to generate the `trace_4b.json` waveform file. Open this in http://ui.perfetto.dev. if you zoom into the region of interest with the keyboard shortcut key W and S to zoom in and out respectively and A and D to pan left and right. You should seem a wave like the following: +1. Let's give tracing a try. In this directory, we're been examining a simplified version of the `vector ccalar multiply` example. Run `make trace`. This compiles the design, generates a trace data file, and run `prase_trace.py` to generate the `trace_4b.json` waveform file. + + **NOTE** In this example, `make`, `make run` and `make trace` will all build a structural design with tracing enabled to keep things simple. But only `make trace` will enable tracing in the host code and call `parse_trace.py`. In contrast, the reference `vector scalar multiply example` has a more robust `Makefile` where `make` and `make run` builds the structural design with tracing disabled. + + Open this waveform json in http://ui.perfetto.dev. If you zoom into the region of interest with the keyboard shortcut key W and S to zoom in and out respectively and A and D to pan left and right. You should seem a wave like the following: - Based on this wave, You can mouse over each chunk of continguous data for `PortRunning0` (input dma port) and `PortRunning1` (output dma port). What is the chunk size? How many input and output chunks are there? This shoudl match iteration loop bounds in our exmple design. + Based on this wave, You can mouse over each chunk of continguous data for `PortRunning0` (input dma port) and `PortRunning1` (output dma port). What is the chunk size? How many input and output chunks are there? This should match iteration loop bounds in our example design. - Here, we notice a few signals worth mentioning. + Here, there are a few common events in our waveform that's further described below. * `Event0` - The event marking the beginning of our kernel. See [vector_scalar_mul.cc](./vector_scalar_mul.cc) where we added the function `event0()` before the loop. This is generally a handy thing to do to attach an event to the beginning of our kernel. - * `Event1` - The event marking the end of our kernel. See [vector_scalar_mul.cc](./vector_scalar_mul.cc) where we added the function `event1()` before the loop. Much like event0, attaching event1 to the end of our kernel is also helpful. + * `Event1` - The event marking the end of our kernel. See [vector_scalar_mul.cc](./vector_scalar_mul.cc) where we added the function `event1()` after the loop. Much like event0, attaching event1 to the end of our kernel is also helpful. * `VectorInstr` - Vector instructions like vector MAC or vector load/store. Here, we are running a scalar implementation so there are no vector events. - * `PortRunning0` - Mapped to Port 0 which is by default configured to the S2MM0 input (DMA from stream to local memory) - * `PortRunning1` - Mapped to Port 1 which is by default configured to the MM2S0 output (DMA from local memory to stream) - * `LockStall` - Any locks that are stalled in the core + * `PortRunning0` - Mapped to Port 0 which is by default configured to the S2MM0 input (DMA from stream to local memory). This is usually the first input. + * `PortRunning1` - Mapped to Port 1 which is by default configured to the MM2S0 output (DMA from local memory to stream). This is usually the first output. + * `LockStall` - Any locks stalls * `LockAcquiresInstr` - Any lock acquire requests * `LockReleaseInstr` - Any lock release requests + We will look at more exercises with Trace and performance measurement in the next [section](../section-4c). + ----- [[Prev]](../section-4a) [[Up]](../../section-4) [[Next]](../section-4c) diff --git a/programming_guide/section-4/section-4c/README.md b/programming_guide/section-4/section-4c/README.md index 15dbfd8fac..e96cd8a7b4 100644 --- a/programming_guide/section-4/section-4c/README.md +++ b/programming_guide/section-4/section-4c/README.md @@ -10,18 +10,18 @@ # Section 4c - Kernel Vectorization and Optimization -* [Section 4 - Vector Programming & Performance Measurement](../../section-4) +* [Section 4 - Performance Measurement & Vector Programming](../../section-4) * [Section 4a - Timers](../section-4a) * [Section 4b - Trace](../section-4b) * Section 4c - Kernel Vectorization and Optimization ----- -Now that we are able to measure the total application time ([section-4a](../section-4a/)) and have examined the kernel performance via tracing ([section-4b](../section-4b)), we will take a closer look at kernel vectorization. We will be using the [vector-scalar multiply example](../../../programming_examples/basic/vector_scalar_mul/) rather than a local copy of that same design to illustrate kernel vectorization concepts. Note that by default, that example design is working with 16-bit data (vs 32-bit of our local examples) and has `vectorized=True`. +Now that we are able to measure the total application time ([section-4a](../section-4a/)) and have seen how we can look at kernel performance via tracing ([section-4b](../section-4b)), we will take a closer look at kernel vectorization and compare perfomance using trace. We will be now switch to using the [vector-scalar multiply example](../../../programming_examples/basic/vector_scalar_mul/) rather than a local copy of that same design to illustrate kernel vectorization concepts. Note that by default, that example design is working with 16-bit data (vs 32-bit in our previous setion-4 examples) and has `vectorized=True`. Go ahead and read the design example summary for [vector-scalar multiply](../../../programming_examples/basic/vector_scalar_mul/) first to get an idea of the different components of this example design. Then, let's take a closer look at the kernel source file ([scale.cc](../../../aie_kernels/aie2/scale.cc)). -In [scale.cc](../../../aie_kernels/aie2/scale.cc), we see that the scalar code is relatively straight forward: +In [scale.cc](../../../aie_kernels/aie2/scale.cc), we see that the scalar code is relatively straight forward and similar to the scalar code we used in [section-4bb](../section-4b): ```C++ template void scale_scalar(T *a, T *c, T factor, const int32_t N) { @@ -38,7 +38,7 @@ Here, the code iterates over the input vector (`a`) and multiplies each element ### AIE API To vectorize this, we first need to familiarize ourselves with the AIE API which abstracts the underlying AIE processor and associated low-level intrinsics with an higher level C++ API. Documentation for AIE API (2023.2 Vitis tools) can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/modules.html). To view details on the vector x scalar multiplier, on the left pane, navigate to *AI Engine API User Guide -> API Reference -> Arithmetic* and select the first `aie::mul` which shows a `Vec * E` where `E` is an elementary data type like a scalar int. -To be able to use this AIE API function in our kernel code, we first need to include the AIE API headers. +To be able to use this AIE API function in our kernel code, we first need to include the AIE API headers in our kernel source. ```C++ #include ``` @@ -48,10 +48,10 @@ Then, we declare a vector as follows: ```C++ aie::vector my_vector ``` -* T - data type, such as `int32_t` -* vec_factor - vector size, such as 16. +* T - data type, such as `int16_t` +* vec_factor - vector size, such as 32. -The size of the vector depends on the type. For example, the standard vector register in AIE2 is **512 bits**. For `int32_t`, that means we can store 16 of them in 1x 512b vector register. Extending this to the other supported data types, we have the following abbreviated table: +The size of the vector depends on the type. For example, the standard vector register in AIE2 is **512 bits**. For `int16_t`, that means we can store 32 of them in 1x 512b vector register. Extending this to the other supported data types, we have the following abbreviated table: | Data type | Vector size | |-----------|-------------| @@ -80,7 +80,7 @@ Finally, we get to the `aie::mul` call which takes a vector and a scalar as argu ```C++ aie::accum cout ``` -The accumulator data type in this case is 32x 32-bit accumulator. We store the computed results back to local memory using the vector store function `aie::store_v` as shown: +The accumulator data type in this case is 32x 32-bit accumulator. We store the computed results back to local memory using the vector store function `aie::store_v`. Note that for `int32_t` datatypes, we require a larger accumulator (`acc64`). ```C++ T *__restrict pC1 = c; @@ -111,26 +111,27 @@ void scale_vectorized(T *a, T *c, int32_t factor, const int32_t N) { } ``` -In this first example, the vectorization strategy was relatively straight forward. Instead of iterating over a vector of values and doing a single scalar multiply, we load a vector of input values, iterate over a smaller loop to perform a vector*scalar operation using the AIE API functions, and then store the vector of results back to local memory. +In this example, the vectorization strategy was relatively straight forward. Instead of iterating over a vector of values and doing a single scalar multiply, we load a vector of input values, iterate over a smaller loop to perform a vector*scalar operation using the AIE API functions, and then store the vector of results back to local memory. -> **NOTE** - AIE API is a portable programming interface that is implemented as a C++ header-only library providing types and operations that get translated into generation specific efficient low-level intrinsics. AIE kernels can also be programmed directly in these low-level C++ intrinsics: [AIE1 Intrinsics User Guide - v2023.2](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_intrinsics/intrinsics/index.html) and [AIE2 Intrinsics User Guide - v2023.2](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/index.html) +> **NOTE** - AIE API is a portable programming interface that is implemented as a C++ header-only library providing types and operations that get translated into generation specific efficient low-level intrinsics. AIE kernels can also be programmed directly in these low-level C++ intrinsics: [AIE1 Intrinsics User Guide - v2023.2](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_intrinsics/intrinsics/index.html) and [AIE2 Intrinsics User Guide - v2023.2](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/index.html) ## Vectorization Exercises -1. Let's take a look at the trace for our vector scalar design. First, let's edit our [vector_scalar_mul design](../../../programming_examples/basic/vector_scalar_mul/) so that the [aie2.py](../../../programming_examples/basic/vector_scalar_mul/aie2.py) source file has `vectorized=False`. In the [aie2.py](../../../programming_examples/basic/vector_scalar_mul/aie2.py) sourcee code, we simply select the scalar version of the kernel function. Then run `make trace`. After the trace compilation is complete, open `trace_vs.json` in https://ui.perfetto.dev and measure the delta between `event 0` and `event 1`. Note that in the Perfetto waveform, 1 us is equal to 1 clock cycle. How many cycles did you measure? +1. Let's take a look at the trace for our vector scalar design. First, let's edit our [vector_scalar_mul design](../../../programming_examples/basic/vector_scalar_mul/) so that the [aie2.py](../../../programming_examples/basic/vector_scalar_mul/aie2.py) source file has `vectorized=False`. In the [aie2.py](../../../programming_examples/basic/vector_scalar_mul/aie2.py) source code, we now have selected the scalar version of the kernel function. Then run `make trace`. After the trace compilation is complete, open `trace_vs.json` in https://ui.perfetto.dev and measure the delta between `event 0` and `event 1`. Note that in the Perfetto waveform, 1 us is equal to 1 clock cycle. How many cycles did you measure? + +1. Now let's turn vectorization back on by changing `vectorized=True`. But we're also going to disable an pragma guided optimization first to see its effect. In the [scale.cc](../../../aie_kernels/aie2/scale.cc), comment out the line after the `for loop` that says `chess_prepare_for_pipelining chess_loop_range(16, )`. **NOTE** Be sure you're editing the general template and not the `int32_t` template specialization. The general version should be the first one. Then rerun the compilation (`make clean; make trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? -1. Now let's turn vectorization back on by changing `vectorized=True`. But we're also going to disable an pragma guided optimization first to see its effect. In the [scale.cc](../../../aie_kernels/aie2/scale.cc), comment out the line after the `for loop` that says `chess_prepare_for_pipelining chess_loop_range(16, )`. Be sure you're editing the general template and not the `int32_t` template specialization. We'll examine that shortly. The rerun the compilation (`make clean; make trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? That's quite an improvemnt, ~20X reduction in compute latency. However, there's more optimization that can be had with vetor code and that involves compilation pragmas. -1. Go back to [scale.cc](../../../aie_kernels/aie2/scale.cc) and uncomment the line with `chess_prepare_for_pipelining chess_loop_range(16, )`. The rerun the compilation (`make clean; make trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? +1. Go back to [scale.cc](../../../aie_kernels/aie2/scale.cc) and uncomment the line with `chess_prepare_for_pipelining chess_loop_range(16, )` to enable those pragmas. Then rerun the compilation (`make clean; make trace`). Measure the delta between `event 0` and `event 1` again. What value do you see now? - Now, we're really seeing some savings (another factor ~6X savings or ~140X compare to the scalar version) The line we added help guide the compiler to find optimal schedules. In particular for kernel loops, `chess_prepare_for_pipelining` and `chess_loop_range(16, )` are particularly useful. - * `chess_prepare_for_pipelining` - Used in the innermost loop to tell the compiler to enable software pipelining. This is necessary for subsequent loop optimization pragmas to be useful - * `chess_loop_range(MIN, MAX)` - An extremely helpful pragma. This tells the compiler how many minimum or maximum iterations we expect this loop to have. We often parameterize loop bounds based on size and even if the upper bound is declared as a const, it's still a runtime computed value. Giving the MIN value is particular helpful because it guides the scheduler to know how many iterations we have and can therefore properly schedule the loop instructions. + Now, we're really seeing some savings (another factor ~6X savings or ~140X compare to the scalar version) The line we added helps guide the compiler to find optimal schedules. For kernel loops, `chess_prepare_for_pipelining` and `chess_loop_range(16, )` are particularly useful. + * `chess_prepare_for_pipelining` - Used in the innermost loop to tell the compiler to enable software pipelining. This is needed to enable subsequent loop optimization pragmas. + * `chess_loop_range(MIN, MAX)` - An extremely helpful pragma. This tells the compiler the minimum or maximum iterations we expect this loop to have. We often parameterize loop bounds based on size and even if the loop size is declared as a const, it's still a runtime computed value. Giving the MIN value in this pragma is particular helpful because it guides the scheduler to know how many iterations we have and can therefore properly schedule the loop instructions for that number rather than the worse case of 1. ## Optimization - Coding for the Architecture -At this point, we've vectorized our code to better leverage the AIE hardware and saw significant performance gains, but is our design fully optimized? How do we know if we've used the powerful AIE hardware to its full potential? This requires a deeper understanding of the underlying AIE architecture and coding for performance with the hardware in mind. For this next section, we will focus on **AIE2** (aka AIE-ML) that's at the heart of the Ryzen AI NPU. AIE2 is optimized for ML workloads which means multiply-accumulate operations like matrix multiplication style compute would leverage the hardware the best. We will also start our exploration by continuing with the vector-scalar multiply example. While it does not expose a sufficient amount of compute to exploit every optimization, it still provides a good starting point in understanding what design considerations are needed to code optimal designs. +At this point, we've vectorized our code to better leverage the AIE hardware and saw significant performance gains, but is our design fully optimized? How do we know if we've used the powerful AIE hardware to its full potential? This requires a deeper understanding of the underlying AIE architecture and coding for performance with the hardware in mind. For this next section, we will focus on **AIE2** (aka AIE-ML) that's at the heart of the Ryzen™ AI NPU. AIE2 is optimized for ML workloads which means multiply-accumulate operations like matrix multiplication style compute would leverage the hardware the best. We will also start our exploration by continuing with the vector-scalar multiply example. While it does not expose a sufficient amount of compute to exploit every optimization, it still provides a good starting point in understanding what design considerations are needed to code optimal designs. ### The Vector Unit - Loads @@ -138,11 +139,11 @@ The first step in optimizing our code even further is to have a picture of the A -As we can see, vector registers are loaded from 2 parallel Load Units, each capable of loading 256 bits per clock cycle from local L1 memory. We have 12 512-bit vector registers which feed into each Permute block and eventually, the Multiplier block. It is important then to always think in terms of 2 256-bit parallel loads per clock cycle. If, for example, you try to load 2048-bits of data per clock in order to do your compute, you will be less efficient as that would require more than 1 cycle. Another important note is that the loads must come from different L1 memory banks or else a bank conflict will occur. The bank conflict penalty is small but would reduce opitimal performance. +As we can see, vector registers are loaded from 2 parallel Load Units, each capable of loading 256 bits per clock cycle from local L1 memory. We have 12 512-bit vector registers which feed into each Permute block and eventually, the Multiplier block. It is important then to always think in terms of 2 256-bit parallel loads per clock cycle. If, for example, you try to load 2048-bits of data per clock in order to do your compute, it would be less efficient as that would require multiple cycles. Another important note is that the loads must come from different L1 memory banks or else a bank conflict will occur. The bank conflict penalty is small but would reduce opitimal performance. ### The Vector Unit - Multiply and Add (MAC) -Once data is loaded and permuted, it passes to the Multiplier block which supports a wide list of AIE data types. The multiply results then pass through an optional post-add step (very common for matrix multiply) before eventually being stored in the accumulator registers. There are 9x 512-bit accumulator registers. Accumulator registers are larger so data precision can be maintained. A well optimized piece of code would schedule 1 vector MAC (VMAC) every cycle. +Once data is loaded and permuted, it passes to the Multiplier block which supports a wide list of AIE data types. The multiply results then pass through an optional post-add step (very common for matrix multiply) before eventually being stored in the accumulator registers. There are 9x 512-bit accumulator registers. Accumulator registers are larger so data precision can be maintained. A well optimized piece of code would strive to schedule 1 vector MAC (VMAC) every cycle. ### The Vector Unit - SRS and Stores @@ -150,11 +151,11 @@ Once data has been computed (either in 1 cycle or accumulated over a number of c -The SRS path is on the right of the diagram above with the corollary path, the Upshift (UPS) path on the left. +The SRS path is on the right of the diagram above with the corollary path, the Upshift (UPS) path on the left. Upshift move data from vector resgister to accumulator registers. ### The Vector Unit - Shift/ Shuffle/ Adder Path -Finally, we have an additional parallel processing path which performs shift, shuffle, simple addition, comparison and a host of other functions. This path runs in parallel with the main integer vector datapath and may be tasked to do the aforementioned functions without the need of the VMAC datapath if a VMAC is not needed in our code. +Finally, we have an additional parallel processing path which performs shift, shuffle, simple addition, comparison and a number of other vector functions. This path runs in parallel with the main integer vector datapath and may be tasked to do the aforementioned functions without the need of the VMAC datapath. @@ -163,16 +164,16 @@ It is very helpful to have in mind this processing datapath and the way in which ### Multiplier Utilization Efficiency -Now that we have a better understanding of the architecture, let's take a closer look at hardware efficiency.The following diagram shows the various AIE architecture blocks we talked about along with a table of generalized compte. +Now that we have a better understanding of the architecture, let's take a closer look at hardware efficiency.The following diagram shows the various AIE architecture blocks we talked about along with a table of generalized compute. > **NOTE** - Matrix multiplication mode table is in the AIE API User Guide [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/group__group__mmul.html). Another way to see the total number of MACs for different bit precisions is the `Table: Supported Precision Width of the Vector Data Path` in the [AM020 spec](https://docs.amd.com/r/en-US/am020-versal-aie-ml/Functional-Overview). -This table tells us that for 16-bit x 16-bit compute, we have 64 MACs available per cycle. However, these MACs are targeting Matrix Multiplication (with its accompanying post-addition steps). In practice, we have 32 accumulator lanes available. That means for eltwise operations, we can only use 32 MACs per cycle. +This table tells us that for 16-bit x 16-bit compute, we have 64 MACs available per cycle. However, these MACs are targeting matrix multiplication (with its accompanying post-addition steps). In practice, we have 32 accumulator lanes available. That means for eltwise operations, we can only use 32 MACs per cycle. #### MAC efficiency -Using this information and our Vector Scalar Multiply example, we know that each call to the kernel passes in an array of 1024 16-bit data. With 32 MACs available, our `vector_factor` is 32 and therefore, we would ideally need 1024 / 32 = 32 cycles to process this amount of data given our 32 MACs per clock eltwise vector MAC configuration. Our final optimized cycle count for the kernel was 72 cycles or roughly 2x the ideal number of cycles. +Using this information and our Vector Scalar Multiply example, we know that each call to the kernel passes in an array of 1024 16-bit data. With 32 MACs available, our `vector_factor` is 32 and therefore, we would ideally need 1024 / 32 = 32 cycles to process this amount of data given our 32 MACs-per-clock eltwise vector MAC configuration. Our final optimized cycle count for the kernel was 72 cycles or roughly 2x the ideal number of cycles. Total MAC efficiency is a product of the (MAC schedule efficiency) x (per clock MAC utilization efficiency). * MAC schedule efficiency - Ideal MAC cycles / Actual MAC cycles (e.g. 32/ 72 = 44%) @@ -183,7 +184,7 @@ Let's file that result away but look at our algorithm from load/store bandwidth #### Load/ Store Bandwidth efficiency -To process a vector of 32 int16 values times a scalar, let's ignore the scalar load and focus only on the vector one. 32 int16 = 512-bits which would take 2x 256-bit loads or 2 cycles per MAC. It might be possible to do it in a single cycle if the data is striped across banks perfectly. We also need to store 2x 256-bits which will take 2 cycles since we only have 1 Store Unit. This means that even if we could do a VMAC every cycle, we need 2 cycles to load the inputs an store the outputs. This explains why our optimized vector results was 72, since based on this 2 cycle requirement, our minimum cycles for our data size is 64 cycles. The remaining 6 cycles is loop preamble, loop postamble and function initialization and cleanup overhead. +To process a vector of 32 int16 values times a scalar, let's ignore the scalar load and focus only on the vector one. 32 int16 = 512-bits which would take 2x 256-bit loads or 2 cycles. It might be possible to do it in a single cycle if the data is interleaved across banks. We also need to store 2x 256-bits which will take 2 cycles since we only have 1 Store Unit. This means that even if we could do a VMAC every cycle, we need 2 cycles to load the inputs and store the outputs. This explains why our optimized vector results was 72, since based on this 2 cycle requirement, our minimum cycles for our data size is 64 cycles. The remaining 6 cycles is loop preamble, loop postamble and function initialization and cleanup overhead. #### Data routing efficiency The load/store bandwidth is already a bottleneck in our 16-bit Vector Scalar Multiply example for the compute. But what about data movement via streams and DMAs. We need to process 1024 chunks of 16-bit data or 512 32-bit quantities. Because our stream switch moves data in 32-bit granularity, we need 512 cycles in order to load in the data to L1 and to move the data out of L1 to L2/L3. @@ -196,7 +197,7 @@ The load/store bandwidth is already a bottleneck in our 16-bit Vector Scalar Mul | Load/Store| 64 | 50% / 100% | | DMA | 512 | 100% | -Looking at this table, we quickly see that the data movement is the bottleneck. +Looking at this table, we quickly see that the data movement is the bottleneck for the overall kernel. ## Optimization Exercises - Part 1 1. Rerun the final optimized code and take a look at the resulting waveform. @@ -208,7 +209,7 @@ Looking at this table, we quickly see that the data movement is the bottleneck. ## Diving Deep - Examining the Microcode Let's take another look at the results of our [vector_scalar_mul design](../../../programming_examples/basic/vector_scalar_mul/). Let's also go back one step and comment out `chess_prepare_for_pipelining chess_loop_range(16, )` and rerun the compilation (`make clean; make trace`). -At this point, we can actually take a look at the `microcode`. The `microcode` is the precise schedule of instructions that our AIE executes in order to run the kernel program. This microcode can usually be found under `build/core_0_2.elf.lst` where the two numbers for the core indicates its column and row position respectively. So if your design has multiple cores, then each core will have its own .lst file. If you were to open the file, you will see a lot of information. Comment lines will have a . in front of it. The other lines are the instructions and are structured as follows: +At this point, we can actually take a look at the `microcode`. The `microcode` is the precise schedule of instructions that our AIE executes in order to run the kernel program. This microcode can usually be found under `build/core_0_2.elf.lst` where the two numbers for the core indicates its column and row position respectively. So if your design has multiple cores, then each core will have its own `.lst` file. If you were to open the file, you will see a lot of information. Comment lines will have a . in front of it. The other lines are the instructions and are structured as follows: ``` Instruction Line Number ---- Encoded Instruction ---- 1 or more slots of ISA commands @@ -231,25 +232,27 @@ Instruction Line Number ---- Encoded Instruction ---- 1 or more slots of ISA com Fully analyzing and understanding this microcode is beyond the scope of this programming guide but we will focus on key parts of this microcode, labeled by 3 types of comments in particular: -`.label vector_scalar_mul_aie` followed by `.function_start` - The start of the function we're interested in. The name after label is the function name but this might have additional characters if the function is generated from a template. +`.label vector_scalar_mul_aie` followed by `.function_start` - The start of the function we're interested in. The name after the label is the function name but this might have additional characters if the function is generated from a template. `.label ZLS_...` - The start of a zero-overhead loop `.label ZLE_...` - The end of a zero-overhead loop. -> **NOTE** The line after this label is the last line within the loop, not just the lines strictly between `ZLS` and `ZLE`. In general, labels pertain the line after the label. +> **NOTE** The line after this label is the last line within the loop, not just the lines strictly between `ZLS` and `ZLE`. In general, labels are for the line after the label. Let's examine this more closely in our example. ## Optimization Exercises - Part 2 -1. Open `build/core_0_2.elf.lst` and take a look through the file. You'll see a lot of helpful comments but it may be a bit too much comments to be able to see patterns in the microcode clearly. Run a simple cleanup script from the vector_scalar_mul example directory +1. Go back and comment out the pragma lines (`chess_prepare_for_pipelining chess_loop_range(16, )`) again and rerun the build (`make clean; make trace`). Open `build/core_0_2.elf.lst` and take a look through the file. You'll see a lot of helpful comments but it may be a bit too much comments to be able to see patterns in the microcode clearly. Run a simple cleanup script from within the vector_scalar_mul example directory `../../utils/clean_microcode.sh build/core_0_2.elf.lst` - This will remove some of the extra comments. Open up the `core_0_2.elf.lst` file again and search for `.label vector_scalar_mul_aie`. Then scroll down until you see the first `.label ZLS ..` line. Count the number of lines until you reach the first `.label ZLE ..` line and add 1 to that total (since the line after ZLE is within the loop). How many lines are in this inner loop? + This will remove some of the extra comments. Open up the `core_0_2.elf.lst` file again and search for `.label vector_scalar_mul_int16_vector`. Then scroll down until you see the first `.label ZLS ..` line after that. Count the number of lines until you reach the first `.label ZLE ..` line and add 1 to that total (since the line after ZLE is within the loop). How many lines are in this inner loop? 1. Now look at each line (including the one after ZLE) and count how many lines contain a `VMUL` or `VMAC` in it? What number do you get? -1. The number you got gives us a rough idea of how optimized the innermost loop of our algorithm is. In this case, we have 1 VMAC out of 15 cycles or ~6% MAC utilization. If the inner loop take 15 cycles and we iterate 32 times, how many cycles should this version take and how close are we to the measured cycle count? +1. The number you got gives us a rough idea of how optimized the innermost loop of our algorithm is. In this case, we have 1 VMAC out of 15 cycles or ~6% MAC utilization. If the inner loop take 15 cycles and we iterate 32 times, how many cycles should this version take and how close are we to the measured cycle count? + +1. Now go back and uncomment the pragma lines again and rerun the build and cleanup script (`make clean; make trace; ../../utils/clean_microcode.sh build/core_0_2.elf.lst`). Search for `vector_scalar_mul_int16_vector` again and cound the number of inner loop lines, as well as `VMUL/VMAC` lines again. How many do you see? This matches with our hand calculation that the inner loop is limited to 2 because of the vector stores. ----- diff --git a/python/utils/README.md b/python/utils/README.md index 3fc8cfeb44..211386adaa 100644 --- a/python/utils/README.md +++ b/python/utils/README.md @@ -52,33 +52,43 @@ Test/ Host code utilities. * `pack4bytes` * Pack 4 bytes into a 32-bit word * `configure_simple_tracing_aie2` - * This function abstracts a number of python functions for configuring a core tile and an associated shim tile. It does not define the trace packet routing betweent he two however. To better appreciate what this wrapper function does, we need to delve more deeply into the details on how trace units are configured. - - -Within the `func.func @sequence` block, we add a set of configuration register writes (`aiex.npu.write32`) to configure the tile trace units and the shimDMA. -### How to configure wrapper and default values -The minimum function call we need is: -```python -trace_utils.configure_simple_tracing_aie2(tile, shim) -``` -This version allows the default values to make certain assumptions such as: -* `channel`=1 - to configure S2MM channel -* `bd_id`=13 - 13 is far enough that's unlikely to have conflict -* `ddr_id`=2 - Maps to inout2 buffer -* `size`=8192 - 8,192 bytes for trace buffer size -* `offset`=0 - An offset=0 means the trace data is in its own inout buffer (not appended to another channel) -* `start`=0x1 - Start event triggers right away -* `stop`=0x0 - No Stop event -* `events`=[0x4B, 0x22, 0x21, 0x25, 0x2D, 0x2C, 0x1A, 0x4F] - standard template of events commonly used - -Another common use case might be: -```python -trace_utils.configure_simple_tracing_aie2(tile, shim, size=8192, offset=output_size, ddr_id_=2) -``` -This one allows us to control the size, offset, and inout buffer mapping. + * This function abstracts a number of python functions for configuring a core tile and an associated shim tile. It does not define the trace packet routing between the two however. + + Function arguments: + * `channel` - S2MM channel used + * `bd_id` - DMA bd used. Be careful that we do not conflict with the auto-assigned bds from allocated by `npu_dma_memcpy_nd` calls + * `ddr_id` - Maps to one of the 3 inout buffers (1,2,3) + * `size` - trace buffer size (in bytes) + * `offset`- offset (in bytes) where trace buffer data should begin + * `start`- start event + * `stop`- stop event + * `events`- Vector of 8 events that we are tracing + + The minimum function call supported is: + ```python + trace_utils.configure_simple_tracing_aie2(tile, shim) + ``` + This version allows the default argument values as described below: + * `channel`=1 - to configure S2MM channel 1 + * `bd_id`=13 - 13 is far enough that's unlikely to have conflict + * `ddr_id`=2 - Maps to inout2 buffer + * `size`=8192 - 8,192 bytes for trace buffer size + * `offset`=0 - An offset=0 means the trace data is in its own inout buffer (not appended to another channel) + * `start`=0x1 - Start event triggers right away when tile is enabled + * `stop`=0x0 - No Stop event + * `events`=[0x4B, 0x22, 0x21, 0x25, 0x2D, 0x2C, 0x1A, 0x4F] - standard template of events commonly used + + A more common use case might be: + ```python + trace_utils.configure_simple_tracing_aie2(tile, shim, size=8192, offset=output_size, ddr_id_=2) + ``` + This one allows us to control the size, offset, and inout buffer mapping. + To better appreciate what this wrapper function does, we need to delve more deeply into the details on how trace units are configured. ### Configure tile trace settings +Within the `func.func @sequence` block, we call a set of configuration register writes (`aiex.npu.write32`) to configure the tile trace units and (`aiex.npu.writebd_shimtile`) to configure the shimDMA. + For a give AIE2 tile, we configure the trace control registers for the tile core and tile memory separately. There are 4 registers we generally use to configure the trace unit behavior. 2 are for configuring the general trace control and the other 2 are to specify which events our tile's trace hardware is monitoring. AIE2 core module registers can be found in [AM025](https://docs.amd.com/r/en-US/am025-versal-aie-ml-register-reference/). @@ -134,7 +144,7 @@ The table below describes which events the trace hardware monitors. This info is also found online in [AM025](https://docs.amd.com/r/en-US/am025-versal-aie-ml-register-reference/) for [Trace Event 0](https://docs.amd.com/r/en-US/am025-versal-aie-ml-register-reference/Trace_Event0-CORE_MODULE-Register) and [Trace Event 1](https://docs.amd.com/r/en-US/am025-versal-aie-ml-register-reference/Trace_Event1-CORE_MODULE-Register). -There is an extensive lists of trace events but here, we will only describe a few common ones. +There is an extensive lists of trace events, but we will only list a few common ones here. | Some common events | event ID | dec value | |--------------------|----------|-----------| | True |0x01| 1 | @@ -147,7 +157,7 @@ There is an extensive lists of trace events but here, we will only describe a fe | Lock stall |0x1A| 26 | | Core Port Running 1 |0x4F| 79 | | Core Port Running 0 |0x4B| 75 | -* A more exhaustive list of events for core tile, core memory, memtile and shim tile can be found in [this header file](https://github.com/Xilinx/aie-rt/blob/main-aie/driver/src/events/xaie_events_aie.h). However, not all events are yet supported in `parse_eventIR.py` at this time. +* A more exhaustive list of events for core tile, core memory, memtile and shim tile can be found in this [header file](https://github.com/Xilinx/aie-rt/blob/main-aie/driver/src/events/xaie_events_aie.h). However, not all events are yet supported in `parse_eventIR.py` at this time. **NOTE**: The "Core Instruction - Event 0/1" are special intrinsics you can add to your kernel code to trigger an event during the running of your core program. Within the kernel code, they look like: ```c++ @@ -159,6 +169,8 @@ This can be placed at the beginning and end of your code block to estimate the t Example Trace Events 0/1 Config +Setting the trace events registers can again be done using the `aiex.npu.write32` function targeting the correct register address (0x340E0 and 0x340E4 for core tiles). An example of setting them in your host code is below: + in C/C++ ```c++ // Events 0-3 monitored @@ -215,6 +227,7 @@ in C/C++ // This is necessary to capture the Port_Running_0 and Port_Running_1 events // Port 0 - Master/ID=1, Port 1 - Slave/ID=1 aiex.npu.write32 { column = 0 : i32, row = 4 : i32, address = 0x3FF00 : ui32, value = 0x121 : ui32 } +aiex.npu.write32 { column = 0 : i32, row = 4 : i32, address = 0x3FF04 : ui32, value = 0x0 : ui32 } ``` in Python ```python