Fixed typos and Makefile (#1420)

Co-authored-by: Joseph Melber <jgmelber@gmail.com>
Xilinx · Apr 25, 2024 · 1b44c27 · 1b44c27
1 parent 312a989
commit 1b44c27
Show file tree

Hide file tree

Showing 9 changed files with 182 additions and 137 deletions.
diff --git a/aie_kernels/aie2/scale.cc b/aie_kernels/aie2/scale.cc
@@ -25,7 +25,7 @@ void scale_scalar(T *a, T *c, T factor, const int32_t N) {
   event1();
 }
 
-// Vectorized scale template
+// Vectorized scale template (general case)
 // Assume N is multiple of 16
 template <typename T>
 void scale_vectorized(T *a, T *c, int32_t factor, const int32_t N) {
@@ -46,7 +46,7 @@ void scale_vectorized(T *a, T *c, int32_t factor, const int32_t N) {
   event1();
 }
 
-// Vectorized scale template for int32_t (acc64 used)
+// Vectorized scale template (int32_t case, acc64 used)
 // Assume N is multiple of 16
 template <>
 void scale_vectorized<int32_t>(int32_t *a, int32_t *c, int32_t factor,

diff --git a/programming_guide/README.md b/programming_guide/README.md
@@ -44,11 +44,10 @@ This IRON AIE programming guide first introduces the language bindings for AIE-a
 * Introduce an example of the first simple program (Vector Scalar Multiplication)
 * Illustrate how to run designs on Ryzen™ AI-enabled hardware
 </details>
-<details><summary><a href="./section-4">Section 4 - Vector programming & Performance Measurement</a></summary>
+<details><summary><a href="./section-4">Section 4 - Peformance Measurement & Vector Programming</a></summary>
 
-* Discuss the topic of vector programming at the kernel level
-* Introduce performance measurement (trace) and how we measure cycle count and efficiency
-* Performant Vector Scalar Multiplication design example
+* Introduce performance measurement (timers, trace)
+* Discuss topic of vector programming at the kernel level
 </details>
 <details><summary><a href="./section-5">Section 5 - Example Vector Designs</a></summary>
 

diff --git a/programming_guide/section-3/README.md b/programming_guide/section-3/README.md
@@ -16,20 +16,21 @@ This section creates the first program that will run on the AIE-array. As shown
 
 For the AIE-array structural description we will combine what you learned in [section-1](../section-1) for defining a basic structural design in Python with the data movement part from [section-2](../section-2).
 
-For the AIE kernel code, we will start with non-vectorized code that will run on the scalar processor part of an AIE. [section-4](../section-4) will introduce how to vectorize a compute kernel to harvest the compute density of the AIE.
+For the AIE kernel code, we will start with non-vectorized code that will run on the scalar processor part of an AIE. [Section-4](../section-4) will introduce how to vectorize a compute kernel to harvest the compute density of the AIE.
 
 The host code can be written in either C++ (as shown in the figure) or in Python. We will also introduce some convenience utility libraries for typical test functionality and to simplify context and buffer creation when the [Xilinx RunTime (XRT)](https://github.com/Xilinx/XRT) is used, for instance in the [AMD XDNA Driver](https://github.com/amd/xdna-driver) for Ryzen™ AI devices.
 
 <img align="right" width="410" height="84" src="../assets/vectorScalarMul.svg">
 
-Throughout this section, a [vector scalar multiplication](../../programming_examples/basic/vector_scalar_mul/) (c = a * factor) will be used as an example. Vector scalar multiplication takes an input vector a and computes the output vector c by multiplying each element of a with a factor. In our example, the total vector size is set to 4096 integers (32b) that will processed in chunks of 1024.
-This design is also available in the [programming_examples](../../programming_examples) of this repository. We will first introduce the AIE-array structural description, the review the kernel code and then introduce the host code. Finally we will show ho to run the design on Ryzen™ AI enabled hardware.
+Throughout this section, a [vector scalar multiplication](../../programming_examples/basic/vector_scalar_mul/) (c = a * factor) will be used as an example. Vector scalar multiplication takes an input vector `a` and computes the output vector `c` by multiplying each element of a with a factor. In this example, the total vector size is set to 4096 (16b) that will processed in chunks of 1024.
+
+This design is also available in the [programming_examples](../../programming_examples) of this repository. We will first introduce the AIE-array structural description, review the kernel code and then introduce the host code. Finally we will show how to run the design on Ryzen™ AI enabled hardware.
 
 ## AIE-array Structural Description
 
 <img align="right" width="150" height="350" src="../assets/vectorScalarMulPhysicalDataFlow.svg">
 
-The [aie2.py](../../programming_examples/basic/vector_scalar_mul/aie2.py) AIE-array structural description (see [section-1](../section-1)) deploys both a compute core (green) for the multiplication and a shimDMA (purple) for data movement of both input vector a and output vector c residing in external memory.
+The [aie2.py](../../programming_examples/basic/vector_scalar_mul/aie2.py) AIE-array structural description (see [section-1](../section-1)) deploys both a compute core (green) for the multiplication and a shimDMA (purple) for data movement of both input vector `a` and output vector `c` residing in external memory.
 
 ```python
 # Device declaration - here using aie2 device NPU
@@ -41,7 +42,7 @@ def device_body():
     ComputeTile2 = tile(0, 2)
 ```
 
-We also need to declare that the compute core will run an external function: a kernel written in C++ that will be linked into the design as pre-compiled kernel (more details in the next subsection). To get our initial design running on the AIE-array, we will run a generic version of the vector scalar multiply run on the scalar processor of the AIE.
+We also need to declare that the compute core will run an external function: a kernel written in C++ that will be linked into the design as pre-compiled kernel (more details below). To get our initial design running on the AIE-array, we will run a generic version of the vector scalar multiply design here in this directory that is run on the scalar processor of the AIE. This local version will use `int32_t` datatype instead of the default `int16_t`for the [programming_examples version](../../programming_examples/basic/vector_scalar_mul/).
 
 ```python
         # Type declarations
@@ -57,7 +58,7 @@ Since the compute core can only access L1 memory, input data needs to be explici
 
 <img align="right" width="300" height="300" src="../assets/vector_scalar.svg">
 
-This enables looking at the data movement in the AIE-array from a logical view where we deploy 3 objectFIFOs: "of_in" to bring in the vector a, "of_factor" to bring in the scalar factor, and "of_out" to move the output vector c, all using shimDMA. Note that the objects for "of_in" and "of_out" are declared to have the `memRef_ty` type: 1024 int32 elements, while the factor is an object containing a single integer. All objectFIFO are set up using a depth size of 2 to enable the concurrent execution to the Shim Tile and Compute Tile DMAs data movement with the processing on the compute core.
+This enables looking at the data movement in the AIE-array from a logical view where we deploy 3 objectFIFOs: "of_in" to bring in the vector `a`, "of_factor" to bring in the scalar factor, and "of_out" to move the output vector `c`, all using shimDMA. Note that the objects for "of_in" and "of_out" are declared to have the `memRef_ty` type: 1024 int32 elements, while the factor is an object containing a single integer. All objectFIFOs are set up using a depth size of 2 to enable the concurrent execution to the Shim Tile and Compute Tile DMAs with the processing on the compute core.
 
 ```python
         # AIE-array data movement with object fifos
@@ -66,7 +67,7 @@ This enables looking at the data movement in the AIE-array from a logical view w
         of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_ty)
 
 ```
-We also need to set up the data movement to/from the AIE-array: configure n-dimensional DMA transfers in the shimDMAs to read/write to/from L3 external memory. For NPU, this is done with the `npu_dma_memcpy_nd` function (more details in [section 2-g](../section-2/section-2g)). Note that the n-dimensional transfer has a size of 4096 int32 elements and that the `metadata` argument  in the `npu_dma_memcpy_nd` needs to match the `name` argument of the corresponding object FIFO. 
+We also need to set up the data movement to/from the AIE-array: configure n-dimensional DMA transfers in the shimDMAs to read/write to/from L3 external memory. For NPU, this is done with the `npu_dma_memcpy_nd` function (more details in [section 2-g](../section-2/section-2g)). Note that the n-dimensional transfer has a size of 4096 int32 elements and that the `metadata` argument  in the `npu_dma_memcpy_nd` needs to match the `name` argument of the corresponding object FIFO, in this case `in`, `inFactor` and `out`.
 
 ```python
         # To/from AIE-array data movement
@@ -81,9 +82,9 @@ We also need to set up the data movement to/from the AIE-array: configure n-dime
             npu_sync(column=0, row=0, direction=0, channel=0)
 ```
 
-Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire and object of 1024 integers to read  from "of_in" and  one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. After the 4 sub-vector iterations, we release the "of_factor" objectFIFO.
+Finally, we need to configure how the compute core accesses the data moved to its L1 memory, in objectFIFO terminology: we need to program the acquire and release patterns of "of_in", "of_factor" and "of_out". Only a single factor is needed for the complete 4096 vector, while for every processing iteration on a sub-vector, we need to acquire an object of 1024 integers to read from "of_in" and one similar sized object from "of_out". Then we call our previously declared external function with the acquired objects as operands. After the vector scalar operation, we need to release both objects to their respective "of_in" and "of_out" objectFIFO. And finally after the 4 sub-vector iterations, we release the "of_factor" objectFIFO.
 
-This access and execute pattern runs on the AIE compute core `ComputeTile2` and needs to get linked against the precompiled external function "scale.o". We run this pattern in a very large loop to enable enqueuing multiple rounds vector scalar multiply work from the host code.
+This access and execute pattern runs on the AIE compute core `ComputeTile2` and needs to get linked against the precompiled external function `"scale.o"`. We run this pattern in a very large loop to enable enqueuing multiple rounds of vector scalar multiply work from the host code.
 
 ```python
         @core(ComputeTile2, "scale.o")
@@ -105,7 +106,7 @@ This access and execute pattern runs on the AIE compute core `ComputeTile2` and
 
 ## Kernel Code
 
-We can program the AIE compute core using C++ code and compile it with `xchesscc` into an kernel object file. In this section, we will use a generic implementation of the vector scalar multiplication that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
+We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local verion of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
 
 ```c
 void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out,
@@ -122,19 +123,24 @@ Note that since the scalar factor is communicated through an object, it is provi
 
 ## Host Code
 
-The host code acts as environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data. Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variant of this code are available.
+The host code acts as an environment setup and testbench for the Vector Scalar Multiplication design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and kick off the execution of the AIE design on the NPU. After running, it verifies the results and optionally outputs trace data (to be covered in [section-4c](../section-4/section-4c/). Both a C++ [test.cpp](./test.cpp) and Python [test.py](./test.py) variants of this code are available.
 
-For convenience, a set of test utilities support common elements of command line parsing, the XRT-based environment setup and with testbench functionality: [test_utils.h](../../runtime_lib/test_lib/test_utils.h) or [test.py](../../python/utils/test.py).   
+For convenience, a set of test utilities support common elements of command line parsing, the XRT-based environment setup and testbench functionality: [test_utils.h](../../runtime_lib/test_lib/test_utils.h) or [test.py](../../python/utils/test.py).   
 
 The host code contains the following elements:
 
-1. *Parse program arguments and set up constants*: the host code typically takes at least as arguments: `-x` the XCLBIN file, `-k` kernel name (with default name "MLIR_AIE"),  and `-i` the instruction sequence file as arguments since it is its task to load those files and set the kernel name. Both the XCLBIN and instruction sequence are generated when compiling the AIE-array Structural Description and kernel code with `aiecc.py`.
+1. *Parse program arguments and set up constants*: the host code typically requires the following 3 arguments: 
+    * `-x` the XCLBIN file
+    * `-k` kernel name (with default name "MLIR_AIE")
+    * `-i` the instruction sequence file as arguments 
+    
+    This is because it is its task to load those files and set the kernel name. Both the XCLBIN and instruction sequence are generated when compiling the AIE-array structural description and kernel code with `aiecc.py`.
 
 1. *Read instruction sequence*: load the instruction sequence from the specified file in memory
 
 1. *Create XRT environment*: so that we can use the XRT runtime
 
-1. *Create XRT buffer objects* for the instruction sequence, inputs (vector a and factor) and output (vector c). Note that the `kernel.group_id(<number>)` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of Python AIE-array structural description, starting with ID number 2 for the first sequence argument and then incrementing by 1.   
+1. *Create XRT buffer objects* for the instruction sequence, inputs (vector `a` and `factor`) and output (vector `c`). Note that the `kernel.group_id(<number>)` needs to match the order of `def sequence(A, F, C):` in the data movement to/from the AIE-array of python AIE-array structural description, starting with ID number 2 for the first sequence argument and then incrementing by 1. This mapping is described as well in the [python utils documentation](../../python/utils/README.md#configure-shimdma). 
 
 1. *Initialize and synchronize*: host to device XRT buffer objects
 

diff --git a/programming_guide/section-4/README.md b/programming_guide/section-4/README.md
@@ -8,9 +8,9 @@
 // 
 //===----------------------------------------------------------------------===//-->
 
-# <ins>Section 4 - Vector Programming & Performance Measurement</ins>
+# <ins>Section 4 - Performance Measurement & Vector Programming</ins>
 
-Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen™ AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming techniques to fully leverage the power of the AI Engines for parallel compute.
+Now that you've had a chance to walk through the components of compiling and running a program on the Ryzen™ AI hardware in [section-3](../section-3), we will start looking at how we measure performance and utilize vector programming technqiues to fully leverage the power of the AI Engines for parallel compute.
 
 It is helpful to first examine performance measurement before we delve into vector programming in order to get a baseline for where our application performance is. There are many factors that contribute to performance including latency, throughput and power efficiency. Performance measurement is an active area of research to provide more powerful tools for users to measure the speedup of their application on AIEs. In [section-4a](./section-4a) and [section-4b](./section-4b/), we look at performance from the perspective of timers and trace. Then in [section-4c](./section-4c), we look more closely at how to vectorize AIE kernel code.
 

diff --git a/programming_guide/section-4/section-4a/Makefile b/programming_guide/section-4/section-4a/Makefile
@@ -31,7 +31,7 @@ ${targetname}.exe: ${srcdir}/test.cpp
 	rm -rf _build
 	mkdir -p _build
 	cd _build && ${powershell} cmake -E env CXXFLAGS="-std=c++23 -ggdb" cmake ${srcdir} -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${targetname} 
-	cd _build && ${powershell} cmake --build , --config Release
+	cd _build && ${powershell} cmake --build . --config Release
 ifeq "${powershell}" "powershell.exe"
 	cp _build/${targetname}.exe $@
 else
@@ -41,11 +41,11 @@ endif
 run: ${targetname}.exe build/final.xclbin build/insts.txt 
 	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
 
-run-1: ${targetname}.exe build/final.xclbin build/insts.txt 
-	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 1 -v 2
+run-10: ${targetname}.exe build/final.xclbin build/insts.txt 
+	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 10 
 
-run-4: ${targetname}.exe build/final.xclbin build/insts.txt 
-	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 4 -v 2
+run-10-warmup: ${targetname}.exe build/final.xclbin build/insts.txt 
+	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --iters 10 --warmup 4
 
 run_py: build/final.xclbin build/insts.txt
 	${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE