General fixes to IRON guide (#1557)

Co-authored-by: Mario Ruiz <mruiznog@xilinx.com>
Xilinx · Jun 19, 2024 · 618df50 · 618df50
1 parent 29d5cec
commit 618df50
Show file tree

Hide file tree

Showing 12 changed files with 58 additions and 57 deletions.
diff --git a/programming_guide/section-1/README.md b/programming_guide/section-1/README.md
@@ -15,12 +15,12 @@ When we program the AIE-array, we need to declare and configure its structural b
 ## <ins>Walkthrough of Python source file (aie2.py)</ins>
 At the top of this Python source, we include modules that define the IRON AIE language bindings `aie.dialects.aie` and the mlir-aie context `aie.extras.context`, which binds to MLIR definitions for AI Engines.
 
-```
+```python
 from aie.dialects.aie import * # primary mlir-aie dialect definitions
 from aie.extras.context import mlir_mod_ctx # mlir-aie context
 ```
 Then we declare a structural design function that will expand into MLIR code when it will get called from within an mlir-aie context (see last part of this subsection).
-```
+```python
 # AI Engine structural design function
 def mlir_aie_design():
     <... AI Engine device, blocks, and connections ...>
@@ -31,7 +31,7 @@ The arguments for the tile declaration are the tile coordinates (column, row). W
 
 > **NOTE:**  The actual tile coordinates used on the device when the program is run may deviate from the ones declared here. For example, on the NPU on Ryzen™ AI (`@device(AIEDevice.npu)`), these coordinates tend to be relative coordinates as the runtime scheduler may assign it to a different available column during runtime.
 
-```
+```python
     # Device declaration - here using aie2 device NPU
     @device(AIEDevice.npu1_1col)
     def device_body():
@@ -41,8 +41,8 @@ The arguments for the tile declaration are the tile coordinates (column, row). W
         ComputeTile2 = tile(2, 3)
         ComputeTile3 = tile(2, 4)
 ```
-Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)`, which takes the code defined in our MLIR context and prints it stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout. 
-```
+Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)`, which takes the code defined in our MLIR context and prints it to stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout. 
+```python
 # Declares that subsequent code is in mlir-aie context
 with mlir_mod_ctx() as ctx:
     mlir_aie_design() # Call design function within the mlir-aie context
@@ -52,9 +52,9 @@ with mlir_mod_ctx() as ctx:
 ## <ins>Other Tile Types</ins>
 Next to the compute tiles, an AIE-array also contains data movers for accessing L3 memory (also called shim DMAs) and larger L2 scratchpads (called mem tiles), which have been available since the AIE-ML generation - see [the introduction of this programming guide](../README.md). Declaring these other types of structural blocks follows the same syntax but requires physical layout details for the specific target device. Shim DMAs typically occupy row 0, while mem tiles (when available) often reside on row 1. The following code segment declares all the different tile types found in a single NPU column.
 
-```
+```python
     # Device declaration - here using aie2 device NPU
-    @device(AIEDevice.npu1_1col)
+    @device(AIEDevice.npu1)
     def device_body():
 
         # Tile declarations
@@ -76,7 +76,7 @@ Next to the compute tiles, an AIE-array also contains data movers for accessing
 4. No error is generated but our code is invalid. Take a look at the generated MLIR code under `build/aie.mlir`. This generated output is invalid MLIR syntax and running our mlir-aie tools on this MLIR source will generate an error. We do, however, have some additional Python structural syntax checks that can be enabled if we use the function `ctx.module.operation.verify()`. This verifies that our Python-bound code has valid operation within the mlir-aie context. 
 
     Qualify the `print(ctx.module)` call with a check on `ctx.module.operation.verify()` using a code block like the following:
-    ```
+    ```python
     res = ctx.module.operation.verify()
     if res == True:
         print(ctx.module)

diff --git a/programming_guide/section-2/section-2b/03_Link_Distribute_Join/README.md b/programming_guide/section-2/section-2b/03_Link_Distribute_Join/README.md
@@ -69,7 +69,7 @@ Similarly, the order in `fifoIns` specifies which input object will make up whic
 
 <img src="./../../../assets/Join.png" height="200">
 
-The following code snippet describes the figure above. There are three Object FIFOs: `of2` has a producer tile B and a consumer tile A, while `of0` and `of1` have C and D respectively as their producer tiles and B as their consumer tile. The link specifies that data from `of0` and `of1` is joined into `of2`. In this link, B is the shared tile where the implicit data copy will take place via B's DMAs. We can also note how `of0` and `of1`'s datatypes are half of `of2`'s, which means that objects from `of0` will become the first half of objects in `of2` while objects in `of1` will become the second half, based on their order in the link.
+The following code snippet describes the figure above. There are three Object FIFOs: `of0` has a producer tile B and a consumer tile A, while `of1` and `of2` have C and D respectively as their producer tiles and B as their consumer tile. The link specifies that data from `of1` and `of2` is joined into `of0`. In this link, B is the shared tile where the implicit data copy will take place via B's DMAs. We can also note how `of1` and `of2`'s datatypes are half of `of0`'s, which means that objects from `of1` will become the first half of objects in `of0` while objects in `of2` will become the second half, based on their order in the link.
 ```python
 A = tile(1, 0)
 B = tile(1, 1)

diff --git a/programming_guide/section-2/section-2c/README.md b/programming_guide/section-2/section-2c/README.md
@@ -46,7 +46,7 @@ A data layout transformation is presented as a tuple of pairs, where each pair r
 ```c
 [<size_2, stride_2>, <size_1, stride_1>, <size_0, stride_0>]
 ```
-Transformations can be expressed in up to three dimensions on each compute and Shim tile, and in up to four dimensions on Mem tiles. The first pair of this array gives the outer-most dimension's stride and size `<size_2, stride_2>`, while the last pair of the array gives the inner-most dimension's stride and size `<size_0,stride_0>`. All strides are expressed in **multiples of the element width**.
+Transformations can be expressed in up to three dimensions on each compute and Shim tile, and in up to four dimensions on Mem tiles. The first pair of this array gives the outer-most dimension's stride and size `<size_2, stride_2>`, while the last pair of the array gives the inner-most dimension's stride and size `<size_0, stride_0>`. All strides are expressed in **multiples of the element width**.
 
 > **NOTE:**  Only for 4B data types the inner-most dimension's stride must be 1 by design.
 
@@ -56,9 +56,9 @@ int *buffer;
 for(int i = 0; i < size_2; i++)
     for(int j = 0; j < size_1; j++)
         for(int k = 0; k < size_0; k++)
-            # access/store element at/to buffer[  i * stride_2
-            #                                   + j * stride_1
-            #                                   + k * stride_0]
+            // access/store element at/to buffer[  i * stride_2
+            //                                   + j * stride_1
+            //                                   + k * stride_0]
 ```
 
 As a practical example, here is an access pattern that corresponds to alternating between even and odd elements every 8 elements in a 128 element buffer/stream:
@@ -67,14 +67,14 @@ aie.dma_bd(%buf : memref<128xi32>, 0, 128, [<8, 16>, <2, 1>, <8, 2>])
 ```
 which translates to:
 ```c
-for(int i = 0; i < 8; i++)          # size_2
-    for(int j = 0; j < 2; j++)      # size_1
-        for(int k = 0; k < 8; k++)  # size_0
-            # access/store element at/to index:
+for(int i = 0; i < 8; i++)          // size_2
+    for(int j = 0; j < 2; j++)      // size_1
+        for(int k = 0; k < 8; k++)  // size_0
+            // access/store element at/to index:
             (
-                i * 16  # stride_2 
-                + j * 1 # stride_1 
-                + k * 2 # stride_0
+                i * 16  // stride_2 
+                + j * 1 // stride_1 
+                + k * 2 // stride_0
             )
 ```
 
@@ -116,12 +116,12 @@ of0 = object_fifo
 ```
 The access pattern of the transformation can be written as:
 ```c
-for(int i = 0; i < 2; i++)      # size_1
-    for(int j = 0; j < 3; j++)  # size_0
-        # access/store element at/to index:
+for(int i = 0; i < 2; i++)      // size_1
+    for(int j = 0; j < 3; j++)  // size_0
+        // access/store element at/to index:
         (
-            i * 16  # stride_1 
-            + j * 2 # stride_0
+            i * 16  // stride_1 
+            + j * 2 // stride_0
         )
 ```
 and further represented as in the image below:

diff --git a/programming_guide/section-2/section-2d/README.md b/programming_guide/section-2/section-2d/README.md
@@ -49,7 +49,7 @@ memRef_48_ty = T.memref(48, T.i32())
 # Input data movement
 
 of_in = object_fifo("in", ShimTile, MemTile, buffer_depth, memRef_data_ty)
-of_in1 = object_fifo("in0", MemTile, ComputeTile, buffer_depth, memRef_data_ty)
+of_in0 = object_fifo("in0", MemTile, ComputeTile, buffer_depth, memRef_data_ty)
 object_fifo_link(of_in, of_in0)
 
 

diff --git a/programming_guide/section-2/section-2e/02_external_mem_to_core/README.md b/programming_guide/section-2/section-2e/02_external_mem_to_core/README.md
@@ -20,10 +20,10 @@ The design in [ext_to_core.py](./ext_to_core.py) uses an Object FIFO `of_in` to
   of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_24_ty) # Output
 ```
 
-Both a consumer and a producer process are running on `ComputeTile2`. The producer process acquires one object from `of_in` to consume and one object from `of_out` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.
+Both consumer and producer processes are running on `ComputeTile2`. The producer process acquires one object from `of_in` to consume and one object from `of_out` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.
 
 It is possible to build, run and test this design with the following commands:
-```
+```bash
 make
 make run
 ```

diff --git a/programming_guide/section-2/section-2e/03_external_mem_to_core_L2/README.md b/programming_guide/section-2/section-2e/03_external_mem_to_core_L2/README.md
@@ -30,7 +30,7 @@ The design in [ext_to_coreL2.py](./ext_to_core.py) is very similar to the one in
 The processes on the compute tile work the same way as in the previous design. The producer process acquires one object from `of_in1` to consume and one object from `of_out1` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.
 
 It is possible to build, run and test this design with the following commands:
-```
+```bash
 make
 make run
 ```

diff --git a/programming_guide/section-2/section-2e/05_join_L2/README.md b/programming_guide/section-2/section-2e/05_join_L2/README.md
@@ -27,7 +27,7 @@ The design in [join_L2.py](./join_L2.py) uses three Object FIFOs from each of th
 All compute tiles are running the same process of acquiring one object from their respective input Object FIFOs to produce, writing `1` to all of its entries, and releasing the object.
 
 This design is combined with the previous [distribute](../04_distribute_L2/distribute_L2.py) design to achieve a full data movement from external memory to the AIE array and back. The resulting code is available in [distribute_and_join_L2.py](./distribute_and_join_L2.py). It is possible to build, run and test it with the following commands:
-```
+```bash
 make
 make run
 ```

diff --git a/programming_guide/section-2/section-2g/README.md b/programming_guide/section-2/section-2g/README.md
@@ -21,9 +21,9 @@
 
 -----
 
-In the preceding sections, we looked at how we can describe data movement between tiles *within* the AIE-array. However, to do anything useful, we need to get data from outside the array, i.e. from the "host", into the AIE-array and back. On NPU devices, we can achieve this with the operations described in this section. 
+In the preceding sections, we looked at how we can describe data movement between tiles *within* the AIE-array. However, to do anything useful, we need to get data from outside the array, i.e., from the "host", into the AIE-array and back. On NPU devices, we can achieve this with the operations described in this section. 
 
-The operations that will be described in this section must be placed in a separate `sequence` function. The arguments to this function describe buffers that will be available on the host side; the body of the function describes how those buffers are moved into the AIE-array. [Section 3](../../../programming_examples/) contains an example.
+The operations that will be described in this section must be placed in a separate `sequence` function. The arguments to this function describe buffers that will be available on the host side; the body of the function describes how those buffers are moved into the AIE-array. [Section 3](../../section-3/) contains an example.
 
 ### Guide to Managing Runtime Data Movement to/from Host Memory
 
@@ -51,7 +51,7 @@ It is important to note that dimension 0 of the **`sizes`** and all **`strides`*
 npu_dma_memcpy_nd("of_in", 0, input_buffer, sizes=[1, 1, 1, 30])
 ```
 
-The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the `input_buffer` in host memory into an object FIFO with matching metadata labled "of_in". The `size` dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to `1`.
+The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the `input_buffer` in host memory into an object FIFO with matching metadata labeled "of_in". The `size` dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to `1`.
 
 
 #### **Advanced Techniques for Multi-dimensional `npu_dma_memcpy_nd`**

diff --git a/programming_guide/section-3/README.md b/programming_guide/section-3/README.md
@@ -106,7 +106,7 @@ This access and execute pattern runs on the AIE compute core `ComputeTile2` and
 
 ## Kernel Code
 
-We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local verion of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
+We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local version of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
 
 ```c
 void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out,