Skip to content

Commit

Permalink
General fixes to IRON guide (#1557)
Browse files Browse the repository at this point in the history
Co-authored-by: Mario Ruiz <mruiznog@xilinx.com>
  • Loading branch information
mariodruiz and Mario Ruiz authored Jun 19, 2024
1 parent 29d5cec commit 618df50
Show file tree
Hide file tree
Showing 12 changed files with 58 additions and 57 deletions.
16 changes: 8 additions & 8 deletions programming_guide/section-1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ When we program the AIE-array, we need to declare and configure its structural b
## <ins>Walkthrough of Python source file (aie2.py)</ins>
At the top of this Python source, we include modules that define the IRON AIE language bindings `aie.dialects.aie` and the mlir-aie context `aie.extras.context`, which binds to MLIR definitions for AI Engines.

```
```python
from aie.dialects.aie import * # primary mlir-aie dialect definitions
from aie.extras.context import mlir_mod_ctx # mlir-aie context
```
Then we declare a structural design function that will expand into MLIR code when it will get called from within an mlir-aie context (see last part of this subsection).
```
```python
# AI Engine structural design function
def mlir_aie_design():
<... AI Engine device, blocks, and connections ...>
Expand All @@ -31,7 +31,7 @@ The arguments for the tile declaration are the tile coordinates (column, row). W

> **NOTE:** The actual tile coordinates used on the device when the program is run may deviate from the ones declared here. For example, on the NPU on Ryzen™ AI (`@device(AIEDevice.npu)`), these coordinates tend to be relative coordinates as the runtime scheduler may assign it to a different available column during runtime.
```
```python
# Device declaration - here using aie2 device NPU
@device(AIEDevice.npu1_1col)
def device_body():
Expand All @@ -41,8 +41,8 @@ The arguments for the tile declaration are the tile coordinates (column, row). W
ComputeTile2 = tile(2, 3)
ComputeTile3 = tile(2, 4)
```
Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)`, which takes the code defined in our MLIR context and prints it stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout.
```
Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the `with mlir_mod_ctx() as ctx:` line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function `mlir_aie_design()`. This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is `print(ctx.module)`, which takes the code defined in our MLIR context and prints it to stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout.
```python
# Declares that subsequent code is in mlir-aie context
with mlir_mod_ctx() as ctx:
mlir_aie_design() # Call design function within the mlir-aie context
Expand All @@ -52,9 +52,9 @@ with mlir_mod_ctx() as ctx:
## <ins>Other Tile Types</ins>
Next to the compute tiles, an AIE-array also contains data movers for accessing L3 memory (also called shim DMAs) and larger L2 scratchpads (called mem tiles), which have been available since the AIE-ML generation - see [the introduction of this programming guide](../README.md). Declaring these other types of structural blocks follows the same syntax but requires physical layout details for the specific target device. Shim DMAs typically occupy row 0, while mem tiles (when available) often reside on row 1. The following code segment declares all the different tile types found in a single NPU column.

```
```python
# Device declaration - here using aie2 device NPU
@device(AIEDevice.npu1_1col)
@device(AIEDevice.npu1)
def device_body():

# Tile declarations
Expand All @@ -76,7 +76,7 @@ Next to the compute tiles, an AIE-array also contains data movers for accessing
4. No error is generated but our code is invalid. Take a look at the generated MLIR code under `build/aie.mlir`. This generated output is invalid MLIR syntax and running our mlir-aie tools on this MLIR source will generate an error. We do, however, have some additional Python structural syntax checks that can be enabled if we use the function `ctx.module.operation.verify()`. This verifies that our Python-bound code has valid operation within the mlir-aie context.

Qualify the `print(ctx.module)` call with a check on `ctx.module.operation.verify()` using a code block like the following:
```
```python
res = ctx.module.operation.verify()
if res == True:
print(ctx.module)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Similarly, the order in `fifoIns` specifies which input object will make up whic

<img src="./../../../assets/Join.png" height="200">

The following code snippet describes the figure above. There are three Object FIFOs: `of2` has a producer tile B and a consumer tile A, while `of0` and `of1` have C and D respectively as their producer tiles and B as their consumer tile. The link specifies that data from `of0` and `of1` is joined into `of2`. In this link, B is the shared tile where the implicit data copy will take place via B's DMAs. We can also note how `of0` and `of1`'s datatypes are half of `of2`'s, which means that objects from `of0` will become the first half of objects in `of2` while objects in `of1` will become the second half, based on their order in the link.
The following code snippet describes the figure above. There are three Object FIFOs: `of0` has a producer tile B and a consumer tile A, while `of1` and `of2` have C and D respectively as their producer tiles and B as their consumer tile. The link specifies that data from `of1` and `of2` is joined into `of0`. In this link, B is the shared tile where the implicit data copy will take place via B's DMAs. We can also note how `of1` and `of2`'s datatypes are half of `of0`'s, which means that objects from `of1` will become the first half of objects in `of0` while objects in `of2` will become the second half, based on their order in the link.
```python
A = tile(1, 0)
B = tile(1, 1)
Expand Down
32 changes: 16 additions & 16 deletions programming_guide/section-2/section-2c/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ A data layout transformation is presented as a tuple of pairs, where each pair r
```c
[<size_2, stride_2>, <size_1, stride_1>, <size_0, stride_0>]
```
Transformations can be expressed in up to three dimensions on each compute and Shim tile, and in up to four dimensions on Mem tiles. The first pair of this array gives the outer-most dimension's stride and size `<size_2, stride_2>`, while the last pair of the array gives the inner-most dimension's stride and size `<size_0,stride_0>`. All strides are expressed in **multiples of the element width**.
Transformations can be expressed in up to three dimensions on each compute and Shim tile, and in up to four dimensions on Mem tiles. The first pair of this array gives the outer-most dimension's stride and size `<size_2, stride_2>`, while the last pair of the array gives the inner-most dimension's stride and size `<size_0, stride_0>`. All strides are expressed in **multiples of the element width**.

> **NOTE:** Only for 4B data types the inner-most dimension's stride must be 1 by design.
Expand All @@ -56,9 +56,9 @@ int *buffer;
for(int i = 0; i < size_2; i++)
for(int j = 0; j < size_1; j++)
for(int k = 0; k < size_0; k++)
# access/store element at/to buffer[ i * stride_2
# + j * stride_1
# + k * stride_0]
// access/store element at/to buffer[ i * stride_2
// + j * stride_1
// + k * stride_0]
```

As a practical example, here is an access pattern that corresponds to alternating between even and odd elements every 8 elements in a 128 element buffer/stream:
Expand All @@ -67,14 +67,14 @@ aie.dma_bd(%buf : memref<128xi32>, 0, 128, [<8, 16>, <2, 1>, <8, 2>])
```
which translates to:
```c
for(int i = 0; i < 8; i++) # size_2
for(int j = 0; j < 2; j++) # size_1
for(int k = 0; k < 8; k++) # size_0
# access/store element at/to index:
for(int i = 0; i < 8; i++) // size_2
for(int j = 0; j < 2; j++) // size_1
for(int k = 0; k < 8; k++) // size_0
// access/store element at/to index:
(
i * 16 # stride_2
+ j * 1 # stride_1
+ k * 2 # stride_0
i * 16 // stride_2
+ j * 1 // stride_1
+ k * 2 // stride_0
)
```

Expand Down Expand Up @@ -116,12 +116,12 @@ of0 = object_fifo
```
The access pattern of the transformation can be written as:
```c
for(int i = 0; i < 2; i++) # size_1
for(int j = 0; j < 3; j++) # size_0
# access/store element at/to index:
for(int i = 0; i < 2; i++) // size_1
for(int j = 0; j < 3; j++) // size_0
// access/store element at/to index:
(
i * 16 # stride_1
+ j * 2 # stride_0
i * 16 // stride_1
+ j * 2 // stride_0
)
```
and further represented as in the image below:
Expand Down
2 changes: 1 addition & 1 deletion programming_guide/section-2/section-2d/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ memRef_48_ty = T.memref(48, T.i32())
# Input data movement

of_in = object_fifo("in", ShimTile, MemTile, buffer_depth, memRef_data_ty)
of_in1 = object_fifo("in0", MemTile, ComputeTile, buffer_depth, memRef_data_ty)
of_in0 = object_fifo("in0", MemTile, ComputeTile, buffer_depth, memRef_data_ty)
object_fifo_link(of_in, of_in0)


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ The design in [ext_to_core.py](./ext_to_core.py) uses an Object FIFO `of_in` to
of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_24_ty) # Output
```

Both a consumer and a producer process are running on `ComputeTile2`. The producer process acquires one object from `of_in` to consume and one object from `of_out` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.
Both consumer and producer processes are running on `ComputeTile2`. The producer process acquires one object from `of_in` to consume and one object from `of_out` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.

It is possible to build, run and test this design with the following commands:
```
```bash
make
make run
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The design in [ext_to_coreL2.py](./ext_to_core.py) is very similar to the one in
The processes on the compute tile work the same way as in the previous design. The producer process acquires one object from `of_in1` to consume and one object from `of_out1` to produce into. It then reads the value of the input object and adds `1` to all its entries before releasing both objects.

It is possible to build, run and test this design with the following commands:
```
```bash
make
make run
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The design in [join_L2.py](./join_L2.py) uses three Object FIFOs from each of th
All compute tiles are running the same process of acquiring one object from their respective input Object FIFOs to produce, writing `1` to all of its entries, and releasing the object.

This design is combined with the previous [distribute](../04_distribute_L2/distribute_L2.py) design to achieve a full data movement from external memory to the AIE array and back. The resulting code is available in [distribute_and_join_L2.py](./distribute_and_join_L2.py). It is possible to build, run and test it with the following commands:
```
```bash
make
make run
```
Expand Down
6 changes: 3 additions & 3 deletions programming_guide/section-2/section-2g/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

-----

In the preceding sections, we looked at how we can describe data movement between tiles *within* the AIE-array. However, to do anything useful, we need to get data from outside the array, i.e. from the "host", into the AIE-array and back. On NPU devices, we can achieve this with the operations described in this section.
In the preceding sections, we looked at how we can describe data movement between tiles *within* the AIE-array. However, to do anything useful, we need to get data from outside the array, i.e., from the "host", into the AIE-array and back. On NPU devices, we can achieve this with the operations described in this section.

The operations that will be described in this section must be placed in a separate `sequence` function. The arguments to this function describe buffers that will be available on the host side; the body of the function describes how those buffers are moved into the AIE-array. [Section 3](../../../programming_examples/) contains an example.
The operations that will be described in this section must be placed in a separate `sequence` function. The arguments to this function describe buffers that will be available on the host side; the body of the function describes how those buffers are moved into the AIE-array. [Section 3](../../section-3/) contains an example.

### Guide to Managing Runtime Data Movement to/from Host Memory

Expand Down Expand Up @@ -51,7 +51,7 @@ It is important to note that dimension 0 of the **`sizes`** and all **`strides`*
npu_dma_memcpy_nd("of_in", 0, input_buffer, sizes=[1, 1, 1, 30])
```

The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the `input_buffer` in host memory into an object FIFO with matching metadata labled "of_in". The `size` dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to `1`.
The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the `input_buffer` in host memory into an object FIFO with matching metadata labeled "of_in". The `size` dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to `1`.


#### **Advanced Techniques for Multi-dimensional `npu_dma_memcpy_nd`**
Expand Down
2 changes: 1 addition & 1 deletion programming_guide/section-3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ This access and execute pattern runs on the AIE compute core `ComputeTile2` and

## Kernel Code

We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local verion of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local version of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.

```c
void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out,
Expand Down
Loading

0 comments on commit 618df50

Please sign in to comment.