diff --git a/programming_examples/basic/matrix_multiplication/README.md b/programming_examples/basic/matrix_multiplication/README.md
new file mode 100644
index 0000000000..88b701ffa2
--- /dev/null
+++ b/programming_examples/basic/matrix_multiplication/README.md
@@ -0,0 +1,19 @@
+
+
+# Matrix Multiplication
+
+Subdirectories in this directory contain example designs that implement matrix multiplication on the AI-Engine-enabled AMD Neural Processing Unit (NPU).
+
+> These designs all follow largely the same structure and rely on the same basic concepts. The [whole-array design](whole_array/README.md) contains a representative in-depth explanation of this structure and these concepts. In the explanations for the other designs, we rely on the whole-array design as a base and only highlight the differences.
+
+* [`single_core`](single_core) - This design performs matrix-matrix multiplication on a single AI Engine core.
+* [`whole_array`](whole_array) - This design evolves `single_core`, by splitting the computation and parallelizing it. It utilizes all available AI Engine cores simultaneously.
+* [`matrix_vector`](matrix_vector) - This design is a specialization to the matrix-vector-multiplication case, which poses unique challenges due to lower computation density. *Work in progress.*
\ No newline at end of file
diff --git a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md
index c01f13c58f..1eae2e2c23 100644
--- a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md
+++ b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md
@@ -8,11 +8,21 @@
//
//===----------------------------------------------------------------------===//-->
-# Matrix Vector Multiplication
+# Matrix-Vector Multiplication
-One tile in one or more columns performs a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute.
+In this design, one or multiple AI Engine compute cores (spread across hardware columns, configurable as `n_cores`) perform a matrix-*vector* multiplication. We use a `bfloat16` data type, and the dimensions of the `A` matrix `M`×`K` are set to `288`×`288` by default (`N`, the number of columns in `B`, is always `1`, since `B` is a vector). The kernel itself consumes chunks of `32`×`32` (`M`×`K`) of `A`, so it is invoked multiple times to complete the full result.
-You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
+> This design relies on the same basic concepts as the [whole-array matrix-matrix multiplication design](../whole_array/README.md), and it is structured very similarly to that design. Please refer to the in-depth explanation of that design along with the below outlined differences for a better understanding of this design.
+
+## Differences from the [Whole-Array Matrix-Matrix Multiplication Design](../whole_array/README.md)
+
+- A specialized matrix-*vector* microkernel, named `matvec_vectorized` is used in this design, as opposed to the more general matrix-matrix microkernel (`matmul_vectorized`) used in the matrix-matrix-multiplication designs.
+- The data movement in this design varies as follows: An identical `32`-element chunk of the vector `B` is **broadcast** to the cores in all columns, whereas _distinct_ subsequent `32`×`32`-sized tiles of the `A` matrix are **distributed** to the cores. As such, each core is responsible for a distinct `32`-element chunk of the output vector `C`. These chunks are assembled (**joined**) at the shim tile level (in the `sequence()` function).
+- This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable `n_cores` defines the number of columns to be used. It would however be possible to extend this design to use all cores.
+
+## Building and Running the Design
+
+You need C++23 for `bfloat16_t` support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
To compile design:
```
diff --git a/programming_examples/basic/matrix_multiplication/single_core/README.md b/programming_examples/basic/matrix_multiplication/single_core/README.md
index 2fe5158e76..1d2b454bb7 100644
--- a/programming_examples/basic/matrix_multiplication/single_core/README.md
+++ b/programming_examples/basic/matrix_multiplication/single_core/README.md
@@ -8,11 +8,20 @@
//
//===----------------------------------------------------------------------===//-->
-# Matrix Multiplication
+# Matrix Multiplication - Single Core Design
-A single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute.
+In this design, a single AI Engine compute core performs a matrix-matrix-multiplication. The matrices are `bfloat16` data type, and the dimensions are set (by default) to `M`×`K`×`N` = `128`×`128`×`128`. The kernel operates on chunks of `64`×`32`×`64` (`m`×`k`×`n`), so it is invoked multiple times to complete the full result.
-You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
+> This design is a simplification of the [whole-array design](../whole_array/README.md). Instead of utilizing all available AI Engine compute cores in parallel, this design performs all computation on a single core. To understand this design better, please refer to the discussion of the whole-array design and the differences outlined below.
+
+## Differences from the [Whole-Array Design](../whole_array/README.md)
+
+* This design supports tracing; See [below](#tracing).
+* Only a single core performs computations. As such, we only need a single ObjectFIFO for each of the transfers between the levels (shim → memory, memory → compute, and back). These ObjectFIFOs are named `inA`, `inB`, `outC` and `memA`, `memB` and `memC`, respectively.
+
+## Building and Running the Design
+
+You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
To compile design:
```
diff --git a/programming_examples/basic/matrix_multiplication/whole_array/README.md b/programming_examples/basic/matrix_multiplication/whole_array/README.md
index bdf9e71778..0e8e7d20c8 100644
--- a/programming_examples/basic/matrix_multiplication/whole_array/README.md
+++ b/programming_examples/basic/matrix_multiplication/whole_array/README.md
@@ -1,26 +1,226 @@
-
-# Matrix Multiplication Array
+# Matrix Multiplication - Whole Array Design
-Multiple tiles in a single column perform a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `256x256x256`. The kernel computes `64x64x64 (MxKxN)`, which is invoked multiple times to complete the full matmul compute.
+The code in this directory showcases an example matrix multiplication design for a Ryzen AI device with an NPU (Neural Processing Unit). The NPU consists of an array of compute cores, called AI Engines (AIEs). The example design configures each of those compute cores to perform multiplications of distinct sub-matrices in parallel.
-You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
+At a high level, the code does the following (in order):
+
+1. [**Defining Matrix Dimensions and Data Types:**](#1-defining-matrix-dimensions-and-data-types) We first specify the dimensions `M`, `K`, `N` for the input matrices `A` (`M`×`K`), and `B` (`K`×`N`), and the output matrix `C` (`M`×`N`), as well as their data type. To enable efficient computation, our design will split large input matrices into smaller sub-matrix blocks on two levels; we thus also define the sizes of those sub-matrices. At the first level, the constants `m`, `k`, and `n` define the size of the submatrices processed by each AIE core. At the second level, we further subdivide using smaller sizes `r`, `s` and `t` -- these are the sizes of required by the vector computation intrinsics of the AIEs.
+
+1. [**Constructing an AIE Array Configuration:**](#2-constructing-an-aie-array-configuration) The NPU hardware is comprised of components laid out in a two-dimensional grid of rows and columns. Based on the matrix sizes and tiling factors, we choose the number of rows, columns, and total number of compute cores of the AIE device that the design should utilize. We then configure the AI Engine array, memory tiles, and shim tiles.
+
+1. [**Defining Data Movement Inside the NPU:**](#3-defining-data-movement-inside-the-npu) ObjectFIFOs are a data movement abstraction for buffering data and synchronizing between AIE components. We configure ObjectFIFOs for `A`, `B` and `C` to transfer and buffer data between AIE components in chunks of the previously defined sizes (`m`×`k`, `k`×`n` and `m`×`n`, respecively).
+
+1. [**Defining Core Computations:**](#4-defining-core-computations) The `core_body()` function contains the code that will be loaded onto each AIE core. This code describes the matrix multiplication using the input submatrices `a` and `b` acquired through the ObjectFIFOs. The results are accumulated in the output submatrix `c`.
+
+1. [**Defining External Data Transfer Sequences:**](#5-defining-external-data-transfer-sequences) The `sequence()` function sets up matrix data movement from the host into the AIE compute cores, and back to the host after computation. It initializes Data Movement Accelerator (DMA) transfers, sets memory access patterns, and performs synchronization.
+
+1. **Generating the Design:** The `my_matmul()` function triggers the code generation process and represents the main entry point of the design. The final print statement outputs the MLIR representation of the AIE array configuration.
+
+In summary, this design leverages an AI Engine accelerator to accomplish matrix multiplication efficiently by breaking large matrices into smaller, manageable submatrices. The design uses parallelism, pipelining, and efficient data movement strategies to minimize computation time on the AI Engine array.
+
+## Building and Running the Design
+
+As configured, this design will set up an array of AIEs to perform matrix-matrix multiplication on a `bfloat16` data type, with `A`, `B` and `C` matrices all of size `512` × `512` × `512`. The tiling size is configured as `64` × `64` for `a`, `b`, and `c`.
+
+You will need C++23 for `bfloat16_t` support in the `test.cpp`, which can be found in `g++-13`: [https://lindevs.com/install-g-on-ubuntu](https://lindevs.com/install-g-on-ubuntu)
+
+To compile the design:
-To compile design:
```
make
make matrixMultiplication.exe
```
To run the design:
+
```
make run
```
+
+## Detailed Design Explanation
+
+The configuration of the AI Engine array is described in the `aie2.py` file.
+It is linked against a compute microkernel which is implemented in C++.
+The following sections elaborate on each of the steps outlined in the high-level summary above.
+
+> Note: The term "tile" has two distinct meanings in the following discussion that should be distinguishable from context:
+> * AIE tiles are components of the hardware, specifically Shim, Memory and Compute tiles.
+> * Matrix tiles are smaller sub-matrices of the larger input and output matrices.
+
+### 1. Defining Matrix Dimensions and Data Types
+
+In the first section of the code in `aie2.py`, we define the following constants:
+
+| Matrix | Size | Submatrix Size (1.) | Vector Intrinsic Size (2.) |
+|---------------|-----------|---------------------|-----------------------|
+| `A` (Input) | `M` × `K` | `m` × `k` | `r` × `s` |
+| `B` (Input) | `K` × `N` | `k` × `n` | `s` × `t` |
+| `C` (Output) | `M` × `N` | `m` × `n` | `r` × `t` |
+
+
+The input and output matrix sizes are given by the user. We subdivide the input matrices `A`, `B` and the output matrix `C` into smaller, manageable "tiles" (or submatrices) at two levels:
+
+1. **Tiling to Compute Core Submatrix Chunks:** The input and output matrices stream to/from the AIE compute cores in chunks of size of `m`×`k`, `k`×`n` and `n`×`m`. Tiling into these chunks allows each of the computation cores to concurrently work on distinct sub-sections of the input matrices in parallel, which improves performance. This also reduces on-chip memory requirements. The final result is re-assembled using the sub-matrix results of all cores.
+
+ > This tiling occurs in the `sequence()` function describing the host-to-memory-tile transfer.
+We describe it further below, in section *"5. Defining External Data Transfer Sequences"*.
+
+1. **Tiling to Vector Intrinsic Size:** The AIE compute cores calculate the matrix multiplication using efficient "multiply-accumulate" vector intrinsic instructions (`MAC` instructions). These hardware instructions process very small blocks of the matrix: size `r`×`s` blocks of `A` and size `s`×`t` blocks of `B`, producing an output of size `r`×`t` (`C`).
+ > This tiling occurs in the inner-AIE data movements. We describe it in the section *"3. Defining Data Movement Inside the NPU"*.
+
+ > The vector intrinsic size is dictated by the hardware and the compute microkernel.
+
+### 2. Constructing an AIE Array Configuration
+
+In the next section of the code, we obtain handles to the components of the hardware.
+
+The Neural Processing Unit (NPU) is physically structured as an array of 6 rows and 4 columns. The lower two rows contain so-called "shim" and "memory" tiles, and the upper four rows are made up of AIE compute cores (AIEs):
+
+1. **Shim tiles:** A single row of shim tiles on the bottom of the core array is responsible for interfacing with the external host for data movement. In our code, they are represented by a list: `[_0_ShimTile, _1_ShimTile, _2_ShimTile, _3_ShimTile]`
+
+1. **Memory tiles:** A row of memory tiles with scratchpad memory is located above the shim tiles. These memory cores are responsible for staging and distributing the data during processing. In our code, they are represented by a list: `[_0_MemTile, _1_MemTile, _2_MemTile, _3_MemTile]`
+
+1. **Compute tiles:** In each of the four columns, there are 4 rows of computation tiles above the memory tiles. This makes for a total of 16 computation cores, which in this design are configured to perform the matrix multiplication. In our code, they are represented by a list of lists, `cores`, showing their two-dimensional arrangement.
+
+### 3. Defining Data Movement Inside the NPU:
+
+We use "ObjectFIFOs" to abstractly describe the data movement and synchronization between AIE Compute, Memory and Shim tiles. ObjectFIFOs present an interface that behaves like a First-In-First-Out queue. To achieve this, they take care of DMA configuration, acquiring and releasing locks, and managing buffers.
+
+There are several ObjectFIFOs used in this design, which are created using the `object_fifo()` Python binding:
+
+1. Host → Memory Tiles: `inA_fifos`, `inB_fifos` move the input matrices from the external host (via the shim tiles) in row 0 to the memory tiles in row 1.
+
+2. Memory Tiles → Compute Tiles: `memA_fifos`, `memB_fifos` move input data from the memory tiles in row 1 to the compute tiles in rows 2-5.
+
+3. Compute Tiles → Memory Tiles → Host: Analogously, `memC_fifos` and `OutC_fifos` move the output data out from the compute cores to the memory tiles (`memC_fifos`) and from there out to the external host via the shim tiles (`OutC_fifos`).
+
+Each of `inA_fifos`, `inB_fifos`, `OutC_fifos`, `memA_fifos`, `memB_fifos` and `memC_fifos` are Python dictionaries, containing a separate ObjectFIFO instance for each column of AIE compute cores in the array. The respective `*_names` lists contain the names of these ObjectFIFOs.
+
+Of note is the `object_fifo_link()` operation. This operation establishes a connection between the `mem*` FIFOs and the `in*` and `outC` FIFOs. By linking ObjectFIFOs, the output received at one end of the source FIFO is fed as input into the ObjectFIFO listed as the destination.
+
+
+
+
+
+
+
+
+#### Tiling and Data Layout Transformations
+
+We assume our data are stored in **row-major format** in the host's memory. For processing on the AIE compute cores, we need to transform the data layouts, such the above listed *sub-matrix tiles* are laid out contiguously in AIE compute core memory. Thankfully, AIE hardware has extensive support for transforming data using the DMAs as it is received and sent with zero cost. In the following, we will explain how we make use of this hardware feature to transform our data.
+
+##### Tiling to Vector Intrinsic Size
+
+The `memA_fifos` and `memB_fifos` receive sub-matrices of size `m`×`k` and `k`×`n`, respectively. The FIFOs translate those matrices from a row-major format into the `r`×`s`-sized and `s`×`t`-sized blocks required by the hardware's vector instrinsics before sending them into the compute cores memory.
+
+For matrix A (`memA_fifos`), this transformation is expressed using the following wraps and strides as a list of tuples `(wrap, stride)`, given as arguments to the `object_fifo()` operation:
+(Note that `//` denotes integer floor-division in Python.)
+
+
+```python
+ [
+ (m // r, r * k), # Pair 1
+ (k // s, s), # Pair 2
+ (r, k), # Pair 3
+ (s, 1), # Pair 4
+ ]
+```
+
+Let us break down each component of this pattern. We do so back-to-front for ease of understanding:
+
+* Pair 4: `(s, 1)`
+ * This dimension represents the transfer of a single row of a `r`×`s`-sized tile (our target tile size after the transformation).
+ * Wrap: `s` is the length of a row of a `r`×`s`-sized block in units of 4 bytes (i32 elements).
+ * Stride: A stride of `1` retrieves contiguous elements.
+* Pair 3: `(r, k)`
+ * Together with the previous dimension, this dimenison represents the transfer of a single `r`×`s`-sized tile.
+ * Wrap: `r` is the number of rows of a `r`×`s`-sized tile.
+ * Stride: `k` is the stride between first element of each consecutive row along the `m` dimension, i.e. adding this stride to a memory address points to the element in the matrix directly below the original address.
+* Pair 2: `(k // s, s)`
+ * Together with the previous dimensions, this dimension represents the transfer of one row of `r`×`s`-sized tiles, i.e. the first `k`×`s` elements of the input array.
+ * Wrap: `k // s` is the number of `r`×`s`-sized tiles along the `k` (columns) dimension.
+ * Stride: `s` is the stride between starting elements of consecutive blocks along the `k` dimension, i.e. adding this stridde to a memory address points to the same element in the `r`×`s`-sized block directly to the right of the block of the original address.
+* Pair 1: `(m // r, r * k)`
+ * Together with the previous dimensions, this dimension transfers the entire `m`×`k`-sized matrix as blocks of `r`×`s`-sized tiles.
+ * Wrap: `m // r` is the number of `r`×`s`-sized blocks along the `m` (rows) dimension.
+ * Stride: `r * k` is the stride between starting elements of consecutive blocks along the `m` dimension, i.e. adding this stride to a memory address points to the same element in the `r`×`s`-sized block directly below the block of the original address.
+
+> You can use this [data layout visualizer](http://andreroesti.com/data-layout-viz/data_layout.html) to better understand data layout transformations expressed as wraps and strides.
+
+The matrix B transformation (`memB_fifos`) is equivalent after substituting the correct dimensions (`k`×`n` instead of `m`×`k` and `s`×`t` isntead of `r`×`s`).
+
+Analogously, the output matrix C is transformed back from `r`×`t`-sized blocks back into a row-major matrix of contiguous rows with size `m`×`n`.
+
+
+### 4. Defining Core Computations
+
+The `core_body()` function defines the computation that each core will perform.
+We define a `core_body()` function for each compute core `i`, inside of which we do the following:
+
+ * We acquire a slot in the output buffer into which we will produce the next `m`×`n`-tile of output in `memC_fifos`. We name the acquired buffer `elem_out`.
+ * We zero out the acquired output slot, since it may contain stale results using `call(zero [elem_out])`.
+ * `K // k` times, we:
+ * We acquire the next `m`×`k`-tile of `A`, and the next `k`×`n` tile of `B` from ObjectFIFOs `memA_fifos[i]` and `memB_fifos[i]`, respectively, as `elem_in_a` and `elem_in_b`.
+ * We call our compute microkernel (implemented in C++ and linked against this design) to perform the matrix multiplication calculation, with `call(matmul, [elem_in_a, elem_in_b, elem_out])`.
+ The result is summed element-wise in `elem_out` together with previous iterations.
+ * We release `elem_in_a` and `elem_in_b`.
+* After the complete result for the current `m`×`n`-block has been calculated, we can release `elem_out`.
+
+### 5. Defining External Data Transfer Sequences
+
+The function signature of the `sequence()` function lists as its arguments all the external buffers from the host that we wish to read from or write to on the AI Engine's shim tiles. The body of this function describes how these buffers are transfered from and to the host, including tiling the input matrices into `m`×`k` and `k`×`n`-sized sub-matrices, and combining the `m`×`n`-sized output tiles into the larger output `M`×`N` matrix buffer.
+
+* The `tile_row_block` variable segments the M (rows of A) into smaller chunks, each containing `rows_per_block` tile rows. This is done so the buffer descriptors (BDs) can be reused for efficient DMA transfers.
+* For each column `i`:
+ * For each `tile_row` in the current row block:
+ * The DMA transfer function `npu_dma_memcpy_nd` loads a segment of matrix A and matrix B data (submatrix a, submatrix b) from the host into the corresponding `inA_fifos` for the respective column, maintaining the appropriate strides and offsets.
+ * Analogously to the data layout transformations described [further above](#tiling-and-data-layout-transformations) to translate a `m`×`k` matrix into blocks of `r`×`s`-submatrices, this transfer translates the input `M`×`K` and `K`×`N` matrices into submatrices of size `m`×`k` and `k`×`n`.
+ > Note that data layout transformations in the `npu_dma_memcpy_nd` operation are expressed in units of 4 bytes. This is why you will see all strides and the lowest-dimension length multiplied by a factor of `word_size_in` or `word_size_out` (to get the size in bytes) and then divided by four (to get the size in units of 4 bytes). This discrepancy will be streamlined in future versions.
+ * The DMA transfer function `npu_dma_memcpy_nd` sends a segment of matrix C data (submatrix c) from the corresponding `outC_fifos` for the respective column, back to the host while maintaining the appropriate strides and offsets.
+ * After completing DMA transfers for each column, `npu_sync` is used to synchronize their completion.
+
+## Compute Microkernels
+
+This C++ code demonstrates how to implement matrix multiplication for different data types and operations using AIE (AI Engine) API and templates. The AI Engine is designed for efficient computation and data movement, especially for matrix multiplication-intensive machine learning workloads. The code has the following main components:
+
+1. `matmul_scalar`: A scalar function that performs matrix multiplication for input matrices `a` and `b` and adds the result to matrix `c`. This function iterates through each row in matrix `a` and each column in matrix `b`, performing the multiplication of the corresponding elements and accumulating their sum to populate matrix `c`.
+
+1. `matmul_vectorized` and `matmul_vectorized_2x2`: Vectorized matrix multiplication functions for different block sizes and input/output types for the AI Engine. These functions use the AIE API for efficient vectorized matrix multiplication, with support for various input and output tensor data types (e.g., int16, bfloat16).
+
+1. `matmul_vectorized_4x4x4_i16_i16`, `matmul_vectorized_4x8x4_bf16_bf16`, and `matmul_vectorized_4x8x4_bf16_f32`: Helper functions for calling the corresponding `matmul_vectorized` functions with specific input and output types and block sizes.
+
+1. Extern "C" interface functions: These functions provide a C-compatible interface to the main matrix multiplication functions, making it easier to call these functions from other languages or environments.
+
+1. Zeroing functions: Functions like `zero_vectorized` and `zero_scalar` initialize the output matrix (`c_out`) with all zero values.
+
+This code showcases efficient performance in matrix multiplication-intensive workloads and can be adapted for other types of inputs and operations as needed.