Matrix Multiplication README Improvements (#1421)

Co-authored-by: André Rösti <an.roesti@gmail.com>
Xilinx · Apr 25, 2024 · 13a9bbe · 13a9bbe
1 parent fab1d72
commit 13a9bbe
Show file tree

Hide file tree

Showing 4 changed files with 250 additions and 12 deletions.
diff --git a/programming_examples/basic/matrix_multiplication/README.md b/programming_examples/basic/matrix_multiplication/README.md
@@ -0,0 +1,19 @@
+<!---//===- README.md --------------------------*- Markdown -*-===//
+//
+// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// Copyright (C) 2024, Advanced Micro Devices, Inc.
+// 
+//===----------------------------------------------------------------------===//-->
+
+# Matrix Multiplication
+
+Subdirectories in this directory contain example designs that implement matrix multiplication on the AI-Engine-enabled AMD Neural Processing Unit (NPU).
+
+> These designs all follow largely the same structure and rely on the same basic concepts. The [whole-array design](whole_array/README.md) contains a representative in-depth explanation of this structure and these concepts. In the explanations for the other designs, we rely on the whole-array design as a base and only highlight the differences.
+
+* [`single_core`](single_core) - This design performs matrix-matrix multiplication on a single AI Engine core. 
+* [`whole_array`](whole_array) - This design evolves `single_core`, by splitting the computation and parallelizing it. It utilizes all available AI Engine cores simultaneously.
+* [`matrix_vector`](matrix_vector) - This design is a specialization to the matrix-vector-multiplication case, which poses unique challenges due to lower computation density. *Work in progress.*
diff --git a/programming_examples/basic/matrix_multiplication/matrix_vector/README.md b/programming_examples/basic/matrix_multiplication/matrix_vector/README.md
@@ -8,11 +8,21 @@
 // 
 //===----------------------------------------------------------------------===//-->
 
-# <ins>Matrix Vector Multiplication</ins>
+# Matrix-Vector Multiplication
 
-One tile in one or more columns performs a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute.
+In this design, one or multiple AI Engine compute cores (spread across hardware columns, configurable as `n_cores`) perform a matrix-*vector* multiplication. We use a `bfloat16` data type, and the dimensions of the `A` matrix `M`&times;`K` are set to `288`&times;`288` by default (`N`, the number of columns in `B`, is always `1`, since `B` is a vector). The kernel itself consumes chunks of `32`&times;`32` (`M`&times;`K`) of `A`, so it is invoked multiple times to complete the full result.
 
-You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
+> This design relies on the same basic concepts as the [whole-array matrix-matrix multiplication design](../whole_array/README.md), and it is structured very similarly to that design. Please refer to the in-depth explanation of that design along with the below outlined differences for a better understanding of this design.
+
+## Differences from the [Whole-Array Matrix-Matrix Multiplication Design](../whole_array/README.md)
+
+- A specialized matrix-*vector* microkernel, named `matvec_vectorized` is used in this design, as opposed to the more general matrix-matrix microkernel (`matmul_vectorized`) used in the matrix-matrix-multiplication designs.
+- The data movement in this design varies as follows: An identical `32`-element chunk of the vector `B` is **broadcast** to the cores in all columns, whereas _distinct_ subsequent `32`&times;`32`-sized tiles of the `A` matrix are **distributed** to the cores. As such, each core is responsible for a distinct `32`-element chunk of the output vector `C`. These chunks are assembled (**joined**) at the shim tile level (in the `sequence()` function).
+- This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable `n_cores` defines the number of columns to be used. It would however be possible to extend this design to use all cores.
+
+## Building and Running the Design
+
+You need C++23 for `bfloat16_t` support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
 
 To compile design:
 ```

diff --git a/programming_examples/basic/matrix_multiplication/single_core/README.md b/programming_examples/basic/matrix_multiplication/single_core/README.md
@@ -8,11 +8,20 @@
 // 
 //===----------------------------------------------------------------------===//-->
 
-# <ins>Matrix Multiplication</ins>
+# Matrix Multiplication - Single Core Design
 
-A single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute.
+In this design, a single AI Engine compute core performs a matrix-matrix-multiplication. The matrices are `bfloat16` data type, and the dimensions are set (by default) to `M`&times;`K`&times;`N` = `128`&times;`128`&times;`128`. The kernel operates on chunks of `64`&times;`32`&times;`64` (`m`&times;`k`&times;`n`), so it is invoked multiple times to complete the full result.
 
-You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
+> This design is a simplification of the [whole-array design](../whole_array/README.md). Instead of utilizing all available AI Engine compute cores in parallel, this design performs all computation on a single core. To understand this design better, please refer to the discussion of the whole-array design and the differences outlined below.
+
+## Differences from the [Whole-Array Design](../whole_array/README.md)
+
+* This design supports tracing; See [below](#tracing).
+* Only a single core performs computations. As such, we only need a single ObjectFIFO for each of the transfers between the levels (shim &rightarrow; memory, memory &rightarrow; compute, and back). These ObjectFIFOs are named `inA`, `inB`, `outC` and `memA`, `memB` and `memC`, respectively. 
+
+## Building and Running the Design
+
+You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
 
 To compile design:
 ```