Skip to content

Commit

Permalink
Matrix Multiplication README Improvements (#1421)
Browse files Browse the repository at this point in the history
Co-authored-by: André Rösti <an.roesti@gmail.com>
  • Loading branch information
jgmelber and andrej authored Apr 25, 2024
1 parent fab1d72 commit 13a9bbe
Show file tree
Hide file tree
Showing 4 changed files with 250 additions and 12 deletions.
19 changes: 19 additions & 0 deletions programming_examples/basic/matrix_multiplication/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!---//===- README.md --------------------------*- Markdown -*-===//
//
// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// Copyright (C) 2024, Advanced Micro Devices, Inc.
//
//===----------------------------------------------------------------------===//-->

# Matrix Multiplication

Subdirectories in this directory contain example designs that implement matrix multiplication on the AI-Engine-enabled AMD Neural Processing Unit (NPU).

> These designs all follow largely the same structure and rely on the same basic concepts. The [whole-array design](whole_array/README.md) contains a representative in-depth explanation of this structure and these concepts. In the explanations for the other designs, we rely on the whole-array design as a base and only highlight the differences.
* [`single_core`](single_core) - This design performs matrix-matrix multiplication on a single AI Engine core.
* [`whole_array`](whole_array) - This design evolves `single_core`, by splitting the computation and parallelizing it. It utilizes all available AI Engine cores simultaneously.
* [`matrix_vector`](matrix_vector) - This design is a specialization to the matrix-vector-multiplication case, which poses unique challenges due to lower computation density. *Work in progress.*
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,21 @@
//
//===----------------------------------------------------------------------===//-->

# <ins>Matrix Vector Multiplication</ins>
# Matrix-Vector Multiplication

One tile in one or more columns performs a `matrix * vector` multiply on bfloat16 data type where `MxK` is `288x288`. The kernel itself computes `32x32 (MxK)` so it is invoked multiple times to complete the full matvec compute.
In this design, one or multiple AI Engine compute cores (spread across hardware columns, configurable as `n_cores`) perform a matrix-*vector* multiplication. We use a `bfloat16` data type, and the dimensions of the `A` matrix `M`&times;`K` are set to `288`&times;`288` by default (`N`, the number of columns in `B`, is always `1`, since `B` is a vector). The kernel itself consumes chunks of `32`&times;`32` (`M`&times;`K`) of `A`, so it is invoked multiple times to complete the full result.

You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
> This design relies on the same basic concepts as the [whole-array matrix-matrix multiplication design](../whole_array/README.md), and it is structured very similarly to that design. Please refer to the in-depth explanation of that design along with the below outlined differences for a better understanding of this design.
## Differences from the [Whole-Array Matrix-Matrix Multiplication Design](../whole_array/README.md)

- A specialized matrix-*vector* microkernel, named `matvec_vectorized` is used in this design, as opposed to the more general matrix-matrix microkernel (`matmul_vectorized`) used in the matrix-matrix-multiplication designs.
- The data movement in this design varies as follows: An identical `32`-element chunk of the vector `B` is **broadcast** to the cores in all columns, whereas _distinct_ subsequent `32`&times;`32`-sized tiles of the `A` matrix are **distributed** to the cores. As such, each core is responsible for a distinct `32`-element chunk of the output vector `C`. These chunks are assembled (**joined**) at the shim tile level (in the `sequence()` function).
- This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable `n_cores` defines the number of columns to be used. It would however be possible to extend this design to use all cores.

## Building and Running the Design

You need C++23 for `bfloat16_t` support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu

To compile design:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,20 @@
//
//===----------------------------------------------------------------------===//-->

# <ins>Matrix Multiplication</ins>
# Matrix Multiplication - Single Core Design

A single tile performs a `matrix * matrix` multiply on bfloat16 data type where `MxKxN` is `128x128x128`. The kernel itself computes `64x32x64 (MxKxN)` so it is invoked multiple times to complete the full matmul compute.
In this design, a single AI Engine compute core performs a matrix-matrix-multiplication. The matrices are `bfloat16` data type, and the dimensions are set (by default) to `M`&times;`K`&times;`N` = `128`&times;`128`&times;`128`. The kernel operates on chunks of `64`&times;`32`&times;`64` (`m`&times;`k`&times;`n`), so it is invoked multiple times to complete the full result.

You need c++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
> This design is a simplification of the [whole-array design](../whole_array/README.md). Instead of utilizing all available AI Engine compute cores in parallel, this design performs all computation on a single core. To understand this design better, please refer to the discussion of the whole-array design and the differences outlined below.
## Differences from the [Whole-Array Design](../whole_array/README.md)

* This design supports tracing; See [below](#tracing).
* Only a single core performs computations. As such, we only need a single ObjectFIFO for each of the transfers between the levels (shim &rightarrow; memory, memory &rightarrow; compute, and back). These ObjectFIFOs are named `inA`, `inB`, `outC` and `memA`, `memB` and `memC`, respectively.

## Building and Running the Design

You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu

To compile design:
```
Expand Down
Loading

0 comments on commit 13a9bbe

Please sign in to comment.