-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
2,681 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
//===- trace_packet_routing.mlir ------------------------------------------------*- MLIR -*-===// | ||
// | ||
// Copyright (C) 2024, Advanced Micro Devices, Inc. | ||
// SPDX-License-Identifier: MIT | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// REQUIRES: ryzen_ai, chess | ||
|
||
// RUN: aie-opt --aie-create-packet-flows %s | FileCheck %s | ||
// CHECK-LABEL: module @trace_packet_routing { | ||
|
||
module @trace_packet_routing { | ||
aie.device(npu1_4col) { | ||
%tile_0_0 = aie.tile(0, 0) | ||
%tile_1_0 = aie.tile(1, 0) | ||
%tile_0_2 = aie.tile(0, 2) | ||
%tile_0_3 = aie.tile(0, 3) | ||
|
||
aie.packet_flow(0) { | ||
aie.packet_source<%tile_0_2, Trace : 0> // core trace | ||
aie.packet_dest<%tile_0_0, DMA : 1> | ||
} {keep_pkt_header = true} | ||
aie.packet_flow(1) { | ||
aie.packet_source<%tile_0_3, Trace : 0> // core trace | ||
aie.packet_dest<%tile_1_0, DMA : 1> | ||
} {keep_pkt_header = true} | ||
} | ||
} |
38 changes: 38 additions & 0 deletions
38
test/npu-xrt/matrix_multiplication_using_cascade/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
<!---//===- README.md --------------------------*- Markdown -*-===// | ||
// | ||
// This file is licensed under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
// Copyright (C) 2024, Advanced Micro Devices, Inc. | ||
// | ||
//===----------------------------------------------------------------------===//--> | ||
|
||
## MM Cascade Design Example | ||
This is a matrix multiply example with the sizes of (16 * 16) * (16 * 16) and i32 data type, where four different versions are compared to examine the possibility of distributing K dim accross multiple cores. | ||
|
||
### Plainx1 Version<br> | ||
Generated from IREE end-to-end flow, using one core only. | ||
|
||
### Plainx4 Version<br> | ||
Using four cores, as output stationary | ||
|
||
### Bufferx4 Version<br> | ||
With four cores chained horizontally, the intermediate accumulations are passed through shared buffers implemented as ObjectFIFO. | ||
|
||
### Cascadex4 Version<br> | ||
Still having four cores but the intermediate accumulations are communicated through the cascade port. | ||
|
||
### Results<br> | ||
From the trace files, | ||
|
||
| | Total | Init | Compute | | ||
|-----------|--------|-------|---------| | ||
| Plainx1 | 25.6us | 7.6us | 18.0us | | ||
| Plainx4 | 6.7us | 2.0us | 4.7us | | ||
| Bufferx4 | 32.0us | 7.6us | 24.4us | | ||
| Cascadex4 | 13.9us | 7.6us | 6.3us | | ||
|
||
The Buffer version is slow because of frequent lock-related operations. | ||
|
||
The Cascade version almost halves the latency but with 4x cores. The performance gain is constrained by the initialization time of the accumulation buffer (depends on MxN only). |
Oops, something went wrong.