-
Notifications
You must be signed in to change notification settings - Fork 24
Streaming Interface of HLS modules
Each HLS module is implemented in one or multiple versions, currently stored in the blas/
folder.
Each module receives and produces data using FIFO buffers (Intel channels). The interface (i.e. how data is received/produced, which tiling schema is supported, etc.), is described in the header of the corresponding implementation file.
To favor composability, level-2 and level-3 HLS modules are proposed in different versions, capturing different tiling schema/input parameters.
In the following, we report the main details of the HLS modules interfaces.
Please Note: this page is a Work In Progress
Description ASUM
takes the sum of the absolute values.
Input Data is received from the input stream CHANNEL_VECTOR_X
. Data elements must be streamed with a padding equal to W
, the vectorization width used in the module. Padding data must be set to zero.
Output the result is streamed in an output channel at the end of the computation in the channel CHANNEL_OUT
Description AXPY
constant times a vector plus a vector.
Input data is received from two input streams CHANNEL_VECTOR_X
and CHANNEL_VECTOR_Y
Data elements must be streamed with a padding equal to W
, the vectorization width used in the module. Padding data must be set to zero.
Output Result is streamed in the output channel CHANNEL_OUT
, W
elements at a time.
Description DOT
performs the dot product of two vectors.
Input Data is received from two input streams CHANNEL_VECTOR_X
and CHANNEL_VECTOR_Y
. Data elements must be streamed with a padding equal to W
, the vectorization width used in the module. Padding data must be set to zero.
Output The result is streamed in an output channel at the end of the computation
in the channel CHANNEL_OUT
Description: GEMV
performs one of the matrix-vector operations:
- y := alphaAx + beta*y,
- or y := alphaA**Tx + beta*y, where A is an NxM matrix.
This module is released in two versions (v1 and v2).
Input:: Data is received from three different channels (CHANNEL_VECTOR_X
, CHANNEL_VECTOR_Y
and CHANNEL_MATRIX A
). Input data must be padded with zeros according to
the reference tile sizes (TILE_N
and TILE_M
).
Version 1:
-
A
is Non Transposed, tiles received by rows, elements in tile sent row-by-row. The input vectorx
(M
elements) must be entirely sentN/TILE_N times
(i.e.len_y/tile_y
) -
A
is Transposed, tiles are received by columns, elements in tile are sent column-by-column. The input vectorx
(N
elements) must be sentM/TILE_M
.
Version 2:
-
A
is Transposed, tiles received by columns, tile elements are sent row-by-row. The input vectorx
(M
elements) must be entirely sentM/TILE_M
times -
A
is Non Transposed, tiles are received by rows, tile elements are sent columns-by-column. The input vectorx
(M
elements) must be entirely sentN/TILE_N times
Output: Result is streamed in the output channel CHANNEL_VECTOR_OUT
as soon as it is available.
Description: GER computes A := alpha*x*y**T
+ A, where A
is a NxM
matrix, x
is an N-element vector, y
is an M-element vector, alpha
is a scalar.
This module is released in 4 different versions.
Input: Data is received from three different channels (CHANNEL_VECTOR_X
, CHANNEL_VECTOR_Y
and CHANNEL_MATRIX A
). Input data must be padded with zeros according to
the reference tile sizes (TILE_N
and TILE_M
).
Version 1: A
is received in tiles of size TILE_N x TILE_M
by rows, tile elements are sent row-by-row. The input vector y
must be sent N/TILE_N
times. Input vector x
is read once.
Version 2: A
is received in tiles of size TILE_N x TILE_M
by columns, tile elements are sent column-by-column. The input vector x
must be sent M/TILE_M
times. Input vector y
is read once.
Version 3: A
is received in tiles of size TILE_N x TILE_M
by columns, tile elements are sent row-by-row. The input vector x
must be sent M/TILE_M
times. Input vector y
is read once.
Version 4: A
is received in tiles of size TILE_N x TILE_M
by row, tile elements are sent column-by-column. The input vector y
must be sent N/TILE_N
times. Input vector x
is read once.
Output: Result is streamed in an output channel CHANNEL_MATRIX_OUT
, tile by tile as soon as it is available,
respecting the same order of arrival of the input matrix
Description: implements matrix-matrix multiplication with accumulation. The implementation adopt a two level of tiling: the outermost for the memory (size MTILE x MTILE
) and the innermost for the computation (CTILE_ROWS x CTILE_COLS
).
Input: Matrix A
has size NxK
and arrives through channel CHANNEL_MATRIX_A
.
For each outer tile (MTILE x MTILE
size), inner blocks are received one after the other
The entire outer tile-row (MTILE x K
) is sent a number of times equal to the number of
outer tiles in matrix B
(check helpers/read_matrix_a_notrans_gemm.cl
for an example).
Matrix B
is received through channel CHANNEL_MATRIX_B
. For each outer tile, inner blocks are received one of the other. Each outer tile row is sent multiple times (check helpers/read_matrix_b_notrans_gemm.cl
for an example)
In both cases, input data must be padded to zeros according to the reference tile size MTILE
.
Output: The kernel computes the matrix C
in tiles by rows, row streamed.
The results are sent in the channel CHANNEL_MATRIX_OUT
. Accumulation must be performed on the receiving kernel (check helpers/write_matrix_gemm.cl
for an example).
**Additional notes: ** a similar interface is used for the systolic implementation