Streaming Interface of HLS modules

Each HLS module is implemented in one or multiple versions, currently stored in the blas/ folder. Each module receives and produces data using FIFO buffers (Intel channels). The interface (i.e. how data is received/produced, which tiling schema is supported, etc.), is described in the header of the corresponding implementation file. To favor composability, level-2 and level-3 HLS modules are proposed in different versions, capturing different tiling schema/input parameters.

In the following, we report the main details of the HLS modules interfaces.

Please Note: this page is a Work In Progress

Level-1

ASUM

Description ASUM takes the sum of the absolute values.

Input Data is received from the input stream CHANNEL_VECTOR_X. Data elements must be streamed with a padding equal to W, the vectorization width used in the module. Padding data must be set to zero.

Output the result is streamed in an output channel at the end of the computation in the channel CHANNEL_OUT

AXPY

Description AXPY constant times a vector plus a vector.

Input data is received from two input streams CHANNEL_VECTOR_X and CHANNEL_VECTOR_Y Data elements must be streamed with a padding equal to W, the vectorization width used in the module. Padding data must be set to zero.

Output Result is streamed in the output channel CHANNEL_OUT, W elements at a time.

DOT

Description DOT performs the dot product of two vectors.

Input Data is received from two input streams CHANNEL_VECTOR_X and CHANNEL_VECTOR_Y. Data elements must be streamed with a padding equal to W, the vectorization width used in the module. Padding data must be set to zero.

Output The result is streamed in an output channel at the end of the computation in the channel CHANNEL_OUT

Level-2

GEMV

Description: GEMV performs one of the matrix-vector operations:

y := alphaAx + beta*y,
or y := alphaA**Tx + beta*y, where A is an NxM matrix.

This module is released in two versions (v1 and v2).

Input:: Data is received from three different channels (CHANNEL_VECTOR_X, CHANNEL_VECTOR_Y and CHANNEL_MATRIX A). Input data must be padded with zeros according to the reference tile sizes (TILE_N and TILE_M).

Version 1:

A is Non Transposed, tiles received by rows, elements in tile sent row-by-row. The input vector x (M elements) must be entirely sent N/TILE_N times (i.e. len_y/tile_y)
A is Transposed, tiles are received by columns, elements in tile are sent column-by-column. The input vector x (N elements) must be sent M/TILE_M.

Version 2:

A is Transposed, tiles received by columns, tile elements are sent row-by-row. The input vector x (M elements) must be entirely sent M/TILE_M times
A is Non Transposed, tiles are received by rows, tile elements are sent columns-by-column. The input vector x (M elements) must be entirely sent N/TILE_N times

Output: Result is streamed in the output channel CHANNEL_VECTOR_OUT as soon as it is available.

GER

Description: GER computes A := alpha*x*y**T + A, where A is a NxM matrix, x is an N-element vector, y is an M-element vector, alpha is a scalar.

This module is released in 4 different versions.

Input: Data is received from three different channels (CHANNEL_VECTOR_X, CHANNEL_VECTOR_Y and CHANNEL_MATRIX A). Input data must be padded with zeros according to the reference tile sizes (TILE_N and TILE_M).

Version 1: A is received in tiles of size TILE_N x TILE_M by rows, tile elements are sent row-by-row. The input vector y must be sent N/TILE_N times. Input vector x is read once.

Version 2: A is received in tiles of size TILE_N x TILE_M by columns, tile elements are sent column-by-column. The input vector x must be sent M/TILE_M times. Input vector y is read once.

Version 3: A is received in tiles of size TILE_N x TILE_M by columns, tile elements are sent row-by-row. The input vector x must be sent M/TILE_M times. Input vector y is read once.

Version 4: A is received in tiles of size TILE_N x TILE_M by row, tile elements are sent column-by-column. The input vector y must be sent N/TILE_N times. Input vector x is read once.

Output: Result is streamed in an output channel CHANNEL_MATRIX_OUT, tile by tile as soon as it is available, respecting the same order of arrival of the input matrix

Level-3

GEMM

Description: implements matrix-matrix multiplication with accumulation. The implementation adopt a two level of tiling: the outermost for the memory (size MTILE x MTILE) and the innermost for the computation (CTILE_ROWS x CTILE_COLS).

Input: Matrix A has size NxK and arrives through channel CHANNEL_MATRIX_A. For each outer tile (MTILE x MTILE size), inner blocks are received one after the other The entire outer tile-row (MTILE x K) is sent a number of times equal to the number of outer tiles in matrix B (check helpers/read_matrix_a_notrans_gemm.cl for an example).

Matrix B is received through channel CHANNEL_MATRIX_B. For each outer tile, inner blocks are received one of the other. Each outer tile row is sent multiple times (check helpers/read_matrix_b_notrans_gemm.cl for an example)

In both cases, input data must be padded to zeros according to the reference tile size MTILE.

Output: The kernel computes the matrix C in tiles by rows, row streamed. The results are sent in the channel CHANNEL_MATRIX_OUT. Accumulation must be performed on the receiving kernel (check helpers/write_matrix_gemm.cl for an example).

**Additional notes: ** a similar interface is used for the systolic implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly