Skip to content

Host API

Tiziano De Matteis edited this page Nov 18, 2020 · 9 revisions

The Host API of FBLASs enable a programmer to offload the execution of a numerical routine to the FPGA directly from an host program.

As common in the FPGA scenario, a program that uses FBLAS routines will be constituted by two parts: the FPGA programming bitstream(s), and the host program that manages the application and launch FBLAS routines over the FPGA.

To obtain these, the user should go through the following steps:

  1. Specify the routines that will be executed;
  2. Generate the OpenCL code using the host code generator;
  3. Compile the code for emulation or for full synthesis;
  4. Write the host code.

Please note: in the following, all commands are executed inside the FBLAS base folder.

Specify the routines

Since the FPGA is characterized by a limited amount of resources, the user has to specify the routine that she will use in her own program. To do this, she has to write a routines specification file, which details all the characteristics of the desired routines. The routines specification file is a JSON file having the following structure:

{
    "routine": [
        {
            "blas_name" : "dot",
            "user_name" : "sdot",
            "type" : "float",
            "width" : 16
        },
        {
            "blas_name" : "scal",
            "user_name" : "sscal",
            "type" : "float",
            "incx" : 2
        }
    ],
    "platform" : "stratix 10"
}

The file has an item routine, which is an array of routine specifications (one or more, two in the example) and an item platform that specifies the target FPGA.

Each routine is characterized by different properties:

  • mandatory properties: these are the same for all the routines and are:

    • blas_name a string indicating the routine, according to the BLAS library nomenclature (in the example dot and scal, two Level 1 routines);
    • user_name a string indicating the name that the user will use for calling this routine. It must be unique and it can be used to have multiple routines of the same type that are simultaneously in execution;
    • type a string (float or double) that indicate the numerical precision that will be used for the routine;
  • optional properties: these characterize the routine behavior. If unspecified, a default value will be considered. These properties may change according to the particular routine. In general, they can be:

    • functional properties, specify the logic of the routine. These are usually BLAS parameters. E.g. in the example the second routine will have stride access to vector x equal to 2;
    • non-functional properties, affect the performance of the routine. For example, the width properties specify the spatial parallelism that will be used.

The meaning of all the properties, as well as the list of the mandatory and optional properties, can be found in the at the bottom of this page.

The platform item is optional. For the moment being FBLAS support Arria 10 (arria 10) and Stratix 10 (stratix 10) FPGA families. If unspecified, Stratix 10 is assumed as target architecture.

Systolic array for GEMM

FBLAS allows the user to generate a systolic array based-implementation for GEMM. Please note: this is still an experimental feature, subject to further improvements. To generate the systolic array the user has to add the JSON parameter "systolic": true to her configuration file.

Being implemented as an infinite loop, the systolic implementation can be called once (per FPGA reprogramming) and the parameters can not be changed. Therefore we suggest using this version only for designs in which the GEMM has to scale across all the chip.

Generation of the OpenCL code

To generate the OpenCL code, the user can invoke the code generator by passing as argument the JSON file containing the routine specifications:

[user@host fblas]$ python codegen/host_codegen.py routines.json [output directory]

The code generator will produce a set of OpenCL files (one per each routine) and a new JSON file (generated_routines) that conveys information that will be used at runtime. Considering the file specification shown before, the code generator will produce the files sdot.cl,sscal.cl and generated_routines.json.

If the user doesn't specify an output directory the files will be produced in /tmp/. If the input JSON file contains errors, the code generator will print proper error messages.

Compile the code for emulation or for full synthesis

At this point the user can compile the produced OpenCL code by using the Intel AOC compiler

For example, for emulating the routines over an x86 host, the user can compile them with the emulator flag:

[user@host fblas]$ aoc -march=emulator -board=p520_max_sg280l  /tmp/test_sscal.cl /tmp/test_sdot.cl 

where board indicates one of the target architectures presents in the machine.

For full synthesis (this could take hours) the user can invoke the command without the emulator flag.

Write the host code

To use the compiled routines, the user has to write a host program that invokes them over the FPGA. FBLAS comes with a C++ host API that can be used for this purpose. It is a header-only library.

The following example shows an excerpt of host code that uses the routines specified at the previous step (the full code can be found in the demo folder)

#include <fblas_environment.hpp>


int main(int argc, char *argv[])
{
    //... Command-line argument parsing...

     //create FBLAS environment
    FBLASEnvironment fb(program_path,json_path);
    
    //create data
    float *x,*y;
    float res,cpu_res;
    posix_memalign ((void **)&x, IntelFPGAOCLUtils::AOCL_ALIGNMENT, n*sizeof(float));
    posix_memalign ((void **)&y, IntelFPGAOCLUtils::AOCL_ALIGNMENT, n*sizeof(float));
    generate_vector(x,n);
    generate_vector(y,n);   

    //get context and device
    cl::Context context=fb.get_context();
    cl::Device device=fb.get_device();
    cl::CommandQueue queue;
    IntelFPGAOCLUtils::createCommandQueue(context,device,queue);

    //create buffer over fpga
    cl::Buffer fpga_x(context, CL_MEM_READ_WRITE|CL_CHANNEL_1_INTELFPGA, n *sizeof(float));
    cl::Buffer fpga_y(context, CL_MEM_READ_ONLY|CL_CHANNEL_2_INTELFPGA, n * sizeof(float));
    cl::Buffer fpga_res(context, CL_MEM_READ_WRITE|CL_CHANNEL_3_INTELFPGA,  sizeof(float));

    //copy data
    queue.enqueueWriteBuffer(fpga_x,CL_TRUE,0,n*sizeof(float),x);
    queue.enqueueWriteBuffer(fpga_y,CL_TRUE,0,n*sizeof(float),y);

    //scale the element in odd position of y
    fb.sscal("sscal",floor(n/2),alpha,fpga_x,2);
    //compute the dot product
    fb.sdot("sdot",n,fpga_x,1,fpga_y,1,fpga_res);

    //copy back the result
    queue.enqueueReadBuffer(fpga_res,CL_TRUE,0,sizeof(float),&res);

To use it, the programmer must include the fblas_environment.hpp file contained in folder include/ into his own host code.

Then:

  1. The FBLASEnvironemnt object is created. In this step, the OpenCL environment is created and the FPGA will be reconfigured using the compiled binary. The constructor of the FBLASEnvironment object initializes the OpenCL environment:

    • creates the platform, context, and device objects. By default, it is taken as the target device the first Intel FPGA Platform available. Context and device object can be obtained by using proper getter methods;
    • loads the binary program and reconfigures the FPGA;
    • creates the kernels for invocation, according to the specification contained in the JSON file
  2. Following a classical OpenCL flow, the user creates her own input data (properly aligned to help data transfer to/from FPGA), the FPGA buffers and copies the data over the device

  3. At this point, the user can invoke the FBLAS routines. They are defined as methods of the FBLASEnvironment object and takes as input:

    • a string, which corresponds to the user-defined name;
    • a list of parameters, which correspond to the routine parameters according to the BLAS documentation (visit: http://www.netlib.org/blas/)
    • since OpenCL kernels do not return anything, in the cases in which the BLAS routine has a return value this is saved into a memory area allocated over the FPGA (this is the case of dot in the tutorial)
  4. The user copies back the result(s)

By default, FBLAS calls are synchronous, meaning that the control returns to the host program only when the routine is completely executed. FBLAS calls can be also asynchronous. For doing this, FBLAS relies on OpenCL Events. Each routine optionally accepts two additional parameters which are:

  • a pointer to a vector of cl::Event objects, which are the set of events that need to complete before the routine can be executed
  • a pointer to an event, that will contain the event that identifies the routine (and can be used for further synchronizations).

Considering the example of before, the routines invocation becomes:

    std::vector<cl::Event> scal_event, dot_event;
    cl::Event e;
    fb.sscal("sscal",floor(n/2),alpha,fpga_x,2,nullptr, &e);
    scal_event.push_back(e);

    fb.sdot("sdot",n,fpga_x,1,fpga_y,1,fpga_res,&scal_event,&e);
    dot_event.push_back(e);

    queue.enqueueReadBuffer(fpga_res,CL_TRUE,0,sizeof(float),&res,&dot_event);

How to compile the host code

The host code can be compiled by using the gcc compiler. The user must provide proper compilation flags to include the fblas_environment.hpp header (present in the include/ folder), the rapidjson library (under the folder rapidjson/include) as well as any other include and compilation flag required by the Intel AOC compiler (can be retrieved using the aocl compile-config and aocl link-config command line utilities).

JSON File specification

The JSON file provided by the user contains the specification of a set of routines. As explained each routine is identified by a set of mandatory and optional parameters. For all the routines, it is required to specify the property blas which indicates the BLAS routine that the user wants to use, user_name which is the unique name that can be used to call an FBLAS routine, and type which specify the precision format (float/double). Then as optional parameters, the user can specify: incx/incy (vector access strides), tile N size\ tile M size (dimensions of tiling size), uplo (for routines that take in input triangular matrices it can be L/U), trans (N: to specify that the matrix is non-transposed, T for transposed matrices), side (for TRSM, can be Left or Right).

In the case of malformed JSON, the generator will output error messages.