Skip to content

FFT in SDSoC (C callable IP)

t-kuha edited this page Feb 28, 2018 · 5 revisions

Let's migrate fft_single example in Vivado HLS to SDSoC...

  • SDSoC version used: 2016.3

  • Target Platform: Digilent Zybo (XC7Z010-1CLG400C) with Linux as OS

1. Import fft_single example into SDSoC

First off, create an SDSoC project named fft_single:

Create SDSoC Project

Choose Zybo as platform:

Select Zybo

... and Linux SMP as Software platform:

Select Linux SMP

... we do not use Templates (select Empty Application):

Select Empty Application

Locate fft_single example files in: < SDSoC installation path >/2016.3/Vivado_HLS/examples/design/FFT/fft_single

Example source files

Then copy the source files (*.cpp & *.h) into "src" folder of the project.

Copy source files

You may also want to copy data files into the project:

Copy data folder

In SDx Project Settings, select fft_top() as HW functions, leaving clock frequency as default:

Specify HW function

For quick iteration, uncheck Generate bitsream & Generate SD card image and check Estimate Performance

Project settings

At this point, the project would look like below:

Project overview

2. Optimize

2.1 1st try

See whether the project compiles without any modification: Project -> Build

The compilation log (sds_fft_top.log) should look like this with some WARNINGs and ERRORs:

Part of SDS log

2.2 Modify code

We now have to resolve these WARNINGs & ERRORs...

2.2.1 WARNINGs

According to page 58 of UG1027 (SDSoC Environment User Guide, 2016.3), #pragma HLS interface for a top-level function argument is ignored, so we have to comment those pragmas out in fft_top.cpp:

//#pragma HLS interface ap_hs port=direction
//#pragma HLS interface ap_fifo depth=1 port=ovflo
//#pragma HLS interface ap_fifo depth=FFT_LENGTH port=in,out

We also comment out the following pragmas, which will lose effect after the following code change:

//#pragma HLS data_pack variable=in
//#pragma HLS data_pack variable=out

2.2.2 ERRORs

Since there is also a restriction on the data width of the arguments of top level function, i.e. arguments' data width must be of 8/16/32/64 bits, we also need to change data type of complex<data_in_t> in[FFT_LENGTH] & complex<data_out_t> out[FFT_LENGTH]. Here, for the sake of simplicity, we define the arguments (in & out) as 32-bit floating point (float), and convert data type in the HW function.

  • Let's redefine fft_top(). Note we have to define an argument for real & imaginary part separately. The .h file :
// Use generic C type for HW function arguments
void fft_top(
    bool direction,
    // cmpxDataIn in[FFT_LENGTH],
    // cmpxDataOut out[FFT_LENGTH],
    float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
    float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
    bool* ovflo);

... and .cpp file:

void fft_top( ... )
{
    ...
    // dummy_proc_fe(direction, &fft_config, in, xn);
    dummy_proc_fe(direction, &fft_config, in_re, in_im, xn);
        ...
    // dummy_proc_be(&fft_status, ovflo, xk, out);
    dummy_proc_be(&fft_status, ovflo, xk, out_re, out_im);
}
  • We rewrite HW sub functions accordingly:
void dummy_proc_fe(
    bool direction,
    config_t* config, 
    // cmpxDataIn in[FFT_LENGTH],
    float in_re[FFT_LENGTH],
    float in_im[FFT_LENGTH],
    cmpxDataIn out[FFT_LENGTH])
{
    int i;
    config->setDir(direction);
    config->setSch(0x2AB);
    for (i=0; i< FFT_LENGTH; i++){
        // out[i] = in[i];
        out[i].real(in_re[i]);
        out[i].imag(in_im[i]);
    }
}

void dummy_proc_be(
    status_t* status_in, 
    bool* ovflo,
    cmpxDataOut in[FFT_LENGTH],
    float out_re[FFT_LENGTH],
    float out_im[FFT_LENGTH]
    /*cmpxDataOut out[FFT_LENGTH]*/)
{
    int i; 
    for (i=0; i< FFT_LENGTH; i++){
        // out[i] = in[i];
        out_re[i] = in[i].real();
        out_im[i] = in[i].imag();
    }
    *ovflo = status_in->getOvflo() & 0x1;
}

and main():

    // static cmpxDataIn xn_input[SAMPLES];
    // static cmpxDataOut xk_output[SAMPLES];
    float in_re[SAMPLES] = {0};
    float in_im[SAMPLES] = {0};
    float out_re[SAMPLES] = {0};
    float out_im[SAMPLES] = {0};
    ...
        // xn_input[line_no-5] = cmpxDataIn(input_data_re, input_data_im);
        in_re[line_no - 5] = input_data_re;
        in_im[line_no - 5] = input_data_im;
    ...
    // fft_top(FWD_INV, xn_input, xk_output, &ovflo);
    fft_top(FWD_INV, in_re, in_im, out_re, out_im, &ovflo);
    ...
            //if (golden != xk_output[i].real())
            if (golden.to_float() != out_re[i])
            {
    ...
                cout << "Frame:" << frame << " index: " << i 
                     << "  Golden: " <<  golden.to_float()
					 << " vs. RE Output: " << setprecision(14)
					 << out_re[i] /*xk_output[i].real().to_float()*/ << endl;
            }
    ...
            //if (golden != xk_output[i].imag())
            if (golden.to_float() != out_im[i])
            {
                error_num++;
                cout << "Frame:" << frame << " index: " << i 
                     << "  Golden: " << golden.to_float()
					 << " vs. IM Output: " << setprecision(14)
					 << out_im[i] /*xk_output[i].imag().to_float()*/ << endl;
            }

2.2.3 Try build again

After building the project again, we will encounter new errors (linker errors) as below:

2nd try

So, we need different definitions of fft_top() for SW part and HW part respectively. To do that, we can use __SDSVHLS__ macro:

#ifdef __SDSVHLS__
// This part is compiled into HW function by Vivado HLS
void fft_top(...)
{
    //#pragma HLS interface ap_hs port=direction
    ...
}
#else
// This part is compiled as SW function by gcc and calls HW function
#include <stdio.h>
void fft_top(
    bool direction,
    float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
    float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
    bool* ovflo)
{
    printf("SDSoC Stub Function %s() ...\n", __FUNCTION__);
}
#endif

2.2.4 Build succeeded

After all those changes, the project will build with no errors. But estimated performance is unreasonably low... (meaning it takes about 3 sec. per execution.)

2nd try

2.3 Optimize for performance

To reduce data transaction time, we use sds_alloc() & sds_free() to allocate/release memory for input/output data:

First, we have to include header:

#include <sstream>
#include "sds_lib.h"    // for sds_***()
using namespace std;

... then allocate memories using sds_alloc():

    // float in_re[SAMPLES] = {0};
    // float in_im[SAMPLES] = {0};
    // float out_re[SAMPLES] = {0};
    // float out_im[SAMPLES] = {0};
    float* in_re = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* in_im = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* out_re = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* out_im = (float*) sds_alloc(SAMPLES*sizeof(float));

... and remember to release those memories using sds_free().

    sds_free(in_re);
    sds_free(in_im);
    sds_free(out_re);
    sds_free(out_im);

Optionally (*1), in fft_top.h, we can add SDSoC #pragma to force SDSoC to estimate simple DMA (AXI_DMA_SIMPLE) for faster data transfer:

#pragma SDS data mem_attribute(in_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(in_im:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_im:PHYSICAL_CONTIGUOUS)
void fft_top( 
    ...

*1 In this case, SDSoC automatically estimates simple DMA so we actually don't have to add those #pragmas...

To reduce function execution time, we also want to apply inlining to HW sub functions:

void dummy_proc_fe( ... )
{
#pragma HLS INLINE
    ...
}

void dummy_proc_be( ... )
{
#pragma HLS INLINE
    ...
}

... and loop pipelining as usual:

void dummy_proc_fe( ... )
{
    ...
    for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
    ...
    }
}

void dummy_proc_be( ... )
{
    ...
    for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
    ...
    }
}

2.4 Estimate final performance

Let's estimate performance again...

Final estimation

We got way shorter estimated cycles & a bit less resource utilization. Seems O.K...

3. Build HW & Run it!

  • Make sure Generate bitsream & Generate SD card image are checked in Options section of SDx Project Setting.

Project setting

  • Then build the project (Project -> Build), which will finish successfully. Below is Data Motion Network result. We can see simple DMA is implemented for in/out data.

Data motion network report

  • Copy the contents of sd_card folder & data folder into an SD card.

SD Card image

  • Insert the SD card to your Zybo & power on to boot Linux.

  • cd to /mnt (where the program is located) & run the program as follows:

Invoke program

  • Test PASSED!!! We are now able to accelerate FFT on FPGA without writing FFT code.

Final log


The SD Card files (except for image.ub) is available in the repo: fft_single/sd_card