FFT in SDSoC (C callable IP)

Let's migrate fft_single example in Vivado HLS to SDSoC...

SDSoC version used: 2016.3
Target Platform: Digilent Zybo (XC7Z010-1CLG400C) with Linux as OS

1. Import fft_single example into SDSoC

First off, create an SDSoC project named fft_single:

Create SDSoC Project

Choose Zybo as platform:

Select Zybo

... and Linux SMP as Software platform:

Select Linux SMP

... we do not use Templates (select Empty Application):

Select Empty Application

Locate fft_single example files in: < SDSoC installation path >/2016.3/Vivado_HLS/examples/design/FFT/fft_single

Example source files

Then copy the source files (*.cpp & *.h) into "src" folder of the project.

Copy source files

You may also want to copy data files into the project:

Copy data folder

In SDx Project Settings, select fft_top() as HW functions, leaving clock frequency as default:

Specify HW function

For quick iteration, uncheck Generate bitsream & Generate SD card image and check Estimate Performance

Project settings

At this point, the project would look like below:

Project overview

2. Optimize

2.1 1st try

See whether the project compiles without any modification: Project -> Build

The compilation log (sds_fft_top.log) should look like this with some WARNINGs and ERRORs:

Part of SDS log

2.2 Modify code

We now have to resolve these WARNINGs & ERRORs...

2.2.1 WARNINGs

According to page 58 of UG1027 (SDSoC Environment User Guide, 2016.3), #pragma HLS interface for a top-level function argument is ignored, so we have to comment those pragmas out in fft_top.cpp:

//#pragma HLS interface ap_hs port=direction
//#pragma HLS interface ap_fifo depth=1 port=ovflo
//#pragma HLS interface ap_fifo depth=FFT_LENGTH port=in,out

We also comment out the following pragmas, which will lose effect after the following code change:

//#pragma HLS data_pack variable=in
//#pragma HLS data_pack variable=out

2.2.2 ERRORs

Since there is also a restriction on the data width of the arguments of top level function, i.e. arguments' data width must be of 8/16/32/64 bits, we also need to change data type of complex<data_in_t> in[FFT_LENGTH] & complex<data_out_t> out[FFT_LENGTH]. Here, for the sake of simplicity, we define the arguments (in & out) as 32-bit floating point (float), and convert data type in the HW function.

Let's redefine fft_top(). Note we have to define an argument for real & imaginary part separately. The .h file :

// Use generic C type for HW function arguments
void fft_top(
    bool direction,
    // cmpxDataIn in[FFT_LENGTH],
    // cmpxDataOut out[FFT_LENGTH],
    float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
    float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
    bool* ovflo);

... and .cpp file:

void fft_top( ... )
{
    ...
    // dummy_proc_fe(direction, &fft_config, in, xn);
    dummy_proc_fe(direction, &fft_config, in_re, in_im, xn);
        ...
    // dummy_proc_be(&fft_status, ovflo, xk, out);
    dummy_proc_be(&fft_status, ovflo, xk, out_re, out_im);
}

We rewrite HW sub functions accordingly:

void dummy_proc_fe(
    bool direction,
    config_t* config, 
    // cmpxDataIn in[FFT_LENGTH],
    float in_re[FFT_LENGTH],
    float in_im[FFT_LENGTH],
    cmpxDataIn out[FFT_LENGTH])
{
    int i;
    config->setDir(direction);
    config->setSch(0x2AB);
    for (i=0; i< FFT_LENGTH; i++){
        // out[i] = in[i];
        out[i].real(in_re[i]);
        out[i].imag(in_im[i]);
    }
}

void dummy_proc_be(
    status_t* status_in, 
    bool* ovflo,
    cmpxDataOut in[FFT_LENGTH],
    float out_re[FFT_LENGTH],
    float out_im[FFT_LENGTH]
    /*cmpxDataOut out[FFT_LENGTH]*/)
{
    int i; 
    for (i=0; i< FFT_LENGTH; i++){
        // out[i] = in[i];
        out_re[i] = in[i].real();
        out_im[i] = in[i].imag();
    }
    *ovflo = status_in->getOvflo() & 0x1;
}

and main():

    // static cmpxDataIn xn_input[SAMPLES];
    // static cmpxDataOut xk_output[SAMPLES];
    float in_re[SAMPLES] = {0};
    float in_im[SAMPLES] = {0};
    float out_re[SAMPLES] = {0};
    float out_im[SAMPLES] = {0};
    ...
        // xn_input[line_no-5] = cmpxDataIn(input_data_re, input_data_im);
        in_re[line_no - 5] = input_data_re;
        in_im[line_no - 5] = input_data_im;
    ...
    // fft_top(FWD_INV, xn_input, xk_output, &ovflo);
    fft_top(FWD_INV, in_re, in_im, out_re, out_im, &ovflo);
    ...
            //if (golden != xk_output[i].real())
            if (golden.to_float() != out_re[i])
            {
    ...
                cout << "Frame:" << frame << " index: " << i 
                     << "  Golden: " <<  golden.to_float()
					 << " vs. RE Output: " << setprecision(14)
					 << out_re[i] /*xk_output[i].real().to_float()*/ << endl;
            }
    ...
            //if (golden != xk_output[i].imag())
            if (golden.to_float() != out_im[i])
            {
                error_num++;
                cout << "Frame:" << frame << " index: " << i 
                     << "  Golden: " << golden.to_float()
					 << " vs. IM Output: " << setprecision(14)
					 << out_im[i] /*xk_output[i].imag().to_float()*/ << endl;
            }

2.2.3 Try build again

After building the project again, we will encounter new errors (linker errors) as below:

2nd try

So, we need different definitions of fft_top() for SW part and HW part respectively. To do that, we can use __SDSVHLS__ macro:

#ifdef __SDSVHLS__
// This part is compiled into HW function by Vivado HLS
void fft_top(...)
{
    //#pragma HLS interface ap_hs port=direction
    ...
}
#else
// This part is compiled as SW function by gcc and calls HW function
#include <stdio.h>
void fft_top(
    bool direction,
    float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
    float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
    bool* ovflo)
{
    printf("SDSoC Stub Function %s() ...\n", __FUNCTION__);
}
#endif

2.2.4 Build succeeded

After all those changes, the project will build with no errors. But estimated performance is unreasonably low... (meaning it takes about 3 sec. per execution.)

2nd try

2.3 Optimize for performance

To reduce data transaction time, we use sds_alloc() & sds_free() to allocate/release memory for input/output data:

First, we have to include header:

#include <sstream>
#include "sds_lib.h"    // for sds_***()
using namespace std;

... then allocate memories using sds_alloc():

    // float in_re[SAMPLES] = {0};
    // float in_im[SAMPLES] = {0};
    // float out_re[SAMPLES] = {0};
    // float out_im[SAMPLES] = {0};
    float* in_re = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* in_im = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* out_re = (float*) sds_alloc(SAMPLES*sizeof(float));
    float* out_im = (float*) sds_alloc(SAMPLES*sizeof(float));

... and remember to release those memories using sds_free().

    sds_free(in_re);
    sds_free(in_im);
    sds_free(out_re);
    sds_free(out_im);

Optionally (*1), in fft_top.h, we can add SDSoC #pragma to force SDSoC to estimate simple DMA (AXI_DMA_SIMPLE) for faster data transfer:

#pragma SDS data mem_attribute(in_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(in_im:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_im:PHYSICAL_CONTIGUOUS)
void fft_top( 
    ...

*1 In this case, SDSoC automatically estimates simple DMA so we actually don't have to add those #pragmas...

To reduce function execution time, we also want to apply inlining to HW sub functions:

void dummy_proc_fe( ... )
{
#pragma HLS INLINE
    ...
}

void dummy_proc_be( ... )
{
#pragma HLS INLINE
    ...
}

... and loop pipelining as usual:

void dummy_proc_fe( ... )
{
    ...
    for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
    ...
    }
}

void dummy_proc_be( ... )
{
    ...
    for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
    ...
    }
}

2.4 Estimate final performance

Let's estimate performance again...

Final estimation

We got way shorter estimated cycles & a bit less resource utilization. Seems O.K...

3. Build HW & Run it!

Make sure Generate bitsream & Generate SD card image are checked in Options section of SDx Project Setting.

Project setting

Then build the project (Project -> Build), which will finish successfully. Below is Data Motion Network result. We can see simple DMA is implemented for in/out data.

Data motion network report

Copy the contents of sd_card folder & data folder into an SD card.

SD Card image

Insert the SD card to your Zybo & power on to boot Linux.
cd to /mnt (where the program is located) & run the program as follows:

Invoke program

Test PASSED!!! We are now able to accelerate FFT on FPGA without writing FFT code.

Final log

The SD Card files (except for image.ub) is available in the repo: fft_single/sd_card

Provide feedback

Saved searches

Use saved searches to filter your results more quickly