-
Notifications
You must be signed in to change notification settings - Fork 5
FFT in SDSoC (C callable IP)
Let's migrate fft_single example in Vivado HLS to SDSoC...
-
SDSoC version used: 2016.3
-
Target Platform: Digilent Zybo (XC7Z010-1CLG400C) with Linux as OS
First off, create an SDSoC project named fft_single:
Choose Zybo as platform:
... and Linux SMP as Software platform:
... we do not use Templates (select Empty Application):
Locate fft_single example files in: < SDSoC installation path >/2016.3/Vivado_HLS/examples/design/FFT/fft_single
Then copy the source files (*.cpp & *.h) into "src" folder of the project.
You may also want to copy data files into the project:
In SDx Project Settings, select fft_top() as HW functions, leaving clock frequency as default:
For quick iteration, uncheck Generate bitsream & Generate SD card image and check Estimate Performance
At this point, the project would look like below:
See whether the project compiles without any modification: Project -> Build
The compilation log (sds_fft_top.log) should look like this with some WARNINGs and ERRORs:
We now have to resolve these WARNINGs & ERRORs...
According to page 58 of UG1027 (SDSoC Environment User Guide, 2016.3), #pragma HLS interface for a top-level function argument is ignored, so we have to comment those pragmas out in fft_top.cpp:
//#pragma HLS interface ap_hs port=direction
//#pragma HLS interface ap_fifo depth=1 port=ovflo
//#pragma HLS interface ap_fifo depth=FFT_LENGTH port=in,out
We also comment out the following pragmas, which will lose effect after the following code change:
//#pragma HLS data_pack variable=in
//#pragma HLS data_pack variable=out
Since there is also a restriction on the data width of the arguments of top level function, i.e. arguments' data width must be of 8/16/32/64 bits, we also need to change data type of complex<data_in_t> in[FFT_LENGTH] & complex<data_out_t> out[FFT_LENGTH]. Here, for the sake of simplicity, we define the arguments (in & out) as 32-bit floating point (float), and convert data type in the HW function.
- Let's redefine fft_top(). Note we have to define an argument for real & imaginary part separately. The .h file :
// Use generic C type for HW function arguments
void fft_top(
bool direction,
// cmpxDataIn in[FFT_LENGTH],
// cmpxDataOut out[FFT_LENGTH],
float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
bool* ovflo);
... and .cpp file:
void fft_top( ... )
{
...
// dummy_proc_fe(direction, &fft_config, in, xn);
dummy_proc_fe(direction, &fft_config, in_re, in_im, xn);
...
// dummy_proc_be(&fft_status, ovflo, xk, out);
dummy_proc_be(&fft_status, ovflo, xk, out_re, out_im);
}
- We rewrite HW sub functions accordingly:
void dummy_proc_fe(
bool direction,
config_t* config,
// cmpxDataIn in[FFT_LENGTH],
float in_re[FFT_LENGTH],
float in_im[FFT_LENGTH],
cmpxDataIn out[FFT_LENGTH])
{
int i;
config->setDir(direction);
config->setSch(0x2AB);
for (i=0; i< FFT_LENGTH; i++){
// out[i] = in[i];
out[i].real(in_re[i]);
out[i].imag(in_im[i]);
}
}
void dummy_proc_be(
status_t* status_in,
bool* ovflo,
cmpxDataOut in[FFT_LENGTH],
float out_re[FFT_LENGTH],
float out_im[FFT_LENGTH]
/*cmpxDataOut out[FFT_LENGTH]*/)
{
int i;
for (i=0; i< FFT_LENGTH; i++){
// out[i] = in[i];
out_re[i] = in[i].real();
out_im[i] = in[i].imag();
}
*ovflo = status_in->getOvflo() & 0x1;
}
and main():
// static cmpxDataIn xn_input[SAMPLES];
// static cmpxDataOut xk_output[SAMPLES];
float in_re[SAMPLES] = {0};
float in_im[SAMPLES] = {0};
float out_re[SAMPLES] = {0};
float out_im[SAMPLES] = {0};
...
// xn_input[line_no-5] = cmpxDataIn(input_data_re, input_data_im);
in_re[line_no - 5] = input_data_re;
in_im[line_no - 5] = input_data_im;
...
// fft_top(FWD_INV, xn_input, xk_output, &ovflo);
fft_top(FWD_INV, in_re, in_im, out_re, out_im, &ovflo);
...
//if (golden != xk_output[i].real())
if (golden.to_float() != out_re[i])
{
...
cout << "Frame:" << frame << " index: " << i
<< " Golden: " << golden.to_float()
<< " vs. RE Output: " << setprecision(14)
<< out_re[i] /*xk_output[i].real().to_float()*/ << endl;
}
...
//if (golden != xk_output[i].imag())
if (golden.to_float() != out_im[i])
{
error_num++;
cout << "Frame:" << frame << " index: " << i
<< " Golden: " << golden.to_float()
<< " vs. IM Output: " << setprecision(14)
<< out_im[i] /*xk_output[i].imag().to_float()*/ << endl;
}
After building the project again, we will encounter new errors (linker errors) as below:
So, we need different definitions of fft_top() for SW part and HW part respectively. To do that, we can use __SDSVHLS__ macro:
#ifdef __SDSVHLS__
// This part is compiled into HW function by Vivado HLS
void fft_top(...)
{
//#pragma HLS interface ap_hs port=direction
...
}
#else
// This part is compiled as SW function by gcc and calls HW function
#include <stdio.h>
void fft_top(
bool direction,
float in_re[FFT_LENGTH], float in_im[FFT_LENGTH],
float out_re[FFT_LENGTH], float out_im[FFT_LENGTH],
bool* ovflo)
{
printf("SDSoC Stub Function %s() ...\n", __FUNCTION__);
}
#endif
After all those changes, the project will build with no errors. But estimated performance is unreasonably low... (meaning it takes about 3 sec. per execution.)
To reduce data transaction time, we use sds_alloc() & sds_free() to allocate/release memory for input/output data:
First, we have to include header:
#include <sstream>
#include "sds_lib.h" // for sds_***()
using namespace std;
... then allocate memories using sds_alloc():
// float in_re[SAMPLES] = {0};
// float in_im[SAMPLES] = {0};
// float out_re[SAMPLES] = {0};
// float out_im[SAMPLES] = {0};
float* in_re = (float*) sds_alloc(SAMPLES*sizeof(float));
float* in_im = (float*) sds_alloc(SAMPLES*sizeof(float));
float* out_re = (float*) sds_alloc(SAMPLES*sizeof(float));
float* out_im = (float*) sds_alloc(SAMPLES*sizeof(float));
... and remember to release those memories using sds_free().
sds_free(in_re);
sds_free(in_im);
sds_free(out_re);
sds_free(out_im);
Optionally (*1), in fft_top.h, we can add SDSoC #pragma to force SDSoC to estimate simple DMA (AXI_DMA_SIMPLE) for faster data transfer:
#pragma SDS data mem_attribute(in_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(in_im:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_re:PHYSICAL_CONTIGUOUS)
#pragma SDS data mem_attribute(out_im:PHYSICAL_CONTIGUOUS)
void fft_top(
...
*1 In this case, SDSoC automatically estimates simple DMA so we actually don't have to add those #pragmas...
To reduce function execution time, we also want to apply inlining to HW sub functions:
void dummy_proc_fe( ... )
{
#pragma HLS INLINE
...
}
void dummy_proc_be( ... )
{
#pragma HLS INLINE
...
}
... and loop pipelining as usual:
void dummy_proc_fe( ... )
{
...
for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
...
}
}
void dummy_proc_be( ... )
{
...
for (i=0; i< FFT_LENGTH; i++){
#pragma HLS PIPELINE
...
}
}
Let's estimate performance again...
We got way shorter estimated cycles & a bit less resource utilization. Seems O.K...
- Make sure Generate bitsream & Generate SD card image are checked in Options section of SDx Project Setting.
- Then build the project (Project -> Build), which will finish successfully. Below is Data Motion Network result. We can see simple DMA is implemented for in/out data.
- Copy the contents of sd_card folder & data folder into an SD card.
-
Insert the SD card to your Zybo & power on to boot Linux.
-
cd to /mnt (where the program is located) & run the program as follows:
- Test PASSED!!! We are now able to accelerate FFT on FPGA without writing FFT code.
The SD Card files (except for image.ub) is available in the repo: fft_single/sd_card