 Open Access
 Total Downloads : 1411
 Authors : T.Tirumala Koteswara Rao, S. Sarath Chandra
 Paper ID : IJERTV2IS100873
 Volume & Issue : Volume 02, Issue 10 (October 2013)
 Published (First Online): 25102013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Implementation of 64Point FFT Processor Based on Radix2 Using VERILOG
T.TIRUMALA KOTESWARA RAO1, S. SARATH CHANDRA2
Student of M. Tech Department of Electronics and Communication Engineering1, QIS Institute Of Technology, Ongole.
Associate Professor2, Department of Electronics and Communication Engineering, QIS Institute Of Technology, Ongole.
Abstract
A Fast Fourier transform is an efficient algorithm to compute the discrete Fourier Transform (DFT). The operation has a high computational requirement of large number of operations (N2 complex multiplications and N (N1) additions). This makes computational and implementation very difficult. Short length structures are can be obtain higher length FFT. To obtain VLSI structure by using 4point FFTs to construct Npoint FFT rather than 8point FFT. In this paper the proposed architecture is higher order FFT and it is split into three stages and each stage is radix2 based 4point FFT to reduce the number of operations. In this architecture each stage requires 8 complex additions/subtractions to reduce the no of complex multiplications after each 4point FFT and to keep pipe line way of computation of design. Proposed architecture is implemented using verilog HDL XILINX ISE 12.1. The performance of the proposed architecture is implemented in terms of relative error. The proposed architecture gives best compromise in terms of speed.
Key words: FPGA, 8point FFT, 4point FFT, spatial distribution, temporal distribution.
1. Introduction
The Discrete Fourier Transform (DFT) is one of the most important tools used in digital signal processing applications. It has been widely implemented digital communications such as Radars, Ultra wide band receivers (UWB) and many other applications. Computing this operation has high computational requirement and large number of operations (N2 complex multiplications and N (N1)
additions).This makes computing and implementation very difficult to realize. To reduce the number of
operations a fast algorithm has been introduced by CooleyTukey [2] called Fast Fourier Transform (FFT). Later FFT reduces the computational complexity from O (N2) to O (NlogN). To reduce the complexity of FFT algorithm other researchers propose numerous techniques like radix4 [2], split radix [3]. By using these two techniques we can able to avoid the radix2 structure. These architectures are based on either Decimation in Time Domain (DIT) or Decimation in Frequency (DIF). Much other architecture was proposed on the basis of these architectures. In another way there is growing interest in the Field of Field Programmable Gate Arrays (FPGA). FPGAs have potentially substantially accelerated computational algorithms like FFTs. The Higher Order FFTs are implemented by using HighCost FPGAs. It is not possible to instantiate 512point FFT with the XILINX IP core to implement in Spartan3 family.
To reach this challenge, we present a VLSI structure to allow higher order FFT to be implemented into low cost FPGAs. The remainder this paper is organized as follows. The section I regarding the back ground work for the DFT. Algorithm, in section II is devoted to the proposed low architecture and section III is described about the two kinds of distributions (Temporal and Spatial Distributions). In section II we detail the principle and structure for generalized to higher order FFTs. After that techniques to save area are illustrated. Section IV is presents experimental results and comparisons with IP core and former work quoted in the literature. In Section V finally we conclude the paper.
I Background
For a given sequence x of n samples, the Discrete Fourier transform (DFT) frequency components X (k) may be defined.
X (k) x(n)W
X (k) x(n)W
N 1
n.k
N
n0
2 j
(1)
is not highly and inhomogeneous. Another solution for constructing the 64point FFT is to split 64points into three stages of 4point FFTs. For L=M=K=4, then the Equation for 64point FFT is given by
where ,
WN e N
the twiddle factors, n and k are
3 3 3
(16l 4mk )(16 p4qs)
respectively the time and frequency domain indexes.
X (s 4q 16 p) x(l, m, k)W64
l 0 m0 k 0
0 k N 1, 0 n N 1
And N is the DFT
(7)
length. Let us consider that the N = M. T, k = s + T. t and n = l +M. m, where M, T are integers s, l {0,
1.M1} and t, m{0,1.T1}. Applying these considerations in (1), we obtain (2)
M 1 T 1
M 1 T 1
(2)M .T
(2)M .T
X (s T.t) x(l M .m)W l M .msT .t
l 0 m0
It can be found that (2) is equivalent to
M 1 T 1
(M .T
(M .T
X (s T.t) x(l M .m)W l M .msT .t 3)
l 0 m0
Y And finally, (3) can be rewritten
Optimizations
The 64point FFT according to the signal flow graph of fig 3 Radix4 is the processing element. The 64point FFT composed of a control unit and 3 blocks of 4point FFT units, two blocks of multipliers with two phase generator units, with a complex 64point memory unit. The control unit managing FFT4 blocks and multipliers and memory unit. The control unit also generates addresses of inputs and out puts of each block.

Radix4 modification: Outputs of such algorithms are presented by the following
M 1
T 1
equations
X (s T.t) W l.t W l.s x(l M .m)W m.s (4) A C
M M .T T
l 0 m0
X (0) x(0) x(2) x(1) x(3)
Equation (4) means that it is possible to realize Npoint FFT to first Decomposing into one Mpoint and one Npoint FFT where N =M.T and then combining them. For example take 64point as a case study after that it is generalized for higher order FFT. To perform the 64 point FFT take N=M=8. Then equation (4) is given as follows:
B D
X (1) x(0) x(2) j(x(1) x(3))
X (2) x(0) x(2) x(1) x(3)
X (3) x(0) x(2) jx(1) jx(3)
(8)
7
7
X (s 8.t)
W l.t
W l.s xl 8.mW m.s (5)
The signal flow graph of radix4 structure is
7
7
l 0
8 64 8
m0
given in the fig2. The radix4 algorithm is composed of
8 complex additions/subtractions. To reduce the
Equation (5) illustrates that the 64point is expressed by using twodimensional structure of 8 point FFT. According to the Equation (5) the processing higher order element is 8point.

LOW AREA ARCHITECTURE
The Npoint FFT split in to three stages according to next equation.
X (s Mq MKp) x(l, m, k)W
X (s Mq MKp) x(l, m, k)W
L1 M 1 K 1
(MKl Mmk )( MKp Mq s)
N
l 0 m0 k 0
(6)
The 64point FFT may be constructed by either of 8point, 4point, 2point FFTs. The structured
number of complex multipliers after each 4point stage and to perform the pipeline way of computational design we have to modify the 4point FFT architecture.
The Radix4 is computed as multi input and multi output. This structure requires 4 multipliers in one clock cycle. This structure presents a high speed design but it requires a S2P block to serialize data. For this reason we have to re design the proposed architecture. The resulting design produces one output for one clock cycle. Inter mediate signals are used to know the parallel processing of the data.
x(0)
X (0)
Control unit
Control uni


Memory unit: Phase generator generates the twiddle factor for each output of 4point FFT. The multiplier performs the complex multiplication and stores it in the 64point complex memory. This memory is also known as shared memory. By using sharing memory, the memory is reused and shared between all the blocks.
x(2)
x(1)
x(3)
Fig2 Radix4 butter fly diagram
X (0)
X (0)
X (0)
FFT
4St age 1
FFT
4St age 1
Multip lier
Phase genera
FFT
4St
age 2
FFT
4St
age 2
FFT
4St
age 3
FFT
4St
age 3
Multip lier
Phase genera
The 8point FFT outputs are available and multiplication can be started after 5 clock cycles. The complex multiplications done by Block multiplier needs two clock cycles to perform 49 complex multiplications. 5 clock cycles after the last stage of 8 point FFT blocks the outputs of 64point FFTs are available. The main advantage of this architecture is high speed and low latency. The implementation of this architecture on FPGA needs high memory and high number of complex multiplications and additions.
Fig1 Signal Flow Graph of the Proposed Low area architecture
The computation of 64point FFT based on 4 point FFT needs 3 complex memories. But in the architecture we are using only one memory and that is divided in to the four small parts containing 16point complex memories in order to improve the latency. By using shared memory we have a problem that it has only one write port. A part of data is saved already and it is not used. If we use the dual port memory and that will be synthesized as BRAM. These are available in the low cost FPGAs.

Techniques for Proposed Low Area

Spatial Distribution
The possible realization of 64point FFT is presented in the signal flow graph of fig1. In this distribution the computation of 64point FFT is composed on five levels. The First levels have two serial to parallel converters used to store real and imaginary data. The second level composed of 8 blocks of 8point FFT split radix DIT. The non trivial multiplication is given by the third block and it has 49 complex multipliers. The second level is as same as that of second level. Final level is composed of two P2S blocks to give the data in a serial way. All the input data are ready to processed at 64th clock cycle.
For this reason the architecture is not suitable for Low cost FPGA such as Spartan 3 family.

Temporal distribution
The 64point FFT may also be realized by temporal distribution. The Temporal distribution has also having 5 levels. In the second and fourth levels it has the only one block of 8point FFT when compared with spatial distribution. The First level is composed of one S2P block to serialize the input data. The input data may be parallised in order to perform the computation. The S2P blocks are implemented by delay registers. The control unit manages the input data addresses and the first 8point input data has the address in the form of 8j, j E (0, 1.. 7).
At the 56th clock cycle the input may
processed to the first stage of 8point FFT. The 8point FFT outputs are available and multiplication can be started after the 5 clock cycles. At the 57th clock cycle the indexed data 8j+1 may be transformed by the first 8point FFT and the multiplier outputs are available after 7 clock cycles.The last result of multiplier available at the 71st clock cycle. The results are stored in the complex 64point memory. Similarly, the second 8point FFT may be processed and stored in the 64 point complex memory.
S2P
S2P
FFT8
FFT8
Multiplier
Multiplier
FFT8
FFT8
P2S
P2S
Control unit
Control unit
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
Real part
FFT8
FFT8
FFT8
FFT8
Imagin ary
S2P
S2P
P2S
P2S
part
S2
FF T8
FF T8
P Multi
S2 P
Phase
P2 Real
64
poin t com plex
64
poin t com plex
FF T8
FF T8
P 2S
P 2S
part
Imagina ry part
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
FFT8
Fig3 Signal flow graph of the spatial distribution

Compromise analysis
In both spatial and temporal distributions the 64point is first decomposed into 8point FFTs. In terms of throughput both are same and latency in both cases is also same. The latency in both the architectures are given by the formula L (N) = N+7log (N2). The consumed area is the major difference between the two distributions. The temporal distribution consumes 7 times less area when compared with spatial distribution because it using only one block of 8point FFT instead of 8 in the spatial distribution. The number of multiplications and number of 8point blocks is reduced to 7 and 2 respectively. The complex data memory may be removed by using the multiplier to store the result.
Fig4 Signal flow graph of the temporal distribution
To low cost FPGAs. The other limitation is that the number of inputs by 8point FFT we have inputs N=8n . To overcome this problem the 8point may be split into 4point using radix4 algorithm. By using this algorithm we can able to reduce 2% of occupied slices in spartan3E XC3S500.


Results

Simulation Results
relative error of real value
relative error
relative error
0.6
0.4
0.2
relative error
relative error
0
0.2
0.4
0.6
The Multiplier outputs are stored in the S2P registers by using the addresses 8j,
0.8
0 10 20 30 40 50 60 70
real value of input sequence
8j+1.. Proceeded one can use the addresses.
The major drawback of decomposition of higher order FFT into 8point FFT is related to the hardware consumed resources. The percentage of occupied slices in spartan3E XC3S500 while we are using split radix DIT description with 8point FFT is 30%. The resources are over flowed to design higher order FFT, for this reason we are optimizing our design
Fig5 Relative error of real value of input sequence
The relative error may be calculated by comparing the MATLAB simulation results with the verilog coding implemented for low cost FPGAs. The phase relative error is as follows. The Relative phase error of phase value gives us the phase angle and the relative error of real value.
0.2
0.1
0
relative error
relative error
0.1
0.2
0.3
0.4
0.5
relative error of phase angle
Rev
Fig6 Relative error of phase value to the input sequence
Rev
Fig6 Relative error of phase value to the input sequence

The techniques to implement higher order FFT into Low Cost FPGA are proposed and implemented. An optimized architecture proposed for higher orders after a detailed study and it is implemented. Our Future work is devoted to the FPGA implementation by the optimization of block multipliers and algorithm proposed in [6] to replace embedded multipliers.
relative error 

relative error 

0.6
0 10 20 30 40 50 60 7
phase value of input sequence
Fig6 Relative error of phase value to the input sequence

Synthesis results
Fig7 TOP LEVEL

Implementation results
Fig 8 Implementation Results of higer order FFT
REFERENCES

Yousri Ouerhani, Maher Jridi and A. Alfalou Implementation techniques of higher order FFT in to low cost FPGA.

W. Cooley and 1. Tukey, An algorithm for the machine calculation of Complex Fourier series, Math. Comput., vol. 19, pp. 297301, April 1965.

A. Y. Oppenheim, R. W. Schafer, and J. R. Buck,DiscreteTime Signal Processing, 2nd ed. Englewood Cliffs, NJ: PrenticeHall, 1998.

H. Sorensen, M. Heindeman, and C. Burrus, On computing the splitradix FFT, IEEE Trans. Acoustics, Speech, Signal Process, vo1.34, pp. 152156, 1986.

K. Maharatna, E. Grass, and Ulrich Jagldhold, A 64Point Fourier Transform Chip for High Speed Wireless LAN Application Using OFDM, IEEE 1. SolidState Circuits, vol. 39, pp. 484 493, March 2004.

Xilinx Product Specification, High perfomance 64point Complex FFTIIF Y.7.0 June 2009 [online]. Available on: http://www.xilinx.com/ipcenter.

M. Jridi and A. Alfalou, A LowPower. High Speed DCT architecture for image compression: principle and implementation, in Proc. VLSI Syst. in Chip Conf (VLSISoC), pp. 304309, Sept 2010.

M. Jridi and A. Alfalou, Direct Digital Frequency Synthetizer with CORDIC Algorithm and Taylor Series Approximation for Digital Receivers, Euro Journal of Scientific Research, vol. 30, No. 4, pp. 542553.