 Open Access
 Total Downloads : 1039
 Authors : Suresh Kumar Dunna, B Vijaya Bhaskar , R Suryaprakash
 Paper ID : IJERTV1IS6415
 Volume & Issue : Volume 01, Issue 06 (August 2012)
 Published (First Online): 30082012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Implementation of Higher Order FFT Processor Using FPGA
Implementation of Higher Order FFT Processor Using FPGA
Suresh Kumar Dunna1, B Vijaya Bhaskar2 , R Suryaprakasp
1M.Tech 2ndyear, VLSI System Design, Dept of ECE, St.Theressa Inst. of Engg and Tech , Garividi, Vijayanagaram, AP, India.
2HOD, Dept of ECE, St.Theressa Institute of Engineering & Technology , Garividi, Vijayanagaram, AP. India.
3Assistant Professor, Dept of ECE, St.Theressa Institute of Engineering & Technology , Garividi, Vijayanagaram AP. India.
Abstract In this paper, our objective is to detail knowhow and techniques that can help the designer of electronic circuits to develop and to optimize their own IP in a reasonable time. For this reason, we propose to optimize existing FFT algorithms for lowcost FPGA implementations. For that, we have used short length structures to obtain higher length transforms. Indeed, we can obtain a VLSI structure by using log4(N) 4point FFTs to construct Npoint FFT rather than (N/8) logs (N) 8point FFTs. Furthermore, two techniques are used to yield with VLSI architecture. Firstly, the radix4 FFT is modified to process one sample per clock cycle. Secondly, the memory is shared and divided into 4 parts to reduce the consumed resources and to improve the overall latency. (Abstract)
KeywordsFPGA,8pointFFT,4pointFFT,spatial distribution, temporal distribution

To meet with this challenge, we present in this paper a VLSI architecture to allow the implementation of high order FFT into low cost FPGAs. The remainder of this paper is organized as follows. In section II, definition and two kinds of distributions (spatial and temporal) are introduced. Section III is devoted to the proposed low area architecture. We detail the principle and the structure of 64point FFT which may be generalized to higher orders. Then, techniques to save area are illustrated. Section IV presents the experimental results and comparisons with IP core and prior works quoted in the literature. Finally, we summarize and conclude this paper in section V.


Definition
For a given sequence x of n samples, the DFT frequency components X (k) may be defined.
The Discrete Fourier Transform (DFT) is one of the most important tools used in Digital Signal Processing applications. It has been widely implemented in digital communication systems such as Radars, Ultra Wide Band (UWB) receivers
and many other applications. Computing this operation has
X (k )
2 j
N 1
N
x(n)W n.k
n 0
(1)
a high computational requirement and needs a large number
Where
WN e N
is the twiddle factor, n and k are
of operations (N2 complex multiplications and N. (N 1) respectively the time and frequency indexes,
complex additions).This makes computing and implementation very difficult to realize. To reduce the number
0 k N
1, 0 n
N 1 and N is the DFT length.
of operations a fast algorithm has been introduced by CooleyTukey[1] and called Fast Fourier Transform (FFT). The latter, reduces complexity from O(N2) to O(NlogN).
Let us consider N M.T , k s T.t and
n l M.m, where M ,T are integer and
Other researchers, propose numerous techniques such as radix4 [2], split radix [3] to avoid radix2structure in order to reduce the complexity of FFT algorithm. These
s,l 0,1…M 1 and t, m 0,1…T
considerations in (1), we obtain (2).
1 . Applying these
architectures are either based on the DecimationinTime
X (s
T.t)
M 1T
1
x(l
M .m)W
l M .m
s T .t
(DIT) or on the DecimationinFrequency(DIP). Several designs based on these architectures were proposed in order to
l 0 m 0
M .T
(2)
implement these algorithms. On the other hand, there is a growing interest in Field Programmable Gate Arrays (FPGAs)
It can be found that (2) is equivalent to
because of their potential to substantially accelerate computational intensive algorithms such as FFTs.
X (s
T.t)
M 1T
1
x(l
M .m)W
l M .m
s T .t
(3)
Unfortunately, high order FFT are almost implemented into
l 0 m 0
M .T
high cost FPGAs. For example, it is not possible to instantiate S12point FFT with the Xilinx IP core to implement it in Spartan 3 family.
and finally, (3) can be rewritten
X (s
T.t)
M 1 T
W
l.t
M
1
M .T
W l.s x(l
M .m)W m.s
(4)
by S2P blocks which are implemented by means of delay registers. On the other side, the control unit manages the
T
l 0 m 0
Equation (4) means that it is possible to realize Npoint FFT by first decomposing into one Mpoint and one T point FFT where N = M.T, and then combining them. To illustrate this by example, we take the 64point as a case study after that we can make generalization to a higher order. To perform 64point FFT we may choose M = T = 8. Then equation (4) is
input data addresses. The first 8point input data has the address in the format 8j, j E {O, 1, … 7}. On the 56th clock cycle these data have been proceeded to the first stage of 8point FFT. After 5 clock cycles, the 8point FFT outputs are available and multiplication can be started. Similarly, on the 57th clock cycle, data indexed 8j + 1 will be transformed by the first 8point FFT and after 7 clock
X (s
8.t)
7 7
W
l.t
8
l 0 m 0
W l.s x l
8.m W m.s
(5)
cycles, results data will be available at the multiplier output. And so one until the last result of multiplier output which will be available at the 71st clock cycle.
64
8
Equation (5) means that is possible to express the 64 point FFT by twodimensional structure of 8point FFT. The processing element of higher order FFT according to equation (5) is the 8point. Hence, the performance of high length depends in 8point performance. The choice of 8point FFT structure becomes crucial. In this work, the 8point FFT architecture used is the Split Radix DIT because of its lower number of arithmetic operations.

Spatial distribution
One possible realization of the 64point FFT is presented in the Signal Flow Graph (SFG) of Fig. 1. It can be observed that computing 64point FFT is composed on five levels. The first level is composed of two serial to parallel blocks used to store real and imaginary part of data presented in a serial way. The second floor is composed of 8 blocks of 8point FFT Split Radix DIT. The third block contains 49 complex multipliers used to compute non trivial complex multiplication. The fourth is similar to the second one. the last level is composed of two parallel to serial blocks gives data in a serial way. At the 64th clock cycle all input data are ready to be proceeded. After 5 clock cycles, the 8point FFT outputs are available and multiplication can be started. Block multiplier needs 2 clock cycles to perform the 49 complex multiplications. The 64point FFT outputs are available 5 clock cycles after the last stage of 8point FFT transformation. Hence, the main advantage of this architecture is the high speed and lowlatency. However, the implementation of this architecture on FPGA needs high memory, high number of complex multipliers and complex adders. Therefore, this architecture is not suitable for low cost FPGA such as Spartan 3 family.

Temporal distribution
Another possible realization of the 6point FFT is illustrated in Fig. 2. According to this structure, the first stage is realized by one block of 8point FFT rather than 8 as in Fig.
1. Similarly, the third stage is performed by only one block of 8point FFT rather than 8. Consequently, the control unit in Fig. 2 plays an important role to synchronize all the treatments. This architecture performs FFT in a pipeline way.
First, input data comes in a serial manner. To perform the computation input data have to be parallelized. This is realized
These results are stored on the fly on 64complex data memory. Likewise, the second 8point FFT stage will proceed the stored data to compute 64point FFT.
Fig. 1. Signal Flow Graph of the spatial distribution FFT architecture.
Fig 2. Signal Flow Graph of the temporal distribution FFT architecture

Compromise analysis
Some concluding remarks related to this section have to be drawn. Firstly, decomposing a high length FFT to 8point FFTs may be done in a spatial or in temporal distribution. In terms of throughput, the two distributions present one
complex output per clock cycle since data have to be serialized by P2S component. On the other hand, the latency which represents the elapsed time to get the first result is the same. In fact, for a given N = 8n where n is the number of stages, the latency in both architectures may be expressed as L(N) = N + 7log8(N 2). The main difference between the two distributions is the consumed area. Obviously, the second architecture consumes averagely 7 times less area than the first one. The number of 8point
memorizing unit, it is used also to generate addresses of the inputs and the outputs of each block.
3.2.1 Radix4 modification: Outputs of such algorithm are presented in next equations
A C
FFT blocks pass from 16 to 2 and the number of nontrivial multiplier pass from 49 to 7. Furthermore, the complex data memory used in Fig.2 may be avoided by storing the multiplier
X (0)
x(0)
x(2)
x(1)
x(3)
Outputs on S2P register. Indeed, since input data at B D
address 8j, 8j+l,.. Are proceeded one can use these addresses
to store the multiplier outputs.
Definitively, the major drawback of the decomposition
of high length FFT on 8point FFTs is related to the hardware consumed resources of the 8point FFT. Synthesis results of the split radix DIT description of 8 point FFT show that the percentage of occupied slices in Spartan3E XC3S500 is about 30%. Therefore, to design a higher order FFT, the FPGA resources will be overflowed. Another drawback is about the limitation of the number of input with exclusively 8point FFT elements since N = 8n. To overcome this problem we replace the 8point FFT by a 4point FFT using radix4 algorithm. This choice is
X (1)
X (2)
X (3)
x(0)
x(0)
x(0)
x(2)
x(2)
x(2)
j(x(1) x(1)
jx(1)
x(3)) x(3)
jx(3)
(8)
reinforced by the synthesis results of radix 4 in terms of slice occupation which is about 2%.



Definition
The Npoint FFT equation can be split into three stages according to next equation.
The SFG of the radix4 structure is illustrated in Fig. 4.
It is shown that radix4 algorithm is composed of 8 complex additions/subtractions.
In order to reduce the number of complex multipliers, after each 4point FFT and to keep the pipeline way in computation of the design we modify the 4point FFT architecture.
X (s
Mq MKp)
L 1 M 1 K 1
l 0 m 0 k 0
x(l, m, k)W (MKl Mm k )( MKp Mq s)
N
(6)
For N = 64, one possible solution consists on constructing the 64point FFT according to the temporal distribution by using 8point, 4point FFT and 2point FFTs. The obtained design is not highly structured and inhomogeneous. The second solution consists in constructing the 64point FFT by three stages of 4point FFT. For L = M = K = 4, 64point FFT equation can be written as
X (s
4q 16 p)
3 3 3
64
x(l, m, k)W (16l 4m k )(16 p 4q s)
Fig 3. Signal Flow Graph of the proposed low area 64point FFT

Optimizations

l 0 m 0 k 0
(7)
architecture.
Usually, the radix4 is computed as multiinputs multi outputs system. This structure requires 4 multipliers in one
Using the radix4 processing element, we can represent the64point FFT according to SFG in Fig. 3. The 64point FFT is composed of a control unit, three blocks 4point FFT units, two blocks multipliers units with two phase generator units and a complex 64point memory unit. The control unit, indeed of managing the FFf4, multipliers and
clock cycle. It is true that this structure presents a high speed design, but almost a P2S block is used to serialize data. For these reasons, we rectify the architecture in order to have one multiplier per clock cycle. So, the resulting design have one complex input and give one complex output per clock cycle as represented in Fig. 4. Intermediate signals A, B, C
and D used in the diagram are indicated to understand the parallel computing.
3.2.2 Sharing memory: For each output of the 4point sFFT block the phase generator generates the correspondent twiddle factor and the multiplier unit performs the complex multiplication and stores the result on 64 complex data memory. This last will be reused and shared between all the blocks as it is shown on Fig. 3.
Usually, computing 64point FFT based on 4point FFT needs 3 complex memories. In our architecture we use only one complex 64point. Moreover, this memory is divided into four small 16point complex memories in order to improve the latency. Indeed, the problem behind this consists in using one shared memory with only one writer port. This is impossible since a part of data already saved in the memory are not used. Furthermore, if we use a dual port memory, this will be synthesized as BRAM blocks which are oversize and available in limited number in low cost FPGAs.
Fig.4 Radix4 butterfly diagram


Synthesis results
Fig 5.TOP LEVEL

Implementation results
Fig 6.implementation results of higher order FFT

Techniques to implement high order FFT into low cost FPGAs were presented and validated. After a comprehensive and a comparative study of existing high order FFTs, an optimized architecture of 64point FFT was proposed. The transition between 64point and 256point was exploited. Higher order FFTs could be obtained with the same manner. Our future work for the FPGA implementation will be devoted to the optimization of the block multiplier and the use of the method proposed in [7]to replace embedded multipliers.
REFERENCES

Yousri Ouerhani, Maher Jridi and A. Alfalou,implementation techniques of higher order FFT into low cost FPGA Equipe Vision, Laboratoire L@bISEN, CS 42807, 29228 Brest Cedex 2, Francee mail: {yousri.ouerhani.maher.jridi and ayman.alfalou}@isen.fr

W. Cooley and 1. Tukey, An algorithm for the machine calculation of Complex Fourier series, Math. Comput., vol. 19, pp. 297301, April 1965.

A. Y. Oppenheim, R. W. Schafer, and J. R. Buck,DiscreteTime Signal Processing, 2nd ed. Englewood Cliffs, NJ: PrenticeHall, 1998.

H. Sorensen, M. Heindeman, and C. Burrus, On computing the splitradix FFT, IEEE Trans. Acoustics, Speech, Signal Process, vo1.34, pp. 152156, 1986.

K. Maharatna, E. Grass, and Ulrich Jagldhold, A 64Point Fourier Transform Chip for HighSpeed Wireless LAN Application Using OFDM, IEEE 1. SolidState Circuits, vol. 39, pp. 484493, March 2004.

Xilinx Product Specification, High perfomance 64point Complex FFTIIF Y.7.0 June 2009 [online]. Available on: http://www.xilinx.com/ipcenter.

M. Jridi and A. Alfalou, A LowPower. HighSpeed DCT architecture for image ompression: principle and implementation, in Proc. VLSI Syst. in Chip Conf (VLSISoC), pp. 304309, Sept 2010.

M. Jridi and A. Alfalou, Direct Digital Frequency Synthetizer with CORDIC Algorithm and Taylor Series Approximation for Digital Receivers, Euro Journal of Scientific Research, vol. 30, No. 4, pp. 542553, 2009.