 Open Access
 Total Downloads : 8
 Authors : Ragavi.V, K.Renuka
 Paper ID : IJERTCONV1IS06074
 Volume & Issue : ICSEM – 2013 (Volume 1 – Issue 06)
 Published (First Online): 30072018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Design of pipelined parallel FFT architectures Using folding transformation
RAGAVI.V, K.RENUKA

(VLSI Design) Assistant Professor (ECE)
Srinivasan Engineering College Srinivasan Engineering College
ragavimevlsi@gmail.com1 renudev20@gmail.com2
Abstract – In this paper, a novel approach to develop parallel pipelined architectures for the Fast Fourier transform (FFT) is presented. The folding transformation and register minimization techniques are proposed for designing FFT architectures. Novel parallelpipelined 128point radix24 FFT architecture for the computation of complex and real valued fast Fourier transform are derived. For Complex valued Fast Fourier Transform (CFFT), the proposed architecture takes benefit of underutilized hardware in the serial architecture to derive Lparallel architectures not including the increment of hardware complexity by a factor of L. In addition to, the new parallelpipelined architecture for the computation of Realvalued Fast Fourier Transform (RFFT) is presented. To reduce the hardware complexity, the proposed architecture exploits redundancy in the computation of FFT samples. A comparison is shown between the proposed design and the previous architectures.
Index Terms Fast Fourier Transform (FFT), folding, radix24, register minimization.

INTRODUCTION
DFT is one of the most important tools in the field of digital signal processing. Several Fast Fourier Transform (FFT) algorithms have been developed over the years due to its computational complexity. FFT plays a critical role in modern digital communications such as Digital Video Broadcasting (DVB) and Orthogonal Frequency Division Multiplexing (OFDM) systems. The design of pipelined architectures for computation of FFT of complex valued signals (CFFT) has been carried out. Different algorithms have been developed to reduce the computational complexity, of which CooleyTukey radix2 FFT [1] is very popular.
Algorithms such as radix4 [2], splitradix [3] and radix22 [4] have been developed based on the basic radix2 FFT approach. The one of the most classical approaches for pipelined
implementation of radix2 FFT is Radix2 multipath delay commutator (R2MDC) [5]. A standard usage of the storage buffer in R2MDC leads to the Radix2 Singlepath delay feedback (R2SDF) [6] architecture with reduced memory.
The architectures are developed for a specificpoint FFT in [7] and [8], whereas hypercube theory is used to
derive the architectures in [9].The method of developing these architectures from the algorithms is not well established.
In additional, most of these hardware architectures are not fully utilized and require high hardware complexity. In the period of high speed digital communications, the high throughput and low power designs are essential to meet the speed and power requirements while keeping the hardware overhead to a minimum. In this paper, a new approach to design the architecture from the FFT flow graphs is presented. Folding transformation [10] and register minimization techniques [11], [12] are used to derive several known FFT architectures.
If the input samples are real then the spectrum is symmetric and approximately half of the operations are redundant. The applications such as speech, audio, image, radar, and biomedical signal processing, a specialized hardware implementation is best suitable to meet the real time constraints. The implantable or portable device saves power by using this type of implementation which is a key limitation. Few pipelined architectures for real valued signals have been proposed [13] based on the Brunn algorithm. However, these are not widely used. Different algorithms such as doubling algorithm, packing algorithm have been proposed for computation of RFFT. These approaches are based on removing the redundancies of the CFFT while the input is real. RFFT is calculated using the CFFT architecture in an efficient manner [14].
In the folding transformation, many butterflies in the same column can be mapped to one butterfly unit. If the FFT size is N, a folding factor of N/2 leads to 2parallel architecture and in another design, a folding factor of N/4 leads to design 4parallel architectures in which four samples are processed in the same clock cycle. Various folding sets lead to a family of FFT architectures. Alternatively, known FFT architectures can also be described by the proposed methodology by selecting the appropriate folding set. To reduce latency and the number of storage elements, folding sets are designed. The prior FFT architectures were derived in an informal way, and their derivations were not explained in a systematic way. This is the effort to simplify the design of FFT architectures for arbitrary level of parallelism in an efficient manner by
Ragavi.V, K.Renuka
487
means of the folding transformation. In this paper, the prior design architectures are explained by constructing the specific folding sets. Then new architecture is derived for radix and levels of parallelism and for either DecimationIn Time (DIT) or DecimationInFrequency (DIF) flow graphs. The new architecture achieves full hardware utilization. It may be noted that all prior parallel FFT architectures did not achieve full hardware utilization. The new real FFT architecture is also presented based on higher radices.
In [15], the parallelpipelined architectures for the computation of RFFT based on radix22 and radix23 algorithms have been proposed. The real FFT architectures are not fully utilized. This drawback is removed by proposed methodology. The novel parallelpipelined FFT architectures for the realvalued signals with full hardware utilization based on radix24 algorithm is presented. It combines the advantages of radix2n algorithms, which requires fewer complex multipliers when compared to radix
2 algorithm, with the reduction of operations using redundancy.
This paper is organized as follows. The folding transformation and register minimization based FFT architectures design is presented in Section II. The proposed architecture for complex FFT is explained in Section III. The proposed architecture for real FFT is explained in Section IV. In Section V, the proposed architecture is compared with the previous approaches and some conclusions are drawn in Section VI.

FFT ARCHITECTURES DESIGN TECHNIQUES

In this section, the folding transformation method and register minimization to derive several known FFT architectures is illustrated in general. The process is described using an 8point radix2 DIF FFT as an example. It can be extended to other radices in a similar fashion. Fig. 1 shows the flow graph of a radix2 8point DIF FFT. The graph is divided into three stages and each of them consists of a set of butterflies and multipliers. The twiddle factor in between the stages indicates a multiplication by where WN denotes the Nth root of unity, among its exponent evaluated modulo N.
Fig.1 Flow graph of a radix2 8point DIF FFT.
This algorithm can be represented as a data flow graph (DFG) as shown in Fig. 2. The nodes in the DFG represent tasks or computations. In this case, all the nodes represent the butterfly computations of the radix2 FFT
algorithm. Assume nodes A and B have the multiplier operation on the bottom edge of the butterfly. The folding transformation is used on the DFG in Fig. 2 to derive a pipelined architecture.
To transform the DFG, a folding set is required which is an ordered set of operations executed by the same functional unit. Each folding set cntains K entries some of which may be null operations is called the folding factor, i.e., the number of operations folded into a single functional unit. The operation in the jth position within the folding set (where j goes from 0 to K1) is executed by the functional unit during the time partition. The term j is the folding order, i.e., the time instance to which the node is scheduled to be executed in hardware.
Fig. 2 DFG of a radix2 8point DIF FFT.
For example, consider the folding set A = {7, 7, 7, 7, A0, A1, A2, A3} for K=8. The operation belongs to the folding set A with the folding order 4. The functional unit executes the operations A0, A1, A2, A3 at the respective time instances and will be idle during the null operations. The systematic folding techniques are used to derive the 8 point FFT architecture. Consider an edge e connecting the nodes U and V with w (e) delays. The folding equation (1) for the edge e is
DF (U V) = K w(e)PU + v – u (1)
where PU is the number of pipeline stages in the hardware unit which executes the node U [10]. By using folding sets, folding equations are derived with negative delays (w/o pipeline) and non negative delays (with pipeline or retiming). Consider folding of the DFG in Fig.2 with the folding sets
A = {7, 7, 7, 7, A0, A1, A2, A3}
B = {B2, B3, 7, 7, 7, 7, B0, B1}
C = {C1, C2, C3, 7, 7, 7, 7, C0}.
Assume that the butterfly operations do not have any pipeline stages, i.e, PA=0, PB=0, PC=0.Retiming and/or pipelining can be used to either satisfy DFUV) 0 or determine that the folding sets are not feasible [10]. The negative delays on some edges can be observed. The equations are
DF (A0 B0) = 2 DF (B0 C0) = 1
Ragavi.V, K.Renuka
488
DF (A0 B2) = – 4 DF (B0 C1) = – 6 DF (A1 B1) = 2 DF (B1 C0) = 0 DF (A1 B1) = – 4 DF (B1 C1) = – 7 DF (A2 B0) = 0 DF (B2 C2) = 1 DF (A2 B2) = – 6 DF (B2 C3) = 2 DF (A3 B1) = 0 DF (B3 C2) = 0
DF (A3 B3) = – 6 DF (B3 C3) =1 (2)
DF (A0 B0) = 2 
DF (B0 C0) = 1 

DF (A0 B2) = 4 
DF (B0 C1) = 2 

DF (A1 B1) = 2 
DF (B1 C0) = 0 

DF (A1 B1) = 4 
DF (B1 C1) = 1 

DF (A2 B0) = 0 
DF (B2 C2) = 1 

DF (A2 B2) = 2 
DF (B2 C3) = 2 

DF (A3 B1) = 0 
DF (B3 C2) = 0 

DF (A3 B3) = 2 
DF (B3 C3) = 1 
(3) 
The DFG can be pipelined is shown to ensure that folded hardware has nonnegative number of delays. The folded delays for the pipelined DFG are
FOLDING SETS
FOLDING EQUATIONS
RETIMING FOLDING EQUATIONS
FOLDED ARCHITECTURE
REGISTER ALLOCATION
LIFETIME ANALYSIS
Fig. 3 Block diagram of FFT design techniques
The technique for minimizing register is lifetime analysis
[12] which analyzes the time for when a data is produced (Tinput) and when a data finally is consumed (Toutput).T input = u + PU (4)
T output = u + PU + maxv {DF (UV)} (5)
where u is the folding order of U and PU is the number of pipelining stages in the functional unit that executes u. From
Tinput Toutput
NODE
(3) the 24 registers are required to implement the folded architecture. Lifetime analysis technique is used to design the folded architecture with minimum possible registers. For example, in the current 8point FFT design, consider the variables y0, y1,. . . y7, i.e., the outputs at the nodes A0,A1,A2,A3 respectively. It takes 16 registers to synthesize these edges in the folded architecture. The linear lifetime table and lifetime chart for these variables is shown in Fig. 4 and Fig. 5. From the lifetime chart, it can be seen that the folded architecture requires 4 registers as opposed to 16 registers in a straightforward implementation. The next step is to perform forwardbackward register allocation.
yo 
46 
y1 
57 
y2 
– 
y3 
– 
y4 
48 
y5 
59 
y6 
68 
y7 
79 
Fig.4 Linear lifetime table
Fig.5 Linear lifetime chart
Fig. 6 Register allocation table.
From the allocation table in Fig.6 and the folding equations, the final architecture in Fig. 7 can be synthesized and can be derived by minimizing the registers on all variables at once. The hardware utilization is only 50% in the derived architecture. This can also be observed from the folding sets where half of the time null operations are being executed, i.e., hardware is idle. The pipelined parallel FFT architectures are presented by using this methodology.
Ragavi.V, K.Renuka
489
Fig. 7 Folded architecture.

PROPOSED ARCHITECTURES WITH COMPLEX INPUTS (CFFT)
The proposed approach can be described using folding methodology [10].The 4parallel 128point FFT architecture can be derived using the following folding sets.
A = {A0, A1, A2, A3} A = {A0, A1, A2, A3}
B = {B3, B0, B1, B2} B = {B3, B0, B1, B2}
C = {C1, C2, C3, C0} C = {C1, C2, C3, C0}
D = {D1, D2, D3, D0} D = {D1, D2, D3, D1}
The folded architecture can be derived by writing the folding equation [10] for the edges in the flow graph. The register minimization techniques and the forward and backward register allocation scheme [12] are applied to derive the architecture and then the final architecture can be derived. The 128point FFT flow graph is based on radix24 algorithm which is decimated in time. To achieve the high throughput requirement with low hardware cost, both the proposed pipelining method and radix2n algorithms are exploited in this design.
The proposed 4parallel 128point FFT architecture is shown in fig.8. It consists of two parallel data paths processing two input samples. Each data path consists of seven butterfly units, four constant and two full complex multipliers, delay elements and multiplexers. The function of delay elements and switches is to store and reorder the input data until the other available data is received for the butterfly operation. The four output data values generated after the first stage are multiplied by constant twiddle factors (W18 = eij2/8, W38 = eij23/8). These twiddle factors can be implemented efficiently using Canonic Signed Digit (CSD) approach. The outputs after the third stage are multiplied by the nontrivial twiddle factor.
128 point CFFT architecture based on radix24 algorithm
Another constant multiplier stage is required before the sixth butterfly stage. The CSD complex constant multiplier processes the multiplication of twiddle factors W8, W16, W24, W48. These twiddle factors correspond to cos (/8), sin (/8), and cos(/4).

PROPOSED ARCHITECTURE WITH REAL INPUTS (RFFT)
The proposed radix24 4parallel architecture is explained using N = 128 point FFT is shown in fig.9. Further, the folding sets can be modified to derive Lparallel architectures of any Npoint RFFT.
Fig.8 Proposed
The radix24 algorithm is described in detail in [8]. We can modify the flow graph similar to the other radices. The advantage of radix24 algorithm is that it needs only one full multiplier every four stages. To derive the 4parallel architecture divides the nodes into two groups. The nodes in the same group are processed by the same computation unit. Consider the following folding sets.
A = {A0, A2, A4, A6} A = {A1, A3, A5, A7}
B = {B1, B3, B0, B2} B = {B5, B7, B4, B6}
D = {D3, D0, D2, D1} D = {D7, D4, D6, D5}
Ragavi.V, K.Renuka
490
The mapping of nodes to different butterfly structures can be different in the case of 4parallel architecture. The nodes {B4,…, B7} can be implemented with only a complex multiplier instead of BFIV structure, as these nodes consists of only complex multiplication operation.
Three different butterfly structures are necessary to handle the real and complex data paths. Similar to radix23 architectures, complex multipliers need to operate on samples computed at different time instances.
V. COMPARISON AND ANALYSIS
A.) Complex FFT:
A comparison is made between the previous pipelined architectures and the proposed ones for the case of computing an Npoint complex FFT in Table 1. The comparison is made in terms of required number of complex multipliers, adders, delay elements, twiddle factors and
throughput. The proposed architectures can process 4 samples in parallel, thus achieving a higher performance than previous designs. When compared to some previous architecture, the proposed design doubles the throughput and halves the latency.
B.) Real FFT:
The Table 2 shows the hardware complexity and the throughput of the previous architectures and the proposed ones for computing an Npoint Real FFT. The hardware complexity of the architectures depends on the required number of multipliers, adders and delay elements. The performance is represented by throughput. The number of multiplier required in the radix24 architecture is less compared to the previous designs. The proposed RFFT architecture leads to low hardware complexity.
Fig.9 Proposed 128point RFFT architecture based on radix24 algorithm
Table : 1 Comparison of Pipelined Hardware Architectures for the Computation of N Point CFFT
ARCHITECTURE 
# MULTIPLIERS 
# ADDERS 
# DELAYS 
THROUGHPUT 
R2MDC 
2(log4N1) 
4log4N 
3N/22 
1 
R2SDF 
2(log4N1) 
4log4N 
N1 
1 
R22SDF 
(log4N1) 
4log4N 
N1 
1 
R23SDF 
(log8N1) 
4log4N 
N1 
1 
Radix2 (4parallel) 
4(log4N1) 
8log4N 
2N4 
4 
Radix22 (4parallel) 
3(log4N1) 
8log4N 
2N4 
4 
Radix23 (4parallel) 
log8N1 
4log4N 
3N/22 
2 
Radix24 (4parallel) 
2(log16N1) 
4log2N 
N4 
4 
Ragavi.V, K.Renuka
491
VI. CONCLUSION
A novel four parallel 128point radix24 FFT architecture has been developed using proposed method. The hardware costs of delay elements and complex adders and the number of complex multipliers is reduced using higher radix FFT algorithm by using proposed approach. The throughput can be further increased by adding more pipeline stages which is possible due to the feedforward nature of the design. The power consumption can also be reduced and leads to low hardware complexity in proposed architectures compared to previous architectures. The simulation can be done by using Modelsim software. A generalized approach to design efficient architectures for the computation of RFFT is also proposed. The approach can be extended to radix25 and higher radix algorithms. Further higher parallel architectures can be developed using the proposed approach.
REFERENCES

J. Cooley and J. Tuckey, An Algorithm for the Machine Calculation of the Complex fourier series, Math. Comput, vol. 19, pp. 297301, Apr. 1965.

J.A.C. Bingham, Multicarrier modulation for data transmission: an idea whose time has come, IEEE Communication Magazine, vol. 28, no. 5, pp. 514, May 1990.
Table : 2 Comparison of Pipelined Hardware Architectures for the Computation of N Point RFFT
ARCHITE CTURE
#MULTIP LIERS
#ADD ERS
#DEL AYS
THROUG HPUT
R2MDC
2(log4N1)
4log2N
2(3N/2
2)
1
R2SDF
2(log4N1)
4log2N
2(N1)
1
R22SDF
(log4N1)
4log2N
2(N1)
1
R23SDF
(log8N1)
4log2N
2(N1)
1
Radix23(4 parallel)
2(log8N1)
4log2N
2
< 2N
4
Radix24(4 parallel)
2(log16N1)
4log2N
2
< 2N
4

P. Duhamel, Implementation of splitradix FFT algorithms for complex, real, and realsymmetric data, IEEE Trans. Acoust., Speech,Signal Process., vol. 34, no. 2, pp. 285295, Apr. 1986.

S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proc. of IPPS, 1996, pp. 766770.

L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Englewood Cliffs, NJ: Prentice Hall, 1975.

E. H. Wold and A. M. Despain, Pipeline and parallel pipeline FFT processors for VLSI implementation, IEEE Trans. Comput., vol.C33, no. 5, pp. 414426, May 1984.

Y. W. Lin, H. Y. Liu, and C. Y. Lee, A 1GS/s FFT/IFFT processor for UWB applications, IEEE J. Solid State Circuits, vol. 40, no. 8, pp. 17261735, Aug. 2005.

J. Lee, H. Lee, S. I. Cho, S. S. Choi, A HighSpeed two parallel radix24 FFT/IFFT processor for MBOFDM UWB systems, IEEE Trans. on Fundamentals of Electronics, Communications and Computer Sciences, pp. 12061211, April 2008.

M. Garrido, Efficient hardware architectures for the computation of the FFT and other related signal processing algorithms in real time, Ph.D. dissertation, Dept. Signal, Syst., Radiocommun., Univ. Politecnica Madrid, Madrid, Spain, 2009.

K. K. Parhi, C. Y. Wang, and A. P. Brown, Synthesis of control circuits in folded pipelined DSP architectures, IEEE J. SolidState Circuits, vol. 27, no. 1, pp. 2943, Jan. 1992.

K. K. Parhi, Systematic synthesis of DSP data format converters using lifetime analysis and forwardbackward register allocation, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 39, no. 7, pp. 423440, Jul. 1992.

K. K. Parhi, Calculation of minimum number of registers in arbitrary life time chart, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 41, no. 6, pp. 434436, Jun. 1995.

Y. Wu, New FFT structures based on the Bruun algorithm, IEEE Trans. Acoust., Speech Signal Process., vol. 38, no. 1, pp. 188191, Jan. 1990.

W.W. Smith and J. M. Smith, Handbook of RealTime Fast Fourier Transforms. Piscataway, NJ: WileyIEEE Press, 1995.

M. Ayinala, M. Brown, K.K. Parhi, Pipelined parallel FFT architectures via folding transformation, IEEE Transactions on VLSI Systems, pp. 10681081, vol.20, no. 6, June 2012.
Ragavi.V, K.Renuka
492