Design Of Pipelined Parallel FFT Architectures Using Folding Transformation

V. Ragavi

doi:10.17577/IJERTV2IS70296

Volume 02, Issue 07 (July 2013)

Design Of Pipelined Parallel FFT Architectures Using Folding Transformation

DOI : 10.17577/IJERTV2IS70296

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 254
Total Downloads : 1268
Authors : V. Ragavi
Paper ID : IJERTV2IS70296
Volume & Issue : Volume 02, Issue 07 (July 2013)
Published (First Online): 15-07-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Design Of Pipelined Parallel FFT Architectures Using Folding Transformation

V.Ragavi

PG Scholar

Srinivasan Engineering College, India

Abstract – In this paper, a novel approach to develop parallel pipelined architectures for the Fast Fourier transform (FFT) is presented. The folding transformation and register minimization techniques are proposed for designing FFT architectures. Novel parallel-pipelined 128-point radix-24 FFT architecture for the computation of complex and real valued fast Fourier transform are derived. For Complex valued Fast Fourier Transform (CFFT), the proposed architecture takes benefit of underutilized hardware in the serial architecture to derive L-parallel architectures not including the increment of hardware complexity by a factor of L. In addition to, the new parallel- pipelined architecture for the computation of Real- valued Fast Fourier Transform (RFFT) is presented. To reduce the hardware complexity, the proposed architecture exploits redundancy in the computation of FFT samples. A comparison is shown between the proposed design and the previous architectures.

Keywords Fast Fourier Transform (FFT), folding, radix-24, register minimization.

INTRODUCTION

DFT is one of the most important tools in the field of digital signal processing. Several Fast Fourier Transform (FFT) algorithms have been developed over the years due to its computational complexity. FFT plays a critical role in modern digital communications such as Digital Video Broadcasting (DVB) and Orthogonal Frequency Division Multiplexing (OFDM) systems. The design of pipelined architectures for computation of FFT of complex valued signals (CFFT) has been carried out. Different algorithms have been developed to reduce the computational complexity, of which Cooley-Tukey radix-2 FFT [1] is very popular.

Algorithms such as radix-4 [2], split-radix
[3] and radix-22 [4] have been developed based on the basic radix-2 FFT approach. The one of the most classical approaches for pipelined implementation of radix-2 FFT is Radix-2 multi-

path delay commutator (R2MDC) [5]. A standard usage of the storage buffer in R2MDC leads to the Radix-2 Single-path delay feedback (R2SDF) [6] architecture with reduced memory.

The architectures are developed for a specific-point FFT in [7] and [8], whereas hypercube theory is used to derive the architectures in [9].The method of developing these architectures from the algorithms is not well established.

In additional, most of these hardware architectures are not fully utilized and require high hardware complexity. In the period of high speed digital communications, the high throughput and low power designs are essential to meet the speed and power requirements while keeping the hardware overhead to a minimum. In this paper, a new approach to design the architecture from the FFT flow graphs is presented. Folding transformation [10] and register minimization techniques [11], [12] are used to derive several known FFT architectures.

If the input samples are real then the spectrum is symmetric and approximately half of the operations are redundant. The applications such as speech, audio, image, radar, and biomedical signal processing, a specialized hardware implementation is best suitable to meet the real- time constraints. The implantable or portable device saves power by using this type of implementation which is a key limitation. Few pipelined architectures for real valued signals have been proposed [13] based on the Brunn algorithm. However, these are not widely used. Different algorithms such as doubling algorithm, packing algorithm have been proposed for computation of RFFT. These approaches are based on removing the redundancies of the CFFT while the input is real. RFFT is calculated using the CFFT architecture in an efficient manner [14].

In the folding transformation, many butterflies in the same column can be mapped to one butterfly unit. If the FFT size is N, a folding factor of N/2 leads to 2-parallel architecture and in another design, a folding factor of N/4 leads to design 4-parallel architectures in which four samples are processed in the same clock cycle. Various folding sets lead to a family of FFT

architectures. Alternatively, known FFT architectures can also be described by the proposed methodology by selecting the appropriate folding set. To reduce latency and the number of storage elements, folding sets are designed. The prior FFT architectures were derived in an informal way, and their derivations were not explained in a systematic way. This is the effort to simplify the design of FFT architectures for arbitrary level of parallelism in an efficient manner by means of the folding transformation. In this paper, the prior design architectures are explained by constructing the specific folding sets. Then new architecture is derived for radix and levels of parallelism and for either Decimation-In-Time (DIT) or Decimation- In-Frequency (DIF) flow graphs. The new architecture achieves full hardware utilization. It may be noted that all prior parallel FFT architectures did not achieve full hardware utilization. The new real FFT architecture is also presented based on higher radices.

In [15], the parallel-pipelined architectures for the computation of RFFT based on radix-22 and radix-23 algorithms have been proposed. The real FFT architectures are not fully utilized. This drawback is removed by proposed methodology. The novel parallel-pipelined FFT architectures for the real-valued signals with full hardware utilization based on radix-24 algorithm is presented. It combines the advantages of radix-2n algorithms, which requires fewer complex multipliers when compared to radix-2 algorithm, with the reduction of operations using redundancy.

This paper is organized as follows. The folding transformation and register minimization based FFT architectures design is presented in Section II. The proposed architecture for complex FFT is explained in Section III. The proposed architecture for real FFT is explained in Section IV. In Section V, the proposed architecture is compared with the previous approaches and some conclusions are drawn in Section VI.
FFT ARCHITECTURES DESIGN TECHNIQUES

In this section, the folding transformation method and register minimization to derive several known FFT architectures is illustrated in general. The process is described using an 8-point radix-2 DIF FFT as an example. It can be extended to other radices in a similar fashion. Figure. 1 shows the flow graph of a radix-2 8-point DIF FFT. The graph is divided into three stages and each of them consists of a set of butterflies and multipliers.

Figure.1 Flow graph of a radix-2 8-point DIF FFT.

This algorithm can be represented as a data flow graph (DFG) as shown in fig. 2. The nodes in the DFG represent tasks or computations. In this case, all the nodes represent the butterfly computations of the radix-2 FFT algorithm. Assume nodes A and B have the multiplier operation on the bottom edge of the butterfly. The folding transformation is used on the DFG to derive a pipelined architecture. To transform the DFG, a folding set is required which is an ordered set of operations executed by the same functional unit. Each folding set contains K entries some of which may be null operations is called the folding factor,

i.e., the number of operations folded into a single functional unit. The operation in the jth position within the folding set (where j goes from 0 to K-1) is executed by the functional unit during the time partition. The term j is the folding order, i.., the

time instance to which the node is scheduled to be executed in hardware.

Figure.2 DFG of a radix-2 8-point DIF FFT.

For example, consider the folding set A = {, , , , A0, A1, A2, A3} for K=8. The operation A0 belongs to the folding set A with the folding order

4. The functional unit executes the operations A0, A1, A2, A3 at the respective time instances and will be idle during the null operations. The systematic folding techniques are used to derive the 8-point FFT architecture. Consider an edge e connecting the nodes U and V with w (e) delays.

The folding equation (1) for the edge e is

DF (U V) = K w(e)-PU + v – u (1)

where PU is the number of pipeline stages in the hardware unit which executes the node U [10]. By using folding sets, folding equations are derived with negative delays (w/o pipeline) and non negative delays (with pipeline or retiming). Consider folding of the DFG in fig.2 with the folding sets

A = {, , , , A0, A1, A2, A3}

B = {B2, B3, , , , , B0, B1}

C = {C1, C2, C3, , , , , C0}.

Assume that the butterfly operations do not have any pipeline stages, i.e, PA=0, PB=0, PC=0.Retiming and/or pipelining can be used to either satisfy DFUV) 0 or determine that the

folding sets are not feasible [10]. The negative delays on some edges can be observed. The equations are

DF (A0 B0) = 2 DF (B0 C0) = 1 DF (A0 B2) = – 4 DF (B0 C1) = – 6 DF (A1 B1) = 2 DF (B1 C0) = 0 DF (A1 B1) = – 4 DF (B1 C1) = – 7 DF (A2 B0) = 0 DF (B2 C2) = 1 DF (A2 B2) = – 6 DF (B2 C3) = 2 DF (A3 B1) = 0 DF (B3 C2) = 0

DF (A3 B3) = – 6 DF (B3 C3) =1 (2)

The DFG can be pipelined is shown to ensure that folded hardware has non-negative number of delays. The folded delays for the pipelined DFG are

DF (A0 B0) = 2 DF (B0 C0) = 1 DF (A0 B2) = 4 DF (B0 C1) = 2 DF (A1 B1) = 2 DF (B1 C0) = 0 DF (A1 B1) = 4 DF (B1 C1) = 1 DF (A2 B0) = 0 DF (B2 C2) = 1 DF (A2 B2) = 2 DF (B2 C3) = 2 DF (A3 B1) = 0 DF (B3 C2) = 0

RETIMING FOLDING EQUATIONS

RETIMING FOLDING EQUATIONS

DF (A3 B3) = 2 DF (B3 C3) = 1 (3)

FOLDING SETS

FOLDING EQUATIONS

FOLDING SETS

FOLDING EQUATIONS

FOLDED ARCHITECTURE

REGISTER ALLOCATION

FOLDED ARCHITECTURE

REGISTER ALLOCATION

LIFETIME ANALYSIS

LIFETIME ANALYSIS

Figure. 3 Block diagram of FFT design techniques

The technique for minimizing register is lifetime analysis [12] which analyzes the time for when a data is produced (Tinput) and when a data finally is consumed (Toutput).

T input = u + PU (4)

T output = u + PU + maxv {DF (UV)} (5)

where u is the folding order of U and PU is the number of pipelining stages in the functional unit that executes u. From (3) the 24 registers are required to implement the folded architecture. Lifetime analysis technique is used to design the folded architecture with minimum possible registers. For example, in the current 8-point FFT design, consider the variables y0, y1,. . . y7, i.e., the outputs at the nodes A0,A1,A2,A3 respectively. It takes 16 registers to synthesize these edges in the folded architecture. The linear lifetime table and lifetime chart for these variables is shown in figure. 4 and figure. 5. From the lifetime chart, it can be seen that the folded architecture requires 4 registers as opposed to 16 registers in a straightforward implementation. The next step is to perform forward-backward register allocation.

NODE

Tinput Toutput

yo

46

y1

57

y2

–

y3

–

y4

48

y5

59

y6

68

y7

79

Figure.4 Linear lifetime table

Figure.5 Linear lifetime chart

Figure. 6 Register allocation table.

From the allocation table in Fig.6 and the folding equations, the final architecture in Fig. 7 can be synthesized and can be derived by minimizing the registers on all variables at once. The hardware utilization is only 50% in the derived architecture. This can also be observed from the folding sets where half of the time null operations are being executed, i.e., hardware is idle. The pipelined parallel FFT architectures are presented by using this methodology.
PROPOSED ARCHITECTURES WITH COMPLEX INPUTS (CFFT)

The proposed approach can be described using folding methodology [10].The 4-parallel 128- point FFT architecture can be derived using the following folding sets.

A = {A0, A1, A2, A3} A = {A0, A1, A2, A3}

B = {B3, B0, B1, B2} B = {B3, B0, B1, B2}

C = {C1, C2, C3, C0} C = {C1, C2, C3, C0}

D = {D1, D2, D3, D0} D = {D1, D2, D3, D1}

The folded architecture can be derived by writing the folding equation [10] for the edges in the flow graph. The register minimization techniques and the forward and backward register allocation scheme [12] are applied to derive the architecture and then the final architecture can be derived. The 128-point FFT flow graph is based on radix-24 algorithm which is decimated in time. To achieve the high throughput requirement with low hardware cost, both the proposed pipelining method and radix-2n algorithms are exploited in this design.

Figure. 7 Folded architecture.

The proposed 4-parallel 128-point FFT architecture is shown in figure.8. It consists of two parallel data paths processing two input samples. Each data path consists of seven butterfly units, four constant and two full complex multipliers, delay elements and multiplexers. The function of delay elements and switches is to store and reorder the input data until the other available data is received for the butterfly operation. The four output data values generated after the first stage are

multiplied by constant twiddle factors (W18 = eij2/8, W38 = eij23/8).

Figure.8 Proposed 128 point CFFT architecture based on radix-24 algorithm

Another constant multiplier stage is required before the sixth butterfly stage. The CSD complex constant multiplier processes the multiplication of twiddle factors W8, W16, W24, W48. These twiddle factors correspond to cos (/8),

sin (/8), and cos(/4).
PROPOSED ARCHITECTURE WITH REAL INPUTS (RFFT)

The proposed radix-24 4-parallel architecture is explained using N = 128 point FFT is shown in figure.9. Further, the folding sets can be modified to derive L-parallel architectures of any N-point RFFT.

The radix-24 algorithm is described in detail in [8]. We can modify the flow graph similar to the other radices. The advantage of radix-24

algorithm is that it needs only one full multiplier every four stages. To derive the 4-parallel architecture divides the nodes into two groups. The nodes in the same group are processed by the same computation unit. Consider the following folding sets.

A = {A0, A2, A4, A6} A = {A1, A3, A5, A7}

B = {B1, B3, B0, B2} B = {B5, B7, B4, B6}

C = {C2, C1, C3, C0} C = {C6, C5, C7, C4}

D = {D3, D0, D2, D1} D = {D7, D4, D6, D5}

The mapping of nodes to different butterfly structures can be different in the case of 4-parallel architecture. The nodes {B4,…, B7} can be implemented with only a complex multiplier

instead of BFIV structure, as these nodes consists of only complex multiplication operation. Three different butterfly structures are necessary to handle the real and complex data paths.

COMPARISON AND ANALYSIS

A.) Complex FFT:

A comparison is made between the previous pipelined architectures and the proposed ones for the case of computing an N-point complex FFT in Table I. The comparison is made in terms of required numer of complex multipliers, adders, delay elements, twiddle factors and throughput. The proposed architectures can process 4 samples in parallel, thus achieving a higher performance than previous designs. The proposed design doubles the throughput and halves the latency.

B.) Real FFT:

The Table II shows the hardware complexity and the throughput of the previous architectures and the proposed ones for computing an N-point Real FFT. The hardware complexity of the architectures depends on the required number of multipliers, adders and delay elements. The performance is represented by throughput. The number of multiplier required in the radix-24 architecture is less compared to the previous designs. The proposed RFFT architecture leads to low hardware complexity.

Figure.9 Proposed 128-point RFFT architecture based on radix-24 algorithm

TABLE: I

Comparison of pipelined hardware architectures for the computation of N -point CFFT

Architecture	#Multipliers	#Adders	#Delays	Control	Throughput
R2MDC	2(log4N-1)	4log4N	3N/2-2	Simple	1
R2SDF	2(log4N-1)	4log4N	N-1	Simple	1
R4SDC	(log4N-1)	3log4N	2N-2	Complex	1
R22SDF	(log4N-1)	4log4N	N-1	Simple	1
R23SDF	(log8N-1)	4log4N	N-1	Simple	1
Radix-2 FFT	4(log4N-1)	8log4N	2N-4	Simple	2
Radix-22 FFT	3(log4N-1)	8log4N	2N-4	Simple	2
Radix-23 FFT	log8N-1	4log4N	3N/2-2	Simple	2
Radix-24 FFT	2(log16N-1)	4log2N	N-4	Simple	4

TABLE: II

Comparison of pipelined hardware architectures for the computation of N -point RFFT

Architecture	#Multipliers	#Adders	#Delays	Control	Throughput
R2MDC	2(log4N-1)	4log4N	3N/2-2	Simple	1
R2SDF	2(log4N-1)	4log4N	N-1	Simple	1
R4SDC	(log4N-1)	3log4N	2N-2	Simple	1
R22SDF	(log4N-1)	4log4N	N-1	Simple	1
Radix 2 FFT	2(log4N-1)	4log4N-2	9N/8-2	Simple	2
Radix-22 FFT	2(log4N-1)	4log4N-2	9N/8-2	Simple	2
Radix-23 FFT	(log8N-1)	4log4N-2	N-2	Simple	2
Radix-24 FFT	2(log16N-1)	4log2N-2	< 2N	Simple	4

CONCLUSION

A novel four parallel 128-point radix- 24 FFT architecture has been developed using proposed method. The hardware costs of delay elements and complex adders and the number of complex multipliers is reduced using higher radix FFT algorithm by using proposed approach. The throughput can be further increased by adding more pipeline stages which is possible due to the feed-forward nature of the design. The power consumption can also be reduced and leads to low hardware complexity in proposed architectures compared to previous architectures. The simulation can be done by using Modelsim software. A generalized approach to design efficient architectures for the computation of RFFT is also proposed. The approach can be extended to radix-25 and higher radix algorithms. Further higher parallel architectures can be developed using the proposed approach.

REFERENCES

M. Ayinala, M. Brown, K.K. Parhi, Pipelined parallel FFT architectures via folding transformation, IEEE Transactions on VLSI Systems, pp. 1068-1081, vol.20, no. 6, June 2012.
A. V. Oppenheim, R.W. Schafer, and J.R.Buck, Discrete-Time Signal Processing, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 1998.
P. Duhamel, Implementation of split-radix FFT algorithms for complex,real, and real-symmetric data, IEEE Trans. Acoust., Speech,Signal Process., vol. 34, no. 2, pp. 285295, Apr. 1986.
S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proc. of IPPS, 1996, pp. 766770.
L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975.
E. H. Wold and A. M. Despain, Pipeline and parallel-pipeline FFT processors for VLSI implementation, IEEE Trans. Comput., vol. C-33, no. 5, pp. 414426, May 1984.
A. M. Despain, Fourier transfom using CORDIC iterations, IEEE Trans. Comput., vol. C-233, no. 10, pp. 9931001, Oct. 1974.
E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, A radix-4 delay commutator for fast Fourier transform processor implementation, IEEE J. Solid- State Circuits, vol. SC-19, no. 5, pp. 702709, Oct. 1984.
E. E. Swartzlander, V. K. Jain, and H. Hikawa, A radix-8 wafer scale FFT processor, J. VLSI Signal Process., vol. 4, no. 2/3, pp. 165176, May 1992.
G. Bi and E. V. Jones, A pipelined FFT processor for word-sequential data, IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 12, pp. 19821985, Dec. 1989.
Y. W. Lin, H. Y. Liu, and C. Y. Lee, A 1-GS/s FFT/IFFT processor for UWB applications, IEEE J. Solid-State Circuits, vol. 40, no. 8, pp. 17261735, Aug. 2005.
J. Lee, H. Lee, S. I. Cho, and S. S. Choi, A High- Speed two parallel radix-24 FFT/IFFT processor for MB- OFDM UWB systems, in Proc IEEE Int. Symp. Circuits Syst., 2006, pp. 47194722.
J. Palmer and B. Nelson, A parallel FFT architecture for FPGAs,Lecture Notes Comput. Sci., vol. 3203, pp. 948953, 2004.
M. Shin and H. Lee, A high-speed four parallel radix-24 FFT/IFFT processor for UWB applications, in Proc. IEEE ISCAS, 2008, pp.960963.
M. Garrido, Efficient hardware architectures for the computation of the FFT and other related signal processing algorithms in real time, Ph.D. dissertation, Dept. Signal, Syst., Radiocommun., Univ. Politecnica Madrid, Madrid, Spain, 2009.
K. K. Parhi, C. Y. Wang, and A. P. Brown, Synthesis of control circuits in folded pipelined DSP architectures, IEEE J. Solid-State Circuits, vol. 27, no. 1, pp. 2943, Jan. 1992.
K. K. Parhi, Systematic synthesis of DSP data format converters using lifetime analysis and forward- backward register allocation, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 39, no. 7, pp. 423440, Jul. 1992.
J. W. Cooley and J. Tukey, An algorithm for machine calculation of complex fourier series, Math. Comput., vol. 19, pp. 297301, Apr.1965.
K. K. Parhi, Calculation of minimum number of registers in arbitrary life time chart, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 41, no. 6, pp. 434436, Jun. 995.
D. Sinha and A. H. Tewfik, TDCS, OFDM, and MC-CDMA: A brief tutorial, IEEE Commun.Mag., vol. 43, no. 9, pp. S11S16, Sep. 2005.
J. W. Picone, Signal modeling techniques in speech recognition, Proc. IEEE, vol. 81, no. 9, pp. 12151247, Sep. 1993.
M. H. Cheng, L. C. Chen, Y. C. Hung, and C. M. Yang, A real-time maximum likelihood heart-rate estimator for wearable textile sensors, in Proc. IEEE 30th Annu. Int. Conf. EMBS, 2008, pp. 254257.
T. Netoff, Y. Park, and K. K. Parhi, Seizure prediction using costsensitive support vector machine, in Annu. Int. Conf. EMBS, 2009, pp. 33223325.
R. Storn, A novel radix-2 pipeline architecture for the coputation of the DFT, in Proc. IEEE ISCAS, 1988, pp. 18991902.
Y. Wu, New FFT structures based on the Bruun algorithm, IEEETrans. Acoust., Speech Signal Process., vol. 38, no. 1, pp. 188191,

Jan. 1990

NODE	Tinput Toutput
yo	46
y1	57
y2	–
y3	–
y4	48
y5	59
y6	68
y7	79

Design Of Pipelined Parallel FFT Architectures Using Folding Transformation

PG Scholar

Srinivasan Engineering College, India

Leave a Reply