Design and Implementation of Low Power-Area efficient Rounded Truncated Multiple Constant Multiplier for FIR filters

Download Full-Text PDF Cite this Publication

Text Only Version

Design and Implementation of Low Power-Area efficient Rounded Truncated Multiple Constant Multiplier for FIR filters

Veera Boopathy. E, PG Scholar, VLSI Design, VSB Engineering College,

Karur.

R. Dhayabarani,

Associate Professor,

      1. Engineering College, Karur.

        AbstractFaithfully rounded truncated multipliers are used to obtain highly economical Finite Impulse Response filter designs. Both bit width optimization and hardware resources are equally considered devoid of sacrificing the frequency response and output signal accuracy. In order to reduce total area cost asymmetrical coefficient quantization with suitable filter order is suggested. An improved version of truncated multipliers is utilized to realize multiple constant multiplication

        / accumulation in a direct FIR structure. In comparative analysis with earlier FIR design methodologies illustrate that the proposed designs accomplish best area and power results.

        KeywordsDigital signal processing (DSP), faithful rounding, finite impulse response (FIR) filter, truncated multipliers, VLSI design.

        1. INTRODUCTION

          In Digital Signal Processing (DSP) and communication systems, Finite Impulse Response (FIR) digital filter is one of the essential components whose response to any finite length input settles to zero in finite time. They are extensively applied in various portable applications with reduced area and power resources. An expression for FIR filter of order M is

          =0

          =0

          y(n) = 1 x(n-i)

          In occurrence of linear phase, the coefficients are symmetric or asymmetric with ai = aMI or ai = aMi.

          Fig. 1. Direct form structure of linear-phase even-order FIR filters

          Fig. 2. Transposed form structure of linear-phase even-order FIR filters.

          Two primitive FIR structures are direct form and transposed form as illustrated in Fig.1 meant for a linear- phase FIR filter with even-order.Fig.1(a) is direct form FIR Structure in which the multiple constant multiplication (MCM)/accumulation (MCMA) module accomplish parallel multiplications of specific delayed signals with respective filter coefficients, then pursued by accumulation of entire product terms. Consequently, in MCMA multiplier operands delayed input signals x (n i) and coefficients ai. Fig. 1(b) is transposed form FIR Structure in which MCM module has multiplier operands as current input signal x (n) and coefficients. The outcomes of distinct constant multiplications use structure adders (SAs) with delay elements. Several papers on design and implementation of low-cost otherwise high-speed FIR filters are presented [1] [13], [15][19]. In order to prevent expensive multipliers, earlier hardware implementations of digital FIR filters are segregated into two classes as multiplier less based and memory based.

          Multiplier less-based designs accomplish MCM by means of shift and add operations and assign common sub- operations by Canonical Signed Digit (CSD) recoding and common sub-expression elimination to reduce the adder cost of MCM [1][10].Area reduction is attained effectively by considering optimization of coefficient quantization and CSE equally in [18] and [19]. The transposed structure is employed in several multiplier less MCM-based FIR filter designs to allow for cross-coefficient sharing and likely to be faster, especially when the filter order is large. But, area of delay elements is higher in contrast with that of the direct form because of the range expansion of the constant multiplications and the successive additions in the SAs. In

          [17] Gustafsson and Blad by means of integer linear programming pipelined the carry-save adder trees in the constant multiplications to reduce area cost of full adders, half adders, and registers that provide high-throughput (TP) FIR filter designs.

          Memory-based FIR designs contain two kinds of methods as Look-Up Table (LUT) methods and Distributed Arithmetic (DA) methods [11][13]. In order to achieve the constant multiplications in MCM, LUT-based design stores odd multiples of the input signal in ROMs [11]. Distributed Arithmetic method recursively accumulate bit-level partial results for the inner product calculation in FIR filtering [12] [13].

          Optimization of bit widths for filter coefficients is a crucial design problem of FIR filter realization which comprises direct influence on area cost of arithmetic units and registers. Furthermore, as the bit widths increase after multiplications, several DSP applications do not require full- precision outputs. Alternatively, it is advantageous to produce faithfully rounded outputs where overall error present in quantization and rounding is merely one unit of the last place (ULP) described as the weighting of the least significant bit (LSB) of the outputs.

          By means of faithfully rounded truncated multipliers, a low-cost realization of FIR filter build on the direct structure in Fig. 1(a) is presented. MCMA module is accomplished by accumulating all the partial products (PPs) where redundant partial product bits (PPBs) are detached devoid of disturbing the final precision of the outputs. Non-uniform quantization including unequal word lengths are applied to reduce bit widths of all the filter coefficients with the purpose of reducing hardware cost yet achieving the requirement of frequency response.

          This concise is arranged as pursues. Non-uniform quantization and optimization of filter coefficients is described in segment II. Partial Products generation and compression in the faithfully rounded MCMA module is depicted in segment II. The comparative analysis of results is presented in segment IV.

        2. COEFFICIENT QUANTIZATION AND

          OPTIMIZATION

          recoding through digit set of {0, 1, -1} and radix-4 modified Booth recoding through digit set of {0, 1, -1, 2, -2} are considered and then one outcomes in lesser area cost is chosen.

          In non-uniform quantization, the bit width of each coefficient is reduced until frequency response is no longer satisfied. At last non-uniformly quantized coefficients are adjusted through addition or subtraction of LSB weight of each coefficient as well as more feasible bit width reduction is examined. The algorithm in Fig. 3 is applied to obtain filter order M as well as non-uniformly quantized coefficients reduce area cost in the FIR filter realization.

        3. PP TRUNCATION AND COMPRESSION Here, direct form structure of FIR filter design is

          implemented where entire products are summed up in MCMA module. It is further efficient to gather every PPS into one PPB matrix in spite of accumulating different multiplication for every product including carry-save addition to decrease height of the matrix to two, pursued by a final carry propagation adder. The difference between individual multiplications and combined multiplication for ((A*B) + (C*D)) is demonstrated in fig.4.

          FREQUENCY RESPONSE SPECIFICATION

          FINDING FILTER ORDER AND COEFFICIENTS

          HARDWARE OPTIMIZATION

          COEFFICIENT QUANTIZATION

          Fig. 4. Multiplication / Accumulation using (a) individual PP compression

          A basic ow of FIR lter design and implementation can be classified into three phase as shown in fig.2 such as obtaining lter order and coefcients, coefcient quantization and hardware optimization. Filter order and corresponding coefcients of innite precision are determined first to meet specication of the frequency response. Subsequently, coefcients are quantized to nite bit accuracy. Finally, various optimization methodologies such as CSE are applied to minimize the area cost of hardware implementations. everal earlier FIR lter implementations concentrate on hardware optimization stage.

          As soon as FIR lter operations, owing to bit width expansion after multiplications output signals have greater bit width. Simply partial bits of the full-precision outputs are required in various realistic conditions. Here direct FIR structure with MCMA is implemented as area cost of ip- ops in the delay elements is lesser in contrast with transposed form. In addition three design stages in Fig. 2 are considered to accomplish more economical hardware design through faithfully rounded output signals.

          Number of nonzero digits following coefficient quantization is reduced by performing recoding. Both CSD

          and (b) combined PP compression.

          Fig. 5. Generation of PPBs considering sign extention and negation.

          The sign bit of every PP row is inversed to prevent sign extension bits and by the property (S=1-S) bias constant is included, where S is sign bit of a pp row as revealed in fig.5. PPB matrix ultimate row holds every bias constant and over bared white round symbolize PPBs complements.

          The FIR filter realization demands total error established in arithmetic operations should not be greater than one ulp. The truncated multiplier design in [14] is customized thus more PPBs can be removed which directs to minimum area

          cost. The comparison between two distinct methods is revealed in fig.6. The elimination of redundant PPBs in [14] is compiled of three steps such as deletion, truncation and rounding. Two PPB rows are arranged non removable as they will be detached at succeeding truncation and rounding. Two PPB rows are arranged non removable as they will be detached at succeeding truncation and rounding.

          `

          Fig. 6. Truncated Multipiler designs using (a) approach in [14] and (b) improved version.

          The error limits prior and later accumulating offset constants for deletion, truncation and rounding are specified as pursued,

          • 1 ulp ED 0 – 1 ulp ED = ED + 1 ulp 1 ulp

            2 4 4 4

          • 1 ulp ET 0 – 1 ulp < ET = ET + 1 ulp 1 ulp

            2 4 4 4

          • ulp < ER 0 – 1 ulp < ER = ER + 1 ulp 1 ulp

            • ulp ED 0 – 1 ulp ED = ED + 1 ulp 1 ulp

              2 2 2

            • ulp < ER 0 – 1 ulp < ER = ER + 1 ulp 1 ulp

          2 2 2

          -ulp < E = (ED + ER) ulp

          More PPBs are removed in enhanced version as the limits of deletion error are twofold bigger than that in [14] which direct to lesser area in successive PPB compression. MCMA architecture including truncation (MCMAT) is shown in fig. 7 which eliminates excessive PPBs.

          Fig.7. Overall FIR filter architecture using multiple constant multipliers/accumulators with faithfully rounded truncation (MCMAT).

          In L-shaped block undeletable PPBs are symbolized as white rounds then removal PPBs are symbolized as gray rounds. Rounding of resulting bits following PP compression is symbolized as crossed circles. PPB matrix final row signifies entire offset and bias constants needed with sign bit adjustment.

        4. EXPERIMENTAL RESULTS AND

          COMPARISONS

          According to the requirements provided inside Table I [15], [16] three FIR filters are realized. Actual filter order is indicated as M whereas area optimized filter order is indicated as Mopt. For coefficients which are uniformly quantized including filter order Mopt, the total fractional bits is represented with B. Without including MSB, effective

          2

          -ulp < E = (ED + ET + ER) ulp

          2 2 word length is indicated as EWL. Standardized edge frequencies to one for pass band and stop band are specified as fpass and fstop respectively and their equivalent peak-to-peak

          The deleted bits, truncated bits and rounded bits are

          symbolized as gray circles, crossed green circles and crossed red circles correspondingly as revealed in fig. 6 (a). An enhanced form of rounded truncated multiplier design is suggested here as illustrated in Fig. 6(b). One row of PPBs is simply finished as non removable and PPB removal made up of just deletion and rounding. The processes of deletion and rounding in enhanced version have error limits as follows:

          ripples are indicated as Apass and Astop.

          TABLE I. SPECIFICATIONS OF THREE FIR FILTERS UNDER

          Filter

          M

          Mopt

          B

          EWL

          fpass

          fstop

          Apass (dB)

          Astop (dB)

          A

          LP

          25

          28

          11

          10

          0.15

          0.25

          0.09

          46

          B

          LP

          59

          64

          15

          12

          0.02

          0.07

          0.20

          60

          C

          HP

          121

          121

          19

          17

          0.40

          0.37

          0.10

          80

          Filter

          M

          Mopt

          B

          EWL

          fpass

          fstop

          Apass (dB)

          Astop (dB)

          A

          LP

          25

          28

          11

          10

          0.15

          0.25

          0.09

          46

          B

          LP

          59

          64

          15

          12

          0.02

          0.07

          0.20

          60

          C

          HP

          121

          121

          19

          17

          0.40

          0.37

          0.10

          80

          SIGNIFICANCE

          TABLE II. SYNTHESIS RESULTS OF FILTER A WITH 12-BIT INPUT AND OUTPUT SIGNALS

          Filter A (28-tap LP)

          MCM

          (um2)

          SAs (um2)

          DFFs

          (um2)

          Area (um2)

          Delay (ns)

          TP (M

          data/s)

          Power (mw)

          AP/TP

          Struct.

          CSD/Booth

          4527

          9737

          9655

          23919

          4.78

          209

          1.15

          2.38

          Trans.

          13460

          5030

          18490

          6.75

          148

          0.91

          2.05

          Direct

          NRSCSE[1]

          3151

          9737

          9655

          22543

          4.78

          209

          1.10

          2.14

          Trans.

          MBPG[2]

          3141

          9737

          9655

          22543

          4.78

          209

          1.14

          2.22

          Trans.

          LUT [11]

          16155

          9737

          9655

          35547

          4.77

          210

          1.66

          5.08

          Trans.

          1D DA [12]

          3146

          5526

          8672

          4.83

          17

          0.36

          3.32

          Direct

          MCMA

          10161

          5030

          15191

          6.58

          152

          0.90

          1.63

          Direct

          MCMA opt

          9277

          5030

          14307

          6.31

          159

          0.81

          1.32

          Direct

          MCMAT_I

          7716

          5030

          12746

          6.35

          158

          0.71

          1.04

          Direct

          MCMAT_II

          7460

          5030

          12490

          6.35

          158

          0.70

          1

          Direct

          TABLE III. SYNTHESIS RESULTS OF FILTER B WITH 12-BIT INPUT AND OUTPUT SIGNALS.

          Filter B (64-tap LP)

          MCM

          (um2)

          SAs (um2)

          DFFs

          (um2)

          Area (um2)

          Delay (ns)

          TP (M

          data/s)

          Power (mw)

          AP/TP

          Struct.

          CSD/Booth

          16657

          27382

          25691

          69730

          5.30

          189

          3.25

          2.47

          Trans.

          39086

          11736

          50822

          8.20

          122

          2.57

          2.21

          Direct

          NRSCSE[1]

          11245

          27382

          25691

          64318

          5.30

          189

          3.17

          2.22

          Trans.

          MBPG[2]

          10871

          27382

          25691

          63944

          5.31

          188

          3.37

          2.36

          Trans.

          LUT [11]

          50121

          27382

          25691

          103194

          5.29

          189

          4.83

          5.44

          Trans.

          1D DA [12]

          13215

          12279

          25494

          5.99

          14

          0.94

          3.53

          Direct

          MCMA

          29705

          11736

          41441

          7.74

          129

          2.50

          1.66

          Direct

          MCMA opt

          28391

          11736

          40127

          7.63

          131

          2.40

          1.52

          Direct

          MCMAT_I

          22654

          11736

          34390

          7.43

          135

          2.04

          1.07

          Direct

          MCMAT_II

          21510

          11736

          33246

          7.43

          135

          1.97

          1

          Direct

          TABLE IV. SYNTHESIS RESULTS OF FILTER C WITH 12-BIT INPUT AND OUTPUT SIGNALS.

          Filter C (121-tap LP)

          MCM

          (um2)

          SAs (um2)

          DFFs

          (um2)

          Area (um2)

          Delay (ns)

          TP (M

          data/s)

          Power (mw)

          AP/TP

          Struct.

          CSD/Booth

          50813

          62047

          57451

          170311

          6.23

          161

          8.82

          3.99

          Trans.

          103263

          22353

          125616

          9.74

          103

          6.40

          3.34

          Direct

          NRSCSE[1]

          31430

          62047

          57451

          150928

          6.23

          161

          8.19

          3.28

          Trans.

          MBPG[2]

          30422

          62047

          57451

          149920

          6.23

          161

          8.45

          3.36

          Trans.

          LUT [11]

          124717

          62047

          57451

          244215

          6.22

          161

          13.70

          3.88

          Trans.

          1D DA [12]

          32068

          22990

          55058

          7.17

          11

          2.04

          4.36

          Direct

          MCMA

          71888

          22353

          94241

          8.94

          112

          6.00

          2.16

          Direct

          MCMA opt

          69024

          22353

          91377

          8.99

          111

          5.65

          2.00

          Direct

          MCMAT_I

          49278

          22353

          71631

          8.24

          121

          4.37

          1.11

          Direct

          MCMAT_II

          46018

          22353

          68371

          8.24

          121

          4.14

          1

          Direct

          By means of 90-nm standard cell library of Taiwan Semiconductor Manufacturing Company, performance effects for 12 fractional bits input and output signals are demonstrated in Tables IIIV with Synopsys De- sign Compiler. Arithmetic units (MCM and SAs) along with D- type Flip-Flops (DFFs) comprise overall area of architecture in addition to this TP is characterized in metric of million output data samples per second (M data/s). Exploiting Synopsys Prime Time, power with frequency of 50 MHz is obtained. For distinguishing several designs below collective metrics of area, power and speed operation, standardized areapower product divided by TP (AP/TP) is incorporated additionally.

          In contrast to CSD recoding, CSE can efficiently decrease amount of adders employed in MCM of multiplierless design using transposed structure. Various approaches of CSE are Non Recursive Signed CSE (NRSCSE) [1] and Multiroot Binary Partition Graph (MBPG) [2], while Structured Adders are not reduced. Through combined optimization of coefficient quantization and CSE as recommended in [18] and [19] further MCM adder cost shall be decreased. LUT realized by dual-port fragmented memory sharing memory cells is used to keep odd multiples of constant for constant

          multiplication as in [11]. Above method requires full-custom design of LUT circuits which can be achieved by ROM applying verilog codes. If modified ROM circuits are prepared, area cost can be further decreased. Using distributed arithmetic for FIR filter realization, 1-D as well as 2-D systolic arrays are offered in [12]. Because of cost issues, 1-D distributed arithmetic design only applied here.

          Main purpose of several earlier transposed structure FIR filter designs is to reduce adders cost in MCM which consumes below 20% of overall area. MCM cost certainly in transposed-form NRSCSE and MBPG can be efficiently decreased [18], [19]. Due to the limit increase of outcomes following MCM, area of DFFs in transposed structure is increased and also SAs are not optimized. FIR filter design proposed comprises four types. MCMA is basic realization applying merged PP compression using uniformly quantized coefficients and by using non-uniform coefficient quantization, an enhanced design is attained as MCMA_opt. The enhanced versions which truncate PPBs applying approaches as illustrated in fig. 6(a) and (b) are denoted as MCMAT_I and MCMAT_II correspondingly. The comparative analysis of four enhanced versions designed for filter C is demonstrated in fig. 8. Overall number of PPBs

          prior to compression in MCMAT_II, MCMAT_I, MCMA_opt and MCMA are 1946, 2124, 3175 and 3407 correspondingly. MCMAT_II, MCMAT_I and MCMA_opt possess area decrease rates of 27, 24% and 3% correspondingly which are lesser than earlier models of TP contrast to basic MCMA structures.

          Fig. 8. (a) Area, (b) Delay and (c) Power of proposed designs for filter C.

          The designs proposed appreciably decrease area however critical path delay is greater than before as entire MCMA operations are carried out in one clock cycle. By means of including pipeline registers in PP compression, delay can be probably decreased as recommended in [17] as main purpose is to reduce number of FAs, HAs and registers applying integer linear programming. Here main idea is to construct FIR filter designs for portable functions with average execution rate as main design factors are area and power. The modified designs possess reduced area and power in contrast with earlier MCM designs as revealed in Tables IIIV. The cost of adder in MCM module is lowered through CSE of transposed FIR configuration however the compensated cost added DFFs which in turn uses almost 40% of overall area. Based on memory-based designs, to produce output 1-D DA

          [12] demands 12 cycles while LUT based designs [11] have high area cost. According to united metrics of area, power and TP conditions, the modified designs retain finest AP/TP including MCMAT_II standardized to one.

        5. CONCLUSION

Reduced area, power and cost FIR filter designs is provided in this concise altogether with respect to coefficient bit width optimization and hardware sources realization. The study confirms that MCMAT direct FIR structure heads to minimum area, cost and power utilization, even if several earlier designs are build on transposed form.

REFERENCES

  1. M. M. Peiro, E. I. Boemo and L. Wanhammar, Design of high-speed multiplierless lters using a nonrecursive signed common subexpression algorithm, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 49, no. 3, pp. 196203, Mar. 2002.

  2. C.-H. Chang, J. Chen, and A. P. Vinod, Information theoretic approach to complexity reduction of FIR lter design, IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 55, no. 8, pp. 23102321, Sep. 2008.

  3. F. Xu, C. H. Chang, and C. C. Jong, Contention resolutionA new approach to versatile subexpressions sharing in multiple constant multiplications,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 2, pp. 559571, Mar. 2008.

  4. F. Xu, C. H. Chang, and C. C. Jong, Contention resolution algorithms for common subexpression elimination in digital lter design, IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 52, no. 10, pp. 695700, Oct. 2005.

  5. I.-C. Park and H.-J. Kang, Digital lter synthesis based on an algorithm to generate all minimal signed digit representations, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 12, pp. 15251529, Dec. 2002.

  6. C.-Y. Yao, H.-H. Chen, T.-F. Lin, C.-J. J. Chien, and X.-T. Hsu, A novel common-subexpression-elimination method for synthesizing

    xed-point FIR lters, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 11, pp. 22152221, Sep. 2004.

  7. O. Gustafsson, Lower bounds for constant multiplication problems,IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 11, pp. 974978, Nov. 2007.

  8. Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms, vol. 3, no. 2, pp. 138, May 2007.

  9. D. Shi and Y. J. Yu, Design of linear phase FIR lters with high probability of achieving minimum number of adders, IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 58, no. 1, pp. 126136, Jan. 2011.

  10. R. Huang, C.-H. H. Chang, M. Faust, N. Lotze, and Y. Manoli, Signextension avoidance and word-length optimization by positive- offset representation for FIR lter design, IEEE Trans. Circuits Syst. II, Exp.Briefs, vol. 58, no. 12, pp. 916920, Oct. 2011.

  11. P. K. Meher, New approach to look-up-table design and memory- based realization of FIR digital lter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 3, pp. 592603, Mar. 2010.

  12. P. K. Meher, S. Candrasekaran, and A. Amira, FPGA realization of FIR lters by efcient and exible systolization using distributed arithmetic,IEEE Trans. Signal Process., vol. 56, no. 7, pp. 30093017, Jul. 2008.

  13. S. Hwang, G. Han, S. Kang, and J.-S. Kim, New distributed arithmetic algorithm for low-power FIR lter implementation, IEEE Signal Process. Lett., vol. 11, no. 5, pp. 463466, May 2004.

  14. H.-J. Ko and S.-F. Hsiao, Design and application of faithfully rounded and truncated multipliers with combined deletion, reduction, truncation, and rounding, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 5, pp. 304308, May 2011.

  15. H. Samueli, An improved search algorithm for the design of multiplierless FIR lters with powers-of-two coefcient, IEEE Trans. Circuits Syst., vol. 36, no. 7, pp. 10441047, Jul. 1989.

  16. Y. C. Lin and S. Parker, Discrete coefcient FIR digital lter design based upon an LMS criteria, IEEE Trans. Circuits Syst., vol. 30, no. 10, pp. 723739, Oct. 1983.

  17. Blad and O. Gustafsson, Integer linear programming-based bit-level optimization for high-speed FIR lter architecture, Circuits Syst. Signal Process., vol. 29, no. 1, pp. 81101, Feb. 2010.

  18. F. Xu, C. H. Chang, and C. C. Jong, Design of low-complexity FIR lters based on signed-powers-of-two coefcients with reusable common subexpressions, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 10, pp. 18981907, Oct. 2007.

  19. Y. J. Yu and Y. C. Lim, Design of linear phase FIR lters in subexpression space using mixed integer linear programming, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 10, pp. 23302338, Oct. 2007.

  20. K. C. Bickerstaff, M. Schulte, and E. E. Swartzlander, Jr., Reduced area multipliers, in Proc. Int. Conf. Appl.-Specic Array Processors, 1993, pp. 478489.

Leave a Reply

Your email address will not be published. Required fields are marked *