 Open Access
 Total Downloads : 10
 Authors : Veera Boopathy. E, R. Dhayabarani
 Paper ID : IJERTCONV3IS16018
 Volume & Issue : TITCON – 2015 (Volume 3 – Issue 16)
 Published (First Online): 30072018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Design and Implementation of Low PowerArea efficient Rounded Truncated Multiple Constant Multiplier for FIR filters
Veera Boopathy. E, PG Scholar, VLSI Design, VSB Engineering College,
Karur.
R. Dhayabarani,
Associate Professor,

Engineering College, Karur.
AbstractFaithfully rounded truncated multipliers are used to obtain highly economical Finite Impulse Response filter designs. Both bit width optimization and hardware resources are equally considered devoid of sacrificing the frequency response and output signal accuracy. In order to reduce total area cost asymmetrical coefficient quantization with suitable filter order is suggested. An improved version of truncated multipliers is utilized to realize multiple constant multiplication
/ accumulation in a direct FIR structure. In comparative analysis with earlier FIR design methodologies illustrate that the proposed designs accomplish best area and power results.
KeywordsDigital signal processing (DSP), faithful rounding, finite impulse response (FIR) filter, truncated multipliers, VLSI design.

INTRODUCTION
In Digital Signal Processing (DSP) and communication systems, Finite Impulse Response (FIR) digital filter is one of the essential components whose response to any finite length input settles to zero in finite time. They are extensively applied in various portable applications with reduced area and power resources. An expression for FIR filter of order M is
=0
=0
y(n) = 1 x(ni)
In occurrence of linear phase, the coefficients are symmetric or asymmetric with ai = aMI or ai = aMi.
Fig. 1. Direct form structure of linearphase evenorder FIR filters
Fig. 2. Transposed form structure of linearphase evenorder FIR filters.
Two primitive FIR structures are direct form and transposed form as illustrated in Fig.1 meant for a linear phase FIR filter with evenorder.Fig.1(a) is direct form FIR Structure in which the multiple constant multiplication (MCM)/accumulation (MCMA) module accomplish parallel multiplications of specific delayed signals with respective filter coefficients, then pursued by accumulation of entire product terms. Consequently, in MCMA multiplier operands delayed input signals x (n i) and coefficients ai. Fig. 1(b) is transposed form FIR Structure in which MCM module has multiplier operands as current input signal x (n) and coefficients. The outcomes of distinct constant multiplications use structure adders (SAs) with delay elements. Several papers on design and implementation of lowcost otherwise highspeed FIR filters are presented [1] [13], [15][19]. In order to prevent expensive multipliers, earlier hardware implementations of digital FIR filters are segregated into two classes as multiplier less based and memory based.
Multiplier lessbased designs accomplish MCM by means of shift and add operations and assign common sub operations by Canonical Signed Digit (CSD) recoding and common subexpression elimination to reduce the adder cost of MCM [1][10].Area reduction is attained effectively by considering optimization of coefficient quantization and CSE equally in [18] and [19]. The transposed structure is employed in several multiplier less MCMbased FIR filter designs to allow for crosscoefficient sharing and likely to be faster, especially when the filter order is large. But, area of delay elements is higher in contrast with that of the direct form because of the range expansion of the constant multiplications and the successive additions in the SAs. In
[17] Gustafsson and Blad by means of integer linear programming pipelined the carrysave adder trees in the constant multiplications to reduce area cost of full adders, half adders, and registers that provide highthroughput (TP) FIR filter designs.Memorybased FIR designs contain two kinds of methods as LookUp Table (LUT) methods and Distributed Arithmetic (DA) methods [11][13]. In order to achieve the constant multiplications in MCM, LUTbased design stores odd multiples of the input signal in ROMs [11]. Distributed Arithmetic method recursively accumulate bitlevel partial results for the inner product calculation in FIR filtering [12] [13].
Optimization of bit widths for filter coefficients is a crucial design problem of FIR filter realization which comprises direct influence on area cost of arithmetic units and registers. Furthermore, as the bit widths increase after multiplications, several DSP applications do not require full precision outputs. Alternatively, it is advantageous to produce faithfully rounded outputs where overall error present in quantization and rounding is merely one unit of the last place (ULP) described as the weighting of the least significant bit (LSB) of the outputs.
By means of faithfully rounded truncated multipliers, a lowcost realization of FIR filter build on the direct structure in Fig. 1(a) is presented. MCMA module is accomplished by accumulating all the partial products (PPs) where redundant partial product bits (PPBs) are detached devoid of disturbing the final precision of the outputs. Nonuniform quantization including unequal word lengths are applied to reduce bit widths of all the filter coefficients with the purpose of reducing hardware cost yet achieving the requirement of frequency response.
This concise is arranged as pursues. Nonuniform quantization and optimization of filter coefficients is described in segment II. Partial Products generation and compression in the faithfully rounded MCMA module is depicted in segment II. The comparative analysis of results is presented in segment IV.

COEFFICIENT QUANTIZATION AND
OPTIMIZATION
recoding through digit set of {0, 1, 1} and radix4 modified Booth recoding through digit set of {0, 1, 1, 2, 2} are considered and then one outcomes in lesser area cost is chosen.
In nonuniform quantization, the bit width of each coefficient is reduced until frequency response is no longer satisfied. At last nonuniformly quantized coefficients are adjusted through addition or subtraction of LSB weight of each coefficient as well as more feasible bit width reduction is examined. The algorithm in Fig. 3 is applied to obtain filter order M as well as nonuniformly quantized coefficients reduce area cost in the FIR filter realization.

PP TRUNCATION AND COMPRESSION Here, direct form structure of FIR filter design is
implemented where entire products are summed up in MCMA module. It is further efficient to gather every PPS into one PPB matrix in spite of accumulating different multiplication for every product including carrysave addition to decrease height of the matrix to two, pursued by a final carry propagation adder. The difference between individual multiplications and combined multiplication for ((A*B) + (C*D)) is demonstrated in fig.4.
FREQUENCY RESPONSE SPECIFICATION
FINDING FILTER ORDER AND COEFFICIENTS
HARDWARE OPTIMIZATION
COEFFICIENT QUANTIZATION
Fig. 4. Multiplication / Accumulation using (a) individual PP compression
A basic ow of FIR lter design and implementation can be classified into three phase as shown in fig.2 such as obtaining lter order and coefcients, coefcient quantization and hardware optimization. Filter order and corresponding coefcients of innite precision are determined first to meet specication of the frequency response. Subsequently, coefcients are quantized to nite bit accuracy. Finally, various optimization methodologies such as CSE are applied to minimize the area cost of hardware implementations. everal earlier FIR lter implementations concentrate on hardware optimization stage.
As soon as FIR lter operations, owing to bit width expansion after multiplications output signals have greater bit width. Simply partial bits of the fullprecision outputs are required in various realistic conditions. Here direct FIR structure with MCMA is implemented as area cost of ip ops in the delay elements is lesser in contrast with transposed form. In addition three design stages in Fig. 2 are considered to accomplish more economical hardware design through faithfully rounded output signals.
Number of nonzero digits following coefficient quantization is reduced by performing recoding. Both CSD
and (b) combined PP compression.
Fig. 5. Generation of PPBs considering sign extention and negation.
The sign bit of every PP row is inversed to prevent sign extension bits and by the property (S=1S) bias constant is included, where S is sign bit of a pp row as revealed in fig.5. PPB matrix ultimate row holds every bias constant and over bared white round symbolize PPBs complements.
The FIR filter realization demands total error established in arithmetic operations should not be greater than one ulp. The truncated multiplier design in [14] is customized thus more PPBs can be removed which directs to minimum area
cost. The comparison between two distinct methods is revealed in fig.6. The elimination of redundant PPBs in [14] is compiled of three steps such as deletion, truncation and rounding. Two PPB rows are arranged non removable as they will be detached at succeeding truncation and rounding. Two PPB rows are arranged non removable as they will be detached at succeeding truncation and rounding.
`
Fig. 6. Truncated Multipiler designs using (a) approach in [14] and (b) improved version.
The error limits prior and later accumulating offset constants for deletion, truncation and rounding are specified as pursued,

1 ulp ED 0 – 1 ulp ED = ED + 1 ulp 1 ulp
2 4 4 4

1 ulp ET 0 – 1 ulp < ET = ET + 1 ulp 1 ulp
2 4 4 4

ulp < ER 0 – 1 ulp < ER = ER + 1 ulp 1 ulp

ulp ED 0 – 1 ulp ED = ED + 1 ulp 1 ulp
2 2 2

ulp < ER 0 – 1 ulp < ER = ER + 1 ulp 1 ulp

2 2 2
ulp < E = (ED + ER) ulp
More PPBs are removed in enhanced version as the limits of deletion error are twofold bigger than that in [14] which direct to lesser area in successive PPB compression. MCMA architecture including truncation (MCMAT) is shown in fig. 7 which eliminates excessive PPBs.
Fig.7. Overall FIR filter architecture using multiple constant multipliers/accumulators with faithfully rounded truncation (MCMAT).
In Lshaped block undeletable PPBs are symbolized as white rounds then removal PPBs are symbolized as gray rounds. Rounding of resulting bits following PP compression is symbolized as crossed circles. PPB matrix final row signifies entire offset and bias constants needed with sign bit adjustment.


EXPERIMENTAL RESULTS AND
COMPARISONS
According to the requirements provided inside Table I [15], [16] three FIR filters are realized. Actual filter order is indicated as M whereas area optimized filter order is indicated as Mopt. For coefficients which are uniformly quantized including filter order Mopt, the total fractional bits is represented with B. Without including MSB, effective
2
ulp < E = (ED + ET + ER) ulp
2 2 word length is indicated as EWL. Standardized edge frequencies to one for pass band and stop band are specified as fpass and fstop respectively and their equivalent peaktopeak
The deleted bits, truncated bits and rounded bits are
symbolized as gray circles, crossed green circles and crossed red circles correspondingly as revealed in fig. 6 (a). An enhanced form of rounded truncated multiplier design is suggested here as illustrated in Fig. 6(b). One row of PPBs is simply finished as non removable and PPB removal made up of just deletion and rounding. The processes of deletion and rounding in enhanced version have error limits as follows:
ripples are indicated as Apass and Astop.
TABLE I. SPECIFICATIONS OF THREE FIR FILTERS UNDER
Filter
M
Mopt
B
EWL
fpass
fstop
Apass (dB)
Astop (dB)
A
LP
25
28
11
10
0.15
0.25
0.09
46
B
LP
59
64
15
12
0.02
0.07
0.20
60
C
HP
121
121
19
17
0.40
0.37
0.10
80
Filter
M
Mopt
B
EWL
fpass
fstop
Apass (dB)
Astop (dB)
A
LP
25
28
11
10
0.15
0.25
0.09
46
B
LP
59
64
15
12
0.02
0.07
0.20
60
C
HP
121
121
19
17
0.40
0.37
0.10
80
SIGNIFICANCE
TABLE II. SYNTHESIS RESULTS OF FILTER A WITH 12BIT INPUT AND OUTPUT SIGNALS
Filter A (28tap LP)
MCM
(um2)
SAs (um2)
DFFs
(um2)
Area (um2)
Delay (ns)
TP (M
data/s)
Power (mw)
AP/TP
Struct.
CSD/Booth
4527
9737
9655
23919
4.78
209
1.15
2.38
Trans.
13460
5030
18490
6.75
148
0.91
2.05
Direct
NRSCSE[1]
3151
9737
9655
22543
4.78
209
1.10
2.14
Trans.
MBPG[2]
3141
9737
9655
22543
4.78
209
1.14
2.22
Trans.
LUT [11]
16155
9737
9655
35547
4.77
210
1.66
5.08
Trans.
1D DA [12]
3146
5526
8672
4.83
17
0.36
3.32
Direct
MCMA
10161
5030
15191
6.58
152
0.90
1.63
Direct
MCMA opt
9277
5030
14307
6.31
159
0.81
1.32
Direct
MCMAT_I
7716
5030
12746
6.35
158
0.71
1.04
Direct
MCMAT_II
7460
5030
12490
6.35
158
0.70
1
Direct
TABLE III. SYNTHESIS RESULTS OF FILTER B WITH 12BIT INPUT AND OUTPUT SIGNALS.
Filter B (64tap LP)
MCM
(um2)
SAs (um2)
DFFs
(um2)
Area (um2)
Delay (ns)
TP (M
data/s)
Power (mw)
AP/TP
Struct.
CSD/Booth
16657
27382
25691
69730
5.30
189
3.25
2.47
Trans.
39086
11736
50822
8.20
122
2.57
2.21
Direct
NRSCSE[1]
11245
27382
25691
64318
5.30
189
3.17
2.22
Trans.
MBPG[2]
10871
27382
25691
63944
5.31
188
3.37
2.36
Trans.
LUT [11]
50121
27382
25691
103194
5.29
189
4.83
5.44
Trans.
1D DA [12]
13215
12279
25494
5.99
14
0.94
3.53
Direct
MCMA
29705
11736
41441
7.74
129
2.50
1.66
Direct
MCMA opt
28391
11736
40127
7.63
131
2.40
1.52
Direct
MCMAT_I
22654
11736
34390
7.43
135
2.04
1.07
Direct
MCMAT_II
21510
11736
33246
7.43
135
1.97
1
Direct
TABLE IV. SYNTHESIS RESULTS OF FILTER C WITH 12BIT INPUT AND OUTPUT SIGNALS.
Filter C (121tap LP)
MCM
(um2)
SAs (um2)
DFFs
(um2)
Area (um2)
Delay (ns)
TP (M
data/s)
Power (mw)
AP/TP
Struct.
CSD/Booth
50813
62047
57451
170311
6.23
161
8.82
3.99
Trans.
103263
22353
125616
9.74
103
6.40
3.34
Direct
NRSCSE[1]
31430
62047
57451
150928
6.23
161
8.19
3.28
Trans.
MBPG[2]
30422
62047
57451
149920
6.23
161
8.45
3.36
Trans.
LUT [11]
124717
62047
57451
244215
6.22
161
13.70
3.88
Trans.
1D DA [12]
32068
22990
55058
7.17
11
2.04
4.36
Direct
MCMA
71888
22353
94241
8.94
112
6.00
2.16
Direct
MCMA opt
69024
22353
91377
8.99
111
5.65
2.00
Direct
MCMAT_I
49278
22353
71631
8.24
121
4.37
1.11
Direct
MCMAT_II
46018
22353
68371
8.24
121
4.14
1
Direct
By means of 90nm standard cell library of Taiwan Semiconductor Manufacturing Company, performance effects for 12 fractional bits input and output signals are demonstrated in Tables IIIV with Synopsys De sign Compiler. Arithmetic units (MCM and SAs) along with D type FlipFlops (DFFs) comprise overall area of architecture in addition to this TP is characterized in metric of million output data samples per second (M data/s). Exploiting Synopsys Prime Time, power with frequency of 50 MHz is obtained. For distinguishing several designs below collective metrics of area, power and speed operation, standardized areapower product divided by TP (AP/TP) is incorporated additionally.
In contrast to CSD recoding, CSE can efficiently decrease amount of adders employed in MCM of multiplierless design using transposed structure. Various approaches of CSE are Non Recursive Signed CSE (NRSCSE) [1] and Multiroot Binary Partition Graph (MBPG) [2], while Structured Adders are not reduced. Through combined optimization of coefficient quantization and CSE as recommended in [18] and [19] further MCM adder cost shall be decreased. LUT realized by dualport fragmented memory sharing memory cells is used to keep odd multiples of constant for constant
multiplication as in [11]. Above method requires fullcustom design of LUT circuits which can be achieved by ROM applying verilog codes. If modified ROM circuits are prepared, area cost can be further decreased. Using distributed arithmetic for FIR filter realization, 1D as well as 2D systolic arrays are offered in [12]. Because of cost issues, 1D distributed arithmetic design only applied here.
Main purpose of several earlier transposed structure FIR filter designs is to reduce adders cost in MCM which consumes below 20% of overall area. MCM cost certainly in transposedform NRSCSE and MBPG can be efficiently decreased [18], [19]. Due to the limit increase of outcomes following MCM, area of DFFs in transposed structure is increased and also SAs are not optimized. FIR filter design proposed comprises four types. MCMA is basic realization applying merged PP compression using uniformly quantized coefficients and by using nonuniform coefficient quantization, an enhanced design is attained as MCMA_opt. The enhanced versions which truncate PPBs applying approaches as illustrated in fig. 6(a) and (b) are denoted as MCMAT_I and MCMAT_II correspondingly. The comparative analysis of four enhanced versions designed for filter C is demonstrated in fig. 8. Overall number of PPBs
prior to compression in MCMAT_II, MCMAT_I, MCMA_opt and MCMA are 1946, 2124, 3175 and 3407 correspondingly. MCMAT_II, MCMAT_I and MCMA_opt possess area decrease rates of 27, 24% and 3% correspondingly which are lesser than earlier models of TP contrast to basic MCMA structures.
Fig. 8. (a) Area, (b) Delay and (c) Power of proposed designs for filter C.
The designs proposed appreciably decrease area however critical path delay is greater than before as entire MCMA operations are carried out in one clock cycle. By means of including pipeline registers in PP compression, delay can be probably decreased as recommended in [17] as main purpose is to reduce number of FAs, HAs and registers applying integer linear programming. Here main idea is to construct FIR filter designs for portable functions with average execution rate as main design factors are area and power. The modified designs possess reduced area and power in contrast with earlier MCM designs as revealed in Tables IIIV. The cost of adder in MCM module is lowered through CSE of transposed FIR configuration however the compensated cost added DFFs which in turn uses almost 40% of overall area. Based on memorybased designs, to produce output 1D DA
[12] demands 12 cycles while LUT based designs [11] have high area cost. According to united metrics of area, power and TP conditions, the modified designs retain finest AP/TP including MCMAT_II standardized to one. 
CONCLUSION

Reduced area, power and cost FIR filter designs is provided in this concise altogether with respect to coefficient bit width optimization and hardware sources realization. The study confirms that MCMAT direct FIR structure heads to minimum area, cost and power utilization, even if several earlier designs are build on transposed form.
REFERENCES

M. M. Peiro, E. I. Boemo and L. Wanhammar, Design of highspeed multiplierless lters using a nonrecursive signed common subexpression algorithm, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 49, no. 3, pp. 196203, Mar. 2002.

C.H. Chang, J. Chen, and A. P. Vinod, Information theoretic approach to complexity reduction of FIR lter design, IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 55, no. 8, pp. 23102321, Sep. 2008.

F. Xu, C. H. Chang, and C. C. Jong, Contention resolutionA new approach to versatile subexpressions sharing in multiple constant multiplications,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 2, pp. 559571, Mar. 2008.

F. Xu, C. H. Chang, and C. C. Jong, Contention resolution algorithms for common subexpression elimination in digital lter design, IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 52, no. 10, pp. 695700, Oct. 2005.

I.C. Park and H.J. Kang, Digital lter synthesis based on an algorithm to generate all minimal signed digit representations, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 21, no. 12, pp. 15251529, Dec. 2002.

C.Y. Yao, H.H. Chen, T.F. Lin, C.J. J. Chien, and X.T. Hsu, A novel commonsubexpressionelimination method for synthesizing
xedpoint FIR lters, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 11, pp. 22152221, Sep. 2004.

O. Gustafsson, Lower bounds for constant multiplication problems,IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 11, pp. 974978, Nov. 2007.

Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms, vol. 3, no. 2, pp. 138, May 2007.

D. Shi and Y. J. Yu, Design of linear phase FIR lters with high probability of achieving minimum number of adders, IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 58, no. 1, pp. 126136, Jan. 2011.

R. Huang, C.H. H. Chang, M. Faust, N. Lotze, and Y. Manoli, Signextension avoidance and wordlength optimization by positive offset representation for FIR lter design, IEEE Trans. Circuits Syst. II, Exp.Briefs, vol. 58, no. 12, pp. 916920, Oct. 2011.

P. K. Meher, New approach to lookuptable design and memory based realization of FIR digital lter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 3, pp. 592603, Mar. 2010.

P. K. Meher, S. Candrasekaran, and A. Amira, FPGA realization of FIR lters by efcient and exible systolization using distributed arithmetic,IEEE Trans. Signal Process., vol. 56, no. 7, pp. 30093017, Jul. 2008.

S. Hwang, G. Han, S. Kang, and J.S. Kim, New distributed arithmetic algorithm for lowpower FIR lter implementation, IEEE Signal Process. Lett., vol. 11, no. 5, pp. 463466, May 2004.

H.J. Ko and S.F. Hsiao, Design and application of faithfully rounded and truncated multipliers with combined deletion, reduction, truncation, and rounding, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 5, pp. 304308, May 2011.

H. Samueli, An improved search algorithm for the design of multiplierless FIR lters with powersoftwo coefcient, IEEE Trans. Circuits Syst., vol. 36, no. 7, pp. 10441047, Jul. 1989.

Y. C. Lin and S. Parker, Discrete coefcient FIR digital lter design based upon an LMS criteria, IEEE Trans. Circuits Syst., vol. 30, no. 10, pp. 723739, Oct. 1983.

Blad and O. Gustafsson, Integer linear programmingbased bitlevel optimization for highspeed FIR lter architecture, Circuits Syst. Signal Process., vol. 29, no. 1, pp. 81101, Feb. 2010.

F. Xu, C. H. Chang, and C. C. Jong, Design of lowcomplexity FIR lters based on signedpowersoftwo coefcients with reusable common subexpressions, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26, no. 10, pp. 18981907, Oct. 2007.

Y. J. Yu and Y. C. Lim, Design of linear phase FIR lters in subexpression space using mixed integer linear programming, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 10, pp. 23302338, Oct. 2007.

K. C. Bickerstaff, M. Schulte, and E. E. Swartzlander, Jr., Reduced area multipliers, in Proc. Int. Conf. Appl.Specic Array Processors, 1993, pp. 478489.