Adaptive FIR Filter based on Distributed Arithmetic and LMS Algorithm for Low-Area and Low-Power

DOI : 10.17577/IJERTV3IS20906

Download Full-Text PDF Cite this Publication

Text Only Version

Adaptive FIR Filter based on Distributed Arithmetic and LMS Algorithm for Low-Area and Low-Power

K. Jebin Roy

M.E. VLSI Design

Anand Institute of Higher Technology Affiliated to Anna University, Chennai 603103, India

R. Ramya M.Tech. VLSI Design, Sathyabama University, Accredited by NAAC, Chennai 600119, India

Abstract In this manuscript, we proposed a novel pipelined architecture for low-power and low-area adaptive FIR filter based on distributed arithmetic (DA) and LMS algorithm. DA is bit-serial computational process and uses parallel look-up table (LUTs) apprise and equivalent implementation of filtering and weight-update operations to appliance high throughput filter rates irrespective of the filter length. The full adder based conditional signed carry save accumulation for DA-based inner product computation is replaced and design by using 10 transistor full adder based carry save accumulation of shift accumulation, with the intention of the proposed design, which can reduce the area complexity and power consumption. The least-mean-square (LMS) algorithm adaptation is functioned to update the weight and abate the mean square error between the assessed and chosen output. The weight increment block based adder/subtractor cells is swapped by carry save adder in order to reduce area difficulty. It comprises of multiplexors, smaller LUT, and practically half the number of transistors compared to the present DA-based design.

Index Terms: Adaptive Filter, Distributed Arithmetic (DA), Finite Impulse Response (FIR), Least Mean Square (LMS) Algorithm, Lookup table (LUT).


    Adaptive filters find extensive use in many signal processing applications such as channel equalization, echo cancellation, noise cancellation [1]. The finite impulse response (FIR) filters whose weights are updated by the famous Widrow-Hoff least mean square (LMS) algorithm is the most popularly used adaptive filter not only due to its simplicity but also due to its satisfactory convergence performance [5]. The direct form configuration on the onward path of the FIR filter results in a long critical path due to an inner product computation to obtain a filter output. Consequently, it is required to reduce the critical path of the structure if the input signal has high sampling rate. By reducing the critical path of the structure, thereby, the critical path could not exceed the sampling period.

    Distributed arithmetic (DA) is so named because it performed arithmetic operation. DA is bit serial computation in nature and it eliminates the need for hardware multipliers

    and is capable of implementing large order filters with very high throughput. A lot of study has been done to implement the DA based adaptive FIR filter for area efficient design, the multiplier-less distributed arithmetic (DA) based technique has achieved plenteous popularity for its high throughput, but it results are increased in cost-effective, area and time efficient computing structures [8]. DA based hardware efficient adaptive FIR filter inner product has been suggested by Allred et al. [2] using two separate lookup tables (LUTs) Filtering lookup table and Auxiliary lookup table for filtering and weight updating module. Later, Guo and DeBrunner [3], [4] have improved the design structure in [2] by using only one lookup table instead of two LUTs for both filter and weight updating module. On the other hand, the design process in [2], [3], [4] and [8] require more cycles for lookup table (LUT) update for each new sample, hence it do not support high sampling rate. Meher and Park have improved the design with low adaptation delay for high speed DA based adaptive filter [6]. In a recent paper, Meher and Park proposed a new DA based adaptive filter architecture for low power, low area and high throughput with very low adaptation delay [7].

    This brief proposes an adaptive FIR filter using distributed arithmetic for area efficient design. High Throughput is achieved by using a parallel lookup table update and equivalent implementation of filtering and weight-updating operations. The conditional signed carry saved accumulation for DA-based inner product computation is designed by using 10 transistor full adder based carry saved accumulation of shift accumulation. The use of the proposed design helps to reduce the area complexity and power consumption.

    In the next section, a brief study of the least mean square (LMS) adaptive algorithm, followed by the description of the proposed DA based technique filter in Section 3. The structure of the proposed adaptive filter and description of the proposed DA based adaptive FIR filter in Section 4. Results and Conclusions are given in Section 5 and 6.


    The LMS algorithm computes a filter output and an error value that is equal to the difference between the current

    filter output and the desired response for every clock cycle. In every training cycle, the estimated error is then used to

    and the inner product given by (7) can be computed as

    = 1 2 . 0 , = 1 .

    update the filter weights. The weights of LMS adaptive filter




    during the nth iteration is updated according to the following equations [6]:

    + 1 = + µ . (1a)


    = (1b)

    = . (1c)

    The input vector and the weight vector at the nth training iteration are respected given by

    = , 1 , , + 1 (2)

    = 0 , 1 , , 1 (2)

    is the desired response, and is the filter output of the iteration. denotes the error value generated during the iteration, which is used to update the weights, µ is the convergence factor, and N is the filter length.

    In the case of filter designs, the feedback error

    becomes available after certain number of cycles, called the

    adaptation delay. The pipelined architectures therefore use the delayed error for updating the current weight instead of the most recent error, where is the adaptation delay. The weight update equation of such delayed LMS adaptive filter is given by

    + 1 = + µ . .



    In each cycle, the LMS adaptive filter needs to perform an inner-product computation which contributes to the most of the critical path. Let the inner product computation of (1c) be given by

    Meanwhile any element of the N-point bit sequence

    0 1 can either be 1 or 0, the partial sum for = 0, 1,, 1 can have 2 possible values. If the entire 2 possible values sum are precomputed and stored in a LUT, the partial sum can be read out from the LUT using the bit sequence as address bits for computing the inner product.

    Figure 1: DA-based implementation of four point inner product

    Figure 2: Carry save implementation of shift accumulation

    The inner product of (8) can therefore be calculated in cycles of carry save implementation of shift accumulation, followed by LUT-read operations corresponding to number of bit slices for 0 1, as shown in Fig. 1. Since the carry save implementation of shift accumulation in Fig. 2 required more area and power consumption.


    = 1 .


    Where and for 0 1 form the N point vectors corresponding to the current weights and most recent 1 input respectively. Let us assume be the bit width of the weigt, every component of the vector weight may be expressed in 2s complement representation


    = 0 + 1 . 2


    Where denotes the bit of . Substituting (5), we can write (4) in an expanded form

    = 1 . 0 + 1 . 1 . 2





    Figure 3: 10T 1-Bit Full Adder

    To convert the sum-of-product form of (4) into a distributed form, the order of summations over the indices and in

    (6) can be interchanged to have

    = 1 . 0 + 1 2 . 1 .

    The carry save implementation of shift accumulation based full adder is design by using 10 transistor one bit-full adder

    [9] as shown in Fig. 4. The bit slices of vector are fed one





    after the next in the LSB to the MSB order to the carry save

    accumulator. Finally, the sum and carry output of the carry save accumulator is obtained after clock cycle are required to be added by a final adder. The content of the LUT location can be expressed as


    = 1 .


    where is the + 1 th bit of the – bit binary representation of integer for 0 2 1 can be precomputed and stored in RAM based LUT of 2 words. However, instead of storing 2 words in LUT, we store (2 1) words in a DA table of (2 1) registers.

    Figure 5: Distributed arithmetic table

    Figure 6: Proposed structure of DA-based LMS adaptive filter length

    = 4

    DA table for N=4 is shown in Fig. 5. DA table contains only 15 registers to store the precalculated sums of input words. In DA table, seven new values of are computed by seven adders in parallel.


    Figure 7: Proposed structure of DA-based LMS adaptive filter of length N=16

    A straight-forward DA-based implementation of inner product requires LUT of very large size. For that reason, the computation of the inner products of large orders needs to be decomposed [4] into small adaptive filtering blocks as shown in Fig. 6 and large order adaptive filters shown in Fig. 7.

    The structure of DA-based adaptive filter of length N=4 comprises of a four-point inner-product block and a weight- increment block along with additional circuits for the

    computation of error value and control word for the barrel shifters. The four-point inner-product block [shown in Fig. 1] contains a DA table consisting of an array of 15 registers as shown in Fig. 5 which stores the partial inner products for 0 < 15 and a 16:1 multiplexor to select the content of one of those registers from the DA table. Bit slices of weights A = 3 2 1 0 for 0 1 are

    fed to the MUX as control in LSB to- MSB order, and the output of the MUX is fed to the carry save accumulator using 10T full adder as shown in Fig. 4. After bit cycles, the carry save accumulator shift accumulates all the partial inner products and generates a sum and carry output word of size (+2) bit each. The carry and sum words are shifted added with an input carry 1 to generate filter output which is subsequently subtracted from the desired output to obtain the error .

    Figure 8: Structure of the weight-increment block for = 4

    if r6=1 then t= 000; else if r5 = 1 then t= 001; else if r4 = 1 then t= 010; else if r3 = 1 then t= 011; else if r2 = 1 then t= 100; else if r1 = 1 then t= 101; else if r0 = 1 then t= 110;

    Figure 9: Logic used for generation of control word for the barrel shifter for

    = 8

    As in the case in [4], all the bits of the error except the most significant bit (MSB) one are ignored (8th Bit). The remaining bits are magnitude of the error, the magnitude of the computed error is decoded to generate the control word for the barrel shifter. The logic used for the generation of control word for the barrel shifter is shown in Fig. 9. The number of shifts in that case is increased by locations accordingly to reduce the hardware complexity. The weight increment unit as shown in Fig. 8 for = 4 comprises of 4 barrel shifters and four carry save adder cells. The barrel shifter shifts the different input values for = 0, 1, 2 1 by appropriate number of locations. The barrel shifter yields the desired increments are fed to the carry save adder with the sign bit from the error value. The sign bit of the error is used as the control for the 2:1 MUX to select any one of the sum or carry output from the Carry save adder. The output of the MUX is fed to the Byte- parallel to Bit-serial converter to convert 8 bit data into 1 bit data. The output waveform of DA-based adaptive FIR filter (N=16) as shown in Fig. 10.

    Figure 10: DA-based LMS adaptive FIR filter of length N=4


    Thus the existing and proposed designs in [7] and [8] are implemented in Xilinx 14.1 using verilog code. Along with area and power of corresponding design are measured using Tanner 15.1 EDA in 250nm CMOS technology.

    Table 1: Implementation Results Using Xilinx 14.1 and Tanner 15.1









    N = 16




    N = 16




In this script, an adaptive FIR filter using distributed arithmetic (DA) for area efficient design is implemented. High throughput is drastically enriched by parallel (LUTs) update and equivalent implementation of filtering and weight-update operations. The proposed carry save accumulation using 10 transistor full adder schemes of signed partial inner products for the computation of the filter output and also modified in weight increment block. By this way it utilizes low area, low power consumption and the throughput of the filter rates increases irrespective of the filter length.


  1. B. Widrow and S. D. Stearns, Adaptive signal processing. Prentice Hall, Englewood Cliffs, NJ, 1985.

  2. D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,

    LMS adaptive filters using distributed arithmetic for high throughput, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no.7,pp. 13271337, Jul. 2005.

  3. R. Guo and L. S. DeBrunner, Two high-performance adaptive filter implementation schemes using distributed arithmetic, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600604, Sep. 2011.

  4. R. Guo and L. S. DeBrunner, A novel adaptive filter implementation scheme using distributed arithmetic, in Proc. Asilomar Conf. Signals, Syst., Comput., , pp. 160164, Nov. 2011.

  5. S. Haykin and B. Widrow Least-Mean-Square Adaptive Filters Hoboken, NJ, USA: Wiley, 2003.

  6. P. K. Meher and S. Y. Park, High-throughput pipelined realization of adaptive FIR filter based on distributed arithmetic, in VLSI Symp. Tech. Dig., Oct. 2011, pp. 428433.

  7. Sang Yoon Park, Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Low-Power, High-Throughput, and Low-Area Adaptive FIR Filter Based on Distributed Arithmetic IEEE Transactions On Circuits And SystemsIi: Express Briefs, Vol. 60, No. 6 June 2013.

  8. S. A. White Applications of the distributed arithmetic to digital signal processing: A tutorial review, IEEE ASSP Mag., vol. 6, no. 3, pp. 419, Jul. 1989.

  9. Hanan A.Mahmoud and Magdy.A. Bayoumi, A 10-Transistor Low Power High Speed Full Adder Cell.

Leave a Reply