 Open Access
 Total Downloads : 412
 Authors : Nagaraj Gowd H, K. Santha, I. V. Rameswar Reddy
 Paper ID : IJERTV2IS110759
 Volume & Issue : Volume 02, Issue 11 (November 2013)
 Published (First Online): 30112013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Fast Block LMS Adaptive Filter Using DA Technique for High Performance in FGPA
Nagaraj Gowd H 1, K.Santha2, I.V.Rameswar Reddy3
1, 2, 3 Dept. Of ECE, AVR & SVR Engineering College, Kurnool, A.P, India
AbstractThis paper proposes a design and implementation of high throughput adaptive digital filter using Fast Block Least Mean Squares (FBLMS) adaptive algorithm. The filter structure is based on DA Technic, which is able to calculate the inner product by shifting, and accumulating of partial products and storing in lookup table, also the desired adaptive digital filter will be multiplier less. Thus a DA based implementation of adaptive filter is highly computational and area efficient. Furthermore, the fundamental building blocks in the DA architecture map well to the architecture of todays Field Programmable Gate Arrays (FPGA). FPGA implementation results conforms that the proposed DA based adaptive filter can implement with significantly smaller area usage, (about 45%) less than that of the existing FBLMS algorithm based adaptive filter.
Keywords FBLMS, distributed arithmetic, FFT, cyclic convolution.

INTRODUCTION
Adaptive digital filters are widely used in the area of signal processing such as echo cancelation, noise cancelation, channel equalization for communications and networking systems [1], [2]. The necessity of hardware implementation requires various of performances such as high speed, low power dissipation and good convergence characteristics.
Fast block least mean square (FBLMS) algorithm proposed by Clark et al. [3] is one of the fastest and computationally efficient adaptive algorithm since here the process of filtering and adaption is done in frequency domain by using FFT algorithms. But there is still possibility to further enhanced the throughput of FBLMS algorithm based adaptive filters by applying the concept of Distributed arithmetic (DA) proposed by A. Peled and B. Liu [4]. Using bit level rearrangement of a multiply accumulate terms, DA can hide the complex hardware multipliers and therefore, the desired system becomes multiplierless. DA is a powerful technique for reducing the size of a parallel hardware multiplyaccumulate that is well suited for FPGA designs. Recently there has been a trend to implement DSP functions using FPGAs. While application specific integrated circuits (ASICs) are the traditional solution to high performance applications due to high development costs and timetomarket factor. The main reason behind the popularity of the FPGA is due to balance that FPGAs provide the designer in terms of flexibility, cost, and timetomarket.
The concept of DA has already applied to LMS based adaptive filters [5], [6], [7], [8] but, not to FBLMS algorithm based adaptive filters. This paper proposes a new hardware efficient implementation of FBLMS algorithm based adaptive filter using DA. The proposed architecture reduces the area requirement of original FBLMS by reducing the number of hardware multipliers using DA, result as adaptive filter with low power dissipation and high throughput.

DISTRIBUTED ARITHMETIC BACKGROUND
Consider the following inner product of two L dimensional vectors C and x, where c is a constant vector, x is the input sample vector, and y is the result.
The computation in distributed arithmetic is represented by (4). The values of bkn are either 0 or 1, resulting in bracketed term in (4) having only 2B possible values. Since c a constant vector, the bracketed term can be recomputed and stored in memory using either lookup table (LUT) or ROM. The lookup table is then addressed using the individual bits of input
samples, Xk with the final result y computed after B cycles, regardless of lengths of vectors c and x. A comprehensive tutorial review of OA linear filters is given in [9].

THE EXISTING FAST BLOCK LMS (FBLMS) BASED ADPATIVE FILTER
Consider a BLMS based adaptive filter, that takes an input sequence x(n), which is partitioned into nonoverlapping blocks of length P each by means of a serialtoparallel converter, and the blocks of data so produced are applied to an FIR filter of length L, one block at a time. The tap weights of the filter are updated after the collection of each block of data samples, so that the adaptation of the filter proceeds on block byblock basis rather than on a samplebysample basis as in conventional LMS algorithm [1].
With the jth block, (j E Z) consisting of x (j P + r) , r E Zp = 0,1, ….. , P 1, the filter coefficients are updated from block to block as,
where w(j) = [WO(j)Wl(j)…..WL_l(j)]t is the tap weight vector corresponding to the jth block, x (jP+r) = [x (jP+r) x(jP+r1) ….x(jP+r – L+l)]t and e(jP+r) is the output error at n= jP+r , given by,
The sequence d (j P + r) is the socalled desired response available during the initial training period and y (jP + r) is the filter output at n = j P + r , given as,
The parameter Âµ, popularly called the step size parameter is to be chosen as 0Âµ for convergence of the algorithm. For the lth subbloc within the i th block, O l K 1 i.e., forn = jP+r,r = 0,1, …… P1,j = iK +l, the filter output y (n) = wt (j) X (n) is obtained by convolving the input data sequence x(n) with the filter coefficient vector wt (j) and thus can be realized efficiently by the overlapsave method via M = L + P
1 point FFT, where the first L 1 points come from the previous subblock, for which the output is to be discarded. Similarly, the weight update term in (5) above, viz., be obtained by the usual circular correlation technique, by employing M point FFT and setting the last P 1 output terms as zero.

PROPOSED IMPLEMENTATION
The throughput of FBLMS based adaptive filter is limited by computational complexity lies in FFT (and IFFT) block. It is possible to enhance the throughput of system by implantation of that FFT (and IFFT) block with reduced hardware complexity. OA is one of the efficient techniques, in which, by means of a bit level rearrangement of a multiply accumulate terms; FFT can be implemented without multiplier. Since the main hardware complexity of the system
is due to hardware multipliers and introduction of OA eliminates the need of that multipliers and resulting system will have high throughput and also have low power dissipation. There are many fast Fourier transform (FFT) algorithm like radix2, CooleyTukey, Winograd, Good Thomas, Rader etc. But using OA, FFT can be efficiently calculated by jointly employing the GoodThomas and Rader algorithms. GoodThomas algorithm reexpresses the discrete Fourier transform (OFT) of a size N = NIN2 as a two dimensional NIN2 OFT, but only for the case where Nl and N2 are relatively prime. It is easily seen from the definition of the OFT that the transform of a length N real sequence x(n) has conjugate symmetry, i.e. X(N K) = X*(K). This property facilitates to compute only half of the transform, as the remaining half is redundant and need not be calculated. Rader algorithm provides straightforward way to compute only half of the conjugate symmetric outputs without calculating the others, which is not possible with other algorithms like radix, CooleyThkey and Winograd. Algorithm presented here first decomposes the one dimensional OFT into a multidimensional OFT using the index map proposed by Good [10]. Next, a method which is based on the index permutation proposed by Rader [11] is used to convert the short OFTs into convolution. This method changes a prime length N OFT of real data into two convolutions of length (N 1)/2. Oneconvolution is cyclic and the other is cyclic or skewcyclic. The index mapping suggest by Good and Thomas for n is
and as index mapping for k results
If we substitute the GoodThomas index map in the equation for OFT matrix it follows
Steps for GoodThomas FFT Algorithm
An N = N1N2point Off can be computed according to following steps:

Index transform of input sequence, according to (8).

Computation of N2 OFfs of length Nl using Rader algorithm.

Computation of Nl OFfs of length N2 using Rader algorithm.

Index transform of input sequence, according to (9).
Consider the length N= 14, suppose we have Nl=7 and N2=2 then mapping for the input index according to n. 2nl+7n2 mod 14 and k = 4k l + 7k2 mod 14 for output index results and using these index transforms we can construct the signal flow graph as shown in Fig. 1.
From Fig. 1 we realize that first stage has 2 OFfs each having 7 points and second stage has 7 OFfs each having of length 2. One of the interesting thing here is multiplication with twiddle factors between the stages is not required. Now consider Nl = 7 if the data are real we need to calculate only half of the transform. Also, as Rader showed the zero frequency term must be calculated separately.
In matrix t form, we write
If real and imaginary parts of W matrix in (13) are separated, a simplification is possible. Consider first the real part using notation in matrix that k stands for cos(27rk/7). The real part of (13) becomes
Using the notation of k for sin(27rk/7) gives, for the imaginary part of (13).
The (14) and (15) are cyclic convolution relation. Since in our problem we always convolve with the same coefficients (In case of Off it is twiddle factor matrix), arithmetic efficiency can be improved by pre calculating some of the intermediate results. These are stored in table in memory and simply addressed as needed. Using distributed arithmetic this can be implemented efficiently and detail can be found in [12]. Here we will present only the structure Fig. 2 best suited to the Off calculated by cyclic convolution.
Initially Rl to R7 are cleared to zero and the Xi'S are loaded into registers Rl to R3 after addition. Then all Rl to R3 are shifted by one bit, the last bit of each register is in R4. The ROM output will be added to R5. Circular shift of R4 produces at ROM output are added to R5 to R7. when the first cycle is completed content of all Rl to R7 except R4 are right shifted by one bit and the second cycle starts. At Bth cycle function at AD D _SU B changed from adder to subtraction and after this Bth cycle the content of R5 to R7 gives final FFT coefficient and zero frequency component can be calculated separately applying accumulate and addition of Xi'
as shown in Fig. 2 and the FFT block so obtained has no multipliers at the expanse of increased adder requirement and memory requirement in order to store some pre calculated values.
In the proposed architecture, using the DA based FFT (and IFFT) block that is described above, we recast the existing FBLMS based adaptive filter. Since, the DA based FFT block provides only half of the conjugate symmetric outputs without calculating others (reaming can calculated by conjugating them) and in DA based IFFT block we require to feed only half of the conjugate symmetric coefficients not all, which is not possible in existing FBLMS based adaptive filters since here radix2 based FFT (and IFFT) blocks are employed and in radix2 there is no such facility to calculate only half of the conjugate symmetric outputs hence, in this case half of the processed data are redundant. Since in our proposed FBLMS algorithm based adaptive filter, we require to calculate only half of the conjugate symmetric coefficients, under that condition the hardware requirements for our proposed system is approximately half of that of existing one. In our proposed architecture shown in Fig. 3 the computation off frequency domain outputs requires 8( If + 1) or 8( Mt + 1) number of multiplication and required number of addition is 16N + 2. 5(N1 + N2) + 2, here total number of multiplications are much less than that of required number of multiplications in the existing FBLMS algorithm based adaptive filter (can be compared from Table I and Table II) at the expanse of increased memory and adder requirement, which drastically reduces the hardware complexity for higher order filters as shown in Fig. 4 and it results with a adaptive filter which has high throughput and low power dissipation with reduced area requirement.


EXPERIMENTAL RESULTS
Proposed and existed architectures are implemented for filter length7 and length8 respectively. Verilog codes are written for both of these designs and synthesized using Xilnx 10.1 version. Family of device was Virtex II Pro and target device was 2vpl 00ff16966.
Fig. 5 shows the logic utilization of both the architectures. Although the filter length used in case of existed system is 1 more then that of our proposed one, but the FPGA resource utilization in case of existed system was approximately 45% greater than that of proposed one which can be clearly seen from Fig. 6. From Table III it is clear that our proposed architecture based adaptive filter is 30.2% faster than that of existed one.

CONCLUSION
In this paper, we have proposed a new hardwareefficient adaptive filter structure for very high throughput FBLMS adaptive filters and its implementation details were presented.
The concept of DA involves for implementation of FFT block without any hardware multiplier using LUT and adders. Due to reduced hardware complexity the proposed DA based FBLMS adaptive filter is best suitable for implementation of higher order filters in FPGA efficiently with minimum area requirement, low power dissipation and high throughput.
REFERENCES
[I] S. Haykin, Adaptive Filter Theory, 4th ed., T. Kailath, Ed. Pearson Education, 2008.
B. FarhangBoroujeny, Adaptive Filters: Theory and Applications, Chichester, Ed. Wiley, 1998.

S. K. M. Gregory A. Clark and S. R. Parker, "Block implementation of adaptive digital filters," IEEE Transactions on Circuits and Systems, vol. 28, pp. 584 – 592, 1981.

A. Peled and B. Liu, "A new hardware realization of digital filters," IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. 22, pp.456, December 1974.

c. H. Wei and 1. 1. Lou, "Multi memory block structure for implementing a digital adaptive filter using distributed arithmetic," lEEE Proceedings, Electronic Circuits and Systems, vol. 133, February 1986.

DJ Allred, W. Huang, Y.Krishnan, H. Yoo, D.V Anderson, "An FPGA implementation for a high throughput adaptive filter using distributed arithmetic." 12th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, pp. 324 – 325, 2004.

Daniel 1. Allred, Walter Huang, Venkatesh Krishnan, Heejong Yoo, and David Y. Anderson, "LMS adaptive filters using distributed arithmetic for high throughput," IEEE Transactions on Circuits and Systems, vol. 52, pp. 1327 – 1337, July 2005.

N. J. Sorawat Chivapreecha, Aungkana Jaruvarakul and K. Dejhan, "Adaptive equalization architecture using distributed arithmetic for partial response channels," IEEE Tenth International Symposium on Consumer Electronics, 2006.

S. A. White, "Applications of distributed arithmetic to digital signal processing: A tutorial review," IEEE ASSP Magazine, July 1989.

C. S. Burrus, "Index mappings for multidimensional formulation of the DFT and convolution," IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. 25, pp. 239242, June 1977.

C. M. Rader, "Discrete fourier transforms when the number of datam samples is prime," IEEE Proceedings vol. 56, pp. 11071108, June 1968.

S. Chu and C. S. Burrus, "A prime factor FFT algorithm using distributed arithmetic," IEEE Transactions On Acoustics, Speech, And Signal Processing, vol. 30, April 1982.