 Open Access
 Total Downloads : 1494
 Authors : P. Sowmithri, V.Prasanth
 Paper ID : IJERTV1IS5033
 Volume & Issue : Volume 01, Issue 05 (July 2012)
 Published (First Online): 02082012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
New Approach to LookUpTable Optimization for MemoryBased Realization of FIR Digital Filter
New Approach to LookUpTable Optimization for MemoryBased
Realization of FIR Digital Filter
P. sowmithri#, V.prasanth*
#ECE Department pragati Engineering College, surampalem, India.
Abstract FiniteImpulse response (FIR) digital filter is widely used as a basic tool in various signal processing and image processing applications. New approaches to LUT basedmultiplication are suggested to reduce the LUTsize over that of conventional design. Distributed arithmetic (DA)based computation is popular for its potential for efficient memorybased implementation of finite impulse response (FIR) filter where the filter outputs are computed as innerproduct of inputsample vectors and filtercoefficient vector.LUT optimization is the main key factor in our project, in order to reduce power and area. One of the techniques has to be implemented in LUT to get required qualities. That is Modified Odd multiple storage (O.M.S) technique. We have synthesized the DAbased design and LUTmultiplier based design of 4tap FIR filters.
KeywordsDistributedarithmetic,FIR filter,throughput,memorybased implementation,LUTmultiplierbased approach.

Finiteimpulse response (FIR) filters are basic processing elements in applications such as video signal processing and audio signal processing. The order of an FIR filter primarily determines the width of the transitionband, such that the higher the filter order, the sharper is the transition between a pass band and a adjusted stop band. Many applications in digital communication (channel equalization frequency channelization) speech processing (adaptive noise cancellation) seismic signal processing (noise elimination) and several other areas of signal processing require large order
FIR filters. Since number of multiply accumulate (MAC) operations required for filter outputs increases linearly with the filter order, realtime Implementation of these filters of large orders is a challenging task. Along with the progressive device scaling, semiconductor memory has become cheaper, faster and more power efficient. More over according to the projections of the international technology roadmap for semiconductors embedded memories will have dominating presence in the system onchips (SoCs), which may exceed 90%,of the total Soc content. It has also been found that the transistor packing density of memory components are not only higher but also increasing much faster than those of logic components. Apart from that, memorybased computing structures are more regular then the multiplyaccumulate structures and offer many other advantages. e.g., greater potential for highthroughput and lowlatency implementation,(since memory access time is much shorter than the usual multiplicationtime)and are expected to have less dynamic power consumption due to less switching activities for memory read operations compared to the conventional multiplier. Memorybased structures are wellsuited for many digital signal processing (DSP) algorithms, which involve multiplication with a fixed set of coefficients. There are two basic variants of memorybased techniques. One of them is based on distributed arithmetic (DA) for inner product computation and the other is based on the computation of multiplication by lookuptable (LUT). In the LUTmultiplierbased approach, multiplications of input values with a fixed co efficient are performed by an LUT consisting of all possible precomputed product value corresponding to all possible values of input multiplicand while in the DAbased approach, an LUT is used to store all
possible values of innerproducts of a fixed N point vector with any possible N point bitvector. If the innerproducts are implemented in a straight forward way, the memorysize of LUTmultiplier based implementation increases exponentially with the wordlength of input values, while that of the DAbased approach increases exponentially with the innerproductlength. Attempts have been made to reduce the memoryspace in DAbased architectures using offset binary coding (OBC) and group distributed technique. A decomposition scheme is used for reducing the memory size of DAbased implementation of FIR filter. But, it is observed that the reduction of memorysize achieved by such decompositions is accompanied by increase in latency as well as the number of adders and latches.
II. LUT DESIGN FOR MEMORY BASED MULTIPLICATION
The basic principle of memorybased multiplication is depicted in Fig. 1. Let A be a fixed coefficient and X be an input word to be multiplied with A.
Fig: 1. Conventional LUTbased multiplier
If we assume X to be an unsigned binary number of wordlength L, there can be possible values of X, and accordingly, there can be possible values of product C=A.X. Therefore, for the conventional implementation of memorybased multiplication, a memory unit of words is required to be used as lookuptable consisting of precomputed product values corresponding to all possible values of X. The productword ,for 1, is stored at the memory location whose address is the Same as the binary value of , such that if Lbit binary value of is used as address for the memoryunit, then the corresponding Product value is readout from the memory.
TABLE1
LUT WORDS AND PRODUCT VALUES FOR INPUT WORD LENGTH L=4
Although possible values of X correspond to possible Values of C=A.X, recently we have shown that only ( /2 ) Words corresponding to the odd multiples of A may only beStored in the LUT
.One of the possible product words isZero, while all the rest ( / 2 ) 1 are even multiples of A which could be derived by leftshift operations of one of the odd multiples of A. We illustrate this in Table I for L=4. At eight memory locations, eight odd multiples AÃ— (2i+1) are stored as pi=0. 12,3,.7The even multiples 2A,4A and 8Aare derived by leftshift operations of A . Similarly, 6A and 12A are derived by leftshifting 3A, while 10A and 14A are derived by leftshifting 5A and 7A, respectively. The address X (0000) corresponds to (A.X) =0, which can be obtained by resetting the LUT output. For an input multiplicand of wordsize L similarly, only ( / 2 ) odd multiple values need to be stored in the memorycore of the LUT, while the other ( /2 1) nonzero values could be derived by leftshift operations of the stored values. Based on the above, an LUT for the multiplication of an L bit input with Wbit coefficient is designed by the following strategy:

A memoryunit of ( /L) words of (W+L)bit width is used to store all the odd multiples of A.

A barrelshifter for producing a maximum of (L

1) left shifts is used to derive all the even multiples of A.

The Lbit input word is mapped to (L1) bit LUTaddressby an encoder.

The controlbits for the barrelshifter are derived by a
Controlcircuit to perform the necessary shifts of the LUT
Output. Besides, a RESET signal is generated by the same
Control circuit to reset the LUT output when X=0.

Proposed LUTBased Multiplier for 4Bit Input
The proposed LUTbased multiplier for input word size L=4is shown in Fig. 2. It consists of a memory array of eight wordsof (W+4) bit width and a 3 to8 line address decoder, alongwith a NORcell, a barrelshifter, a 4to3 bit encoder to map the 4bit input operand to 3bit LUTaddress, and a control circuit for generating the controlword for the barrelshifter, and the RESET ignal for the NOR cell.
The 4to3 bit input encoder is shown in Fig. 2(b). It receivesa fourbit input word (x3x2x1x0) and maps that onto the threebit address word ,according to the logical relations
The precomputed values of AÃ—(2i+1) are stored as for I=0,1,2,,7 at 8 consecutive locations of the memoryarray as specified in Table I in bit inverted form.The decoder takes the 3bit address from the input encoder,and generates 8 wordselect signals, 0 I 7}, to select the referenced word from the memoryarray. The output of the memoryarray is either AX or its submultiple in Bitinverted form depending on the value of X. From Table I, we find that the LUT output is required to be shifted through a1location to left
when the input operand is one of the values
{(0010),(0110),(1010),(1110)}.
Two leftshifts are required if X is either (0 1 0 0)
or (1 1 0 0). Only when the input word X= (1000), three shifts are required. For all other possible input operands, no shifts are required. Since the maximum number of leftshifts required on the storedword is three, a twostage logarithmic barrel shifter is adequate to perform the necessary left shift operations.
Fig: 2. (a). Proposed LUT design based 4bit multiplier
Fig: 2. (b). The 4to3 bits input encoder.
Fig: 2. (c). Control circuit
Fig: 2. (d). Twostage logarithmic barrelshifter for W=4
Fig: 2. (e). Structure of the NORcell
The number of shifts required to be performed on the output of the LUT and the controlbits S0 and S1 for different values of X are shown Table I. The control circuit [shown in Fig. 2(c)] accordingly generates the controlbits given by
Fig: 3. Memorybased multiplier using dualport memoryarrays=(W+4).
A logarithmic barrelshifter for W=L=4 is shownin Fig. 2(d). It consists of two stages of 2to1 line bit level multiplexers with inverted output, where each of the two stagesinvolves (W+4) number of 2 input ANDORINVERT (AOI) gates .The control bits and are fed to the AOI gates of stage1 and stage2 of the barrelshifter, respectively. Since each stage of the AOI gates perform inverted multiplexing, after two stages of inverted multiplexing, outputs with desired number of shifts are produced by the barrelshifter in (the usual) uninverted form.
The input X= (0000) corresponds to multiplication by X=0 which results in the product value A.X=0. Therefore, when the input operand word X= (0000), the output of theLUT is required to be reset. The reset function neither is implementedby a NORcell consisting of (W+ 4) NOR gates as shown inFig. 2(e)using an activehigh RESET. The RESET bit is fed as one of the inputs of all those NOR gates, and the other input lines of (W+4) NOR gates of NOR
cell are fed with (W+4) bits of LUT output in parallel. When X= (0000) , the control circuit in Fig. 2(c), generates an activehigh RESET according to the logic expression:
When RESET=1, the outputs of all the NOR gates become0, so that the barrelshifter is fed with (W+4) number of zeros. When RESET=0, the outputs of all the NOR gates becomethe complement of the LUT outputbits. Note that, keeping thisin view, the product values are stored in the LUT in bitinverted form. Reset function can be implemented by an array of 2input AND gates in a straightforward way, but the implementation of reset by the NORcell is preferable since the NOR gates have Simpler CMOS implementation compared with the AND gates. Moreover, instead of using a separate NORcell, the NOR gates could be integrated with memoryarray if the LUT is implemented by a ROM .The NOR cells, therefore, could be eliminated by using a ROM of 9 words, where the 9th word is zero and RESET is used as its wordselect signal.
Multiplication of an 8bit input with a Wbit fixed coefficientcan be performed through a pair of multiplications using aDualport memory of 8 words (or two singleport memory units)along with a pair of decoders, encoders, NOR cells and barrel shifters as shown in Fig. 3. The shiftadder performs leftshift operation of the output of the barrelshifter corresponding to more significant half of input by four bitlocations, and adds that to the output of the other barrelshifter.

MEMORYBASED STRUCTURES FOR FIR FILTER USING LOOKUPTABLE MULTIPLIERS
We derive here the proposed structure for memory based realization of an N tap FIR filter, and discuss the design of memory cell to be used as LUT in the structure. The inputoutput relationship of an N tap FIR filter in timedomain is given by
Where h(n), for n=0,1,…,N1, represent the filter
Coefficients, while x(ni),for i=0,1,. ,N1, represent N recent input samples, and y(n) represents the current output sample. Memory based multipliers can be implemented for signed as well as unsigned operands. In case of signed operands, the input words and the stored product values need to be in twos complement representation. Since the stored product values require signextension in case of twos complement representation during shiftadd operations, the LUTbased multiplication could have a simpler implementation when the multiplicands are unsigned numbers. Besides, without loss of generality, we can assume the input samples{x(n)} to be unsigned numbers and the filter coefficients
{h(n)} to be signed numbers, in general. Keeping this in view, we write above equation alternatively as
Where
denotes the absolute value of h(n) and sign(n) = Â±1 is the signfactor, which could be absorbed with the additions of the corresponding term. The above equation, then may be written in a recursive form
D stands for the delay operator, such that Do(n I)=x(ni1),for I=1,2,,N1.

MemoryBased FIR Filter Using Conventional LUT
The recursive computation of FIR filter output according to is represented by a transposed form dataflow graph (DFG) shown in Fig. 4.It consists of N multiplication nodes (M) and (N1) add subtract (AS) nodes.
Fig: 4.Modified transposed form data flow graph (DFG) of an Ntap FIR filter for LUTmultiplier based implementation. (a)TheDFG. (b)Function of each multiplication node (M) and addsubtract node (AS) of the DFG.
The function of these nodesis depicted in Fig. 4(b).Each multiplication node performs the multiplication of an input sample value with the absolute value of a filter coefficient. The AS node adds or subtracts its input from top with or from that of its input from the left when the corresponding filter coefficient is positive or negative, respectively. It may be noted here that each of the multiplication nodes of the DFG performs multiplications of input samples with a fixed positive number. This feature can be utilized to implement the multiplications by an LUT that stores the results of multiplications of all possible input values with the multiplying coefficient of a node as unsigned numbers. The multiplication of an Lbit unsigned input with W bit magnitude part of fixed filterweight, to be performed by each of the multiplicationnodes of the DFG, can be implemented conventionally by a dualport memory unit consisting of ( / 2 ) words of (W+L) bit width. Each of the (N1) AS nodes of the DFG along with a neighboring delay element can be mapped to an addsubtract (AS) cell.
A fully pipelined structure for tap FIR filter for input word length L=8, as shown in Fig. 5, is derived accordingly from the DFG of Fig. 4. It consists of N memoryunits for conventional LUT based multiplication, along with (N1) AS cells and a delay register. During each cycle, all the 8 bits of current input sample x (n) are fed to all the LUT multipliers in parallel as a pair of 4bit addresses X1and X2 .The structure of the LUTmultiplier is shown in Fig. 5(b). It consists of a dualport memory unit of size [16Ã— (W+4)] (consisting of 16 words of (W+4)bit width) and a shiftadd (SA) cell. The SA cell shifts its rightinput to left by four
bitlocations and adds the shifted value with its other input to produe a (W+8)bit output. The shift operation in the shiftadd cells is hardwired with the adders, so that no additional shifters are required.
The output of the multipliers is fed to the pipeline of AS cells in parallel. Each AS cell performs exactly the same function as that of the AS node of the DFG. It consists of either an adder or a subtractor depending on whether the corresponding filter weight h(n) is positive or negative, respectively.Besides, each of the SA cells consists of a pipeline latch corresponding to the delays in the DFG of Fig. 4. The FIR filter structure of Fig. 5 takes one input sample in each clock cycle, and produces one filter output in each cycle. The first filter output is obtained after a latency of three cycles (1 cycleeach for memory output, the SA cell and the last AS cell). But, the first (N1) outputs are not correct because they do not contain the contributions of all the filter coefficients. The correct output of this structure would thus be available after a latency of (N+2) cycles.
Fig: 5.(a) Conventional LUTmultiplierbased structure of an Ntap transposed form FIR filter for inputwidth L=8.
Fig: 5. (b)Structure of each LUTmultiplier
Fig: 6. (a). Structure of Nth order FIR filter using proposed LUTmultiplier.
Fig:6.(b).The dualport segmented memory core for the Nth order FIR filter.

MemoryBased FIR Filter Using Proposed LUT Design
The memorybased structure of FIR filter (for 8bit inputs) using the proposed LUT design is shown in Fig. 6.It differs from that of the conventional memorybased structure of FIR filter of Fig. 5 in two design aspects.

The conventional LUTmultiplier is replaced by proposed, Oddmultiplestorage LUT, so that the multiplication by an Lbit word could be implemented by ( / 2 ) /2 words in the LUT in a dualport memory, as described in previous.

Since the same pair of address words X1and X2 are used by all the NLUTmultipliers in Fig. 5, only one memory module with N segments could be used instead of N modules. If all the multiplications are implemented by a singlememory module, the hardware complexity of 2(N1) decoder circuits (used in Fig. 5) could be eliminated.
As shown in Fig. 6, the proposed structure of FIR filter consists of a single memorymodule, and an array of N shiftadd (SA) cells, (N1) AS cells and
a delay register. The structure of memory module of Fig. 6(a) is similar to that of Fig. 3. Like the structure of Fig. 3, it consists of a pair of 4to3 bit encoders and control circuits and a pair of 3to8 line decoders to generate the necessary control signals and word select signals for the dualport memory core. The dualport memory core (shown in Fig. 6(b)) consists of [8Ã— (W+4)] Ã—N array of bitlevel memoryelements arranged in 8 rows of [(W+4)] bit width. Each row of this memory consists of N segments, where each segment is(W+4) bit wide. Each segment of this dualport core is of size 8Ã—(W+4), such that the ith memory segment stores the 8 words corresponding to the multiplication of any 4bit input with the filter weight h(i) for 0 i N1. During each cycle, a pair of 4bit subwords X1and X2 are derived from the recentmost input sample x(n) and fed to the pair of 4to3 bit encoders and control circuits, which produce two sets of wordselect signals (WS1 and WS2), a pair of control signals((s01, s00 )) and( s11, s10 )) two reset signals. All these signals are fed to the dualport memorycore as shown in Fig. 6. N segments of the memorycore then produce N pairs of corresponding output, those are fed subsequently to the N pairs of barrelshifters through the 2N NOR cells. The array of N pairs of barrelshifters thus produce N pairs of output ((hi) . X1, h(i).X2) for 0 i N1. Since the same pair of address words X1 and X2 are used by
all the N LUTmultipliers in Fig. 5, only one memory module with N segments could be used instead of N independent memory modules.


RESULTS
Simulation results of 4tab FIR filter:
(a). For 4bit input operand X=(x3x2x1x0).
(b) For 8bit input operands.
RTL Schematic diagram for 4tab FIR filter:

CONCLUSION
The proposed LUTmultiplierbased design involves half the memory than the DAbased and conventional LUTbased designs at the cost of
~4NW AOI gates and nearly 2NW NAND/NOR gates. It performs less delay time for various memories based computations. By oddmultiple storage scheme, for addresslength 4, the LUT size is reduced to half by using a twostage logarithmic barrelshifter and neither number of NOR gates, where W is the wordlength of the fixed multiplying coefficients. Three memorybased structures having unit throughput rate are designed further for the implementation of FIR filter. One of the structures is based on DA principle, and the other two are based on LUT based multiplier using the conventional and the proposed LUT designs. All the structures are found to have the same or nearly the same cycle periods, which depend on the implementation of adders, the wordlength and the filter order.
REFERENCES
[1] International TechnologyRoadmap for Semiconductors. [Online]. [2]P.K.Meher,Newlookup tableoptimizationsformemory basedmultiplication,inProc.ISIC,Dec.2009,pp. 663666. [3] A. K. Sharma, AdvancedSemiconductor M e m o r i e s : Architectures,DesignsandApplications. Piscataway,NJ:IEEEPress,2003.
[4]K.Meher,NewapproachtoLUTimplementationa ndaccumulation formemory basedmultiplication,inProc.IEEEISCAS,May2009.
[5] P.K.Meher,Memory basedhardwareforresourceconstrained digital signalprocessingsystems,inProc.6thInt.Conf.ICI CS,Dec.2007, pp.14. [6]K.K.Parhi,VLSIDigitalSignalProcesing Systems: DesignandImplementation. New York: Wiley,1999.