 Open Access
 Total Downloads : 1307
 Authors : M.Purna Kishore, P.Srinivas
 Paper ID : IJERTV1IS7345
 Volume & Issue : Volume 01, Issue 07 (September 2012)
 Published (First Online): 26092012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
LUT Optimization for Memorybased Computation
LUT Optimization for MemoryBased Computation
1. M.Purna kishore 2. P.Srinivas
Pursuing M.Tech, NCET, Vijayawada Assoc. Prof, NCET, Vijayawada
AbstractRecently, we have proposed the antisymmetric product coding (APC) and oddmultiplestorage (OMS) techniques for lookuptable (LUT) design for memorybased multipliers to be used in digital signal processing applications. Each of these techniques results in the reduction of the LUT size by a factor of two. In this brief, we present a different form of APC and a modified OMS scheme, in order to combine them for efficient memorybased multiplication. The proposed combined approach provides a reduction in LUT size to onefourth of the conventional LUT. We have also suggested a simple technique for selective sign reversal to be used in the proposed design. It is shown that the proposed LUT design for small input sizes can be used for efficient implementation of highprecision multiplication by input operand decomposition. It is found that the proposed LUTbased multiplier involves comparable area and time complexity for a word size of 8 bits, but for higher word sizes, it involves significantly less area and less multiplication time than the canonicalsigneddigit
Index TermsDigital signal processing (DSP) chip, lookuptable (LUT)based computing, memorybased computing.

INTRODUCTION
Digital signal processing algorithms typically require a large number of mathematical operations to be performed quickly and repetitively on a set of data. Signals are constantly converted from analog to digital, manipulated digitally, and then converted again to analog form, as diagrammed below. Many DSP applications have constraints on latency; that is, for the system to work, the DSP operation must be completed within some fixed time, and deferred processing is not viable. Digital signal processing:
Inorder to reach a certain criteria memory based computation plays a vital role in dsp (digital signal processing) application.

FILTER DESIGNING :
Finite impulse response (FIR) digital filter is widely used as a basic tool in various signal processing and image processing applications. The order of an FIR filter primarily determines the width of the transitionband, such that the higher the filter order, the sharper is the transition between a passband and adjacent stopband. Many applications in digital Communication (channel equalization, frequency channelization), speech processing (adaptive noise cancelation), seismic signal processing (noise elimination), and several other areas of signal processing require large order FIR filters. Since the number of multiplyaccumulate
(MAC) operations required per filter output increases linearly with the filter order, realtime implementation of these filters of large orders is a challenging task. Several attempts have, therefore, been made and continued to develop low complexity dedicated VLSI systems for these filters.
As the scaling in silicon devices has progressed over the last four decades, semiconductor memory has become cheaper, faster and more powerefficient. According to the projections of the international technology roadmap for semiconductors (ITRS) , embedded memories will continue to have dominating presence in the systemonchip (SoC), which may exceed 90% of total SoC content. It has also been found thatthe transistor packing density of SRAM is not only high, but also increasing much faster than the transistor density of logic devices .

BINARY MULTIPLICATION:
Multiplication in binary is similar to its decimal counterpart. Two numbers A and B can be multiplied by partial products: for each digit in B, the product of that digit in A is calculated and written on a new line, shifted leftward so that its rightmost digit lines up with the digit in B that was used. The sum of all these partial products gives the final result.

MEMORY BASED MULTIPLICATION :
The inputoutput relationship of an Ntap FIR filter in time domain is given by
where h(n), for n = 0,1,2,——–N1, represent the filter
coefficients x(ni), while for i== 0,1,2,——–N1, for x(n) , represent recent input samples y(n), and represents the current output sample. Memorybased multipliers can be implemented for signed as well as unsigned operands

FIR FILTER ARCHITECTURE
The objectives of this work are:
Multiplying two binary numbers one number is fixed X[4:0] and another variable A
Using APCOMS combined LUT design for the
multiplication of Wbit fixed coefficient A with 5bit input X.
Number of calculations reduced and memory required is less to perform multiplication.
For 16 and 32bit word sizes, respectively, it offers more than 30% and 50% of saving in areadelay product over the corresponding CSD multipliers.

ANTI SYMMETRIC PRODUCT CODING:
Anti symmetric product coding is the technique used to process the multiplication based on LUT multiplication which reduces the size of conventional lut by 50 % .
The anti symmetric product coding is based on the antisymmetric coding i.e the 2s complement phenomenon which is used to reduce the LUT size by half.
For simplicity of presentation, we assume both X and A to be positive integers.2 The product words for different values of X for L = 5 are shown in Table I. It may be observed in this table that the input word X on the first column of each row is the twos complement of that on the third column of the same row. In addition, the sum of product values corresponding to these two input values on the same row is 32A. Let the product values on the second and fourth columns of a row be u and v, respectively.
Since one can write
u = [(u + v)/2 (v u)/2] and
v = [(u + v)/2 + (v u)/2], for (u + v) = 32A, we can have
The product values on the second and fourth columns of Table I therefore have a negative mirror symmetry. This behavior of the product words can be used to reduce the LUT size,where, instead of storing u and v, only [(v u)/2] is stored for a pair of input on a given row. The 4bit LUT addresses and corresponding coded words are listed on the fifth and sixth columns of the table, respectively. Since the representation of the product is derived from the anti symmetric behavior of the products, we can name it as anti symmetric product code.
The 4bit address X= x3x2x1x0 of the APC word is given by
X = XL, if x4 = 1
=XL , if x4 = 0
where XL = (x3x2x1x0) is the four less significant bits of X, and XL is the twos complement of XL.

LUT BASED MULTIPLICATION USING APC OMS MODIFIED OPTIMIZATION TECHNIQUE
The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the twos complement operation of LUT output for sign modification and that of the input operand for input mapping. However, we find that when the APC approach is combined with the OMS technique, the twos complement operations could be very much simplified since the input address and LUT output could always be transformed into odd integers.

LUT COMBINED APCOMS BASED MULTIPLICAT ION TECHNIQUE
The proposed APCOMS combined design of the LUT for L = 5 and for any coefficient width W is shown in Fig. 2.4. It consists of an LUT of nine words of (W + 4)bit width, a four tonineline address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal an control word (s1s0) for the barrel shifter.
The recomputed values of A Ã— (2i + 1) are stored as Pi, for
i = 0, 1, 2, . . . , 7, at the eight consecutive locations of the memory array, as specified in Table II, while 2A is stored for input X = (00000) at LUT address 1000, as specified in Table III. The decoder takes the 4bit address from the address generator and generates nine wordselect signals, i.e.,
{wi, for 0 i 8}, to select the referenced word from the
LUT. The 4to9line decoder is a simple modification of 3 to8line decoder.
The control bits s0 and s1 to be used by the barrel shifter to produce the desired number of shifts of the LUT output are generated by the control circuit, according to the relations.
Here a simple design for sign modification of the LUT output.
TABLE 3
The product values and encoded words for input words X = (00000) and (10000) are separately shown in Table III. For X
= (00000), the desired encoded word 16A is derived
by 3bit left shifts of 2A [stored at address (1000)]. For X = (10000), the APC word 0 is derived by resetting the LUT output, by an activehigh RESET signal given by
ADDER/SUBSTRACTOR (ANTISYMMETRY GENERATION CIRCUIT)
The adder /sub circuit is also called as an ant symmetryl generation circuit
Based on the sign of x4,the circuit generates the anti symmetry based on the msb of x input.


LUT OPTIMATION


Basic Components of LUT Optimization :
The modules contributed for combined APCOMS based LUT optimization technique are
1 .Xin generation module (based on antisymmetric process)

Address generation module

line decoder 4. 9*(w+4) LUT
> line selector module
> multiplier result module
> resultant multiplier module

Barrel Shifter

Add/Substractor (Sign Determination) module
Xin generation module (based on antisymmetric process): A input of 5bit length is given as input to this module. It used to generate antisymetric of last 4bits (Xin(3 to 0)) when the msb of Xin i.e Xin(4) is 0 and and process the same input when the msb of Xin is 1 hence only 16 combinations will be achived for 5bit of input as in table 1.
Block diagram
If (xin(4) = 0) then
Xcomps = Xin(4) & 2scomplement of(Xin(3 to 0)); Else
Xcomps = Xin


Address Generation Unit :
The address generation unit generats the 4bit address for the input given by Xin generation module the 4bit address is named as d.
The reset output will be set when the input combination Xin = 10000;
Inorder to make the output of the barrel shifter to 0 .
The oupt s[1:0] are used to get the shift terminology in barrel shifter maximum of 3 shifts .
din[3: W[8:0
Figure: 4 to 9 Line decoder
The 4 input lines din is converted into 9 output lines w which is used to calculate the LUT output .
A decoder is a device which does the reverse of an encoder, undoing the encoding so that the original information can be retrieved. The same method used to encode is usually just reversed in order to decoder.
Decoding is necessary in applications such as data multiplexing, 7 segment display and memory address decoding.
A simple CPU with 8 registers may use 3to8 logic decoders inside the instruction decoder to select two source registers of the register file to feed into the ALU as well as the destination register to accept the output of the ALU. A typical CPU instruction decoder also includes several other things.
LUT selector
W[8:0]
LUT selector is used to generate PVN (product value number) which is used to calculate the corresponding product value i.e (PVN X A)
The PVN is calculated depending on the W input corresponding bit set in order to generate the stored APC word i.e
The possible PVN values are
When w = 000000001 then PVN = 1 When w = 000000010 then PVN = 3 When w = 000000100 then PVN = 5 When w = 000001000 then PVN = 7 MULTIPLIER RESULT:
Multiplier result module is used to calculate multiplication of individual bits of operand and get the individual multiplication results .
Ex: 1 0 1 1 (A)
Ã— 1 0 1 0 (B)
———
0 0 0 0 
ress0 
i.e B(0) 
X A 

+ 
1 0 1 1 
ress1 
i.e B(1) 
X A 
+ 
0 0 0 0 
ress2 
i.e B(2) 
X A 
+ 
1 0 1 1 
ress3 
i.e B(3) 
X A 
BARREL SHIFTER :
Barrel Shifter is an combinational logic circuit which is used to do any no. of shifts for one clock cycle. Depending upon the s the no of shifts is decided and output outp is given .
Fig:Block diagram:
For example, take a 4bit barrel shifter, with inputs A, B, C and D. The shifter can cycle the order of the bits ABCD as DABC, CDAB, or BCDA; in this case, no bits are lost. That is, it can shift all of the outputs up to three positions to the right (and thus make any cyclic combination of A, B, C and D). The barrel shifter has a variety of applications, including being a useful component in microprocessors (alongside the ALU).
Implementation
A barrel shifter is often implemented as a cascade of parallel 2Ã—1 multiplexers. For a 4bit barrel shifter, an intermediate signal is used which shifts by two bits, or passes the same data, based on the value of S[1]. This signal is then shifted by another multiplexer, which is controlled by S[0]:
im = IN, if S[1] == 0
= IN << 2, if S[1] == 1 OUT = im, if S[0] == 0
= im << 1, if S[0] == 1
It is used to add the intermediate results to 16A to get the final output .It may make output 0 when clr is high.
u = [(u + v)/2 (v u)/2] and
v = [(u + v)/2 + (v u)/2], for (u + v) = 32A,
When xin(4 ) = 1 then sign value = 1 When xin(4) = 0 then sign value = 0.
4bit_ripple_carry_addersubtracter.svg
In digital circuits, an addersubtractor is a circuit that is capable of adding or subtracting numbers.
This works because when D = 1 the A input to the adder is really and the carry in is 1. Adding Bto and 1 yields the desired subtraction of B A.
The addersubtractor above could easily be extended to include more functions. For example, a 2to1 multiplexer could be introduced on each Bi that would switch between zero and Bi; this could be used (in conjunction with D = 1) to yield the two's complement of A since .
A further step would be to change the 2to1 mux on A to a 4to1 with the third input being zero, then replicating this on Bi thus yielding the following output functions:
0 (with the both Ai and Bi input set to zero and D = 0) 1 (with the both Ai and Bi input set to zero and D = 1) A (with the Bi input set to zero)
B (with the Ai input set to zero)
A + 1 (with the Bi input set to zero and D = 1) B + 1 (with the Ai input set to zero and D = 1) A + B
A B B A
(with Ai set to invert; Bi set to zero; and D = 0)

A (with Ai set to invert; Bi set to zero; and D = 1) (with Bi set to invert; Ai set to zero; and D = 0)

B (with Bi set to invert; Ai set to zero; and D = 1)
By adding more logic in front of the adder, a single adder can be converted into much more than just an adder an ALU.
LUT APC OMS Optimization Top Model
The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the twos complement operation of LUT output for sign modification and that of the input operand for input mapping.
The proposed APCOMS combined design of theLUT for L = 5 and for any coefficient width W is shown in Fig. 2.4. It consists of an LUT of nine words of (W + 4)bit width, a four tonineline address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1s0) for the barrel shifter.
The recomputed values of A Ã— (2i + 1) are stored as Pi, for i = 0, 1, 2, . . . , 7, at the eight consecutive locations of the
memory array, as specified in Table II, while 2A is stored for input X = (00000) at LUT address 1000, as specified in Table III. The decoder takes the 4bit address from the address generator and generates nine wordselect signals, i.e.,
{wi, for 0 i 8}, to select the referenced word from the LUT.
fig 2.4 lut combined apcoms based multiplication technique
Here we observe that they will Antisymmetry in the address for the LSB 4 bits. We will get all the address from 0 to 15 for 0 to 31.Thus we reduce the memory locations required to store coefficients by half. Then we will store only odd coefficients in the look up table .
Thus we reduce the number of coefficients by half again. On total we have reduced the number coefficients by quarter.
RTL SCHEMATIC:
SIMULATION RESULTS:
Xin Generation Module:
ADDRESS GENERATION MODULE:
LINE SELECTOR:
MULTIPLIER RESULT MODULE:
RESULTANT MULTIPLICATION MODULE:
BARREL SHIFTER :
ADDER/SUBSTRACTOR (sign determination module) :
References:

LUT Optimization for MemoryBased Computation Meher, P.K IEEE Transactions onCircuits and Systems II: Express Briefs, April 2010Vol 57 ,
Issue: 4 pp 285 – 289

MBARC: A Scalable Memory Based Reconfigurable omputing Framework for Nanoscale Devices, IEEE 2008 9781424419227/08 PP:77 82

L. Carloni et al., Theory of latencyinsensitive design, IEEE TCAD, 2001.

M. Tehranipoor, Defect tolerance for molecular electronicsbased nanofabrics using builtin selftest procedure, DFT, 2005.

A. Dehon et al., Seven strategies for tolerating highly defective fabrication, IEEE Design & Test of Computers, 2005, pp: 306315.

M. Mishra and S.C. Goldstein, Defect Tolerance at the End of the Roadmap, ITC, 2003, pp: 12011211.

S.C. Goldstein et al., NanoFabrics: Spatial Computing Using Molecular Electronics, ISCA, 2001.

R. F. Service, Molecules get wired, Science, vol. 294, 2001.

Yong Chen et al., Nanoscale molecularswitch crossbar circuits, Nanotechnology 14, pp. 462468, 2003.

C. P. Collier et al., Electronically configurable molecularbased logic gates, Science, vol. 285, pp 391394, 1999.

A. Dehon et al., Hybrid CMOS/nanoelectronic digital circuits: devices, architectures, and design automation, ICCAD, 2005.

M. M. Ziegler and M. R. Stan, CMOS/Nano CoDesign for Crossbar Based Molecular Electronic System, IEEE Trans. on Nanotech. 2003.

M. M. Ziegler and M. R. Stan, Design and Analysis of crossbar circuits for molecular nanoelectronics, IEEE Nano, pp. 323327, 2002.

P. Farm et al., Nanoeda: architecture and design methodology for nanoscale elecctronic systems, Swedish SoC Conf., 2003.