 Open Access
 Total Downloads : 124
 Authors : Nandita Jaiswal, Soumitra Sarkar
 Paper ID : IJERTV5IS080189
 Volume & Issue : Volume 05, Issue 08 (August 2016)
 DOI : http://dx.doi.org/10.17577/IJERTV5IS080189
 Published (First Online): 16082016
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Optimized Design Implementation of Direct MemoryBased Hardware for Efficient Resource Constraint Digital Signal Processing Systems
Nandita Jaiswal
M.Tech (VLSI),
Department of Electronics & Communication Engineering Hindustan College of Science & Technology
Farah, Mathura, India
Abstract An advance approach for DirectMemoryBased hardware for areadelaypower efficient systems for commonly encountered computationintensive cores of digital signal processing (DSP) systems is presented by combining three techniques, where the memorysize is reduced to oneeighth at the cost of some increase in combinational circuit complexity for signed magnitude numbers. Each of these techniques results in the reduction of the memory size by a factor of two. It is shown that by efficiently combining signbit exclusion technique, a different form of antisymmetric product coding (APC) and a modified oddmultiplestorage (OMS) scheme, we get an optimized directmemorybased multiplication hardware for resourceconstraint DSP systems which provides a reduction in memory size to oneeighth over conventional directmemory based hardware, at the cost of a marginal area overhead. The proposed design for small input sizes can be used for efficient implementation of highprecision multiplication by input operand decomposition. The proposed optimized design also offers almost 87.5% and 85% reductions in direct memory size for L=5 bits and L=6 bits signedmagnitude numbers respectively, over conventional directmemory size.
Keywords Digital signal processing (DSP), directmemory based computing, very large scale integration (VLSI).

INTRODUCTION
Rapid advancement in very large scale integration (VLSI) technology and hardware performance of digital devices have paved way to efficient memorybased computing systems as alternative to the conventional logiconly computing in order to meet the stringent constraint and growing requirements of the digital signal processing (DSP) systems in different application environments. Since DSP is considered as the major component of the digital revolution that is currently taking place around the world, it is therefore, important to design dedicated VLSI chips for fast and efficient computation of the DSP applications. It is observed that algorithms optimized for softwareimplementation, in general, are not wellsuited for dedicated hardwareimplementation. Appropriate algorithm design has a major role on developing a hardware entity that can meet the system requirements and specification. Not only it should necessarily lead to reduction of computational complexity, but also should facilitate maximization of concurrency by exploiting the possible parallelism to achieve highthroughput performance. Moreover, the architecture should be developed synergetic with the underlying algorithms to derive a cost effective and areatimepower efficient optimal VLSI. Memorybased designs consequently are gaining substantial popularity in the
Soumitra Sarkar
M.Tech Project Coordinator,
Department of Electronics & Communication Engineering Hindustan College of Science & Technology
Farah, Mathura, India
DSP application space. DSP algorithms involve multipliers that not only consume most of the resources of the system but also involve most of the computationtime. Significant researches have, therefore, been made in the past two decades for efficient multiplier less implementation of DSP systems.
Most of the DSP algorithms involve repetitive multiply accumulate operations and innerproduct computation. Besides, very often the multiplying coefficients (e.g., filter coefficients or transform kernel coefficients) remain constant during the DSP operations. This behavior of DSP algorithms is utilized to realize the memorybased computing systems. There are two basic variants of memorybased computing techniques found to be popularly used. One of the techniques is the direct memorybased implementation of multiplications [1], while the second is based on distributed arithmetic (DA) [2]. The DA principle is used primarily to compute the inner products by repeated shiftadd operations of partial products corresponding to the successive bitvectors of one of the input vectors. Whereas in the directmemorybased implementations, the multiplications of input values with the fixed coefficients are performed by a lookuptable (LUT), where each of the LUTs contains the precomputed product values for all possible values of input samples.
Apart from that, memorybased computing structures are more regular than the multiplyaccumulate structures and offer many other advantages, e.g., greater potential for high throughput and lowlatency implementation and less dynamic power consumption [11]. However, we find that now increasing efforts are made to carry any sort of significant work on optimization for memorybased multiplication.
The rest of the paper follows as: in the next section, we have discussed comparison study of techniques used in direct memorybased computation. In Section III, the proposed optimized technique used in directmemorybased multiplication are discussed which are signbit exclusion, the APC and the modified OMS technique. The Section IV explains to use optimized directmemorybased multiplication for signed and unsigned operands. The algorithmic design for proposed techniques is given in Section V and the optimized hardware implementation of directmemorybased multiplier is presented in Section VI. The results of the proposed multiplier along with the conclusion are presented in Section VII.

COMPARISON STUDY OF TECHNIQUES USED IN DIRECTMEMORYBASED COMPUTATION
DirectMemorybased computing is well suited for many DSP algorithms, which involve multiplication with a fixed set of coefficients. In the conventional directmemorybased implementations, the multiplications of input value B with the fixed coefficients A are performed by a lookuptable (LUT) of size 2L, (L is the wordlength of B) where each value of the LUT contains the precomputed product values P=AB for all possible values of input samples. The product word ABi is stored at the location Bi for 0 Bi 2L 1, such that if an L bit binary value of Bi is used as the address for the LUT, then
the corresponding product value is available as its output. Several architectures have been reported in the literature for memorybased implementation of DSP algorithms involving orthogonal transforms and digital filters [1][8].
In [9], odd multiples of the fixed coefficient is required to be stored, which is referred to as the oddmultiplestorage (OMS) scheme. Using OMS approach, one can reduce the LUT size to half, but it has significant combinational overhead since it requires a barrelshifter along with a controlcircuit to generate the controlbits for producing a maximum of (L 1) leftshifts, and an encoder to map the Lbit input word to (L 1) bit LUT address.
In [10], there is anti symmetric product coding (APC) approach, in this the LUT size can also be reduced to half, where the product words are recoded as anti symmetric pairs. The APC approach, although providing a reduction in LUT size by a factor of two, incorporates substantial overhead of area and time to perform the twos complement operation of LUT output for sign modification and that of the input operand for input mapping.
However, we find that when the APC approach is combined with the OMS technique, the twos complement operations could be very much simplified since the input address and LUT output could alwaysbe transformed into odd integers. However, the OMS technique in [9] cannot be combined with the APC scheme in [10], since the APC words generated according to [10] are odd numbers. Moreover, the OMS scheme in [9] does not provide an efficient implementation when combined with the APC technique. In [11], a combined APCOMS scheme reduces the LUT size to onefourth. But further optimization can also be achieved, with our proposed scheme that combines three techniques which reduces the LUT size to oneeight. We therefore present three schemes for optimization of LUT with lower areaand timeoverhead. One of the proposed optimization is based on exclusion of signbit from the LUT address, and the other two optimization is based on a coding of stored product word, where a different form of APC and combined that with a modified form of the OMS scheme for efficient memory based multiplication are presented.

PROPOSED OPTIMIZED TECHNIQUE EMPLOYED IN DIRECT MEMORYBASED
MULTIPLICATION
The optimization of product values stored could easily be performed for unsigned as well as signed magnitudes numbers. Besides, numbers could be fractions or integers in fixedpoint format. But, for simplicity we assume here the multiplicand B to be an integer in signmagnitude
representation, while the constant A is assumed to be either in signmagnitude or in 2s complements representation. We present here the proposed signbit exclusion scheme, the APC technique and its further optimization by combining it with a modified form of OMS.

DirectMemoryBased SignBit Exclusion Technique:
As the name suggest in this technique the signbit of multiplier and multiplicand is excluded, and the product values stored is P=AÂ·B, where A is magnitudepart of A and B is magnitudepart of B. The signedbit which is the most significant bit (MSB) of A and B are XOR operated and the result of XOR operation is concatenated with P to get the true result of multiplication. Since B is an (L 1)bit binary number, all possible product values of AÂ·B can be stored as 2(L1) LUT words which reduces its size to half.
The product words required to be stored for different values of B for directmemorybased multiplication for L = 6 is shown in Table I. The product word corresponding to B = (1 b4 b3 b2 b1 b0) is negative of that for B = (0 b4 b3 b2 b1 b0) for any given value of B = (b4 b3 b2 b1 b0). Therefore, product words on the fourth column can be derived by negating the product word stored at the second column on the same row. Therefore, instead of 64 product words only 32 values of A Â·
B for all possible values of B are required to be stored, as shown in the sixth column. The technique requires only one additional XOR gate to determine the sign of product word P.
TABLE I. DIRECTMEMORYBASED SIGNBIT EXCLUSION TECHNIQUE FOR L=6
Input, B address, B
product
values
Input, B address, B
product
values
stored words
2s comp
sign magnitude
0 0 0 0 0 0
0
1 0 0 0 0 0
0
0
0
0 0 0 0 0 1
A
1 0 0 0 0 1
A
A
A
0 0 0 0 1 0
2A
1 0 0 0 1 0
2A
2A
2A
0 0 0 0 1 1
3A
1 0 0 0 1 1
3A
3A
3A
0 0 0 1 0 0
4A
1 0 0 1 0 0
4A
4A
4A
0 0 0 1 0 1
5A
1 0 0 1 0 1
5A
5A
5A
0 0 0 1 1 0
6A
1 0 0 1 1 0
6A
6A
6A
0 0 0 1 1 1
7A
1 0 0 1 1 1
7A
7A
7A
0 0 1 0 0 0
8A
1 0 1 0 0 0
8A
8A
8A
0 0 1 0 0 1
9A
1 0 1 0 0 1
9A
9A
9A
0 0 1 0 1 0
10A
1 0 1 0 1 0
10A
10A
10A
0 0 1 0 1 1
11A
1 0 1 0 1 1
11A
11A
11A
0 0 1 1 0 0
12A
1 0 1 1 0 0
12A
12A
12A
0 0 1 1 0 1
13A
1 0 1 1 0 1
13A
13A
13A
0 0 1 1 1 0
14A
1 0 1 1 1 0
14A
14A
14A
0 0 1 1 1 1
15A
1 0 1 1 1 1
15A
15A
15A
0 1 0 0 0 0
16A
1 1 0 0 0 0
16A
16A
16A
0 1 0 0 0 1
17A
1 1 0 0 0 1
17A
17A
17A
0 1 0 0 1 0
18A
1 1 0 0 1 0
18A
18A
18A
0 1 0 0 1 1
19A
1 1 0 0 1 1
19A
19A
19A
0 1 0 1 0 0
20A
1 1 0 1 0 0
20A
20A
20A
0 1 0 1 0 1
21A
1 1 0 1 0 1
21A
21A
21A
0 1 0 1 1 0
22A
1 1 0 1 1 0
22A
22A
22A
0 1 0 1 1 1
23A
1 1 0 1 1 1
23A
23A
23A
0 1 1 0 0 0
24A
1 1 1 0 0 0
24A
24A
24A
0 1 1 0 0 1
25A
1 1 1 0 0 1
25A
25A
25A
0 1 1 0 1 0
26A
1 1 1 0 1 0
26A
26A
26A
0 1 1 0 1 1
27A
1 1 1 0 1 1
27A
27A
27A
0 1 1 1 0 0
28A
1 1 1 1 0 0
28A
28A
28A
0 1 1 1 0 1
29A
1 1 1 1 0 1
29A
29A
29A
0 1 1 1 1 0
30A
1 1 1 1 1 0
30A
30A
30A
0 1 1 1 1 1
31A
1 1 1 1 1 1
31A
31A
31A
The signbit exclusion technique can also be applied for 2s complement representation of coefficient A, which stores the product words in 2s complement representation, and requires a 2s complement unit along with a 2:1 MUX to change the sign of LUT output for negative values of B. If the sign bit is 0 the result is product value stored in LUT but if sign bit is 1 the true result is 2s complemnt of product value stored in LUT. The address of stored product word is the same as the magnitude bit of input.

DirectMemoryBased AntiSymmetric Product Coding (Apc) Technique:
In AntiSymmetric Product Coding we arrange the inputs in such a way that we can utilize the antisymmetric property to get the output. Here for convince we assume both numbers to be positive, the product words for different input values of B for L = 5 are shown in Table II. The table is so arranged that the input word on the first column of each row is the 2s complement of that on the third column of the same row. Also, the sum of product values corresponding to these two input values on the same row is 32A. Let the product values on the second and fourth columns of a row be j and k, respectively.
Since one can write j = [(j + k)/2 (k j)/2] and
k = [(j + k)/2 + (k j)/2].
For (j + k) = 32A, we can have
j = 16A – [(k j)/2] (1a)
k = 16A + [(k j)/2] (1b)
The product values on the second and fourth columns therefore have negative mirror symmetry. This behavior of the product words can be used to reduce the LUT size, where, instead of storing j and k, only [(k – j)/2] is stored for a pair of input on a given row. The 4bit LUT addresses and corresponding coded words are listed on the fifth and sixth columns of the table, respectively. Since the arrangement of products is done by the antisymmetric behavior of the products, we called it antisymmetric product coding.
To evaluate the address of APC words if MSB of the input is 1 then address is rest of the least significant bits (LSB) but if MSB of the input is 0 then address is the 2s complement of rest of the LSBs. Therefore, 4bit address B= (b3 b2 b1 b0) of the APC word is given by:
(2)
Where BL = (b3 b2 b1 b0) is the four less significant bits of B, and BL is the 2s complement of BL. The desired product could be obtained by adding or subtracting the stored value (kj)/2 to or from the fixed value 16A when b4 is 1or 0, respectively, i.e.
Product word = 16A + (sign value) Ã— (APC word) (3)
Where (APC word) = (kj)/2, sign value =+1 for b4 = 1 and sign value = 1 for b4 = 0. The product value for B = (10000) corresponds to APC value zero, which could be derived by resetting the LUT output, instead of storing that in the LUT.
TABLE II. DIRECTMEMORYBASED APC TECHNIQUE FOR L=5
Input, B
Product values
Input, B
Product values
Address B, b3 b2 b1 b0
APC
words
0 0 0 0 1
A
1 1 1 1 1
31A
1 1 1 1
15A
0 0 0 1 0
2A
1 1 1 1 0
30A
1 1 1 0
14A
0 0 0 1 1
3A
1 1 1 0 1
29A
1 1 0 1
13A
0 0 1 0 0
4A
1 1 1 0 0
28A
1 1 0 0
12A
0 0 1 0 1
5A
1 1 0 1 1
27A
1 0 1 1
11A
0 0 1 1 0
6A
1 1 0 1 0
26A
1 0 1 0
10A
0 0 1 1 1
7A
1 1 0 0 1
25A
1 0 0 1
9A
0 1 0 0 0
8A
1 1 0 0 0
24A
1 0 0 0
8A
0 1 0 0 1
9A
1 0 1 1 1
23A
0 1 1 1
7A
0 1 0 1 0
10A
1 0 1 1 0
22A
0 1 1 0
6A
0 1 0 1 1
11A
1 0 1 0 1
21A
0 1 0 1
5A
0 1 1 0 0
12A
1 0 1 0 0
20A
0 1 0 0
4A
0 1 1 0 1
13A
1 0 0 1 1
19A
0 0 1 1
3A
0 1 1 1 0
14A
1 0 0 1 0
18A
0 0 1 0
2A
0 1 1 1 1
15A
1 0 0 0 1
17A
0 0 0 1
A
1 0 0 0 0
16A
1 0 0 0 0
16A
0 0 0 0
0

For B= (0 0 0 0 0), the encoded word to be stored is 16 A.


DirectMemoryBased Modified Odd Multiple Storage (Oms) Technique:
In Odd Multiple Storage technique instead of storing all the 2L possible values of product P = A B as in conventional,
here only (2L/2) words corresponding to the odd multiples of A may be stored in the LUT, while all the even multiples of A could be derived by leftshift operations of one of those odd multiples. In Table III, we have shown that, the even multiples 2A, 4A, and 8A are derived by leftshift operations of A. Similarly, 6A and 12A are derived by left shifting 3A, while 10A and 14A are derived by left shifting 5A and 7A, respectively. A barrel shifter for producing a maximum of three left shifts could be used to derive all the even multiples of A.
In Modified OMS technique the address of the APC stored words becomes the input B of OMS such that when we combine APCOMS technique we get the reduction in LUT size by onefourth over conventional. At the eight memory locations the eight odd multiples of product words are stored by relation Pi = A Ã— (2i + 1) for i =0, 1, 2 . . . . 7. As required by (3), the word to be stored for B = (00000) is not 0 but 16A, which we can obtain from A by four left shifts using a barrel shifter. However, if 16A is not derived from A, only a maximum of three left shifts is required to obtain all other even multiples of A. A maximum of three bit shifts can be implemented by a twostage logarithmic barrel shifter, but the implementation of four shifts requires a threestage barrel shifter. Therefore, it would be a more efficient strategy to store 2A for input B = (00000), so that the product 16A can be derived by three arithmetic left shifts. The product values and encoded words for input words B= (00000) and (10000) are separately shown in Table IV. For B= (00000), the desired encoded word 16A is derived by 3bit left shifts of 2A [stored at address (1000)]. For B = (10000), the APC word 0 is derived by resetting the LUT output, by an activehigh RESET signal given by:
(4)
TABLE III. DIRECTMEMORYBASED OMS TECHNIQUE OF APC WORDS FOR L=5
1 0
Input, B b3 b2 b1 b0
Product value
No. of shifts
Control S1 S0
Shifted Input, B
Stored odd APC words
address, d3 d2 d1 d0
0 0 0 1
A
0
0 0
0 0 0 1
P0=A
0 0 0 0
0 0 1 0
2 A
1
0 1
0 1 0 0
4 A
2
1 0
1 0 0 0
8 A
3
1 1
0 0 1 1
3A
0
0 0
0 0 1 1
P1=3A
0 0 0 1
0 1 1 0
2 3A
1
0 1
1 1 0 0
4 3A
2
0 1 0 1
5A
0
0 0
0 1 0 1
P2=5A
0 0 1 0
1 0 1 0
2 5A
1
1 0
0 1 1 1
7A
0
0 0
0 1 1 1
P3=7A
0 0 1 1
1 1 1 0
2 7A
1
0 1
1 0 0 1
9A
0
0 0
1 0 0 1
P4=9A
0 1 0 0
1 0 1 1
11A
0
0 0
1 0 1 1
P5=11A
0 1 0 1
1 1 0 1
13A
0
0 0
1 1 0 1
P6=13A
0 1 1 0
1 1 1 1
15A
0
0 0
1 1 1 1
P7=15A
0 1 1 1
TABLE IV. PRODUCTS AND ENCODED WORDS FOR B=(00000) AND B=(10000)
Input, B b4 b3 b2 b1 b0
Product value
Encoded word
Stored values
No. of shifts
address d3 d2 d1 d0
Control S1 S0
1 0 0 0 0
16A
0
– – –
– –
– – – –
– –
0 0 0 0 0
0
16A
2A
3
1 0 0 0
11
It may be seen from Tables III and IV that the 5bit input word B can be mapped into a 4bit LUT address (d3d2d1d0), by a simple set of mapping relation
, for i = 0, 1, 2 and (5)
Where B = (b3 b2 b1 b0) is generated by shiftingout all the leading zeros of B by an arithmetic right shift followed by address mapping, i.e.
(6)
Where YL and YL are derived by circularly shiftingout all the leading zeros of BL and BL, respectively. The RESET signal can alternatively be generated as (d3 AND b4).
(7)
The control bits s0 and s1 to be used by the barrel shifter to produce the desired number of shifts of the LUT output are generated by the control circuit, according to the relation


OPTIMIZED DIRECTMEMORYBASED MULTIPLICATION FOR SIGNED AND UNSIGNED OPERANDS
In this section, we discuss that the directmemorybased multiplication of input B with fixed coefficient A could be easily carried out for any combination of signed and unsigned magnitude number by just modifying the design.

Both Operands in Signedmagnitude form :
The APCOMS combined optimization of the LUT can be performed for signed values of A and B with the help of sign bit exclusion technique. All the three technique are well utilized when both operands are in signmagnitude form, the multiples of magnitude of the fixed coefficient are to be stored in the LUT, and the sign of the product could be obtained by the XOR operation of sign bits of both multiplicands. When both operands are in twos complement forms, a twos complement operation of the output of the LUT is required to be performed for MSB equal to 1.

Both Operands in UnsignedMagnitude form :
When both the operands A and B are in unsigned magnitude form then there is no need for the signbit exclusion technique for the optimization. Here only the APC OMS combined optimization technique is used for reduction in LUT size.

Input is UnsignedMagnitude and fixed coefficient is Signed Magnitude form:
For the multiplication of unsigned input B with signed coefficient A, the products could be stored in twos complement representation, and the signmodification circuit checks the MSB of the output to give a 2s complement as true output for MSB equal to 1 and as it is otherwise. A straight forward implementation of the signmodification circuit involves multiplexing of the LUT output and its twos complement, to reduce the areatime complexity.

Input is SignedMagnitude and fixed coefficient is Unsigned Magnitude form:
For the multiplication of signed input B with unsigned coefficient A, as the signbit of input is excluded by the sign bit exclusion technique the products could be stored as it is, and the signmodification circuit checks the MSB of the input to give a 2s complement as true output for MSB equal to 1 and as it is otherwise. Here also all the three techniques are well utilized with no modification in proposed combined design.


ALGORITHMIC DESIGN FOR PROPOSED TECHNIQUES USED IN OPTIMIZED DIRECTEMORY
BASED MULTIPLICATION
This section presents the design algorithms of techniques
us
(7a)
(7b)
Note that (s1 s0) is a 2bit binary equivalent of the required number of shifts specified in Tables III and IV.
ed in optimized directmemorybased multiplication.

Algoritmic Design for DirectMemoryBased SignBit Exclusion technique:
The design of this technique for L= 6 bits, consists of a memory array of 32 word size which stores the precomputed product values and an address generating circuit which is in form of a 5To 32 line decoder for address mapping the input to a particular memory location. It also consist of a 2s
complement unit which generates the product words in 2s complement representation along with a 2:1 MUX to change the sign of LUT output for negative values of input.
ALGORITHM
Step1: Let fixed coefficient input A be 10 bit word.
Step2: Multiplicand input B (b5 b4 b3 b2 b1 b0) is 6 bit word. Step3: Product output P is 16 bit word.
Step4: Memory component has precomputed 32 Stored Product Words SPW of 16 bit size.
Output SPW = AÂ·B corresponding to input B.
Step5: Signal is declared for address D (d4 d3 d2 d1 d0) of 5 bits.
Step6: Signal is declared for SPW of 16 bits. Step7: Begin for finding address D value:
Address D = B (i.e. b4 b3 b2 b1 b0);
Step8: Begin Process 1 for finding SPW for given address D:
Case Address D is
(Here a list of address value D and corresponding to it SPW value is given according to table I ).
end Case; End Process 1;
Step9: Begin Process 2 for finding true product P:
If signbit b5 = 0 (i.e. for positive number) then Product P = SPW;
else signbit b5 = 1 (i.e. for negative number) then Product P = 2s complement of SPW;
End Process 2; Step10: End

Algoritmic Design for DirectMemoryBased Anti Symmetric Product Coding (APC) Technique:
In this technique for L=5 bit, taking only the magnitude part of signed 6bit numbers or unsigned 5bit numbers, here the design consist of a fourinput memory array of 16 words to store the APC values of product words as given in the sixth column of Table II, except on the last row, where 2A is stored for input B= (00000) instead of storing a 0 for input B = (10000). Besides, it consists of an addressmapping circuit and an add/subtract circuit. The addressmapping circuit generates the desired address B= (b3 b2 b1 b0) according to (2). A straightforward implementation of address mapping can be done by multiplexing BL and BL using b4 as the control bit. The output of the memory table is added with or subtracted from 16A, for b4 = 1 or 0, respectively, according to (3) by the add/subtract cell. Hence, b4 is used as the control for the add/subtract cell.
ALGORITHM
Step1: Let fixed coefficient input A be 10 bit word.
Step2: Multiplicand input B (b4 b3 b2 b1 b0) is 5 bit unsigned number or is B for signed 6 bit number.
Step3: Product output P is 16 bit word.
Step4: Memory component has precomputed 16 APC words. Step5: ignal is declared for address B (b3 b2 b1 b0) of 4 bit. Step6: Signal is declared for APC word of 16 bits.
Step7: Begin Process 1 for finding address B : If b4 = 1 then
Address B = (b3 b2 b1 b0); else b4 = 0 then
Address B =2s complement of (b3 b2 b1 b0); End Process 1;
Step8: Begin Process 2 for finding APC for given address B:
Case Address B is
(Here a list of address value Band corresponding to it APC value is given according to table II).
end Case; End Process 2;
Step9: Begin Process 3 for finding true product P: If b4 = 1 then
Product P = 16A + APC; else b4 = 0 then
Product P = 16A – APC;
End Process 3; Step10: End

Algoritmic Design for DirectMemoryBased Modified
Odd Multiple Storage (OMS) Technique:
In this technique for L=4 bits taking unsigned 4bit number or APC words of L=5 bit and for any coefficient width W, here the design consists of a memory array of nine words of (W + 4)bit width, a fourtonineline address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1s0) for the barrel shifter. The precomputed values of A Ã— (2i + 1) are stored as Pi, for i = 0, 1, 2, . . . , 7, at the eight consecutive locations of the memory array, as specified in Table III, while 2A is stored for input B = (00000) at memory address 1000, as specified in Table IV. The decoder takes the 4bit address from the address generator and generates nine wordselect signals, i.e., {vi, for 0 i 8}, to select the referenced word from the memory. The control bits s0 and s1 to be used by the barrel shifter to produce the desired number of shifts of the memory output are generated by the control circuit, according to (7a) and (7b). The RESET signal can be generated by (7).
The addressgenerator circuit receives the input operand B and maps that onto the 4bit address word (d3d2d1d0), according to (5) and (6).
ALGORITHM
Step1: Let fixed coefficient input A be 10 bit word.
Step2: Multiplicand input B (b3 b2 b1 b0) is 4 bit unsigned number or is APC word of L=5 bit.
Step3: Product output P is 16 bit word.
Step4: Memory component has precomputed 9 OMS words. Step5: Signal is declared for address D (d3 d2 d1 d0) of 4 bits. Step6: Signal is declared for OMS word of 16 bits.
Step7: Signal is declared for control S (s1 s0) of 2 bits. Step8: Signal is declared for RESET of 1 bit.
Step9: Begin Process 1for finding address and control signal:
Case Input B is
(Here a list of input values B and corresponding to it Control signal value S and address location D is given according to table III and IV).
end Case; End Process 1;
Step10: Begin Process 2 for finding OMS for given address D:
Case Address D is
(Here a list of address value D and corresponding to it OMS value is given according to table III and IV). end Case;
End Process 2;
Step11: Begin Process 3 for finding true product value.
RESET = d3. b4
If RESET = 1 then Product P= 0
else RESET=0 then
Product P= OMS << S (i.e. OMS is left shift by S) End Process 3;
Step12: End

Algorithmic Design Of Combined Technique Used In Proposed Optimized DirectMemoryBased Multiplication:
The algorithmic design principle here is to utilize all the three technique SignBit Exclusion, APC and Modified OMS efficiently in an optimized manner to get our proposed design which reduce the LUT memory size to oneeighth of the conventional LUT. By efficiently combine all the technique we get optimized directmemorybased multiplication hardware. It consists of an address generator and control circuit, 4To9 addressline decoder, 9(W+6) LUT memory units, barrel shifter, add/subtract unit and 2s complement/signmodification unit.
ALGORITHM
Step1: Let fixed coefficient input A be 10 bit word.
Step2: Multiplicand input B (b5 b4 b3 b2 b1 b0) is 6 bit signed number.
Step3: Product output P is 16 bit word.
Step4: Memory component has precomputed 9 OMS words. Step5: Signal is declared for address D(d3 d2 d1 d0) of 4 bits. Step6: Signal is declared for input B(b3 b2 b1 b0) of
OMS technique of 4 bits.
Step7: Signal is declared for OMS word of 16 bits. Step8: Signal is declared for APC word of 16 bits. Step9: Signal is declared for control S(s1 s0) of 2 bits. Step10: Signal is declared for RESET of 1 bit.
Step11: Begin Process 1 for finding Input B of OMS value:
Case Input (b5 b4) is [Value (b5 b4) = 00
Input of OMS B =2s complement of (b3 b2 b1 b0); Value (b5 b4) = 01
Input of OMS B = (b3 b2 b1 b0); Value (b5 b4) = 10
Input of OMS B =2s complement of (b3 b2 b1 b0); Value (b5 b4) = 11
Input of OMS B = (b3 b2 b1 b0);] end Case;
End Process 1;
Step12: Begin Process 2 for finding address D and control signal S:
Case Input B is
(Here a list of input values B and corresponding to it Control signal value S and address location D is given according to tables III and IV).
end Case; End Process 2;
Step13: Begin Process 3 for finding OMS for given address D:
Case Address D is
(Here a list of address D and corresponding to it OMS value is given according to tables III and IV). end Case;
End Process 3;
Step14: Begin Process 4 for finding APC product value: RESET = d3. b4;
If RESET = 1 then Product APC= 0;
else RESET=0 then
Product APC= OMS << S; (i.e. OMS is left shift by S) End Process 4;
Step15: Begin Process 5 for finding true product P;
Case Input (b5 b4) is [Value (b5 b4) = 00
Product P= 16A APC; Value (b5 b4) = 01
Product P= 16A + APC; Value (b5 b4) = 10
Product P=2s complement of (16A APC); Value (b5 b4) = 11
Product P=2scomplement of (16A + APC);] end Case;
End Process 5; Step16: End


HARDWARE IMPLEMENTATION OF DIRECTMEMORY BASED MULTIPLIER USING THE PROPOSED OPTIMIZATION
TECHNIQUE
The hardware implementation of directmemorybased multiplier for an Lbit input with a Wbit coefficient using the proposed optimization scheme is shown in fig.1. The multiplicand input (b5 b4 b3 b2 b1 b0) is applied to address generator and control circuit to generate the desired address location (d3d2d1d0), RESET and control signal (s1 s0), according to (5), (6), (7), (7a) and (7b) respectively. The function of a 4To9 addressline decoder is to take the 4bit address from the address generator and generate nine word select signals, i.e., {vi, for 0 i 8}, to select the referenced word from the memory unit. The memory unit consist of LUT of nine words of (W + 6)bit width, which stores the pre computed values of Pi =AÃ— (2i + 1), for i = 0, 1, 2, . . . , 7, at the eight consecutive locations of the LUT memory unit, as specified in Table III, while 2A is stored for input B = (00000) at LUT address 1000, as specified in Table IV. The control bits s0 and s1 is used by the barrel shifter to produce the desired number of shifts of the LUT memory output. The output of the barrel shifter is added with or subtracted from 16A, for b4 = 1 or 0, respectively, according to (3) by the add/subtract unit. Hence, b4 is used as the control for the add/subtract unit. At the 2s complement/sign modification unit, a twos complement operation of the output of add/subtract cell is required to be performed for b5=1 and remain as it is for b5=0, when both operands are in twos complement form. Here b5 act as a sign control signal. The RTL schematic of Optimized DirectMemoryBased Multiplier Design, Using SignBit Exclusion, APC and OMS Technique is shown in Fig.2.

RESULT AND CONCLUSION
The proposed optimized memorybased multiplier is coded in VHDL and synthesized in Xilinx ISE 9.1i Project Navigator, for word size L = 5 and 6 bits for signed magnitude numbers and L= 5 bits unsigned magnitude numbers respectively. Forunsigned numbers we use APC and modified OMS scheme whereas for signedmagnitude numbers the signbit exclusion
Fig. 1. Proposed Optimized DirectMemoryBased Multiplier Design, Using SignBit Exclusion, APC and OMS Technique.
Fig. 2. RTL schematic of Optimized DirectMemoryBased Multiplier Design, Using SignBit Exclusion, APC and OMS Technique.
scheme is included in APC and modified OMS scheme to get an optimized LUT multiplier which reduces the memory size to oneeighth of conventional LUT. Simulation result, area utilization and timing analysis for 6 bit signed number, 5 bit unsigned number and 5 bit signed number are shown in fig.3, fig.4, fig.5, fig.6, fig.7, fig.8, fig.9, fig.10 and fig.11 respectively.
Fig. 3. Simulation Result of Proposed Optimized Design for Signed Magnitude Numbers, L=6 Bits
Fig. 4. Area Utilization of Proposed Optimized Design for Signed Magnitude Numbers, L=6 Bits
Fig. 5.
Fig. 6. Timing Report of Proposed Optimized Design for SignedMagnitude Numbers, L=6 Bits
Fig. 7. Simulation Result of Optimized Design for UnsignedMagnitude Numbers, L=5 Bits
Fig. 8. Area Utilization of Optimized Design for UnsignedMagnitude Numbers, L=5 Bits
Fig. 9. Timing Report of Optimized Design for UnsignedMagnitude Numbers, L=5 Bits
Fig. 10. Simulation Result of Proposed Optimized Design for Signed Magnitude Numbers, L=5 Bits
Fig. 11. Area Utilization of Proposed Optimized Design for Signed Magnitude Numbers, L=5 Bits
Fig. 12. Timing Report of Proposed Optimized Design for SignedMagnitude Numbers, L=5 Bits
Comparison of memory block and LUT memory size for L= 6 bits by different techniques is shown in Table V. In Table VI we have compared the results derived from are design summary and synthesis report for area and timing parameters. It is found that for signed magnitude numbers of 5 bit word size it offers more saving in areadelay product as of
5 bit word size unsigned magnitude numbers. With our proposed optimized design LUT memory size for L= 6 bits is 14.065% of conventional LUT and for L= 5 bits is 12.5% of conventional LUT. Thus, the proposed optimized design also offers almost 85% and 87.5% reductions in LUT memory size for L=6 bits and L=5 bits respectively over conventional LUT. The design also saves a lot of multiplication computation power as it is directmemory based and no product computation are carried out, hence with a very negligible trade off in delay a lot of area and power are saved.
The LUT multiplier could be used for memorybased implementation of linear and circular convolution, cosine and sinusoidal transforms, and innerproduct computation. The performance of memorybased structures with different memory implementations could be studied in future for different DSP applications. Although the memory core of the directmemorybased multiplier is reduced to nearly one eighth by the proposed optimization technique, it is not efficient for operands of small widths, since it requires an adder to add the offset value. However, it could be used for multiplication with input of large word size by an input decomposition scheme. When the width of the input multiplicand is large, direct implementation of LUT multiplier involves a very large LUT. Therefore, the input word could be
decomposed into a certain number of segments or subwords, and the partial products pertaining to different subwords could be shift added to obtain the desired product. In brief, we have shown the possibility of using directmemorybased multipliers to implement the constant multiplication for DSP applications. The full advantages of proposed design, however, could be derived if the LUTs are implemented as NAND or NOR readonly memories and the arithmetic shifts for large operand word size are implemented by an array barrel shifter using metaloxidesemiconductor transistors. Further work could still be done to derive SignBit Exclusion, OMSAPCbased LUTs for higher input sizes with different forms of decompositions and parallel and pipelined addition schemes for suitable areadelay tradeoffs.
TABLE V. COMPARISON STUDY OF OPTIMIZED DESIGN FOR UNSIGNED MAGNITUDE NUMBERS WITH PROPOSED OPTIMIZED DESIGN FOR SIGNEDMAGNITUDE
NUMBERS
TABLE VI. COMPARISON OF MEMORY BLOCKS AND LUT SIZE FOR L= 6
METHOD 
CONVENTIONAL 
SIGNBIT EXCLUSION 
APC 
OMS 
SIGNBIT EXCLUSION + APC 
APC+OMS 
SIGNBIT EXCLUSION + APC + OMS 
MEMORY BLOCK 
64 
32 
33 
33 
17 
17 
9 
LUT SIZE 
100% 
50% 
51.5625% 
51.5625% 
26.5625% 
26.5625% 
14.0625% 
REFERENCES

International Technology Roadmap for Semiconductors. [Online].
Available: http://public.itrs.net/

J.I. Guo, C.M. Liu, and C.W. Jen, The efficient memorybased VLSI array design for DFT and DCT, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, pp. 723733, Oct. 1992.

H.R. Lee, C.W. Jen, and C.M. Liu, On the design automation of the memorybased VLSI architectures for FIR filters, IEEE Trans. Consum. Electron., vol. 39, no. 3, pp. 619629, Aug. 1993.

D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, A systolic array architecture for the discrete sine transform, IEEE Trans. Signal Process., vol. 50, no. 9, pp. 23472354, Sep. 2002.

H.C. Chen, J.I. Guo, T.S. Chang, and C.W. Jen, A memory efficient realization of cyclic convolution and its application to discrete cosine transform, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp. 445453, Mar. 2005.

D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, Systolic algorithms and a memorybased design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125 1137, Jun. 2005.

P. K. Meher, Systolic designs for DCT using a lowcomplexity concurrent convolutional formulation, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 9, pp. 10411050, Sep. 2006.

P. K. Meher, Memorybased hardware for resourceconstrained digital signal processing systems, in Proc. 6th Int. Conf. ICICS, Dec. 2007, pp. 14.

P. K. Meher, New approach to LUT implementation and accumulation for memorybased multiplication, in Proc. IEEE ISCAS, May 2009, pp. 453456.

P. K. Meher, New lookuptable optimizations for memorybased multiplication, in Proc. ISIC, Dec. 2009, pp. 663666.

P.K.Mehra, LUT optimization for MemoryBased Computation, IEEE Trans. Circuits and Syst.II:Express Briefs, vol.57,no.4,pp.285289, Apr.2010.