 Open Access
 Total Downloads : 679
 Authors : Aneesh R, Sarin K Mohan
 Paper ID : IJERTV3IS040039
 Volume & Issue : Volume 03, Issue 04 (April 2014)
 Published (First Online): 03042014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Design and Analysis of High Speed, Area Optimized 32×32Bit Multiply Accumulate Unit Based on Vedic Mathematics
Aneesh R
ER&DCI Institute of Technology, CDAC Thiruvananthapuram
Kerala, INDIA
Sarin K Mohan
ER&DCI Institute of Technology, CDAC Thiruvananthapuram
Kerala, INDIA
AbstractThis paper describes the implementation of a 32×32bit multiply accumulate (MAC) unit designed using ancient Vedic mathematical techniques. This research work presents the efficiency of Urdhva Triyagbhyam Vedic method for multiplication which strikes a difference in actual process of multiplication itself. It enables the parallel generation of partial products and eliminates unwanted multiplication and addition steps. Multiply Accumulate unit is a key component in the most of the digital signal processors, in order to make a balance in the key performance characters such as speed, power and area, a gate level implementation of the design is adopted in the entire research work. An analysis of the best adder among some commonly available adders is carried out and the best adder is used for adding the partial product generated in the Vedic multiplication technique to reduce the combinational delay in the critical path. The proposed research work is coded in VHDL, and analysis interms of speed power and area is done on vertex 6 FPGA using Xilinx ISE 13.1 tool.
Keywords Vedic mathematics, Vedic multiplier, Multiply Accumulate Unit, FPGA, VHDL)

INTRODUCTION
Multiplication is an important fundamental operation in arithmetic operations. Multiply and Accumulate (MAC) operations are used in many Digital signal processing (DSP) applications like FFT, DFT, convolution, and also in the arithmetic and logic unit of the microprocessors[8],[9]. In many DSP applications, the Multiply Accumulate component is a major contributing factor to the critical path delay and will affect the performance of the application. Low values of the critical path time delay and power consumption are the major specification for many applications. This paper describes a high speed, area efficient and low power 32×32 bit multiply accumulate unit based on Vedic mathematics
Multiplication operations performing in DSP applications, delay and throughput are two major specifications from a researchers and designers perspective. Time delay is the real delay of computing the algorithm and throughput is the measure of how many multiplication operations can be completed in a specified time. Minimizing power consumption and latency for digital system design involves optimization at all areas of the design [8] [9]. The optimization operation includes the best optimum algorithm
for the operation based on specification, this being the highest level of design, then the circuit style, the topology and finally the technology used to implement the digital circuits.
Most common multiplication algorithms followed in digital hardware are array multiplication algorithm and booth multiplication algorithm. Array multiplier is an efficient layout of combinational multiplier. Array multiplier circuit is based on add and shift algorithm. Partial products occurred during the multiplication of multiplicand with one multiplier bit and the partial products are added are shifted left or right according to the bit order and then added. (N1) adders are required for Nbit multiplier. Booth multiplier is used for signednumber multiplication, which considers both positive and negative numbers in a same manner. It uses shift and add method to achieve the appropriate result. Each multiplier bit generates one multiple of the multiplicand which is to be added to the partial product. For Nbit multiplicand it requires N number of adders.
Proposed paper uses Vedicmathematics based approach to reduce the number of partial products for multiplication, which ineffect reduces the number of adders. Vedic mathematics is the ancient Indian system of mathematics which is based on sixteen sutras and its subsutras mentioned in AtharvaVeda, and deals with various branch of mathematics such as arithmetic, algebra, geometry, trigonometry, conics, astronomy, calculus etc.
The proposed architecture of 32×32bit Multiply accumulate unit based on Vedic mathematics is shown in Fig1. The operand_1 and operand_2 are 32bit data inputs. The 64bit output of MAC is available in Result pin and carry from addition is available in Carry pin. The 64bit adder performs addition of the result from multiplier to the value stored in the accumulator. To enhance the performance of MAC unit, adder for multiplication and accumulation is selected through a comparison of several adders.
Fig1: Top level representation of MAC unit

VEDIC MATHEMATICS
The multiplier is based on an algorithm Urdhva Tiryakbhyam (Vertical and Crosswise) of ancient Indian Vedic mathematics. Urdhva Tiryakbhyam sutra is general multiplication formula applicable to all case of multiplication. It is based on a novel concept through which generation of all partial products can be done them; concurrent addition of these partial products can be done. Thus parallelism in generation of partial product is obtained by using Urdhva Tiryakbhyam sutra. The summation of the parallel product is done by using a high power carry save adder. The partial products and their sums are calculating in parallel blocks, so the multiplier path delay will not contribute to the critical path delay of the system.
The strategy applied for developing a 32 x 32bit Vedic multiplier is to design a 2 x 2 bit Vedic multiplier as a basic building module for the system. In the next stage of development a 4 x 4bit multiplier is designed using 2 x 2bit Vedic multiplier. Further in same manner 8 x 8, 16 x 16 and
32 x 32 bit Vedic multiplier is designed. For the partial product addition for all stages of development a fast carry save adder is used.

2×2 Vedic Multiplier
In 2 x 2bit multiplier, the multiplicand has two bits each and result of multiplication is of four bits. Input ranges from 00 to 11 and the output lies in the set of 0000 to 1111. Fig 2 shows the stepwise multiplication of two binary numbers using Vedic mathematics technique.
Fig2: Multiplication of 10 x 10
The first step in multiplication is vertical multiplication of LSB of both multiplicands, and then second step is crosswise multiplication and additions of the partial products. Third step involves vertical multiplication of MSB of the multiplicand and addition with the carry propagated from step 2. Fig3 shows the hardware realization of 2×2 Vedic multiplier.
Fig3: Hardware realization of 2×2 block

4×4 Vedic Multiplier
The 4×4 multiplication is decomposed into four 2×2 multiplications performed in parallel. This mechanism reduces the number of stages for the multiplication and thus reduces the delay of the multiplier. Fig 4 shows the block level representation of the 4×4 Vedic multiplier.
Fig3: Block level representation of 4×4 multiplier block
The advantages of this mechanism is that larger bit streams (say Nbits) can be divided into (N/2=n) bit length, which can be further divided into n/2 bit streams and this can be continued till we reach the bit stream width of 2bits, and the can be multiplied in parallel, thus providing an increase in speed of operation. The selection of the adder is based on a comparative study described in section III.

8×8 Vedic Multiplier
The 8×8 multiplier is formed by using four, 4×4 multiplier blocks. Multiplicands are of bit size (n=8) wher as the result is 16bit size. The input is broken into smaller block of size n/2=4, for both inputs. The newly formed 4bit data blocks are given as input to the 4×4 multiplier block, formed by 2×2 block. The results produced from the 4×4 multiplier blocks which is of 8bit are given to the adder. Fig 4 represents the block level representation of the 8×8 multiplier block
Fig4: Block level representation of 8×8 multiplier block

16×16 Vedic Multiplier
The 16×16 multiplier is designed by using four 8×8 multiplier blocks. Both the multiplicands are of bit size (n=16) and the result is of 32bit size. The input is broken into smaller block of size of n/2=8, for both the inputs. The newly formed 8×8 data blocks are applying to the input of 8×8 multiplier blocks. Fig5 repents the block level view of the 16×16 multiplier.
Fig5: Block level representation of 16×16 multiplier block

32×32 Vedic Multiplier
The 32×32 multiplier is made by using four 16×16 multiplier blocks. The multiplicands are of bit size (n=32) where as the result is of 64bit size. The input is broken into smaller block of size of n/2=16, for both the inputs. The newly formed 16×16 data blocks are applying to the input of 16×16 multiplier blocks. Fig6 repents the block level view of the 32×32 multiplier.
Fig6: Block level representation of 32×32 multiplier block
The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This
measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations.


ANALYSIS OF ADDER
Adders preform the key role in multiplication. As the number of data input bits in the addition increases the delay of the system also getting increased. In order to reduce the delay in adder circuit, a comparative study on four popular adders with various data inputs are carried out. The adder implementation is done with basic gates.

Carry Save Adder
Carry save adder is best suitable on adding more then 3 bits. The carry save adder is just a set of full adders and half adders. For an nbit adder implementation, still carry is rippled to the next stage, of the same row, even though inputs to the lower next stage is ready, but kept in a wait state, until the sum and carry output didnt come from the above stage. This induces delay, now it can be optimized, if the carry out is passed diagonally to the lower next stage, instead of rippling to the next stage of the same row. Cary computation is not performed, but it is saved up to last row, where the results are obtained finally, in the last bottom row, carry is rippled however, but is significantly reduces the amount of delay occurred due to rippling operation. Fig 7 represents the carry save adder structure for a 4bit addition.
Fig7: Block level representation of 4bit carry save adder
The intermediate carry and sum is generated by using the half adder and it is given to full adder to perform the addition

Carry Skip Adder
The carry skip adder reduces the delay as compared with carry lookahead adder and ripple carry adder. The carry skip adder divides the input word into blocks. Within each block, ripple carry adder is used to produce the sum bit and carry bit. The carry skip adder reduces the delay due to the carry computation i.e. by skipping over group of consecutive adder stages. If the input the individual adder blocks is different, sum will be generated and carry will not be computed and also the incoming carry is propagated to the next block. Also if both input to the blocks are zero, then incoming carry will not be propagated. Fig 8 represents the general block representation of the carry skip adder.
Fig 8: General block diagram of the carry skip adder

Ripple Carry Adder
The ripple carry adder is made using full adders only. The Nbit data input is directly applied to the Nfull adders to perform the addition operation and the sum is generated in parallel. If the incoming is absent in the system, first full adder is replaced with the half adder. In ripple carry adder the carry output from each stage is rippled to the next stage. Fig9 shows the general block diagram of the 4bit ripple carry adder.
Fig 9: Block diagram of 4bit ripple carry adder

Carry Lookahead Adder
The significant delay produced in ripple carry adder is a tradeoff; this can be minimized by computing the carry initially itself, as it will minimizes the wait for the carry at every stage. The carry lookahead adder is based on generate and propagate approach. Fig10 represents the logic equations used to represent the 4bit carry lookahead adder.
Fig10: logic equation for 4bit carry lookahead adder

Delay Analysis of Adders
The above adders are designed and developed using VHDL for various data input configurations and implemented on Xilinx Virtex 6 XC6VLX550T FPGA, the speed, area and power specifications are computed. Fig11 shows the speed/delay based comparison for the all adders.
Fig 10: Delay comparison of adders
From the above table it is clear that the minimum delay is for the carry lookahead adder, and for the Multiply accumulate operation, the addition operator is replaced by carry lookahead adder.


MULTIPLY ACCUMULATE UNIT
The multiply accumulate unit performs the multiplication and addition operation. For multiplication the proposed method uses Vedic multiplier and for addition, method uses carry lookahead adder. Fig11 shows the block level representation of the multiply accumulate unit.
Fig11: Architecture of MAC unit
The use of minim delay adder and a Vedic mathematics based multiplier for the multiply accumulate unit. The use of Vedic multiplier reduces significant amount of adders, required for the multiplier implementation. The fig 12 give the number of addition and multiplication required to implement the multiplier using conventional and Vedic multiplier approach.
Fig 12: Multipliers and additions required for implementation

SIMULATION AND IMPLEMENTATION
Register transfer level modeling of the 32×32bit multiply accumulate unit is done by using VHDL and implemented on Xilinx virtex 6 XC6VLX550T FPGA using Xilinx ISE tool kit. The maximum combinational path delay for the MAC unit is 21.849 ns. Fig 13 shows the implementation summery of the 32×32 MAC units on FPGA.
Fig 13: Combination delay summery form the Xilinx ISE tool

CONCLUSSIONS
A low power, area efficient and highly combinational path delay optimized 32×32bit MAC is designed and implemented on FPGA. For the efficient implementation of the design various adders are studied, compared and among them carry lookahead adders is used for the final implementation. The future enhancement of the Vedic multiplication is to pipeline the design and achieve more throughputs.
ACKNOWLEDGMENT
This research work carried out as continuation of thesis work carried out in Master of Technology in VLSI design and Embedded systems at ER & DCIInstitute of technology, Thiruvananthapuram
REFERENCES

T.T Hoang, M. Sjalander, Double throughput Multiply Accumulate Unit for Flexcore processor enhancements. IEEE International symposium on parallel and distributed processing, pp14, 2009

Jagadguru Swami Sri Bharati Krishna Trithaji Maharaja, Vedic Mathematics,Motilal Banarsidas Publishers Pvt. Ltd Delhi, 2009

A.D. Booth, A signed Binary multiplication Technique, Qrt. J.Mech. App. Math., Vol 4., no.2, pp.236240,1951.

Akhter S., VHDL implementation of a fast NxNmultiplier based on Vedic mathematic, pp. 472475. ECCTD 2007.

Shamsiah Suhaili and Othman Sidek, Design and implementation of reconfigurable alu on FPGA, 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 2830 December 2004, Dhaka, Bangaladesh, pp.4756.

Tam Anh Chu, Booth multiplier with low power high performace Input circuitry, US patent, 6393454BL, May 21 2002

Frank Marzona, vedic mathematics: The Scientific Heritage of Ancient India, Proceedings of the Pennsylvania State System of Higher Education Mathematics association Conference, held at Mansfield University 1997volume one.

Aneesh R, Jijuk, Sreekumari B, Design and Implementation of Bluetooth MAC core with RFCOMM on FPGA, proceedings of INDICON 2012.

Aneesh R, Jijuk, Design of FPGA based 8bit RISC controller IP core using VHDL, proceedings of INDICON 2012