Area Efficient High Bit Rate Serial-serial Multiplier with 1’s Asynchronous Counters

DOI : 10.17577/IJERTV1IS7256

Download Full-Text PDF Cite this Publication

Text Only Version

Area Efficient High Bit Rate Serial-serial Multiplier with 1’s Asynchronous Counters


ECE Department Pragati Engineering College, Surampalem, India

In this Paper, a technique for Serial Multiplication completes the partial product Formation in n cycles is presented. The Proposed technique effectively forms partial product matrix in just n cycles for an nxn multiplication instead of at least 2n cycles in traditional Serial-Serial Multiplier. Here Serial- Serial algorithm is used. The newly developed design is capable of processing input data (GBs) without buffering and with reduced number of computational cycles .This multiplication of partial products by considering two series inputs among which one is starting from LSB and other from MSB. The architecture consists of a series of asynchronous 1s counters instead of 5:3 Counters, then critical path is limited to AND gate & D flip- flop. The proposed multiplier consists of a Serial-Serial data accumulator and CSA, is designed to carry out both signed and unsigned multiplication. We can able to extend area reduction of proposed multiplier by using CLA instead of RCA in CSA.

  1. Serial-serial multiplication techniques have been in use for many years . The proposed a structure based on 1s counters, which computed the N bits of the product of an NxN multiplication in N bits clock cycles, using N processing cells. Multipliers are the fundamental and essential building blocks of VLSI systems. The design and implementation approaches of multipliers contribute substantially to the area, speed and power consumption of computational intensive

    VLSI systems.Hardware implementation of a multiplication operation consists of three stages, specifically the generation of partial products (PPs), the reduction of partial products (PPs), and the final carry propagation addition. The partial products can be generated either in parallel or serially, depending on the target application and the availability of input data. The partial products are reduced by carry-save adders (CSA) using an array or tree structure. Carry propagation addition is inevitable when the number of partial products is reduced to two rows. This final adder can be a simple ripple carry adder (RCA) for low power or a carry look-ahead adder (CLA) for high speed. As the height of PP tree increases linearly with the word length of the multiplier, it aggravates the area, delay and power dissipation of the two subsequent stages.

    Therefore, it is highly desirable to reduce the number of partial products before the CSA Stage. In the proposed method the partial product formation is revamped using an algorithm named as serial-serial algorithm which is explained in the following sections. The generated partial products are passed to a group of asynchronous 1s counters for accumulation. The counters will count the number of ones in the partial products which is used for addition. In the following sections an approach to the design of serial multiplier that is capable of processing input data without input buffering and with reduced total number of computational cycles is proposed.


    In a serial-serial multiplier both the operands are loaded in a bit-serial fashion, reducing the data input pads to two serial multipliers are popular for their low area and power. Bit- serial processing can result in efficient communications, both within and between VLSI chips, because of the reduced number of interconnections required. Serial multiplier designs which are particularly suitable for applications where input

    data are sequentially presented .The operating speeds are determined mainly by the Propagation delays along the critical path within the processing elements. A major advantage offered by bit serial processors is when the operands are available only one bit at a time, the processing speed can be improved using bit-serial arithmetic elements. The structures that use this approach can achieve moderate speeds with comparatively small area. The existing method of multiplication is the CSAS (carry save add shift) multiplier

    .CSAS architecture consists of a FA, D FF and an AND gate in the critical path for unsigned multiplication and an additional EXOR gate for signed multiplication. This architecture takes 2n cycles for the partial product formation.

    Parallel multipliers are popular as the size is less critical due to technology scaling. However due to the emerging development of the on-chip serial-link bus architectures serial-serial multipliers could find their potential roles in the new generation of SoCs and FPGAs. In the following sections, approach to the design of serial multiplier that is ca- pable of processing input data at Gb/s without input buffering and with reduced total number of computational cycles is proposed.


    The paper addresses an algorithm named serial algorithm that reduces the computation time of the partial products such that they can be formed in just n cycles for an n x n multiplier. According to this algorithm the partial product row and column structure is revamped. The figures (1) and

    (2) shows the partial product formation of the conventional and the proposed multiplier.

    Figure 1: Conventional PP Formation

    Figure 2: Proposed PP Formation

    The partial products so formed are counted column wise for the number of ones.

    Figure 3: Hardware architecture of a 3-bit 1s counter.

    The counters corresponding to the columns that have a 1 input are incremented. The counters can be clocked at high frequency and all the operands will be accumulated at the end of the clock. The final outputs of the counters need to be further reduced to only two rows of partial products by a CSA tree.


    Accumulation is an integral part of serial multiplier design. A typical accumulator is simply an adder that successively adds the current input with the value stored in its internal register.

    Generally, the adder can be a simple RCA but the speed of accumulation is limited by the carry propagation chain. The accumulation can be speed up by using a CSA with two registers to store the intermediate sum and carry vectors, but a more complex fast vector merged adder is needed to add the final outputs of these registers. In either case, the basic functional unit is an FA cell. A new approach to serial accumulation of data by using asynchronous counters is suggested here which essentially count the number of 1s in respective input sequences (columns).

    Figure 4: Architecture of Accumulator

    For an 8 bit operands multiplication the counter counts the number of ones in the columns and these count values are positioned and arranged for addition.


    This section describes an unsigned multiplier in which the operands are fed serially one starting from LSB and the other from MSB Using this feeding sequence and the proposed counter based accumulation, it takes only n cycles to complete the entire partial product generation for nxn multiplication.

    The product of two unsigned numbers X and Y can be written as



    Where xi and yj are the ith and jth bits of X and Y with bit 0 being the LSB.

    Reversing the sequence of index I and rearranging the above equation can be written as




    The partial product row PPr can be generated in rth cycle If X is fed rom MSB (bit n-1) first and Y is fed from LSB first (bit 0), then in the rth cycle ,PPrC is a partial product bit generated by the current input bits xn-r-1 and yr,PPrL are partial product bits of the current input bit, yr and each of the preceding input bits of X, ie xn-k-1, for k=0,1,.r-1.and PPrR are the partial product bits of the input bit xn-r-1,and each of the preceding input bits of Y, i.e Yr-k-1 for k = 0,1,2,.r-1. By appropriately sequencing the input bits of X and Y in to a shift register, one PP (PPr) in each cycle can be generated. As a result P can be obtained in n cycles.

    The figure (5) illustrates the PP generation of an 8X8 multiplier for unsigned numbers using the serial algorithm.

    According to the algorithm, the partial products formation sequence is as shown in the figure (2) like a pyramid. From the figure it is clear that in the first cycle of operation the pp X7Y0 is formed. In the next cycle x7 is shifted one position to the left and y0 is shifted one position to the right. At the same time the x6 and y1 is also generated. In this way at the next cycle x7 and x6 will be shifted again one position to the

    left and y0 and y1 one position to the right. Similarly the process continues till the last row of partial products is being formed. Once the partial products are formed they are accumulated using the asynchronous ones counter explained in the last section.

    Figure 5: Proposed Architecture for 8×8 Serial Serial Unsigned Multiplications


The above said architecture can be modified such that the same could be used for the multiplication of signed numbers. The signed multiplication is carried out with Baugh Wooley algorithm. By combining the Baugh Wooley and the serial algorithm, the architecture could be modified to perform signed multiplication. Using Baugh Wooley algorithm the equation for the multiplication of two signed numbers in twos complement form can be written as


Using the proposed architecture the multiplication of the signed numbers can be written as

Cycle n-1. The generation of

has to be delayed until cycle n-1. The remaining terms

The difference between 3 and 4 can be written as


can be computed during the initial n- 1cycles. Hence, the difference can be corrected in the CSA tree. It is trivial that a n- 1-bit shift register, a NAND gate and several FAs are required for adding .


+ 2n – 22n-1 (5)

The above expression could be simplified as



To extend the unsigned multiplier architecture for signed multiplication without introducing a high over-head, the difference expressed in 6must be simplified. Since n-2 2i =


n-2 2j ,the following summation terms embedded in (6) can

be simplified by the closed form expression of a geometric progression





The difference is added to the proposed architecture such that the architecture can be used for signed multiplication. Thus we can write

P = P + (8)

In the proposed PP generation method, Yn-1 arrives only in

Figure 6: Proposed Architecture for 8×8 Serial Serial Signed Multiplication

The architecture of the proposed 2s complement serial- serial multiplier is depicted in the figure (6). A control input is required to latch Xn-1 in the first clock cycle to generate

Serially in n-1 cycles. The bits of to be added in the CSA tree is shown in the figure.


    Look ahead carry algorithm speed up the operation to perform addition, because in this algorithm carry for the next stages is calculated in advance based on the input signals. By using this CLA, the carry propagation time is

    reduced by using a tree like circuit to compute the carry rapidly. The CLA exploits the fact that the carry generator by a bit position depends on the 3 inputs to that position.

    If X & Y are two inputs then

    if X=Y=1 a carry generated independently of the carry from the previous bit position

    X=Y=0 no carry generated.

    X Y a carry generated if and only if the previous bit- position generates a carry.

    The multiplier architectures area can be reduced with CLA than RCA used in the carry propagation addition stage of CSA.


    Fig: Simulation results of 8 x 8 unsigned multiplication using RCA.

    Fig: Simulation results of 8 x 8 signed multiplication using RCA.

    Fig: Simulation results of 8 x 8 unsigned multiplication using CLA.

    Fig: Simulation results of 8 x 8 signed multiplication

    using CLA.

    In this section the results of the proposed work is compared with the existing carry save adder multiplier using RCA and CLA .The architecture is implemented in VHDL and simulated using XYLINX.


    Bit Rate

    Operating Mode

    Number of


    Using RCA

    Using CLA


    8 x 8

    Un Signed




    8 x 8




    Table 1: Proposed serial-serial multiplication Area comparison

    International Journal of Engineering Research & Technology (IJERT)

    ISSN: 2278-0181

    Vol. 1 Issue 7, September – 2012

    The proposed work can be made area efficient by using CLA for PP addition than RCA. From the synthesis report which is shown in the above table, the no. of slices are reduced by using CLA than RCA.


    In this paper a new method of computing serial- serial multiplication is introduced by using low complexity asynchronous counters. By exploiting the relationship among the bits of a partial product matrix it is possible to generate all the rows serially in just n cycles for an n x n multiplication. Employing counters to count no of ones in each column allows the partial products bits to be generated on-the-fly and partially accumulated in place with a critical path delay of only an AND gate and a DFF . The counter-based accumulation reduces the partial product height logarithmically and makes it possible to achieve an effective reduction rate. The proposed method outperforms many serial-serial and serial-parallel multipliers in speed. This approach has clear advantage of low I/O requirement and hence is most suitable for complex SOCs, advanced FPGAs and high speed bit serial applications.


    1. P. Ienne and M. A Viredaz, Bit-serial multipliers and squarers, IEEE Trans. Comput., vol. 43, no. 12, pp. 14451450, Dec. 1994.

    2. Aggoun, A. Ashur, and M. K. Ibrahim, Area- time efficient serial-serial multipliers, in Proc. IEEE Conf. Circuits Syst. (ISCAS), Geneva, Switzerland, 2000, pp. 585588.

    3. A high bit rate serial-serial multiplier [online].

    4. R. Gnanasekaran, On a bit-serial input and bit-serial output multiplier, IEEE Trans. Comput., vol. C-32, no. 9, pp. 878880, 1983.

    5. O. Nibouche, A. Bouridarie, and M. Nibouche, New architectures for serial-serial multiplication, in Proc. IEEE Conf. Circuits Syst. (ISCAS), Sydney, Australia, 2001, vol. 2, pp. 705708

Leave a Reply