Design of modulo 2n-1 multiplier Based on Radix-8 Booth Algorithm using Residue Number System

DOI : 10.17577/IJERTV1IS3181

Download Full-Text PDF Cite this Publication

Text Only Version

Design of modulo 2n-1 multiplier Based on Radix-8 Booth Algorithm using Residue Number System


M.Tech Student, Dept. of ECE Vaagdevi Institute of Technology & Science,Proddatur, Kadapa (Dt.), A.P.

Assistant Professor, Dept. of ECE Vagdevi Institute of Technology &Science,Proddatur,Kadapa(Dt) A.P.


HOD, Dept. of ECE

Vagdevi Institute of Technology & Science,Proddatur,Kadapa(Dt) A.P.

Modular arithmetic operations (inversion, multiplication and exponentiation) are used in several cryptography applications. A special moduli set of forms {2n-1, 2n, 2n +1} are preferred over the generic moduli due to the ease of hardware implementation of modulo arithmetic functions as well as system-level inter-modulo operations, such as RNS-to-binary conversion and sign detections. With this precept, a family of radix-8 Booth encoded modulo 2n-1 multipliers, with delay adaptable to the RNS multiplier delay, is proposed.

The first-ever family of low-area and low-power radix-8 Booth encoded modulo 2n-

1 multiplier whose delay can be tuned to match the RNS delay closely has been proposed in this paper. A CSA tree with end- around-carry addition for accumulation of redundant partial products and a Sklansky parallel-prefix structure has also been implemented.

Index Terms Public Key Cryptographic (PKC),Booth algorithm, modulo arithmetic, multiplier, residue number system (RNS)

  1. RIVEST, Shamir, and Adleman (RSA) and elliptic curve cryptography (ECC) are two of the most well established and widely used public key cryptographic (PKC) algorithms.

    The encryption and decryption of these PKC algorithms are performed by repeated modulo multiplications [1][3].

    These multiplications differ from those encountered in signal processing and general computing applications in their sheer operand size. Key sizes in the range of 512~1024 bits and 160~512 bits are typical in RSA and ECC, respectively [4][7]. Hence, the long carry propagation of large integer multiplication is the bottleneck in hardware implementation of PKC. The residue number system (RNS) has emerged as a promising alternative number representation for the design of faster and low power multipliers owing to its merit to distribute a long integer multiplication into several shorter and independent modulo multiplications

    Modulo 2n+1, 2n , 2n-1 addition and multiplication are the crucial operations in the IDEA algorithm and also modulo 2n+1 arithmetic operations are used in Fermat number transform computation. Moduli choices of the forms {2n+1, 2n, 2n-1} have received significant attention because they offer very efficient circuits when considering the area * time2 product and efficient converters from and to the binary system. Therefore, designing efficient modulo 2n-1 multipliers is an interesting issue. Modulo 2n-

    1 multiplication is used extensively in Residue Number System (RNS) based Digital Signal Processing (DSP) and cryptography units.

    The modulo 2n 1 multiplication of two numbers n-bit each follows 3 steps: production of n2 partial products modulo 2n 1 reduction of this n2 partial products 2n 1 into two numbers of n bits addition of these two numbers modulo 2n 1 with the preceding adder.

  2. The radix-8 Booth encoding reduces the number of partial products to

    which is more aggressive than the radix-4 Booth encoding. However, in the radix-8 Booth encoded modulo 2n-1 multiplication, not all modulo-reduced partial products can be generated using the bitwise circular-left-

    2 -1

    shift operation and bitwise inversion. Particularly, the hard multiple |+3X| n is to be generated by an n -bit end-around-carry

    addition of X and 2X.

    Recoding of binary numbers was first hinted at by Booth four decades ago. MacSorley proposed a modification of Booths algorithm a decade after. The modified Booths algorithm (radix-4 recoding) starts by appending a zero to the right of x0 (multiplier LSB). Triplets are taken

    beginning at position x-1 and continuing to the MSB with one bit overlapping between adjacent triplets. If the number of bits in X (excluding x-1) is odd, the sign (MSB) is extended one position to ensure that the last triplet contains 3 bits. In every step we will get a signed digit that will multiply the multiplicand to generate a partial product entering the Wallace reduction tree. The meaning of each triplet can be seen in table I:

    Table I: Radix-4 encoding

    This recoding scheme applied to a parallel multiplier halves the number of partial products so the multiplication time and the hardware requirements decrease. This gain is possible at the expense of somewhat more complex operations in every step. However, that the required multiples of Y {0, Y,

    2Y} are available by merely shifting Y to the left. Although the algorithms and operations specified above seem rather arbitrary at the first sight, they are based on meaningful number systems. If one focuses on what modifications are being done to X, then one may arrive at a different representation for the 2s-complement number X as shown in figure 2:

    Figure 2: Signed-digit representation

    where digits Di are one of -2, -1, 0, 1, 2found in the table of figure 1, based on the value of triplets in the form (xi+2 xi+1 xi). Here we have a signed digit representation of

    X in radix-4. Signed-digit number representation allows redundancy to exist. Thanks to this we can make a parallel recodification that is, all triplets are recoded at the same time, and the value of each triplet is independent from the adjacent triplets.

    Radix-8 recoding applies the same algorithm as radix-4, but now we take quartets of bits instead of triplets. Each quartet is codified as a signed-digit using the table II:

    Here we have an odd multiple of the multiplicand, 3Y, which is not immediately available. To generate it we need to perform this previous add: 2Y+Y=3Y. But we are designing a multiplier for specific purpose and thereby the multiplicand belongs to a previously known set of numbers which are stored in a memory chip. We have tried to take advantage of this fact, to ease the bottleneck of the radix-8 architecture, that is, the generation of 3Y. In this manner we try to attain a better overall multiplication time, or at least comparable to the time we could obtain using radix-4 architecture (with the additional advantage of using a less number of transistors). To generate 3Y with 21-bit words we only have to add 2Y+Y, that is, to add the number with the same number shifted one position to the left, getting in this way a new 23-bit word, as shown in figure 3:

    Figure 3: 21-bit previous add

    In fact, only a 21-bit adder is needed to generate the bit positions from z1 to z21. Bits z0 and z22 are directly known because z0=y0 and z22=y20 (sign bit of the 2s-complement number; 3Y and Y have the same sign). If in the memory from where we take the numbers just two additional bits are stored together with each value of the set of numbers, we can decompose the previous add in three shorter adds that can be done in parallel. In this way, the delay is the same of a 7-bit adder:

    Bits which are going to be stored are the two intermediate carry signals c8 and c15. Before each word of the set of numbers is stored in the memory, the value of its intermediate carries has to be obtained and stored beside it. In this way, they are immediately available when it is required to perform the previous add to get the multiple 3Y of one of the numbers that belongs to the set.

    The radix-4 Booth encoding tchnique is most prevalent as all required modulo reduced partial products can be generated by circular-left-shift operation and bit-wise complementation, thereby minimizing the hardware complexity. The reduction in the number of partial products is determined by the radix of the Booth encoding technique employed. Reduction of partial products by more than half is possible with higher radix Booth encoding. Similar to the radix-4 algorithm, the radix-8 Booth encoding algorithm can be considered as a digit set conversion of four consecutive multiplier bits y3i-1,y3i,y3i+1,y3i+2 , yi

    {0, 1}from Y, to di , di [4, 4], for i = 0, 1,


    The digit set conversion is given by


    where y-1, yn, yn+1 and yn+2 are zero. For the radix-8 Booth encoded modulo 2n-1 multiplier, the required modulo-reduced partial products are shown in Table III.

    From Table 3, the necessary modulo-reduced partial products except ±3X can be generated by circular-left-shift operat-ion and/or bit- wise complemen-tation of the multiplicand, X. The generation of±3X requires a large word- length adder which increases the critical path delay of the multiplier significantly.


    carry propagation length to n bits by representing the hard multiple as a sum and a redundant end-around-carry bit pair. The resultant [n/3] +1 end-around-carry bits in the partial product matrix may lead to a marginal increase in the CSA tree depth and consequently, may aggravate the delay of the CSA tree. In which case, it is not sufficient to reduce the carry propagation length to merely n bits using the above technique.

    Since the absolute difference between the noncritical modulo 2n-1 multiplier delay and the system critical path delay depends on the degree of imbalance in the moduli word- length of a RNS, the delays cannot be equalized by arbitrarily fixing the carry propagation length to n bits. Instead, we propose to accomplish the adaptive delay equalization by representing the hard multiple in a partially-redundant form [48].

    A. Generation of Partially-Redundant Hard Multiple

    Let X


    2 -1

    and 2X


    2 -1

    be added by a group

    results also confirm that the proposed method helps pathologists distinguish exact lesion sizes and regions

To ensure that the radix-8 Booth encoded modulo 2n-1multiplier does not constitute the system critical path of a high- DR moduli set based RNS multiplier, the carry propagation length

in the hard multiple generation should not exceed n bits. To this end, the carry

of M (=n/k) k -bit RCAs such that there is no carry propagation between


the adders. Fig. 2 shows this addition for n=8 and k =4,where the sum and carry-out bits from the RCA block j are represented as S j andcij for i [0,k-1] and j [0,M- 1],respectively. In Fig. 2, the carry-out of RCA 0,C30 , is not propagated to the carry input of RCA 1 but preserved as one of the partial product bits to be accumulated in the CSA tree. The binary weight of the carry-out C31of RCA 1 has,however, exceeded the maximum range of the modulus and has to be modulo reduced before it can be accumulated by the CSA tree.

By Property 2, the binary weight of C31 can be reduced from 28 to 20 . Thus, C31 is inserted at the least significant bit (lsb)position in Fig.

6. It should be stressed that the carry-out C31 is a partial carry propagated through only k most significant FAs and hence, is different from the end-around-carry bit in the modulo 2n-1 addition of X and 2X , i.e., c7 of Fig.

5.From Fig. 6, the partially-redundant form of

propagation through the HAs in Fig. 1 can be

eliminated by making the end-around-carry



2 -1

is given by the partial-sum and

bit c7

a partial product bit to be accumulated

partial-carry pair(S,C)

in the CSA tree. This technique reduces the

Fig. 5. Generation of partially-redundant

Fig. 7. Generation of partially-redundant simple multiples.



using k-bit RCAs

2 -1

Fig.6. Generation of partially-redundant B+3X n


Since modulo 2n-1 negation is equivalent to

Fig. 8. Modulo-reduced partial products and

2 -1

CC for X Y 8

The addends for the computation of the

bitwise complementation by Property 1, the

biased hard multiple, B+3X


2 -1

in a

negative hard multiple in a partially-

partially-redundant form are X


2 -1

,2X 2n-

redundant -3X

as follows:


2 -1

=( , ), is computed

1and B or equivalently S,C and B. Since is chosen to be a binary word that has logic ones at bit positions 2kj and logic zeros at other bit

2 -1

positions, |B+3X| n can be generated by

To avoid having many long strings of ones in , an appropriate bias B, , is added to the hard multiple such that both

C and are sparse [48]. The value of B is chosen as

simple XNOR and OR operations on the bits of S and C at bit positions 2kj. Fig. 6 illustrates how these bits in the sum and the carry outputs of RCA 0 and RCA 1 are modified.

2 -1

In general |B+3X| n , is given by the partial-sum and partial-carry pair

(BS,BC)such that


partially-redundant form. Fig. 7 shows the

2 -1

2 -1

2 -1

biased simple multiples, |B+0| n , |B+X| n ,

2 -1

|B+2X| n

, and |B+4X| n

represented in a


For j=0, 1 M-1. Let

partially-redundant form for n=8. From Fig. 10, it can be seen that the generation of these biased multiples involves only shift and selective complementation of the multiplicand bits without additional hardware overhead.

The i-th partial product of a radix-8 Booth encoded modulo 2n-1 multiplier is given by

2 -1

PPi= |23i ·di · X n


To include the bias B necessary for partially- redundant representation of PPi , (12) is modified to

2 -1

PPi =|23i (B+di · X) n


Using Property 3, the modulo 2n-1 multiplication by 23i in (13) is efficiently implemented as bitwise circular-left-shift of the biased multiple, (B+di · X). For n=8 and k =4 ,Fig. 8 illustrates the partial product

2 -1

matrix of X·Y) 8 with

Fig. 9. Modulo-reduced partial product generation.

2 -1

2 -1

It can be easily verified that the sum of (BS, BC) and() modulo 2n-1 is |2B| n . Therefore,()represents the partially-redundant form of B-3X| n .

B. Generation of Partially-Redundant Simple Multiples

The proposed technique represents the hard multiple in a biased partially- redundant form. Since the occurrences of

the hard multiple cannot be predicted at design time, all multiples must be uniformly represented. Similar to the hard multiple, all other Booth encoded multiples listed in Table I must also be biased and generated in a

Fig.10. (a) Bit-slice of Booth Encoder (BE).

(b) Bit-slice of Booth Selector (BS).

(n/3+1) partial products in partially- redundant representation.Each PPi consists of an n-bit vector, ppi7—ppi1 ppi0 and a vector of n/k=2 redundant carry bits, qi1 and qi0 . Since qi0 and qi1 are the carry-out bits of the RCAs, they are displaced by k -bit positions for a given PPi. The bits, qij is displaced circularly to the left of q(i-1) j by 3 bits, i.e., q20 and q21 are displaced circularly to the left of q10 and q11 by 3 bits, respectively and q10 and q11 are in turn displaced to the left of q00 and q01 by 3 bits, respectively. The last partial product in

Fig. 8 is the Compensation Constant (CC) for the bias introduced in the partially-redundant representation.

The generation of the modulo- reduced partial products, PP0, PP1 and PP2, in a partially-redundant representation using Booth Encoder (BE) and Booth Selector (BS) blocks are illustated in Fig. 8. The BE block produces a signed one-hot encoded digit from adjacent overlapping multiplier bits as illustrated in Fig. 10(a). The signed one-hot encoded digit is then used to select the correct multiple to generate PPi. A bit-slice of the radix-8 BS for the partial product bit, PPij is shown in Fig. 10(b).

Fig. 11. Modulo-reduced partial product accumulation.

As the bit positions of qij do not overlap, as shown in Fig. 8, they can be merged into a single partial product for accumu-lation.The merged partial products , PPi and the constant CC are Accumulated using a CSA tree with end-around-carry addition at each CSA level and a final two-operand modulo 2n-1 adder as shown in Fig. 11.

A family of low-area and low-power modulo 2n-1 multipliers with variable delay to achieve delay balance amongst individual modulo channels in a high-DR RNS multiplier was proposed. The delay of the proposed multiplier is controlled by the word- length of the small parallel RCAs that are used to compute the requisite hard multiple of the radix-8 Booth encoded multiplication in a partially-redundant form. From synthesis results constrained by the critical channel delay of the RNS, it was shown that the proposed multiplier simultaneously reduces the area as well as the power dissipation of the radix-4 Booth encoded multiplier for n28, which is the useful dynamic range of RNS multiplication to meet the minimum key-size requirements of ECC and RSA algorithms.

[1] R. Rivest, A. Shamir, and L.Adle-man, A method for obtaining digital signatures and public key cryptosy-stems, Commun. ACM, vol. 21, no.2, pp. 120126, Feb. 1978.[2] V. Miller, Use of elliptic curves in cryptography, in Proc. Advancesin Cryptology-CRYPTO85, Lecture Notes in Computer Science, 1986,vol. 218, pp. 417 426.

  1. N. Koblitz, Elliptic curve cryptosys- tems, Mathemat. of Comput. , vol.48, no. 177, pp. 203209, Jan. 1987.

  2. National Institute of Standards and

    Technology. Available: http://

  3. A. K. Lenstra and E. R. Verheul, Selecting cryptographic key sizes,J. Cryptol., vol. 14, no. 4, pp. 255293, Aug. 2001.

  4. C. McIvor, M. McLoone, and J. V.

    McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, IEE Proc. Comput. and Dig. Techniq., vol. 151, no. 6, pp. 402408, Nov.2004.

  5. C. McIvor,M.McLoone, and J. V. McCanny, Hardware elliptic curve cryptographic processors over , IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 53, no. 9, pp. 19461957, Sep. 2006.

  6. D. M. Schinianakis, A. P. Fournaris, H. E. Michail, A. P. Kakarountas,and T. Stoura itis, An RNS implementation of an Elliptic curve point multiplier, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no.6, pp. 12021213, Jun. 2009.

  7. J. C. Bajard and L. Imbert, A full RNS implementation of RSA, IEEE Trans. Comput. Brief Contributions , vol. 53, no. 6, pp. 769774, Jun.2004.

  8. H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, Implementation of RSA algorithm based on RNS Montgomery multiplication, in Proc. Workshop on Cryptographic Hardware and Embedded Systems,Paris, France, May 2001, pp. 364 376.

  9. T. Stoura itis and V. Paliouras, Considering the alternatives in lowpower design, IEEE Circuits Devices Mag., vol. 17, no. 4, pp. 2229,Jul. 2001.

  10. S. Pontarelli, G. C. Cardarilli, M. Re, and

    A. Salsano, Totally fault tolerant RNS based FIR filters, in Proc. 14th IEEE Int. On-Line Testing Symp., Rhodes, Greece, Jul. 2008, pp. 192194.

  11. Mr.K.Ramamohan Reddy,is currently doing post graduation in Vagdevi institute of Tech & Science,Proddatur,Kadapa(Dt)

    A.P with the specialization of VLSI [14]

    Mr.V.Ramesh,is an Assistant Professor in ECE dept. at Vagdevi institute of Tech&Science,Proddatur.He obtained his B.Tech degree in ECE from MeRITS,JNTUA,Udayagiri in 2007,M.Tech degree in Electronic Instrumentation and Communication Systems from S.V.University,Tirupati in 2009.He has 3years of teaching experience and his area of interest includes Electronics instrumentation and Antennas.He has published 4 papers in referred international journals and also 2papers at national level conferences.

Mr.C.Mahammed Aslam is the HOD,Dept.of ECE at Vagdevi institute of Tech&Science,Proddatur.He obtained his B.Tech degree in ECE from Dr.Babasaheb Ambedkar Maratwada University,Aurangabad in 1997,M.Tech degree in Digital Electronics and Computer Science from JNTU,Hyderabad in 2007.He ahs 10 years of teaching experience.He registered his Ph.D from JNTU,Anantapur in digital Image Processing.His area of interest is Microprocessor and EDC circuits. He has published 2 papers in international journals.

Leave a Reply