 Open Access
 Total Downloads : 2087
 Authors : K.Ramamohan Reddy, V.Ramesh, C.Md.Aslam
 Paper ID : IJERTV1IS3181
 Volume & Issue : Volume 01, Issue 03 (May 2012)
 Published (First Online): 30052012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Design of modulo 2^{n}1 multiplier Based on Radix8 Booth Algorithm using Residue Number System
K.RAMAMOHAN REDDY V.RAMESH
M.Tech Student, Dept. of ECE Vaagdevi Institute of Technology & Science,Proddatur, Kadapa (Dt.), A.P.
Assistant Professor, Dept. of ECE Vagdevi Institute of Technology &Science,Proddatur,Kadapa(Dt) A.P.
C.Md.ASLAM
HOD, Dept. of ECE
Vagdevi Institute of Technology & Science,Proddatur,Kadapa(Dt) A.P.
Modular arithmetic operations (inversion, multiplication and exponentiation) are used in several cryptography applications. A special moduli set of forms {2n1, 2n, 2n +1} are preferred over the generic moduli due to the ease of hardware implementation of modulo arithmetic functions as well as systemlevel intermodulo operations, such as RNStobinary conversion and sign detections. With this precept, a family of radix8 Booth encoded modulo 2n1 multipliers, with delay adaptable to the RNS multiplier delay, is proposed.
The firstever family of lowarea and lowpower radix8 Booth encoded modulo 2n
1 multiplier whose delay can be tuned to match the RNS delay closely has been proposed in this paper. A CSA tree with end aroundcarry addition for accumulation of redundant partial products and a Sklansky parallelprefix structure has also been implemented.
Index Terms Public Key Cryptographic (PKC),Booth algorithm, modulo arithmetic, multiplier, residue number system (RNS)

RIVEST, Shamir, and Adleman (RSA) and elliptic curve cryptography (ECC) are two of the most well established and widely used public key cryptographic (PKC) algorithms.
The encryption and decryption of these PKC algorithms are performed by repeated modulo multiplications [1][3].
These multiplications differ from those encountered in signal processing and general computing applications in their sheer operand size. Key sizes in the range of 512~1024 bits and 160~512 bits are typical in RSA and ECC, respectively [4][7]. Hence, the long carry propagation of large integer multiplication is the bottleneck in hardware implementation of PKC. The residue number system (RNS) has emerged as a promising alternative number representation for the design of faster and low power multipliers owing to its merit to distribute a long integer multiplication into several shorter and independent modulo multiplications
Modulo 2n+1, 2n , 2n1 addition and multiplication are the crucial operations in the IDEA algorithm and also modulo 2n+1 arithmetic operations are used in Fermat number transform computation. Moduli choices of the forms {2n+1, 2n, 2n1} have received significant attention because they offer very efficient circuits when considering the area * time2 product and efficient converters from and to the binary system. Therefore, designing efficient modulo 2n1 multipliers is an interesting issue. Modulo 2n
1 multiplication is used extensively in Residue Number System (RNS) based Digital Signal Processing (DSP) and cryptography units.
The modulo 2n 1 multiplication of two numbers nbit each follows 3 steps: production of n2 partial products modulo 2n 1 reduction of this n2 partial products 2n 1 into two numbers of n bits addition of these two numbers modulo 2n 1 with the preceding adder.

The radix8 Booth encoding reduces the number of partial products to
which is more aggressive than the radix4 Booth encoding. However, in the radix8 Booth encoded modulo 2n1 multiplication, not all moduloreduced partial products can be generated using the bitwise circularleft
2 1
shift operation and bitwise inversion. Particularly, the hard multiple +3X n is to be generated by an n bit endaroundcarry
addition of X and 2X.
Recoding of binary numbers was first hinted at by Booth four decades ago. MacSorley proposed a modification of Booths algorithm a decade after. The modified Booths algorithm (radix4 recoding) starts by appending a zero to the right of x0 (multiplier LSB). Triplets are taken
beginning at position x1 and continuing to the MSB with one bit overlapping between adjacent triplets. If the number of bits in X (excluding x1) is odd, the sign (MSB) is extended one position to ensure that the last triplet contains 3 bits. In every step we will get a signed digit that will multiply the multiplicand to generate a partial product entering the Wallace reduction tree. The meaning of each triplet can be seen in table I:
Table I: Radix4 encoding
This recoding scheme applied to a parallel multiplier halves the number of partial products so the multiplication time and the hardware requirements decrease. This gain is possible at the expense of somewhat more complex operations in every step. However, that the required multiples of Y {0, Y,
2Y} are available by merely shifting Y to the left. Although the algorithms and operations specified above seem rather arbitrary at the first sight, they are based on meaningful number systems. If one focuses on what modifications are being done to X, then one may arrive at a different representation for the 2scomplement number X as shown in figure 2:
Figure 2: Signeddigit representation
where digits Di are one of 2, 1, 0, 1, 2found in the table of figure 1, based on the value of triplets in the form (xi+2 xi+1 xi). Here we have a signed digit representation of
X in radix4. Signeddigit number representation allows redundancy to exist. Thanks to this we can make a parallel recodification that is, all triplets are recoded at the same time, and the value of each triplet is independent from the adjacent triplets.
Radix8 recoding applies the same algorithm as radix4, but now we take quartets of bits instead of triplets. Each quartet is codified as a signeddigit using the table II:
Here we have an odd multiple of the multiplicand, 3Y, which is not immediately available. To generate it we need to perform this previous add: 2Y+Y=3Y. But we are designing a multiplier for specific purpose and thereby the multiplicand belongs to a previously known set of numbers which are stored in a memory chip. We have tried to take advantage of this fact, to ease the bottleneck of the radix8 architecture, that is, the generation of 3Y. In this manner we try to attain a better overall multiplication time, or at least comparable to the time we could obtain using radix4 architecture (with the additional advantage of using a less number of transistors). To generate 3Y with 21bit words we only have to add 2Y+Y, that is, to add the number with the same number shifted one position to the left, getting in this way a new 23bit word, as shown in figure 3:
Figure 3: 21bit previous add
In fact, only a 21bit adder is needed to generate the bit positions from z1 to z21. Bits z0 and z22 are directly known because z0=y0 and z22=y20 (sign bit of the 2scomplement number; 3Y and Y have the same sign). If in the memory from where we take the numbers just two additional bits are stored together with each value of the set of numbers, we can decompose the previous add in three shorter adds that can be done in parallel. In this way, the delay is the same of a 7bit adder:
Bits which are going to be stored are the two intermediate carry signals c8 and c15. Before each word of the set of numbers is stored in the memory, the value of its intermediate carries has to be obtained and stored beside it. In this way, they are immediately available when it is required to perform the previous add to get the multiple 3Y of one of the numbers that belongs to the set.
The radix4 Booth encoding tchnique is most prevalent as all required modulo reduced partial products can be generated by circularleftshift operation and bitwise complementation, thereby minimizing the hardware complexity. The reduction in the number of partial products is determined by the radix of the Booth encoding technique employed. Reduction of partial products by more than half is possible with higher radix Booth encoding. Similar to the radix4 algorithm, the radix8 Booth encoding algorithm can be considered as a digit set conversion of four consecutive multiplier bits y3i1,y3i,y3i+1,y3i+2 , yi
{0, 1}from Y, to di , di [4, 4], for i = 0, 1,
/3.
The digit set conversion is given by
(1)
where y1, yn, yn+1 and yn+2 are zero. For the radix8 Booth encoded modulo 2n1 multiplier, the required moduloreduced partial products are shown in Table III.
From Table 3, the necessary moduloreduced partial products except Â±3X can be generated by circularleftshift operation and/or bit wise complementation of the multiplicand, X. The generation ofÂ±3X requires a large word length adder which increases the critical path delay of the multiplier significantly.
TABLE III: MODULOREDUCED PARTIAL PRODUCTS FOR RADIX8 BOOTH ENCODING
carry propagation length to n bits by representing the hard multiple as a sum and a redundant endaroundcarry bit pair. The resultant [n/3] +1 endaroundcarry bits in the partial product matrix may lead to a marginal increase in the CSA tree depth and consequently, may aggravate the delay of the CSA tree. In which case, it is not sufficient to reduce the carry propagation length to merely n bits using the above technique.
Since the absolute difference between the noncritical modulo 2n1 multiplier delay and the system critical path delay depends on the degree of imbalance in the moduli word length of a RNS, the delays cannot be equalized by arbitrarily fixing the carry propagation length to n bits. Instead, we propose to accomplish the adaptive delay equalization by representing the hard multiple in a partiallyredundant form [48].
A. Generation of PartiallyRedundant Hard Multiple
Let X
n
2 1
and 2X
n
2 1
be added by a group
results also confirm that the proposed method helps pathologists distinguish exact lesion sizes and regions
To ensure that the radix8 Booth encoded modulo 2n1multiplier does not constitute the system critical path of a high DR moduli set based RNS multiplier, the carry propagation length
in the hard multiple generation should not exceed n bits. To this end, the carry
of M (=n/k) k bit RCAs such that there is no carry propagation between
i
the adders. Fig. 2 shows this addition for n=8 and k =4,where the sum and carryout bits from the RCA block j are represented as S j andcij for i [0,k1] and j [0,M 1],respectively. In Fig. 2, the carryout of RCA 0,C30 , is not propagated to the carry input of RCA 1 but preserved as one of the partial product bits to be accumulated in the CSA tree. The binary weight of the carryout C31of RCA 1 has,however, exceeded the maximum range of the modulus and has to be modulo reduced before it can be accumulated by the CSA tree.
By Property 2, the binary weight of C31 can be reduced from 28 to 20 . Thus, C31 is inserted at the least significant bit (lsb)position in Fig.
6. It should be stressed that the carryout C31 is a partial carry propagated through only k most significant FAs and hence, is different from the endaroundcarry bit in the modulo 2n1 addition of X and 2X , i.e., c7 of Fig.
5.From Fig. 6, the partiallyredundant form of
propagation through the HAs in Fig. 1 can be
eliminated by making the endaroundcarry
+3X
n
2 1
is given by the partialsum and
bit c7
a partial product bit to be accumulated
partialcarry pair(S,C)
in the CSA tree. This technique reduces the
Fig. 5. Generation of partiallyredundant
Fig. 7. Generation of partiallyredundant simple multiples.
+3X
2n1
using kbit RCAs
2 1
Fig.6. Generation of partiallyredundant B+3X n
where
Since modulo 2n1 negation is equivalent to
Fig. 8. Moduloreduced partial products and
2 1
CC for X Y 8
The addends for the computation of the
bitwise complementation by Property 1, the
biased hard multiple, B+3X
n
2 1
in a
negative hard multiple in a partially
partiallyredundant form are X
n
2 1
,2X 2n
redundant 3X
as follows:
n
2 1
=( , ), is computed
1and B or equivalently S,C and B. Since is chosen to be a binary word that has logic ones at bit positions 2kj and logic zeros at other bit
2 1
positions, B+3X n can be generated by
To avoid having many long strings of ones in , an appropriate bias B, , is added to the hard multiple such that both
C and are sparse [48]. The value of B is chosen as
simple XNOR and OR operations on the bits of S and C at bit positions 2kj. Fig. 6 illustrates how these bits in the sum and the carry outputs of RCA 0 and RCA 1 are modified.
2 1
In general B+3X n , is given by the partialsum and partialcarry pair
(BS,BC)such that
where
partiallyredundant form. Fig. 7 shows the
2 1
2 1
2 1
biased simple multiples, B+0 n , B+X n ,
2 1
B+2X n
, and B+4X n
represented in a
and
For j=0, 1 M1. Let
partiallyredundant form for n=8. From Fig. 10, it can be seen that the generation of these biased multiples involves only shift and selective complementation of the multiplicand bits without additional hardware overhead.
The ith partial product of a radix8 Booth encoded modulo 2n1 multiplier is given by
2 1
PPi= 23i Â·di Â· X n
(12)
To include the bias B necessary for partially redundant representation of PPi , (12) is modified to
2 1
PPi =23i (B+di Â· X) n
(13)
Using Property 3, the modulo 2n1 multiplication by 23i in (13) is efficiently implemented as bitwise circularleftshift of the biased multiple, (B+di Â· X). For n=8 and k =4 ,Fig. 8 illustrates the partial product
2 1
matrix of XÂ·Y) 8 with
Fig. 9. Moduloreduced partial product generation.
2 1
2 1
It can be easily verified that the sum of (BS, BC) and() modulo 2n1 is 2B n . Therefore,()represents the partiallyredundant form of B3X n .
B. Generation of PartiallyRedundant Simple Multiples
The proposed technique represents the hard multiple in a biased partially redundant form. Since the occurrences of
the hard multiple cannot be predicted at design time, all multiples must be uniformly represented. Similar to the hard multiple, all other Booth encoded multiples listed in Table I must also be biased and generated in a
Fig.10. (a) Bitslice of Booth Encoder (BE).
(b) Bitslice of Booth Selector (BS).
(n/3+1) partial products in partially redundant representation.Each PPi consists of an nbit vector, ppi7—ppi1 ppi0 and a vector of n/k=2 redundant carry bits, qi1 and qi0 . Since qi0 and qi1 are the carryout bits of the RCAs, they are displaced by k bit positions for a given PPi. The bits, qij is displaced circularly to the left of q(i1) j by 3 bits, i.e., q20 and q21 are displaced circularly to the left of q10 and q11 by 3 bits, respectively and q10 and q11 are in turn displaced to the left of q00 and q01 by 3 bits, respectively. The last partial product in
Fig. 8 is the Compensation Constant (CC) for the bias introduced in the partiallyredundant representation.
The generation of the modulo reduced partial products, PP0, PP1 and PP2, in a partiallyredundant representation using Booth Encoder (BE) and Booth Selector (BS) blocks are illustated in Fig. 8. The BE block produces a signed onehot encoded digit from adjacent overlapping multiplier bits as illustrated in Fig. 10(a). The signed onehot encoded digit is then used to select the correct multiple to generate PPi. A bitslice of the radix8 BS for the partial product bit, PPij is shown in Fig. 10(b).
Fig. 11. Moduloreduced partial product accumulation.
As the bit positions of qij do not overlap, as shown in Fig. 8, they can be merged into a single partial product for accumulation.The merged partial products , PPi and the constant CC are Accumulated using a CSA tree with endaroundcarry addition at each CSA level and a final twooperand modulo 2n1 adder as shown in Fig. 11.
A family of lowarea and lowpower modulo 2n1 multipliers with variable delay to achieve delay balance amongst individual modulo channels in a highDR RNS multiplier was proposed. The delay of the proposed multiplier is controlled by the word length of the small parallel RCAs that are used to compute the requisite hard multiple of the radix8 Booth encoded multiplication in a partiallyredundant form. From synthesis results constrained by the critical channel delay of the RNS, it was shown that the proposed multiplier simultaneously reduces the area as well as the power dissipation of the radix4 Booth encoded multiplier for n28, which is the useful dynamic range of RNS multiplication to meet the minimum keysize requirements of ECC and RSA algorithms.
[1] R. Rivest, A. Shamir, and L.Adleman, A method for obtaining digital signatures and public key cryptosystems, Commun. ACM, vol. 21, no.2, pp. 120126, Feb. 1978.[2] V. Miller, Use of elliptic curves in cryptography, in Proc. Advancesin CryptologyCRYPTO85, Lecture Notes in Computer Science, 1986,vol. 218, pp. 417 426.
N. Koblitz, Elliptic curve cryptosys tems, Mathemat. of Comput. , vol.48, no. 177, pp. 203209, Jan. 1987.

National Institute of Standards and
Technology. Available: http:// csrc.nist.gov/publications/PubsSPs.html

A. K. Lenstra and E. R. Verheul, Selecting cryptographic key sizes,J. Cryptol., vol. 14, no. 4, pp. 255293, Aug. 2001.

C. McIvor, M. McLoone, and J. V.
McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, IEE Proc. Comput. and Dig. Techniq., vol. 151, no. 6, pp. 402408, Nov.2004.

C. McIvor,M.McLoone, and J. V. McCanny, Hardware elliptic curve cryptographic processors over , IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 53, no. 9, pp. 19461957, Sep. 2006.

D. M. Schinianakis, A. P. Fournaris, H. E. Michail, A. P. Kakarountas,and T. Stoura itis, An RNS implementation of an Elliptic curve point multiplier, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no.6, pp. 12021213, Jun. 2009.

J. C. Bajard and L. Imbert, A full RNS implementation of RSA, IEEE Trans. Comput. Brief Contributions , vol. 53, no. 6, pp. 769774, Jun.2004.

H. Nozaki, M. Motoyama, A. Shimbo, and S. Kawamura, Implementation of RSA algorithm based on RNS Montgomery multiplication, in Proc. Workshop on Cryptographic Hardware and Embedded Systems,Paris, France, May 2001, pp. 364 376.

T. Stoura itis and V. Paliouras, Considering the alternatives in lowpower design, IEEE Circuits Devices Mag., vol. 17, no. 4, pp. 2229,Jul. 2001.

S. Pontarelli, G. C. Cardarilli, M. Re, and
A. Salsano, Totally fault tolerant RNS based FIR filters, in Proc. 14th IEEE Int. OnLine Testing Symp., Rhodes, Greece, Jul. 2008, pp. 192194.

Mr.K.Ramamohan Reddy,is currently doing post graduation in Vagdevi institute of Tech & Science,Proddatur,Kadapa(Dt)
A.P with the specialization of VLSI [14]
Mr.V.Ramesh,is an Assistant Professor in ECE dept. at Vagdevi institute of Tech&Science,Proddatur.He obtained his B.Tech degree in ECE from MeRITS,JNTUA,Udayagiri in 2007,M.Tech degree in Electronic Instrumentation and Communication Systems from S.V.University,Tirupati in 2009.He has 3years of teaching experience and his area of interest includes Electronics instrumentation and Antennas.He has published 4 papers in referred international journals and also 2papers at national level conferences.
Mr.C.Mahammed Aslam is the HOD,Dept.of ECE at Vagdevi institute of Tech&Science,Proddatur.He obtained his B.Tech degree in ECE from Dr.Babasaheb Ambedkar Maratwada University,Aurangabad in 1997,M.Tech degree in Digital Electronics and Computer Science from JNTU,Hyderabad in 2007.He ahs 10 years of teaching experience.He registered his Ph.D from JNTU,Anantapur in digital Image Processing.His area of interest is Microprocessor and EDC circuits. He has published 2 papers in international journals.