Montgomery Modular Multiplication in Residue Arithmetic Jean-Claude Bajard LIRMM Montpellier, France
Laurent-Stephane Didier Universite de Bretagne Occidentale Brest, France
[email protected]
[email protected]
Peter Kornerup SDU/Odense University Odense, Denmark
[email protected]
November 1, 2000 Abstract
We present a new RNS modular multiplication for very large operands. The algorithm is based on Montgomery's method adapted to residue arithmetic. By choosing the moduli of the RNS system reasonably large, an eect corresponding to a redundant high-radix implementation is achieved, due to the carry-free nature of residue arithmetic. The actual computation in the multiplication takes place in constant time, where the unit of time is a few simple residue operations. However, it is necessary twice to convert values from one residue system into another, operations which takes O(n) time on O(n) processors, where n is the number of moduli in the RNS systems. Thus these conversions are the bottlenecks of the method, and any future improvements in RNS base conversions, or the use of particular residue systems, can immediately be applied.
1 Introduction Many cryptosystems employs modular multiplications and exponentiation on very large numbers [14, 5, 10], and various algorithms have been proposed [3, 8, 21, 19, 18, 11]. Most of them use redundant (possibly high-radix) standard number systems and Montgomery's modular multiplication [9]. On the other hand the Residue Number System (RNS) is also of particular interest, because of the parallel and carry free nature of its arithmetic [17, 20]. Note that the Montgomery modular multiplication takes place in a modi ed residue system, where operands and results contain an extra factor M , for some suitably chosen value of M . Mapping in and out of this residue system is simple, and its cost may be amortized over many multiplications, when these are used for modular exponentiation. However, if applied to RSA encryption [14] as well as decryption, we may just as well 1
assume that the message itself is considered the RNS representation of a number, thus mapping in and out of the RNS system is not necessary. This is particularly interesting for the Fiat-Shamir authenti cation protocol [5, 10], where only modular multiplications are used (no exponentiation). We shall thus not further discuss the implications of using this modi ed residue system. We have previously [1, 2] proposed two RNS versions of the Montgomery algorithm for modular multiplication. To compute A B mod N , an intermediate value Q is to be determined such that A B + Q N is a multiple of M , the product of the moduli of the RNS base. The quotient Q was computed digitwise in the Mixed Radix System (MRS). The result of one pass of the algorithm was then obtained in an auxiliary RNS base, using n (the size of the RNS base) RNS computations. Here we propose to compute Q with a single parallel RNS calculation in one RNS base, but to be able to divide out the factor M it is necessary to convert Q into an auxiliary RNS base, such that we can evaluate the result of the algorithm, R = (A B + Q N ) M ?1 , in the auxiliary RNS base. The major costs now lie in conversions from one base into another. Two such O(n)-time parallel conversion algorithms are described, where the simpler one based on [15] unfortunately cannot be employed for the conversion of Q. But it can be used to convert the result R back into the original base, allowing it to be used as an operand for another multiplication. The conversion of Q can be performed through an intermediate MRS-representation, which unfortunately is twice as costly, either in time or in space. Note that for regularity purposes, we want base conversions which can be executed on simple cells (the n residue \channels"), which excludes the use of O(log(n))time algorithms where multi-operand addition in a cell is performed in a tree structure. Section 2 introduces the notation used in the residue and the mixed radix systems employed. In Section 3 the Montgomery algorithm is brie y introduced and its adaption to the RNS system is discussed, together with a brief proof of correctness. Section 4 then introduces the two conversion algorithms and their use for our new modular multiplication algorithm. Section 5 combines the basic RNS multiplication with the conversions, and nally Section 6 contains some conclusions.
2 The Residue and Mixed Radix Number Systems We begin with a short summary of the RNS system, and introduce our terminology: The vector fm1; m2 ; ; mng forms a set of moduli, called the RNS-base Bn , where the mi's are relatively prime. M is the value of the product Qni=1 mi . The vector fx1 ; ; xng is the RNS representation of X , an integer less than M , where xi = jX jmi = X mod mi Due to the Chinese Remainder Theorem, any X less than M has one and only one RNS-representation. Addition and multiplication modulo M can be implemented in parallel in linear space O(n), and performed in one single step without any carry propagation, by de ning +RNS and RNS as component-wise operations [7, 17, 20]: A +RNS B jaj + bj jmj ; for j 2 f1; ; ng 2
A RNS B jaj bj jmj ; for j 2 f1; ; ng: We also de ne \exact division", A RNS B , assuming that B divides A, gcd(B; M ) = 1: R = A RNS B r^j for j 2 f1; ; ng where r^j is computed as:
r^j = rj (B )?1 mj mj ;
(1)
where (X )?1 mj denotes the inverse of X modulo mj for X and mj relatively prime. The Mixed Radix System (MRS) associated with this RNS is de ned using the same base of moduli. With (x01; ; x0n); 0 x0i < mi, being the MRS representation of X , an integer less than M , then
X = x01 + x02 m1 + x03 m2m1 + + x0n m1 mn?1: Observe that the value of x01 in the MRS representation is identical to the value of the rst component x1 of the RNS representation of X , when we use the same base vector for the two systems. We shall use the notation x0i for components of a MRS representation, as opposed to xi for the residues of an RNS representation. Conversion from RNS into MRS representation is often used for comparison of RNS numbers, but the MRS system is not well suited for computations. In the following we shall use an intermediate MRS representation to convert a number given in one RNS system into another RNS system. f1 ; m f2 ; ; m fn We shall also introduce an auxiliary base Ben~ = fm ~ g such that each element is relatively prime to each element of Bn. In this system the RNS representations of an integer X is: XRNS = fxe1; xe2 ; ; xen~ g f f fi , we shall assume that M and we note that with M = Qni~=1 m > M.
3 A new RNS algorithm Based on the original M-reduce algorithm by Montgomery [9] we want to compute the modular product, ABM ?1 mod N , for given A; B; N and M , where N < M and M is chosen such that reductions modulo M are \easy", which will be the case when M is the product of the moduli of the RNS system used. As we shall see below, to be able to perform the computations in RNS arithmetic, it is necessary to use two RNS systems. Hence the operands A and B must be available in both systems. And since for exponentiations the result of a multiplication may be used again as input for other multiplications, the result should also be delivered in both systems.
3
Stimulus: A residue base Bn , fm1 ; m2; ; mng, where M = A residue base Bn~ , fm1; m2 ; ; mn~ g, where M = e
f
f
f
f
n m i=1 i Qn ~ f i i=1 m Q
A modulus N expressed in RNS in the two bases, f with GCD(N; M )=1, and satisfying 0 < 2N < M < M Integer A given in RNS in the two RNS bases Integer B given in RNS in the two RNS bases Response: An integer R < 2N expressed in the two RNS bases, such that: R ABM ?1 (modN ) For that we compute an intermediate value Q, Q < M , such that: A B + Q N is a multiple of M . Then in RNS the representation of A B + Q N in Bn is composed only of zeros. As we have Q < M , Q can be easily obtained in the RNS base Bn. For i = 1::n, we have (ai bi + qi ni) mod mi = 0, and thus deduce that: for i = 1::n; we have qi = (mi ? ai bi ) (ni)?1 mi mod mi :
(2)
But note that since A B + Q N is a multiple of M , it cannot be represented in the system with base Bn. Hence to compute the nal result R = (A B + Q N ) M ?1 it is necessary to compute its value in an RNS system using another base Ben~ , and thus not only to have A; B and N available in Ben~ , but also to convert Q into that system. Hence we obtain the following algorithm:
Algorithm 1 (RNS Modular Multiplication) Function: RNS Modular Montgomery Multiplication Stimulus: A residue base Bn , fm1; m2 ; ; mng, where M = A residue base Bn~ , fm1 ; m2 ; ; mn~ g, where M = e
f
f
f
f
n m i=1 i n~ m i=1 fi
Q
Q
f f where gcd(M; M ) = 1 and M < M A modulus N expressed in RNS in the two bases, f with gcd(N; M ) = 1, gcd(N; M ) = 1, and 0 < 2N < M Integer A given in RNS in the two RNS bases Integer B given in RNS in the two RNS bases with A B < M N Response: An integer R < 2N expressed in the two RNS bases such that R ABM ?1 (modN ) Method: Q (?A RNS B ) RNS N ?1 in Bn Conversion of the representation of Q from Bn to Ben~ R (A RNS B +RNS Q RNS N ) RNS M ?1 in Ben~ Conversion of the representation of R from Ben~ to Bn
Since Q < M and AB < MN it follows that R < 2N , and it is easy to see that the result R satis es R ABM ?1 (modN ). With this new version of Montgomery's 4
algorithm, base conversions become the major operations of the algorithm, as the two RNS computations can be performed in parallel on all the individual residues.
4 Base conversion
All conversions of RNS representations from one base Bn into another Ben~ , satisfying f f gcd(M; M ) = 1, where M; M are the products of the moduli of the systems, must in some way or other implicitly calculate the value of the numbers represented. For our purpose here we want conversion algorithms, which can be executed on a set of simple processors available for the RNS computations (the \channels").
4.1 Using an extra modulus
We consider XRNS = fx1; x2 ; : : : ; xng represented in the system Bn with X 2 [0; M [ and construct X using the Chinese Remainder Theorem [7] by the following expression: n
X
X=
i=1
xi
!
M ?1 i
mi
Mi mod M
(3)
where Mi = mMi , and Mi?1 mi is the inverse of Mi modulo mi . Thus we have: (xi
M ?1 i
(
xi if j = i M ) mod m = i j mi 0 else
The normal use of this method is to reconstruct the integer value of X in a classical fi , we could number system. Now if we only want to obtain the residue of X modulo m use the expression (3). But the modulo-M reduction gives some problems evaluating the fi . However, residues modulo m n
X
i=1
xi
M ?1 i
mi
Mi
mi
= X + M
(4)
for some value of where 0 < n. In 1989 Shenoy et Kumaresan [15], proposed to use an extra modulus mn+1 to evaluate : n X ?1 ?1 (5) = jM jmn+1 xi Mi mi Mi ? jX jmn+1 : i=1
Thus it is now possible to compute jX jme j by
x~j =
n
X
i=1
xi
M ?1 i
mi
Mi
M ?1
mn+1
ej
? jM jm
m ej
mn+1
m ej
(6)
M ?1
and i mi Mi Observing that the constants i mi Mi mn+1 , jM jm ej m ej mn+1 can be precomputed, then and x~j are evaluated with n + 1 multiplications and n additions. The only dependence on is in the last multiplication and addition to compute 5
, jM ?1 j
for j = 1::n~:
x~j . Thus, in parallel using n~ + 1 channels, and all the x~j can be evaluated in n + 2 multiplications and n + 1 addition steps. Note that the value of is bounded by n, which in practice is much smaller than the other moduli mi; i = 1::n. Hence the extra channel computing can operate with a modulus mn+1 n smaller than the rest, possibly even a power of 2. However, a major drawback of this method is that one must know one extra residue, jX jmn+1 , which cannot be computed by (2) for Q in our algorithm, since (amn+1 bmn+1 + qmn+1 nmn+1 ) mod mmn+1 is unknown. But if an extra residue for R can be computed by some means then this algorithm can be used to convert the representation of R.
4.2 Using a mixed radix system
The only way to convert the representation of Q is to use an intermediate positional representation like the Mixed Radix System [7]. Consider X 2 [0; M [ with MRS representation such that: X = x01 + x02 m1 + x03 m1 m2 + + x0n m1 mn?1; then x0i; i = 1::n, can be found from the RNS representation XRNS = fx1 ; ; xn g using the following expressions:
x01 = x1 x02 = (x2 ? x01 )jm?1 1 jm2 m2 ?1 0 x03 = ((x3 ? x01 )jm?1 1 jm3 ? x2 )jm2 jm3 m3 ?1 ?1 0 0 x04 = (((x4 ? x01 )jm?1 1 jm4 ? x2 )jm2 jm4 ) ? x3 )jm3 jm4 m4 ... ?1 ?1 0 0 x0n = ( (xn ? x01 )jm?1 1 jmn ? x2 )jm2 jmn ) ? ? xn?1 )jmn?1 jmn mn
Assuming that the constants jm?1 i jmk for i < k, have been precomputed, then the evaluation of x0k can be done with k ? 1 multiplications and k ? 1 additions. f Then, knowing a MRS representation of X , and assuming that M > M , we can e compute the RNS representation of X in Bn~ as
e
x~i = x01 + x02 jm1jmi + x03 jm1m2 jmi + + x0n jm1 mn?1 jmi mi : e
e
e
If we assume that the constants jm1 mk jme i are precomputed, then for i = 1::n~ , x~i can be evaluated in parallel in n~ channels, using n multiplications and n ? 1 addition steps. Note that the evaluation of x~i for i = 1::n~ can begin with x01 = x1 . Thus if n~ + n channels are available, the conversion of Q from RNS into MRS in Bn can take place in n channels, while in parallel the conversion from MRS into RNS in base Ben~ takes place in the other n~ channels, and the complete conversion of Q can then be performed in n multiplication- and n ? 1 addition-steps. Also observe that it is possible here to compute any number of extra residues, and thus in particular to calculate the extra residue needed for R, allowing a base conversion of R to be performed by the previous and simpler method. 6
5 The RNS multiplication algorithm Figure 1 describes the execution of our algorithm, illustrating the time and area complexities. In is here assumed that the parallelism is maximal, but there are many intermediate possibilities using fewer channels. With maximal parallelism, using n + n~ +1 channels the time complexity is approximately 2nT , where T is the time for one table look-up plus one modular multiply-add operation. With max(n; n~) + 1 channels the time is approximately 3nT . Note that the conversion through MRS-representation is about twice as costly in time or space, as conversion using an extra modulus. 2
?
RNS operations in Bn @ @ @
n (n ? 1) +
@ @
RNS operations in Ben~ RNS evaluation of Q in Bn extension of Q from Bn to Ben~ using MRS conversion
@
@ @
?
3 ; 2 +
@ @
@
RNS evaluation of R in Ben~
?
extension of R from Ben~ to Bn using Shenoy-Kumaresan
(~n + 2) (~n + 1) + ?
n moduli of Bn
-
KA
e moduli of Bn ~ A n extra modulus mn+1 e
-
Figure 1: Evaluation of A B M ?1 in RNS For applications in cryptology, say with N 21024 , it is possible to choose n = n~ = 32, where each channel is realized by a very simple 32-bit processor, i.e., a total of 65 processors. Each processor has to be able to compute additions and multiplications, modulo some speci c 32-bit prime, and to store some 32-entry tables of 32-bit constants.
Example We consider the systems B5 = f3; 7; 13; 19; 29g, B5 = f5; 11; 17; 23; 31g, the f
extra modulus me = 8 and operands A, B and N . Thus, we have M = 150423 and f M = 666655. 3 A 1 B 1 N 2
7 3 5 3
In Bn 13 19 9 14 1 9 4 4
29 25 25 17
5 11 1 8 1 1 3 7 7
In Bfn Base 10 17 23 31 2 5 5 26386 1 21 19 72931 12 15 16 142523
The computation of A B M ?1 mod N is detailed as follows: In Bn me 1 2 1 16 11 1 5 0 5 13 0 3 0 26 7 10 4 1 18 7 5 1 4 6 16 7 6 1 4 6 16 7 6 0 5 0 15 14 5 2 2 2 15 19 0 2 1 12 10 16 3 2 0 2 8 6 10
In Bfn 1 1 4 4 0 2
1 1 1 5 16 16 2 11 10 4 4 3 2 6 12 0 13 3
Computation Q (?A B ) N ?1 1 Extension of Q from Bn to Bfn 16 trough MRS conversion 17 19 13 5 R (A B + Q N ) M ?1 Conversion of R through Shenoy and Kumaresan algorithm
(= ) A B M ?1 mod N = 103917
5.1 Using speci c sets of moduli
The use of speci c moduli can improve conversion methods using either the Chinese Remainder Theorem or a Mixed Radix System. Many such systems have been previously published using small sets of three or more moduli like f2k ? 1; 2k ; 2k + 1g for the most popular set. The bene t of such systems is twofold. First, since operations modulo 2k ? 1, 2k or 2k + 1 are reduced to simple logic operations, RNS operations are easier in this system. Second, the values involved in the CRT or MRS conversion algorithms are simple numbers which greatly simpli es the conversion computations. A recent result published in [6] shows that a conversion from RNS to binary using the set f2k ? 1; 2k; 2k?1 ? 1g has a O(n) space complexity, and a delay equal to one 2k ? 1-bit adder and two k ? 1 adders without any lookup table. However, as discussed above for applications in crypto-algorithms, much larger systems are needed. The only two systems we have found0 with a parameterizable number of moduli 1m km k m 2 m 2 2 of the form 2 1, are the systems f2 ? 1; 2 + 1; 2 + 1; ; 2 + 1g proposed in [13], and systems f2n1 1; 2n2 1; ; 2nL 1g in [16] where the exponents n1; n2 ; ; nL are mutually prime, not divisible by 3. The rst kind of systems are extremely un-balanced. Actually, the largest modulus here approximately p equals the product of all the remaining moduli, i.e., it is of about the same size as M , where M is the system modulus. The second kind of systems are also un-balanced when a dynamic range of say 1024 bits is needed. It is even simply impossible with 32-bit processors to nd suciently many such mutually prime moduli pairs of the form 2k 1. And with 64-bit processors at least 20 pairs of moduli of the form 2k 1 will be needed, in which case the system can hardly be considered balanced. Also, conversion 8
is performed trough two levels where the rst conversion step is easy, the second step has the same complexity as a regular CRT conversion.
6 Conclusions With this new approach of the modular multiplication in RNS, we observe that base extension is a key operation. All the complexity is due to this transformation. Shenoy and Kumaresan proposed an ecient method based on the CRT, by which a logarithmic time complexity can be obtained with n2 processors. Utilizing n parallel RNS channels the algorithm can be implemented in n + 1 steps, using one extra channel for an extra residue. But since such an extra residue cannot be obtained for the quotient Q, and if the computation is to be performed on the set of parallel RNS processors, base extension must use an intermediate MRS representation, which increases the complexity. We note that an advantage of MRS is that it is easy to add a modulus. In fact, the stored constants in each channel do not have to be modi ed, we just have to add a new channel to generate the residue for this extra modulus. Compared to other number systems, RNS can be considered a real parallel system. Our method is easily implementable with processors connected by a bus, where communication is reduced to at most one broadcast per step. If modular multiplication is to be implemented on processors which do not support arithmetic on the full data-width (say 1024 bits), we believe that an RNS implementation is preferable to a high-radix implementation. Given a number of smaller (say 32-bit) processors, parallelism is easier to exploit in RNS with O(n) processors, than in a redundant ordinary radix system. And it is not necessary to employ the full n + n~ + 1 processors for maximal parallelism, any number will do, even a single processor. Furthermore, when the application is in cryptography, we note that the message does not need to be converted into the special Montgomery residue system (containing the factor M ), it can be directly considered an RNS representation.
References [1] J.-C. Bajard, L.-S. Didier, and P. Kornerup, An RNS Montgomery modular multiplication algorithm, IEEE Transaction on Computers 47(7), pp. 766{776, 1998. [2] J.-C. Bajard, L.-S. Didier, P. Kornerup and F. Rico, Some improvements on RNS Montgomery Modular Multiplication, SPIE's International Symposium on Optical Science and Technology 30 July - 4 August 2000, San Diego, California USA. [3] E.F. Brickell. A Survey of Hardware Implementations of RSA. In Gilles Brassard, editor, Advances in Cryptology - CRYPTO '89, LNCS-435, pages 368{370. SpringerVerlag, 1990. [4] S.E. Eldridge and C. D. Walter. Hardware implementation of Montgomery's modular multiplication algorithm. IEEE Transaction on Computers, 42(6):693{699, June 1993. 9
[5] A. Fiat and A. Shamir. How to prove yourself: Practical solutions to identi cation and signature problems. In Advances in Cryptology - Proceedings of Crypto'86, pages 186{194, 1986. [6] A. A. Hiasat and S. H. Abdel-Aty-Zohdy. Residue-to-binary arithmetic converter for the moduli set (2k ; 2k ? 1; 2k?1 ? 1). IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 45(2):204{9, 1998. [7] D.E. Knuth. Seminumerical Algorithms, Volume 2 of The Art of Computer Programming. Addison-Wesley, 2 edition, 1981. [8] P. Kornerup. High-radix modular multiplication for cryptosystems. In G. Jullien M.J Irwin, E. Swartzlander, editors, 11th IEEE Symposium on Computer Arithmetic, pages 277{283, Windsor, Canada, 1993. IEEE Computer Society Press. [9] P. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519{521, April 1985. [10] S. Micali and A. Shamir. An improvement of the Fiat-Shamir identi cation and signature scheme. In Advances in Cryptology - Proceedings of Crypto'88, pages 244{ 247, 1988. [11] H. Orup. Simplifying Quotient Determination in High-Radix Modular Multiplication. In S. Knowles and W. H. McAllister, editors, Proc. 12th IEEE Symposium on Computer Arithmetic. IEEE Computer Society, 1995. [12] P. Paillier. Low-cost double-size modular exponentiation or how to stretch your cryptoprocessor. In H. Imai and Y. Zheng, editors, Second International Workshop on Practice and Theory in Public Key Cryptography, PKC'99, LNCS-1560, pages 223-234. Springer Verlag, 1999. [13] F. Pourbigharaz and H. M. Yassine. A signed-digit architecture for residue to binary transformation. IEEE Transactions on Computers, 46(10):1146{50, 1997. [14] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120{126, 1978. [15] A. P. Shenoy and R. Kumaresan. Fast base extension using a redundant modulus in RNS. IEEE Transactions on Computer, 38(2):292{296, 1989. [16] A. Skavantzos and M. Abdallah. Implementation issues of the two-level residue number system with pairs of conjugate moduli. IEEE Transactions on Signal Processing, 47(3):826{38, 1999. [17] N. Szabo and R. I. Tanaka. Residue Arithmetic and its application to Computer Technology. McGraw-Hill, 1967. [18] M. Shand and J. Vuillemin. Fast Implementations of RSA Cryptography. In M.J. Irwin E. Swartzlander and G. Jullien, editors, Proc. 11th IEEE Symposium on Computer Arithmetic, pages 252{259. IEEE Computer Society, 1993. 10
[19] N. Takagi. Modular Multiplication Algorithm with Triangle Addition. In M.J. Irwin, E. Swartzlander and G. Jullien, editors, Proc. 11th IEEE Symposium on Computer Arithmetic, pages 272{276. IEEE Computer Society, 1993. [20] F.J. Taylor. Residue Arithmetic: A Tutorial with Examples. COMPUTER, pages 50{62, May 1984. [21] C.D. Walter. Systolic Modular Multiplication. IEEE Transactions on Computers, C-42(3):376{378, March 1993.
11