Rapid Modular Reduction Algorithm

Rapid Modular Reduction Algorithm 1

Sattar J. Aboud, 2 Mohammed Al-Fayoumi

1

Department of Computer Science & technology, University of Bedfordshire, UK

2

Faculty of Information Technology, El-ISra University, Amman-Jordan

Abstract In this paper, we introduced a new modular reduction algorithm which divides the binary string of the integer to be reduced into blocks in terms of its runs. Its computing cost relies on the number of runs in the binary string. We give the complete cost computing analysis of the algorithm. It claims that the new reduction method is about twice faster than the popular Barrett reduction method.

1. Introduction The operations of cryptosystems base on the pace of modular reduction. In existing modular reduction methods, Barrett reduction and Montgomery reduction are two accepted methods. Montgomery reduction [1] is not effective for the single modular multiplication, but can be employed efficiently in calculations with many multiplications for known inputs. Barrett reduction [2] is appropriate if many reductions are achieved by the single modulus. Both two algorithms are same in costly divisions in classical reduction substituted by less-costly multiplications. In 1993, Bosselaers et al [3] stated that classical reduction, Barrett reduction, and Montgomery reduction have their performance in the specific area of use. In 1998, Win et al [4] stated that a variation between Montgomery reduction and Barrett reduction was insignificant in their execution on field arithmetic in F p for 192-bits prime p. In 2011, Dupaquis and Venelli [5] proposed a method that allows the application of redundant modular arithmetic. The redundant method can be employed to increase a differential side-channel resistance of asymmetric systems. In 2013, Cao and Wu [6] shown that a base b  3 in Barrett reduction can be substituted by a base 2. The enhancement solves the information expansion difficulty in Barrett reduction in order to provide the small cost reduction. To accelerate modular reduction, lookup-table method has been proposed by Walter in 1991[7]. When a length of the pre-computed table is convenient, the algorithm will be efficient. These modular reductions partition the binary string of the integer to be decreased into fixed-size blocks for example 32 bits or 64 bits. In 1997, Lim et al [8] tested

classical reduction, Barrett reduction and Montgomery reduction, using lookup-table. It showed that their lookup-table reduction run about two to three times quicker than Montgomery method. They recommended for Barrett reduction the base b  2 32 . Also, they not show the computing time analysis of these methods. In this paper, we present a new reduction algorithm relies on the rule of addition. Dissimilar from the present lookup-table approach, the new method divide the binary string of the integer to be reduced into blocks in term of its runs. Its result bases on the number of runs. The new technique requires less addition compare with the existing lookup-table method. We will give a complete complexity analysis of the algorithm.

2. The Existing of the Reduction Algorithms In this section we describe the existing modular reduction algorithm, which are as follows: 2.1. Montgomery Reduction Algorithm In 1985, Montgomery [1] presents the algorithm. Montgomery reduction is a technique which allows efficient implementation of modular multiplication without explicitly carrying out the classical modular reduction step. The steps of the algorithm are as follows: Algorithm 1 Input: integers m  (mn1...m1m0 ) b with gcd( m, b)  1 , R  b n , m '  m 1 mod b , T  (t 2n1...t1t 0 ) b  mR

Output: TR 1 mod m 1.

A  T ; (with A  (a2n1 ...a1a0 ) b )

2.

for(i from 0 to(n  1)) do

2.1 ui  ai m ' mod b ; 2.2 A  A  ui mb i ; 3. 4.

A  A / bn ;

if ( A  m)then

4.1 A  A  m ; return(A) .

Example Suppose m  72639, b  10, R  105 , and T  7118368 . Thus n  5, m '  m 1 mod10  1 , T mod m  72385 , and TR 1 mod m  39796 . Table 1 shows the iterations of step 2 in the algorithm 1.

Remarks 1. An algorithm 1 is not needed m '  m 1 mod R , but require m '  m 1 mod b . This is because of the option of R  b n . 2. In step 2.1 of an algorithm 1 with i  l , A has a characteristic that a j  0 such that 0  j  l  1 . In step 2.2 is not adjusted these values, but is substituted a l by 0 . It follows that in step 3, A is divisible by b n . 3. In step 3, the value of A equals T with multiple of m . Thus, A  (T  km) / b n is an integer and A  TR 1 mod m . It demonstrates that A is less than 2m , in step 4, a subtraction rather n 1

than the division will be sufficient. In step 3, A  T 

 i 0

n 1

u i b i m , but

 u b m  b m  Rm , i

n

i

i 0

and T  Rm . Thus, A  2RM in step 4 after division of A by R, A  2m as needed. 4. The algorithm 1 does not require any single-precision divisions. But, the step 2.1 and the step 2.2 of the algorithm 1 need the total of n  1 single-precision multiplications. As these steps are executed n times, the total number of single-precision multiplications is n(n  1) . 2.2. Barrett Reduction In 2014, Sattar Aboud and Edmond Prakash [9] described the Barrett reduction. Barrett reduction calculates r  x mod m given x , and m . The algorithm 2 needs the pre-calculation of the number u  b 2k / m . It is useful if many reductions are done with the single modulus. For instance, every RSA encryption for one individual needs reduction modulo that individual's public key modulus. The pre-calculation takes the determined amount of work, which is insignificant in comparison to modular exponentiation cost. The radix b is selected to be close to the word-length of a processor, suppose that b  3 in the algorithm 2. However, the steps of the algorithm 2 are as follows: Algorithm 2 Input: positive integers x  ( x2k 1 ...x1 x0 ) b , m  (mk 1 ...m1m0 ) b With (mk 1  0) ,



u  b 2k / m



Output: r  x mod m









1.

q1  x / b k 1 ; q2  q1  u ; q3  q 2 / b k 1 ;

2.

r1  x mod b k 1 ; r2  q3  m mod b k 1 ; r  r1  r2 ;

3.

if (r  0)then r  r  b k 1 ;

4.

while(r  m)do r  r  m ;

5.

return(r )

Example Suppose that b  4, k  3, x  (313221) b , and m  (233) b . It means that x  3561 , and m  47 . Then,









u  4 6 / m  87  (1113) b , q1  (313221) b / 4 2  (3132) b , q2  (3132) b  (1113) b  (10231302) b , q3  (1023) 2 , r1  (3221) b , r2  (1023) b  (233) b mod b 4  (3011) b , and r  r1  r2  (210) b . Thus x mod m  36

Remarks 1. All divisions done in an algorithm 2 are simple right-shifts of the base b depiction. 2. The variable q 2 is just employed to calculate q 3 . As a k  1 least significant digits of q 2 are not required to fix q 3 only the limited multiple-precision multiplication (q1  u) is required. The only effect of k  1 least significant digits on the higher order digits is the carry from position k  1 to position k  2 . Given a base b is adequately large with respect to k , this carry can be calculated by only computing the digits at positions k and k  1 . As, k  1 least significant digits of q 2 is not calculated. The u , and q1 have at most k  1 digits, fixing k q 3 and needs at most (k  1) 2   2   (k 2  5k  20 / 2 single-precision multiplications.  

3. In step 2 of the algorithm 2, r2 can also be calculated by the partial multiple-precision multiplication which calculates just the least significant k  1 digits of q3  m . This can be  k 1 

complete in at most  2   k single-precision multiplication. 



4. The algorithm 2 is relied on result of x / m and can be written as Q  ( x / b k 1 )(b 2k / m)(1 / b k 1 ) Also, Q can be estimated by a number q3  x / b k 1 u / b k 1  . Observe that q 3 is never larger than the true amount Q and is at most 2 lesser. 5. In step2,  b k 1  r1  r2  b k 1 , r1  r2  (Q  q3 )m  R(mod b k 1 ) and 0  (Q  q3 )m  R  3m  b k 1 . If r1  r2  0 then r1  r2  (Q  q3 ) m  r

If r1  r2  0 then r1  r2  b k 1  (Q  q3 )m  R

In either case, step 4 is repeated at most twice since 0  r  3m . 2.3 Customized Reduction In 1991, Walter presented another type for reduction method [7]. Though, if the modulus has type reduction method can used for more efficient calculation. Assume a modulus m is t  digit base b of type m  b t  c , with c is l  digit with l  t . The algorithm 3 calculates x mod m for any value x

using shifts, additions, and single-precision multiplications. The steps of the algorithm are as follows: Algorithm 3 Input: a base b , positive integer x , A modulus m  b t  c , where c is l  digit base b for some l  t Output: r  x mod m





1.

q0  x / b t ; r0  x  q0 b t ; r  r0 ; i  0;

2.

while(qi  0)do

2.1 qi 1  qi c / b t  ; 2.2 ri 1  qi c  qi 1b t ; 2.3 i  i  1; 2.4 r  r  ri 3. while(r  m)do 3.1 r  r  m; 4.

return(r ).

Example Assume b  4, m  935  (32213) 4 , and x  31085  (13211231) 4 .As m  45  (1121) 4 , c  (1121) 4 . Now t  5 and l  4 . Note that at step 3, r  (102031) 4 since r  m , step 3 calculates r  m  (3212) 4 .

Remarks 1. Assume that x has 2t base b digits. If l  t / 2 then algorithm 3 runs step 2 at most s  3 times, needing 2 multiplications by c . If l is about (s  2)t /(s  1), then the algorithm

3 runs step 2 about s times. So, Algorithm 3 needs around sl signal-precision multiplications. 2. If c has few non-zero digits, then multiplication by c is relatively low-cost. If c is large but has little non-zero digits, the iteration numbers of algorithm 3 will be larger. But, iteration needs a simple multiplication. 3. Algorithm 3 can be tailored if m  b t  c for some positive integer c  b t in step 2.2, substitute r  r  ri with r  r  (1) i ri .

3. The Proposed Method This modular reduction method depended on lookup-table which is best than the described methods.

3.1 The Algorithm The steps of the algorithm are as follows: 1. Selects two integers c, n 2. Selects an integer f where 0  f  2 2c , 3. Let e0  log 2 f   1 . 4. Flips every bit of f , we get an integer f 1 with f  (2 e0  1)  f1 5. Let e1  log2 f1   1 . 6. Flips every bit of f 1 , we get an integer f 2 with f  (2e0  1)  (2e1  1)  f2 7. Computes f  (2 e0  1)  (2 e1  1)  (2 e2  1)  ... (1) j 1 (2

e j 1

 1)  (1) j f

'

8. With e j 1  c  e j , e j is a bit-size of f ' , therefore, e0  e1  ...  e j 9. Lookup the pre-calculated table for values z[e0 ],..., z[e j 1 ] using the sequences e0 ,...,e j 1 10. Finds z  ( z[e0 ]  1  ( z[e1 ]  1)  z[e2 ]  1)  ... (1) j 1 ( z[e j 1 ]  1)  (1) j f ' (1) 11. Determines f  z mod n Example Suppose n  97  (1100001) 2 , c  7 (bit-size), f  3135  (110000111111) 2 , e0  12 ,

z[e0 ]  212 mod 97  22 .

The steps to calculate the modular n are as follows: Flip n  1111000000 , e1  10 , z[e1 ]  210 mod 97  54 Flip n  111111 , e  6( c) . Therefore, f '  (111111) 2 . So, ( z[e0 ])  ( z[e1 ]  1)  111111) 2  31 3.2 Remarks To get the sequence e0 ,...,e j 1 and f ' in formula 5, the above process needs to flip all bits of strings. Actually, these sequences and f ' base basically on the runs in the left string of the binary ( f ) . The run means the maximal substring of its bit positions hold the same digit 0 or 1 . It can get them through calculating the size of every run in the left string. Assume that the binary ( f )  a 0 || a1 || ...|| a j 1 || a 'j

(2)

Such that a notation a || b means that string a is a concatenated with string b and a i are executes with sizes d i and a 'j is a remaining string. Thus e1  e0  d 0 ,...,e j 1  e j 2  d j 2 , e j  e j 1  d j 1

Observe that the size of string a 'j is the e j . Therefore if j is even then

(3)

f '  (a 'j )2 else if j is odd then

f ' 2

ej

 1  (a 'j ) 2

So, if j is even then f  z[e0 ]  z[e1 ]  z[e2 ]  ... (1) j 1 z[e j 1 ]  (a 'j ) 2

(4)

else if j is odd then

f  z[e0 ]  z[e1 ]  z[e2 ]  ... (1) j 1 z[e j 1 ]  (a 'j ) 2  2

ej

(5)

We understand that the run-size encoding is a simple and practical type of data compression by which executes of data is kept as a single data value not as an original run. However, no one has declared such the run-size reduction method to this date. Example Suppose n  97  (1100001) 2 c  7 , f  3135  (110000111111) 2 , e0  12 the runs in the left binary string ( f ) are a0  11, a1  0000 . Their sizes are d 0  2, d1  4 . Thus e1  e0  d 0  12  2  10 e2  e1  d1  10  4  6

As e2  6  7  c we obtain j  2, a 2'  111111 , thus, f '  (a ' ) 2  (111111) 2  63 Therefore, f  2 e0  2 e1  f '  212  210  63  22  54  63  31 mod 97

3.3 Timing Analysis To get e0 ,...,e j1 , f ' , it needs only some of less-expensive bit operations. As e0 ,...,e j 1 is ordered, it means that e0  e1  ...  e j 1 , the cost of lookup z[e0 ],..., z[e j 1 ] in T is unimportant. There are j additions to find z . As | z[t ] | n / 2 , t  (e0 ,...,e j 1 ) we have | z | ( j  2)n / 2  ( j  2) / 2  1n

Algorithm 2 INPUT: n, c  bitsize(n) with 0  f  2 2c , T  ( z[2c  1], z[2c  2],...,[2c])

OUTPUT: f mod n if ( f  n)then return( f )

if (bitsize( f )  c)then return( f  n) s : binary[ f ]; e : bitsize[ f ];

y : 1; z : z[e]; d : 0; t : 0;

for i(e  1 downto 0) do b : StringTake[s, i] . if (b  y)then

d : d  1; e : e  d ;

t : t  1;

z : z  (1) t z[e]; if (e  c)then y : mod( y  1,2);

d : 0; a : StringTalk[s,e]; if (mod(t ,2)  0)then z : z  (a) 2 else z : z  (a) 2  2 e ;

break ; while( z  n)do

z : z  n; while( z  0)do

z : z  n; return(r );

It needs about ( j  2) / 2 subtractions for finding z mod n . The proposed algorithm needs 3 j / 2 additions of c  bit integers. Table 3, summarized the comparison between the proposed method and Barrett method. We found that a computing of  f / b i  q gives the cost of Barrett reduction.

Table 1: Cost comparison between Barrett reduction and the proposed method

Barrett reduction The

Arithmetic Operations

Pre-

( c  bit integers)

computations

1

multiplication,

3 Value q

Proposed additions 3 j / 2 additions

Method

Table

Byte/bit scans c / 8 byte c  bit

T (c  items )

The proposed modular needs more cost for bit scans. But we emphasis an entire cost for bit scans is less than a cost for an addition of c  bit integers. A j is significance to a comparison. Obviously j  c if a left string of binary ( f ) is (1010...10) then j  c . Suppose the random integer f , it is anticipated that there are roughly c runs and c 1’s. Therefore, we have j  c / 2 . It means the proposed method is quicker than Barrett method with a cost of the small storage. Incidentally, a storage space is approximately 1M for a pre-calculated table with respect to the modular of 1024 bits, which is satisfactory to most devices. The Monogamy method is unsuitable for processing the string 11...1 , while the proposed method can handle efficiently such a string. The proposed method is inefficient as basic to manage the string 1010...10 . If hundreds of modular multiplications are needed for modular exponentiation, then it is good to employ these two modular correctly. As they need the same pre-calculated table, we can easily merge the two methods. The depiction of the combined reduction method is as follows.

4. The Combined Reduction Method Assume that n is a modular, with c  bitsize(n) and T is a pre-calculated table. To find f mod n , a combined reduction method runs as follows. 1. Let  is the left string of binary ( f ) where the size of a right string equals to c . 2. Add the number of 1’s in  and indicates it by x . 3. Add the number of runs in  and indicates it by o . 4. If x  o then employ algorithm 1. Else, employ algorithm 2. In order to processes the above algorithm. Consider the string of (101010111101) 2 . For this string x  8 , and o  9 . Therefore, algorithm 1 is used. But, it is better to use algorithm 1 for first 6 bits and use algorithm 2 for last 6 bits. If we have the long run of 1, then we must use algorithm 2 for such run. The following algorithm can be used to compute f mod n with n  f  n 2 .

Phase 1: The steps of this phase are as follows 1. Sets e0  bitsize[ f ] . 2. Sets  to be the left string of binary ( f ) where the size of the right string equals to c . 3. Adds the mummer of 1’s in  and indicate it by x . 4. If x  c / 2 , then Flip all bits of binary ( f ) . 5. Indicates the new number by f

'

6. Computes f  (2 e0  1)  f ' . In this case, the number of 1’s in the left string of f '  c / 2 . Thus, we consider f ' mod n . For ease, suppose that x  c / 2 . Phase 2: The steps of this phase are as follows 1. Add runs in  to get a vector w  (l 0 , r0 ; l1 , r1 ;...;l j , r j ) with l 0 is a size of the first run of 1 in  and z 0 is a size of the first run of 0 in  ,...,l j is a size of the last run of 1 in  and z j is a size of the last run of 0 in  . 2. For 0  i  j with l i  1 and for 0  i  j  1 with z i  1 . Phase 3: The steps of this phase are as follows j

1. Assume et  c 

 (l

i

 z i ) with 0  t  j

i t

2.

for t (0 to j ) do if l t  2 then ut 



et met lt 1

z[m  1]

else

ut  z[et ]  z[et  lt ]

Phase 4: The steps of this phase are as follows j

1. Finds r 

u

t

t 0

2.

Finds f mod n

Remark We emphasis that the processed algorithm requires to lookup the pre-calculation table 1  c / 2 times, it needs around c / 2 additions of c  bit integers at worst. Because Barrett reduction needs one multiplication of c  bit integers, a method is anticipated to be about twice as quick as the Barrett reduction. We will perform some test on this method on different stage.

Example Assume

that f  58809  (1110010110111001) 2 and n  267  (100001011) 2

such

that f  n 2 .

Then c  9 , and   (1110010) . w  (3,2;1,1) . Thus u0  z[16]  z[13]  121  182  61 u1  z[10]  44

r  61  44  105

Therefore f  105  (110111001) 2  105  441  69 mod 267

5. Conclusions In this paper, we present a rapid modular reduction method which relied on lookup table which needs small arithmetic operations at the cost of the little storage space. We illustrate that the proposed reduction is about twice as quick as Barrett reduction. More considerably, the proposed method only requires a reasonable size storage space, less than 1 M for 1024-bit modulus, which creates it more convenient and more appropriate for small devices. References [1] Montgomery P, "Modular multiplication without trial division", Mathematics of Computation, 44, pp. 519-521, 1985 [2] Barrett P, "Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor", Advances in Cryptology-CRYPTO’86, LNCS, 263, pp. 311-323. Springer-Verlag, 1987 [3] Fischer W. and J. Seifert, "Duality between Multiplication and Modular Reduction", IACR ePrint Archive, 2005 [4] Kaihara M. and Takagi N., "Bipartire Modular Multiplication Method", IEEE Transactions on Computers, Volume 57, Number 2, pp. 157-164, 2008. [5] Dupaquis V., Venelli A., "Redundant Modular Reduction Algorithms", Proceeding of CARDIS 2011, LNCS, 7079, pp. 102-114. Springer-Verlag, 2011 [6] Cao ZJ,Wu XJ, "An improvement of the Barrett modular reduction algorithm", International Journal of Computer Mathematics, Taylor & Francis, 2013 [7] Walter C., "Faster modular multiplication by operand scaling", Advances in CryptologyCRYPTO’91, LNCS 576, pp. 313-323, Springer-Verlag, 1991 [8] Lim C., Hwang H., Lee P., "Fast modular reduction with pre-computation", Proceeding of 1997 Korea- Japan Joint Workshop on Information Security and Cryptology, pp. 65-79,1997

[9] Sattar J. Aboud., Prakash E., "Modular Reduction Methods", International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May 2014.

Rapid Modular Reduction Algorithm

Rapid Modular Reduction Algorithm

Suggest Documents

Modular Multiplier by Folding Barrett Modular Reduction

Modular Multiplier by Folding Barrett Modular Reduction

Model reduction of modular systems using

Incomplete Reduction in Modular Arithmetic - Semantic Scholar

Lattice Reduction for Modular Knapsack - UOW

On Computation of Polynomial Modular Reduction - CiteSeerX

MODULAR RAPID E-LEARNING FRAMEWORK - Eric

Modular Ultrasonic Lysis System for Rapid Nucleic

modular, reconfigurable, and rapid response space systems

Providing Rapid Feedback in Generated Modular Language

Rapid Deployment Modular Building Solutions and

A modular integer GCD algorithm - Semantic Scholar

An algorithm for generating modular hierarchical ...

On Shanks' Algorithm for Modular Square Roots

Space Complexity of Algorithm for Modular

Improved Montgomery modular inverse algorithm - IEEE Xplore

Tabulated Modular Exponentiation (TME) Algorithm for Enhancing ...

depth circuit algorithm for modular exponentiation - acsel

A New Modular Division Algorithm and Applications

0 Modular Range Reduction: a New Algorithm for Fast and Accurate

Rapid contrast gain reduction following motion adaptation

NOISE REDUCTION AND PATTERN FORMATION IN RAPID ...

Rapid and Reversible Reduction of junctional ... - BioMedSearch

Rapid and sustained symptom reduction following psilocybin ...