A New Euclidean Division Algorithm for Residue Number Systems - Lip6

0 downloads 0 Views 222KB Size Report
Residue Arithmetic and its applications to computer Technology. McGraw-Hill, 1967. ... IEEE Computer Magazine, May 1984. [28] F.J. Taylor. A more efficient ...
A New Euclidean Division Algorithm for Residue Number Systems Jean-Claude Bajard and Laurent Stéphane Didier Laboratoire d’Informatique de Marseille CMI, Université de Provence, 39 rue Joliot-Curie, 13453 Marseille Cedex FRANCE Jean-Michel Muller CNRS, Laboratoire de l’Informatique du Parallélisme, 46 Allée d’Italie, 69364 Lyon Cedex 07 FRANCE Abstract We propose in this paper a new algorithm and architecture for performing divisions in residue number systems. Our algorithm is suitable for residue number systems with large moduli, with the aim of manipulating very large integers on a parallel computer or a specialpurpose architecture. The two basic features of our algorithm are one one hand the use of a high-radix division method, and on the other hand the use of a floating-point arithmetic that should run in parallel with the modular arithmetic.

1 Introduction A Residue Number System (abbreviated as RNS) is composed of moduli that are relatively prime integers (i.e., GCD if ). A number is represented in the RNS by the residues modulo . When performing additions, subtractions and multiplications, the operations can be performed in parallel for each modulus, using separate ALU’s : this allow RNS computations to be completed quickly[14, 24] [28]. Unfortunately, division and comparison look difficult to perform in a residue number system [10, 15]. The very origin of such systems is quite old: it goes back to the well known Chinese Remainders Theorem (CRT) [18]. An history of the Residue Number Systems can be found in [27]. In the following, the set of relatively prime integers is called RNS base, and to simplify we assume that the ’s are prime numbers.

1

The CRT shows that for any -uple , such that computes .

, , there exists a unique integer , represents . The CRT also gives an algorithm that

In 1989, D. Gamberger [13, 12] proposed a new algorithm for performing divisions in an RNS. Then, Lu and Chiang proposed a RNS division algorithm based on the combination of a classical division method and a parity checking method [5, 22]. In this paper, we propose a new division algorithm for Residue Number Systems. Our algorithm has been designed for applications where manipulation of huge numbers is required. The first part of this paper is devoted to a “weighted representation”,where a floating-point like representation is added to the usual RNS representation. In the second part we present our new algorithm for RNS division, which gives the quotient and the remainder of the Euclidean division of two numbers given in RNS form. Then we analyze the various features of our algorithm.

2 Weighted RNS arithmetic As told above, our algorithm uses a “weighted arithmetic”, based on the combination of a RNS arithmetic and a floating-point-like arithmetic. The basic idea behind this — keeping the order of magnitude of RNS numbers — is not new: such an idea has been proposed [8, 15] in order to detect overflows. What is new is the fact that we use this for comparisons and quotient-digit estimations.

2.1

Definition of the FPL representation

Assume that is the RNS representation of an integer in the RNS-base , where . The integers will be represented in radix . We assume that all the ’s are between the same two consecutive powers , that is to say: there exists such that

Thus, all the computations can be carried out using Define and as integers satisfying:

-digit numbers.

(1) Also define

and

as integers satisfying: (2) (3) 2

In other words, (4) is a

-digit number, and

is a small number

. We have: (5)

therefore can be viewed as a (weighted)“floating-point representation” of . In the following, it will be called the Floating-Point like (FPL) representation of the RNS number . There may be several possible values of that satisfy the requirement of the above definition. In practice, those possible values do not all lead to the same accuracy. The most accurate representation is the . The FPL representations will be manipulated in parallel with one with the largest value of the RNS representations, using the classical floating-point algorithm for addition and a slightly modified one for multiplication (to compute , one first computes using the usual floating-point algorithm, then one multiplies the result by a floating-point approxto take into account the “weight” that appears in (5)). Most of the time, the imation of FPL representations will suffice for comparing numbers. However, if the numbers that are being compared are very close, or if, after many computations, the FPL representations have become too inaccurate, it will become necessary to “refresh” them (i.e. to re-compute them from the RNS representations). It will not be necessary to perform a refreshment after each operation. A refreshment will be performed when the error on the FPL representations becomes so large that a comparison is impossible. In the following, we show how to construct and how to refresh the FPL representation of a RNS number.

2.2

Refreshment

To compute or to refresh the FPL representation of a RNS number , one can use the MixedRadix System associated to the RNS-base [18]. Unfortunately, this solution seems intrinsically sequential. Another solution is to directly implement the formula that appears in the proof of the Chinese Remainder Theorem. We choose this last solution because it looks much more parallelisable. The proof of the CRT uses the relation: modulo with and We can notice that,

modulo

(6)

.

The refreshment method is characterized by the fact that all the computations are performed with integers. Since the evaluation is done modulo , we just have to manipulate the fractional part of each term

, and the fractional part of the sum 3

.

When and how do we obtain a significant

?

First of all, we can remark that for each integer ,

Define to say:

as the integer constituted by the first

, we have

digits of the fractional part of

, that is

-digit number -digit number is a -digit number

We have, (7) The terms are pre-computed and stored constants. Now, let us multiply the -digit integer by . The result is a -digit integer. We note the integer whose digits are the first digits of the fractional part of . That is:

-digit number “integer part”

-digit number

We have,

4

-digit number “error part”

From Eq. (7) we deduce,

(8)

Define as the integer whose digits are the first , that is:

digits of the fractional part of

Thus, (9) So, we deduce from (8), (9) and (6) that:

Thus, Using Eq. (1) we can conclude that,

This gives: (10) If we compare (4) and (10), we can see that if

then one can choose:

and if , this means that is small in front of ( ). In such a case, we start , and trying to convert it to the process again, by first multiplying (in the RNS system) by the FPL system ( will be added to the obtained exponent). As a matter of fact, even if , if is small in front of , it may be advantageous to multiply (in the RNS system) by , where is such that , and to start the process again, to get a much more accurate FPL representation. The following theorem summarizes our result. Theorem 1 If

is such that:

5

then we can obtain

such that:

only using integer arithmetic with numbers less than Remark additions or multiplications. In this It is useful to refresh the FPL representation every case we can perform the refreshment two times: the first one to find a “good” , the second one to find the corresponding .

2.3

Implementation of the refreshment

The implementation uses -digit Arithmetic Units (AU’s) and a FPL unit, which are connected each other with a -digit bus. This bus is composed of sub-bus connected by gates for the simulation of a binary tree for the evaluation of (figure 1). Thus, we can assume a computation time for . Each -digit arithmetic unit has its own registers and memory to , and local variables like , or the partial value of . Such a unit can be a contain single -digit adder with a control part. It can perform specific additions to compute , (just additions with no overflow control: the overflows correspond to the integer part) and addition multiplication modulo . The control part of each unit can be commanded by a single sequencer with instructions for the different operations in a ROM. We can use redundant adders. All the numbers are positive, so it is easy to only consider the least significant part knowing that the first significant digit must be positive. To have an idea of the size (which is ), a Borrow-Save implementation (radix and digits in ) with and (which gives ) uses around transistors for the arithmetic part. So it is possible to implement other operations in hardware, for example a modular multiplication[2].

3 A RNS division algorithm We note

We note RNS-base

the RNS representation of .

the representation of . That is to say,

in the Mixed-Radix system associated to the

with . If is non-zero then the function SupRNS returns the RNS representation of , else it returns the RNS representation of 1. In all cases this function returns a flag such that if then , else

6

FPL-U 0 AU 1

4 AU

1 AU

2 AU 0

5 AU

3 AU 0

6 AU

1

7 AU

0

0 2 1

8 AU

9 AU

2 10 AU

0

12 AU 13 AU

11 AU 0

14 AU

1

15 AU

0

0 3 1

16 AU 17 AU

3 18 AU

0

20 AU 21 AU

19 AU 0

22 AU

1

23 AU

0

0 2 1

24 AU 25 AU

26 AU 0

28 AU 29 AU

2

27 AU

0 30 AU

1

31 AU

0

0 Evaluation of

:

Step If

then Initialization gates open the connexions all gates are closed (no connexion) AU sends to AU AU computes partial Figure 1: Architecture of the implementation

7

Theorem 2 If and

then for given numbers

and

such that:

with Algorithm 1 pre computing: construction of initialization:

and

loop: while

or (

and SupRNS

)

if if else refresh end: if

4 Analysis of the RNS division algorithm 4.1

proof of the “while” loop: decrease of

Let us denote

. We have,

and, Thus,

8

to

, the following algorithm evaluates

(11) as we have, we obtain,

thus,

If we have, (12) (this is possible with two refreshments) then,

so,

Thus (13) As, we have,

Eq. (13) becomes,

as we have, we obtain, 9

(14) Thus we have no overflow problems in And, if then the next value of confirms that decreases to .

4.2

can be

because , else

. This

When

As we have seen in (14), and considering (12) we deduce that just before the last iteration we have and . Now the last iteration gives with (13),

and (14) becomes, in other words for

,

(15) This requires at most another iteration with an exact comparison using a Mixed Radix number system.

4.3

implementations

Now we can complete the description of the implementation of the refreshment, in particular the description of the implementation of the function RNSsup. The evaluation of the RNS representation of is done in two steps. One in the FPL-unit for the evaluation of two numbers and such that . is a -digit number, thus modulo for each . To compute

and

Algorithm 2 if then we compute If

else

in the FPL unit

such that then inf

and

10

To compute

modulo

in the -th Arithmetic Unit (AU)

The FPL unit broadcasts to each arithmetic unit. The -th AU receives and computes such that modulo . The value modulo of is read in a local table or is computed using the radixdecomposition of (all the digits are less than ) with a lookup of the values modulo of for in a table. In this last case the table is very small (if and then it only contains one value) Thus is evaluated with at most modular multiplications.

4.4

performances

At each step of the “while” loop we compute the RNSsup function, we perform one refreshment, one modular multiplication and two modular additions. The number of steps is at most equal to . Thus the cost of this loop is at most , in other words . [18]. Now the last step of our algorithm is done in time Theorem 3 The proposed RNS-division algorithm is a

-time algorithm with

space.

5 Other recent algorithms Comparison and division are difficult problems in RNS arithmetic. We can find many algorithms in the literature that have been proposed to cope with these problems. In D. Gamberger presented an original RNS division algorithm without comparisons [13]. The execution time of his algorithm only depends on the value of the divisor. Each iteration requires modular steps, and the number of iterations is often greater than . For example with moduli , , , and , with a divisor close to the maximum number of iterations is equal to . The execution time of our algorithm does not depend on the value of the divisor, it only depends on the size of the difference between the dividend and the divisor. The worst case of our algorithm has a better execution time than the one of the mean case of Gamberger’s algorithm. More recently, M. Lu and J.S. Chiang proposed a RNS division based on a classical high-radix division algorithm with comparisons performed using a parity checking [5, 22]. To have an efficient -time algorithm they use tables. The size of each table is proportional to . For example, if then the size of the tables is close to bits. At each iteration, Lu and Chiang’s algorithm computes a binary digit of the quotient. Each iteration requires the . Our algorithm computes a radix-digit at each iteration. Each evaluation of iteration requires a floating point division and the evaluation of . Thus the execution time of our algorithm is better as soon as times the time of one iteration of Lu and Chiang’s algorithm is greater than the floating point division time. Moreover, we are not limited by the size of a table.

11

6 Conclusion We have proposed an efficient algorithm for RNS division, the implementation of which is realistic. The execution time of this algorithm is better than that of previously published algorithms, and it does not require large tables. We do not claim that our algorithm is attractive for applications such as computer cards — e.g. credit cards or phone cards — for such applications, other algorithms (see for instance [26]) look more promising. The use of our algorithm to perform modular multiplications of huge numbers (i.e. a RNS multiplication followed by a RNS division) would be comparable to the use of modular multiplication algorithms proposed by N. Takagi [25, 26] or . P. Kornerup [19] as soon as

References [1] G. Alia and E. Martinelli. On the lower bound to the VLSI complexity of number conversion from weighted to residue representation. IEEE Transactions on Computers, 42(8):962, August 1993. [2] G. Alia and E. Martinelli. A VLSI modulo m multiplier. IEEE Transactions on Computers, 40(7):873, July 1994. [3] D.K. Barnerji, T.Y. Cheung, and V. Ganesan. A High-Speed division Method in Residue System. In 5th IEEE Symp. on Comp. Arith., page 158, 1981. [4] PW Beame, SA. COOK, and HJ. Hoover. log depth circuits for division and related problems. SIAM J. Comp., 15(4):994, November 1986. [5] JS. Chiang and Mi Lu. A general division algorithm for residue number system. In P. Kornerup and D. Matula, editors, proceedings of the 10th Symposium on Computer Arithmetic, page 76. IEEE Computer Society Press, 1991. [6] W.A. Chren. A New Residue Number System Division Algorithm. Comput. Math. Appl., 19(7):13, 1990. [7] L. Ciminiera and P. Montuschi. Over-redundant digit sets and the design of digit-by-digit division units. IEEE Transactions on Computers, 43(3):269, March 1994. [8] E. D. Di Claudio, G. Orlandi, and F. Piazza. A systolic redundant residue arithmetic error correction circuit. IEEE Transactions on Computer, 42(4):427, April 1994. [9] GI. Davida and B. Litow. Fast parallel arithmetic via modular representation. SIAM J. Comp., 20(4):756, August 1991. [10] G. Dimauro, S. Impedovo, and G. Pirlo. A new technique for fast number comparison in the residue number systems. IEEE Transactions on Computers, 42(5):607, May 1993. [11] SE. Eldridge and CD. Walter. Hardware implementation of montgomery’s modular multiplication algorithm. IEEE Transactions on Computers, 42(6):693, July 1993. 12

[12] D. Gamberger. Incomplete specified numbers in residue number system - Defintion and applications. In M. D. Ercegovac and E. Swartzlander, editors, proceedings of the 9th Symposium on Computer Arithmetic, pages 210–215. IEEE Computer Society Press, 1989. [13] D. Gamberger. New Approach to Integer Division in Residue Number System. In P. Kornerup and D. Matula, editors, proceedings of the 10th Symposium on Computer Arithmetic, pages 84–91. IEEE Computer Society Press, 1991. [14] H.L. Garner. The Residue Number System. IRE Trans. Electronic Computer, 8:140, June 1959. [15] S. Kaushik. Sign detection in non-redundant residue number system with reduced information. In 6th Symposium on Computer Architecture. IEEE Computer Society Press, 1983. [16] Y.A. Keir, P.W. Cheney, and M. Tannenbaum. Division and Overflow Detection in Residue Number Systems. IRE Trans. Electron. Comp., 11, August 1962. [17] E. Kinoshita, H. Kosako, and Y. Kojima. General Division in Symmetric Residue Number System. IEEE Transactions on Computer, 22:134, February 1973. [18] D. Knuth. The art of computer programming, volume 2. Addison Wesley, 1973. [19] P. Kornerup. High-radix modular multiplication for cryptosystem. In 11th IEEE Symposium on Computer Arithmetic, 1993. [20] P. Kornerup. A systolic, linear-array multiplier for a class of right-shift algorithm. IEEE Transactions on Computers, 43(8):892, August 1994. [21] M.L. Lin, E. Leiss, and B. McInnis. Division and Sign Detection Algorithms for Residue Number Systems. Comput. Math. Appl., 10(4/5):331, 1984. [22] Mi Lu and JS. Chiang. A novel division algorithm for the residue number system. IEEE Transactions on Computers, 1992. [23] M. Shand and J. Vuillemin. Fast implementation of RSA cryptography. In ARITH11, 11th Symposium on Computer Arithmetic, Windsor Canada, 1993. [24] N.S Szabo and R.I. Tanaka. Residue Arithmetic and its applications to computer Technology. McGraw-Hill, 1967. [25] N. Takagi. A Radix-4 Modular Multiplication Hardware Algorithm Efficient for Iterative Modular Multiplications. In P. Kornerup and D. Matula, editors, proceedings of the 10th Symposium on Computer Arithmetic, pages 35–41. IEEE Computer Society Press, 1991. [26] N. Takagi. A modular multiplication algorithm with triangle additions. In proceedings of the 11th Symposium on Computer Arithmetic, page 272. IEEE Computer Society Press, 1993. [27] F.J. Taylor. Residue Arithmetic: a tutorial with examples. IEEE Computer Magazine, May 1984. [28] F.J. Taylor. A more efficient residue arithmetic implementation of the FFT. In 5th Symposium on Computer Arithmetic. IEEE Computer Society Press, 1985. 13

[29] CD. Walter. Systolic modular multiplication. IEEE Transactions on Computers, 42(3):376, March 1993.

14

Suggest Documents