Laboratoire de l’Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n°1398

FPGA Implementation of Polynomial Evaluation Algorithms Milos Ercegovac Jean-Michel Muller Arnaud Tisserand

November 1995

Research Report

No

95-34

Ecole Normale Supérieure de Lyon 46 Allée d’Italie, 69364 Lyon Cedex 07, France Téléphone : (+33) 72.72.80.00 Télécopieur : (+33) 72.72.80.80 Adresse électronique : [email protected]−lyon.fr

FPGA Implementation of Polynomial Evaluation Algorithms Milos Ercegovac Jean-Michel Muller Arnaud Tisserand November 1995

Abstract The most-significant-digit-first function evaluation method (E-method) allows efficient evaluation of polynomials and certain rational functions on custom hardware. The time required for the computation is of the order of m carry-free addition operations, m being the number of digits in the result. We discuss a digit-parallel and a digit-serial implementation of this method on a DecPeRLe-1 board, made up with Xilinx FPGAs. After a presentation of the E-method, we give a description of the architecture of the Dec-PeRLe-1 board, present our designs and analyze their performances. Keywords: On-line arithmetic, FPGA, Polynomial evaluation Résumé La méthode d’évaluation “chiffre de poids fort en tête” (E-méthode) permet l’évaluation rapide de polynômes et de certaines fonctions rationnelles par matériel. Le temps requis pour le calcul est de l’ordre du temps nécessité par m additions sans propagation de retenue, où m est le nombre de chiffres du résultat. On propose une implantation parallèle et une implantation série de cette méthode, sur une carte DEC-PeRLe1, constituée de circuits FPGA Xilinx. Après une description de la E-méthode, nous donnons une description de l’architecture de la carte Dec-PeRLe1, nous présentons nos implantations et analysons leurs performances. Mots-clés: Arithmétique en-ligne, FPGA, Évaluation de polynômes

1 Introduction This paper shows how re-programmable hardware, such as field-programmable gate arrays (FPGAs), can be used to compute some polynomials and thus approximations of transcendental functions. This is an attractive approach when

there are not enough potential users that need to compute a function to make a dedicated hardware implementation of a particular function economical (e.g., for special functions such as Bessel functions and the Gamma function). when the precision required is greater than the precision provided by the available floating-point processors.

A standard way to evaluate a transcendental function is to approximate it by a polynomial or a rational function. Any continuous function can be approximated, as closely as desired, by a polynomial. There are other methods (for instance shift-andadd CORDIC methods [14, 15]), but those methods are applicable to a small number of functions only. The best minimax polynomial approximation of a given degree of a function can be obtained, for example, by the Remes’ algorithm [11]. As a consequence, the transcendental functions are frequently computed using polynomials (see for instance [6, 13]). Polynomial or rational approximations of the most common transcendental functions can be found in [10].

1.1 The E-method The E-method, introduced in [7, 8], allows efficient solution of diagonally dominant systems of linear equations on simple and highly regular hardware. Since the evaluation of polynomials and certain rational functions can be achieved by solving the corresponding linear systems, the E-method is an attractive general approach for function evaluation. In this paper we concentrate on the evaluation of polynomials assuming radix-2 arithmetic. Consider evaluation of p(x) = pn xn + pn?1 xn?1 + + p0 . One can show that p(x) is equal to y0 , where [y0 ; y1 ; : : : ; yn ]t is the solution of the following linear system: 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4

1 ?x 0 0 1 ?x 0 0 0 1 ?x 0 ..

.. .

0

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

0 0 0 .. .

0 0

1 ?x 0 1

3

2 7 76 76 76 76 76 76 76 76 76 76 76 74 5

y0 y1 y2 .. .

yn?1 yn

3

2

7 7 7 7 7 7 7 7 7 7 7 5

6 6 6 6 6 6 6 6 6 6 6 4

=

p0 p1 p2 .. .

pn?1 pn

3 7 7 7 7 7 7 7 7 7 7 7 5

(1)

The radix-2 E-method consists in solving this linear system by using the following basic recursion: 1

i.e., for i = 0; : : : ; n,

w(0) = [p0 ; p1 ; : : : ; pn ]t ( j ) w = 2 w(j?1) ? Ad(j?1)

(2)

h j ?1) xi wi(j) = 2 wi(j?1) ? di(j?1) + di(+1

(j )

(j )

(0) (1) (2)

(j )

where the values di 2 f?1; 0; 1g. Define the number Di = di :di di : : : di (the d(ij) are the digits of a radix-2 signed-digit [1] representation of Di(j) ). One can show (j ) (j ) that if the sequence jwi j is bounded, then Di goes to yi as j goes to infinity. The problem at step j is to find a selection function that gives a value of the terms d(ij) from the terms wi(j) such that the values wi(j+1) will remain bounded. In [8], the following selection function (a form of rounding) is proposed

s(u) =

0;

sign u;

juj < 1=2 otherwise;

if

(3)

and applied to the following cases: 1. 2.

d(ij) = s(wi(j)), i.e., the selection requires non-redundant wi(j) ;

d(ij) = s(w^i(j)), where w^i(j) is an approximation of wi(j) (in practice, w^i(j) is deduced (j ) from a few digits of wi by the means of a rounding or a truncation)

Assume

8 < :

8i; jpi j jxj 1=8 jwi(j) ? w^i(j) j 2

The E-method gives a correct result provided that the above defined bounds ; , and satisfy 8

= 12 (1 + ) 0 > > > < > > > > :

For k

! ! ! ! !

= 2 and = 18 , we get:

2 ? 2 ?2 1 + 2 + 2

2 1 + 2?k+1 + 2 2 2 ? 2?k+1 + 2 2 + 2 ? 2 2 + 2?k+2

8 > > > >

> ! 2 2 ? 1=4 > > : ! 13=4 ? 2 2 These conditions are satisfied if ! = 7=4, 1 = 1=4 or 1=2 and 2 = 1 or 3=4. With (j ?1) depending these values, the following table sums up the possible choices for di 5

on the bits of weight w^i(j?1) .

21 ; 20 ; 2?1

and

2’s complement 00. 00 00. 01 00. 10 00. 11 01. 00 01. 01 01. 10 01. 11 10. 00 10. 01 10. 10 10. 11 11. 00 11. 01 11. 10 11. 11

2?2

of the 2’s complement representation of

w^i(j?1) 0 1/4 1/2 3/4 1 5/4 3/2 7/4 -2 -7/4 -3/2 -5/4 -1 -3/4 -1/2 -1/4

di(j?1)

0 1 or 0 1 1 1 1 1 1 never occurs -1 -1 -1 -1 -1 or 0 0 0

(j ?1) are needed to choose d(j ?1) .

It is worth noting that only the first three bits of w ^i

i

2.2 Selection in Borrow-Save Form ^i(j ) the number obtained by converting into 2’s complement the number Let us call w constituted by the bits of weights greater than or equal to 2?k of the borrow-save (j ) representation of wi . We have: w^i(j) ? 2?k jwi(j) j w^i(j) + 2?k (6) (j ) We assume that jxj is less than , and that we want to have jwi j ! for any j . We (j ?1) is the following: also assume that the choice of di 8 > 1 if w^i(j ?1) < ( j ? 1) di = > 0 if + 2?k w^i(j?1) ? 2?k : ?1 if w^i(j?1) ? (j ?1) j ! ) jw(j ) j !. In a We want to find values of k; ; and ! such that jwi i

manner similar to that of the previous section, we derive the following requirements:

! 2 ? 2 ! ?2 + 2?k+1 + 2 + 2 ! 2 + 2 6

For instance, with k = 2 and = 18 we get conditions that are satisfied with ! = 7=4 and = 1=2. These conditions are similar to those of the carry-save implementation, ^i(j ?1) . allowing also the use of only 3 bits of w We conclude that the carry-save and borrow-save representations have a similar complexity in the implementation of the selection function. We assume the borrowsave representation in the rest of the paper.

2.3 Mapping of LUs onto FPGA board In our first design we map one LU to a single FPGA. This limits the precision to 32 bits and, consequently, the largest polynomial degree to 11 (since x 1=8). Fig. 3 shows the mapping of the ith line-unit (which computes di ) and the communication scheme. IN

wi x

control logic L O A D

pi x di+1

WB

DC

seg 31

seg 0

seg 30

seg 1

.. .

.. .

seg 18

seg 13

seg 17

seg 14

seg 16

seg 15

?di S

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

LOADING STAGE

DC

control logic SB control WB = West Bus SB = South Bus DC = Direct Connection

OUT

di 15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

d0k

COMPUTING STAGE

Figure 3: Line-unit mapping and communication scheme The coefficients of the polynomial are loaded into their respective LUs at the beginning through the West bus (16-bit width). During that step, we can select the function to be evaluated by loading the corresponding coefficients. The control signals distributed by the southern bus determine which FPGA of the matrix will get the digits of pi and which part of the coefficients is transmitted during each cycle (pi may have many digits, whereas the bus is 16 bits only). During the second step, 7

the “computing step”, x enters the last LU. Thus the function evaluator can work in a pipeline mode. x can change at each step. The E-method allows computation (j ) of all di in parallel. This requires direct link between LUs which is not possible (j ) (j ) on the FPGA board used. In this design, di is obtained after di+1 because the bus limits communication of a single digit at a time between the LUs. Fig. 4 shows the activity of the FPGA during the computation. In this figure the time unit is the LU cycle time . = (n + 1) is the time needed to compute the first digit of the result p(x). It corresponds to the pipeline latency. The total evaluation time of p(x) is = + m The implementation of the selection function is based on a conversion from the borrow-save system to the 20 s complement system of the first 3 digits and does not affect the cycle time significantly. The design obtained in the first implementation of the E-method is not very fast. Our goal in this design was to have one LU per FPGA, in order to be able to evaluate polynomials of large degree. In the DECPeRLe-1 board, with the architecture described above, we can evaluate degree-11 polynomials. We can evaluate functions of 32-bit argument with this solution. The clock frequency is around 30 MHz and 8 clock cycles are required for computing a new digit di . Our first design is limited to a precision of 32 bits. which corresponds roughly to single precision. It is impossible to put more digits in a single FPGA. Of course, we can use several FPGAs for implementing a LU. In this case, the attainable precision becomes much larger. In this case we can evaluate polynomials with 64-bit precision. With more recent FPGAs, the previous results would be better: the Xilinx 4025 chips are four times larger than ours, and they run at least twice as fast. With such chips, and with our first design, one could evaluate degree 40 polynomials with 128-bit operands, or degree 20 polynomials with 64-bit operands, at a clock frequency of around 60MHz.

3 On-Line Implementation of the E-Method In on-line arithmetic [9], the data circulates through the arithmetic operators in a digit-serial fashion, most significant digit first. One of the main advantages of on-line arithmetic compared to the conventional (i.e., least significant digit first) digit-serial arithmetic is that certain functions (division, square root, elementary functions) ar computable most significant digit first, thus allowing digit-level overlap between successive computations. Moreover, in most applications, the most significant digits are the most important, so it is more attractive to obtain them first. In this section, we assume that the input operand x is available on-line, i.e., serially, most significant digit first. To take this into account, we modify the basic recurrence of the E-method. We get:

j ?1) ? d(j ?1) + 2??1 x D(j ?1) ) wi(j) = 2 (wi(j?1) + x(j?1) di(+1 +j i+1 i (j ) In our design, the selection uses 6 bits of wi . Fig. 5 shows the architecture of the

on-line LUs. There is more control logic in the on-line units because the accumulator receives digits during each step of the computation. The implementation of the on-line E-method is more complex than the digitparallel one. The control logic requires more space in one FPGA and there are more 8

p0

LU 0 . . . LU 14 LU 15

d(0) 1

p1

LU 1

p14

p15

d(0) 15

d(1) d(0) 0 0 d(1) 1

(1) d(0) 14 d14 d(1) 15

d(15m)

d(1m)

d(0m)

d(14m) time

load

LU 0 LU 1 . . .

p(x)

p(x0 )

p(x00 )

LU 14 LU 15 init

time

(m + 1) Figure 4: LU activity during the evaluation

9

One position left shift (r)

to line-unit i ? 1

wi(j?1) j ?1) Di(+1

r??1 S di(j?1)

x(j?1) x+j

d(ij)

j ?1) di(+1

on-line input

from line-unit i + 1

Figure 5: On-line LU architecture

10

registers used for synchronization. With one LU per XC3090 we can manipulate 20bit arguments only, and the frequency is about 33MHz. As in the previous section, those results would improve to 80-bit arguments, and to around 66 MHz with more recent chips. The loading step is performed in an on-line fashion which simplifies the interface. The coefficients of the polynomial are distributed into their respective LUs digit-by-digit. This mode allows simpler chip connectivity. Fig. 6 shows the mapping of one on-line LU.

pi x

di+1

L O A D WB

DC

seg 0 c o n t r o l l o g i c

?di

seg 1 seg 2

.. .

S

seg 17 seg 18 seg 19

c o n t r o l

l o g i c

DC

di

SB control

Figure 6: On-line LU implementation

4 Variable Precision Implementation We have seen in the previous sections that the major problem is the limited precision. In order to provide flexible precision in a reconfigurable architecture for polynomial evaluation, we consider the following “multiple-precision” scheme‘[7]. The idea is to segment the adders and registers. We design a small circuit which computes the (j ) digits of wi whose rank is between k and k + . k is a multiple of between 0 (j ) and m. To perform the computation of one term di , we iterate d m k e cycles. We use the memory banks of the DECPeRLe-1 board to store the inactive register segments. Now the computation part of the LU circuit is simpler, but the control logic necessary to perform the communications is more complex. An overview of this architecture is given Fig. 7. The connectivity between the RAM banks and the matrix of FPGAs is designed in order to make all the FPGAs working each clock cycle (once the pipeline is initialized). (j ?1) , we would get a very If we wait for the end of the d m k e iterations to add ?di (j ) significant delay for each computation of a term di . In order to overcome this ( j ? 1) during the first iteration. This is possible, because in a drawback, we add di 11

256kword)

SWITCH

RAM (

R A M

w[1 ] x[1 ] D[1 ] w[2 ] x[2 ] D[2 ]

...

w[3 ] x[3 ] D[3 ]

DATA

r e g i s t e r

x w D

...

w[4 ] x[4 ] D[4 ] control adresses

NB

4

3

2

2

1

4

3

3

2

1

4

4

3

2

1 3

+

cycle

FSM

generator

PUT

regis

ters

linki control logic

1 4

control logic

OUT

RAM / LINE - UNIT CONNECTIVITY

1

INPUT register

c a r r y

w

x

D

3 inputs BS adder

w

x

D

c a r r y

control logic

2

di+1 x

?di

S

MATRIX FPGA i ; i

2 f1; 2; 3; 4g

DC

Figure 7: Multi-precision architecture

12

DC

di

redundant number system, there is no carry propagation. With this improvement the global delay remains of the same order as in the previous designs. A very high level of pipelining is possible with this technique. All the data (digits of x, w) can be distributed in on-line fashion. Because of the slow access to the static RAM banks (50ns), the frequency of this architecture is limited to 20 MHz. The number of cycles (j ) for the generation of one digit di depends on the required precision (the number of memory accesses). Here, each segment has a size of 8 bits (4 borrow-save digits). Several cycles are required to feed each LU with the proper x, w and D segments. This architecture may be viewed as a fast multiple-precision kernel usable in many applications. The maximal precision is mainly the problem of RAM size.

5 Conclusion We have analyzed, designed and implemented two versions of the E-method for polynomial evaluation using the DEC PeRLe-1 board with Xilinx 3090 FPGA chips. Our designs demonstrate a relatively modest performance and very good flexibility with respect to custom VLSI implementations. Using more advanced FPGA chips, we expect significant improvement in performance and usability. We have also indicated a scheme suitable for large precision evaluations of polynomials using the E-method. Acknowledgements This work has been supported in part by the NSF Grant INT9217512 “On-Line Arithmetic: From Theoretical Studies to Implementations” and the CNRS Project “L’arithmétique en ligne: des études theoriques à l’implementation.”

References [1] A. Avizienis. Signed-digit number representations for fast parallel arithmetic. IRE Transactions on electronic computers, 10:pp 389–400, 1961. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 2, IEEE Computer Society Press Tutorial, 1990. [2] J.C. Bajard, J. Duprat, S. Kla and J.M. Muller. Some Operators for On-line Radix-2 Computations. Journal of Parallel and Distributed Computing, 22(2):336-345, 1994. [3] P. Bertin, D. Roncin, and J. Vuillemin. Introduction to programmable active memories. Technical report, DEC Paris Research Lab., 1989. [4] P. Bertin, D. Roncin, and J. Vuillemin. Programmable active memories: A performance assessment. Technical Report 24, DEC Paris Research Lab., 1993. [5] P. Bertin, M. Shand, and J. Vuillemin. Hardware speedups in long integer multiplication. Technical report, DEC paris Research Laboratory, 1991. [6] W. Cody and W. Waite. Software Manual for the Elementary Functions. PrenticeHall Inc, 1980.

13

[7] M.D. Ercegovac. A general method for evaluation of functions and computation in a digital computer. PhD thesis, Dept. of Computer Science, University of Illinois, Urbana-Champaign, 1975. [8] M.D. Ercegovac. A general hardware-oriented method for evaluation of functions and computations in a digital computer. IEEE Trans. Comp., C-26(7):667– 680, 1977. [9] M.D. Ercegovac and K.S. Trivedi. On-line Algorithms for Division and Multiplication. IEEE Trans. Comp., C-26(7):681–687, 1977. [10] J.F. Hart. Computer Approximations. Wiley, 1968. [11] E. Remes. Sur un procédé convergent d’approximations successives pour déterminer les polynômes d’approximation. C.R. Acad. Sci. Paris, 198, 1934. [12] M. Shand and J.E. Vuillemin. Fast implementations of RSA cryptography. In M.J. Irwin E.E. Swartzlander and G. Jullien, editors, 11th Symposium on Computer Arithmetic, pages 252–259, Los Alamitos, CA, June 1993. IEEE Computer Society Press. [13] P.T.P. Tang. Table lookup algorithms for elementary functions and their error analysis. In P. Kornerup and D. Matula, editors, proceedings of the 10th IEEE Symposium on Computer Arithmetic, pages 232–236. IEEE Computer Society Press, June 1991. [14] J. Volder. The CORDIC computing technique. IRE Transactions on Electronic Computers, 1959. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE Computer Society Press Tutorial, 1990. [15] J. Walther. A unified algorithm for elementary functions. In Joint Computer Conference Proceedings, 1971. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE Computer Society Press Tutorial, 1990. [16] XILINX, The Programmable Gate Array Data Book, 1992.

14

FPGA Implementation of Polynomial Evaluation Algorithms Milos Ercegovac Jean-Michel Muller Arnaud Tisserand

November 1995

Research Report

No

95-34

Ecole Normale Supérieure de Lyon 46 Allée d’Italie, 69364 Lyon Cedex 07, France Téléphone : (+33) 72.72.80.00 Télécopieur : (+33) 72.72.80.80 Adresse électronique : [email protected]−lyon.fr

FPGA Implementation of Polynomial Evaluation Algorithms Milos Ercegovac Jean-Michel Muller Arnaud Tisserand November 1995

Abstract The most-significant-digit-first function evaluation method (E-method) allows efficient evaluation of polynomials and certain rational functions on custom hardware. The time required for the computation is of the order of m carry-free addition operations, m being the number of digits in the result. We discuss a digit-parallel and a digit-serial implementation of this method on a DecPeRLe-1 board, made up with Xilinx FPGAs. After a presentation of the E-method, we give a description of the architecture of the Dec-PeRLe-1 board, present our designs and analyze their performances. Keywords: On-line arithmetic, FPGA, Polynomial evaluation Résumé La méthode d’évaluation “chiffre de poids fort en tête” (E-méthode) permet l’évaluation rapide de polynômes et de certaines fonctions rationnelles par matériel. Le temps requis pour le calcul est de l’ordre du temps nécessité par m additions sans propagation de retenue, où m est le nombre de chiffres du résultat. On propose une implantation parallèle et une implantation série de cette méthode, sur une carte DEC-PeRLe1, constituée de circuits FPGA Xilinx. Après une description de la E-méthode, nous donnons une description de l’architecture de la carte Dec-PeRLe1, nous présentons nos implantations et analysons leurs performances. Mots-clés: Arithmétique en-ligne, FPGA, Évaluation de polynômes

1 Introduction This paper shows how re-programmable hardware, such as field-programmable gate arrays (FPGAs), can be used to compute some polynomials and thus approximations of transcendental functions. This is an attractive approach when

there are not enough potential users that need to compute a function to make a dedicated hardware implementation of a particular function economical (e.g., for special functions such as Bessel functions and the Gamma function). when the precision required is greater than the precision provided by the available floating-point processors.

A standard way to evaluate a transcendental function is to approximate it by a polynomial or a rational function. Any continuous function can be approximated, as closely as desired, by a polynomial. There are other methods (for instance shift-andadd CORDIC methods [14, 15]), but those methods are applicable to a small number of functions only. The best minimax polynomial approximation of a given degree of a function can be obtained, for example, by the Remes’ algorithm [11]. As a consequence, the transcendental functions are frequently computed using polynomials (see for instance [6, 13]). Polynomial or rational approximations of the most common transcendental functions can be found in [10].

1.1 The E-method The E-method, introduced in [7, 8], allows efficient solution of diagonally dominant systems of linear equations on simple and highly regular hardware. Since the evaluation of polynomials and certain rational functions can be achieved by solving the corresponding linear systems, the E-method is an attractive general approach for function evaluation. In this paper we concentrate on the evaluation of polynomials assuming radix-2 arithmetic. Consider evaluation of p(x) = pn xn + pn?1 xn?1 + + p0 . One can show that p(x) is equal to y0 , where [y0 ; y1 ; : : : ; yn ]t is the solution of the following linear system: 2 6 6 6 6 6 6 6 6 6 6 6 6 6 4

1 ?x 0 0 1 ?x 0 0 0 1 ?x 0 ..

.. .

0

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

0 0 0 .. .

0 0

1 ?x 0 1

3

2 7 76 76 76 76 76 76 76 76 76 76 76 74 5

y0 y1 y2 .. .

yn?1 yn

3

2

7 7 7 7 7 7 7 7 7 7 7 5

6 6 6 6 6 6 6 6 6 6 6 4

=

p0 p1 p2 .. .

pn?1 pn

3 7 7 7 7 7 7 7 7 7 7 7 5

(1)

The radix-2 E-method consists in solving this linear system by using the following basic recursion: 1

i.e., for i = 0; : : : ; n,

w(0) = [p0 ; p1 ; : : : ; pn ]t ( j ) w = 2 w(j?1) ? Ad(j?1)

(2)

h j ?1) xi wi(j) = 2 wi(j?1) ? di(j?1) + di(+1

(j )

(j )

(0) (1) (2)

(j )

where the values di 2 f?1; 0; 1g. Define the number Di = di :di di : : : di (the d(ij) are the digits of a radix-2 signed-digit [1] representation of Di(j) ). One can show (j ) (j ) that if the sequence jwi j is bounded, then Di goes to yi as j goes to infinity. The problem at step j is to find a selection function that gives a value of the terms d(ij) from the terms wi(j) such that the values wi(j+1) will remain bounded. In [8], the following selection function (a form of rounding) is proposed

s(u) =

0;

sign u;

juj < 1=2 otherwise;

if

(3)

and applied to the following cases: 1. 2.

d(ij) = s(wi(j)), i.e., the selection requires non-redundant wi(j) ;

d(ij) = s(w^i(j)), where w^i(j) is an approximation of wi(j) (in practice, w^i(j) is deduced (j ) from a few digits of wi by the means of a rounding or a truncation)

Assume

8 < :

8i; jpi j jxj 1=8 jwi(j) ? w^i(j) j 2

The E-method gives a correct result provided that the above defined bounds ; , and satisfy 8

= 12 (1 + ) 0 > > > < > > > > :

For k

! ! ! ! !

= 2 and = 18 , we get:

2 ? 2 ?2 1 + 2 + 2

2 1 + 2?k+1 + 2 2 2 ? 2?k+1 + 2 2 + 2 ? 2 2 + 2?k+2

8 > > > >

> ! 2 2 ? 1=4 > > : ! 13=4 ? 2 2 These conditions are satisfied if ! = 7=4, 1 = 1=4 or 1=2 and 2 = 1 or 3=4. With (j ?1) depending these values, the following table sums up the possible choices for di 5

on the bits of weight w^i(j?1) .

21 ; 20 ; 2?1

and

2’s complement 00. 00 00. 01 00. 10 00. 11 01. 00 01. 01 01. 10 01. 11 10. 00 10. 01 10. 10 10. 11 11. 00 11. 01 11. 10 11. 11

2?2

of the 2’s complement representation of

w^i(j?1) 0 1/4 1/2 3/4 1 5/4 3/2 7/4 -2 -7/4 -3/2 -5/4 -1 -3/4 -1/2 -1/4

di(j?1)

0 1 or 0 1 1 1 1 1 1 never occurs -1 -1 -1 -1 -1 or 0 0 0

(j ?1) are needed to choose d(j ?1) .

It is worth noting that only the first three bits of w ^i

i

2.2 Selection in Borrow-Save Form ^i(j ) the number obtained by converting into 2’s complement the number Let us call w constituted by the bits of weights greater than or equal to 2?k of the borrow-save (j ) representation of wi . We have: w^i(j) ? 2?k jwi(j) j w^i(j) + 2?k (6) (j ) We assume that jxj is less than , and that we want to have jwi j ! for any j . We (j ?1) is the following: also assume that the choice of di 8 > 1 if w^i(j ?1) < ( j ? 1) di = > 0 if + 2?k w^i(j?1) ? 2?k : ?1 if w^i(j?1) ? (j ?1) j ! ) jw(j ) j !. In a We want to find values of k; ; and ! such that jwi i

manner similar to that of the previous section, we derive the following requirements:

! 2 ? 2 ! ?2 + 2?k+1 + 2 + 2 ! 2 + 2 6

For instance, with k = 2 and = 18 we get conditions that are satisfied with ! = 7=4 and = 1=2. These conditions are similar to those of the carry-save implementation, ^i(j ?1) . allowing also the use of only 3 bits of w We conclude that the carry-save and borrow-save representations have a similar complexity in the implementation of the selection function. We assume the borrowsave representation in the rest of the paper.

2.3 Mapping of LUs onto FPGA board In our first design we map one LU to a single FPGA. This limits the precision to 32 bits and, consequently, the largest polynomial degree to 11 (since x 1=8). Fig. 3 shows the mapping of the ith line-unit (which computes di ) and the communication scheme. IN

wi x

control logic L O A D

pi x di+1

WB

DC

seg 31

seg 0

seg 30

seg 1

.. .

.. .

seg 18

seg 13

seg 17

seg 14

seg 16

seg 15

?di S

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

LOADING STAGE

DC

control logic SB control WB = West Bus SB = South Bus DC = Direct Connection

OUT

di 15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

d0k

COMPUTING STAGE

Figure 3: Line-unit mapping and communication scheme The coefficients of the polynomial are loaded into their respective LUs at the beginning through the West bus (16-bit width). During that step, we can select the function to be evaluated by loading the corresponding coefficients. The control signals distributed by the southern bus determine which FPGA of the matrix will get the digits of pi and which part of the coefficients is transmitted during each cycle (pi may have many digits, whereas the bus is 16 bits only). During the second step, 7

the “computing step”, x enters the last LU. Thus the function evaluator can work in a pipeline mode. x can change at each step. The E-method allows computation (j ) of all di in parallel. This requires direct link between LUs which is not possible (j ) (j ) on the FPGA board used. In this design, di is obtained after di+1 because the bus limits communication of a single digit at a time between the LUs. Fig. 4 shows the activity of the FPGA during the computation. In this figure the time unit is the LU cycle time . = (n + 1) is the time needed to compute the first digit of the result p(x). It corresponds to the pipeline latency. The total evaluation time of p(x) is = + m The implementation of the selection function is based on a conversion from the borrow-save system to the 20 s complement system of the first 3 digits and does not affect the cycle time significantly. The design obtained in the first implementation of the E-method is not very fast. Our goal in this design was to have one LU per FPGA, in order to be able to evaluate polynomials of large degree. In the DECPeRLe-1 board, with the architecture described above, we can evaluate degree-11 polynomials. We can evaluate functions of 32-bit argument with this solution. The clock frequency is around 30 MHz and 8 clock cycles are required for computing a new digit di . Our first design is limited to a precision of 32 bits. which corresponds roughly to single precision. It is impossible to put more digits in a single FPGA. Of course, we can use several FPGAs for implementing a LU. In this case, the attainable precision becomes much larger. In this case we can evaluate polynomials with 64-bit precision. With more recent FPGAs, the previous results would be better: the Xilinx 4025 chips are four times larger than ours, and they run at least twice as fast. With such chips, and with our first design, one could evaluate degree 40 polynomials with 128-bit operands, or degree 20 polynomials with 64-bit operands, at a clock frequency of around 60MHz.

3 On-Line Implementation of the E-Method In on-line arithmetic [9], the data circulates through the arithmetic operators in a digit-serial fashion, most significant digit first. One of the main advantages of on-line arithmetic compared to the conventional (i.e., least significant digit first) digit-serial arithmetic is that certain functions (division, square root, elementary functions) ar computable most significant digit first, thus allowing digit-level overlap between successive computations. Moreover, in most applications, the most significant digits are the most important, so it is more attractive to obtain them first. In this section, we assume that the input operand x is available on-line, i.e., serially, most significant digit first. To take this into account, we modify the basic recurrence of the E-method. We get:

j ?1) ? d(j ?1) + 2??1 x D(j ?1) ) wi(j) = 2 (wi(j?1) + x(j?1) di(+1 +j i+1 i (j ) In our design, the selection uses 6 bits of wi . Fig. 5 shows the architecture of the

on-line LUs. There is more control logic in the on-line units because the accumulator receives digits during each step of the computation. The implementation of the on-line E-method is more complex than the digitparallel one. The control logic requires more space in one FPGA and there are more 8

p0

LU 0 . . . LU 14 LU 15

d(0) 1

p1

LU 1

p14

p15

d(0) 15

d(1) d(0) 0 0 d(1) 1

(1) d(0) 14 d14 d(1) 15

d(15m)

d(1m)

d(0m)

d(14m) time

load

LU 0 LU 1 . . .

p(x)

p(x0 )

p(x00 )

LU 14 LU 15 init

time

(m + 1) Figure 4: LU activity during the evaluation

9

One position left shift (r)

to line-unit i ? 1

wi(j?1) j ?1) Di(+1

r??1 S di(j?1)

x(j?1) x+j

d(ij)

j ?1) di(+1

on-line input

from line-unit i + 1

Figure 5: On-line LU architecture

10

registers used for synchronization. With one LU per XC3090 we can manipulate 20bit arguments only, and the frequency is about 33MHz. As in the previous section, those results would improve to 80-bit arguments, and to around 66 MHz with more recent chips. The loading step is performed in an on-line fashion which simplifies the interface. The coefficients of the polynomial are distributed into their respective LUs digit-by-digit. This mode allows simpler chip connectivity. Fig. 6 shows the mapping of one on-line LU.

pi x

di+1

L O A D WB

DC

seg 0 c o n t r o l l o g i c

?di

seg 1 seg 2

.. .

S

seg 17 seg 18 seg 19

c o n t r o l

l o g i c

DC

di

SB control

Figure 6: On-line LU implementation

4 Variable Precision Implementation We have seen in the previous sections that the major problem is the limited precision. In order to provide flexible precision in a reconfigurable architecture for polynomial evaluation, we consider the following “multiple-precision” scheme‘[7]. The idea is to segment the adders and registers. We design a small circuit which computes the (j ) digits of wi whose rank is between k and k + . k is a multiple of between 0 (j ) and m. To perform the computation of one term di , we iterate d m k e cycles. We use the memory banks of the DECPeRLe-1 board to store the inactive register segments. Now the computation part of the LU circuit is simpler, but the control logic necessary to perform the communications is more complex. An overview of this architecture is given Fig. 7. The connectivity between the RAM banks and the matrix of FPGAs is designed in order to make all the FPGAs working each clock cycle (once the pipeline is initialized). (j ?1) , we would get a very If we wait for the end of the d m k e iterations to add ?di (j ) significant delay for each computation of a term di . In order to overcome this ( j ? 1) during the first iteration. This is possible, because in a drawback, we add di 11

256kword)

SWITCH

RAM (

R A M

w[1 ] x[1 ] D[1 ] w[2 ] x[2 ] D[2 ]

...

w[3 ] x[3 ] D[3 ]

DATA

r e g i s t e r

x w D

...

w[4 ] x[4 ] D[4 ] control adresses

NB

4

3

2

2

1

4

3

3

2

1

4

4

3

2

1 3

+

cycle

FSM

generator

PUT

regis

ters

linki control logic

1 4

control logic

OUT

RAM / LINE - UNIT CONNECTIVITY

1

INPUT register

c a r r y

w

x

D

3 inputs BS adder

w

x

D

c a r r y

control logic

2

di+1 x

?di

S

MATRIX FPGA i ; i

2 f1; 2; 3; 4g

DC

Figure 7: Multi-precision architecture

12

DC

di

redundant number system, there is no carry propagation. With this improvement the global delay remains of the same order as in the previous designs. A very high level of pipelining is possible with this technique. All the data (digits of x, w) can be distributed in on-line fashion. Because of the slow access to the static RAM banks (50ns), the frequency of this architecture is limited to 20 MHz. The number of cycles (j ) for the generation of one digit di depends on the required precision (the number of memory accesses). Here, each segment has a size of 8 bits (4 borrow-save digits). Several cycles are required to feed each LU with the proper x, w and D segments. This architecture may be viewed as a fast multiple-precision kernel usable in many applications. The maximal precision is mainly the problem of RAM size.

5 Conclusion We have analyzed, designed and implemented two versions of the E-method for polynomial evaluation using the DEC PeRLe-1 board with Xilinx 3090 FPGA chips. Our designs demonstrate a relatively modest performance and very good flexibility with respect to custom VLSI implementations. Using more advanced FPGA chips, we expect significant improvement in performance and usability. We have also indicated a scheme suitable for large precision evaluations of polynomials using the E-method. Acknowledgements This work has been supported in part by the NSF Grant INT9217512 “On-Line Arithmetic: From Theoretical Studies to Implementations” and the CNRS Project “L’arithmétique en ligne: des études theoriques à l’implementation.”

References [1] A. Avizienis. Signed-digit number representations for fast parallel arithmetic. IRE Transactions on electronic computers, 10:pp 389–400, 1961. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 2, IEEE Computer Society Press Tutorial, 1990. [2] J.C. Bajard, J. Duprat, S. Kla and J.M. Muller. Some Operators for On-line Radix-2 Computations. Journal of Parallel and Distributed Computing, 22(2):336-345, 1994. [3] P. Bertin, D. Roncin, and J. Vuillemin. Introduction to programmable active memories. Technical report, DEC Paris Research Lab., 1989. [4] P. Bertin, D. Roncin, and J. Vuillemin. Programmable active memories: A performance assessment. Technical Report 24, DEC Paris Research Lab., 1993. [5] P. Bertin, M. Shand, and J. Vuillemin. Hardware speedups in long integer multiplication. Technical report, DEC paris Research Laboratory, 1991. [6] W. Cody and W. Waite. Software Manual for the Elementary Functions. PrenticeHall Inc, 1980.

13

[7] M.D. Ercegovac. A general method for evaluation of functions and computation in a digital computer. PhD thesis, Dept. of Computer Science, University of Illinois, Urbana-Champaign, 1975. [8] M.D. Ercegovac. A general hardware-oriented method for evaluation of functions and computations in a digital computer. IEEE Trans. Comp., C-26(7):667– 680, 1977. [9] M.D. Ercegovac and K.S. Trivedi. On-line Algorithms for Division and Multiplication. IEEE Trans. Comp., C-26(7):681–687, 1977. [10] J.F. Hart. Computer Approximations. Wiley, 1968. [11] E. Remes. Sur un procédé convergent d’approximations successives pour déterminer les polynômes d’approximation. C.R. Acad. Sci. Paris, 198, 1934. [12] M. Shand and J.E. Vuillemin. Fast implementations of RSA cryptography. In M.J. Irwin E.E. Swartzlander and G. Jullien, editors, 11th Symposium on Computer Arithmetic, pages 252–259, Los Alamitos, CA, June 1993. IEEE Computer Society Press. [13] P.T.P. Tang. Table lookup algorithms for elementary functions and their error analysis. In P. Kornerup and D. Matula, editors, proceedings of the 10th IEEE Symposium on Computer Arithmetic, pages 232–236. IEEE Computer Society Press, June 1991. [14] J. Volder. The CORDIC computing technique. IRE Transactions on Electronic Computers, 1959. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE Computer Society Press Tutorial, 1990. [15] J. Walther. A unified algorithm for elementary functions. In Joint Computer Conference Proceedings, 1971. Reprinted in E.E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE Computer Society Press Tutorial, 1990. [16] XILINX, The Programmable Gate Array Data Book, 1992.

14