FPGA Implementation of a Variable Precision CORDIC Processor E. Saez J. Villalba J. Hormigo F.J. Quiles J.I. Benavides E.L. Zapata
November 1998 Technical Report No: UMA-DAC-98/26
Published in: 13th Conf. on Design of Circuits and Integrated Systems (DCIS’98) Madrid, Spain, pp. 604-609, November 17--20, 1998
University of Malaga Department of Computer Architecture C. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain
FPGA implementation of a variable precision CORDIC processor E. Saez2, J. Villalba1 , J. Hormigo1, F.J. Quiles2, J. I. Benavides2, E. L. Zapata1 1 Dpto. Arquitectura de Computadores. Universidad de M alaga.
[email protected] 2 U. D. Arquitectura de Ordenadores. Dpto. Electrotecnia y Electronica. Universidad de Cordoba. SPAIN
Abstract In this paper we present the FPGA implementation of a new word{serial CORDIC processor working with variable precision. It has been designed in such a way that it allows us to take advantage of the successive shifts in the coordinates involved in the CORDIC algorithm. This signi catively reduces the total number of cycles required. Special attention has been paid to the design of the critical path to obtain an acceptable cycle time. The resulting FPGA device is able to work with a word length of 16 bits through 128 bits.
1 Introduction The variable precision technique is normally used to improve the accuracy and reliability of numerical computation. It solves the problems produced by roundo errors and catastrophic cancelation [10][6]. Several software tools have been developed to manage this problem but their main disadvantage is their long execution time for numerically intensive applications [13]. Some hardware designs exist that attempt to overcome the large overhead of these software solutions [9][11][5][3][4], and recently a whole CO-processor was designed working with variableprecision and interval arithmetic [15][12][14]. Nevertheless, the evaluation of the elementary function is not included in this papers. We present an implementation to support variable precision, based on one of the most ecient algorithms for computing elementary functions: the CORDIC algorithm [18]. The architecture presented has been successfully implemented using a FPGA of Xilinx: the integrated circuit XC4025EPG223-3. We have chosen
this technology to test the functionality of our design, taking advantage of its shorter development time, re-programmability and cost{eectiveness for low volume production [2]. FPGA provides high
exibility and low cost for design modi cations. Our aim is to shown the characteristics and results of the FPGA implementation of this architecture. The paper is structured as follows: in section 2 we present the basic CORDIC algorithm. In section 3 we propose the architecture of the word{ serial CORDIC processor in variable precision. In section 4 we present the details of the FPGA implementation of this architecture. Finally, in section 5 we provide some conclusions and future work.
2 Fundamentals of CORDIC algorithm
the
The CORDIC algorithm (COordinate Rotation DIgital Computer) was introduced to compute trigonometric functions and generalized to compute linear and hyperbolic functions [18][19]. It is an iterative algorithm suitable for VLSI implementation as it only employs adders and shifters and has a wide application domain. Special attention has been paid by dierent researchers to the improvement of the algorithm in recent years, as referred to [8]. By means of the CORDIC algorithm, a vector (x; y) is rotated through an angle (rotation mode) or it is taken to one of the coordinate axis (vectoring mode). These operations can be performed in linear, circular and hyperbolic coordinate systems, but only the circular coordinate system is considered here. The algorithm is based on rotations over pre xed known elementary angles (tan?1(2?i ) in circular coordinates) which are stored in a table. The basic iteration or microrota-
tion in circular coordinates is:
X(i)L
x(j + 1) = x(j) ? j 2?j y(j) y(j + 1) = y(j) + j 2?j x(j) z(j + 1) = z(j) ? j tan?1 (2?j )
Y(i)L FRX
(1)
where (x(0); y(0)) are the initial coordinates of the vector and the z coordinate accumulates the angle. The coecient j 2 f?1; +1g speci es the direction of each microrotation. n + 1 iterations are needed to produce n bit precision. The nal coordinates are scaled by the factor
X(i)2 X(i)1 X(i)0
FRZ
Table of
Y(i)2 Y(i)1 Y(i)0
m
bus 2
Z(i) L
FRY
Z(i) 2 Z(i) 1 Z(i) 0
angles
bus 2
m bus 1
bus 1
ARY
ARX Shift Reg
Barrel Shifter (BSY)
Barrel Shifter (BSX) m
m
m m
m
m
m Mux
Adder/Subtractor
Adder/Subtractor
Adder/Subtractor
m
m
m
Figure 1: Architecture
n q Y 1 + j 2? j K=
(2) as i = d m + r with r < m. To locate this last word, we have to shift the value y(i) to the right by full words plus r bits. Now, we select the third which is a constant since j j j= 1. It is necessary dword. 2 shows this situation in the case of to compensate the scale factor. The most usual i = dmFigure + r with 1. In this gure we can see method consists in the addition of some extra scal- that the value of d2?=i y(i) is obtained from y(i) by ing iterations [7], but can be compensated in paral- means of a shift of a full word plus a shift of r bits. lel with the CORDIC iteration [17]. The CORDIC algorithm can be implemented using dierent radix [1]. Nevertheless, due to the inherent complexity of variable precision, we prefer to work with the classical radix-2 CORDIC since radix-4 CORDIC has a higher hardware cost. 2
2
j =0
m
X(i)L
X(i)L-1
X(i)2
X(i)0
X(i)1
r
d m
r
Y(i)L
-i
[2 Y(i)]
L-d
3 CORDIC processor for variable precision The architecture presented in this section is an extension of the classical CORDIC architecture to support variable precision and is shown in Figure 1. This architecture has a basic word length of m bits, but is able to work with words of n bits (n > m) in such a way that one CORDIC iteration is composed of d mn e cycles. To do this, we must obtain a suitable decomposition of the two basic operations used in CORDIC: addition and shift. This is possible applying the classic bit{serial techniques to our word{serial system, which we now brie y describe. The addition operation is quite easy and can be found in [15]. Basically, we add the dierent words starting from the least signi cant word (LSW) using an m bit adder and storing the carry from one word to the next one. The shift is decomposed into two partial shifts. The rst one is a word{level shift and the second one is a bit{level shift. For example, if we are in CORDIC iteration i and we are computing the third word of x(i + 1), we need to locate the third word of x(i) and the third word of 2?i y(i) (see (1)). Let us assume that i can be decomposed
Y(i)L-1
-i
[2 Y(i)]
L-d-1
Y(i)2
Y(i)1
Y(i)0
-i
[2 Y(i)]
0
Figure 2: Word{level shift and bit{level shift In Figure 1 we can see three data paths corresponding to the coordinates x, y and z (see equations (1)). The classic registers to support the CORDIC algorithm have been substituted by the le registers FRX, FRY and FRZ. The auxiliary registers ARX and ARY make the shift possible, since two words are involved in this operation (see Figure 2). For the same reason the shifters have 2m input bits and m output bits. The z data path incorporates a shift register. The function of this register is to substitute the table of angles for iterations with i > n3 . From this iteration on, we can use the approximation tan?1(2?i ) = 2?i [16][20], and therefore use a shift register to hold the value 2?i instead of the table of angles. With this system, the size of this table is reduced to a third. This was one of the main handicaps of a high precision CORDIC, since the table of angles increases according to word length. An important reduction in the number of iterations can be obtained taking advantage of the nature of the CORDIC algorithm. At rst, every CORDIC iteration is composed of as many cycles as words per coordinate are needed (d mn e). Nevertheless, since at iteration i the second operand of each
equation of 1 has i leading zeroes, we can avoid the addition/subtraction of the leading all zero words implicit in iterations with i > m (all ones if negative). In other words, we can stop the computation when the sign extension words are reached. This situation is only possible when no carry is produced to the MSWs. We have studied the probability of this situation takes place, and we conclude that the number of cycles required in order to perform a whole rotation is approximately halved.
Shift register
Shift register m
Lmax m
m
Lmax -1 m
m
0
4 FPGA implementation of the CORDIC processor The architecture has been designed using the XILINX development tools for LCA of the XC4000E series, which is a high speed family, and the FPGA device has been programmed using XACT-STEP version 6.01. This device is the XC4025EPG223-3 chip. The implementation corresponds to a variable precision CORDIC processor with a maximum and minimum precision of 128 bits and 16 bits respectively. The basic word length is 16 bits. We attempted to obtain a 256 bits precision processor, but encountered problems with the size of the le registers. The table of angles must be outside the chip for the same reason. Nevertheless, all the circuitry to address the table has been included in the device. The required size of this table is 512x16 (1KByte). Since this is a variable precision processor one of the input parameters to the system is the precision which we want to work with. The precision indicator provides the number of words to be used for a particular application. This number goes from one through eight (16 bits through 128 bits). We use the addition of scaling iterations technique to compensate the scale factor since it is the lowest hardware cost solution. It only requires the redirection of some paths. The shifts involved in this operation are xed for every iteration. These scaling iterations reduce the norm of the vector, which is increased by the successive CORDIC iterations (see section 2). To avoid over ow problems, we prefer to perform the scaling iteration at the beginning of processing. The subsystem in charge of the control of the scaling operations has two small memories which indicate the number of scaling iterations to perform for the precision selected and the values of the shifts for every iteration. The le register architecture of FRX and FRY is presented in Figure 3. It has two output
m bus in
1
0
1
Mux
control
control bus 2 bus 1
Figure 3: File register architecture [15:0]
COORD_X
ADDRESS
[10:0]
[15:0]
COORD_Y
ANGLE
[15:0]
[15:0]
COORD_Z CLK RESET MODE
EOC
START
Figure 4: Pin-out of the FPGA ports and one input port. In the control section, there are two registers to hold the bit{level shift and the word{level shift (see Figure 2). The register for the bit{level shift is a counter and controls the variable shifters of the data paths, whereas the register for the word{level shift is a shift register and controls the le registers subsystem. The output of the common buses was setup using three{state buers, but it produced a lot of bus access con icts, in spite of we have paid special attention to this. Due to the design being implemented over a programmable device we can not ensure that the propagation delays of the output enable control lines of the three{state buers are the same (unless the routing is specially carried out to it). Since the FPGA device has limited resources the lter performs a routing as a function of the available resources, which is a problem for an appropriate timing of these lines, unless we control the routing speci cally. Therefore, we have used open collector elements instead of three{state buers, so preventing any con ict.
Figure 4 shows the pin-out of the resulting FPGA. The number of lines has been minimized as far as possible. Nevertheless, the total number is high since three coordinates of 16 bits are input parameters. The buses COORD X, COORD Y and COORD Z are in charge of performing the input/output of the dierent words of each coordinate. Bus ADDRESS supplies the address of the external ROM containing the elementary angles, and they are read through the bus ANGLE. The rest of lines of Figure 4 are control signals. Line MODE provides the operation mode of the device (rotation or vectoring). Signal START control the beginning of the acquisition of the words which compose the coordinates. Line EOC indicates the end of the full CORDIC operation. Finally, the lines CLK and RESET are the clock and the re{initiation signals of the device. After a reset, the operation of the system begins enabling the START line during the rst cycle. In this cycle the value of the desired precision must be presented in the bus COORD x, in such a way that the precision indicator is loaded. After this, the values of the dierent words of each coordinate are loaded using the corresponding buses, one per cycle (from the LSW through the MSW). Now, the CORDIC iterations can start. The rst cycle of every iteration is dedicated to load the auxiliary registers ARX and ARY (see Figure 1). Next, we work with the dierent words of the coordinates, stopping the computation when the sign extension words are reached (see previous section). When all the iterations are performed, the EOC line is enabled. Simulations have shown that the critical path is the y data path, as we foresaw. This data path has a few more logic elements to support the vectoring mode of the CORDIC algorithm. Applying a suitable safety margin, the cycle time obtained is 125 ns. (8 MHz.), which is acceptable taking into account the limited resources of the routing for the device XC4025.
5 Conclusion and future work Using the design presented in this paper we have integrated a full processor in a FPGA device to deal with elementary functions for variable precision up to 128 bits. The selection of a relatively simple and very ecient algorithm such as CORDIC to perform elementary functions is one of the keys to the feasibility of the integration, in spite of the large hardware required for the most of the high precision systems.
Special attention has been paid to the minimization of the critical path (the y data path) to obtain an acceptable frequency. From the results obtained we think that a full{custom version of this processor would achieve a very good performance. A comparison with other alternatives has not been included since this is the rst design of a CORDIC processor for variable precision. The possible alternative would be the implementation of elementary functions based on a table look{up method. Nevertheless, the complexity of that system is very high since the size of the tables is very large, and therefore a design based on table look{up is not feasible in a FPGA. A future improvement to this design may be to avoid the preliminary cycle needed at the beginning of every CORDIC iteration to load the auxiliary registers. Basically, there are two alternatives: rst, adding a new output port to the le registers in such a way that the two words used in the shift are read at the same time instead of only one word as in the proposed system. Second, initializating the auxiliary registers just in the last cycle of the previous iteration. It is necessary to study the hardware implications of both alternatives. It would also be convenient to take direct control over the place and route stages to optimize the delays of the dierent lines through the critical path, in such a way that the frequency can be increased.
References [1] E. Antelo, J. Villalba, J.D. Bruguera, and E.L. Zapata. High performance rotation architectures based on radix-4 CORDIC algorithm. IEEE Transactions on Computers, 46(8):855{ 870, Aungust 1997. [2] S. D. Brown and al. Field{programmable gate arrays. Kluwer Academic Publishers, 1992. [3] T. M. Carter. Cascade: Hardware for high/variable precision arithmetic. In M. D. Ercegovac and E. Swartzlander, editors, Proc. 9th Symposium on Computer Arithmetic, pages 184{191, 1989. [4] D. M. Chiarulli, W. G. Rudd, and D. A. Buell. Draft: A dynamically recon gurable processor for integer arithmetic. In Proc. 7th Symposium on Computer Arithmetic, pages 309{318, 1985. [5] M. S. Cohen, T. E. Hull, and V. C. Hamarcher. Cadac: A controlled-precision decimal arith-
[6] [7] [8] [9] [10]
metic unit. IEEE Trans. on Computers, C- [19] J.S. Walther. A uni ed algorithm for elementary functions. Proc. Spring. Joint Comput. 32:370{377, 1983. Conf., pages 379{385, 1971. D. Goldberg. What every computer scientist should know about oating-point arithmetic. [20] S. Wang, V. Piuri, and E. E. Swartzlander, Jr. ACM Computing Surveys, 23:5{48, 1991. Hybrid cordic algorithms. IEEE Trans. Computers, 46(11):1202{1207, November 1997. G.L. Haviland and A.A. Tuszynski. A CORDIC arithmetic processor chip. IEEE Trans. on Comput., C-29(2):68{79, Feb 1980. Y.H. Hu. CORDIC{based VLSI architectures for Digital Signal Processing. IEEE Signal Processing Magazine, (7):16{35, July 1992. J. Kernhof. A cmos- oating point processing chip for veri ed exact vector arithmetic. ESSIRC 94, 1994. A. Knofel. Hardware kernel for scienti c/engineering computations. Scienti c Computing with Automatic Result Veri cation, E. Adams and U. Kulisch, eds., pages 549{570,
1993. [11] M. Muller, C. Rub, and W. Rulling. Exact accumulation of oating-point numbers. In P. Kornerup and D. W. Matula, editors,
Proc. 10th Symposium on Computer Arithmetic, pages 64{69, 1991.
[12] M. J. Schulte and E. E. Swartzlander, Jr. Hardware design and arithmetic algorithms for a variable-precision, interval arithmetic coprocessor. In Proc. 12th Symposium on Computer Arithmetic, pages 222{229, 1995. [13] M. J. Schulte and E. E. Swartzlander, Jr. A processor for staggered interval arithmetic. Proc. Int. Conf. Application-Speci c Array Processors, pages 104{112, 1995.
[14] M. J. Schulte and E. E. Swartzlander, Jr. Variable-precision, interval arithmetic coprocessors. Reliable Computing, 2(1):47{62, 1996. [15] Michael J. Schulte. A Variable-Precision, Interval Arithmetic Processor. PhD thesis, The University of Texas at Austin, May 1996. [16] J. Villalba. Dise~no de arquitecturas cordic multidimensionales. Ph.D Universidad de Malaga, pages 40{87, November 1995. [17] J. Villalba, T. Lang, and E. L. Zapata. Parallel compensation of the scale factor for the cordic algorithm. In press in VLSI of signal Processing, 1998. [18] J.E. Volder. The CORDIC Trigonometric Computing Technique. IRE Trans. Elect. Comput., EC(8):330{334, 1959.