Radix-4 Vectoring CORDIC Algorithm and Architectures

4 downloads 47 Views 481KB Size Report
vol. 19, no. 2, pp. 127-147, July 1998. University of Malaga. Department of Computer Architecture. C. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain ...
Radix-4 Vectoring CORDIC Algorithm and Architectures J. Villalba E. Antelo J.D. Bruguera E.L. Zapata

July 1998 Technical Report No: UMA-DAC-98/20

Published in: J. of VLSI Signal Processing Systems for Signal, Image, and Video Technology vol. 19, no. 2, pp. 127-147, July 1998

University of Malaga Department of Computer Architecture C. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain

Radix-4 vectoring CORDIC algorithm and architectures by

J. Villalba1, E. Antelo2, J.D. Bruguera2 and E.L. Zapata1 1Dept.

of Computer Architecture University of Malaga Malaga. Spain. [email protected] 2 Dept. Electronica y Computacion. University of Santiago de Compostela. Spain [email protected]

MANUSCRIPT FOR PARHI AND TAYLOR ASAP-96 SPECIAL ISSUE

Author responsible for correspondence: Prof. Julio Villalba Moreno Dpto. Arquitectura de Computadores Universidad de Malaga, Complejo Tecnologico P.O.BOX : 4114 Malaga E-29080 (SPAIN)

Tfno. +34-5-2132787 fax. +34-5-2132790 e-mail: [email protected]  This

work was supported in part by the Ministry of Education and Science (CICYT) of Spain under contract TIC96-1125-C03-01

Radix-4 vectoring CORDIC algorithm and architectures Abstract In this work we extend the radix{4 CORDIC algorithm to the vectoring mode (the radix-4 CORDIC algorithm was proposed recently by the authors for the rotation mode). The extension to the vectoring mode is not straightforward, since the digit selection function is more complex in the vectoring case than in the rotation case; as in the rotation mode, the scale factor is not constant. Although the radix{4 CORDIC algorithm in vectoring mode has a similar recurrence as the radix{4 division algorithm, there are speci c issues concerning the vectoring algorithm that demand dedicated study. We present the digit selection for non{redundant and redundant arithmetic (following two di erent approaches: arithmetic comparisons and table look{up), the computation and compensation of the scale factor, and the implementation of the algorithm (with both types of digit selection) in a word{serial architecture. When compared with conventional radix{2 (redundant and non-redundant) architectures, the radix-4 algorithms present a signi cant speed up for angle calculation. For the computation of the magnitude the speed up is very slight, due to the non{constant scale factor in the radix{4 algorithm.

1 Introduction The CORDIC algorithm (COordinate Rotation DIgital Computer) was introduced to compute trigonometric functions and generalized to compute linear and hyperbolic functions [1][2]. It is an iterative algorithm suitable for VLSI implementation because it employs only adders and shifters and it has a broad application eld. Special attention has been paid by di erent researchers to the improvement of the algorithm in the last few years, as referenced in [3]. By means of the CORDIC algorithm, a vector (x; y ) is rotated by an angle (rotation mode) or it is taken to the coordinate axis (vectoring mode). The algorithm is based on rotations over given elementary angles. The basic iteration or microrotation is:

xi+1 = xi + i  2?i  yi yi+1 = yi ? i  2?i  xi zi+1 = zi + i  tan?1 (2?i )

(1)

where (x0; y0) are the initial coordinates of the vector, z coordinate accumulates the angle, and i 2 f?1; +1g speci es the direction of each microrotation. Each iteration q introduces a scaling over both coordinates given by the expression ki = 1 + i2  2?2i , and thus, after n iterations the nal x and y 1

coordinates are scaled by the factor:

K=

nY ?1 q i=0

1 + i2  2?2i

(2)

In radix-2 CORDIC, the scale factor is constant since jij = 1 (for a detailed review of CORDIC see [3]). To preserve the magnitude of the vector, this factor must be compensated. In recent years many computionality intensive applications have been proposed, including matrix computations and computer graphics algorithms that need to compute angles [4] [5] [6]. In [4] [5] [7] the angle computation and rotation operation is performed by means of CORDIC modules. It has been shown that the processors for angle calculation and rotation based on CORDIC modules present a signi cant improvement in performance as compared to the more conventional approach using standard modules like division, multiplication or square{root [4]. In the classic radix{2 CORDIC algorithm, roughly only one bit of the result is computed in each of the iterations. In [8], we design a rotator based on a radix-4 CORDIC algorithm (rotation mode only). In that paper we prove that if radix{4 instead of radix{2 is used, the total number of microrotations of the CORDIC algorithm is halved because two bits of the result are computed in each of the iterations. However the radix{4 CORDIC algorithm presents two drawbacks: 1. The selecction of i is more complex than in the radix-2 case. 2. The scale factor is not constant, since i takes values di erent for 1 or -1. As we show in [8] for the rotation mode the selection of i only depends on zi and the resulting digit selection table is simple. The scale factor is computed by combining look{up table and linear approximations of the scale factor of each microrotation. The resulting architectures demonstrate to be ecient when compared to the conventional radix-2 architectures (both with redundant and non{redundant arithmetic). In this paper we extend the radix{4 CORDIC algorithm to the vectoring mode (both for redundant and non{redundant representation of the operands). This extension is very interesting for applications that are based in the angle computation and rotation operation. However the extension is not straighforward due to the selection of i . In the vectoring case the selection of i depends on both x and y coordinates and the resulting selection is more complex than in the rotation mode. On the other hand the scale factor may be computed and compensated in a similar way as in the rotation mode. The organization of the paper is as follows. First in Section 2 we present the radix{4 CORDIC vectoring algorithm and demonstrate its convergence. 2

In Section 3 we deal with the selection of i . Section 4 is dedicated to the scale factor computation and compensation. In Section 5 we illustrate the implementation of the algorithm in a word{serial architecture. Finally Section 6 is dedicated to the evaluation and comparison and Section 7 to the conclusions.

2 Radix-4 CORDIC Algorithm In this section we develop a radix-4 CORDIC algorithm in the vectoring mode. We prove the convergence and give the precision and number of iterations. First we perform an extension of the iterative equations of the radix{2 CORDIC algorithm to radix{4 [8]. Basically, we use elementary angles of the form tan?1 (i 4?i ) instead of tan?1 (2?i ), in such a way that equations (1) become:

xi+1 = xi + i  4?i  yi yi+1 = yi ? i 4?i  xi zi+1 = zi + i(i)

(3)

where i (i) = tan?1 (i 4?i ), i takes values in the digit set f?a; : : :; 0; : : :; +ag, and with 2  a;  3. The number of iterations to achieve n bits of precision is n=2. The rotated coordinates are scaled by the factor Y

K = (1 + i2  4?2i)1=2 i

(4)

The scale factor K is not constant, as it depends on the i values. In the vectoring mode, coordinate y is taken to 0 as the iterations progress. Therefore, in each iteration, the i value must be selected so that the y coordinate approaches to zero as the iterations progress. At the end of the iterations we only have to compensate the x coordinate, as the y coordinate is zero within the precision. Coordinate z does not require any correction.

2.1 Convergence of the radix{4 CORDIC algorithm in the vectoring mode In order to prove the convergence of the radix{4 CORDIC algorithm we have to prove that variable y approaches to zero as the index of the microrotations increases. In order to obtain a set of iterations where the radix{4 CORDIC vectoring may be eciently performed, we consider the scaled value of yi . That is, we de ne a new variable wi :

w i = 4 i yi

(5)

This equation introduces a scaling on yi of the same order as the decrease produced in yi in each iteration. In this way, we manage to maintain the value of wi bounded. This simpli es the selection criteria and eliminates possible 3

imprecisions in the calculation. A similar solution is used in [5] [9]. With this change, equations (3) look like:

xi+1 = xi + i  4?2i  wi wi+1 = 4  (wi ? i  xi ) zi+1 = zi + i (i)

(6)

These equations reduce the number of barrel shifters to only one (see section 5). Based on these new equations, we may obtain a selection criteria for i in each microrotation that guarantees the convergence of the algorithm. The digit set for i is f?a; : : :; 0; : : :; +ag. This radix-4 digit set can be minimally redundant if a = 2 or maximally redundant if a = 3. As in the case of the radix-4 SRT-division [10] it is convenient to de ne a redundancy factor as  = a=3. The main drawback of the radix{4 algorithm is the selection of the i values. In order to bound the w variable in each iteration i must be a function of the value of both xi and wi (see iteration w in equation (6)). In this way, the selection process seems to be more complex than in the case of the radix{4 algorithm in the rotation mode [8] where the i values only depends on the z variable in each iteration. We now propose the selection intervals for i , that is if wi 2 [Lq (xi ); Uq (xi)] then i = q with q 2 f?a; : : :; 0; : : :; +ag. The selection interval [Lq (xi ); Uq (xi )] must assure that wi+1 is bounded selecting i = q . The selection intervals we propose are de ned by: Lq (xi) = (q ? )  xi (7) and Uq (xi) = (q + )  xi (8) These intervals are similar to the selection intervals used in the radix{4 SRT{division [10]. This is due to the fact that iteration w is similar to the iteration used in division. The iteration for division has the form wi+1 = 4  (wi? qi  d) where qi is the quotient digit and d is the divisor. However there are two important di erences between the division algorithm and the CORDIC algorithm: rst, in division the divisor is constant for all of the iterations whereas in the CORDIC algorithm the x coordinate takes di erent values in each one of the iterations. Secondly, if redundant arithmetic is considered, in the division algorithm the divisor is represented in a non{redundant form, but in the CORDIC algorithm the x coordinate is represented in redundant form, leading to a more complex selection function. Although both algorithms are very similar, these two di erences impose di erent constraints for both algorithms. Consequently, a particular study must be made for the radix{4 CORDIC algorithm. The intervals given in expressions (7) and (8) have to assure a) the continuity condition [10] and b) the bounding of wi . The continuity condition implies that all of the selection intervals must cover the whole range of wi. The continuity condition is assured if Lq (xi )  Uq?1 (xi ) for all i and q . Based on the de nition of Lq (xi ) and Uq?1 (xi ) (see equations (7) and (8)) the continuity 4

condition is assured if   1=2. As we have de ned  = a=3 and the minimum value of a we can select is a = 2, the minimum value for  is 2=3, and then the continuity condition is satis ed. Next, we demonstrate that wi is bounded in each of the iterations with the selection intervals given in expressions (7) and (8). Without any loss of generality in what follows we assume that x0  0 and x0 and y0 are fractional values, one of them is normalized within the interval [0:5; 1). The bound for wi is:

jwij  4    xi

(9)

Proof: we prove this by induction.

 Base case (i=1). We are going to consider two sets of values that w0 may take in relation to x0 : that jw0j takes values that are lower than or equal to 4x0 or that it takes larger values. a) Let us consider the set of values jw0j  4x0. Since w0  4x0 it is clear that 9q 2 f?a; : : :; 0; : : :; +ag that satis es

q  x 0 ?   x 0  w0  q  x 0 +   x 0

(10)

?4x0  w1  4x0

(11)

Subtracting q  x0 , multiplying by 4 and observing (6) we have: Thus as xi is a growing succession [5] we may write:

jw1j  4x1 b) We now consider the set of values jw0j > 4x0.

(12)

Let us assume that for this set of values we select q = 2; we are going to see that with this selection expression (9) is satis ed and there is still a bound for w1. The worst case occurs when the ratio between w0 and x0 is in nite, that is, when x0 = 0. If q = 2 and x0 = 0, in the rst iteration (i = 0) we have (see equations (6)):

x1 = 2w0 w1 = 4w0 We can thus write that:

w1 = 4w0 = 2  2w0 = 2x1  4x1 (13) since  = a=3 with a = 2 or a = 3.  Induction hypothesis (i = m ? 1). We assume as true that jwm?1j  4    xm?1 Induction step (i=m): Because of the induction hypothesis it is true that there is a q satisfying:

q  xm?1 ? xm?1  wm?1  q  xm?1 + xm?1 5

(14)

Subtracting q  xm?1 , multiplying by 4 and taking (6) into account we may write: ?4xm?1  wm  4xm?1 (15) Therefore, as xi is a growing succession [5] we may write: jwmj  4xm (16) Q.E.D. Now, we prove that xi is bounded. If we substitute the expression (9) in the rst equation of (6) we obtain: xi+1  xi + 4i 4?2i xi = xi (1 + 4i 4?2i) (17) Expressing this inequality as a function of x0 we have:

xi+1  x0

i Y

k=0

(1 + 4k 4?2k )

(18)

The maximum value of this expression is reached when j = a 8j  i and  = 1. Therefore, we obtain the expression:

xi+1  x0

i Y

k=0

(1 + 12  4?2k )

(19)

This product is convergent since the corresponding in nite product is also convergent. The in nite product is convergent since it ful ls two conditions: the general term (1 + 12  4?2i ) tends to 1 when k goes to in nite, and the series composed by u1 + u2 + ::: + up + ::: with up = 12  4?2p is convergent too. Therefore, taking into account expression (9) and the bound of xi we conclude that wi is bounded. In Figure 1 we show the selection intervals for the case  = 1 and  = 2=3. In this Figure we only show the intervals for positive values of wi . The intervals are symmetrical for the negative values of wi. As we can see in this Figure there is an overlap between the selection intervals. Observe that the overlap is greater in the case  = 1 than in the case  = 2=3. This overlap implies that we do not need to make exact comparisons to determine the selection interval. We can determine the suitable interval based on estimations of wi and xi . This is very useful, both for non{redundant and redundant representations. For example, for redundant carry{save arithmetic and we only have to assimilate (convert from redundant to non{redundant representation) a reduced number of most signi cant bits of wi and xi to determine the i value. In the next section we obtain the selection function based on estimations.

2.1.1 Precision and number of iterations obtained in radix-4 After n iterations, the angle between the vector(xp; yp) and the x axis is (see expressions (5) and (9)): !  ?p wp !  y 4 p tan?1 = tan?1  tan?1 2    2?2p+1 (20)

xp

xp

6

If  = 2=3 this value is slightly greater than the value obtained by means 2p standard radix{2 CORDIC iterations (tan?1 (2?2p+1)) and slightly less than the value obtained with 2p + 1 iterations (tan?1 (2?2p+2 )). If  = 1 this value coincides with 2p + 1 standard radix-2 iterations. Therefore, the radix{4 CORDIC algorithm in vectoring mode basically halves the number of microrotations with respect to the standard radix{2 CORDIC algorithm.

3 Selection function In this section we will obtain the selection functions for the radix{4 CORDIC algorithm in the vectoring mode. We obtain a selection function that is valid for its hardware implementation in redundant as well as non-redundant arithmetic. To do this, we use two di erent techniques that produce two di erent selection functions. The rst technique is based on arithmetic comparison and the second one is based on a look-up table. There is no clear di erence in terms of area and time between both techniques, and only a real implementation would tell us which one of them is better. For this reason, we explain both techniques.

3.1 Selection function by arithmetic comparisons This method is based on comparing coordinate w to a couple of comparison points so that the i value is obtained. To reduce the number of comparisons we choose a = 2 so that q = f0; 1; 2g. On the other hand, the only restrictions of the input data is jx0j < 1; jw0j < 1 and one of them must be normalized. Without any loss of generality we also assume that x0  0. For clarity in the presentation, in what follows we assume non{redundant arithmetic. The extension to redundant arithmetic is considered at the end of this subsection.

3.1.1 Obtaining xed comparison points for all the iterations Let us assume the selection intervals given by [Lq (xi ); Uq(xi )] (see expressions (7)(8)). We de ne Pi (1) as the comparison point used for discriminating between values i = 0 and i = 1, and we de ne Pi (2) as the comparison point used for discriminating between the values i = 1 and i = 2 (We de ne Pi (?1) and Pi (?2) in a similar way). The comparison points that we have de ned must belong to the overlap intervals and be easy to calculate and implement. Two suitable selections for the comparison points are: Pi(1) =  21 xi Pi (2) =  32 xi (21) because they are simple and belong to the overlap intervals (see Figure 1b). However, it is necessary to recalculate the comparison points in each iteration, as they depend on the numerical value of xi . 7

The alternative we now present makes it only necessary to calculate the comparison points Pi (1) and Pi (2) in a few initial iterations, they remain xed for the rest of the iterations. We are going to calculate from which iteration the comparison points obtained are valid for the remaining iterations. As xi is a succession of growing terms, the successions of terms Lq (xi) and Uq (xi ) are also growing (see expressions (7) (8) and Figure 1b). According to this, for the comparison points of the i{th stage (Pi (1) and Pi (2)) to still be valid as comparison points in the remaining iterations, they must belong to the overlap intervals of these iterations. In Figure 2 we see how there is a common overlap area between all the iterations for q = 0 and q = 1. We are going to seek an iteration i such that the comparison points belong to the common overlap area, that is, we seek a Pi (1) and Pi (2) such that (see Figure 2):

L1 (x1 )  Pi (1)  U0 (xi) L2(x1 )  Pi (2)  U1(xi)

(22) (23)

(The arguments would be the same for Pi (?1) and Pi (?2)) Let us analyze equation (22); taking into account expressions (7) (8) and that the value of Pi (1) is 1=2  xi , we may write: 1=3  x1  1=2  xi  2=3  xi

(24)

A top bound for xi is obtained making i = 2 for every j > i in the equation on x in (6). If we consider i = 2 and substitute the expression (9) with  = 2=3 in equation on x of (6) we obtain: 16 ?2i ?2i xi+1  xi + 16 (25) 3 4 xi = xi (1 + 3 4 ) Now, we obtain the value xi+k as a function of xi :

xi+k  xi

i+Y k?1 j =i

?2j (1 + 16 3 4 )

(26)

As shown in section 2.1, this series is convergent. The rst inequality of (24) can be written as x1  3 (27) xi 2 For practical implementations it is enough to take xi+k with a large k value instead of x1 . Therefore, condition (27) can be substituted by xi+k  3 (28) xi 2 with k large enough. From expression (26) we can write k?1 xi+k  i+Y (1 + 16  4?2j ) xi 3 j =i

8

(29)

This series converges very quickly. We have veri ed that for i = 0 and k = 500 a bound for this product is 8:7 which does not ful l condition 28, whereas for i = 1 a good bound is 1.4 which does ful l condition 28. Therefore, we conclude that value P1 (1) can be used as a comparison point for the rest of the iterations, as this value is always within the overlap that is produced in the following iterations, that is: Pi (1) = P1 (1) 8i  1 (30) Proceeding in the same way from expression (23) we nd that the value

P2 (2) may be used as a comparison point for the rest of the iterations, that is: Pi (2) = P2 (2) 8i  2 (31) Therefore, we only have to calculate the comparison points in the rst three iterations. From this iteration on, the values calculated as comparison points (P2 (2) and P1 (1)) are valid for the next iterations. Figure 3 shows the evolution of the common overlap bound and the location of P1 (1) and P2 (2). As a consequence, the selection function is: 8 +2 if wi > Pi (2) > > > > > < +1 if Pi (1) < wi  Pi (2) i = > 0 if Pi (?1) < wi  Pi (1) (32) > > ? 1 if P ( ? 2) < w  P ( ? 1) i i i > > : ?2 if wi  Pi (?2) being (

 21 x0 if i = 0 Pi(1) =  1 x1 if i  1 2

Pi(2) =

(

 32 xi if i  1  32 x2 if i  2

(33)

3.1.2 Size of the comparators As the comparison points depend on the values of the x coordinate, the comparison with these points must be carried out by means of an add/subtract operation. This operation may be carried out in parallel with the shift associated with equations (6) in each iteration. However, we need n bit addition/subtractions, which increase the hardware and slow down the comparison process (which would be signi cantly longer than the delay through the shifters). In what follows we will prove that it is enough to perform the comparison with a few most signi cant bits of Pi (1) and Pi (2) for the coecients to still be correct. This implies that fast adders/subtractors may be used for performing the comparison, saving hardware and obtaining comparison times of the same order as the delay of the shifters. Reducing the number of bits to be compared We now calculate the number of most signi cant bits of wi and Pi (q ) needed to perform the comparison without making errors. We must prevent 9

the comparison from producing di erent results for the truncated and non{ truncated values of wi and Pi (q ). Let us call w^i and P^i (q ) the truncated values of wi and Pi (q ) using f fractional bits. Let b and c be any values taken by wi and let ^b and c^ be the corresponding truncated values. Let us assume that the truncated values with f fractional bits of the higher limit of the overlap interval Uq (xi ) and the comparison point Pi (q + 1) correspond to the same value (see Figure 4). The value of ^b and c^ is the same as that of P^i (q + 1) and for them we select i = q ; however, point c is higher than Uq (xi ) and does not permit any selection except i = q + 1, and thus the decision made for point c with the truncated values is not correct. In order to prevent this situation, it is necessary for the truncated values ^ Pi (q + 1) and U^i(q) to be di erent. Thus, the distance between Pi (q + 1) and Uq (xi ) must be higher than the precision we observe for the truncated values of wi (2?f = distance between two consecutive points of w^i , see Figure 4). In this way, we always have P^i (q + 1) 6= U^i (q ). We can express this mathematically as: jUq(xi) ? Pi (q + 1)j > 2?f (34) From Figure 1 we see that the overlap interval is 1=3  xi, and as xi is a growing function, the amplitude of the intervals also grows (see Figure 3). For the same reason, the extremes of the intervals Lq (xi ) and Uq (xi ) are shifted in the same direction (see Figure 3). Consequently, the smallest of the distances between Pi (q +1) and Uq (xi ) arises in the initial iterations, and depends on the smallest value that xi may have in these iterations. Due to the normalization employed, the minimum and maximum magnitude of the initial vector (x0; w0) are 0:5 p and 2. The extremes of the intervals are U0 (xi ) = 2=3  xi and U1 (xi ) = 5=3  xi (see Figure 1). We are going to calculate the number of fractional bits needed. Let us assume that i > 0; due to the normalization of the input data, the minimum value of xi is 0:5. Thus, taking into account that the distance between Pi (q + 1) and Uq (xi) is greater than or equal to than 1=6  xi , it must happen that 1=6  xi  2?f . From this expression we can deduce that f > 3:5, that is, we need at least 4 fractional bits. Reasoning in a similar fashion, we nd that for i = 0 we need at least 5 fractional bits. Summarizing, we need to assimilate 5 fractional bits for x0 and w0 (a total of 7 bits taking into account the two bits of the integer part including sign) and 4 fractional bits for xi and wi (i  1) (a total of 8 bits taking into account the four bits of the integer part, including sign). We may now rewrite the selection function (32) as follows: 8 > +2 if w^i > P^i (2) > > > > ^ > ^i  P^i (2) < +1 if Pi (1) < w (35) i = > 0 if P^i (?1) < w^i  P^i (1) > ^ ^ > ?1 if Pi(?2) < w^i  Pi(?1) > > > : ?2 if w^i  P^i (?2) 10

where P^i (1) and P^i (2) are the truncated values of Pi (1) and Pi (2) (see expressions (33)).

3.1.3 Extension to redundant arithmetic Without any loss of generality, we assume that we use carry{save redundant arithmetic. We truncate wi with f fractional bits. Since we are not taking into account bits with weight of less than 2?f for wi , the maximum error is 2?f in the sum word, and 2?f in the carry word, so the total error is 2?f +1. As this error is positive, the truncated value w^i and the real value wi satis es:

w^i  wi < w^i + 2?f +1 (36) Now, the condition for the truncated values P^i (q + 1) and U^i (q ) is not the same if the distance between Pi (q + 1) and Uq (xi ) is no larger than the precision we observe for the truncated values of wi in redundant arithmetic: 2?f +1 . Condition (34) is now transformed into:

jUq (xi) ? Pi(q + 1)j > 2?f +1

(37)

Using the same arguments we nd that in redundant arithmetic it is necessary to observe one more fractional bit. However, if the input data (x0; w0) are in conventional arithmetic, this condition does not have to be applied for i = 0, and thus the number of fractional bits is 5 for all the iterations. Consequently, the selection function we propose in redundant carry{save arithmetic coincides with (35), truncating Pi (q ) and wi with 5 fractional bits.

3.2 Selection function by look-up table We prefer to develop this method in redundant arithmetic. The non{redundant arithmetic version can be easily obtained from the redundant one and it will be studied in section 3.2.2. For the selection process we have to take into account that both wi and xi are represented in redundant carry{save form, but due to the overlap between the selection intervals, we can take an estimation of these values to obtain i . Assume that we assimilate wi up to the t fractional bit, and xi up to the  fractional bit. We call the assimilated values w^i and x^i respectively. Therefore we can write that:

w^i  wi < w^i + 2?t+1 and x^i  xi < x^i + 2?+1

(38)

Now we have to obtain relations between t and  that assure the convergence of the algorithm, that is, the conditions that assure a correct selection of i . We follow Figure 5 to obtain the suitable values for t and  . To make a correct selection of i from an estimation of xi (^xi), the overlap (q [^xi ]) we have to consider between the intervals q and q ? 1 is: q [^xi ] = Uq?1 [^xi ] ? Lq [^xi + 2?+1 ] 11

(39)

The value of q [^xi ] is the worst case overlap, only dependent on the value of the estimation of xi . In this way the selection only depends on the assimilated value of xi and not on the true value. On the other hand to make a correct selection using an estimation of wi , it is necessary that the overlap between the intervals (q [^xi ]) be greater than 2?t (this is the same case as the radix{4 SRT{division [10]). Therefore, the selection with estimations will be correct if the following condition is satis ed: q [^xi ]  2?t

(40)

From condition (40) and taking into account equations (7), (8) and (39), we obtain: ( + q ? 1)  x^i ? (q ? )  (^xi + 2?+1 )  2?t (41) The worst case condition is obtained for the greatest allowable value for q , that is q = a = 3  . Then we obtain a new expression: (2   ? 1)  x^i ? 2    2?+1  2?t

(42)

The values of  and t are constrained to the values of  and x^i . To obtain  and t independent of the value of xi, we must consider the worst case in expression (42), that is, we have to take x^i as the minimum possible value for xi . We assume that x0 or y0 are normalized in the range [0:5; 1), and then x1  0:5 [5]. The minimum value of xi is 0.5 since xi+1  xi  x1 [5]. Then we take x^i = 0:5 in expression (42). The parameter  can take values 2=3 or 1. From condition (42) we obtain suitable values for  and t for both cases,  = 2=3 and  = 1: a)  = 2=3: For this case a = 2 and i 2 f?2; ?1; 0; +1; +2g. As in the CORDIC iteration (see equation (6)) i multiplies the value of xi and wi. This digit set is very interesting since all digits are powers of two, and the multiplication by i can be done only by shifting. We obtain the fact that suitable values are  = 5 and t = 5 (actually t = 4 could be used, but the resulting digit selection would be very complex). b)  = 1: In this case a = 3 and then i 2 f?3; ?2; ?1; 0; +1; +2; +3g. The introduction of the value 3 in the digit set implies that an additional adder must be incorporated to make the multiplication by 3. For this case we obtain  = 4 and t = 3 (as in the previous case, t = 2 could be used but this would result in a complex selection function). As the selection of i is in the critical path of our architecture, we make design decisions in order to achieve a reduced critical path time at the cost of more silicon area, so we take  = 1 which leads to a less complex selection than the case  = 2=3, since the values of  and t are lower in case b (note that for digit selection table the number of input bits is critical). We have determined the number of fractional bits of xi and wi that must be assimilated ( and t). Now we have to determine the number of integer bits to assimilate of both operands, to obtain the total number of bits to be assimilated. 12

The maximum value of xi (xmax) is given by [5] xmax = Kmax  (x20 + w02)1=2 where Kmax is the maximum value of the scale factor. The maximum value of the scale factor depends on the value of a. If the scale factor is too large the range of x is also large and then more integer bits must be assimilated. In order to reduce the value of Kmax we make the microrotation i = 0 as a radix-2 microrotation with i 2 f?1; +1g. We can do that since we assume that the maximum angle to be computed is within the interval [?=2; +=2]. It can be easily demonstrated that making the microrotation i = 0 as radix{2, this range is covered. In this way, according to equations (2) and (4) and taking the maximun value for i in every iteration (that is 0 = 1 and i = 38i  1), we obtain a scale factor of Kmax = 1:80068, and taking into account the expression for xmax , we obtain xmax = 2:55. As the range for the angle is [?=2; +=2], the value of xi is always positive. Then we have to assimilate six bits of xi in each iteration, two integer bits (without sign) and four fractional bits. The maximum value for wi (wmax) is easy to obtain, taking into account that jwij < 4    xi . As we have selected  = 1, wmax = 4  xmax = 10:2. Then we have to assimilate eight bits, ve integer bits and three fractional bits of w.

3.2.1 Size of the look-up table The selection of i is done by implementing a selection function (usually a look{up table) whose inputs are the assimilated bits of wi and xi . In this way the look{up table will have a total of 14 input bits. This seems very large and look{up operation would be too slow. To reduce the complexity of the look{up table, we make use of the scaling technique. This technique has been widely used for division [10], and consists of the scaling of the dividend and the divisor such that the scaled divisor is within a certain range. The scaling does not a ect the result since the quotient only depends on the ratio between the dividend and the divisor, and scaling does not a ect this ratio. This idea can be applied to our CORDIC algorithm, that is the scaling of wi and xi does not a ect the angle to be computed. To make the implementation simpler we perform the scaling over the assimilated values, and not over the full length words of wi and xi . We propose scaling the value of x^i to have the scaled value in the interval [0:5; 1). As the range of x^i is [0:5; 2:55) the scaling operation involves only right shifts, and then the scaling does not a ect the result, since the assimilation error is also reduced. The scaling operation is very simple: if x^i 2 [0:5; 1) then no scaling is performed, if x^i 2 [1; 2) then a right shift is performed, and if x^i 2 [2; 2:55), two right shifts are carried out. The scaling also a ects w^i , and the scaled value is within the range (?4:5; 4:25), We only have to consider 3 bits of x^i (since x^i  0:5 the bit with weight 0.5 is always one) and 7 bits of w^i . Following a similar procedure to the case of the radix{4 SRT{division [10] we have obtained the selection function for i . In Table 1 we show the selection function to be implemented by means of a look{up table. As we can 13

x^i [0:5000; 0:5625) [0:5625; 0:6250) [0:6250; 0:6875) [0:6875; 0:7500) [0:7500; 0:8125) [0:8125; 0:8750) [0:8750; 0:9375) [0:9375; 1:0000)

i = ?3 [?11; ?8] [?12; ?8] [?13; ?8] [?14; ?8] [?15; ?8] [?16; ?8] [?17; ?13] [?18; ?13]

i = ?2 [?7; ?6] [?7; ?6] [?7; ?6] [?7; ?6] [?7; ?6] [?7; ?6] [?12; ?6] [?12; ?6]

w^ i  4 i = ?1 i = 0 [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0] [?5; ?3] [?2; 0]

i = 1 [1; 2] [1; 2] [1; 2] [1; 4] [1; 4] [1; 4] [1; 4] [1; 4]

i = 2 [3; 4] [3; 5] [3; 5] [5; 6] [5; 6] [5; 8] [5; 8] [5; 8]

i = 3 [5; 10] [6; 11] [6; 12] [7; 13] [7; 14] [9; 15] [9; 16] [9; 17]

Table 1: Selection table for i . see in this table, the least signi cant bit of the scaled value of w^i does not a ect the selection, and therefore the look{up table will have 9 input bits, 3 corresponding to x^i , and 6 corresponding to w^i, and 3 output bits.

3.2.2 The look-up table method in non-redundant arithmetic The non-redundant arithmetic case can be seen as a simpli cation of the redundant arithmetic case. Now, the maximun assimilation error in coordinates x and w is 2?t and 2? respectively, and expression (38) becomes:

w^i  wi < w^i + 2?t and x^i  xi < x^i + 2? (43) In this case, the overlap is q [^xi ]  0 and condition (42) becomes: 2?  (22? 1) x^i (44) In this case the best option is to choose  = 2=3 since we avoid the adders needed to work with the value i = 3, and the size of the table obtained is 9 input bits (6 bits for w and 3 bits for x), which is similar to the redundant case with  = 1.

4 Scale Factor If we are interested in the magnitude of the vector, it is necessary to compensate the scale factor. We use the same technique that appears in [8] to solve the nonconstant scale factor problem. Basically, after the rst n=8 + 1 microrotations, we access a table in order to take the value of the scale factor, and in the next iterations (n=8 + 1  i  n=4) we calculate the scale factor by a shift and q operation in each iteration, since for these iterations the scale factor ki = 1 + i24?2i generated, may be approximated by the two rst terms of the Taylor seriesQexpansion (ki  1 + 1=2  i24?2i). Finally, we perform the division by K (K = ki ) using the radix{4 CORDIC algorithm in the vectoring mode 14

and linear coordinates (this is a conventional radix{4 division). The radix{4 CORDIC equations in linear coordinates are the following:

xi+1 = xi wi+1 = 4  (wi ? i  xi) zi+1 = zi + i  4?i

(45)

Performing the same analysis as in Section 2 we can test the convergence of the algorithm in linear coordinates. Also, the same selection functions given in Section 3 can be used. After n/2 iterations, we obtain the following value over coordinate z: zn=2 = z0 + wx 0 (46) 0

Therefore, performing a suitable selection of the coordinates, we can carry out the division. In the next section we analyze the hardware requirements for calculating and compensating the scale factor.

5 Architectures In this section we obtain di erent architectures that implement the radix{4 CORDIC algorithm in the vectoring mode. There are di erent architectures that may implement the algorithm: redundant or non{redundant arithmetic; selection with arithmetic comparisons or selection by table; word{serial or pipelined. We illustrate the implementation of the radix{4 vectoring algorithm with two of these architectures. First, we consider a word-serial architecture using selection by the arithmetic comparison method in conventional arithmetic. Then we present a word{serial architecture using selection by look-up table in redundant arithmetic. Finally, some comments are given for the implementation in a pipelined architecture.

5.1 Architecture for the arithmetic comparison method In this subsection we develop a word-serial architecture in redundant arithmetic and selection by arithmetic comparisons. First, we design the hardware to implement the selection function (35) where the zero{skipping technique is incorporated. Then, we design the data path, explaining the di erent operations that are carried out over this architecture.

5.1.1 Implementation of the selection function with the zero{skipping technique If a microrotation obtains i = 0, then xi+1 = xi and wi+1 = 4wi (see equations (6)) and the value of i+1 can be obtained directly from wi (note that 4wi is no more than a shift of wi ). Thus, if in a microrotation we obtain, in parallel, 15

coecients i and i+1 and the rst is zero, it is not necessary to carry out microrotation i because the rotation angle is zero, and we can directly proceed to microrotation i + 1. In this way, we skip iteration i reducing the total number of microrotations. This technique is called the zero{skipping technique and it was initially developed for the division algorithm [11]. In [8] we used this technique for the rotation mode. In that paper we conclude that a reduction of about 20% in the total number of iterations can be achieved. In Figure 6 we present the hardware implementation of the selection functions incorporating the zero{skipping technique. Registers P 1 and P 2 keep the values of the comparison points P^i (1) and P^i (2). In order to be able to apply the zero{skipping technique, we must carry out a double comparison of the comparison points. On one hand, we use two 7 bit comparators (basically the necessary hardware for generating the carry c7 of a 7-CLA) for comparing the 8 MSBs of wi to the comparison points in order to obtain i ; on the other hand, we need a twin architecture (indicated with dotted lines in Figure 6) that carries out the comparison with the 8 bits that follow the two most signi cant of wi for obtaining i+1 (if i = 0, wi+1 = 4wi and the two MSBs of wi are zero). The Control Logic block found in the Figure 6 generates the value of i from the analysis of signal c7 of each comparator and the sign of wi . Also, the skip is activated when i = 0. For low values of n this technique may be not ecient enough. In this case, we eliminate the hardware indicated with dotted lines in Figure 6.

5.1.2 Design of a data path In this section we analyze in detail a possible word{serial architecture in non{ redundant arithmetic. Figure 7 shows the architecture of paths x, w and z . The realization of the vectoring mode is carried out by programming the data paths as a function of the iteration we are in. The basic processes are the calculation of the comparison points, the realization of the radix{4 CORDIC iterations in circular coordinates (equations (6)), the calculation of the scale factor, and nally, its compensation (radix{4 CORDIC in linear coordinates). Some of these processes are carried out in parallel, and we now describe how the system works as a function of the iteration. Module A in Figure 7 corresponds to the hardware to implement the selection function with the zero{skipping technique, which is shown in detail in Figure 6. Table 2 may help in the description we now present, and re ects the operation mode of the paths x, w and z , the function carried out by each path and the operation it performs in module A for each of the iterations. 1. Iterations i=0 to i=n/8 The main processes carried out in these iterations are the calculation of the comparison points, the processing of the corresponding radix{4 CORDIC iterations and the generation of the address in the scale factor table corresponding to the angle to be rotated. 16

Data path operation mode Function x w and z x w Eval. Pi (2) { P0 (2) = 3=2  x0 { CORDIC CORDIC x1 = x0 + 0 w0 w1 = 4(w0 ? 0 x0 ) Circular Circular 1 Eval. Pi (2) { P1 (2) = 3=2  x1 { 1 CORDIC CORDIC x2 = x1 + 1 4?2 w1 w2 = 4(w1 ? 1 x1 ) Circular Circular 2 Eval. Pi (2) { P2 (2) = 3=2  x2 { 2 to CORDIC CORDIC xi+1 = xi + i 4?2i wi wi+1 = 4(wi ? i xi ) n/8 Circular Circular n/8+1 CORDIC CORDIC xi+1 = xi + i 4?2i wi wi+1 = 4(wi ? i xi ) to n/4 Circular Circular n/4+1 Compute CORDIC ki+1 = ki (1+ wi+1 = 4(wi ? i xn=4 ) to n/2 Scale factor Circular +j2 2?4j?1 ), j=i?n=8 n/2 to CORDIC CORDIC xr+1 = xr wr+1 = 4(wr ? r K ) to n** Linear Linear r=i-n/2 * j=i-n/8 ** Only for scale factor compensation purpose

i 0 0

z { z1 = z0 + 0 (0 )

Module A operation Load P1,P2 Comput. 0

{ z2 = z1 + 1 (1 )

Load P1,P2 Comput. 1

{ zi+1 = zi + i (i )

Load P1,P2 Comput. i

zi+1 = zi + i (i ) zi+1 = zi + i (i )

Comput. i Store j = i  Comput. i

zr+1 = zr + r 4?r

Comput. r

Table 2: Operation mode and function of the paths x; w; z and Module A The evaluation of the comparison points P^0 (1); P^0(2); P^1(1); P^1(2) and P^2 (2) is performed by means of speci c purpose iterations using data path x. In the rst, third and fth iterations, we program this path for the evaluation of comparison point Pi (2) (Eval. Pi (2) mode in Table 2). In the second, fourth and from the sixth iteration on, the hardware is programmed for obtaining the radix{4 CORDIC equations in circular coordinates (CORDIC circular mode in Table 2), that is, equations (6) are evaluated. MUX-1 allows obtaining the value 3=2  xi (leftmost input), and together with MUX-2,3,4 permits selecting the appropriate input for supporting i = 1; 2 or i+1 = 1; 2 if a zero skip takes place. As we can see in Table 2, it is not necessary to calculate Pi after i = 2 since the comparison point obtained in iteration i = 2 is valid as comparison point for the remaining iterations. The i values obtained in these rst iterations are also used for addressing the table that stores the di erent scale factors (see Section 4). 2. Iterations i=n/8+1 to i=n/4 During these iterations, paths x, w and z obtain equations (6) (radix{4 CORDIC in circular coordinates). Module A calculates i (i+1 if a zero{ skipping occurs) and puts the values jij in a shift register, necessary for the calculation of the scale factor in later iterations. 3. Iterations i=n/4+1 to i=n/2 The main processing carried out during this period is the calculation of the scale factor (over data path x) and the ending of the angle computation (over data path z ). Taking into account that the maximum value obtained experimentally for wi is 5.3, we have that i 4?2i wi < 2?n+1 if i  n=4 + 1. Therefore we obtain, from (6), that xi+1  xi for the precision considered. Therefore it is not necessary to use path x from this iteration on. 17

Now, path x is free, and it is used to evaluate the scale factor (Compute Scale factor mode in Table 2). To preserve the value xn=4 obtained in the previous iterations we use the auxiliary register RX' in Figure 7. The scale factor produced by the rst n/8+1 rotations is obtained from the scale factor table, and it is initially loaded onto register RX (see Figure 7 ). From now on, the approximation ki = 1+1=2i24?2i [8] is carried out over path x (see Section 4). At the end of these iterations we have obtained the value of the scale factor K , which is in register RX, and the value of the nal rotated angle, which is in register RZ. Just in that moment, we have obtained one of the two results generated by the CORDIC algorithm: the value of the angle (argument of the initial vector). In applications that do not require the evaluation of the magnitude of the vector (for example, angle calculation and rotation [5]) the scale factor table and the process for the calculation of the scale factor carried out in data path x are not necessary. 4. Iterations i=n/2+1 to i=n These iterations have the aim of compensating the scale factor in those applications that require it. In order to do this, we program paths x, w and z for performing the radix{4 CORDIC algorithm in the vectoring mode in linear coordinates (CORDIC linear mode in Table 2, see also radix{4 CORDIC equations in linear coordinates (45)). Registers RW and RX were loaded with xn=4 and K respectively in the iteration i = n=2. The value 4?i of the equation in z of (45) is directly obtained from the angle table as follows: from iteration i  n=6 we have that tan?1 (i 4?i )  i 4?i , and thus the values 4?i with i  n=6 are already present in the table; we only have to add to the primitive angle table, the values 4?i with i < n=6 (for example, for n=32 we would have to add six values: 40; 4?1,...,4?5). Consequently, after iteration i = n we obtain in RZ the value of the magnitude of the initial vector (see equation (46)).

5.2 Architecture for the look-up table method Figure 9 shows the architecture that implements the radix{4 algorithm using this method using redundant carry{save arithmetic. The considerations to calculate and compensate the scale factor are similar to those of the previous subsection, and they have been skipped to make the understanding of the architecture easier. Since for the selection with table we have considered that i may take value 3, two 4{to{2 adders are needed to perform the products 3  xi and 3  wi. These values will be needed if ji j = 3. Two word multiplexers permits to compute ji j  xi and ji j  wi . Finally two 4{to{2 adder/subtracters permit obtaining xi+1 and wi+1. 18

Figure 8 shows the block diagram of the digit selection network which is in charge of the selection of i . Six bits of xi and 8 bits of wi are assimilated. From the two most signi cant bits of x^i we obtain the suitable shift to make the scaling. By means of multiplexers we perform the scaling in both x^i and w^i. In the output of the multiplexers we obtain the inputs to the look{up table. The table has 9 input bits and three output bits. A look{up table is needed to store the microrotation angles, a multiplexer to select the suitable value for the microrotation angle depending on the value of i , and nally the 3{to{2 adder/subtractor performs the iteration to obtain zi+1 . We use a 3-to-2 adder/subtractor since the microrotation angle has a non{redundant representation.

5.3 Pipelined architectures For a pipeline the crucial points are the hardware cost, the latency and the throughput (related to the cycle time). For the radix{4 CORDIC vectoring the selection of i should be implemented in each of the iterations. As we have seen in previous sections the selection (using arithmetic comparision or table) is more complex (in time and in hardware) than in the case of the radix{2 algorithm. The problems with the pipeline implementation are two-fold: 1. Since in each microrotation the shift to be performed is given, hardwired shifts are performed. Thus in this architecture there is no overlap between the digit selection and the shift operation, and the complex digit selection is fully in the critical path. 2. The replication of the hardware associated with each microrotation. This implies replicating the hardware for digit selection. Furthermore, in the case of the selection with table, two additional adders are needed in each microrotation to perform the multiplication by 3. Due to these factors, it seems that a full radix{4 pipelined CORDIC vectoring needs additional research to be ecient.

6 Evaluation and comparison In this section we compare the word{serial architectures proposed in the previous sections with each other and with the architectures proposed in [5] and [9] in the case of using redundant arithmetic, and with a conventional radix{2 architecture (with w iterations) when non{redundant arithmetic is used. To carry out the evaluation we will express the delay of each hardware element in terms of the delay and area of one full adder (tfa and afa ). We have used the reference technology and Library (ES2-ECPD10 Standard Cells Library, 1m double metal CMOS [12]) for the hardware elements which do not 19

Element delay (tfa ) Area (afa ) Bu er tbuf 1.0 1 2{to{1 mux t2?1mux 0.5 0.5n 3{to{1 mux t3?1mux 0.5 0.75n 4{to{1 mux t4?1mux 0.5 n 6{to{1 mux t6?1mux 1.0 1.25n Register treg 1.5 n 3{to{2 csa t3?2csa 1 n 4{to{2 csa t4?2csa 1.5 1.5n 7-CLA* t7?CLA 2.5 4 Ripple adder tripple 0.83n n Constant width carry skip** tcwcs 13(15) 45(87) CLA tcla?n dlog 2 (n)e 2n Barrel shifter r levels tbs 0:5  dlog 2 (r)e n log 2 r * Only the logic to obtain c7 ** Values for n=32 and n=64 (this last between brackets)

Table 3: Delays assumed for hardware elements have recognized delays in terms of the delay of one full adder. The delays and areas assumed for the di erent hardware elements are showed in Table 3. For the 4{to{2 carry{save adder/subtracter we have assumed the implementation given in [13], which possesses the same delay as a 4{to{2 carry{save adder. We have taken into account the delay introduced by the bu ers, which are necessary for control signals that are heavily loaded. We would like to emphasize that a true comparison between di erent implementations is possible only if actual implementation is considered and logic level simulations are carried out. Therefore, we present a rough, rst order approximation comparison based on Table 3. Nevertheless, we claim that it can express the general trend between di erent designs. In [5] a radix-2 CORDIC architecture is proposed with on-line redundant arithmetic, but the word{serial architecture is also considered. In that work the i values can be f0; 1g, resulting in a non-constant scale factor. This architecture is specially interesting when we are only interested in calculating the angle. In [9] a radix{2 CORDIC architecture in redundant arithmetic is proposed with a constant scale factor. In this case, the most signi cant bits of w are used to estime the i values (truncating t fractional bits). Due to the estimation error it is necessary to repeat some microrrotation to assure convergence. To be exact, it is necesary to repeat one microrotation every t ? 1 iterations during the rst n=2 microrrotation. From i > n=2, the value i = 0 is allowed, and only one repetition is necessary. The conventional radix{2 architecture with non{redundant arithmetic can be found in [3]. Basically, the word{serial version of the three architectures has the same 20

data path, which is shown in Figure 10, and the main di erence is the selection function and the fact that when redundant arithmetic is used the datapaths are of double complexity and the adders are carry{save. For the case of redundant arithmetic, when only angle calculation is considered, we have compared our architectures with the architecture proposed in [5] (only n iterations), but when the calculation of the magnitude of the vector is considered we compare our architectures with the architecture proposed in [9] (n iterations plus some repetitions, but with a constant scale factor). Based on Figure 10 and according to Table 3, the delay of a CORDIC iteration is: tr2 = t2?1mux + treg + max(tbs; tsel ) + tbuf + t2?1mux + tadder (47) where tsel is the delay of the logic to select the i value (about 3tfa in [9] and 1.5tfa in [5], and is negligible for the radix{2 architecture with non{redundant arithmetic) and tadder is the delay of the nal adder. According to the delays given in Table 3, this corresponds to a cycle time of trd2 = (3:5+max(tbs ; tsel )+ tadder )  tfa For n bits of precision, n iterations are needed in the classic radix{2 CORDIC [3] and [5], so the total computation time Trd2 is Trd2 = (n)  tr2 , whereas in [9] some repetitions are needed, resulting in a total computation  l m n= 2 time of Trd2 = n + t?1  trd2 . According to Table 3 and Figure 10, the total area in non{redundant arithmetic is Ard2 = (6  n + n log2 n + Asel + 3  Aadder )afa (48) where Asel is the area of the logic to select the i value (module SEL in Figure 10) and Aadder is the area of the adder selected. Asel is practically negligible for non{redundant arithmetic whereas in the redundant case it is about 1.5afa in [5] and 5afa in [9] (assuming six fractional bits for truncation). In the carry{save case, the barrel shifter, registers and multiplexors are duplicated, the adders of the x and w datapaths become 4-2 CSA, and the adder of the z datapath is a 3-2 CSA. Therefore, basically the hardware is double with regard to the hardware required when carry propagate adders are used.

6.1 Comparison of the architecture of the arithmetic comparison method According to Table 3, the delay of one iteration for the design proposed in section 5.1 (see Figure 7) is given by: trd4comp = t3?1mux +treg +maxft2?1mux +tmodA ; t2?1mux +tbs g+tbuf +t6?1mux +tadder (49) and the area is Ard4comp = (8:25n + AmodA + n log2 n + 3Aadder )afa (50) where tmodA and AmodA correspond to the delay and the area of the Module A of Figure 7 respectively (the area is calculated without considering the compensation of the scale factor. This means that the register RX' is not included 21

and mux-2 has only three inputs). The main di erence between the redundant and non{redundant version of the arithmetic comparison method presented in Section 3.1 is that the number of fractional bits needed to truncate the x and w coordinates is increased by one bit (see subsection 3.1.3). Therefore, each one of the 7-CLA* of Figure 6 (module A of gure 7) must be substituted by two 4-2 CSA of 8 bits followed by a 8-CLA in the redundant arithmetic version. Therefore, according to Table 3, we estimate that Module A of Figure 7 has a delay of 4tfa and an area of 18afa in conventional arithmetic and a delay of 5.5tfa and area of 30afa in carry-save arithmetic. Hence, for n  64 and according to Table 3, we have trd4comp = (4 + tmodA + tadder )  tfa, and the total computation time (Trd4comp) is Trd4comp = (0:80  n2 + 3)trd4comp if only angle computation is considered, and Trd4comp = (0:80  n + 3)trd4comp if magnitude computation is also considered. Factor 0.80 takes into account the reduction in the number of iterations when the zero{skipping technique is used. Basically, the area for the architecture of Figure 7 in redundant arithmetic is double that expressed in (50). To allow the calculation of the magnitude of the vector in applications that require it, we add a 2 to 1 multiplexer and a register to the design proposed in [9] for storing the value of the x coordinate after the rst n=2 microrotations and carry out the compensation of the scale factor in parallel with the nal iterations (i > n=2). Table 4 shows the speedup and area ratio obtained using ripple-adders, constant width carry skip adders, carry lookahead adders and carry-save adders for n = 32 bits and n = 64 bits for the architecture based on arithmetic comparison. We give the time and area ratio for the calculation of the angle, and for the calculation of the magnitude of the vector. When magnitude calculation is considered the area ratio is the same as when only calculation of the angle is considered. From Table 4 we deduce that for applications that only require the calculation of the angle, the speedup for 32 bits ranges from 1.9 to 1.7 using di erent non{redundant adders (this speedup for 64 bits ranges from 2.2 to 2.0), whereas a slight gain is obtained when the magnitude of the vector is computed. In the case of redundant arithmetic, the speedup for the computation of the angle is not so good (but signi cant, 1.4 for 32 bits and 1.6 for 64 bits), and for the computation of the magnitude is slightly worse than the radix{2 architecture. The area ratio keeps constant for any case (between 0.8 and 0.9) which means that the architecture proposed in this paper occupies between 12:5% and 25% more area. The behaviour of the di erent ratios of Table 4 can be explained as follows: the basic cycle time of one iteration for our design is slightly greater since more multiplexors are used and because of the delay of Module A. Nevertheless, the number of iterations for the compared design is 32 and 64 whereas these numbers are about 16 and 29 for the angle calculation in our architecture (note that the zero{skipping technique reduces the total number of microrotations by about 20%, i.e for n = 64 bits we need 32 radix-4 iterations times 0:8 plus 3 22

n 32 32 32 32 64 64 64 64

Ard2 Trd2 Trd2 Adder Ard4comp Trd4comp (Angle) Trd4comp (angle&Magnitude) Ripple 0.8 1.9 1.2 Constant width carry skip 0.8 1.8 1.1 Carry lookahead 0.9 1.7 1.1 Carry-save 0.8 1.4 0.9 Ripple 0.9 2.2 1.2 Constant width carry skip 0.9 2.0 1.1 Carry lookahead 0.9 2.0 1.1 Carry-save 0.9 1.6 0.9

Table 4: Speedup and area ratio for the arithmetic comparison method comparison point computation iterations, that is, 29). Therefore, a speedup of about 2 is obtained. Nevertheless, the number of microrotations in our architecture go up to 29 and 55 when magnitude computation is considered, keeping 32 and 64 iterations for the compared design since scale factor compensation is carried out in parallel during the last n=2 iterations. Consequently, the time gain is slight. We can also see from Table 4 that the time ratio decreases when a faster adder is used. This is because the relative weight of the delay of Module A of Figure 7 is greater, as it is a constant and also belongs to the critical path, whereas the cycle time depends strongly on the delay of the nal adder. The same reason justi es the fact that the time ratio is better when the number of bits n increases, that is, when we go from n = 32 to n = 64.

6.2 Comparison of the architecture of the look-up table method Based on Figure 9 we have evaluated the critical path of this redundant arithmetic architecture, which is composed by:

tr4table = t2?1mux + treg + maxftds ; tbs + t4?2csa g + tbuf + t4?1mux + t4?2csa (51) and the area is given by

Ard4table = 22n + 2n log2 n + ads

(52)

where tds and ads correspond to the delay and area of the digit selection network respectively (see Figure 8). We have evaluated the delay and area of the digit selection network. This delay corresponds to a delay of one carry look{ahead adder of 8 bits (for assimilation), the delay of one 3-to-1 multiplexer, and the delay of the look{up table (about 50afa) to implement the selection of i . To obtain the delay of the look{up table, we have assumed an implementation with sparse logic and synthesized it with a multiple{level logic optimization tool, MIS [14]. Our digit selection logic has 9 inputs and 3 outputs. The delay for the digit selection logic we have obtained is 2  tfa. In this way, taking into account the delays assumed, we obtain a delay of 5:5  tfa and an area 23

Ard2 Trd2 Adder Ard4table Trd4table Ripple 0.7 1.9 Constant width carry skip 0.8 1.8 Carry lookahead 0.8 1.7 Carry-save 0.7 1.7

Table 5: Speedup and area ratio for the look{up table method of about 75afa for the digit selection network without using the zero{skipping technique, and about 100afa if this tecnnique is considered (note that it is only necessary to duplicate the indexing logic instead of the whole table). Therefore, according to Table 3 for n  64 bits we have trd4table = 10:5tfa and the area is 22n + 2n log2 n + 100 if the zero{skipping technique is implemented. The total computation time is Trd4table = (0:80 n2 + 1)  10:5  tfa . The architecture presented in Figure 9 is in redundant arithmetic. As discussed in section 3.2.2, the non{redundant arithmetic version of this method obtains the best results for the minimally redundant digit set (a = 2), and therefore, the rst two 4-2 CSAs of Figure 9 disappears as well as the carry look{ahead of 8 bits (for assimilation, see Figure 8), whereas the nal 4-2 CSAs and 3-2 CSA are substituted by non-redundant adders. Therefore, basically, the hardware is half, and the cycle time becomes trd4table = 3:5 + maxftds ; tbs g + tadder , and the area is Ard4table = 7:5n + n log2 n + 100 + 3  Aadder . Table 5 shows the speedup and area ratios for the architecture based on the look{up table method, using the zero{skipping technique, for n = 32 bits and only for angle calculation. The ratios for the non-redundant architectures have been obtained by considering the minimally redundant digit set (a = 2), whereas the carry{save version is with the maximally redundant digit set (a = 3). The behaviour and the reasoning for n = 64 is quite similar to that of the previous subsection, and therefore it has been omitted here for reasons of clarity. Similarly, the magnitude computation has been excluded too. From this table we can see that for the non-redundant arithmetic we obtain the same results as in the case of the arithmetic comparison, and we obtain the same conclusion. For the redundant arithmetic version, the maximally redundant digit set produces a good ratio (1.7). The results that were obtained for the non{redundant arithmetic using the maximally redundant digit set would be worse (i.e, for the ripple adder the ratio goes down from 1.9 to 1.3 since the adders needed to obtain coecient i = 3 belong to the critical path).

6.3 Comparison of the arithmetic comparison method with the look-up table method Table 6 shows the time and area ratios when we compare both methods proposed in this paper for the radix{4 CORDIC vectoring. From this table we can see that both methods of selection lead to similar 24

Ard2comp Trd2comp Adder Ard4table Trd4table Ripple 1 1 Constant width carry skip 1 1 Carry lookahead 1 1 Carry-save 0.9 1.2

Table 6: Speedup and area ratio when comparing both methods. results in terms of area and computation time, although a better computation time is obtained by the table selection method when redundant arithmetic is used.

7 Conclusions In this work we extend the radix{4 CORDIC algorithm to the vectoring mode. We have developed two methods for obtaining the selection functions that are valid for their implementation in redundant arithmetic and in non{redundant arithmetic. The problem of having a non{constant scale factor is eciently solved by introducing a small table, combined with the reuse of the data path of x. We have obtained two word{serial architectures for the algorithm in redundant and conventional arithmetic. It seems that more research into a speci c radix{4 algorithm for the pipelined architecture might result in an ecient implementation. From the analysis of these architectures we can conclude that the radix{4 CORDIC vectoring algorithm is very ecient in the computation of the angle in word{serial architectures such as arises in some of the most important classical applications of the CORDIC algorithm (i.e SVD [5]).

25

8

8 Overlad bound between q=2 and q=3

7

Overlad bound between q=1 and q=2

Overlad bound between q=1 and q=2

7

Overlad bound between q=0 and q=1

Overlad bound between q=0 and q=1

6

6 5

wi

5

q=3

wi

4 3

L2(x i)= 4 x 3

3

q=2 q=1

2

2

1

U 0(x i)= 2 x 3 0.5x i

q=1

1

q=0 0 0.5

U 1(x i)= 5 x 3 1.5x i

q=2

4

1

1.5

2

0 0.5

2.5

xi

L1(x i)= 1 x 3

q=0 1

1.5

2

xi

a)

b)

Figure 1: Selection intervals for a) = 1 and b) = 2=3.

Figure 2: Common overlap area between q = 0 and q = 1 for all iterations.

26

2.5

i

i

i

i

L1(x0)

U0(x0)

L2(x0)

U1(x0)

i=0 1 3

0

x0 P0 (1)

w

x0

2x 3 0

4x 0 3

P0(2)

5x 0 3

U0(x1)

i=1 P1 (1)

P1 (2)

x1

0

U1(x 2

i=2 1x 2 2

0

P2 (2)

x2

i=3 1x 2 3

0

3x 2 3

x3

L1 (x )

8

i -->

8

8

L2 (x )

1x 2

3x 2

8

8

x

8

0

Figure 3: Evolution of the common overlap area and location of P1 (1) and P2 (2)

Figure 4: Real and truncated values −δ+1

U k-1 [xi + 2

]

worst case overlap U k-1 [xi ]

>2

-t −δ+1

L k[xi + 2

]

L k[xi ]

2

−δ+1

xi

x i+ 2

−δ+1

Figure 5: Worst case overlap between the selection intervals. 27

wi

P1 6

"0"

P2

10 MSBs

1

7

7 bits 1 to 7

bits 3 to 9

bit 0 (MSB)

Cout

7-CLA*

Cout

7-CLA*

7-CLA*

Cout

7-CLA*

Cout

Control Logic Skip *

σi or σi+1

Only logic to generate C7

Figure 6: Implementation of the selection function Scale factor Table x0

w0

MUX-9

z0

MUX-8

RX

RX’

MUX-11

RW

xi

RZ

wi

zi

7

2 MUX-7

MUX-5

4 -2i

2

10 MSBs

-1

MUX-6

Module A Angle Table

Shift reg.

2

-1

2 2

-4

2

-3

Skip

σi (σi+1 )

RX’

MUX-1

0

4 0

2

MUX-2

Add/Sub

MUX-3

Add/Sub

x i+1

4 6

MUX-4

Add/Sub

wi+1

Figure 7: Architecture for the arithmetic comparison method 28

z i+1

xi

xx.xxxx xx.xxxx

xxxxx.xxx xxxxx.xxx

assimilate

assimilate

xx.xxxx

xxxxx.xxx

2

-1

2

-2

2

3-to-1 mux

-1

2

wi

-2

3-to-1 mux

xxxx.xxx

0.1xxx 3

6 Selection Table

3

σi

Figure 8: Implementation of the selection function.

w0

x0

z0

2-to-1 mux

2-to-1 mux

x

w

2-to-1 mux

Barrel shifter

z

x 8

6 Digit selection Network

σi 2

2

4-to-2 carry-save adder

4-to-2 carry-save adder

0

2

2

0

x0 x1 x2 x3

x3 x2 x1 x0

4-to-1 mux

4-to-1 mux

4-to-2 carry-save add/sub

Table of Angles 0 4-to-1 mux

4-to-2 carry-save add/sub

3-to-2 carry-save adder

Figure 9: Architecture of the processor. 29

W0

Z0

2-to-1 mux

2-to-1 mux

2-to-1 mux

RX

RW

RZ

X0

SEL σi

Barrel shifter

Table of Angles

0

0

2-to-1 mux

ADDER

2-to-1 mux

0 2-to-1 mux

ADDER

2

Figure 10: Radix{2 Architecture

30

ADDER

References [1] J.E. Volder. The CORDIC Trigonometric Computing Technique. IRE Trans. Elect. Comput., EC(8):330{334, 1959. [2] J.S. Walther. A uni ed algorithm for elementary functions. Proc. Spring. Joint Comput. Conf., pages 379{385, 1971. [3] Y.H. Hu. CORDIC{based VLSI architectures for Digital Signal Processing. IEEE Signal Processing Magazine, (7):16{35, July 1992. [4] J.R. Cavallaro and F.T. Luk. Cordic arithmetic for an svd processor. Journal of Parallel and Distributed Computing, 5:271{290, 1988. [5] M.D. Ercegovac and T. Lang. Redundant and on{line Cordic: Application to matrix triangularization and SVD. IEEE Trans. on Comput.,, 39(6):725{740, June 1990. [6] G. Knittel. Proven{prompt vector normalizer. Proc. Sixth IEEE International ASIC Conference and Exhibit, USA, pages 112{115, 1993. [7] N.D. Hemkumar and J.R. Cavallaro. Redundant and on{line cordic for unitary transformations. IEEE Transactions on Computers, 43(8):941{ 954, 1994. [8] E. Antelo, J. Villalba, J.D. Bruguera, and E.L. Zapata. High performance rotation architectures based on radix-4 cordic algorithm. IEEE Transactions on Computers, 46(8):855{870, August 1997. [9] J. Lee and T. Lang. Constant-factor redundant cordic for angle calculation and rotation. IEEE Trans. on Compt., 41(8):1016{1025, August 1992. [10] M. Ercegovac and T. Lang. Division and square root: Digit{recurrence, algorithms and implementations. Kluwer Academic Pub., 1994. [11] P. Montuschi and L. Ciminiera. Reducing iteration time when result digit is zero for radix 2 srt-division and square root with redundant remainders. IEEE Transactions on Computers, 42(2), February 1993. [12] European Silicon Structures. Es2 ecpd10 library databook. 1992. [13] E. Antelo, J.D. Bruguera, J. Villalba, and E.L. Zapata. Redundant cordic rotator based on parallel prediction. Proc. IEEE 12th Symposium on Computer Arithmetic (ARITH{12), pages 172{179, July 1995. [14] R.K. Brayton, R. Rudell, A. Sangiovanni-Vicentelli, and A.R. Wang. Mis: A multiple{level logic optimization sytem. IEEE Transactions on Computer{Aided Design, CAD{6:1062{1081, 1987.

31

Suggest Documents