The CORDIC Householder algorithm

7 downloads 0 Views 655KB Size Report
CORDIC rotations are applied in parallel to zero out the second and fourth ..... Householder reflection into a proper rotation, with de- terminant l . Thus, our new ...
The CORDIC Householder Algorithm Shen-Fu Hsiao

Jean-Marc Delosme

Department of Electrical Engineering Yale University New Haven, CT 06520 Abstract A novel n-dimensional (n-D) CORDIC algorithm. for Euclidean and pseudo-ELclidian rotations,% proposed. The new algorithm is closely related to Householder transformations. It is shown to converge faster than CORDIC algorithms developed earlier for n = 3 and 4. Processor architectures for the algorithm are presented. The area and time performance of n-D CORDIC processors are evaluated. For a comparable time performance, the processors require significantly less area than parallel Householder processors. Furthermore, arrays of n-D Euclidean CORDIC processors are shown to speed up the QR decomposition of rectangular matrices by a factor of n - l with respect to a 2-D CORDIC processor array.

1

r

1 Introduction 1.1 2-D CORDIC Algorithms The Coordinate Rotation DIgital Computer introduced by Volder [13] and extended Walther [15 , is an algorithm operating on 2dimensional [2-D) vectors to perform rotations using simple arithmetic primitives. The algorithm has several variants and can be used to calculate a variety of elementary functions. While the traditional approach to vector iotation calls for square-root extract&, division, and multiplication, the CORDIC algorithm requires only additions and shifts. The algorithm approximates a 2-D rotation Rz (Euclidean or hyperbolic) by Rz,; of p elementary rotations the product

Figure 1: The architecture of a 2-D CORDIC processor . applied to bring the vector along the direction [l 0IT (evaluation process). The signs 6i are selected according to 6; = sign(z1,;-1 . xz,;-l), where xjj-1 denotes the j t h component of the vector at the beginning of iteration i and, initially, x j , ~= zj. The sei p , thus obtained may be used quence &;, 1 to rotate other vectors by the same amount (application process). Like the multiplication by the unscaled rotations Uz,i, multiplication by the scaling fac1 uta)-1/2is implemented by additions and tor shifts, see e.g. [9]. While no repetition is needed for the 2-D Euclidean CORDIC algorithm, it is necessary in the hyperbolic case to have a repetition for t 2. -- 2-4 , 2-13 , T 4 0 2-k,2-(3k+1),. . . [15]. Fig. 1 illustr..ztes the 2-D CORDIC processor architecture.

<
2 , are used in lieu of 2-D CORDIC processors.

(4)

where z j , i 7 1 denotes the j t h component of the vector at the beginning of iteration i. In the 4-D case, the 4 x 4 skew-symmetric matrix T4,i defined as

w a s selected in [6] [7]. Because the rotation matrix

2 R4,;

CORDIC Householder Algorithm

= (1- T 4 , i ) ( l +%,i)-' 2.1

is the square of the matrix

The 3-D Euclidean

Case

As mentioned above, the 3-D elementary rotation R3,i in equation ( 3 ) has either [l 1 1IT or [l -1 1IT as rotation axis. If instead the rotation axis is chosen to be either [0 1 11' or (0 1 -1IT, the matrix T3,i is replaced by which has a simpler unnormalized part than R 4 , , , the matrices defined by equation (6) were selected as 4-D CORDIC elementary rotation matrices in 161 171.

257

(and rotations) as a particular case of pseudo-Euclidean norm (and rotations). For an n x n matrix H , to represent a pseudo-Euclidean rotation, it should preserve the pseudo-norm of an arbitrary vector x,i.e.,

-

kk-+--p

-..I

L.

1.1

Figure 2 : Domains after the first three iterations of R3.

Since the above equation is true for any vector, H , must satisfy H,TE,H, = E,. (12) i.e., it must be orthogonal with respect to E,. If T 3 , i in equation (7) is generalized t o an n-dimensional space with signature matrix E,, i.e.,

and the elementary rotation matrices become

R

'

- (1 + 2 t y x 2tis1,i

-2tis2,i

1

-ti

-ti

..

-2t;s2,is1,i

0 0

0 0

'.'

...

0

0

...

with signs selected during the evaluation process according to SI,,

= sign(z1,i-l .12,i-1),

= Sign(z1,i-l

'

where the signs in the matrix

(9)

( 'y 1

Sn,i=

0

0

0 0

. '

0

.

bn-1

0

0

.''

0

.;. ...

0

)

,

(14)

Sn-1,i

are selected during the evaluation process according to sj,i

= (Ti . s i g n ( q 1 - 1 . 2j+1+1),

(15)

+

the matrix Hn,i = ( I - T,,i)(lT n , i ) - l is equal to

Hn,, = (1 + ct;)-' x 1 - ct; -2ti

2ti

1

+ (c - 2)t;

... ...

2ti

-2t;

... ' ' '

2t;

Without loss of generality, we will assume the values of ui ordered according to:

1

...

2ti -2t;

...

-2t;

...

2tf

... ...

-2t; 2t; 2tf

...

n

+ (c + 2)t;

2ti -2t;

2tf If k = 72, Cn = In and IIzIlc,, is the Euclidean norm of the vector I. We shall view the Euclidean norm

k+l

258

+ (c - 2)t; 2t; 2t;

... ...

2t!

...

2

1

1

...

...

The pseudo-norm 11111,~ of an n-D vector 2 in pseudoEuclidean space is defined as 1 1 ~ 1 1 ; ~ = zTCnz, where E, is a signature matrix of the form 0

sn,i,

Sn,i

The General n-D Case

1

-[]

13,i-1).

The matrix R 3 i is simpler than R 3 , i . Moreover, as is discussed next,'no repetition is necessary with this new family of elementary rotations. Using the method outlined in Section 1 , we can determine whether any repetition is needed and prove the convergence of the new 3-D algorithm by determining the cone C; of all the rotated vectors after each iteration i . The cones Ci have the origin for vertex and surround the first axis. The intersections of the cones C1, C2 and C 3 with the plane 21 = 1 are shown in Fig. 2 . The cones are invariant under a permutation of 2 2 and 2 3 . The new 3-D CORDIC algorithm, R 3 , converges slightly faster than R3 since it does not require any repetition while R 3 calls for the repetition of the iterations with ti = Y 42,- 7 , 2 - 1 2 , 2-22 . . . [7]. Furthermore, the domain obtained by intersectkg the cone Ci with the plane 11 = 1 is bounded by a square box with half-side 2 t i / ( l - 2 t f ) .

2.2

= ZC,Z.

(H,#&(H~z)

*.'

k

where c = 2k - n - 1. Hn,i does satisfy equation ( 1 2 ) and hence can be used as an elementary rotation matrix. Equation (16) defines the elementary rotation matrix for a general n-D CORDIC algorithm, applicable to both Euclidean and pseudo-Euclidean spaces. When E n = Inl Hn,i is an elementary Euclidean rotation matrix and the more specific notation R,,i is used instead of H,,,i. If the signature matrix for an odd dimensional pseudo-Euclidean space is such that t = ( n 1 ) / 2 , i.e., c = 0 , n o scaling is required to normalize the rotated vectors. Elementary CORDIC rotation matrices for n = 2 , 3 and 4 are exhibited below: 0 n = 2, 6 1 = 1:

new algorithm with elementary rotations R4,i is a rational CORDIC algorithm whose unnormalized elementary rotations are only slightly more complex. The convergence of the new 4-D CORDIC algorithm can be analyzed and proved using the procedure outlined in Section 1. The cones Ci have for vertex the origin and surround the first axis, thus they are fully characterized by their intersection with the hyperplane 2 1 = 1. Since they are symmetric with respect to 2 2 , 2 3 , 2 4 , these 3-D polytopes can further be projected orthogonally onto the hyperplane 2 2 = 0 to study the convergence behavior. Fig. 3 represents this projection for the cones C1, C2 and C3.

+

0

Figure 3: Domains after the first three iterations of R e .

n = 3 , u1 = u2 = 1: H3,i = R3,i in equation (8), n = 3 , U 1 = l,u2 = -1: H3,i = s3,i

R4,;

r

-2ti 2ti

2t; 1 - 2t! 2t;

= ( 1 + 3t;)-’X 2ti 1 - 3tf -2ti 1 +t; -2ti -2ti

-22; -2t;

-2t; 1 2ti2t;

2ti

-at: 1 + t; -2t;

+

1

2ti -2t2 -24 1 +t;

s3,j,

Algorithm R4 does not require any repetition and converges faster than Rq, which requires repetition of the iterat,ions with t ; = 2-3, 2-5, 2-9, 2-16,. . . [7]. The domain obtained by intersecting Ci with the hyperplane z1 = 1 is bounded by a 3-D cube with half-side 2ti/(l- 3 t 3

(19)

)

2.3

Structure of t h e Elementary Rotations

The n-D elementary rotation matrix H,,j in equation ( 1 6 ) is the product of a reflection with respect to the hyperplane z 1 = 0 and a ‘pseudo-Householder’ reflection. Indeed, H,,i can be rewritten as (20)

n = 4, u1 =

1,u2

H4,i = ( 1 - t / 1+ t f

= u3 = -1:

yx 2t;

where 2ti

2ti

1 E,, =

[

-1

0

.’

O

!) ( ) and

1 0 O

(21) The elementary rotations R 2 , i and H2,i are the square of the elementary rotations R2,; in equation ( 1 ) with U set to 1 and -1, respectively. We call the algorithm of equation ( 1 ) a ‘square-root’ 2-D CORDIC algorithm and the algorithms of equations ( 1 7 ) and ( 1 8 ) ‘rational’ CORDIC algorithms. The rational 2-D Euclidean CORDIC algorithm has been used in [5] [8] to speed up the Singular Value Decomposition (SVD) computations and in [12] to obtain a redundant arithmetic algorithm with constant scaling. For n = 4, while the algorithm with elementary rotations R4,i given by equation ( 6 ) is a square-root CORDIC algorithm, the

0



1

ti!’i

ti Sn-

.

1,;

The matrix I, -2C,u uTCnu)-’uT is a reflection with respect to the hyperp ane normal to the vector U . It is orthogonal with respect to C,, and has determinant -1. It, is a generalization of the Householder reflection to pseudo-Euclidean spaces. The role of the multiplication by the reflection E, is to transform the pseudoHouseholder reflection into a proper rotation, with determinant l . Thus, our new elementary rotation matrix Hn,i is essentially a Householder reflection in pseudoEuclidean space. This is the origin for the name given to the new CORDIC algorithm.

\

259

21,;-1

x2j-1

x3,i-1

24,;-1

I

I

6-to-2 CSA

2t,x1,, 2t,22,1

,!&I&,&

I tTX2,I

5-to-2 CSA Figure 4: A possible architecture for a 4-D CORDIC processor.

Figure 5 : Modifications to the 4-D CORDIC processor architecture of Fig. 4 in order to perform scaling iterations in the processor

Processor Architectures

3 3.1

(0, f l , f 2 , 4 3 , f 4 , f 5 ) . In the rotation mode, the multiplexers select data from the left-hand side inputs. The sign controls s ; , ~ , 0~ 5 , k # 1 5 3 , at iteration i are se, ~s k , i s l , ; where SO,^ = 1. The lected such that s ; , ~= iteration index i is dropped for clarity in Fig. 5 . In the scaling mode, the multiplexers select the data from their right-hand side input. The sign controls are such that

A 4-D Processor Architecture

A possible architecture to implement the 4-D CORDIC algorithm in equation (20) is illustrated in Fig. 4. By adding proper control inputs, c1, c 2 and c3, to the sign operators, this 4-D CORDIC processor can also execute the 2-D CORDIC algorithm in equation (17) and the 3D CORDIC algorithm in equation (8). More precisely, in the (unscaled) rotation mode, the set of coefficients (131, C2, c 3 ) = ( - l , O , -1) selects the 2-D CORDIC function if 2 3 , i - l and q i - 1 are preset to zero; (c1, c 2 , c3) = ( 0 , -1,0) selects the 3-D CORDIC function if z 4 , , - 1 is preset t o zero; and ( c l , c 2 , c3) = (-1, -1,l) selects the 4-D CORDIC function. Multiplication by a p b i t accurate approximation of the 3t?)-l can be performed scaling factor S 4 = by shifts and additions using q additional iterations, where q is small compared to p . However, the processor in Fig. 4 must be slightly modified to also perform the scaling operations. Several 2-to-1 niultiplexers should be inserted at the input of the Carry-SaveAdders (CSA). Fig. 5 shows the modifications of the processor to enable scaling of the first and second components. The modifications for the third and fourth components are similar to those for the second component; the sign controls are replaced by

= c3i s O , l = s l , O = s2,0 = s3,0, c1 I

I

l

l

I

and

,

,

(‘3,0> ‘3,ll ‘3,2),

s4

=

bi

E

The scaling factor S 4 is approximated by a;t; bit:) where a; E {0,*2} and

n:=,(l+

+

I

I

I

c2

= s1,3

I

s0,2

I

I

I

s2,3

,

s3,2i I

= s 1 , 2 = s2,1 = ‘3,l

We found that q = 5 extra iterations are required to perform the scaling operations in the 4-D CORDIC algorithm for 32 bits of accuracy. A custom chip implementing the rational 2-D CORDIC rotation R 2 i in equation (17) has been designed at Yale and fabricated through MOSIS with a 2p CMOS process. It operates on 32-bit fixed-point words with 5 extra guard bits. The 2t; and t: shifters are realized by trees of 2-to-1 multiplexers, providing higher performance than barrel shifters. The area of two shifters by t i and t:, A 3 h 2 , is about 1.4 times the area of a single shifter, Ash, due to the possibility of overlapping layouts. The adders in this chip are realized by a variant of the Abraham-Gajski tree structure [l].The depth of the trees to realize both shifters and adders is Fog2 b) where b is the word length, including guard bits. The area of such an adder is about 1.6 times A s h . However, it is possible to design a faster adder

nfz1(l+

(‘2,Ol ‘ 2 , l l ‘2,3)

I

260

3.2

with smaller area usin multiple output domino logic and carry-look-ahead flo]. The area of this smaller adder, Aadd, is about 0.6 times Ash. The area of a 3to-2 Carry-Save-Adder array, A,,,, is about 0.13 times Ash. The area of a 2-to-1 word multiplexer, A,,,, is about 0.08 times Ash. To sum up,

n-D Processor Architectures

Turning now t o general n-D CORDIC algorithms, we found computationally that the first iteration (with tl = 2-2) should be repeated twice for n = 5 , 6 , 7 and 8 to ensure convergence. The architecture in Fig. 4 can be easily generalized to implement an n-D CORDIC processor. For instance, a 5-D CORDIC processor reAsh2 = 1.4A,h Aadd = 0.6A,h quires an area A ~ ? D 1 . 3 A 4 ~and a computation time A,,, = 0.13A,h A,,, = 0.08A,h. T~D N l.lT4D. Therefore, a 5-D CORDIC processor Note that the sensitivity of these factors-1.4, 0.6, 0.13 takes about 40 % of the time (= 3 T 2 ~ )of two 2-D and 0.08-to the bit accuracy is low since Ash, Ash:! CORDIC processors to perform a 5-D rotation, with and Aadd increase with bit accuracy slightly faster than essentially no increase in the area-time product. linearly, while A,,, and A,, are proportional to the An alternative architecture for an n-D CORDIC probit accuracy. cessor, based on the the representation in terms of The total area for the 4-D CORDIC processor, A ~ D , pseudo-Householder reflection given in equation (22), is well approximated by the sum of the areas of the is shown in Fig. 6. Because of an additional shifter deadders, shifters, CSA arrays, multiplexers and interlay, this ‘CORDIC Householder architecture’ requires connection wires. The area of the wires, A ~ D - ~ ; , . ~is, , a longer computation time for each CORDIC iteration about 2 times Ash in this 4-D processor. Therefore than the plain ‘CORDIC architecture’ exemplified in Fig. 4. On the other hand, it requires less area for high A ~ D 22 4 A s h ~ 4Aadd A s - ~ c s ~ dimensional rotations ( n > 8). 11Amux A4~-wire +3&-:!CSA x1,i-1 x2,i-1 Xn,i-l N 5.6A,h 2.4A,h 0.52A,h +1.17A,h + 0.88A,h 2A,h = 12.57A,h

+

+

+

+

+

I

+



I

Since the square-root 2-D CORDIC implementation in Fig. 1 requires A2D

N

N

+ +

Ash:! 2Aadd 2Amtix AZD-rvire 1.4ASh 1.2Ash 0.16ASh 0.58ASh 3.34A,h,

+

+

the area of a 4-D CORDIC processor is about 3.8 times the area of a 2-D CORDIC processor. The delay of the 6-to-2 and 5-to-2 CSA, TCSA,is about three gate delays, 3T a t e , where Tgatedenotes the average delay per gate. $he delay of a 2-to-1 multiplexer, TmUx, is Tgate.The delay of a shifter, T , h i f t e r , is about c1 Pog, bl gates delay for b-bit internal wordlength, where the factor c1 accounts for the delay of the intermediate buffer stages driving the wires (q = 1.3). The delay of an adder is about c:!pog, bl gate delays, where c:! accounts for the large carry-look-ahead gates (cg 1.3). The delay to drive the interconnection wires, Twires,is about 2Tgate. Therefore, for 32-bit accuracy, the delay of one 4-D CORDIC iteration is

T ~ D21 T s h i t e r 2 6.5date = 19qate

+

T m u x + TCSA + Twires + T g a t e + 3Tgate + 6.5Tgate + 2 T g a t e Tadder

while the delay for one 2-D CORDIC iteration is T2D

Tshi ter

+ +

6.5date = 16Tgate.

Tmux Tgate

Tadder + T w i r e s + 6.5Tgate + 2Tgate

Figure 6: Architecture of an n-D pseudo-Euclidean CORDIC processor (c = 2k - n - 1).

To evaluate or apply a single 4-D rotation the proposed processor takes 60% of the time of two 2-D CORDIC processors at the expense of an area increase by a factor of 1.9. The gain in speed is thus obtained with a slight increase in the area-time product.

3.3 Area and Time Comparisons Two architectures have been presented: ‘CORDIC architecture’, which is the generalization of Fig. 4, and ‘CORDIC Householder architecture’ of Fig. 6. Since

261

the pseudo-Householder transformation is considered to be the most efficient way to implement n-D vector rotations, we shall compare the two CORDIC architectures to a highly parallel implementation of the pseudo-Householder transformation. To any n-D vector z = [ t 1 2 2 ... z,]~ in a space with signature matrix E, may be associated a pseudeHouseholder reflection bringing t along the axis el = [l 0 . . .OIT:

H = I, - 2C,U(UTCnU)--1UT,

+

where U = t sign(il)llxllEnel. The pseudoHouseholder transformation calls for division, squareroot and multiplications. Division and square-root extraction can be implemented by the Newton-Raphson method using multiplier and ROMs [14]. The multiplications are performed using n multipliers in parallel. For comparison purposes, the area and delay of a multiplier will be expressed in terms of the units Ash and Tgate,respectively. The 32 x 32 multiplier presented in [ll],which uses the Booth algorithm and a Wallace tree, will be used for our comparison. Compared to the 32 x 32 adder in [lo], its area and delay are

200 180 160 140 120 100

2 4

6

8 10-n

ACHTCH

80 60

0.3

40 20

0.1 2 4

6

8 1 0

n

A HTH 2

4

6

8 10-fl

Figure 7: Area, time and area-time product as function of dimension n ( k 3 2 ) .

AmpyN 35Aadd N 21A,h, T mpy 21 5Aadd 2: 5C2 P O g 2 b ] T g a t e . In Fig. 7, Ac,AcH, AH,Tc,TcH and TH denote the respective areas and delays of ‘CORDIC’, ‘CORDIC Householder’ and ‘Householder’ implementations for the Euclidean rotation of an n-D vector. compares the areas of these implementations, their delays, and Fig. 7(c) their area-time products. The two implementations of the new CORDIC algorithm are clearly preferable in terms of area-time product to the parallel implementation of the pseudoHouseholder transformation.

Ft.

4

p,

Application t o t h e QR Decomposition on a Processor Array

Givens rotations are widely used in modern signal processing algorithms and the square-root 2-D CORDIC algorithm is often employed to implement these rotations. Two well-known examples are the QR Decomposition (QRD) and Singular Value Decomposition (SVD)[2][3]. We shall examine here the QRD of a rectangular matrix to show how higher dimensional CORDIC algorithms can speed up array implementations of matrix computations. The n-D CORDIC algorithm discussed in Sect,ion 2 enables n - 1 elements to be zeroed out at each rotation instead of a single element for a Givens rotation. To triangularize an 1 x m matrix A with elements a i j , an n-D rotation is applied at time step 1 to zero out a2 1,. ... u n , l in the first column of A . Simultaneously, this rotation is applied to the other elements of rows 1 to n. At time step 2, another n-D rotation is applied to rows 1 and n+1 to 272-1 to zero out a n + l , l , .. . , ~ n - l , while an ( n - 1)-D rotation is applied to rows 2 to n to zero out (13,2, u 4 , 2 , . ’ . , un,2. Proceeding with this greedy approach, an 1 x m matrix is triangularized in

CORDIC

n- D CORDIC

~ ,

262

CORDIC

n- D CORDIC

CORDIC

Figure 8: Array of n-D CORDIC processors for decomposing an 1 x 4 matrix.

W D

2-D CORDIC

4-U CORDIC

1

[2] H.M. Ahmed, J.-M. Delosme and M. Morf, Highly Concurrent Computing Structures for Matriz Arithmetic and Signal Processing, IEEE Computer, Vol. 15, No. 1, pp. 65-82, Jan. 1982. [3] J.R. Cavallaro and F.T. Luk, CORDICArithmetic f o r an SVD processor, Journal of Parallel and Distributed Computing, pp. 271-290, June 1988.

Table 1: Comparison, for the QRD of 1 x m matrices, of the computation times for one problem, and of the periods and processor areas. The time unit is Tgoteand the area unit is Ash.

[4] J.-M. Delosme, VLSI Implementation of Rotations

in Pseudo-Euclidean Spaces, Proc. 1983 IEEE Conf. on ASSP, pp. 927-930, Apr. 1983. [5] J.-M. Delosme, A Processor for Two-dimensional Symmetric Eigenvalue and Singular Value Arrays, Proc. 21st Asilomar Conf. on Circ., Syst., and Comp., pp. 217-221, Nov. 1987.

+ +

r(1 - l)/(n - 1)1 ( m - 1) time steps on a triangular array of m(m 1)/2 n-D CORDIC processors as shown in Fig. 8. Each time step allows for a whole n-D CORDIC rotation. If several 1 x m matrices are pipelined into the array, the pipeline period to compute one QRD is [ ( l - l)/(n - 1)1 time steps. Note that the ability of an n-D CORDIC processor to also perform lower dimensional rotations is crucial in this context. Table 1 compares the computation time, the pipeline period and the processor area required for the QRD of 1 x m matrices using the square-root 2-D CORDIC algorithm and the new 4-D CORDIC algorithm; p is the bit accuracy, and q1 and 42 are the number of additional iterations required for scaling operations in the 2-D and 4-D algorithms, respectively. The 4-D CORDIC algorithm provides a speed up of 2.5 at the expense of an area increase by a factor of 3.8. The time performance is thus significantly improved with only a 50% increase in the area-time product.

5

[6] J.-M. Delcsme, CORDIC Algorithms: Theory and Extensions, Advanced Algorithms and Architectures for Signal Processing IV, Proc. SPIE 1152, pp. 131-145, Aug. 1989. [71 J.-M. Delosme and S.-F. Hsiao, CORDIC Algorithms in Four Dimensions, Advanced Signal Processing Algorithms, Architectures and Implementations, Proc. SPIE 1348, pp. 349-360, July 1990. J.-M. Delosme, Bit-level Systolic Algorithm for the Symmetric Eigenvalue Problem, Proc. Int. Conf. on Application Specific Array Processors, pp. 770781, Princeton, Sept. 1990. G. Haviland and A. Tuszynski, A CORDICArithmetic Processor Chip, IEEE Trans. on Computers, Vol. C-29, No. 2, pp. 68-79, Feb. 1980.

Conclusion

1,s.Hwang and A.L. Fisher, Ultrafast Compact

A general n-D CORDIC algorithm, applicable to rotations in Euclidean and pseudo-Euclidean spaces, has been introduced. In the 3-D and 4-D Euclidean cases, the algorithm does not require any repetition and thus improves on previously developed CORDIC algorithms. The elementary rotations in the new algorithm are essentially pseudo-Householder reflections. Parallel implementations of the algorithm require, for n 5 10, much less area than a parallel implementation of the Householder transformation to achieve roughly the same computation time and, in addition, they are simpler to design. Finally, the example of the QR decomposition illustrates the use of the new n-D algorithms in application-specific processor arrays in order to speed up matrix computations.

6

32-bat CMOS Adders in Multiple-Output Domino Logic, IEEE J . Solid State Circuits, Vol. SC-24, pp. 358-369, Apr. 1989.

M. Nagamatsu et al., A 15-ns 32 x 32-b CMOS Multiplier with an Improved Parallel Structure, IEEE J . Solid State Circuits, Vol. SC-25, pp. 494496, Apr. 1990. N. Takagi, T. Asada and S. Yajima, A Hardware Algorithm for Computing Sane and Cosine Using Redundant Binary Representation, Trans. IECE Japan, Vol. J69-D, No. 6, pp. 841-847, June 86 (in Japanese). English translation available in Systems and Computers in Japan, Vol. 18, No. 8, pp. 1-9, Aug. 1987.

Acknowledgement

[13] J.E. Volder, The CORDIC Trigonometric Computing Technique, IRE Trans. on Electronic Computers, Vol. EC-8, No. 3, pp. 330-334, Sept. 1959.

This work was supported by the Defense Advanced Project Research Agency (DARPA) under contract N00014-88K-0573.

[14] S. Waser and M.J. Flynn, Introduction to Arithmetic for Digital Systems Designs, CBS College Publishing, 1982.

References [l) J . Abraham and D. Gajski, Easily Testable, High-

[15] J.S. Walther, A Unified Algorithm for Elementary Functions, AFIPS Conf. Proc., Vol. 38, pp. 379335, 1971.

speed Realization of Register- fiansfer-Level Operations, Proc. 10th Fault-Tolrant Computing Symp., pp. 339-344, Oct. 1980.

263