High-Speed Inverse Square Roots - Semantic Scholar

0 downloads 217 Views 111KB Size Report
Electrical Engineering and Computer Science Department. Lehigh University ..... umn to be replaced by aiai+1 in the same
High-Speed Inverse Square Roots Michael J. Schulte and Kent E. Wires Computer Architecture and Arithmetic Laboratory Electrical Engineering and Computer Science Department Lehigh University Bethlehem, PA 18015, USA [email protected] and [email protected]

Abstract Inverse square roots are used in several digital signal processing, multimedia, and scientific computing applications. This paper presents a high-speed method for computing inverse square roots. This method uses a table lookup, operand modification, and multiplication to obtain an initial approximation to the inverse square root. This is followed by a modified Newton-Raphson iteration, consisting of one square, one multiply-complement, and one multiplyadd operation. The initial approximation and NewtonRaphson iteration employ specialized hardware to reduce the delay, area, and power dissipation. Application of this method is illustrated through the design of an inverse square root unit for operands in the IEEE single precision format. An implementation of this unit with a 4-layer metal, 2.5 Volt, 0.25 micron CMOS standard cell library has a cycle time of 6.7 ns, an area of 0.41 mm2 , a latency of five cycles, and a throughput of one result per cycle.

1. Introduction Square roots and inverse square roots are important in several digital signal processing, multimedia, and scientific computing applications [1], [2], [3], [4]. For many computations, including vector normalization [1], least squares lattice filters [2], Cholesky decomposition [4], and Givens rotations [5], a square root is first computed and then used as the divisor in a subsequent divide operation. A more efficient method for performing this computation is to first compute the inverse square root and then use it as the multiplier in a subsequent multiply operation [1]. Because of its usefulness in 3D graphics applications, special instructions for inverse square root have been added to the Motorola AltiVec [6] and Advanced Micro Devices 3DNow! [7] Instruction Set Extensions.

Although several algorithms have been developed for computing square roots or inverse square roots, these algorithms typically either have long latencies or high memory requirements [5], [8], [9]. Digit recurrence algorithms, such as those presented in [10], [11], [12], need less area than other methods, but have linear convergence and often require a large number of iterations. On the other hand, methods that employ parallel polynomial approximations, such as those presented in [4], [13] [14], [15], have short latencies yet require large amounts of memory and area. This paper presents a high-speed method for computing inverse square roots. This method uses a variation of the algorithm presented in [16] to obtain an initial approximation to the inverse square root. The initial approximation requires a table lookup, operand modification, and multiplication. After the initial approximation, a modified Newton-Raphson iteration is used to produce an accurate inverse square root. The initial approximation and modified Newton-Raphson iteration are implemented using specialized hardware to reduce the delay, area, and power dissipation. The method presented in this paper is similar to the method presented in [1], except it uses a more accurate initial approximation, requires only a single Newton-Raphson iteration, and employs truncated multipliers and a specialized squaring unit. Consequently, it requires significantly less memory and area. Section 2 describes the method for computing inverse square roots. Section 3 presents the design of a hardware unit that uses this method to compute inverse square roots for numbers in the IEEE single precision format [17]. Section 4 gives our conclusions. A similar method for computing high-speed reciprocal approximations is described in [18].

2. Inverse square root method The method presented in this paper produces an approximation Y to the inverse square root of a number X . The

method for calculating Y consists of the following steps:

k 0 1 2 3 4 5 6

p

1. Compute an initial approximation R  1= a variation of the method described in [16].

X using a

2. Perform a modified Newton-Raphson iteration p to produce a more accurate approximation Y  1= X . To reduce the hardware requirements, the inverse square root method uses truncated multipliers and a specialized squaring unit.

range of n

n3 n7 8  n  14 15  n  27 28  n  52 53  n  101 102  n  198 1 4

Table 1. Values of k for less than one ulp error.

2.1. Truncated multipliers In the discussion to follow, it is assumed that an unsigned n-bit multiplicand A is multiplied by an unsigned n-bit multiplier B to produce an unsigned 2n-bit product P . For fractional numbers, the values for A, B , and P are

A=

Xn ai ?i i=1

2

B=

Xn bi ?i 2

i=1

P=

Xn pi ?i 2

2

i=1

(1)

The multiplication matrix for P = A  B is shown in Figure 1a. To avoid excessive growth in word size, P is often rounded to n bits. a1 x b1

a2

an-1

an

b2

bn-1

bn

a1 bn a2 bn

an-1b n an b n

a1 bn-1 a2 bn-1

a1 b2

p1

a1 b1

a2 b1

p2

p3

an-1 b n-1 an b n-1

an-1b 2 an b 2

a2 b 2

a n-1b1 an b1 pn

pn+1

pn+2

p2n-2 p2n-1 p2n

(a) Standard Multiplication Matrix

ak+1b n ak+2b n-1 anb k+1 1 a1 bn a2 b n a1 b n-1 a2 bn-1

1

1

ak-1 bn

akbn

akbn-1

ak+1b n-1

an-1 b k anb k anb k-1 a1 b2

p1

an-1b2 an b 2

a2 b 2

a1 b1

a2 b1

an-1 b1 an b1

p2

p3

pn

pn+1

pn+2

pn+k-1

pn+k

(b) Truncated Multiplication Matrix

Figure 1. Multiplication matrices. With truncated multiplication, only the n + k most significant columns of the multiplication matrix are used to compute the product, which is then truncated to n bits [19], [20].

Using a method similar to the one presented in [20], the partial product bits in column n + k + 1 are added to column n + k to compensate for the error that occurs by eliminating columns n + k + 1 to 2n. To compensate for truncating the (n + k )-bit result to only n bits, ones are added to columns n + 2 to n + k , as shown in Figure 1b. To add these ones, k ? 1 half adders are changed to specialized half adders. Specialized half adders are equivalent to full adders that have one input set to one and require roughly the same number of transistors as regular half adders [21]. As described in [22], this method of truncated multiplication results in a maximum absolute error that is bounded by

Etrunc  0:5 +

X

b(n?k)=2c i=1

n ? k + 2 ? 2i)2?k?2i?1

(

(2)

units in the last place (ulps). Table 1 shows values of k that limit the maximum absolute error due to truncated multiplication to less than one ulp for different ranges of n [22]. Truncated multipliers require significantly less hardware than conventional parallel multipliers. A conventional n by n array multiplier requires n2 AND gates, n2 ? 2n full adders, and n half adders [19]. A truncated array multiplier, with t = n ? k , requires t(t ? 1)=2 fewer AND gates, (t ? 1)(t ? 2)=2 fewer full adders, and (t ? 1) fewer half adders than an equivalent conventional array multiplier [22]. This reduction in hardware leads to a significant decrease in delay, area, and power dissipation. For example, a 32 by 32 truncated array multiplier with k = 4 has 17% less delay, 39% less area, and 41% less average power dissipation than a conventional 32 by 32 array multiplier [23]. Similarly, a conventional n by n Parallel Reduced Area Multiplier [21], requires n2 AND gates, n2 ? 4n + 3 + S full adders, n ? 1 half adders, and a (2n ? 2 ? S )-bit carry propagate adder, where S is the number of full adder stages required to reduce the partial products. A truncated Parallel Reduced Area Multiplier requires approximately t(t ? 1)=2 fewer AND gates, (t ? 2)(t ? 3)=2+ S fewer full adders, and t fewer half adders. The size of the carry-propagate adder is reduced by approximately (t ? S ) bits [22]. With mi-

a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8 a8 a8 a1 a7 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7 a7 a7 a8 a7 a1 a6 a2 a6 a3 a6 a4 a6 a5 a6 a6 a6 a7 a6 a8 a6 a1 a5 a2 a5 a3 a5 a4 a5 a5 a5 a6 a5 a7 a5 a8 a5 a1 a4 a2 a4 a3 a4 a4 a4 a5 a4 a6 a4 a7 a4 a8 a4 a1 a3 a2 a3 a3 a3 a4 a3 a5 a3 a6 a3 a7 a3 a8 a3 a1 a2 a2 a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a2 a8 a2 a1 a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a1

An n-bit squaring unit that uses Dadda’s method [26] for reducing the partial products requires (n2 + n ? 2)=2 AND gates, (n2 ? 7n + 10)=2 full adders, 2dn=2e? 3 half adders, n?1 inverters, and a (2n?3)-bit carry propagate adder [27]. In comparison, a conventional n by n Dadda tree multiplier requires n2 AND gates, n2 ? 4n + 3 full adders, n ? 1 half adders, and a (2n ? 2)-bit carry propagate adder [21].

(a) Original Matrix a1

a2

a3

a4

a5

a6

a7

0

a8

2.3. Initial approximation

a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8 a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7 a3 a4 a3 a5 a3 a6 a4 a6 a5 a6 a4 a5 (b) Reduced Matrix a1 a2 a1 a2 a2 a3 a2 a3 a3 a4 a3 a4 a4 a5 a4 a5 a5 a6 a5 a6 a6 a7 a6 a7 a7 a8 a7 a8

0

a8

a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7

A variation of the method presentedpin [16] is used to obtain an initial approximation R  1= X . As in [16], it is assumed that X = [1:x1 x2 : : : xn ] (xi 2 0; 1). With this method, X is divided into two parts; X1 = [1:x1 x2 : : : xm ] and X2 = [xm+1 xm+2 : : : xn ]  2?m . The initial approximation is computed as

R = X0  C0

a3 a5 a3 a6 a4 a6 (c) Optimized Matrix

(4)

X 0 = [1:x1 x2 : : : xm xm+1 xm+1 xm+2 : : : xnX ], C 0 = X1?3=2 ? 3  2?m?2X1?5=2 + 33  22m?6X1?7=2 , and nX denotes the number of bits in the fractional part of X 0 . The values for X 0 are obtained from X by complementing nX ? m bits of X . The values for C 0 are precomputed, where

Figure 2. 8 by 8 squaring matrices.

0

0

nor modifications, the technique of truncated multiplication can be extended to handle two’s complement multipliers, multiply-add and multiply-complement units, and multipliers for which A, B , and P are different sizes [22]. These modifications do not significantly increase the area or delay of truncated multiplication.

2.2. Specialized squaring unit The modified Newton-Raphson iteration uses the square of the initial approximation. Rather than having a parallel multiplier compute the square, a specialized squaring unit, which requires significantly less hardware, is employed. Figure 2a shows an 8 by 8 partial product matrix for S = A2 . Since ai aj = aj ai , the matrix is symmetric with respect to the antidiagonal. This allows the number of partial products to be reduced by using the identities ai aj + aj ai = 2ai aj and ai ai = ai . As shown in Figure 2b, the original matrix can be replaced by an equivalent matrix consisting of the partial products on the antidiagonal, plus the partial products above the antidiagonal shifted one position to the left [24]. To improve the regularity of the matrix and further reduce its maximum height, a technique presented in [25] is employed. This technique uses the identity

ai + ai ai+1 = 2ai ai+1 + ai ai+1

(3)

This allows the partial products ai and ai ai+1 in one column to be replaced by ai ai+1 in the same column and ai ai+1 in the next column to the left, as shown in Figure 2c.

0

rounded to nearest, and stored in a table (e.g., a RAM or ROM) that is addressed with the bits [x1 x2 : : : xm ]. For floating point numbers, the least significant bit of the exponent is also used to index the table. pIf the exponent is odd, the value for C 0 is multiplied by 1= 2 before it is rounded and stored in the table. This method varies slightly from the method presented in [16], because only the nX 0 most significant fractional bits of X 0 are used and R is computed using truncated multiplication. Although these modifications result in a small amount of additional error, they significantly reduce the amount of hardware required for the multiplication. The maximum absolute error in the initial approximation is bounded by

R