Newton-Raphson iterations. It takes ... Section III presents the Newton-Raphson iteration, and how ..... modeled in VHDL, synthesized with AMIS 0.35 um 5-metal.
DESIGN AND IMPLEMENTATION OF RECIPROCAL UNIT Dongdong Chen, Bintian Zhou, Zhan Guo and Peter Nilsson Department of Electroscience Lund University, Sweden Abstract – This paper presents the design and implementation of a reciprocal unit, in which the initial approximation of the reciprocal is obtained using a look-up table and a multiplication. How to create a look-up Table efficiently is described in detail, and the error analysis for the ROMs of different sizes is also given in this paper. The presented design utilizes a 27 × 16 bits ROM followed by two Newton-Raphson iterations. It takes 10 clock cycles to achieve the 52-bit accuracy approximation of the reciprocal of a double precision floating-point number.
I. INTRODUCTION Nowadays, division is very important for several applications in digital signal and image processing, computer graphics and scientific computing [1]. But division is the most time consuming operation among four arithmetic operations. Designing a high-speed reciprocal unit is very useful for division operation because the division can be replaced as the following method: the reciprocal of divisor is computed at first, and then it is used as the multiplier in a subsequent multiplication with the dividend. If several divisions by the same divisor need to be performed, this method is particularly efficient, since once the reciprocal of divisor is found for the first division, each subsequent division involves just one additional multiplication. How to create an efficient look-up table is a key point in the design of the reciprocal unit. In today’s world, as the required precision of the approximation increases (e.g., greater than 16 bits), the size of the memory needed to implement the table lookups becomes prohibitive. Several techniques have been devised for reducing the amount of memory required to approximate the elementary functions, while keeping their performance at an acceptable level. These techniques have included the parallel polynomial approximations, table-driven polynomial approximations and rational approximations [2]. This paper presents the design and implementation of a reciprocal unit, in which the initial approximation is achieved using an efficient look-up table and a multiplication. Followed the initial approximation, two Newton-Raphson iterations are operated to obtain the final 52 bits accuracy approximation of the reciprocal. The algorithm applied to create the look-up table is based on the first-order Taylor expansion. Compared with Symmetric Bipartite Tables [2], the look-up table created according to this algorithm is much smaller for achieving the same accuracy. In this design, the method requires to read a value from table and a multiplication of this value with the
0-7803-9197-7/05/$20.00 © 2005 IEEE.
modified operand to obtain the initial approximation [3]. The Newton-Raphson iteration is the second stage of our implementation. A squarer, a multiplier and a subtracter are used to realize it. This reciprocal operation takes only 10 clock cycles. It just requires a subsequent multiplication to realize the division operation. The disadvantage of our design is the non-guaranteed accuracy of 52 bits. However, in many applications of digital signal and image processing, the error which causes guaranteed 51-bit accuracy may not be a problem. The remainder of this paper is organized as follows: Section II gives the algorithm to create the look-up table firstly,then describes how to determine the size of ROM. Section III presents the Newton-Raphson iteration, and how it is applied in our design; the error analysis of NewtonRaphson iteration is also given in this section. Section IV describes an overview of the architecture of this reciprocal unit, and how the reciprocal unit works in every clock cycle. Section V presents the implementation results in ASIC and FPGA respectively. In the end of this paper, Section VI gives the conclusion.
II. CREATING LOOK-UP TABLE Creating look-up table is very important in our design. The look-up table determines how much the ROM size is, and it also determines how much accuracy can be guaranteed. The algorithm applied to create the look-up table is as follows: According to IEEE standard 754 [4], the operand X, is a 64-bit normalized double precision floating-point number in the range of 1 ≤ X < 2 . An implicit leading one and the 52 fractional bits constitute the mantissa. The 53-bit mantissa is expressed as: (1) X mantissa = [1.x1 x 2 x3 ..x52 ] X can be split into two parts: one part from 1st to m th , and the other from (m + 1) th bit to 52nd bit, like that [3]:
X 1 = [1.x1 x 2 x 3 ... x m ]
X 2 = [0.x m+1 x m + 2 ...x52 ] × 2
(2) −m
(3)
X mantissa = X 1 + X 2 The initial reciprocal approximation X the first-order Taylor expansion.
(4) −1
is obtained by
X −1 = ( X 1 + 2 − m −1 ) −1
− ( X 1 + 2 − m−1 ) −2 ( X 2 − 2 − m−1 )
1318
Authorized licensed use limited to: University of Saskatchewan. Downloaded on January 27, 2010 at 02:09 from IEEE Xplore. Restrictions apply.
(5)
Another expression can be derived from equation (5) as follows:
X −1 = ( X 1 + 2 − m −1 ) −2 × (6) [( X 1 + 2 − m −1 ) − ( X 2 − 2 − m −1 )] − m −1 −2 In equation (6), the first term ( X 1 + 2 ) is read from the ROM addressed by X 1 (without the leading 1) as a constant term. As for the remaining term − m −1 − m −1 [( X 1 + 2 ) − (X 2 − 2 )] , it can be achieved from the operand modifier. The operation of the operand modifier is to keep the bits from1st to m th unchanged and to inverse the bits from ( m + 1) th to 52th .
C = ( X 1 + 2 − m −1 ) −2 X ' = [1.x1 x2 ...xm ~ xm+1 ~ xm+ 2 ...~ x52 ]
the mantissa, and the different bits of cases m=7, 8, 9 exist at the least 7 bits. For cases m=7, 8, 9, almost each result achieves 52-bit accuracy. The number of total error bits is very close in three simulations. But the 2 7 × 16 bits table is the minimum size, and the accuracy of the initial reciprocal approximation through looking up 2 7 × 16 bits table is 14 bits. That means it can be truncated during the Newton-Raphson iterations without influencing the 52 bits accuracy, so the ROM of 2 7 × 16 bits is determined in our implementation, it is smaller than the ROM 210 × 20 bits in [5].
(7) (8)
By multiplication of term C with modified operand X ' , the initial reciprocal approximation is obtained, whose accuracy is (2m+2) bits, and the corresponding table size is 2 m × ( 2 m + 2 ) . If a reciprocal of 52-bit accuracy needs to be computed, the table should be of size 23 × 8 = 64 or 2 6 ×14 = 896 bits, accordingly as followed by 3 or 2 NewtonRaphson iterations. However in a practical implementation of reciprocal unit, the algorithm needs to be amended because the designer should consider the area and power consumption. In the modified algorithm [5], the equation (3) is revised to be:
X 2 = [0.x m+1 x m+ 2 ...x2 m ] × 2 − m
(9) Figure 1 Error Analysis of 256 Test Vectors’ Reciprocal
where m < 26. Correspondingly, X’ becomes
X ' = [1.x1 x2 ...xm ~ xm+1 ~ xm+ 2 ...~ x2m ]
(10) Of course, all those modifications introduce the decrease of accuracy in the initial reciprocal approximation. In order to keep the ROM size minimum, which is determined by 2 m × ( 2 m + 2) , the value of m should be determined carefully. A group of test vectors that contain 256 53-bit mantissas are selected. They are uniformly distributed from 1 to 2, and some special test vectors are also selected, such as all ones or all zeros in the 52-bit fractional part. Firstly, 6 is chosen as the value of m, because theoretically 52-bit accuracy can be achieved after two Newton-Raphson iterations. The architecture simulated in Matlab is matched to the hardware implementation of reciprocal unit. But the simulation results show that more than half of 256 test vectors can not achieve 52-bit accuracy after the reciprocal computation, so it’s unfeasible to choose m=6. Then m=7, 8, 9 are selected after our previous simulation. Four look-up tables are created with the size of 2 6 × 14 , 2 7 × 16 , 2 8 × 18 and 2 9 × 20 bits respectively. Figure 1 shows the comparison of the real values with simulation results of reciprocals of 256 test vectors at the least significant 13 bits in the fractional part. The different bits of case m=6 between real values and simulation results exist at the least 13 bits of
Table 1 the Values in the ROM
Address bits (7 bits) 0000000 0000001 0000010 0000011 0000100 0000101 0000110 … 1111111
ROM’s values (16 bits) 1111111000000010 1111101000011010 1111011001001001 1111001010001101 1110111011101000 1110101101010111 1110011111011010 … 0100000001000000
Table 1 briefly shows the values kept in the ROM. Totally, the ROM has 128 values, and each value contains 16 bits. It means through 7 address bits, a 16 bits value can be read from the ROM. Then the output of the ROM is used to multiply with the result of the operand modifier to obtain the initial approximation of the reciprocal.
1319
Authorized licensed use limited to: University of Saskatchewan. Downloaded on January 27, 2010 at 02:09 from IEEE Xplore. Restrictions apply.
IV. HARDWARE IMPLEMENTATON OF THE RECIPROCAL UNIT
III. NEWTON-RAPHSON ITERATION Newton-Raphson iteration is a well-known iterative method to approximate the root of a non-linear function. Let f (x) be a well–behaved function, and let r be a root of the equation f ( x ) = 0 , we start with x 0 which is a good estimate of r and let r = x 0 + h . The number h measures how far the estimate x0 is from the truth. Since h is ‘small’, the linear approximation can be used to conclude that [6] (11) 0 = f ( r ) = f ( x 0 + h ) ≈ f ( x 0 ) + hf ' ( x 0 ) And therefore, unless
h≈−
f ' ( x0 ) is close to 0,
f ( x0 ) f ' ( x0 )
(12)
It follows that
r = x0 + h ≈ x 0 −
f ( x0 ) f ' ( x0 )
Our new improved estimate
x1 = x 0 −
(13)
x1 of r is therefore given by
f ( x0 ) f ' ( x0 )
Figure 2 shows the architecture of this reciprocal unit. It takes 10 clock cycles to achieve the 52-bit accuracy approximation of the reciprocal of a double precision floating-point number. The architecture can be looked as two stages. In the first stage, the initial reciprocal approximation of X is obtained through look-up table and operand modifier. The size of the ROM is 2 7 × 16 bits, which is a tradeoff of hardware cost and accuracy. The second stage is the implementation of Newton-Raphson iteration. At the beginning, the first m (m=7) bits are obtained from the mantissa of X, and then these 7 bits are used as the address bits to achieve a 16-bit’s value from the ROM. At the same time, the first 15 bits of X are passed to the operand modifier. The operand modifier unit consists of seven inverters. The most significant 7 bits of X are kept the same, and the least significant 7 bits are inverted. So the values from the ROM and the operand modifier are obtained at the first clock cycle.
(14)
Continue in this way. If x i is the current estimate, then the next estimate x i +1 is given by:
xi +1 = xi −
f ( xi ) f ' ( xi )
(15)
The obtained equation (15) is called the NewtonRaphson formula. In order to compute the reciprocal, the following function and its derivative are used: (16) f ( x) = 1 / x − X
f ' ( x) = −1 / x 2
(17) Substituting the equations (16), (17) into the equation (15) yields: xi +1 = xi (2 − Xxi ) (18) The equation (18) is rewritten as:
xi +1 = 2 xi − Xxi
2
(19)
which can be implemented in hardware in order to double the accuracy in each iteration. Using the form in equation (19), one square, one multiplication, one shift and one subtraction are required for computation of xi +1 . Let δ i = 1 / X − xi be the error at i th iteration, then [7]:
δ i +1 = 1 / X − xi +1 = 1 / X − xi ( 2 − xi X )
(20)
which can also be expressed as:
δ i +1 = X (1 / X − xi ) 2 = Xδ i 2
(21) The equation (19) clearly proves us that the absolute error degrades quadratically in each Newton-Raphson iteration as it is proportional to the square of the previous error.
Figure 2 the Architecture of Reciprocal Unit
At the second clock cycle, the output of the ROM is multiplied with the output of the operand modifier in the
1320
Authorized licensed use limited to: University of Saskatchewan. Downloaded on January 27, 2010 at 02:09 from IEEE Xplore. Restrictions apply.
multiplier 1 to achieve the initial approximation for the reciprocal of X. The output of multiplier 1 is truncated to 16 bits and concatenated by 13 bits zeros. It is selected by MUX1 and sent to the squarer, and the result of the squarer is obtained at the third clock cycle. Here squarer is chosen because squarer is smaller than the multiplier in size. The Newton-Raphson iteration result is achieved just using one squaring, one multiplication and one subtraction. At the following 4th and 5th clock cycles, the 53-bit mantissa is multiplied with the result of the squarer in multiplier 2, followed by a rounding operation. At the 6th clock cycle, the shifted value of the output of the multiplier 1 is selected by MUX2, and it is subtracted from the rounding result of multiplier 2. Now the first result of Newton-Raphson iteration is achieved. Then it is truncated and selected by MUX1 to be computed in the squarer at the 7th clock cycle. At the 8th and 9th clock cycles, the multiplication is operated again. Then MUX2 selects the shifted value of the first iteration result, which is passed to the subtracter together with the rounding result. At the 10th clock cycle, the result of the second Newton-Raphson iteration is obtained, which is the 52-bit accuracy approximation. Among 256 test cases, there are few cases that can not achieve 52-bit accuracy. In other words, the 52-bit accuracy can not be guaranteed for all operands in this design, but the 51-bit accuracy can always be achieved. The reciprocal unit takes 2 clock cycles to get the initial approximation of the reciprocal, and each Newton-Raphson iteration takes 4 clock cycles. Consequently, it takes 10 clock cycles to obtain the reciprocal approximation of a double precision floatingpoint number in our reciprocal unit.
V. IMPLEMENTATION RESULTS The proposed architecture for reciprocal unit is modeled in VHDL, synthesized with AMIS 0.35 um 5-metal technology using Synopsys and routed using Silicon Ensemble [8]. The RTL and gate level netlists are all verified against the same test vectors generated from the MATLAB fixed-point model. Table 2 gives the comparison of our design with [5] in several aspects. The design contains Table 2 Comparison of This Design with [5] This work [5]
ROM size (bits)
2 7 × 16
210 × 20
Max clock frequency (MHz) Core area (gates number) Latency (clock cycles)
66.7 29110 10
200 67908 12
29k core gates, and its worst case delay path is in the squarer, which determines the maximum clock frequency of 66.7 MHz. The core area of this chip is 2 mm × 2 mm . The reciprocal unit is also implemented using a Xilinx Virtex XCV1000E FPGA board with package bg560 and speed -8. The implementation occupies 1 BRAM out of 96, 1 GCLK I/O block out of 4, 109 I/O blocks out of 404, and 3249 slices out of 12288. The maximum clock frequency is 46.3 MHz.
VI. CONCLUSION This paper presents the design of a reciprocal unit for double precision floating number. The reciprocal unit uses a ROM of 27 × 16 bits for the initial approximation algorithm described in this paper, and it takes 10 clock cycles to compute the reciprocal using a look-up table and two Newton-Raphson iterations. 52 bits accuracy can be guaranteed in almost all the cases. However, the presented architecture can be modified to achieve higher frequency and less area.
REFERENCES [1]
M. J. Schulte, J. E. Stine and K. E. Wires, “High-Speed Reciprocal Approximations”, Signals, Systems & Computers, 1997. Conference Record of the Thirty-First Asilomar Conference on Volume 2, 2-5 Nov. 1997 pp: 1183 - 1187 vol.2
[2]
M. J. Schulte, J. E. Stine., “Symmetric bipartite tables for accurate function approximation”, Computer Arithmetic, 1997. Proceedings., 13th IEEE Symposium on 6-9 July 1997 pp:175 – 183
[3]
Takagi, N., “Generating a power of an operand by a table look-up and a multiplication”, Computer Arithmetic, 1997. Proceedings., 13th IEEE Symposium on 6-9 July 1997 pp: 126 – 131
[4]
Institute of Electrical and Electronics Engineers, New York, NY. ANSI/IEEE 754-1985 standard for Binary Floating-Point Arithmetic, 1985
[5]
Kucukkabak, U.; Akkas, A., “Design and implementation of reciprocal unit using table look-up and Newton-Raphson iteration”, Digital System Design, 2004. DSD 2004. Euromicro Symposium on 31 Aug.-3 Sept. 2004 pp: 249 – 253
[6]
Andrew Adler, “Notes on Newton-Raphson method ”, online available:http://www.math.ubc.ca/~adler/math104184/newtonmethod. pdf
[7]
Behrooz Parhami, “Computer Arithmetic Algorithms and Hardware Designs”, Oxford University Press, October 1999, pp. 261-272
[8]
Zhan Guo, “A short introduction to ASIC design flow in AMIS library”, online available: http://www.es.lth.se/home/zgo/, April, 2003
1321
Authorized licensed use limited to: University of Saskatchewan. Downloaded on January 27, 2010 at 02:09 from IEEE Xplore. Restrictions apply.