An Efficient Hardware Implementation for a Reciprocal Unit - CiteSeerX

An Efficient Hardware Implementation for a Reciprocal Unit Andreas Habegger, Andreas Stahel, Josef Goette, and Marcel Jacomet Bern University of Applied Sciences, MicroLab CH-2501 Biel-Bienne, Switzerland Abstract The computation of the reciprocal of a numerical value is an important ingredient of many algorithms. We present a compact hardware architecture to compute reciprocals by two or three Newton-Raphson iterations to obtain the accuracy of I EEE 754 single- and double-precision standard, respectively. We estimate the initialization value by a specially designed second-order polynomial approximating the reciprocal. By using a second-order polynomial, we succeed in using one single hardware architecture for both, the polynomialapproximation computations as well as the Newton-Raphson iterations. Therefore, we obtain a most compact hardware implementation for the complete reciprocal computation. Keywords: Arithmetic inversion, reciprocal, Newton-Raphson, polynomial initialization, Nelder-Mead, hardware algorithm.

1. Introduction Division is among the four basic operations the most complicated one. Nevertheless, hardware algorithms often need fast and compact division units, because division is time-, chip area-, and power-consuming. Division can either be done directly, N/D, or by first computing the reciprocal of the denominator, 1/D, followed by a multiplication with the numerator N, a method that is especially useful if different numerators are to be divided by the same denominator. We concentrate on computing the reciprocal. In [1] an overview and comparison of various division algorithms is given. Division approximation algorithms are recursive in nature and can be grouped into algorithms with linear convergence and algorithms with quadratic convergence, [2]. Non-restoring, restoring, S RT (Sweeney, Robertson, and Tocher [3, 4, 5]) and C ORDIC, [6, 7], are examples of linear convergence; reciprocation by Newton-Raphson and Goldschmidt’s division by convergence, [8], are examples for algorithms of quadratic convergence. The algorithms with linear convergence

suffer from a high latency, the algorithms with quadratic convergence are costly in terms of chip area and computational complexity. Many efforts have been made to improve reciprocal division algorithms with quadratic convergence. A polynomial-based division algorithm is proposed in [9] for low resolutions. An improvement of Goldschmidt’s division by convergence algorithm has recently been shown in [10]. Different improvements for Newton-Raphson methods have been published, some of which concentrating on a so-called modified Newton-Raphson algorithms, others focusing on the initial start value approximation of the Newton-Raphson algorithm. Look-up table solutions for the initial value approximation are common [11], but by nature need quite large memories if high accuracy combined with low iteration counts are required. To reduce the look-up table memory-size, [12] introduces the Symmetric Bipartite Table Method.

2. Architecture The Newton-Raphson method is an iterative approximation algorithm. A high hardware effi-

ciency with respect to chip area in iterative algorithms is achieved by using all hardware elements in every iteration cycle; no sleeping hardware block should be present. We distinguish two phases in the Newton-Raphson method, the computation of the initial start value and the iterative Newton-Raphson solution approximation. The hardware architectures presented so far use different hardware units for these two phases, thus one of them is always inactive. Our approach for an efficient hardware implementation for a reciprocal hardware unit is to use a single hardware unit for both, the initial value approximation as well as the Newton-Raphson approximation. With such an approach, the hardware unit will never be partially in sleep mode, resulting in an optimally efficient use of the chip area without additional delay or latency. The Newton-Raphson method for computing the reciprocal of a numerical value is

−a

−D

25

MUX

=

18

18

18

MUL D

MUX 27

25

18

MUX ADD 25

26

c

0 18

MUL

18

MUX 28

xi+1 = xi · (2 − xi · D) −D · x2i + 2 · xi · +0 .

b

2

18

18

ADD 26

(1)

Observe that the right hand side of the above equation is a second order polynomial with the constant term being zero. Therefore, using a second order polynomial for the initial value computation, we can use the same hardware element for the polynomial computations as for the Newton-Raphson iterations; only a minor modification is needed, see Figure 1. The second order polynomial used to calculate the initial value x0 , approximating the reciprocal of D, is x0 = a · D2 + b · D + c . (2) As both equations, (1) and (2), evaluate second-order polynomials, a hardware unit combining both of them can easily be found as Figure 1 shows. We see that setting the multiplexors to their right positions, the hardware unit calculates the initial value approximation, using the polynomial with coefficients a, b and c. If the multiplexors are in their left positions, it computes NewtonRaphson iterations using the data on the feedback path. Our goal is to achieve an approximation precision needed for single precision I EEE 754 float-

clock

REG 14

Figure 1. Efficient hardware implementation for a reciprocal unit combining initial value and Newton-Raphson approximation in a single hardware element.

ing point values in two Newton-Raphons iterations, and a precision needed for double precision floating point values in three Newton-Raphson iterations. The Newton-Raphson method doubles its precision with every iteration, thus the needed 24 bits for a 32 bits single-precision floating point number requests an initial value precision larger than 6 bits, and the 53 bits needed for 64 bit double-precision floating point number requests an even higher initial value precision larger than 6.625 bits. The goal is thus to find a second-order polynomial to achieve the requested initial precision of 6.625 bits (see the discussion in Section 3). According to the above arguments, the number widths indicated in Figure 1 are valid for an

accuracy of single precision using two NewtonRaphson iterations; if we like to realize double precision accuracy, the indicated number widths must be replaced as follows: 14 → 27, 18 → 18, 25 → 54, 26 → 55, and 28 → 57. As mentioned above, we are ending up with a hardware unit that calculates in a first cycle the initial guess and further on the final value depending on the desired accuracy in two or three steps, respectively.

3. Theory To determine the value of 1/D for some D ≤ 1 we solve the equation f (x) =

1 2

An Efficient Hardware Implementation for a Reciprocal Unit - CiteSeerX

An Efficient Hardware Implementation for a Reciprocal Unit - CiteSeerX

Suggest Documents

An Efficient Hardware Implementation for AI applications - Embedded ...

An Efficient Hardware Implementation for Motion Estimation of AVC ...

A Hardware Intensive Approach for Efficient Implementation of ... - IJRIT

A Hardware Intensive Approach for Efficient Implementation of ...

An efficient hardware implementation of feed-forward neural ... - BME

Hardware design issues for a mobile unit for next ... - CiteSeerX

Area Efficient Hardware Implementation of Elliptic Curve

Efficient Hardware Implementation of the Horn

Efficient FPGA Hardware Implementation of Secure ...

Efficient Hardware/Software Implementation of LPC ...

Hardware Implementation of Efficient Modified Karatsuba Multiplier ...

Hardware Efficient Implementation of Probabilistic ... - Lab-STICC

A Hardware Implementation of an Embryonic Architecture ... - CiteSeerX

An Efficient Hardware Architecture for H.264 Intra ... - CiteSeerX

An Implementation of the Hardware Partition in A Software/Hardware ...

Program Implementation Schemes for Hardware-Software ... - CiteSeerX

Program Implementation Schemes for Hardware-Software ... - CiteSeerX

Hardware Implementation of a Digital Watermarking ... - CiteSeerX

Combined Software/Hardware Implementation of a ... - CiteSeerX

Efficient Unified Arithmetic for Hardware Cryptography - CiteSeerX

Efficient hardware looping units for FPGAs - CiteSeerX

Efficient Hardware Code Generation for FPGAs - CiteSeerX

design and implementation of reciprocal unit - Individual Websites ...

Efficient Hardware Implementation of Encoder and Decoder for Golay ...