floating-point arithmetic operations. A bound on the relative error in floating-point addition using a single-precision accumulator with guard digits is derived in ...
On Local Roundoff Errors in Floating-Point Arithmetic TOYOHISA K A N E K O AND B E D E L I U
Princeton University, Princeton, New Jersey ABSTRACT. A bound on the relative error in floating-point addition using a single-precision accumulator with guard digits is derived. It is shown that even with a single guard digit, the accuracy can be almost as good as that using a double-precision accumulator. A statistical model for the roundoff e r r o r in double-precision multiplication and addition is also derived. The model is confirmed by experiment al measurements. KEY WORDS AND PHRASES: roundoff error, guard digit, computer arithmetic CR CATEGORIES: 5.11
1. Introduction This paper investigates two problems in connection with the local roundoff error in floating-point arithmetic operations. A bound on the relative error in floating-point addition using a single-precision accumulator with guard digits is derived in Section 2. Error bounds have been known for both single- and double-precision additions [3, 11]. Our study is motivated in part by the appearance of some recent computers using guard digits, such as the IBM 360 series which uses radix 16 and a guard digit for addition [2]. I t is shown in Section 2 that even with one guard digit, the accuracy of a single-precision addition can be almost as good as that using a double-precision accumulator, provided the radix is moderately large? The second problem is the statistical characterization of local roundoff errors in doubleprecision multiplication and addition. A statistical model is derived in Section 3 which, in view of the result obtained in Section 2, can also be applied to single-precision addition with guard digit. Statistical models for roundoff errors have been used in investigating the propagation of errors for digital computer solution of differential equations, algebraic processes, digital filters, and the fast Fourier trasform [5, 6, 8]. However, in most of these studies, binary arithmetic is assumed and the local roundoff error is taken to be independent and uniformly distributed in its range. The assumption of a uniform density, although quite satisfactory for binary arithmetic, does not agree well with experimental obselvation for a larger radix. As will be seen in Section 4, the model developed in Section 3 agrees quite well with experimental measurements.
2.
Single-Precision Addition Using Guard Digits
This section investigates the roundoff error caused by floating-point addition on a singleprecision register with guard digits. Multiplication Js 1/ot considered, st/re a doubleprecision register would be used normally. Copyright © 1973, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM's copyright notice is g i v e n and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of tl~e Association for Computing Machinery. This w o r k w a s supported by the Air Force Office of Scientific Research, USAF, under Grant No. AFOSR-71-2101, and by the National Science Foundation under Grant GK-24187. Authors' present addresses" T. Kaneko, IBM Thomas J Watson Research Center, Yorktown Heights, NY 10598; B. LiE, Department of Electrical Engineering, Brackett Hall, Engineering Quadrangle, Princeton University, Princeton, NJ 08540. According to an anonymous reviewer, this problem is also treated in [1, 7]. Journal of the A~ociation for ComputingMachinery,VoL 20, No. 3, July 1973,pp. 391-398.
392
T.
occumulofor
Z bz
[ [ [
,
[ a2- o I
accumulator
B.
LIU
1.2-., 1 I
I I I II guard dlgzt$
FIG. 1
AND
[ [ t
I I
I' b I
KANEKO
I I to be discorded
Floating-point additmn on a sxngle-preeision accumulator with g guard digits
Consider the addition of two floating-point numbers x~ = (sgn)l~a~bl and x2 -(sgn)2~"2b2, where/3 is the radix, al and a2 are the characteristics, and bi and b2 are the mantissas. Only normalized operation is considered here, i.e. the leading digit of a nonzero mantissa is not zero. Thus 1 > bl ~ 1//3 or bz = 0 and 1 > b2 ~_ 1/~ or b2 = 0. The number of digits in the mantissa is denoted by t. We can take az _< a2 and (sgn)2 to be + 1 without loss of generality. We define the relative error ~ by the equation f l ( x ~ W x 2 ) = (Xl-+X2) ( l - i - e),
(1)"
where fl ( ) denotes the actual calculated result using floating-point operations. We shall derive a bound for the relative error e. The first step in the addition is the right shifting of the mantissa b~ by (as - aj) places. Unlike the usual single-precision procedure which loses all the digits beyond the most significant t digits, we keep a total of (t + g) digits, where g > 1 is the number of "guard" digits. With the use of these guard digits, those digits beyond the most significant (t + g) digits are rounded or chopped, 2 as the case may be. See Figure 1 for an illustration. Thus, if a2 -- a~ _< g, all the digits of bl are retained. Otherwise an error will be introduced by this right shift. Let A~ be the result after the shift. Then A1 = blfV ~a*-al) + ,1,
(2)
with el
=
if a2
0
-
al _< g ,
--~-(t+~)/2 _< ~ _< f~-(t+g)/2 if
a2 --
al > g
and rounding is used,
_~-(*+o) _< ,1 < 0
a2 - -
al > g
and chopping is used.
if
(3)
The next step in the addition of Xl and x2 is to add algebraically A1 to the mantissa b2. The result is left-shifted if renormalization is needed, and right-shifted by one if an overflow occurs. All the digits beyond the first t digits are either rounded off or chopped off. Let L be the number of places of left shift, L ~ -- 1, and let A2 be the result after rounding or chopping. I t is clear then that
A2 = ~--L{~L[b2 -J¢" (sgn)zAx] + e2},
(4)
1 > J j3L[b2 "t- Az(sgn),] + e2 I -~ 1//3
(5)
with and ~2=0
if
L_>g,
- - ~ - t / 2 _< ~2 _< fVt/2
if
L < g
and rounding is used,
_ ~ - t _ < E2_< 0
if L < g
and chopping is used.
(6)
The final result of the entire addition is fl (xl + x2) = (sgn)2~"2A2
(7)
z In rounding, a 1 or a 0 m added to the tth digit depending on whether or not the (/+l)-th digit is 8/2 or larger. In chopping, those digzts beyond t are simply dropped.
On Local Roundoff Errors in Floating-Point Arithmetic
393
which, on using eqs. (2) and (4), becomes fl (xl q- x~) = (zl "k x2)[1 "-b (~l (sgn)l "k e~/3-L)(b~ q-- (sgn)lbgV(=2-a'))-'].
(8)
Therefore the error e defined by eq. (1) is given by e = el (sgn)l[b~ + (sgn)zb~-("~-ax)]-i + ~{~-L[b2 + (sgn)lbl~-(~'-"')]t-1,
(9)
Suppose as - at _< g. Then ~ = 0 and the denominator of the second term in eq. (9) is simply/3~[b2 + (sgn)l A1], which according to inequalities (5) and (6) satisfies i BL[b2 + (sgn)xb,/3-(ara,)] [ > /~-i - /3-'.
(10)
Thus, for a2 - al _< g, the first term in eq. (9) is zero and the second term can be bounded by using inequalities (6) and (10). The result is --/3-'+1(1 - /vt+l)/2 < e < /3-'+1(1 -- 0-*+1)/2
for rounding,
_~-t+l(1 _ /~-t+l) ~ e _~ 0
for chopping.
and Since usually 1 >>/3-'+1, the above inequalities can be replaced, for all practical purposes~ by -/3-*+1/2 < ~ _< ~-'+1/2 for rounding, - / V '+i < E < 0
(ll)
for chopping.
Suppose now a~ - ai > g. Since [b2 q'- (sgn)zbx/3-(~-~') [ _> /3-I -- ~-(g+l), the first term ,n the fight side of eq. (9) is bounded by l ex (sgn
b~ "k (sgn)lbz/3-("2-")]-i I --< ~$-('+')[/3-z _.--fV('+l)]-I for rounding,for chopping, (12)
In order to [ ound the second term, we notice that the number of places of possible left el.rift is at mos, one since as - al > g. So L < 1, and we have, from eqs. (2), (3), (5), and (8), [ /3Lib2 + (sgn)xbx/3-(~-'P] I = [/3Lib2 --~ (sgn)lAz] -b ~2 - ~ - ~l/~L (sgn)i [ _> /~-z _ /~-,-o+z _ fV'. (13) Again, since fV ~ >> fV t, the terms ~-*-~+~ and/~-~ on the right side of inequality (13) are negligible. So for all practical purposes the second term on the right side of cq, (9) is bounded by x/~-'+~ _< e~ {~-'~[b~ + (sgn)zbl/~-(~r~')]} -~ _< ½/~+l -~_fl-,+l < ~ {/T-~[b~ + (sgn)zbz/~-(~-~,)]}-~ < 0
for rounding, for chopping.
(14)
From inequalities (11), (12), and (14), we obtain finally a bound for the relative erorr ,'. - ~ - ' + z / 2 ( 1 - - fl-~) _< ~ < ~-*+1/2(1 - /~-°) -/3-*+1/(1 -- /3-~) _< ~ < /3-'-~+z/(1 -- /3-~)
for rounding, for chopping.
(15)
When compared with the known bounds for double-precision accumulator [3, 11], it is seen that the error is worse by a factor of 1/(1 --/3-~). For the case of an IBM 360 computer, ~3 -- 16, t = 6, and g = 1, this factor is }~. Therefore, even with one guard digit, a single-precision accumulator can perform addition almost as accurately as a doubleprecision accumulator if the radix is moderately large. At the end of the Section 3 we shall give reasons why the statistical model of the roundoff error derived there for doubleprecision accumulators may also be applied to single-precision addition with guard digits.
394
3.
T.
KANEKO AND B. LIU
A Statistical Model of Floating-Point Roundoff Errors
(A) MULTIPLICATION Ermotts. The floating-point multiplication of two numbers xa-(sgn)lS~bl and x2 = (sgu)s/!F~bs consists of the addition of the characteristics al and as and the multiplication of the mantissas b~ and b2. The sign of the product is determined by the rule of algebra and the product of mantissas is normalized if necessary. The product bibs in general has 2t or 2t - I digits, and therefore must be rounded or chopped to t digits. Assume that rounding or chopping takes place after any necessary normalization shifts and let [bxb2]t denote the normalized t-digit mantissa of btb2 • Then it is clear that
[blb~]t ffi bibs + ~
(16)
with 1/8 _< bxb~ < 1 and
- 8 - t I 2 < ~ < 8-t/2
for rounding,
- 8 -t < ~ _< 0
for chopping.
(17)
The relative error e defined by fl (xl.x2) = (xl.xs)(1 + E)
(18)
is seen to relate to ~ by
= ~/blb2.
(19)
Since the error ~ arises from a quantization operation, it is reasonable to assume that it is a random variable, independent of bibs and uniformly distributed in its range, i.e. in ( - ~ - t / 2 , 8-t/2) for rounding and in ( - 8 - t , 0) for chopping [10]. Therefore the probability density function of the relative error ¢ can be determined from those of ~ and of bibs. On using a reciprocal density for the mantissa bibs recently reported for floating-point numbers [4] fblb2(U) = (u In 8 ) - ' , 1/fl < u < 1, = 0,
otherwise,
(20)
the probability density of the relative error t can be calculated to be 0, (8 - 1 ) / ~ -'+1 in 8, f,(u) [.... di,, = ) (8-t+1 _ 2U) /2U 8 -'+1 In 8 ~ ( - - 8 -'+1 -- 2u)/2u 8 -'+11n/3,
t
0,
ft(u)
[ohopping ~"
t
( 8 - - 1)/8-t+11n 8,
( ( - 8 -t+' -- u ) / u 3 -*+1 In 8,
l u l > 8-t+1/2, I u I < ~-'/2, fl-t/2 < u < 8-t+1/2, --8-'+1/2 < u < - - 8 - ' / 2 ;
u > 0 or u < --8 -t+l, - - 8 - ' < u < 0,_B_t. - - 8 -|+1 < U
(21)
(22)
Equation (22) is plotted in Figure 2. The mean and mean squared values can be easily calculated.
E{E} [,o~di., -- 0, E{~S} [.... o,ng = 8-st(82 - 1)(24 In #)-1, E{~} [oho,pi,, = 8 - t ( B -- 1)(2 In fl)-', E{~!2} [chopping = 8 - 2 t ( 8 2 - - 1)(6 In 8) -1,
(23)
(24)
(]3) ADDITION ERROR. The floating-point addition using a double-precision accumulator is carried out in the same manner as that described in Section 2 with g = t. If as al > t, the entire Xl is lost and the relative error introduced is seen to be bounded by 8-'. Since this happens with rather small probability for moderately high t [9], we need only to consider the case a~ - al _< t. The relative error is given by eq. (9) of Section 2
On Local Roundoff Errors in Floating-Point Arithmetic
395
fc (u) 15 (x 15 x 166
io
-5
u
i
. . . . . . .
[
-16
F I G . 2.
. . . . . . .
-8
(~j6-6)
Probabihty density of multipheatmn roundoff error, double-precision accumulator, eq (22) (chopping arithmetic, 3 = 16, t = 6)
with e~ = 0. Unlike in multiplication, there is a significant probability t h a t e2 = 0. This can happen when (1) a~ = al and (sgn), = - 1 , or (2) a~ = al and (sgn)l = + 1 b u t L ~- 0, or (3) a~ = a, + 1 and L = 1. Because the probability of a2 = ax increases for a larger radix/3, the probability of ~ = 0 grows with increasing radix. Let P0 denote the probability t h a t e2 = 0. F o r the problems we have studied, p0 is rather dependent on the d a t a and the problem at hand. I t is observed to range from about 0.3 to about 0.8, with the usual value around 0.55. Thus the probability density of e~consists of two components, a delta function of strength p0 at the origin and a uniform distribution over ( - / 3 - ' / 2 , /3-'/2) or over ( --/3-', 0) depending on whether rounding or chopping is used. Using again a reciprocal distribution for the mantissa, eq. (20), we find the probability density of the relative error e is given b y
f,(U) [. . . . drag _
"t- [(1 - - Po)(/3
1 ) / / 3 - ' + ' In/3],
- ] ( 1 -- P0)(3 -t+~ -- 2u)/2u/3 -'+1 In 3, (1 po) (--fl -'+1 -- 2u) /2u3 -'+~ In 3,
t
[u[ >/3-'+'/2, ]u[