Fast quadruple-double floating point format - J-Stage

NOLTA, IEICE Paper

Fast quadruple-double floating point format Naoya Yamanaka 1 a) and Shin’ichi Oishi 2 1

Research Institute for Science and Engineering, Waseda University 3-4-1 Okubo Shinjuku, Tokyo 169-8555, Japan

2

Faculty of Science and Engineering, Waseda University 3-4-1 Okubo Shinjuku, Tokyo 169-8555, Japan

a)

naoya [email protected]

Received May 20, 2013; Revised September 27, 2013; Published January 1, 2014 Abstract: An efficient format and fast algorithms of basic operations for 4-fold working precision are proposed. The proposed format is an unevaluated sum of four double precision numbers, capable of representing at least 203 bits of mantissa. Hence, it is slightly less accurate than quad-double format proposed by Hida et. al. [1], however presented algorithms based on the format are faster than those algorithms. By numerical experiments it is shown that the proposed algorithms are efficient. Key Words: high precision computation, quadruple floating point arithmetic

1. Introduction In a numerical calculation sometimes we need higher than double precision floating point arithmetic to get a confident result. One alternative is to rewrite the program to use a software package implementing arbitrary-precision extended floating-point arithmetic such as MPFR [2] or exflib [3], and try to choose a suitable precision. There are possibilities intermediate between the largest hardware floating point format and the general arbitrary precision software which combine a considerable amount of extra precision with a relatively modes factor of loss of speed. An alternative approach (e.g. doubledouble/quad-double format [1], triple-double format [4] and ARPREC [5]) is to store numbers in a multiple-component format, where a number is expressed as unevaluated sums of ordinary floating point words, each with its own mantissa and exponent. In this paper we present an efficient format and describe fast algorithms for basic operations. Proposed format is also an unevaluated sum of four double precision numbers, and it is capable of representing at least 203 bits of mantissa (50 bit × 3 + 53 bit). Proposed format and algorithms do not require other than double precision format and its operations defined by IEEE754 Standard [6].

1.1 Previous work We collet some notes on precious work. In the following we assume the computer arithmetic to satisfy the IEEE 754 standard [6]. To construct a high precision floating point arithmetic is basic tasks in numerical analysis, so there

15 Nonlinear Theory and Its Applications, IEICE, vol. 5, no. 1, pp. 15–34

c IEICE 2014

DOI: 10.1587/nolta.5.15

Table I.

TwoSum Split TwoProduct TwoProductFMA

Error-free Error-free Error-free Error-free

Algorithms of error free transformations.

transformation transformation transformation transformation

of of of of

the the the the

sum of two floating-point numbers [7]. split of a floating-point number [8]. product of two floating-point numbers [8]. product using Fused-Multiply-and-Add [9].

are several algorithms for that, among them [1, 4, 8]. These approaches are based on techniques called error free transformations. Denote a, b ∈ F and ◦ ∈ {+, −, ·}. Furthermore, we denote by fl (·) the result of floating-point operations in rounding to nearest corresponding to the IEEE 754 arithmetic standard. It is known that the error of floating-point operations ◦ is itself a floating-point number: x = fl (a ◦ b)

=⇒

x + y = a ◦ b with y ∈ F.

(1)

We listed these algorithms in Table I and the symbols of these algorithms are represented in Fig. 1. (The symbols for double precision sum and product are also in it.) In 1977, Dekker gave an application of error free transformations to obtain algorithms for doubled length floating point numbers [8]. He defined a doubled length floating point number a as a pair (a0 , a1 ) satisfying (2) |a1 | |a0 + a1 | Cu, which u denotes the machine epsilon (see section 2) and C is some constant not much larger than 1. The value of the number a is equal to a0 + a1 . In 1991, based on the doubled length floating point number, Priest gave algorithms for exact addition and multiplication and arbitrarily accurate division of arbitrary precision numbers using only fixed precision floating point arithmetic operations [10]. In 1997, Schewchuk proposed a methods which are closely related to the methods developed by Priest, but are faster [11]. The improvement in speed arises partly because Priest’s algorithms run on a wide variety of floating-point architectures, with different radices and rounding behavior, whereas Schewchuk’s algorithms are limited to and optimized for radix 2. ARPREC package developed by Bailey et. al. uses arrays of double floating point numbers to represent high precision floating point arithmetic [5]. This software is based in part on the Fortran90 MPFUN package [12], which is a sophisticated portable multiprecision library that uses digits of machine-dependent radix stored as single precision floating-point values. Recently, Lauter proposed a format to calculate for three double precision floating point numbers called triple-double format [4], and Hida et. al. proposed a format for four double precision floating point numbers called quad-double format [1] respectively. Lauter defined a form of a triple-double number a as a triplet (a1 , a2 , a3 ) ∈ F such that a = a1 + a2 + a3 .

(3)

As similar to a triple-double number, a quad-double number is an unevaluated sum of four double precision numbers. The quad-double number (a0 , a1 , a2 , a3 ) represents the exact sum a = a0 + a1 + a2 + a3 .

Fig. 1. From left, the symbols of IEEE double precision sum and product operators, TwoSum and TwoProduct.

16

(4)

Let us denote ulp(x) (“unit in the last place” for a real number x) by the distance between the two closest straddling double floating-point numbers. Note that for any given number x, there can be many representations as an unevaluated sum of four doubles. Hence we require that the quadruple (a0 , a1 , a2 , a3 ) to satisfy 1 (5) |ai+1 | ulp (ai ) 2 for i = 0, 1, 2, with equality occurring only if ai = 0 or the last bit of ai is 0 (that is, round-to-even is used in case of ties). Note that the first double a0 is a double-precision approximation to the quad-double number a, accurate to almost half an ulp. Thus the quadruple is capable of representing at least 212 bits of mantissa. (The sign bits have the same role as an additional bit of the mantissa.) These algorithms proposed by Hida et. al. are specialized for using double format, so that these algorithms can be made faster than some software for arbitrary precision with the same precision. Here, we introduce the multiplication algorithm of quad-double format. Let a = (a0 , a1 , a2 , a3 ) and b = (b0 , b1 , b2 , b3 ) be two quad-double numbers. Assume (without loss of generality) that a and b are order 1. After multiplication, we need to accumulate 10 terms of order O(u3 ) or higher. a × b ≈ a0 b0 + a0 b1 + a1 b0 + a0 b2 + a1 b1 + a2 b0 + a0 b3 + a1 b2 + a2 b1 + a3 b0

O(1) term O(u) terms O(u2 ) terms O(u3 ) terms

Note that smaller order terms (such as a1 b3 , which is O(u4 )) are not even computed. For i + j ≤ 3, let (pij , qij ) = TwoProduct (ai , bj ). Then pij = O(ui+j ) and qij = O(ui+j+1 ). Now there are one term (p00 ) of order O(1), three (p01 , p10 , q00 ) of order O(u), five (p02 , p11 , p20 , q01 , q10 ) of order O(u2 ), seven of order O(u3 ). Now we can start accumulating all the terms by their order, starting with O(u) terms (see Fig. 2). In the diagram, there are three different summation boxes. The first (topmost) one is Three-Sum, and next two are, respectively, Six-Three-Sum (sums six doubles and outputs the first three components), Nine-Two-Sum (sums nine doubles and outputs the first two components). Six-Three-Sum computes the sum of six doubles to three double worth of accuracy (i.e., to relative error of O(u3 )). This is done by dividing the inputs into two groups of three, and performing Three-Sum on each group. Then the two sums are added together, in a manner similar to quad-double addition [1]. Nine-Two-Sum computes the sum of nine doubles to doubledouble accuracy [1]. This is done by pairing the inputs to create four double-double numbers and a

Fig. 2.

Accumulation phase in the multiplication algorithm of quad-double.

17

Fig. 3.

Three-Sum, Six-Three-Sum and Nine-Two-Sum.

single double precision number, and performing addition of two double-double numbers recursively until one arrives at a double-double output. See Fig. 3.

1.2 Our approach and goal In this paper, we will propose an efficient format and describe fast algorithms for basic operations similar to quad-double format. Proposed format is also an unevaluated sum of four double precision numbers, but it is capable of representing at least 203 bits of mantissa (50 bit × 3 + 53 bit). Hence, it is slightly less accurate than quad-double format, however, therefore some algorithms of proposed format are designed to permit Three-Sum, Six-Three-Sum and Nine-Two-Sum to be replaced by double floating point additions (see section 3). As a result, proposed algorithms for the format are faster than those for quad-double format. For example, proposed multiplication algorithm is about 1.7 times faster than that of quad-double format. By numerical experiments it is shown that the proposed algorithms are efficient compared with quad-double, MPFR and exflib.

2. Preliminaries 2.1 Notation and basic properties of floating point arithmetic Floating point encodings and functionality are defined in the IEEE 754 Standard [6] last revised in 2008. Goldberg [13] gives a good introduction to floating point and many of the issues that arise. The standard mandates binary floating point data of double format be encoded on three fields: a one bit sign field, followed by exponent 11 bits encoding the exponent offset by a numeric bias specific to each format, and 52 bits encoding the mantissa. Thus, double floating point numbers have the form ± mantissa × 2exponent , which the mantissa is a number is m0 .m1 · · · m52 with mi satisfying mi ∈ {0, 1}, and the exponent is an integer such that −1022 e 1023. F denotes the set of floating point numbers according to the IEEE 754 arithmetic standard. The other element os F are +∞, −∞, and NaN (Not a Number, used for invalid operations). F contains normalized and subnormal numbers. A normalized number is a number with −1022 e 1023 and m0 = 0. A subnormal number is a number with −1023 and m0 = 0. In this paper, we ignore the possibility of overflow and underflow. Therefore the phrase “provided no overflow or underflow occurs” should be appended to each result stated below. However, we stress that our intent is merely to demonstrate the feasibility of proposed format and algorithms up to the limits imposed by overflow and underflow. The set of floating-point numbers which have m-bit mantissa is denoted by Fm , and the relative rounding error unit, half the distance from 1.0 to the next larger floating-point number in Fm , is denoted by um . For IEEE754 double precision u = 2−53 . Furthermore, Fi,j is denoted by the set of floating point numbers representing by

18

⎛

min(52,e−i)

a = ±⎝

k=0

⎞ dk ⎠ × 2e , 2k

(6)

for i e j. The IEEE 754 standard requires support for a handful of operations. These include the arithmetic operations add, subtract, multiply, divide, square root, and so on [6]. The results of these operations are guaranteed to be the same for all implementations of the standard, for a given format and rounding mode. A consequence of the specifications for the arithmetic operations given by the IEEE 754 standard is the following: if a ◦ b is neither a subnormal number nor an infinity nor a NaN, then |a ◦ b − fl (a ◦ b)| u |a ◦ b| ,

(7)

with rounding to nearest (even). Furthermore, the next lemma by Sterbenz is well known result about accuracy. Lemma 1 (Sterbenz [14]) Define two floating-point numbers a, b ∈ F of the same sign be not too far apart. More precisely, for a, b 0 we have 1 a b 2a ⇒ fl (b − a) = b − a. (8) 2 Let us denote ufp (“unit in the first place”) by 0 = r ∈ R

⇒

ufp (r) := 2log2 |r| ,

(9)

where we set ufp (0) := 0 [15]. Furthermore, we define the floating-point predecessor and successor of a real number r by predm (r) := max {f ∈ Fm | f < r }

(10)

succm (r) := min {f ∈ Fm | r < f } .

(11)

A floating-point number f is called a (m-bit) faithful rounding of a real number r if there is no other floating-point number between f and r [15]. It follows that f = r in the case when r ∈ Fm . Definition 1 A floating-point number f ∈ Fm is called a faithful rounding of a real number r ∈ R if predm (f ) < r < succm (f ) .

(12)

We denote this by f ∈ faithful m (r). For r ∈ Fm this implies f = r. Rump et.al. have extended Definition 1 to a sequence a0 , · · · , ak−1 [16]. Definition 2 A sequence a0 , · · · , ak−1 ∈ Fm is called a (m-bit and k-fold) faithful rounding of s ∈ R if

ai ∈ faithfulm s −

i−1

av

for 0 i k − 1.

(13)

v=0

Furthermore, for the integer n, the numbers a0 , · · · , ak−2 ∈ Fm and ak−1 ∈ Fn , a sequence a0 , · · · , ak−1 is also called a (m-bit, n-bit and k-fold) faithful rounding of s if i−1 k−2 ai ∈ faithfulm s − av for 0 i k − 2, and ak−1 ∈ faithfuln s − av . (14) v=0

v=0

19

Then the following lemmas hold: Lemma 2 (Rump et.al. [16]) Suppose the sequence a0 , · · · , ak−2 (∈ Fm ) , ak−1 (∈ Fn ) is a m-bit, n-bit and k-fold faithful rounding of some s ∈ R. For 1 i k − 1, |ai | 2um ufp (ai−1 ) . (15)

Lemma 3 (Rump et.al. [16]) If a sequence a0 , · · · , ak−2 (∈ Fm ) , ak−1 (∈ Fn ) is a m-bit, n-bit and k-fold faithful rounding of some s ∈ R, then k−1 k−1 av < 2ukm ufp (s) , av < 2ukm ufp (a0 ) . (16) s − s − v=0

v=0

2.2 Error Free Transformations Extracting High Order Part In addition to Split algorithm, there is another splitting algorithm called ExtractScalar. This algorithm is to extract high order part of a floating point number. The floating-point number is split relative to σ, a fixed power of 2. Algorithm 1 Error-free transformation extracting high order part. function [q, p ] = ExtractScalar(σ, p) q = fl ((σ + p) − σ) p = fl (p − q) end The following lemma about the result of Algorithm 1 holds: Lemma 4 (Rump et.al. [15]) Let q and p be the results of Algorithm 1 applied to floating-point numbers σ and p. Assume σ = 2k ∈ F for some k ∈ Z, and assume |p| 2−M σ for some 0 M ∈ N . Then p = q + p ,

|p | uσ,

|q| 2−M σ.

(17)

Here, a 53-bit floating-point number is split into two parts relative to its exponent, and using sign bits both the high and the low part have at most 26 significant bits in the mantissa. In ExtractScalar a floating-point number is split relative to σ, a fixed power of 2. The higher and the lower part of the splitting may have between 0 and 53 significant bits, depending on σ.

3. Proposed Method In this section, we will preset a faster format than quad-double format. The calculations based on quad-double format are very fast in terms of the computational time, comparing to other formats, e.g. Schewchuk’s algorithm, MPFR and exflib [2, 3, 11]. However, the accumulation phase in the multiplication algorithm of quad-double format (Fig. 2) is a bit complicated because it needs to take care of results which have several kinds of magnitude. As a result, the algorithm uses the complicated three summation boxes : Three-Sum, Six-Three-Sum and Nine-Two-Sum (Fig. 3). The main idea to construct a faster format is to permit these summation boxes to be replaced by double floating point additions. Accumulation phase in the multiplication algorithm of proposed format is shown in Fig. 4. In the figure, the symbols ⊕ means the calculation by floating point addition operator for all input values, and Renormalization means an algorithm expressed in Algorithm 2. To achieve this technique, we needs to consider a format in which some of floating point numbers are stored shortly than 53 bits. In addition, the input values into the symbols ⊕ needs to be the same order because each result of the double floating point additions has to be able to express as a double floating point number.

20

Fig. 4. mat.

Accumulation phase in the multiplication algorithm of proposed for-

3.1 Format Proposed format is an unevaluated sum of three floating-point numbers in F50 and one floating point number in F. For a0 , a1 , a2 ∈ F50 and a3 ∈ F, the number (a0 , a1 , a2 , a3 ) represents the exact sum a = a0 + a1 + a2 + a3 . Then we require that the quadruple (a0 , a1 , a2 , a3 ) to satisfy a0 ∈ faithful 50 (a)

(18)

a1 ∈ faithful 100 (a) − a0

(19)

a2 ∈ faithful 150 (a) − a0 − a1

(20)

|a3 |
>52) 203 b0 = x0 ; b1 = x1 ; b2 = x2 ; b3 = x3 ; return; case 150 < n 203 a0 = x0 ; a1 = x1 ; a2 = x2 ; a3 = fl (x3 + y0 ); case 100 < n 150 [a0 , a1 , a2 , a3 ] = add3(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ); case 50 < n 100 [a0 , a1 , a2 , a3 ] = add2(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ); case 1 < n 50 [a0 , a1 , a2 , a3 ] = add1(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ); case − 1 n 1 [a0 , a1 , a2 , a3 ] = add0(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ); case − 50 n < −1 [a0 , a1 , a2 , a3 ] = add1(y0 , y1 , y2 , y3 , x0 , x1 , x2 , x3 ); case − 100 n < −50 [a0 , a1 , a2 , a3 ] = add2(y0 , y1 , y2 , y3 , x0 , x1 , x2 , x3 ); case − 150 n < −100 [a0 , a1 , a2 , a3 ] = add3(y0 , y1 , y2 , y3 , x0 , x1 , x2 , x3 ); case − 203 n < −150 a0 = y0 ; a1 = y1 ; a2 = y2 ; a3 = fl (y3 + x0 ); case − 203 < n b0 = y0 ; b1 = y1 ; b2 = y2 ; b3 = y3 ; return; end [b0 , b1 , b2 , b3 ] = Renormalize(a0 , a1 , a2 , a3 ) end

24

Fig. 7.

Accumulation phase in addition algorithm of proposed format.

Remark 2 The integer n means that the difference of the exponential parts of each numbers. When the calculation of n, it is desirable to avoid using the binary logarithm. For example, frexp function in math.h or epart function shown below are better to get the exponential part. union doubleint { double d; unsigned long long int i; }; inline int epart(const double &x) { union doubleint u = {x}; return ((u.i53)-1023; } In the Algorithm 4, we define some algorithms shown in Algorithm 5–Algorithm 8. These algorithms are called depending on the difference of two input numbers. For example, if n in Algorithm 4 satisfies 100 < |n| 150, then Algorithm 5 is called. This branches is the main point of Algorithm 4, i.e. the eliminations of unnecessary calculations are done in Algorithm 5–Algorithm 8. In Algorithm 8, there are many branches to check whether these floating point numbers are zero or not. This part is inescapable because the heavy cancelation may occur in the addition algorithm when the orders of input two numbers are almost the same, and a leading floating point number of the proposed format must not be zero if a number of proposed format is not zero. From the point of view of the computational costs, Algorithm 4 seems to take computational time because there are many branches. But if you can use an array of function pointers, then it is fast to Algorithm 5 Proposed addition algorithm for the case of 100 < |n| 150 in Algorithm 4 (add3) function [a0 , a1 , a2 , a3 ] = add3(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ) a0 = x0 ; a1 = x1 ;

[s, t] = ExtractScalar 253−150 · ufp (a0 ) , y0 ; a2 = fl (x2 + s); a3 = fl (x3 + t + y1 ); end Algorithm 6 Proposed addition algorithm for the case of 50 < |n| 100 in Algorithm 4 (add2) function [a0 , a1 , a2 , a3 ] = add2(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ) a0 = x0 ;

[s, t] = ExtractScalar 253−100 · ufp (a0 ) , y0 ;

[u, v] = ExtractScalar 253−150 · ufp (a0 ) , y1 ; a1 = fl (x1 + s); a2 = fl (x2 + t + u); a3 = fl (x3 + v + y2 ); end Algorithm 7 Proposed addition algorithm for the case of 50 < |n| 100 Algorithm 4 (add1) function [a0 , a1 , a2 , a3 ] = add1(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 )

[s, t] = ExtractScalar 253−50 · ufp (x0 ) , y0 ;

[u, v] = ExtractScalar 253−100 · ufp (x0 ) , y1 ;

53−150 [w, z] = ExtractScalar 2 · ufp (x0 ) , y2 ; a0 = fl (x0 + s); a1 = fl (x1 + t + u); a2 = fl (x2 + v + w); a3 = fl (x3 + z + y3 ); end

25

Algorithm 8 Proposed addition algorithm for the case of 1 < |n| 50 in Algorithm 4 (add0) function [a0 , a1 , a2 , a3 ] = add0(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ) a0 = fl (x0 + y0 ); a1 = fl (x1 + y1 ); a2 = fl (x2 + y2 ); a3 = fl (x3 + y3 ); if a0 == 0 if a1 == 0 if a2 == 0 if a3 = 0 a0 = a3 ; a1 = a2 = a3 = 0.0; end else a0 = a2 ; a1 = a3 ; a2 = a3 = 0.0; end else a0 = a1 ; a1 = a2 ; a2 = a3 ; a3 = 0.0; end end end treat such kind of branches. We show an example of this technique using C language. Left algorithm is able to be replaced by right algorithm. #include int main(void){ int n; scanf("%d",&n); switch(n){ case 0: printf("A\n"); case 1: printf("B\n"); } return 0; }

break; break;

#include void func0(void){ printf("A\n"); } void func1(void){ printf("B\n"); } void (*Switch[2])(void) = {func0,func1}; int main(void){ int n; scanf("%d",&n); Switch[n](); return 0; }

Similar to this example, if you set 4091 function pointers for all n (n means the difference of the exponential parts, so that the range of n is −2045 n 2045) into an array, then Algorithm 4 works and the computational speed is faster than an algorithm using switch function. Here we present a theorem about the error of the addition algorithm. Theorem 3 The maximum error of Algorithm 4 is 9 · 2−203 (ufp (x0 ) + ufp (y0 ) + ufp (b0 )). The proof of this theorem can be found in Appendix. To construct for subtraction operator, it is implemented as the addition a + (−b), so it has the same algorithm and properties as that of addition.

3.3 Multiplication and division The main policy to construct a multiplication algorithm is the replacement of the summation boxes in the multiplication algorithm of quad-double format by double floating point additions as we mentioned in opening of this section (see Fig. 4). In this subsection we present the detail of the proposed multiplication algorithm. First, to achieve the proposed form, we need to modify the multiplication algorithm TwoProduct for m-bit floating point numbers. For a pair of m-bit floating point numbers (x, y) with x, y ∈ Fm , Algorithm 9 outputs a new pair (a, b) with a, b ∈ Fm satisfying

26

x × y = a + b,

|b| ≤ um |a| .

(33)

Algorithm 9 m-bit TwoProduct. (27 < m < 53) function [a, b] = TwoProductm (x, y) z = fl (x · y) [x1 , x2 ] = Split(x) [y1 , y2 ] = Split(y) w = fl (x2 · y2 − (((z − x1 · y1 ) − x2 · y1 ) − x1 · y2 )) [a, v] = ExtractScalar(253−m · ufp (z) , z) b = fl (w + v) end Based on Algorithm 9 and the policy we have already discussed on 3.1 , we proposed a multiplication algorithm. Algorithm 10 Proposed multiplication algorithm function [a0 , a1 , a2 , a3 ] = multiplication(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ) [z0 , b0 ] = TwoProduct50 (x0 , y0 ) [b1 , c0 ] = TwoProduct50 (x0 , y1 ) [b2 , c1 ] = TwoProduct50 (x1 , y0 ) z1 = fl (b0 + b1 + b2 ) [c2 , d0 ] = TwoProduct50 (x0 , y2 ) [c3 , d1 ] = TwoProduct50 (x1 , y1 ) [c4 , d2 ] = TwoProduct50 (x2 , y0 ) z2 = fl (c0 + c1 + c2 + c3 + c4 ) z3 = fl (d0 + d1 + d2 + x0 · y3 + x1 · y2 + x2 · y1 + x3 · y0 ) [a0 , a1 , a2 , a3 ] = Renormalize (z0 , z1 , z2 , z3 ) end Here, we propose two theorems about the error of Algorithm 10. Theorem 4 For a number x, y ∈ Fm , the result of Algorithm 9 satisfies (33). Proof. Since x, y ∈ Fm , so that xy ∈ F2m . Since |v| 2−m · ufp (z) (from Lemma 5) and w ∈ F2m−53 , we have thus fl (w + v) = w + v. Using Lemma 5 again, this completes the proof. Theorem 5 The upper bound of the error of Algorithm 10 is 274 · 2−203 · ufp (a0 ). The proof of this theorem can be found in Appendix. Division algorithm of the proposed format is constructed by using results of the multiplication algorithm. Let x = (x0 , x1 , x2 , x3 ) and y = (y0 , y1 , y2 , y3 ) be two numbers of the proposed form. We can first compute an approximate quotient z0 = x0 /y0 . We then compute the remainder r0 = a−q0 ×b, and compute the correction term z1 = r0 /b0 . We can continue this process to obtain four terms, z0 , z1 , z2 , and z3 .

4. Numerical results In this section we present timing of proposed algorithms. We tested the computational times under the following environment: Intel Core I7, 1.8GHz, Mac OS X 10.8.2, Memory 32GB. All proposed algorithms were tested in C++. All operators of the proposed format are implemented by operator

27

Algorithm 11 Proposed division algorithm function [a0 , a1 , a2 , a3 ] = division(x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 ) z0 = fl (a0 /b0 )

[z0 , e] = ExtractScalar 253−50 · ufp (z0 ) , z0 [t0 , t1 , t2 , t3 ] = multiplication(y0 , y1 , y2 , y3 , z0 , 0, 0, 0) [r0 , r1 , r2 , r3 ] = addition(x0 , x1 , x2 , x3 , −t0 , −t1 , −t2 , −t3 ) z1 = fl (r0 /b0 )

[z1 , e] = ExtractScalar 253−50 · ufp (z1 ) , z1 [t0 , t1 , t2 , t3 ] = multiplication(y0 , y1 , y2 , y3 , z0 , z1 , 0, 0) [r0 , r1 , r2 , r3 ] = addition(x0 , x1 , x2 , x3 , −t0 , −t1 , −t2 , −t3 ) z2 = fl (r0 /b0 )

[z2 , e] = ExtractScalar 253−50 · ufp (z2 ) , z2 [t0 , t1 , t2 , t3 ] = multiplication(y0 , y1 , y2 , y3 , z0 , z1 , z2 , 0) [r0 , r1 , r2 , r3 ] = addition(x0 , x1 , x2 , x3 , −t0 , −t1 , −t2 , −t3 ) z3 = fl (r0 /b0 ) [a0 , a1 , a2 , a3 ] = Renormalize (z0 , z1 , z2 , z3 ) end overloading and other all functions which need to be prepared are implemented using inlined. Furthermore, the function epart shown in Remark 2 is used to get the exponential part and the array of function pointers are used in addition and subtraction operators as we mentioned in 3.2. All floating point operations are done in IEEE standard 754 double precision. We use a compiler g++ 4.7.2 with -O2 option. Moreover, extra compile options -msse2 -mfpmath=sse have to be used [19] to avoid using 80-bit double extended precision format with a 64-bit mantissa in TwoSum and TwoProduct. We show the numerical results in terms of computational costs to calculate multiplication algorithms in Table II, and the results of four basic operation algorithms are summarized in Table III. All calls are designed to execute in a pipelined way. Remark 3 We made random 203 bits numbers for the tests using Boost C++ Libraries [20]. Furthermore to measure the computing time, we used a class called cpu_timer in the software. Remark 4 Proposed format have 203 bit mantissa, and Quad-double have 212 bit mantissa. We set 203 bit mantissa in MPFR software and 63 digits mantissa in exflib software since 2−203 7.77 · 10−62 . (This Table II. Measured computing time (sec) and measured computing time per a bit (sec/bit) for 10 million calls of each algorithms.

106 bit 159 bit 203 bit 212 bit −3 Double-double [1] 0.58 (5.4 × 10 ) Triple-double [4] 1.63 (1.0 × 10−2 ) Proposed format 2.70 (1.3 × 10−2 ) Quad-double [1] 4.59 (2.0 × 10−2 ) −2 −2 −2 exflib [3] 4.05 (3.8 × 10 ) 5.05 (3.1 × 10 ) 6.23 (3.0 × 10 ) 6.50 (3.0 × 10−2 ) MPFR [2] 28.3 (2.6 × 10−1 ) 31.1 (1.9 × 10−1 ) 31.1 (1.5 × 10−1 ) 32.4 (1.5 × 10−1 ) Table III. Measured computing time (sec) and time ratio for 10 million calls of each algorithms.

Proposed Algorithm Quad-Double [1] exflib [3] MPFR [2]

+/− 0.98 (1) 0.95 (0.9) 2.84 (2.8) 23.0 (23.)

28

× 2.70 4.59 6.23 31.1

÷ (1) (1.7) (2.3) (11.)

8.76 15.3 16.3 36.0

(1) (1.7) (1.8) (4.1)

software needs to be set by digit.) Up to 212 bits, we can see that the results of algorithms which store numbers in a multiple-component format, e.g. double-double, triple-double, proposed format and quad-double, are faster than results of exflib and MPFR. However, the results about the measured computing time per a bit in a multiplecomponent format become slower when the precision become increased. It means that the algorithms of exflib and MPFR tend to occur the bigger overhead to prepare the format, so that the results of the measured computing time per a bit get worse with low precision. Next, we focus on the result of proposed format. As we mentioned, the proposed format is similar to quad-double format and our proposed algorithms are based on the algorithms of quad-double format. Despite many branches used in proposed addition algorithm, the results of proposed format and quad-double format are almost same, because the array of function pointers is used. By using the replacement, three summation boxes in quad-double format are changed to double floating point summation, therefore the result of proposed algorithm is 1.7 times faster. Since the division algorithm of the proposed format is constructed by using results of the multiplication algorithm, the division algorithm delivers a similar results: the result of proposed algorithm is faster than quad-double format.

5. Concluding remarks We presented algorithms and performance of basic operations on the proposed format. Proposed format is an unevaluated sum of four double precision numbers similar to quad-double arithmetic, but it is capable of representing at least 203 bits of mantissa (50 bit × 3 + 53 bit). The main idea of this format is to permit three summation boxes in the multiplication algorithm of quaddouble format to be replaced by double floating point additions. In other words, to achieve the idea we considered a format in which some of floating point numbers are stored shortly than 53 bits, and the input values into the floating point additions needs to be the same order. Furthermore, we propose fast algorithms of basic operations for proposed format. In addition algorithm of quad-double format, we can see that there are some calculations that no longer need to be done. To remove these unnecessary calculations and to construct faster algorithm of the proposed format, we presented an algorithm with many branches unlike the addition algorithm of quad-double format. Since the proposed format stores bits as the continuous 203 bits, it is easy to construct the efficient algorithm with knowing the difference of the input two numbers. Finally, from the numerical results, we can see that proposed algorithms for the proposed format are faster than the results of quad-double format, exflib and MPFR. For example, proposed multiplication algorithm is about 1.7 times faster than that of quad-double format.

Acknowledgments This paper is a part of the outcome of research performed under a Waseda University Grant for Special Research Projects (Project number: 2012A-508) and this work was supported by JSPS KAKENHI Grant Number 24700015.

Appendix A. Proof of Theorem 1 Theorem 1 If a sequence x0 , · · · , x3 satisfying x0 ∈ F,

x1 ∈ Fe−47:e−99 ,

x2 ∈ Fe−97:e−149 ,

x3 ∈ Fe−147:e−199 ,

(A-1)

is given, where e denotes log2 (ufp (x0 )), then the result of Algorithm 2 a0 , a1 , a2 , a3 satisfies x0 + x1 + x2 + x3 = a0 + a1 + a2 + a3 . Proof. Here we prove Theorem 1. Let x = x0 + x1 + x2 + x3 . The error |x − a0 | is using a0

29

(A-2)

|x − a0 | = |x0 + x1 + x2 + x3 − a0 | = |(s0 + t0 ) + x2 + x3 − a0 |

(A-3)

= |(a0 + b0 ) + t0 + x2 + x3 − a0 | = |b0 + t0 + x2 + x3 | .

(A-4)

From Lemma 5, |b0 | u · 253−50 ufp (s0 ) u50 ufp (a0 )

(A-5)

|t0 | u |s0 | uufp (a0 )

(A-6)

|x2 | 2

−96

−94

|x3 | 2

−146

ufp (x0 ) = 2

ufp (a0 )

−144

ufp (x0 ) = 2

ufp (a0 )

(A-7) (A-8)

thus, |x − a0 | = |b0 + t0 + x2 + x3 | < 2u50 ufp (a0 ) .

(A-9)

Here, ufp(s0 ) may be different from ufp(a0 ). This difference can be happened when the carry bit occurs

in ExtractScalar. If there is no carry bit when the calculation of ExtractScalar 253−50 · ufp (s0 ) , s0 , then a0 ∈ Fes :es −49 ,

(A-10)

where es = log2 (ufp (s0 )). If a carry bit occurs, then a0 = 2ufp (s0 ). In both cases, a0 exists in F50 . Thus, it means that a0 is faithful rounding of x. From the definition of TwoSum and Lemma 5, x0 + x1 = s0 + t0 = a0 + b0 + t0 . Since (18), (19) and a0 ∈ F50 , b0 + t0 = (x0 + x1 ) − a0 ∈ Fex −48:ex −99 ⊂ F,

(A-11)

where ex = log2 (ufp (x0 )), so that fl (b0 + t0 ) = b0 + t0 is proved. Same as above, from a1 ∈ F50 and (22), b1 + t1 = (x0 + x1 + x2 ) − (a0 + a1 ) ∈ Fex −98:ex −149 ⊂ F,

(A-12)

and from a2 ∈ F50 and (23), b2 + t2 = (x0 + x1 + x2 + x3 ) − (a0 + a1 + a2 ) ∈ Fex −148:ex −199 ⊂ F.

(A-13)

Thus, fl (bi + ti ) = bi + ti is proved for 0 i 2. Using this result, x0 + x1 + x2 + x3 = (s0 + t0 ) + x2 + x3 = (a0 + b0 + t0 ) + x2 + x3

(A-14)

= a0 + (s1 + t1 ) + x3 = a0 + (a1 + b1 + t1 ) + x3

(A-15)

= a0 + a1 + (s1 + t1 ) = a0 + a1 + (a2 + b2 + t1 ) = a0 + a1 + a2 + a3

(A-16)

the desired equation follows.

B. Proof of Theorem 2 Theorem 2 If a sequence x0 , · · · , x3 satisfying x0 ∈ F,

x1 ∈ Fe−47:e−99 ,

x2 ∈ Fe−97:e−149 ,

x3 ∈ Fe−147:e−199 ∩ F

(B-1)

is given, where e denotes log2 (ufp (x0 ))), then the result sequence a0 , a1 , a2 , a3 of Algorithm 2 satisfies the format (18)–(21), and the maximum error of Algorithm 2 is |x0 + x1 + x2 + x3 − (a0 + a1 + a2 + a3 )| 2 · 2−203 · ufp (a0 ) .

30

(B-2)

Proof. Since the proof of Theorem 1, a0 , a1 , a2 ∈ F50 and fl (b0 + t0 ) = b0 + t0

and fl (b1 + t1 ) = b1 + t1 .

(B-3)

From definition of TwoSum and ExtractScalar, |e| := |x0 + x1 + x2 + x3 − (a0 + a1 + a2 + a3 )| = |b2 + t2 − a3 | .

(B-4)

Since |b2 | 2−150 · ufp (a0 ) and |t2 | 2−150 · ufp (a0 ), |e| |b2 + t2 − a3 | |b2 + t2 − fl (b2 + t2 )| u (|b2 | + |t2 |) u · 2 · 2−150 · ufp (a0 ) = 2 · 2−203 · ufp (a0 )

(B-5) (B-6)

the desired inequality follows.

C. Proof of Theorem 3 Theorem 3 The maximum error of Algorithm 4 is 9 · 2−203 (ufp (x0 ) + ufp (y0 ) + ufp (b0 )). Proof. We consider the cases separately. In this proof we think the case n is positive as an example. (I) |n| > 203 In this case, there is no computation so that no error occurs. (II) 150 < |n| 203 In this case, an error occurs in the calculation of a3 . |e1 | u |x3 | u · u350 · ufp (x0 ) = 2−203 · ufp (x0 )

(C-1)

(III) 100 < |n| 150 In this case, an error occurs in the calculation of a3 . (a2 is error free because fl (x2 + s) = x2 +s. See the proof of Theorem 1 in detail.) |e1 | = |a3 − (x3 + t + y1 )| 3u (|x3 | + |t| + |y1 |)

3u u350 · ufp (x0 ) + u · 253−150 · ufp (x0 ) + u50 · ufp (y0 )

3u u350 · ufp (x0 ) + u · 253−150 · ufp (x0 ) + u350 · ufp (x0 ) 9 · 2−203 · ufp (x0 )

(C-2) (C-3) (C-4)

(IV) 50 < |n| 100 In this case, an error occurs in the calculation of a3 . |e1 | = |a3 − (x3 + t + y2 )| 3u (|x3 | + |t| + |y2 |)

3u u350 · ufp (x0 ) + u · 253−150 · ufp (x0 ) + u250 · ufp (y0 )

3u u350 · ufp (x0 ) + u · 253−150 · ufp (x0 ) + u350 · ufp (x0 ) 9 · 2−203 · ufp (x0 )

(C-5) (C-6) (C-7)

(V) 1 < |n| 50 In this case, an error occurs in the calculation of a3 . (C-8) |e1 | = |a3 − (x3 + t + y3 )| 3u (|x3 | + |t| + |y3 |)

3 53−150 3 3u u50 · ufp (x0 ) + u · 2 · ufp (x0 ) + u50 · ufp (y0 ) (C-9)

3 53−150 3 −203 · ufp (x0 ) + u50 · ufp (x0 ) 9 · 2 · ufp (x0 ) (C-10) 3u u50 · ufp (x0 ) + u · 2 (VI) |n| 1

31

In this case, an error occurs in the calculation of a3 .

|e1 | = |a3 − (x3 + y3 )| 2u (|x3 | + |y3 |) 2u u350 ufp (x0 ) + u350 · ufp (y0 ) 2·2

−203

(ufp (x0 ) + ufp (y0 ))

(C-11) (C-12)

In any cases, |e1 | 9 · 2−203 (ufp (x0 ) + ufp (y0 )). In addition, we need to consider the error e2 of Algorithm 2. |e| = |e1 | + |e2 | 9 · 2−203 (ufp (x0 ) + ufp (y0 )) + 2 · 2−203 · ufp (b0 ) 9·2

−203

(ufp (x0 ) + ufp (y0 ) + ufp (b0 ))

(C-13) (C-14)

This completes the proof.

D. Proof of Theorem 5 Theorem 5 The upper bound of the error of Algorithm 10 is 274 · 2−203 · ufp (a0 ). Proof. First we consider an error of the steps before Renormalize. |e1 | = |(x0 + x1 + x2 + x3 ) (y0 + y1 + y2 + y3 ) − (z0 + z1 + z2 + z3 )| = |x0 y0 + x0 y1 + x1 y0 + x0 y2 + x1 y1 + x2 y0 + x0 y3 + x1 y2 + x2 y1 + x3 y0 +x1 y3 + x2 y2 + x3 y1 + x2 y3 + x3 y2 + x3 y3 − (z0 + z1 + z2 + z3 )|

(D-1) (D-2) (D-3)

= |z0 + b0 + b1 + c0 + b2 + c1 + c2 + d0 + c3 + d1 + c4 + d2 + x0 y3 + x1 y2 + x2 y1 + x3 y0 +x1 y3 + x2 y2 + x3 y1 + x2 y3 + x3 y2 + x3 y3 − (z0 + z1 + z2 + z3 )|

(D-4)

From the definition of x, y and Lemma 5, |x1 y3 | < 2 · u50 · ufp (x0 ) · 2 · u350 · ufp (y0 ) = 4 · u450 · ufp (x0 ) · ufp (y0 ) |x2 y2 | < 2 · |x3 y1 | < 2 · |x2 y3 | < 2 · |x3 y2 | < 2 · |x3 y3 | < 2 ·

u250 u350 u250 u350 u350

· ufp (x0 ) · 2 · · ufp (x0 ) · 2 · · ufp (x0 ) · 2 · · ufp (x0 ) · 2 · · ufp (x0 ) · 2 ·

u250 u150 u350 u250 u350

· ufp (y0 ) = 4 · · ufp (y0 ) = 4 · · ufp (y0 ) = 4 · · ufp (y0 ) = 4 · · ufp (y0 ) = 4 ·

u450 u450 u550 u550 u650

(D-5)

· ufp (x0 ) · ufp (y0 )

(D-6)

· ufp (x0 ) · ufp (y0 )

(D-7)

· ufp (x0 ) · ufp (y0 )

(D-8)

· ufp (x0 ) · ufp (y0 )

(D-9)

· ufp (x0 ) · ufp (y0 )

(D-10)

so that the upper bound of the omitted part as follows: |eo | := |x1 y3 + x2 y2 + x3 y1 + x2 y3 + x3 y2 + x3 y3 | < 13 · u450 · ufp (x0 ) · ufp (y0 )

(D-11)

Next, from the feature of Error Free Transformations, when e denotes log2 (ufp (z0 )) we have bi ∈ Fe−49:e−99 ci ∈ F

for 0 i 2

e−99:e−149

for 0 i 4.

(D-12) (D-13)

Since ci and di are same order, we have z1 = fl (b0 + b1 + b2 ) = b0 + b1 + b2

(D-14)

z2 = fl (c0 + c1 + c2 + c3 + c4 ) = c0 + c1 + c2 + c3 + c4 ,

(D-15)

|e1 | |d0 + d1 + d2 + x0 y3 + x1 y2 + x2 y1 + x3 y0 − z3 | + |eo |

(D-16)

thus

hold. From the definition of x, y we have

32

|d0 | u50 |x0 y2 | < u50 · 2 · ufp (x0 ) · 2 · u250 · ufp (y0 ) = 4 · u350 · ufp (x0 ) · ufp (y0 ) |d1 | u50 |x1 y1 | < u50 · 2 · u50 · ufp (x0 ) · 2 · u50 · ufp (y0 ) = 4 · u250

|d2 | u50 |x2 y0 | < u50 · 2 ·

u350

· ufp (x0 ) · 2 · ufp (y0 ) = 4 ·

u350

· ufp (x0 ) · ufp (y0 )

· ufp (x0 ) · ufp (y0 )

(D-17) (D-18) (D-19)

|x0 y3 | < 2 · ufp (x0 ) · 2 · u350 · ufp (y0 ) = 4 · u350 · ufp (x0 ) · ufp (y0 )

(D-20)

|x1 y2 | < 2 · u150 · ufp (x0 ) · 2 · u250 · ufp (y0 ) = 4 · u350 · ufp (x0 ) · ufp (y0 )

(D-21)

|x2 y1 | < 2 · |x3 y0 | < 2 ·

u250 u350

· ufp (x0 ) · 2 ·

u150

· ufp (y0 ) = 4 ·

· ufp (x0 ) · 2 · ufp (y0 ) = 4 ·

u350

u350

· ufp (x0 ) · ufp (y0 )

· ufp (x0 ) · ufp (y0 )

(D-22) (D-23)

thus |e1 | |d0 + d1 + d2 + x0 y3 + x1 y2 + x2 y1 + x3 y0 − z3 | + |eo | 8·u

· 4 · u350 −203

< 32 · 2

= 136 · 2

· ufp (x0 ) · ufp (y0 ) + |eo |

· ufp (x0 ) · ufp (y0 ) + 13 ·

−203

u450

(D-24) (D-25)

· ufp (x0 ) · ufp (y0 )

· ufp (x0 ) · ufp (y0 ) .

(D-26) (D-27)

Finally, we need to consider the error e2 of Algorithm 2. Combined with the statements above, |e| |e1 | + |e2 | 136 · 2−203 · ufp (x0 ) · ufp (y0 ) + 2 · 2−203 · ufp (a0 ) 136 · 2

−203

· |z0 | + 2 · 2

274 · 2

−203

· ufp (a0 )

−203

· ufp (a0 )

(D-28) (D-29) (D-30)

This completes the proof.

References [1] Y. Hida, X.S. Li, and D.H. Bailey, “Quad-double arithmetic: algorithms, implementation, and application,” Report LBL-46996, October 2000. [2] “The GNU MPFR Library,” http://www.mpfr.org/ [3] “exflib - extend precision floating-point arithmetic library,” http://www-an.acs.i.kyoto-u.ac.jp/~fujiwara/exflib/ [4] C.Q. Lauter, “Basic building blocks for a triple-double intermediate format,” Research report RR-5702, INRIA, September 2005. [5] D.H. Bailey, Y. Hida, X.S. Li, and B. Thompson, “ARPREC: an arbitrary precision computational package,” Lawrence Berkeley National Laboratory, Berkeley, CA94720. 2002. [6] “IEEE Standard for floating-point arithmetic,” Std754-2008, 2008. [7] D.E. Knuth, “The art of computer programming: seminumerical algorithms,” vol. 2, AddisonWesley, Reading, Massachusetts, 1969. [8] T.J. Dekker, “A floating-point technique for extending the available precision,” Numer. Math., 18, pp. 224–242, 1971. [9] T. Ogita, S.M. Rump, and S. Oishi, “Accurate sum and dot product,” SIAM J. Sci. Comput., 26:6, pp. 1955–1988, 2005. [10] D.M. Priest, “Algorithms for arbitrary precision floating point arithmetic,” 10th Symposium on Computer Arithmetic, pp. 132–143, IEEE Computer Society Press, 1991. [11] J.R. Shewchuk, “Adaptive precision floating-point arithmetic and fast robust geometric predicates,” Discrete & Computational Geometry, vol. 18, no. 3, pp. 305–363, 1997. [12] D.H. Bailey, “MPFUN:A portable high performance multiprecision package,” RNR Technical Report, RNR-90-022, (2nd ed.), NASA Ames Research Center, Moffet Fields, CA, 1992. [13] D. Goldberg, “What every computer scientist should know about floating-point arithmetic,” ACM Comput. Surveys, vol. 23, no. 1, pp. 5–47, 1991. [14] P.H. Sterbenz, “Floating-point computation,” Prentice-Hall, 1973.

33

[15] S.M. Rump, T. Ogita, and S. Oishi, “ Accurate floating-point summation part I: Faithful rounding,” SIAM J. Sci. Comput., vol. 31, no. 1, pp. 189–224, 2008. [16] S.M. Rump, T. Ogita, and S. Oishi, “Accurate floating-point summation part II: Sign, K-fold faithful and rounding to nearest,” Siam J. Sci. Comput., vol. 31, no. 2, pp. 1269–1302, 2008. [17] S.M. Rump, “Error estimation of floating-point summation and dot product,” BIT Numerical Mathematics, vol. 52, no. 1, pp. 201–220, 2012. [18] D.H. Bailey, “A fortran-90 double-double library,” http://www.nersc.gov/~dhbailey/mpdist/mpdist.html [19] Intel Corporation, “Intel C++ Intrinsic Reference, Document Number,” 312482-003US. [20] “Boost C++ Libraries,” http://www.boost.org/.

34

Fast quadruple-double floating point format - J-Stage

Fast quadruple-double floating point format - J-Stage

Suggest Documents

IEEE 754 Floating-Point Format

Fast Logarithms on a Floating-Point Device

Adaptive Precision Floating-Point Arithmetic and Fast

Lecture 5 Fixed Point vs Floating Point Q-Format number ...

FLOATING-POINT ARITHMETIC • Floating-point representation and ...

Design of the Floating-Point Adder Supporting the Format Conversion ...

A Flexible Floating-Point Format for Optimizing Data-Paths ... - CiteSeerX

The Case For a Redundant Format in Floating Point Arithmetic

Design of the Floating-Point Adder Supporting the Format Conversion ...

A Software-Oriented Floating-Point Format for Enhancing Automotive

Page 1 Floating Point Arithmetic Floating Point Fractional Binary ...

Floating-Point Numbers Floating-point number system characterized ...

Fast and Efficient Compression of Floating-Point Data

Fast Lossless Compression of Scientific Floating-Point Data - CiteSeerX

Fast Floating Point Compression on the Cell BE Processor - HiPC

Fast Floating-Point Processing in Common Lisp - CiteSeerX

Adaptive Precision Floating-Point Arithmetic and Fast ... - CiteSeerX

Floating point numbers

Floating-Point Comparison

Floating-Point Comparison

Floating-Point Comparison

Floating Point - Akbar College

Floating Point Arithmetic

Embedded Floating Point Microcontroller