A Reference Implementation for Extended and Mixed Precision BLAS James Demmely
Xiaoye Liz
David Baileyz Michael Martiny Anil Kapury September 3, 1999
Jimmy Iskandary
Abstract
This paper describes a C implementation of the proposed new BLAS Standard. Permitting mixtures of input/output types and precisions, as well as higher internal precision, the new BLAS standard contains many more subroutines than the existing standard. We have developed an automated process of generating and systematically testing these large numbers of routines. We believe our methodology could be applicable to the other languages besides C. In particular, our algorithms used in the testing code would be very valuable to all the other BLAS implementors.
1 Introduction This library of routines is part of a reference implementation for the Dense and Banded BLAS routines, along with their Extended and Mixed Precision versions, as documented in Chapters 2 and 4 of the new BLAS Standard. This is obtainable from the followin URL, to which we refer the reader for details: http://www.netlib.org/cgi-bin/checkout/blast/blast.pl
EXTENDED PRECISION is only used internally; the input and output arguments remain just Single or Double as in the existing BLAS. At present, we only allow Single, Double, or Extra internal precision. Extra precision is implemented as double-double precision (128-bit total, 106-bit signi cand). This research was supported in part by the National Science Foundation Cooperative Agreement No. ACI9619020, NSF Grant No. ACI-9813362, the Department of Energy Grant Nos. DE-FG03-94ER25219 and DEFC03-98ER25351, and gifts from the IBM Shared University Research Program, Sun Microsystems, and Intel. This project also utilized resources of the National Energy Research Scienti c Computing Center (NERSC) which is supported by the Director, Oce of Advanced Scienti c Computing Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC03-76SF00098. The information presented here does not necessarily re ect the position or the policy of the Government and no ocial endorsement should be inferred. y Computer Science Division, University of California, Berkeley, CA 94720.
[email protected], fmcmartin,
[email protected],
[email protected]. z NERSC, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720.
[email protected],
[email protected].
1
We have designed all our routines assuming that single precision arithmetic is actually done in IEEE single precision (32 bits) and that double precision arithmetic is actually done in IEEE double precision (64 bits). (The routines also pass our tests on an Intel machine with 80-bit oating point registers.) MIXED PRECISION permits some input/output arguments to be of dierent types (mixing real and complex) or precisions (mixing single and double). There are two sets of subroutines for each BLAS function. One set has a calling sequence similar to the existing BLAS, but the arguments may mix types and precisions, and the internal precision is the same as the output precision. The calling sequence of the second set of subroutines contains an extra argument PREC at the end. PREC is a runtime variable specifying the internal precision to be used for the calculation.
2 Code Generation with M4 macro processor In the existing BLAS, there are usually 4 routines associated with each operation. All input, output, and internal variables are single or double precision and real or complex. But under the new extended and mixed precision rules (see Chapter 4 for details), the input, output and internal variables may have dierent precisions and types. Therefore, the combination of all these types results in many more routines associated with each operation. For example, DOT will have 32 routines altogether, 4 \standard" versions (from Chapter 2) and 28 mixed and extended precision versions (from Chapter 4). In addition, the 16 versions with extended precision support up to three internal precisions that can be chosen at runtime. We have attempted to automate the code generation as much as possible. We use the M4 [2] macro processor to facilitate this task.
2.1 Basic Arithmetic Operations
The idea is to de ne a macro for each fundamental arithmetic operation. The macro's argument list contains the variables, accompanied by their types and precisions. For example, for the operation c a + b, we de ne the following macro: ADD(c, c type, a, a type, b, b type)
where, x type can be one of: real-single real-double real-extra complex-single complex-double complex-extra Inside the macro body, we use an \if-test" on c type, a type and b type, to generate the appropriate code segment for \+". This is similar to operator overloading in C++; but we do it manually. All these if-tests are evaluated at macro-evaluation time, and do not appear in the executable code. Indeed, our goal was to produce ecient C code, which means minimizing branches in inner loops. 2
Other macros include SUB, MUL, DIV, DECLARE (variable declaration), ASSIGN, etc. Since these macros are shared among all the BLAS routines, we put them in a common header le, named cblas.m4.h, which has about 1000 lines of code.
2.2 BLAS Functions
Each BLAS routine also has its own macro le, such as dot.m4, spmv.m4 and gbmv.m4, to generate the speci c functions. All the macro les are located in the m4/ subdirectory. Each macro le is structured as 3 levels. Take dot.m4 as an example: The top level generates the 32 subroutines with the correct names, type statements, and switch statements based on PREC. The middle level is the actual dot product algorithm. It is written exactly once in such a way that lets it support all combination of types and precisions. It is about as long and complicated as a straightforward C implementation, with the dierence that statements like prod = x[i] y[i] are converted into the M4 macro calls. The bottom level consists of calls to the macros that perform the fundamental operations. For example, the inner loop of the M4 macro for the dot product is simply as follows (the M4 parameters $2, $3, and $4 are types): for (i = 0; i < n; ++i) { GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2) /* put ix-th element of vector x into x_ii */ GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3) /* put iy-th element of vector y into y_ii */ MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */ ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */ ix += incx; iy += incy; } /* endfor */
The motivation for this macro-based approach is simplifying software engineering. For example, the le dot.m4 of M4 macros for the dot product is 365 lines long (220 non-comment lines) but expands into 9526 lines in 32 C subroutines implementing dierent versions of dot. Similarly the macros for SPMV expand from 255 lines (198 non-comment lines) to 62044 lines in 32 C subroutines. (This does not count the shared M4 macros in directory m4.)
3 Testing The goal of the testing code is to validate the underlying implementation. The challenge is twofold: First, we must thoroughly test routines claiming to use extra precision internally, where the test code is not allowed to declare any extra precision variables or use any other extra precision facilities not available to the code being tested. This requires great care in generating test data. Second, 3
we must use M4 to automatically generate the many versions of test code needed for the many versions of the code being tested. For each BLAS routine, we perform the following steps in the test code: 1. Generate input scalars, vectors and arrays, according to the routine's speci cation, so that the result exposes the internal precision actually used. 2. Call the BLAS routine. 3. For each output, compute a \test ratio" of the computed error to the theoretical error bound, that is, jComputed value ? "True value"j=ErrorBound. In Step 1, we use a nested loop structure to loop over all the input parameters and vary all possible values for each parameter, to generate a large number of test cases. By design, the test ratios computed in Step 3 should be bounded by 1. A large ratio indicates that the computed result is either completely wrong, or not as accurate as claimed in the speci cation. The following sections will discuss how we generate \good" inputs in order to reveal the internal precisions actually implemented.
3.1 Test DOT
The DOT function computes:
r
rin + (
X
i=1;n
xi y i )
(1)
By careful error analysis, we can derive the following error bound: (See Appendix A for its derivation.) jrcomp ? raccj (n + 2) ("int + "acc) S + U + "out jraccj (2) where, rcomp = r computed by the routine being tested, with the accumulation done in internal precision "int , and then rounding to the output precision "out racc = the computed result using precision "acc "int = internal precision claimed by the routine being tested int = under ow threshold in the claimed internal precision "acc = our most accurate precision (106 bits) to compute r acc = under ow threshold in our most accurate precision "out = output precision out = under ow threshold in output precision S = jj Pi=1;n jxi yi j + j rinj U = (jj n + 2) (int + acc ) + out The U term is to accommodate all possible under ows at various places. The returned \test ratio" is the ratio of the left side over the right side of inequality (2). This ratio should p be at most 1, and often is rather less than 1, because in the error bound n is pessimistic and n is more typical if 4
Single Double Double-Double Machine precision " 2?24 2?53 2?106 ?126 ?1022 Under ow threshold 2 2 2?968 Table 1: Some oating-point machine constants in IEEE standard. rounding is equally likely to be up or down. Table 1 lists the machine precisions and the under ow thresholds1 in IEEE oating-point standard [1]. In order to estimate whether the implemented internal precision is at least as high as claimed by the routine, it is not sucient to generate random inputs. Because for random inputs, the last term in the error bound (2) will very likely dominate the error bound, and wipe out the rst term. Therefore our strategy is to make jrj (hence jracc j) much smaller than S , so that "int term dominates the error bound. In our test generator, we choose ; ; r; x and y judiciously so as to cancel as many bits as possible when evaluating r. If the routine being tested claims to use internal precision "int claimed with under ow threshold int claimed , then the denominator of our test ratio will be the right-hand-side of error bound (2) with "int = "int claimed and int = int claimed . The actual error is bounded by the same formula with "int = "int actual and int = int actual . Presuming that over many tests, the actual error occasionally approaches its bound, then our test ratio will be as large as + "acc ) S + Uactual + "out jracc j ratio (n(n++2)2) ("("int actual + (3) int claimed "acc ) S + Uclaimed + "out jraccj Since "acc min("int actual ; "int claimed ), the U terms are typically much smaller than the other terms, and racc is also very small by our choice of data (Section 3.1.1), the above ratio is roughly
ratio ""int actual
int claimed
(4)
which we compare to 1. Therefore, a large ratio gives us a sense of the actual precision, assuming under ow does not dominate. 3.1.1
Generating input scalars and vectors
In the test code, we should generate three scalars ; ; r, (r is overwritten by the inner product on output) and two vectors x and y . To make the inner product small in magnitude, we designed the following algorithm to generate these quantities. (Here, we assume that the inputs and output are in single precision, and internal precision is double-double. The other situations are simpler than this.) 1. Choose n, random ; and xi ; i = 1 : : :n. P 2. Choose the leading yi ; i = 1 : : :n ? k +1, to add bits into the pre x sum, i=1;n?k+1 xi yi , such that the sum contains at least 106 bits. 1
We assume \ ush to zero" on under ow.
5
y1 = random() s = x 1 y1 do j = 2; n ? k + 1 s = s 2?30 yj = sx enddo
/* shift right 30 bits */
j
3. Choose the remaining yi ; i = n ? k + 2 : : :n and r, to cancel as many bits as possible. do j = n ? kP+ 2; n s = i=1;j?1 xi yi /* very accurately */ (*) yj = ? sxj /* s + xj yj cancels the leading 24 bits of s */ enddo P s = i=1;n xi yi /* very accurately */ s /* s + r cancels the leading 24 bits of s */ r = ? The pseudorandom numbers generated are uniformly distributed in (0; 1). In summary, the rst (n ? k + 1) yi 's are chosen to add bits into the pre x sum, and the last k ? 1 yi's, together with r, are chosen to cancel bits in the nal sum. How big should k be, i.e., how many cancellations do we need? It depends on the relative sizes of "out and "acc . Basically, log "acc k = log (5) " out
which is as large as d106=24e = 5 for single precision output and double-double accurate precision. This means that we cannot test dot products so simply with n < 6. For smaller n, we do not have sucient freedom to add and cancel all 106 bits. Instead, we use some algebraic identities to make the inner product small. For example, 1. n = 1 Choose = at random, a 2 [1=2; 1) at random with only 12 leading bits nonzero, x1 = a + "out exactly, y1 = a ? "out exactly, r = ?a2 exactly, so that the nal result equals ? "2out . 2. n = 2 Choose = at random, r = x1 , and y1 = ?1, to make x1 y1 + r = 0 exactly. Choose y2 such that the nal result x2 y2 is much smaller than r in magnitude. 3. n = 3 Choose r = 0. Choose y1 = ?x3 ; y3 = x1 to make x1 y1 + x3 y3 = 0 exactly. Choose y2 such that the nal result x2 y2 is much smaller than x1 y1 in magnitude. 4. n = 4 Choose r = 0. Choose y1 = ?x4 ; y3 = 0; y4 = x1 to make x1 y1 + x3 y3 + x4 y4 = 0 exactly. Choose y2 such that the nal result x2 y2 is much smaller than x1 y1 in magnitude. 5. n = 5 Choose r = 0. Choose y1 = ?x5 ; y2 = ?x4 ; y4 = x2 ; y5 = x1 to make x1 y1 + x2 y2 + x4 y4 + x5 y5 = 0 exactly. Choose y3 such that jx3 y3j is much smaller than max(jx1 y1 j; jx2 y2 j). 6
In all the cases above, we have at least 2 summands available in the inner product (1). According to our design strategy, we can cancel from 24 to 106 leading bits in the nal sum. That is, 10?32 jraccj 10?8. For some special inputs, however, where there is at most 1 term in the inner product, jraccj cannot be made very small, and is O(1). These include: n=0 =0 6= 0; = 0; and n 1 For these cases, any internal precision higher than the output precision could not be revealed by our tests, and indeed any higher internal precision cannot change the answer by more than one ulp (making the answer slightly less accurate!). 3.1.2
Obtaining an accurate
racc
One possibility is to use an arbitrary precision package, such as MPFUN, but our goal is to make this test code self contained, and use no extra precision facilities not available to the code being tested. Since most of our routines can be reduced to a series of dot products, testing can be based on DOT. We only need a trusted dot routine. Currently, we are using our own dot routine with double-double internal precision to compute racc , which is accurate to 106 bits. That is, "acc = 2?106, and acc = 2?968. This implies that any internal precision higher than double-double cannot be detected, and may result in a tiny test ratio. A very tiny test ratio (such as zero) may also occur if the result happens to be computed exactly. The careful reader may wonder how we can trust test code that uses our own accurate dot product to test itself, as well as all other XBLAS. Some test cases are generated where racc is known exactly by a mathematical formula, because there is exact term-by-term cancellation, and that we do not depend on our trusted dot product being correct. Indeed, our own trusted accurate dot product had to pass these other tests. These are precisely the cases when 1 n 5 in see Section 3.1.1.
3.2 Testing SPMV and GBMV
The SPMV, GBMV and many other BLAS2 functions compute:
y
y+Ax
(6)
where, A is a symmetric or band matrix. Testing it is no more dicult than testing DOT, because each component of the output vector y is a dot product of the corresponding row of A with vector x, and satis es the error bound (2). So we can iterate over the n rows, generate each row of A using almost the same test generator as DOT, and compute a test ratio for each component of the output y . However, some modi cations are needed in the test generator. Consider the case of symmetric A. For the rst row of A, we can use exactly the same algorithm discussed in Section 3.1.1. For the second row, by symmetry, the (2,1) entry should the same as (1,2) entry, so we only have n ? 1 free entries to choose. Furthermore, the x vector is xed. For the third row of A, we only have 7
n ? 2 free entries to choose, and so on. To accommodate this need, our test generator is modi ed
as follows. The inputs are those free parameters chosen from a prior call to the generator. The outputs are the free parameters chosen from this call. The generator rst evaluates the partial sum using the xed input parameters, and computes the number of bits B available in the partial sum. If B < 106, we will add a few more terms into the partial sum such that B 106. Afterwards, we will generate the remaining free parameters to cancel bits in the running sum. The way to add and cancel bits is the same as what we described in Section 3.1.1. This approach can be generalized to most other Level 2 and Level 3 BLAS, except for the triangular solve function.
3.3 Testing TRSV
It appears to be more dicult to test TRSV systematically by the same simple scheme above. This is work in progress.
4 Remarks We assume the underlying oating-point arithmetic conforms to IEEE standard. We will not deal with any other non-IEEE arithmetic, most notably the old Cray arithmetic. Though the algorithm was not designed for Intel machines with 80-bit oating point, our tests indicate that it works. It is likely that the double precision code would work on an Intel machine with 80-bit oating-point registers provided that the round ag is set to round all oating point results to double, but we have not con rmed this. The same comments apply to single on Intel machines; the round ag should be set to single. Our code is meant to be a reference implementation, serving much the same purpose as the Fortran BLAS appeared in the Netlib. The generated code consists only of straightforward loops. We did not include any architectural optimizations, such as blocking for locality, loop unrolling, etc. but believe our code to run as fast as straightforward but careful code written by hand. Performance testing remains to be done. Although we have not tried, we believe similarly structured macros could generate Fortran 77 and Fortran 95 code as well. However, we note that the conventional operator overloading where the implementation of each binary operation depends only on the argument types and not the result type is not enough. It cannot implement Extra = Double Double correctly. The alternative would be to promote one argument to the output type, but this is could be unnessarily slow if implemented poorly (e.g. Extra = Extra Double is signi cantly more expensive than Extra = Double Double.)
5 Acknowledgement We thank Prof. W. Kahan for constructive discussions on the test code design. Weihua Shen, Berkat Tung and Ben Wanzo contributed codes in the early stage of the development.
8
References [1] IEEE standard for binary oating-point arithmetic (ANSI/IEEE Std 754-1985). IEEE, 1985. [2] M4 macro processor. http://www.cs.utah.edu/csinfo/texinfo/m4/m4.html.
9
A Error bound for inner product To derive the error bound for the inner product (1), we will use the following notation. "out = output precision (e.g., 2?24 in single), out = under ow threshold in output precision "int = internal precision claimed by the routine to be tested int = under ow threshold in the claimed internal precision "acc = our most accurate precision to compute r (Section 3.1.2) acc = under ow threshold in our most accurate precision S = jj Pi=1;n jxi yi j + j rj rtruth = the correct answer in exact arithmetic rcomp = r computed by the routine being tested, with the accumulation done in internal precision "int , and then rounding to the output precision "out racc = the computed result using precision "acc The usual error analysis says that the correctly computed and rounded result should satisfy
jrcomp ? rtruthj (n + 2) "int S + (jj n + 2) int + "out jrtruthj + out
(7)
This takes into account all possible under ows in the accumulation and nal rounding. We cannot directly compare with this error bound, because we do not know rtruth . We only know racc , as described in Section 3.1.2, which satis es
jracc ? rtruthj (n + 2) "acc S + (jj n + 2) acc
(8)
Applying triangle inequality to (7) and (8), we get
jrcomp ? raccj (n + 2) ("int + "acc ) S + (jj n + 2) (int + acc) + out + "out jrtruthj (9) >From error bound (8) we know
jrtruthj jraccj + (n + 2) "acc S + (jj n + 2) acc (10) Substituting (10) into (9) and omitting the second order terms ("out "acc and "out acc ), we obtain
the nal error bound that the test code can compute
jrcomp ? raccj (n + 2) ("int + "acc) S + (jj n + 2) (int + acc) + out + "out jraccj (11)
10