Download this talk from http://bit.ly/conte18. Nick Higham. Multiprecision Algorithms ..... tune cgce tol, alternative p
Multiprecision Algorithms Research Matters February 2009 Nick25, Higham School of Mathematics Nick Higham The University of Manchester
Director of Research School of Mathematics http://www.maths.manchester.ac.uk/~higham @nhigham, nickhigham.wordpress.com Samuel D. Conte Distinguished Lecture, Department of Computer Science, Purdue University, March 2018
1/6
Conte and De Boor
1980 2018 Early textbook cmpwise backw error analysis of LU factorization: computed solution of Ax = b satisfies (A + ∆A)xb = b,
b U|, b |∆A| ≤ un |A| + un (3 + un )|L||
where un = 1.01nu and |A| = (|aij |). Nick Higham
Multiprecision Algorithms
2 / 48
Outline Multiprecision arithmetic: floating point arithmetic supporting multiple precisions.
Low precision—-half, or less. High precision–quadruple, possibly arbitrary. How to exploit different precisions to achieve faster algs with higher accuracy. Download this talk from http://bit.ly/conte18
Nick Higham
Multiprecision Algorithms
3 / 48
IEEE Standard 754-1985 and 2008 Revision Type half single double quadruple
Size 16 bits 32 bits 64 bits 128 bits
Range 10±5 10±38 10±308 10±4932
u = 2−t 2−11 ≈ 4.9 × 10−4 2−24 ≈ 6.0 × 10−8 2−53 ≈ 1.1 × 10−16 2−113 ≈ 9.6 × 10−35
√ Arithmetic ops (+, −, ∗, /, ) performed as if first calculated to infinite precision, then rounded. Default: round to nearest, round to even in case of tie. Half precision is a storage format only.
Nick Higham
Multiprecision Algorithms
4 / 48
Model for Rounding Error Analysis For x, y ∈ F fl(x op y ) = (x op y )(1 + δ), Also for op =
p
|δ| ≤ u,
op = +, −, ∗, /.
.
Model is weaker than fl(x op y ) being correctly rounded.
Nick Higham
Multiprecision Algorithms
5 / 48
Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).
|δi | ≤ u,
Precision = u. Accuracy ≈ 2u.
Nick Higham
Multiprecision Algorithms
6 / 48
Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).
|δi | ≤ u,
Precision = u. Accuracy ≈ 2u.
Accuracy is not limited by precision
Nick Higham
Multiprecision Algorithms
6 / 48
ARM NEON
Nick Higham
Multiprecision Algorithms
8 / 48
NVIDIA Tesla P100 (2016), V100 (2017)
“The Tesla P100 is the world’s first accelerator built for deep learning, and has native hardware ISA support for FP16 arithmetic” V100 tensor cores do 4 × 4 mat mult in one clock cycle. double P100 4.7 V100 7 Nick Higham
TFLOPS single half/ tensor 9.3 18.7 14 112 Multiprecision Algorithms
9 / 48
AMD Radeon Instinct MI25 GPU (2017)
“24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPU compute performance on a single board . . . Up to 82 GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPU compute performance” Nick Higham
Multiprecision Algorithms
10 / 48
Machine Learning “For machine learning as well as for certain image processing and signal processing applications, more data at lower precision actually yields better results with certain algorithms than a smaller amount of more precise data.” “We find that very low precision is sufficient not just for running trained networks but also for training them.” Courbariaux, Benji & David (2015) We’re solving the wrong problem (Scheinberg, 2016), so don’t need an accurate solution. Low precision provides regularization. Low precision encourages flat minima to be found. Nick Higham
Multiprecision Algorithms
11 / 48
Deep Learning for Java
Nick Higham
Multiprecision Algorithms
12 / 48
Climate Modelling T. Palmer, More reliable forecasts with less precise computations: a fast-track route to cloud-resolved weather and climate simulators?, Phil. Trans. R. Soc. A, 2014: Is there merit in representing variables at sufficiently high wavenumbers using half or even quarter precision floating-point numbers? T. Palmer, Build imprecise supercomputers, Nature, 2015.
Nick Higham
Multiprecision Algorithms
13 / 48
Fp16 for Communication Reduction ResNet-50 training on ImageNet. Solved in 60 mins on 256 TESLA P100s at Facebook (2017). Solved in 15 mins on 1024 TESLA P100s at Preferred Networks, Inc. (2017) using ChainerMN (Takuya Akiba, SIAM PP18): “While computation was generally done in single precision, in order to reduce the communication overhead during all-reduce operations, we used half-precision floats . . . In our preliminary experiments, we observed that the effect from using half-precision in communication on the final model accuracy was relatively small.”
Nick Higham
Multiprecision Algorithms
14 / 48
Preconditioning with Adaptive Precision Anzt, Dongarra, Flegar, H & Quintana-Ortí (2018): For sparse A and iterative Ax = b solver, execution time and energy dominated by data movement. Block Jacobi preconditioning: D = diag(Di ), where Di = Aii . Solve D −1 Ax = D −1 b. All computations are at fp64. Compute D −1 and store Di−1 in fp16, fp32 or fp64, depending on κ(Di ). Simulations and energy modelling show can outperform fixed precision preconditioner.
Nick Higham
Multiprecision Algorithms
15 / 48
Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .
Nick Higham
Multiprecision Algorithms
16 / 48
Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .
What happens when nu > 1?
Nick Higham
Multiprecision Algorithms
16 / 48
Outline
1
Higher Precision
2
Iterative Refinement
Nick Higham
Multiprecision Algorithms
17 / 48
Need for Higher Precision Long-time simulations. Resolving small-scale phenomena. Khanna, High-Precision Numerical Simulations on a CUDA GPU: Kerr Black Hole Tails, 2013. Beliakov and Matiyasevich, A Parallel Algorithm for Calculation of Determinants and Minors Using Arbitrary Precision Arithmetic, 2016. Ma and Saunders, Solving Multiscale Linear Programs Using the Simplex Method in Quadruple Precision, 2015.
Nick Higham
Multiprecision Algorithms
18 / 48
Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?
Nick Higham
Multiprecision Algorithms
19 / 48
Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?
To what extent are existing algs precision-independent? Newton-type algs: just decrease tol? How little higher precision can we get away with? Gradually increase precision through the iterations? For Krylov methods # iterations can depend on precision, so lower precision might not give fastest computation! Nick Higham
Multiprecision Algorithms
19 / 48
Randomization + Extra Precision Exploit A = X diag(λi )X −1 ⇒ f (A) = X diag(f (λi ))X −1 . Davies (2007): “approximate diagonalization”: function F = funm_randomized(A,fun) d = digits; digits(2*d); tol = 10^(-d); E = randn(size(A),class(A)); [V,D] = eig(A + (tol*norm(A,’fro’) ... /norm(E,’fro’))*E); F = V*diag(fun(diag(D)))/V; digits(d) Perturbation ensures diagonalizable. Extra precision overcomes the effect of the perturbation. Nick Higham
Multiprecision Algorithms
20 / 48
Availability of Multiprecision in Software Maple, Mathematica, PARI/GP, Sage. MATLAB: Symbolic Math Toolbox, Multiprecision Computing Toolbox (Advanpix). Julia: BigFloat. Mpmath and SymPy for Python. GNU MP Library. GNU MPFR Library. (Quad only): some C, Fortran compilers. Gone, but not forgotten: Numerical Turing: Hull et al., 1985. Nick Higham
Multiprecision Algorithms
21 / 48
Cost of Quadruple Precision How Fast is Quadruple Precision Arithmetic? Compare MATLAB double precision, Symbolic Math Toolbox, VPA arithmetic, digits(34), Multiprecision Computing Toolbox (Advanpix), mp.Digits(34). Optimized for quad. Ratios of times Intel Broadwell-E Core i7-6800K @3.40GHz, 6 cores mp/double vpa/double vpa/mp LU, n = 250 98 25,000 255 75 6,020 81 eig, nonsymm, n = 125 eig, symm, n = 200 32 11,100 342
Nick Higham
Multiprecision Algorithms
22 / 48
Outline
1
Higher Precision
2
Iterative Refinement
Nick Higham
Multiprecision Algorithms
23 / 48
Accelerating the Solution of Ax = b A ∈ Rn×n nonsingular. Standard method for solving Ax = b: factorize A = LU, solve LUx = b, all at working precision. Can we solve Ax = b faster and/or more accurately by exploiting multiprecision arithmetic?
Nick Higham
Multiprecision Algorithms
24 / 48
Iterative Refinement for Ax = b (classic) Solve Ax0 = b by LU factorization in double precision. r = b − Ax0 quad precision Solve Ad = r double precision x1 = x0 + d double precision (x0 ← x1 and iterate as necessary.) Programmed in J. H. Wilkinson, Progress Report on the Automatic Computing Engine (1948). Popular up to 1970s, exploiting cheap accumulation of inner products.
Nick Higham
Multiprecision Algorithms
25 / 48
Iterative Refinement (1970s, 1980s) Solve Ax0 = b by LU factorization. r = b − Ax0 Solve Ad = r x1 = x0 + d Everything in double precision. Skeel (1980). Jankowski & Wo´zniakowski (1977) for a general solver.
Nick Higham
Multiprecision Algorithms
26 / 48
Iterative Refinement (2000s) Solve Ax0 = b by LU factorization in single precision. r = b − Ax0 double precision Solve Ad = r single precision x1 = x0 + d double precision Dongarra et al. (2006). Motivated by single precision being at least twice as fast as double.
Nick Higham
Multiprecision Algorithms
27 / 48
Iterative Refinement in Three Precisions A, b given in precision u. Solve Ax0 = b by LU factorization in precision uf . r = b − Ax0 precision ur Solve Ad = r precision uf x1 = fl(x0 + d) precision u Three previous usages are special cases. Choose precisions from half, single, double, quadruple subject to ur ≤ u ≤ uf . Can we compute more accurate solutions faster?
Nick Higham
Multiprecision Algorithms
28 / 48
Existing Rounding Error Analysis Wilkinson (1963): fixed-point arithmetic. Moler (1967): floating-point arithmetic. Higham (1997, 2002): more general analysis for arbitrary solver. Dongarra et al. (2006): lower precision LU. At most two precisions and require κ(A)u < 1 . New Analysis Applies to any solver. Covers b’err and f’err. Focus on f’err here. Allows κ(A)u & 1 . Nick Higham
Multiprecision Algorithms
29 / 48
New Analysis Assume computed solution to Adi = ri has normwise rel b’err O(uf ) and satisfies bi k∞ kdi − d ≤ uf θi < 1. kdi k Define µi by kA(x − xbi )k∞ = µi kAk∞ kx − xbi k∞ , and note that κ∞ (A)−1 ≤ µi ≤ 1.
Nick Higham
Multiprecision Algorithms
30 / 48
Condition Numbers |A| = (|aij |). k |A−1 ||A||x| k∞ cond(A, x) = , kxk∞ cond(A) = cond(A, e) = k |A−1 ||A| k∞ , κ∞ (A) = kAk∞ kA−1 k∞ . 1 ≤ cond(A, x) ≤ cond(A) ≤ κ∞ (A) .
Nick Higham
Multiprecision Algorithms
31 / 48
Convergence Result Theorem (Carson & H, 2018) For IR in precisions ur ≤ u ≤ uf if φi = 2uf min cond(A), κ∞ (A)µi + uf θi is sufficiently less than 1, the forward error is reduced on the ith iteration by a factor ≈ φi until an iterate xb satisfies kx − xbk∞ . 4nur cond(A, x) + u. kxk∞ Analogous standard bound would have µi = 1, uf θi = κ(A)uf . Nick Higham
Multiprecision Algorithms
32 / 48
Precision Combinations H = half, S = single, D = double, Q = quad. “uf u ur ”: SSD DDQ HHS Traditional: HHD HHQ SSQ
SSS DDD 1970s/1980s: HHH QQQ
SDD HSS DQQ 2000s: HDD HQQ SQQ
HSD HSQ 3 precisions: HDQ SDQ
Nick Higham
Multiprecision Algorithms
33 / 48
Results for LU Factorization (1) uf H H H H S S S S
u ur S S S D D D D Q S S S D D D D Q
κ∞ (A) 104 104 104 104 108 108 108 108
Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D
Nick Higham
Multiprecision Algorithms
34 / 48
Results (2): HSD vs. SSD uf H H H H S S S S
u ur S S S D D D D Q S S S D D D D Q
κ∞ (A) 104 104 104 104 108 108 108 108
Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D
Nick Higham
Multiprecision Algorithms
35 / 48
Results (2): HSD vs. SSD uf H H H H S S S S
u ur S S S D D D D Q S S S D D D D Q
κ∞ (A) 104 104 104 104 108 108 108 108
Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D
Can we get the benefit of “HSD” while allowing a larger range of κ∞ (A)?
Nick Higham
Multiprecision Algorithms
35 / 48
Extending the Range of Applicability Recall that the convergence condition is φi = 2uf min cond(A), κ∞ (A)µi + uf θi 1. We need both terms to be smaller than κ∞ (A)uf . Recall that bi k∞ kdi − d ≤ uf θi , kdi k µi kAk∞ kx − xi k∞ = kA(x − xi )k∞ = kb − Axi k∞ = kri k∞ .
Nick Higham
Multiprecision Algorithms
36 / 48
Bounding µi For a stable solver, in the early stages we expect kri k kx − xi k ≈u , kAkkxi k kxk or equivalently µi 1. But close to convergence kri k kx − xi k ≈u≈ kAkkxi k kxk
or µi ≈ 1.
Conclude µi 1 initially and µi → 1 as the iteration converges.
Nick Higham
Multiprecision Algorithms
37 / 48
Bounding θi uf θi bounds rel error in solution of Adi = ri . We need uf θi 1. Standard solvers cannot achieve this for very ill conditioned A! b and U b are Empirically observed by Rump (1990) that if L computed LU factors of A from GEPP then b−1 AU b −1 ) ≈ 1 + κ(A)u, κ(L even for κ(A) u −1 .
Nick Higham
Multiprecision Algorithms
38 / 48
Preconditioned GMRES To compute the updates di we apply GMRES to e i ≡U b −1 L b−1 Adi = U b −1 L b−1 ri . Ad e is applied in twice the working precision. A e κ(A) typically. κ(A) bi Rounding error analysis shows we get an accurate d even for numerically singular A. Call the overall alg GMRES-IR. e GMRES cgce rate not directly related to κ(A). e Cf. Kobayashi & Ogita (2015), who explicitly form A.
Nick Higham
Multiprecision Algorithms
39 / 48
Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .
LU LU
uf H S
u ur D Q D Q
Nick Higham
κ∞ (A) 104 108
Backward error nrm cmp F’error D D D D D D
Multiprecision Algorithms
40 / 48
Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .
LU LU
uf H S
u ur D Q D Q
GMRES-IR GMRES-IR
H S
D D
Q Q
Nick Higham
κ∞ (A) 104 108 1012 1016
Backward error nrm cmp F’error D D D D D D D D D D
D D
Multiprecision Algorithms
40 / 48
Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .
LU LU
uf H S
u ur D Q D Q
GMRES-IR GMRES-IR
H S
D D
Q Q
κ∞ (A) 104 108 1012 1016
Backward error nrm cmp F’error D D D D D D D D D D
D D
How many GMRES iterations are required? Some tests with 100 × 100 matrices . . . Nick Higham
Multiprecision Algorithms
40 / 48
Test 1: LU-IR, (uf , u, ur ) = (S, D, D) κ∞ (A) ≈ 1010 , σi = αi
Divergence!
ferr nbe cbe
100 10-5
10-10 10-15 0
1
2
re-nement step Nick Higham
Multiprecision Algorithms
41 / 48
Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi
Divergence!
ferr nbe cbe
100 10-5
10-10 10-15 0
1
2
re-nement step Nick Higham
Multiprecision Algorithms
42 / 48
Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 104 , σi = αi
Convergence
10-5
10-15 0
1
Nick Higham
2
Multiprecision Algorithms
43 / 48
Test 1: GMRES-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi , GMRES its (2,3)
Convergence
ferr nbe cbe
100 10-5
10-10 10-15 0
1
2
re-nement step Nick Higham
Multiprecision Algorithms
44 / 48
Test 2: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 102 , 1 small σi , GMRES its (8,8,8) Convergence
ferr nbe cbe
100 10-5
10-10 10-15 0
1
2
3
re-nement step Nick Higham
Multiprecision Algorithms
45 / 48
Test 3: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 1012 , σi = αi , GMRES (100,100) Take x0 = 0 because of overflow!
Convergence
10-5
10-15 0
1
Nick Higham
2
3
Multiprecision Algorithms
46 / 48
Conclusions Both low and high precision floating-point arithmetic becoming more prevalent, in hardware and software. Need better understanding of behaviour of algs in low precision arithmetic. IR with LU in lower precision ⇒ twice as fast as trad. IR, albeit restricted κ(A). Judicious use of a little high precision can bring major benefits. GMRES-IR cges when trad. IR doesn’t, thanks to preconditioned GMRES solved of Adi = ri . GMRES-IR handles κ∞ (A) ≈ u −1 . Further work: tune cgce tol, alternative preconditioners etc. Nick Higham
Multiprecision Algorithms
47 / 48
References I H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Computat.: Pract. Exper., 2018. G. Beliakov and Y. Matiyasevich. A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic. BIT, 56(1):33–50, 2016.
Nick Higham
Multiprecision Algorithms
1/7
References II E. Carson and N. J. Higham. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017. E. Carson and N. J. Higham. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM J. Sci. Comput., 40(2):A817–A847, 2018. M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications, 2015. ArXiv preprint 1412.7024v5. Nick Higham
Multiprecision Algorithms
2/7
References III E. B. Davies. Approximate diagonalization. SIAM J. Matrix Anal. Appl., 29(4):1051–1064, 2007. N. J. Dingle and N. J. Higham. Reducing the influence of tiny normwise relative errors on performance profiles. ACM Trans. Math. Software, 39(4):24:1–24:11, 2013. N. J. Higham. Iterative refinement for linear systems and LAPACK. IMA J. Numer. Anal., 17(4):495–509, 1997.
Nick Higham
Multiprecision Algorithms
3/7
References IV N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002. ISBN 0-89871-521-0. xxx+680 pp. G. Khanna. High-precision numerical simulations on a CUDA GPU: Kerr black hole tails. J. Sci. Comput., 56(2):366–380, 2013.
Nick Higham
Multiprecision Algorithms
4/7
References V Y. Kobayashi and T. Ogita. A fast and efficient algorithm for solving ill-conditioned linear systems. JSIAM Lett., 7:1–4, 2015. J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, Nov. 2006.
Nick Higham
Multiprecision Algorithms
5/7
References VI D. Ma and M. Saunders. Solving multiscale linear programs using the simplex method in quadruple precision. In M. Al-Baali, L. Grandinetti, and A. Purnama, editors, Numerical Analysis and Optimization, number 134 in Springer Proceedings in Mathematics and Statistics, pages 223–235. Springer-Verlag, Berlin, 2015.
Nick Higham
Multiprecision Algorithms
6/7
References VII A. Roldao-Lopes, A. Shahzad, G. A. Constantinides, and E. C. Kerrigan. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, pages 209–216, Apr. 2009. K. Scheinberg. Evolution of randomness in optimization methods for supervised machine learning. SIAG/OPT Views and News, 24(1):1–8, 2016.
Nick Higham
Multiprecision Algorithms
7/7