Multiprecision Algorithms - Maths @ Manchester - The University of ...

0 downloads 114 Views 3MB Size Report
Download this talk from http://bit.ly/conte18. Nick Higham. Multiprecision Algorithms ..... tune cgce tol, alternative p
Multiprecision Algorithms Research Matters February 2009 Nick25, Higham School of Mathematics Nick Higham The University of Manchester

Director of Research School of Mathematics http://www.maths.manchester.ac.uk/~higham @nhigham, nickhigham.wordpress.com Samuel D. Conte Distinguished Lecture, Department of Computer Science, Purdue University, March 2018

1/6

Conte and De Boor

1980 2018 Early textbook cmpwise backw error analysis of LU factorization: computed solution of Ax = b satisfies (A + ∆A)xb = b,

b U|, b |∆A| ≤ un |A| + un (3 + un )|L||

where un = 1.01nu and |A| = (|aij |). Nick Higham

Multiprecision Algorithms

2 / 48

Outline Multiprecision arithmetic: floating point arithmetic supporting multiple precisions.

Low precision—-half, or less. High precision–quadruple, possibly arbitrary. How to exploit different precisions to achieve faster algs with higher accuracy. Download this talk from http://bit.ly/conte18

Nick Higham

Multiprecision Algorithms

3 / 48

IEEE Standard 754-1985 and 2008 Revision Type half single double quadruple

Size 16 bits 32 bits 64 bits 128 bits

Range 10±5 10±38 10±308 10±4932

u = 2−t 2−11 ≈ 4.9 × 10−4 2−24 ≈ 6.0 × 10−8 2−53 ≈ 1.1 × 10−16 2−113 ≈ 9.6 × 10−35

√ Arithmetic ops (+, −, ∗, /, ) performed as if first calculated to infinite precision, then rounded. Default: round to nearest, round to even in case of tie. Half precision is a storage format only.

Nick Higham

Multiprecision Algorithms

4 / 48

Model for Rounding Error Analysis For x, y ∈ F fl(x op y ) = (x op y )(1 + δ), Also for op =

p

|δ| ≤ u,

op = +, −, ∗, /.

.

Model is weaker than fl(x op y ) being correctly rounded.

Nick Higham

Multiprecision Algorithms

5 / 48

Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).

|δi | ≤ u,

Precision = u. Accuracy ≈ 2u.

Nick Higham

Multiprecision Algorithms

6 / 48

Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).

|δi | ≤ u,

Precision = u. Accuracy ≈ 2u.

Accuracy is not limited by precision

Nick Higham

Multiprecision Algorithms

6 / 48

ARM NEON

Nick Higham

Multiprecision Algorithms

8 / 48

NVIDIA Tesla P100 (2016), V100 (2017)

“The Tesla P100 is the world’s first accelerator built for deep learning, and has native hardware ISA support for FP16 arithmetic” V100 tensor cores do 4 × 4 mat mult in one clock cycle. double P100 4.7 V100 7 Nick Higham

TFLOPS single half/ tensor 9.3 18.7 14 112 Multiprecision Algorithms

9 / 48

AMD Radeon Instinct MI25 GPU (2017)

“24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPU compute performance on a single board . . . Up to 82 GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPU compute performance” Nick Higham

Multiprecision Algorithms

10 / 48

Machine Learning “For machine learning as well as for certain image processing and signal processing applications, more data at lower precision actually yields better results with certain algorithms than a smaller amount of more precise data.” “We find that very low precision is sufficient not just for running trained networks but also for training them.” Courbariaux, Benji & David (2015) We’re solving the wrong problem (Scheinberg, 2016), so don’t need an accurate solution. Low precision provides regularization. Low precision encourages flat minima to be found. Nick Higham

Multiprecision Algorithms

11 / 48

Deep Learning for Java

Nick Higham

Multiprecision Algorithms

12 / 48

Climate Modelling T. Palmer, More reliable forecasts with less precise computations: a fast-track route to cloud-resolved weather and climate simulators?, Phil. Trans. R. Soc. A, 2014: Is there merit in representing variables at sufficiently high wavenumbers using half or even quarter precision floating-point numbers? T. Palmer, Build imprecise supercomputers, Nature, 2015.

Nick Higham

Multiprecision Algorithms

13 / 48

Fp16 for Communication Reduction ResNet-50 training on ImageNet. Solved in 60 mins on 256 TESLA P100s at Facebook (2017). Solved in 15 mins on 1024 TESLA P100s at Preferred Networks, Inc. (2017) using ChainerMN (Takuya Akiba, SIAM PP18): “While computation was generally done in single precision, in order to reduce the communication overhead during all-reduce operations, we used half-precision floats . . . In our preliminary experiments, we observed that the effect from using half-precision in communication on the final model accuracy was relatively small.”

Nick Higham

Multiprecision Algorithms

14 / 48

Preconditioning with Adaptive Precision Anzt, Dongarra, Flegar, H & Quintana-Ortí (2018): For sparse A and iterative Ax = b solver, execution time and energy dominated by data movement. Block Jacobi preconditioning: D = diag(Di ), where Di = Aii . Solve D −1 Ax = D −1 b. All computations are at fp64. Compute D −1 and store Di−1 in fp16, fp32 or fp64, depending on κ(Di ). Simulations and energy modelling show can outperform fixed precision preconditioner.

Nick Higham

Multiprecision Algorithms

15 / 48

Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .

Nick Higham

Multiprecision Algorithms

16 / 48

Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .

What happens when nu > 1?

Nick Higham

Multiprecision Algorithms

16 / 48

Outline

1

Higher Precision

2

Iterative Refinement

Nick Higham

Multiprecision Algorithms

17 / 48

Need for Higher Precision Long-time simulations. Resolving small-scale phenomena. Khanna, High-Precision Numerical Simulations on a CUDA GPU: Kerr Black Hole Tails, 2013. Beliakov and Matiyasevich, A Parallel Algorithm for Calculation of Determinants and Minors Using Arbitrary Precision Arithmetic, 2016. Ma and Saunders, Solving Multiscale Linear Programs Using the Simplex Method in Quadruple Precision, 2015.

Nick Higham

Multiprecision Algorithms

18 / 48

Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?

Nick Higham

Multiprecision Algorithms

19 / 48

Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?

To what extent are existing algs precision-independent? Newton-type algs: just decrease tol? How little higher precision can we get away with? Gradually increase precision through the iterations? For Krylov methods # iterations can depend on precision, so lower precision might not give fastest computation! Nick Higham

Multiprecision Algorithms

19 / 48

Randomization + Extra Precision Exploit A = X diag(λi )X −1 ⇒ f (A) = X diag(f (λi ))X −1 . Davies (2007): “approximate diagonalization”: function F = funm_randomized(A,fun) d = digits; digits(2*d); tol = 10^(-d); E = randn(size(A),class(A)); [V,D] = eig(A + (tol*norm(A,’fro’) ... /norm(E,’fro’))*E); F = V*diag(fun(diag(D)))/V; digits(d) Perturbation ensures diagonalizable. Extra precision overcomes the effect of the perturbation. Nick Higham

Multiprecision Algorithms

20 / 48

Availability of Multiprecision in Software Maple, Mathematica, PARI/GP, Sage. MATLAB: Symbolic Math Toolbox, Multiprecision Computing Toolbox (Advanpix). Julia: BigFloat. Mpmath and SymPy for Python. GNU MP Library. GNU MPFR Library. (Quad only): some C, Fortran compilers. Gone, but not forgotten: Numerical Turing: Hull et al., 1985. Nick Higham

Multiprecision Algorithms

21 / 48

Cost of Quadruple Precision How Fast is Quadruple Precision Arithmetic? Compare MATLAB double precision, Symbolic Math Toolbox, VPA arithmetic, digits(34), Multiprecision Computing Toolbox (Advanpix), mp.Digits(34). Optimized for quad. Ratios of times Intel Broadwell-E Core i7-6800K @3.40GHz, 6 cores mp/double vpa/double vpa/mp LU, n = 250 98 25,000 255 75 6,020 81 eig, nonsymm, n = 125 eig, symm, n = 200 32 11,100 342

Nick Higham

Multiprecision Algorithms

22 / 48

Outline

1

Higher Precision

2

Iterative Refinement

Nick Higham

Multiprecision Algorithms

23 / 48

Accelerating the Solution of Ax = b A ∈ Rn×n nonsingular. Standard method for solving Ax = b: factorize A = LU, solve LUx = b, all at working precision. Can we solve Ax = b faster and/or more accurately by exploiting multiprecision arithmetic?

Nick Higham

Multiprecision Algorithms

24 / 48

Iterative Refinement for Ax = b (classic) Solve Ax0 = b by LU factorization in double precision. r = b − Ax0 quad precision Solve Ad = r double precision x1 = x0 + d double precision (x0 ← x1 and iterate as necessary.) Programmed in J. H. Wilkinson, Progress Report on the Automatic Computing Engine (1948). Popular up to 1970s, exploiting cheap accumulation of inner products.

Nick Higham

Multiprecision Algorithms

25 / 48

Iterative Refinement (1970s, 1980s) Solve Ax0 = b by LU factorization. r = b − Ax0 Solve Ad = r x1 = x0 + d Everything in double precision. Skeel (1980). Jankowski & Wo´zniakowski (1977) for a general solver.

Nick Higham

Multiprecision Algorithms

26 / 48

Iterative Refinement (2000s) Solve Ax0 = b by LU factorization in single precision. r = b − Ax0 double precision Solve Ad = r single precision x1 = x0 + d double precision Dongarra et al. (2006). Motivated by single precision being at least twice as fast as double.

Nick Higham

Multiprecision Algorithms

27 / 48

Iterative Refinement in Three Precisions A, b given in precision u. Solve Ax0 = b by LU factorization in precision uf . r = b − Ax0 precision ur Solve Ad = r precision uf x1 = fl(x0 + d) precision u Three previous usages are special cases. Choose precisions from half, single, double, quadruple subject to ur ≤ u ≤ uf . Can we compute more accurate solutions faster?

Nick Higham

Multiprecision Algorithms

28 / 48

Existing Rounding Error Analysis Wilkinson (1963): fixed-point arithmetic. Moler (1967): floating-point arithmetic. Higham (1997, 2002): more general analysis for arbitrary solver. Dongarra et al. (2006): lower precision LU. At most two precisions and require κ(A)u < 1 . New Analysis Applies to any solver. Covers b’err and f’err. Focus on f’err here. Allows κ(A)u & 1 . Nick Higham

Multiprecision Algorithms

29 / 48

New Analysis Assume computed solution to Adi = ri has normwise rel b’err O(uf ) and satisfies bi k∞ kdi − d ≤ uf θi < 1. kdi k Define µi by kA(x − xbi )k∞ = µi kAk∞ kx − xbi k∞ , and note that κ∞ (A)−1 ≤ µi ≤ 1.

Nick Higham

Multiprecision Algorithms

30 / 48

Condition Numbers |A| = (|aij |). k |A−1 ||A||x| k∞ cond(A, x) = , kxk∞ cond(A) = cond(A, e) = k |A−1 ||A| k∞ , κ∞ (A) = kAk∞ kA−1 k∞ . 1 ≤ cond(A, x) ≤ cond(A) ≤ κ∞ (A) .

Nick Higham

Multiprecision Algorithms

31 / 48

Convergence Result Theorem (Carson & H, 2018) For IR in precisions ur ≤ u ≤ uf if  φi = 2uf min cond(A), κ∞ (A)µi + uf θi is sufficiently less than 1, the forward error is reduced on the ith iteration by a factor ≈ φi until an iterate xb satisfies kx − xbk∞ . 4nur cond(A, x) + u. kxk∞ Analogous standard bound would have µi = 1, uf θi = κ(A)uf . Nick Higham

Multiprecision Algorithms

32 / 48

Precision Combinations H = half, S = single, D = double, Q = quad. “uf u ur ”: SSD DDQ HHS Traditional: HHD HHQ SSQ

SSS DDD 1970s/1980s: HHH QQQ

SDD HSS DQQ 2000s: HDD HQQ SQQ

HSD HSQ 3 precisions: HDQ SDQ

Nick Higham

Multiprecision Algorithms

33 / 48

Results for LU Factorization (1) uf H H H H S S S S

u ur S S S D D D D Q S S S D D D D Q

κ∞ (A) 104 104 104 104 108 108 108 108

Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D

Nick Higham

Multiprecision Algorithms

34 / 48

Results (2): HSD vs. SSD uf H H H H S S S S

u ur S S S D D D D Q S S S D D D D Q

κ∞ (A) 104 104 104 104 108 108 108 108

Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D

Nick Higham

Multiprecision Algorithms

35 / 48

Results (2): HSD vs. SSD uf H H H H S S S S

u ur S S S D D D D Q S S S D D D D Q

κ∞ (A) 104 104 104 104 108 108 108 108

Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D

Can we get the benefit of “HSD” while allowing a larger range of κ∞ (A)?

Nick Higham

Multiprecision Algorithms

35 / 48

Extending the Range of Applicability Recall that the convergence condition is  φi = 2uf min cond(A), κ∞ (A)µi + uf θi  1. We need both terms to be smaller than κ∞ (A)uf . Recall that bi k∞ kdi − d ≤ uf θi , kdi k µi kAk∞ kx − xi k∞ = kA(x − xi )k∞ = kb − Axi k∞ = kri k∞ .

Nick Higham

Multiprecision Algorithms

36 / 48

Bounding µi For a stable solver, in the early stages we expect kri k kx − xi k ≈u , kAkkxi k kxk or equivalently µi  1. But close to convergence kri k kx − xi k ≈u≈ kAkkxi k kxk

or µi ≈ 1.

Conclude µi  1 initially and µi → 1 as the iteration converges.

Nick Higham

Multiprecision Algorithms

37 / 48

Bounding θi uf θi bounds rel error in solution of Adi = ri . We need uf θi  1. Standard solvers cannot achieve this for very ill conditioned A! b and U b are Empirically observed by Rump (1990) that if L computed LU factors of A from GEPP then b−1 AU b −1 ) ≈ 1 + κ(A)u, κ(L even for κ(A)  u −1 .

Nick Higham

Multiprecision Algorithms

38 / 48

Preconditioned GMRES To compute the updates di we apply GMRES to e i ≡U b −1 L b−1 Adi = U b −1 L b−1 ri . Ad e is applied in twice the working precision. A e  κ(A) typically. κ(A) bi Rounding error analysis shows we get an accurate d even for numerically singular A. Call the overall alg GMRES-IR. e GMRES cgce rate not directly related to κ(A). e Cf. Kobayashi & Ogita (2015), who explicitly form A.

Nick Higham

Multiprecision Algorithms

39 / 48

Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .

LU LU

uf H S

u ur D Q D Q

Nick Higham

κ∞ (A) 104 108

Backward error nrm cmp F’error D D D D D D

Multiprecision Algorithms

40 / 48

Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .

LU LU

uf H S

u ur D Q D Q

GMRES-IR GMRES-IR

H S

D D

Q Q

Nick Higham

κ∞ (A) 104 108 1012 1016

Backward error nrm cmp F’error D D D D D D D D D D

D D

Multiprecision Algorithms

40 / 48

Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .

LU LU

uf H S

u ur D Q D Q

GMRES-IR GMRES-IR

H S

D D

Q Q

κ∞ (A) 104 108 1012 1016

Backward error nrm cmp F’error D D D D D D D D D D

D D

How many GMRES iterations are required? Some tests with 100 × 100 matrices . . . Nick Higham

Multiprecision Algorithms

40 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, D) κ∞ (A) ≈ 1010 , σi = αi

Divergence!

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

re-nement step Nick Higham

Multiprecision Algorithms

41 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi

Divergence!

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

re-nement step Nick Higham

Multiprecision Algorithms

42 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 104 , σi = αi

Convergence

10-5

10-15 0

1

Nick Higham

2

Multiprecision Algorithms

43 / 48

Test 1: GMRES-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi , GMRES its (2,3)

Convergence

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

re-nement step Nick Higham

Multiprecision Algorithms

44 / 48

Test 2: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 102 , 1 small σi , GMRES its (8,8,8) Convergence

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

3

re-nement step Nick Higham

Multiprecision Algorithms

45 / 48

Test 3: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 1012 , σi = αi , GMRES (100,100) Take x0 = 0 because of overflow!

Convergence

10-5

10-15 0

1

Nick Higham

2

3

Multiprecision Algorithms

46 / 48

Conclusions Both low and high precision floating-point arithmetic becoming more prevalent, in hardware and software. Need better understanding of behaviour of algs in low precision arithmetic. IR with LU in lower precision ⇒ twice as fast as trad. IR, albeit restricted κ(A). Judicious use of a little high precision can bring major benefits. GMRES-IR cges when trad. IR doesn’t, thanks to preconditioned GMRES solved of Adi = ri . GMRES-IR handles κ∞ (A) ≈ u −1 . Further work: tune cgce tol, alternative preconditioners etc. Nick Higham

Multiprecision Algorithms

47 / 48

References I H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Computat.: Pract. Exper., 2018. G. Beliakov and Y. Matiyasevich. A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic. BIT, 56(1):33–50, 2016.

Nick Higham

Multiprecision Algorithms

1/7

References II E. Carson and N. J. Higham. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017. E. Carson and N. J. Higham. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM J. Sci. Comput., 40(2):A817–A847, 2018. M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications, 2015. ArXiv preprint 1412.7024v5. Nick Higham

Multiprecision Algorithms

2/7

References III E. B. Davies. Approximate diagonalization. SIAM J. Matrix Anal. Appl., 29(4):1051–1064, 2007. N. J. Dingle and N. J. Higham. Reducing the influence of tiny normwise relative errors on performance profiles. ACM Trans. Math. Software, 39(4):24:1–24:11, 2013. N. J. Higham. Iterative refinement for linear systems and LAPACK. IMA J. Numer. Anal., 17(4):495–509, 1997.

Nick Higham

Multiprecision Algorithms

3/7

References IV N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002. ISBN 0-89871-521-0. xxx+680 pp. G. Khanna. High-precision numerical simulations on a CUDA GPU: Kerr black hole tails. J. Sci. Comput., 56(2):366–380, 2013.

Nick Higham

Multiprecision Algorithms

4/7

References V Y. Kobayashi and T. Ogita. A fast and efficient algorithm for solving ill-conditioned linear systems. JSIAM Lett., 7:1–4, 2015. J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, Nov. 2006.

Nick Higham

Multiprecision Algorithms

5/7

References VI D. Ma and M. Saunders. Solving multiscale linear programs using the simplex method in quadruple precision. In M. Al-Baali, L. Grandinetti, and A. Purnama, editors, Numerical Analysis and Optimization, number 134 in Springer Proceedings in Mathematics and Statistics, pages 223–235. Springer-Verlag, Berlin, 2015.

Nick Higham

Multiprecision Algorithms

6/7

References VII A. Roldao-Lopes, A. Shahzad, G. A. Constantinides, and E. C. Kerrigan. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, pages 209–216, Apr. 2009. K. Scheinberg. Evolution of randomness in optimization methods for supervised machine learning. SIAG/OPT Views and News, 24(1):1–8, 2016.

Nick Higham

Multiprecision Algorithms

7/7