Multiprecision Algorithms - Maths @ Manchester - The University of ...

Multiprecision Algorithms Research Matters February 2009 Nick25, Higham School of Mathematics Nick Higham The University of Manchester

Director of Research School of Mathematics http://www.maths.manchester.ac.uk/~higham @nhigham, nickhigham.wordpress.com Samuel D. Conte Distinguished Lecture, Department of Computer Science, Purdue University, March 2018

1/6

Conte and De Boor

1980 2018 Early textbook cmpwise backw error analysis of LU factorization: computed solution of Ax = b satisfies (A + ∆A)xb = b,

b U|, b |∆A| ≤ un |A| + un (3 + un )|L||

where un = 1.01nu and |A| = (|aij |). Nick Higham

Multiprecision Algorithms

2 / 48

Outline Multiprecision arithmetic: floating point arithmetic supporting multiple precisions.

Low precision—-half, or less. High precision–quadruple, possibly arbitrary. How to exploit different precisions to achieve faster algs with higher accuracy. Download this talk from http://bit.ly/conte18

Nick Higham


3 / 48

IEEE Standard 754-1985 and 2008 Revision Type half single double quadruple

Size 16 bits 32 bits 64 bits 128 bits

Range 10±5 10±38 10±308 10±4932

u = 2−t 2−11 ≈ 4.9 × 10−4 2−24 ≈ 6.0 × 10−8 2−53 ≈ 1.1 × 10−16 2−113 ≈ 9.6 × 10−35

√ Arithmetic ops (+, −, ∗, /, ) performed as if first calculated to infinite precision, then rounded. Default: round to nearest, round to even in case of tie. Half precision is a storage format only.

Nick Higham


4 / 48

Model for Rounding Error Analysis For x, y ∈ F fl(x op y ) = (x op y )(1 + δ), Also for op =

p

|δ| ≤ u,

op = +, −, ∗, /.

.

Model is weaker than fl(x op y ) being correctly rounded.

Nick Higham


5 / 48

Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).

|δi | ≤ u,

Precision = u. Accuracy ≈ 2u.

Nick Higham


6 / 48

Precision versus Accuracy fl(abc) = ab(1 + δ1 ) · c(1 + δ2 ) = abc(1 + δ1 )(1 + δ2 ) ≈ abc(1 + δ1 + δ2 ).

|δi | ≤ u,

Precision = u. Accuracy ≈ 2u.

Accuracy is not limited by precision

Nick Higham


6 / 48

ARM NEON

Nick Higham


8 / 48

NVIDIA Tesla P100 (2016), V100 (2017)

“The Tesla P100 is the world’s first accelerator built for deep learning, and has native hardware ISA support for FP16 arithmetic” V100 tensor cores do 4 × 4 mat mult in one clock cycle. double P100 4.7 V100 7 Nick Higham

TFLOPS single half/ tensor 9.3 18.7 14 112 Multiprecision Algorithms

9 / 48

AMD Radeon Instinct MI25 GPU (2017)

“24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPU compute performance on a single board . . . Up to 82 GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPU compute performance” Nick Higham


10 / 48

Machine Learning “For machine learning as well as for certain image processing and signal processing applications, more data at lower precision actually yields better results with certain algorithms than a smaller amount of more precise data.” “We find that very low precision is sufficient not just for running trained networks but also for training them.” Courbariaux, Benji & David (2015) We’re solving the wrong problem (Scheinberg, 2016), so don’t need an accurate solution. Low precision provides regularization. Low precision encourages flat minima to be found. Nick Higham


11 / 48

Deep Learning for Java

Nick Higham


12 / 48

Climate Modelling T. Palmer, More reliable forecasts with less precise computations: a fast-track route to cloud-resolved weather and climate simulators?, Phil. Trans. R. Soc. A, 2014: Is there merit in representing variables at sufficiently high wavenumbers using half or even quarter precision floating-point numbers? T. Palmer, Build imprecise supercomputers, Nature, 2015.

Nick Higham


13 / 48

Fp16 for Communication Reduction ResNet-50 training on ImageNet. Solved in 60 mins on 256 TESLA P100s at Facebook (2017). Solved in 15 mins on 1024 TESLA P100s at Preferred Networks, Inc. (2017) using ChainerMN (Takuya Akiba, SIAM PP18): “While computation was generally done in single precision, in order to reduce the communication overhead during all-reduce operations, we used half-precision floats . . . In our preliminary experiments, we observed that the effect from using half-precision in communication on the final model accuracy was relatively small.”

Nick Higham


14 / 48

Preconditioning with Adaptive Precision Anzt, Dongarra, Flegar, H & Quintana-Ortí (2018): For sparse A and iterative Ax = b solver, execution time and energy dominated by data movement. Block Jacobi preconditioning: D = diag(Di ), where Di = Aii . Solve D −1 Ax = D −1 b. All computations are at fp64. Compute D −1 and store Di−1 in fp16, fp32 or fp64, depending on κ(Di ). Simulations and energy modelling show can outperform fixed precision preconditioner.

Nick Higham


15 / 48

Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .

Nick Higham


16 / 48

Error Analysis in Low Precision For inner product x T y of n-vectors standard error bound is | fl(x T y ) − x T y | ≤ nu|x|T |y | + O(u 2 ). In half precision, u ≈ 4.9 × 10−4 , so nu = 1 for n = 2048 .

What happens when nu > 1?

Nick Higham


16 / 48

Outline

1

Higher Precision

2

Iterative Refinement

Nick Higham


17 / 48

Need for Higher Precision Long-time simulations. Resolving small-scale phenomena. Khanna, High-Precision Numerical Simulations on a CUDA GPU: Kerr Black Hole Tails, 2013. Beliakov and Matiyasevich, A Parallel Algorithm for Calculation of Determinants and Minors Using Arbitrary Precision Arithmetic, 2016. Ma and Saunders, Solving Multiscale Linear Programs Using the Simplex Method in Quadruple Precision, 2015.

Nick Higham


18 / 48

Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?

Nick Higham


19 / 48

Going to Higher Precision If we have quadruple or higher precision, how can we modify existing algorithms to exploit it?

To what extent are existing algs precision-independent? Newton-type algs: just decrease tol? How little higher precision can we get away with? Gradually increase precision through the iterations? For Krylov methods # iterations can depend on precision, so lower precision might not give fastest computation! Nick Higham


19 / 48

Randomization + Extra Precision Exploit A = X diag(λi )X −1 ⇒ f (A) = X diag(f (λi ))X −1 . Davies (2007): “approximate diagonalization”: function F = funm_randomized(A,fun) d = digits; digits(2*d); tol = 10^(-d); E = randn(size(A),class(A)); [V,D] = eig(A + (tol*norm(A,’fro’) ... /norm(E,’fro’))*E); F = V*diag(fun(diag(D)))/V; digits(d) Perturbation ensures diagonalizable. Extra precision overcomes the effect of the perturbation. Nick Higham


20 / 48

Availability of Multiprecision in Software Maple, Mathematica, PARI/GP, Sage. MATLAB: Symbolic Math Toolbox, Multiprecision Computing Toolbox (Advanpix). Julia: BigFloat. Mpmath and SymPy for Python. GNU MP Library. GNU MPFR Library. (Quad only): some C, Fortran compilers. Gone, but not forgotten: Numerical Turing: Hull et al., 1985. Nick Higham


21 / 48

Cost of Quadruple Precision How Fast is Quadruple Precision Arithmetic? Compare MATLAB double precision, Symbolic Math Toolbox, VPA arithmetic, digits(34), Multiprecision Computing Toolbox (Advanpix), mp.Digits(34). Optimized for quad. Ratios of times Intel Broadwell-E Core i7-6800K @3.40GHz, 6 cores mp/double vpa/double vpa/mp LU, n = 250 98 25,000 255 75 6,020 81 eig, nonsymm, n = 125 eig, symm, n = 200 32 11,100 342

Nick Higham


22 / 48

Outline

1

Higher Precision

2

Iterative Refinement

Nick Higham


23 / 48

Accelerating the Solution of Ax = b A ∈ Rn×n nonsingular. Standard method for solving Ax = b: factorize A = LU, solve LUx = b, all at working precision. Can we solve Ax = b faster and/or more accurately by exploiting multiprecision arithmetic?

Nick Higham


24 / 48

Iterative Refinement for Ax = b (classic) Solve Ax0 = b by LU factorization in double precision. r = b − Ax0 quad precision Solve Ad = r double precision x1 = x0 + d double precision (x0 ← x1 and iterate as necessary.) Programmed in J. H. Wilkinson, Progress Report on the Automatic Computing Engine (1948). Popular up to 1970s, exploiting cheap accumulation of inner products.

Nick Higham


25 / 48

Iterative Refinement (1970s, 1980s) Solve Ax0 = b by LU factorization. r = b − Ax0 Solve Ad = r x1 = x0 + d Everything in double precision. Skeel (1980). Jankowski & Wo´zniakowski (1977) for a general solver.

Nick Higham


26 / 48

Iterative Refinement (2000s) Solve Ax0 = b by LU factorization in single precision. r = b − Ax0 double precision Solve Ad = r single precision x1 = x0 + d double precision Dongarra et al. (2006). Motivated by single precision being at least twice as fast as double.

Nick Higham


27 / 48

Iterative Refinement in Three Precisions A, b given in precision u. Solve Ax0 = b by LU factorization in precision uf . r = b − Ax0 precision ur Solve Ad = r precision uf x1 = fl(x0 + d) precision u Three previous usages are special cases. Choose precisions from half, single, double, quadruple subject to ur ≤ u ≤ uf . Can we compute more accurate solutions faster?

Nick Higham


28 / 48

Existing Rounding Error Analysis Wilkinson (1963): fixed-point arithmetic. Moler (1967): floating-point arithmetic. Higham (1997, 2002): more general analysis for arbitrary solver. Dongarra et al. (2006): lower precision LU. At most two precisions and require κ(A)u < 1 . New Analysis Applies to any solver. Covers b’err and f’err. Focus on f’err here. Allows κ(A)u & 1 . Nick Higham


29 / 48

New Analysis Assume computed solution to Adi = ri has normwise rel b’err O(uf ) and satisfies bi k∞ kdi − d ≤ uf θi < 1. kdi k Define µi by kA(x − xbi )k∞ = µi kAk∞ kx − xbi k∞ , and note that κ∞ (A)−1 ≤ µi ≤ 1.

Nick Higham


30 / 48

Condition Numbers |A| = (|aij |). k |A−1 ||A||x| k∞ cond(A, x) = , kxk∞ cond(A) = cond(A, e) = k |A−1 ||A| k∞ , κ∞ (A) = kAk∞ kA−1 k∞ . 1 ≤ cond(A, x) ≤ cond(A) ≤ κ∞ (A) .

Nick Higham


31 / 48

Convergence Result Theorem (Carson & H, 2018) For IR in precisions ur ≤ u ≤ uf if φi = 2uf min cond(A), κ∞ (A)µi + uf θi is sufficiently less than 1, the forward error is reduced on the ith iteration by a factor ≈ φi until an iterate xb satisfies kx − xbk∞ . 4nur cond(A, x) + u. kxk∞ Analogous standard bound would have µi = 1, uf θi = κ(A)uf . Nick Higham


32 / 48

Precision Combinations H = half, S = single, D = double, Q = quad. “uf u ur ”: SSD DDQ HHS Traditional: HHD HHQ SSQ

SSS DDD 1970s/1980s: HHH QQQ

SDD HSS DQQ 2000s: HDD HQQ SQQ

HSD HSQ 3 precisions: HDQ SDQ

Nick Higham


33 / 48

Results for LU Factorization (1) uf H H H H S S S S

u ur S S S D D D D Q S S S D D D D Q

κ∞ (A) 104 104 104 104 108 108 108 108

Backward error norm comp Forward error S S cond(A, x) · S S S S D D cond(A, x) · D D D D S S cond(A, x) · S S S S D D cond(A, x) · D D D D

Nick Higham


34 / 48

Results (2): HSD vs. SSD uf H H H H S S S S


κ∞ (A) 104 104 104 104 108 108 108 108


Nick Higham


35 / 48

Results (2): HSD vs. SSD uf H H H H S S S S


κ∞ (A) 104 104 104 104 108 108 108 108


Can we get the benefit of “HSD” while allowing a larger range of κ∞ (A)?

Nick Higham


35 / 48

Extending the Range of Applicability Recall that the convergence condition is φi = 2uf min cond(A), κ∞ (A)µi + uf θi 1. We need both terms to be smaller than κ∞ (A)uf . Recall that bi k∞ kdi − d ≤ uf θi , kdi k µi kAk∞ kx − xi k∞ = kA(x − xi )k∞ = kb − Axi k∞ = kri k∞ .

Nick Higham


36 / 48

Bounding µi For a stable solver, in the early stages we expect kri k kx − xi k ≈u , kAkkxi k kxk or equivalently µi 1. But close to convergence kri k kx − xi k ≈u≈ kAkkxi k kxk

or µi ≈ 1.

Conclude µi 1 initially and µi → 1 as the iteration converges.

Nick Higham


37 / 48

Bounding θi uf θi bounds rel error in solution of Adi = ri . We need uf θi 1. Standard solvers cannot achieve this for very ill conditioned A! b and U b are Empirically observed by Rump (1990) that if L computed LU factors of A from GEPP then b−1 AU b −1 ) ≈ 1 + κ(A)u, κ(L even for κ(A) u −1 .

Nick Higham


38 / 48

Preconditioned GMRES To compute the updates di we apply GMRES to e i ≡U b −1 L b−1 Adi = U b −1 L b−1 ri . Ad e is applied in twice the working precision. A e κ(A) typically. κ(A) bi Rounding error analysis shows we get an accurate d even for numerically singular A. Call the overall alg GMRES-IR. e GMRES cgce rate not directly related to κ(A). e Cf. Kobayashi & Ogita (2015), who explicitly form A.

Nick Higham


39 / 48

Benefits of GMRES-IR Recall H = 10−4 , S = 10−8 , D = 10−16 , Q = 10−34 .

LU LU

uf H S

u ur D Q D Q

Nick Higham

κ∞ (A) 104 108

Backward error nrm cmp F’error D D D D D D


40 / 48


LU LU

uf H S

u ur D Q D Q

GMRES-IR GMRES-IR

H S

D D

Q Q

Nick Higham

κ∞ (A) 104 108 1012 1016

Backward error nrm cmp F’error D D D D D D D D D D

D D


40 / 48


LU LU

uf H S

u ur D Q D Q

GMRES-IR GMRES-IR

H S

D D

Q Q

κ∞ (A) 104 108 1012 1016

Backward error nrm cmp F’error D D D D D D D D D D

D D

How many GMRES iterations are required? Some tests with 100 × 100 matrices . . . Nick Higham


40 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, D) κ∞ (A) ≈ 1010 , σi = αi

Divergence!

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

re-nement step Nick Higham


41 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi

Divergence!

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2



42 / 48

Test 1: LU-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 104 , σi = αi

Convergence

10-5

10-15 0

1

Nick Higham

2


43 / 48

Test 1: GMRES-IR, (uf , u, ur ) = (S, D, Q) κ∞ (A) ≈ 1010 , σi = αi , GMRES its (2,3)

Convergence

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2



44 / 48

Test 2: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 102 , 1 small σi , GMRES its (8,8,8) Convergence

ferr nbe cbe

100 10-5

10-10 10-15 0

1

2

3



45 / 48

Test 3: GMRES-IR, (uf , u, ur ) = (H, D, Q) κ∞ (A) ≈ 1012 , σi = αi , GMRES (100,100) Take x0 = 0 because of overflow!

Convergence

10-5

10-15 0

1

Nick Higham

2

3


46 / 48

Conclusions Both low and high precision floating-point arithmetic becoming more prevalent, in hardware and software. Need better understanding of behaviour of algs in low precision arithmetic. IR with LU in lower precision ⇒ twice as fast as trad. IR, albeit restricted κ(A). Judicious use of a little high precision can bring major benefits. GMRES-IR cges when trad. IR doesn’t, thanks to preconditioned GMRES solved of Adi = ri . GMRES-IR handles κ∞ (A) ≈ u −1 . Further work: tune cgce tol, alternative preconditioners etc. Nick Higham


47 / 48

References I H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Computat.: Pract. Exper., 2018. G. Beliakov and Y. Matiyasevich. A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic. BIT, 56(1):33–50, 2016.

Nick Higham


1/7

References II E. Carson and N. J. Higham. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017. E. Carson and N. J. Higham. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM J. Sci. Comput., 40(2):A817–A847, 2018. M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications, 2015. ArXiv preprint 1412.7024v5. Nick Higham


2/7

References III E. B. Davies. Approximate diagonalization. SIAM J. Matrix Anal. Appl., 29(4):1051–1064, 2007. N. J. Dingle and N. J. Higham. Reducing the influence of tiny normwise relative errors on performance profiles. ACM Trans. Math. Software, 39(4):24:1–24:11, 2013. N. J. Higham. Iterative refinement for linear systems and LAPACK. IMA J. Numer. Anal., 17(4):495–509, 1997.

Nick Higham


3/7

References IV N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002. ISBN 0-89871-521-0. xxx+680 pp. G. Khanna. High-precision numerical simulations on a CUDA GPU: Kerr black hole tails. J. Sci. Comput., 56(2):366–380, 2013.

Nick Higham


4/7

References V Y. Kobayashi and T. Ogita. A fast and efficient algorithm for solving ill-conditioned linear systems. JSIAM Lett., 7:1–4, 2015. J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, Nov. 2006.

Nick Higham


5/7

References VI D. Ma and M. Saunders. Solving multiscale linear programs using the simplex method in quadruple precision. In M. Al-Baali, L. Grandinetti, and A. Purnama, editors, Numerical Analysis and Optimization, number 134 in Springer Proceedings in Mathematics and Statistics, pages 223–235. Springer-Verlag, Berlin, 2015.

Nick Higham


6/7

References VII A. Roldao-Lopes, A. Shahzad, G. A. Constantinides, and E. C. Kerrigan. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines, pages 209–216, Apr. 2009. K. Scheinberg. Evolution of randomness in optimization methods for supervised machine learning. SIAG/OPT Views and News, 24(1):1–8, 2016.

Nick Higham


7/7

Multiprecision Algorithms - Maths @ Manchester - The University of ...

Multiprecision Algorithms - Maths @ Manchester - The University of ...

Suggest Documents

Multiprecision Algorithms - School of Mathematics - The University of ...

volume 794 - Maths @ Manchester

Geometry of Differential operators - Maths @ Manchester - The ...

Maths in Medicine - School of Mathematics - University of Manchester

The University of Manchester

The University of Manchester

The Manchester Museum, University of Manchester ...

The University of Manchester

Exhibitions in Manchester - The University of Manchester

Sensitivity of Computational Control Problemsâ - Maths @ Manchester

UManSysProp - The University of Manchester

Newsletter - The University of Manchester

undergraduate - The University of Manchester

undergraduate - The University of Manchester

Alcohol - The University of Manchester

THE UNIVERSITY OF MANCHESTER ...

green - The University of Manchester

ManUniCast - The University of Manchester

Proceedings - The University of Manchester

Disclaimer - The University of Manchester

review - The University of Manchester

Von-Neumann Stability Analysis - Maths @ Manchester

OUT OF TIME - Manchester eScholar - The University of Manchester

Master's thesis - Manchester HEP - The University of Manchester

Multiprecision Algorithms - Maths @ Manchester - The University of ...