Numerical Performance of an Asynchronous

0 downloads 0 Views 145KB Size Report
i?1. X j=1 aijxj(t ?1) ? nX j=i+1 aijxj(t ?1). 1. A. , aii end if convergence test is ... If we let A = L + D + U, where L is strictly lower triangular, D is diagonal, and U.
Numerical Performance of an Asynchronous Jacobi Iteration J.M.Bull1 and T.L.Freeman2? 1

Centre for Novel Computing, Department of Computer Science, University of Manchester, Manchester, M13 9PL, UK. 2 Department of Mathematics, University of Manchester, Manchester, M13 9PL, UK.

Abstract. We examine the e ect of removing synchronisation points from a

parallel implementation of a simple iterative algorithm|Jacobi's method for linear systems. We nd that in some cases the asynchronous version requires fewer iterations to converge than its synchronous counterpart. We show that this behaviour can be explained in terms of the presence or absence of oscillations in the sequence of error vectors in the synchronous version, and that removing the synchronisation point can damp the oscillations.

1 Introduction There are a number of problems in linear and nonlinear numerical algebra where the solution vector x? can be obtained by an iterative method. Such a method generates a sequence of vectors x(t); t = 1; 2; 3; : : :; so that x(t) ! x? as t ! 1. Typically x(t +1) = f (x(t)), where f (x) = (f1 (x); f2 (x); : : :; fn (x))T is the iteration function. They are suitable for implementation on a parallel computer, by assigning subsets of the elements of x to the di erent processors. On each iteration a processor updates its elements of x(t), broadcasts the results to the other processors and in turn receives the other elements of x(t) from the other processors. However speedup is limited because of the synchronisation point in each iteration (to exchange the latest values of x(t)). What happens if we remove the synchronisation point from the algorithm and allow it to run in an asynchronous (or chaotic) mode? In general this will a ect both the eciency of the algorithm, and the necessary and sucient conditions required to ensure convergence. As regards eciency, we expect there to be a trade-o between a reduced time per iteration (by removing the synchronisation point) and a slower rate of convergence, resulting from the use of older data when computing the next iterate. We examine these issues for a simple iterative method| Jacobi's algorithm for the solution of a system of linear equations. We nd that in some cases the convergence rate is improved by the switch to the asynchronous implementation. We present results for several linear systems, which are chosen to demonstrate that under some circumstances the removal of the synchronisation point damps oscillatory behaviour in the iteration. ?

The second author acknowledges the support of the NATO Collaborative Research Grant 920037.

2 Parallel Jacobi algorithms The classical Jacobi algorithm (denoted J) for the system of linear equations Ax = b where A is an n  n, non-singular, matrix can be written as: Choose a starting vector x(0) for t = 1; 2; 3; : ::, for i = 1; 2; : : :; n;

1, 0 i?1 n X X aij xj (t ? 1)A aii xi(t) = @bi ? aij xj (t ? 1) ? j =1

end

j =i+1

end if convergence test is satis ed then stop

If we let A = L + D + U , where L is strictly lower triangular, D is diagonal, and U is strictly upper triangular, then the algorithm can be written as x(t) = MJ x(t ? 1) + D?1 b; where MJ = ?D?1 (L + U ) is the Jacobi iteration matrix . A natural parallel version of this algorithm is obtain by assigning sets of the xi to processors. At each step every processor computes the new values of its set of the xi . It then broadcasts these to every other processor, then waits until it has received the new values of the xi from every other processor before proceeding with the next step. Note that within each processor we can implement the algorithm in a Gauss-Seidel fashion by using any new values of the xi that are available. This algorithm, denoted JLGS (Jacobi with local Gauss-Seidel), usually converges faster than the standard Jacobi algorithm (particularly when p is small) at no extra cost. As described above, it is a synchronous version (which we will denote SJLGS) since, after each processor has broadcast, it blocks until it has received elements of x(t) from all the other processors. To obtain an asynchronous version of the algorithm (AJLGS), we simply remove this synchronisation point and allow the messages from the other processors to be received at any point in the iteration. Each processor proceeds with the computations, using the most recently received values of the elements of x when it requires them. Synchronous and asynchronous versions of the standard Jacobi algorithm (SJ and AJ respectively) can of course be derived in exactly the same way. We mention here the conditions for convergence of the algorithms. For SJ the condition (MJ ) < 1, where MJ is the Jacobi iteration matrix and (:) denotes the spectral radius of a matrix, is necessary and sucient (see Section 10.1.2 of [5]). For AJ a sucient condition is (jMJ j) < 1, where jMJ j denotes the matrix whose (i; j ) entry is the absolute value of the (i; j ) entry of MJ . (see Section 6.2 of [2],[3], [4] and [6]). This condition also seems to be necessary in practise, even though it cannot be demonstrated formally (see Section 6.3.1 of [2], and [3]). The same results also apply to JLGS given a suitable rede nition of the iteration matrix.

3 Results We now present the results of implementations of SJLGS and AJLGS on the Intel iPSC/2 and iPSC/860 hypercubes at the SERC Daresbury Laboratory. A diagonally dominant matrix A of dimension n was constructed by adding nIn to a matrix with random entries uniformly distributed on [0; 1]. The vector b was chosen to ensure that the solution vector x? = A?1b = (1; 1; : : :; 1)T . The initial vector x0 was de ned by (x0 )i = 5i; i = 1; : : :; n. Single precision (32 bit) arithmetic was used and the iteration was deemed to have converged when jjx(t) ? x? jj1 < 4  10?6 . Each implementation was run on 2, 4, 8 and 16 processors. Table 1 gives the number of iterations and total time required for problems of dimensions n = 128 and n = 512 on both the iPSC/2 and iPSC/860. Because AJLGS is a chaotic implementation the number of iterations can vary from one run to the next. The gures given here are the median number of iterations from a set of ve runs. The most interesting feature

Table 1. Number of iterations and total time required by SJLGS and AJLGS for n = 128 and n = 512 on the iPSC/2 and iPSC/860 SJLGS No. of iterations Time (ms) AJLGS No. of iterations iPSC/2 Time (ms) SJLGS No. of iterations n=512 Time (ms) AJLGS No. of iterations Time (ms) SJLGS No. of iterations n=128 Time (ms) AJLGS No. of iterations iPSC/860 Time (ms) SJLGS No. of iterations n=512 Time (ms) AJLGS No. of iterations Time (ms) n = 128

No. of processors 2 4 8 16 17 22 24 26 946 694 424 357 18 24 26 20 945 712 465 316 18 23 26 28 15585 10094 5935 3521 16 19 23 29 13146 7959 5012 3464 17 22 24 26 152.1 120.6 51.2 53.2 15 17 19 19 142.7 89.0 43.2 44.6 18 23 26 28 2294.3 1506.2 906.7 582.5 18 21 24 25 2477.8 1465.0 875.5 526.4

of these results is that the asynchronous version AJLGS often converges in fewer iterations than its synchronous counterpart. This is counterintuitive, since we might expect the asynchronous version, which is likely to be making use of less up-to-date information when computing its iterates, to converge more slowly. We will seek an explanation for this behaviour in the next section. A second observation is that the asynchronous version often requires more time to execute each iteration|in Intel Fortran the overhead incurred in maintaining asynchronous communication can be larger than the synchronisation overhead of synchronous communication.

4 Oscillations of the errors We suspect that the faster convergence rate of AJLGS is due to damping of oscillations of the iterates generated by the synchronous version of the algorithm about the solution x? . We note that this damping e ect only appears if the time required to pass messages between processors is not much greater than the time required for each processor to perform its part of an iteration. To analyse this e ect further we consider the general iteration x(t + 1) = M x(t) + c: We will assume that (M ) < 1, ensuring convergence to a xed point x? . If we de ne the error vector at step t by e(t) = x(t) ? x? , then it is easy to show that e(t) = M t e(0). We wish to measure the (degree of) oscillation in the series e(0); e(1); e(2); : : :: A natural measure is provided by the series T C (t) = jje(et()tjj) jjee((tt++1)1)jj t = 0; 1; 2; : : :; 2 2 where we note that C (t); t = 0; 1; 2; : : :; are the cosines of the angles between successive error vectors. We restrict attention to the case when M and c are real. Let 1 ; 2 ; : : :; n ; be the eigenvalues of M , ordered so that j1j  j2 j  : : :  jnj, and let y1 ; y2; : : :; yn, be the corresponding eigenvectors. There are two important cases: 1. 1 is real and j1j > j2 j. 2. 1 is complex, 2 = 1 and j1j > j3j. Suppose further that e(0) 2 spanfy1; y2 ; : : :; yng. (If M is diagonalisable then this will be true for any e(0).) Then e(0) =

n X j =1

cj yj and e(t) =

n X j =1

cj tj yj :

In case 1., e(t) ! c1 t1y1 as t ! 1. This is equivalent to applying a nave power method to M , and e(t) converges to a multiple of the eigenvector y1 . Thus C (t) ! 1 if 1 > 0 (corresponding to no oscillations in the error terms) and C (t) ! ?1 if 1 < 0 (corresponding to the strongest possible oscillations in the error terms). Case 2. is more complicated, but it can be shown that  + B cos(2t + ) C (t) ! [1 + D cos(2cos t +  )]1=2[1 + E cos(2t + )]1=2 where  = arg(1 ) and B , D, E , ,  and  depend on M and e(0) but not on t. In this case there is always some degree of oscillation in the error terms. This analysis applies equally to J and JLGS given suitable de nitions of the iteration matrix M . We will restrict our attention to J, however, because we can readily construct systems which display each type of behaviour. We now describe such systems, and present the results of solving them using both synchronous (SJ) and asynchronous (AJ) versions of the standard Jacobi algorithm.

Let Jn be an nn matrix with zero diagonal and ones in all o -diagonal positions. We de ne the following matrices: P = nIn ? Jn; Q = nIn + Jn ; and R; = nS + Jn ; where S is a diagonal matrix with Sii =  if i is odd and Sii =  if i is even. If MJ (A) denotes the iteration matrix of the Jacobi algorithm applied to the linear system Ax = b, then it can be shown that MJ (P ) has a dominant eigenvalue 1(MJ (P )) = (1 ? 1=n)= . Similarly 1 (MJ (Q )) = ?(1 ? 1=n)= . For certain choices of  and  , and for even n, MJ (R; ) has a complex pair of dominant eigenvalues. In [1] numerical results are given for a linear system derived from the ve point nite-di erence method for Laplace's equation on a rectangular grid. This system does not t into any of the above categories as, although 1(MJ ) is real and positive, 2(MJ ) = ?1 (MJ ). In this case, C (t) ! (c21 ? c22)=(c21 + c22) as t ! 1. Since (c21 ? c22)=(c21 + c22 ) can take any value between ?1 and 1, the degree of oscillation of the iterates about the solution tends to a constant determined solely by the initial vector x(0).

5 Further results

Table 2 shows the results of solving linear systems with coecient matrices P1:1, P2:0, Q1:1, Q2:0 and R1:0;?10:0 for n = 128, using SJ and AJ on the Intel iPSC/2. The right hand side vector b is chosen so that the solution x? = (1; 1; : : :; 1)T , and the initial vector is chosen so that C (t) = 1 for every t for P , and C (t) = ?1 for every t for Q . Because these problems are not as well conditioned as those of Section 3 the convergence condition for the iteration is relaxed to jje(t)jj1 < 6  10?5. We observe that for the cases where C (t) = ?1 (Q1:1 and Q2:0) and therefore the iterates of SJ are highly oscillatory, the number of iterations required by AJ is consistently less than is required by SJ, and substantially so when the convergence is slow (Q1:1). Conversely when C (t) = 1 (P1:1 and P2:0) and there are no oscillations in the iterates of SJ, AJ requires more iterations than SJ. For the linear system R1:0;?10:0x = b, AJ sometimes requires fewer iterations than SJ, sometimes more. This numerical evidence supports our hypothesis that the asynchronous version (AJ) is damping out oscillatory behaviour. Some further experimentation with small matrices con rms that the same e ect is also present when the JLGS algorithm is used, although it is less marked. The matrices of Section 3 have 1 (MJ ) real and negative, and so we expect that the iterates will oscillate and hence that convergence can be accelerated by using the asynchronous version of algorithm. In many cases this behaviour is observed. The results of [1] show that AJ requires more iterations than SJ for the linear system derived from Laplace's equation. We must assume that in this case either x(0) was such that C (t) tended to a positive constant value, or the communication/computation ratio was high.

6 Conclusions The results of Section 3 show that an asynchronous version of Jacobi's method sometimes requires fewer iterations to converge than the corresponding synchronous

Table 2. Number of iterations and total time required by AJ and SJ for selected test problems on the iPSC/2.

Problem

SJ No. of iterations Time (ms) AJ No. of iterations Time (ms) SJ No. of iterations P2 0 x = b Time (ms) AJ No. of iterations Time (ms) SJ No. of iterations Q1 1 x = b Time (ms) AJ No. of iterations Time (ms) SJ No. of iterations Q2 0 x = b Time (ms) AJ No. of iterations Time (ms) SJ No. of iterations R1 0 ?10 0 x = b Time (ms) AJ No. of iterations Time (ms) P1 1 x = b :

:

:

:

: ;

:

No. of processors 2 4 8 16 158 158 158 158 8709 4842 2676 2052 173 199 197 179 10302 6509 3936 2910 24 24 24 24 1323 734 407 311 27 30 29 25 1609 983 551 408 156 156 156 156 8598 4780 2643 2024 36 46 62 48 2143 1504 1176 782 24 24 24 24 1323 735 406 310 18 23 20 15 1073 751 382 242 21 21 21 21 1157 643 356 271 18 22 19 15 1072 719 362 243

version. This is contrary to the expectation that removing synchronisation points will cause older data to be used when computing iterates, and hence delay convergence. We have shown, using carefully designed test problems, that this accelerated convergence occurs when the errors in the synchronised Jacobi algorithm are oscillatory and that the removal of the synchronisation points has a damping e ect. However, the details of the mechanism responsible for this are not yet well understood.

References 1.

(1978) Asynchronous iterative methods for multiprocessors J. Assoc. Comp. Mach. 25, 226{244. 2. Bertsekas, D.P. and J.N. Tsitsiklis (1989) Parallel and Distributed Computation: Numerical Methods. Prentice Hall, New Jersey. 3. Bull, J.M. (1991) Asynchronous Jacobi iterations on local memory parallel computers. M. Sc. Thesis, University of Manchester, Manchester, UK. 4. Chazan, D. and W. Miranker (1969) Chaotic relaxation. Linear Algebra and its Applications 2, 199{222. 5. Golub, G.H. and C.F. Van Loan (1989) Matrix Computations (2nd edition) Johns Hopkins, Baltimore. 6. Li, L. (1989) Convergence of asynchronous iteration with arbitrary splitting form. Linear Algebra and its Applications 113, 119{127. This article was processed using the LaTEX macro package with LLNCS style Baudet, G.M.