Scalable Stable Solvers for Non-symmetric Narrow-Banded Linear Systems Peter Arbenz
[email protected]
Swiss Federal Institute of Technology (ETH), Institute of Scienti c Computing, CH-8092 Zurich, Switzerland
Markus Hegland
[email protected]
Computer Sciences Laboratory, RSISE, ANU, Canberra ACT 0200, Australia
Abstract Banded linear systems with large bandwidths can be solved by similar methods as full linear systems. In particular, parallel algorithms based on torus-wrap mapping and Gaussian elimination with partial pivoting have been used with success. These algorithms are not suitable, however, if the bandwidth is small, say, between 1 and 100. As the bandwidth limits the amount of parallelism available at any elimination step severe load imbalance results if large numbers of processors are used for narrow banded systems. Most of the parallel algorithms suggested so far for solving narrow banded systems do require some sort of diagonal dominance for stability reasons. Thus they can break down or, worse, they can generate results with very large errors. New algorithms which have similar properties as the partial pivoting algorithm for uniprocessors have been studied in [3, 4]. However, in order to achieve stability, computational overhead is required as in the uniprocessor case. As it is a priori often hard to determine if pivoting is needed it is suggested to use partial pivoting in any case. (Notice that this is the strategy of the dense linear solvers in LAPACK.) Of course, this is not the optimal choice if pivoting is not needed. The loss in eciency is studied here for the case of narrow banded linear systems. Practical experiments have been carried out on the Fujitsu AP1000 and on the Intel Paragon.
1111111 0000000 0000000 1111111 0000000 1111111 0000000 A = 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 Figure 1: Banded matrix
1 Introduction A banded linear system of equations X n
a x = y ; i = 1; : : :; n ij
j
=1
j
i
has a matrix A = [a ] =1 with nonzero elements exclusively in a band enclosing the diagonal, see Figure 1. Banded linear systems occur naturally in the discretisation of one-dimensional boundary value problems or as building blocks of higher-dimensional problems. If the linear system is diagonally dominant, i.e., if X ja j > ja j; i = 1; : : :; n ij i;j
ii
j
6=
;:::;n
ij
i
then it can be solved with Gaussian elimination which consists of three steps: 1. LU factorisation of A
2. Solution of Lz = y (forward elimination) 3. Solution of Ux = z (backward substitution) Let the upper and lower half-bandwidths of the matrix A be denoted by k and k respectively. Then it can be seen that the lower and upper triangular Gauss factors L and U are banded with bandwidth k and k , respectively. An upper bound for the number of additions for Gaussian elimination is (see also e.g. [6]) ((k + 1)(k + 1) ? 1) n and the same bound holds for multiplications. In addition, less than k n divisions are required. These bounds are asymptotically exact for large n. If A is not diagonally dominant, then the LU factorisation may not exist or its computation may be unstable. Thus it is advisable to use partial pivoting with elimination in this case. The corresponding factorisation is PA = LU where P is the row permutation matrix. Pivoting requires additionally about k n comparisons and n row interchanges. More importantly, the bandwidth of U can get as large as k + k . (L looses its bandedness but has still only k + 1 nonzeros per column and can therefore be stored at its original place.) The wider the band of U the higher the number of arithmetic operations and our previous upper bound for the additions increases in the worst case to ((k + k + 1)(k + 1) ? 1) n: This bound is obtained by counting the additions for solving a banded system with lower and upper half-bandwidth k and k + k , respectively, without pivoting. The overhead introduced by pivoting, which may be as big as (k + k )=k , is unavoidable if one cannot guarantee stability of the LU factorisation. Therefore the methods for solving banded systems in packages like LAPACK incorporate partial pivoting and accept the overhead. Note that this overhead is particular for banded and sparse linear systems, it does not occur for dense matrices! In Section 2 parallel Gaussian elimination for diagonally dominant systems will be discussed. In Section 3 it will be shown how this method can be generalised to include partial pivoting. The amount and nature of the pivoting overhead for the parallel case is further explored there. u
l
111111111 000000000 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111
l
u
n
n
u
l
l
l
l
u
l
PP n
l
u
l
l
l
u
l
u
u
Figure 2: Blocking of a banded matrix
2 Parallel Gaussian elimination in the diagonally dominant case All three steps of the Gaussian elimination procedure are recursive. Thus the exploitable parallelism is limited to the parallelism of each elimination step which is k k for the elimination and k respectively k for the forward elimination and the backsubstitution. If this maximal amount of parallelism is exploited the granularity of the independent tasks is very small which leads to heavy performance losses due to parallel overhead costs for synchronisation and communication. Thus for any reasonable performance one requires larger granularity and so the bandwidth has to be very large, if this \inherent" parallelism is to be used [5]. There are, however, means by which the amount of parallelism in banded Gaussian elimination can be increased. In principle this is done by permuting the matrix. A useful permutation is based on blocking the bandmatrix A as in Figure 2. The blocking is characterised by the 2p ? 1 diagonal blocks where p large blocks (p n) are interleaved with p ? 1 blocks of size max(k ; k ). Note that this blocking reinterprets the bandmatrix as a block tridiagonal matrix. A block tridiagonal system can be solved by Cyclic Reduction (CR) which consists of a sequence of odd-even permutations bringing A to the form displayed in Figure 3. The elimination consists of
P2-U-2
u
l
u
l
u
l
1. Factorisation of the odd diagonal blocks
A1; A3; A5
2 6 6 6 6 6 4
A1
C1 A3 B3 C3 A5 B5 A2 B2 C2 B4 C4 A4
3 7 7 7 7 7 5
Figure 3: Permuted block tridiagonal matrix
l
u
2. Updating the diagonal blocks A2 ; A4 3. Forming the new odiagonal blocks Note that the new \reduced" system is again block tridiagonal, now with square blocks of order max(k ; k ). It furthermore is diagonally dominant such that the above 3-step procedure can be applied repeatedly leading to smaller and smaller reduced systems. One obtains large overall granularity by choosing p relatively small compared to n. In this case the rst elimination step is by far the most costly. This choice of granularity is a typical scalability consideration which re ects the fact that in order to pro t from high levels of parallelism one also requires large problems. One can see that the amount of additions required for cyclic reduction is asymptotically bounded by (see also [2]) l
in Table 2. The missing numbers could not be determined because the problems were too big to t into the available memory. In this situation, to be able to give speedup numbers oneprocessor times have been obtained by linear extrapolation of times for problems with equal k and k but with smaller n. Columns 2 and 3 in Tab. 2 indicate that this is a permissible approach.
u
l
n;p
u
l
u
n;p
n
l
n = 100000 kl = 10 ku = 10
n = 100000 kl = 50 ku = 50
1
5825
2 4
7042 (0.83)
|
|
3537 (1.6)
17621 (1.7)
8
|
1781 (3.3)
8826 (3.3)
|
16
913 (6.4)
4436 (6.6)
56617 (5.3)
32
474 (12)
2244 (13)
29270 (10)
64
266 (22)
1153 (25)
15712 (19)
128
168 (35)
605 (48)
9097 (33)
29125y
300938y
A similar bound holds for the number of multiplications. The ratio of the number of operations required by CR to the number of operations required by banded Gaussian elimination is termed redundancy and one has approximately k + 1)2 ? 1 : = (k(k ++1)( k + 1) ? 1 If k = k , then is bounded by (1 + )2 = , a bound that can be arbitrarily large if k and k are very dierent. The redundancy is the price that has to be payed in order to get at the same time moderate levels of parallelism and high granularity. If we assume k k , the parallel speedup of this method (compared to the sequential algorithm) for p processors is limited by u
n = 20000 kl = 10 ku = 10
Table 1: Execution times in milliseconds (speedups) for parallel Gaussian elimination without pivoting on the Fujitsu AP1000. (y: one processor time estimated)
n (k + k + 1)2 ? 1 : n;p
p
u
l
p
n = 20000 kl = 10 ku = 10
n = 100000 kl = 10 ku = 10
n = 100000 kl = 50 ku = 50
1
1102
5504
2
1369(0.80)
6868(0.80)
|
4
688(1.6)
3424(1.6)
22762(1.4)
32726y
8
349(3.2)
1715(3.2)
11420(2.9)
16
182(6.1)
861(6.3)
5765(5.7)
32
100(11)
438(13)
2971(11)
64
61(18)
229(24)
1594(21)
128
43(25)
126(44)
929(35)
n;p
l
u
l
S p p4 :
u
n;p
n;p
The measured speedups on the Fujitsu AP1000 are displayed in Table 1 and on the Intel Paragon
Table 2: Execution times in milliseconds (speedups) for parallel Gaussian elimination without pivoting on the Intel Paragon. (y: one processor time estimated) For small processor numbers the speedups obtained are clearly better than predicted. The reason for this is that the computations that make up the parallel overhead are executed mostly at higher M op/s rates than the LU factorisation [1]. The degradation of the speedup
P2-U-3
111111111 000000000 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111
2 6 6 6 6 6 4
with higher processor numbers is due to the parallelisation overhead stemming from the solution of the reduced system that grows like k3 log2 (p), k = max(k ; k ) [3]. u
3 Parallel partial pivoting When partial pivoting is combined with CR the amount of parallelism is reduced. This can be seen in Figure 3. Rows in the forth block row containing B2 and C2 may have to be exchanged with rows in the rst and second block row. Thus the second block cannot be eliminated independently of the rst. This de ciency can be overcome by changing the blocking strategy and by allowing non-square diagonal blocks thus exposing a block bidiagonal structure of the banded matrix as displayed in Figure 4. The block bidiagonal structure features mostly lower triangular diagonal blocks which alternate in size between approximately n=p and k + k . The CR algorithm for block bidiagonal systems can now be applied [7]. Again it consists of a sequence of odd/even block permutation and elimination steps. The odd/even block permutation produces a block structure as displayed in Figure 5. The block structure is simpler than in the block tridiagonal case. However, the diagonal blocks are rectangular. This seems peculiar but does not cause any further problems for the factorisation as partial pivoting is used. The important property of this permutation is that no new dependencies are introduced which would inhibit parallel partial pivoting. u
l
A3
B2
B3
A5
B5
A2
B4
3 7 7 7 7 7 5
A4
Figure 5: Odd/even permuted block-bidiagonal matrix
Figure 4: Block bidiagonal structure of a banded matrix
l
A1
The steps required for the elimination are very similar to the one of the previous case including factorisations and forming of the reduced system which is block bidiagonal again so that cyclic reduction can be used recursively. The addition count can be seen to be bounded by [3, 4] n ((k + k )(2(k + k ) + 3)) : (1) A similar bound holds for multiplications. From (1) one sees that the redundancy factor for this case is k )(2(k + k ) + 3) ; (2) = ((kk + + 1)(k + k + 1) ? 1 which is bounded by 2(1 + ) for k = k . If we assume again k k , then the upper bound for the parallel speedup gets PP n;p
l
PP n;p
u
PP n;p
l
PP n
l
l
u
u
l
u
l
u
u
l
S
l
u
p p4 : as well as are worst case
PP n;p
PP n;p
Notice that bounds. So, it is hard to say what one should expect in real situations. A particularly bad situation appears to occur if the sequential algorithm does not pivot while the parallel algorithm does. This acctually happens if the pivoting parallel algorithm is used for solving a diagonally dominant system! Because of the rectangular blocking with this algorithm, cf. Fig. 4, the main diagonal of the original matrix is no longer the main diagonal in the diagonal blocks A2 ?1 , i > 1. In the j -th step of the factorisation of A2 ?1 the parallel algorithm exchanges rows j and j + k . The upper triangular factor therefore has k o-diagonals. So, in the diagonally dominant case we have ~ )(k + 2k + 3) ; ~ = (k(k++k 1)( k + 1) ? 1 which is now bounded by (1 + )(1 + 2 )= if k = k . This is for 1 not much higher
P2-U-4
PP n;p
PP n
i
i
u
u
PP n;p
PP n;p
n
u
l
l
u
l
l
u
u
than for the estimate of equation (2). For large n and equal half-bandwidths the parallel speedup would be around p=6 instead of p=4. p
n = 20000 kl = 10 ku = 10
n = 100000 kl = 10 ku = 10
1
4923
2
7633 (0.6)
|
|
4
4213 (1.2)
17383 (1.4)
| |
24615y
n = 100000 kl = 50 ku = 50 295688y
8
2153 (2.3)
9297 (2.6)
16
1133 (4.4)
5265 (4.7)
|
32
627 (7.9)
2699 (9.1)
57507 (5.1)
64
389 (13)
1421 (17)
36840 (8.0)
128
285 (17)
791 (31)
28384 (10)
Table 3: Execution times in milliseconds (speedups) for parallel Gaussian elimination with pivoting on the Fujitsu AP1000. The system is diagonally dominant. (y: one processor time estimated) p
n = 20000 kl = 10 ku = 10
n = 100000 kl = 10 ku = 10
n = 100000 kl = 50 ku = 50
1
1330
6650
37710y
2
1306(1.0)
6524(1.0)
|
4
665(2.0)
3276(2.0)
|
8
341(3.9)
1653(4.0)
22422(1.7)
16
184(7.2)
843(7.9)
12184(3.1)
32
128(10)
440(15)
7242(5.2)
64
83(16)
240(28)
4994(7.6)
128
63(21)
139(48)
4065(9.3)
Table 4: Execution times in milliseconds (speedups) for parallel Gaussian elimination with pivoting on the Intel Paragon. The system is diagonally dominant. (y: one processor time estimated) In Table 3 and 4 the actually obtained speedups are displayed for implementations of the above described algorithm on the Fujitsu AP1000 and the Intel Paragon, respectively. The underlying system of equations was the same diagonally dominant one that yielded the numbers in Tables 1 and 2. The same remarks as at the end of the previous section apply. The operations in the parallel overhead are executed in even larger blocks. This may be the reason for
the better speedups than in Tables 1 and 2. The speedups are even as high that the absolute execution times are shorter with the pivoting algorithm. However, as the size of the reduced system is larger a stronger degradation of the speedup is observed with increasing processor numbers. p
n = 20000 kl = 10 ku = 10
n = 100000 kl = 10 ku = 10
7451
37255y
n = 100000 kl = 50 ku = 50
1 2
9772 (0.8)
|
|
4
4932 (1.5)
24501 (1.5)
|
8
2541 (2.9)
12305 (3.0)
|
16
1354 (5.5)
6233 (6.0)
|
32
780 (9.6)
3219 (12)
78497
64
498 (15)
1731 (22)
53079
128
380 (20)
997 (37)
43261
|
Table 5: Execution times in milliseconds (speedups) for parallel Gaussian elimination with pivoting on the Fujitsu AP1000. The system is not diagonally dominant. (y: one processor time estimated) p
n = 20000 kl = 10 ku = 10
n = 100000 kl = 10 ku = 10
n = 100000 kl = 50 ku = 50
1
1941
9709
62761y
2
1491(1.3)
7444(1.3)
|
4
755(2.6)
3742(2.6)
|
8
387(5.0)
1881(5.16)
27634(2.3)
16
206(9.4)
953(10)
14759(4.3)
32
119(16)
491(20)
8524(7.4)
64
77(25)
264(37)
5597(11)
128
60(32)
153(63)
4325(15)
Table 6: Execution times in milliseconds (speedups) for parallel Gaussian elimination with pivoting on the Intel Paragon. The system is not diagonally dominant. (y: one processor time estimated) In Tables 5 and 6 we present execution times and speedups of the same system as earlier but the large diagonal elements have been replaced by zeros. This enforces pivoting in the serial as well as in the parallel solver. Not surprisingly, the execution times increased from those of Ta-
P2-U-5
bles 3 and 4. What is surprising is the considerable improvement of the speedups.
4 Conclusion Both from theoretical considerations and timings one can see that the overhead of pivoting seems to be about the same for sequential and parallel algorithms for the solution of banded linear systems. Thus if one is prepared to pay this overhead in the sequential case (and this is what is usually done) one should do this in the parallel case too. Algorithms with restricted pivoting do not provide the advantage which justify their usage and the resulting exposure to instability. Our timings indicate that the pivoting algorithm may be chosen in all cases in which solving the reduced system is negligible compared with the local computations. This is the case if p is small compared with n. Often routines are suggested which include pivoting but do so only \locally" exchanging only rows which are on the same processor. This amounts to doing the blocking into a block tridiagonal system but factorising the diagonal blocks with LU factorisation with partial pivoting. It can be seen that the amount of additions of such a procedure is bounded by
n ((k + k )(2(k + k ) + 3)) l
u
l
u
which leads to a redundancy of 3 instead of 4 for the parallel algorithm and we suggest that this might not be worth the risk one takes of getting a wrong result.
References [1] P. Arbenz, On experiments with a parallel direct solver for diagonally dominant banded linear systems, in Euro-Par '96, L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, eds., Springer-Verlag, Berlin, 1996, pp. 11{21. (Lecture Notes in Computer Science, 1124). [2] P. Arbenz and W. Gander, A survey of direct parallel algorithms for banded linear systems, Tech. Report 221, ETH Zurich, Computer Science Department, November P2-U-6
1994. able at URL
Avail-
http://www.inf.ethz.ch/ publications/ tr200.html.
[3] P. Arbenz and M. Hegland, The stable parallel solution of general narrow banded linear systems, Tech. Report 252, ETH Zurich, Computer Science Department, November 1996. Available at URL http://www.inf.ethz.ch/ publications/ tr200.html. [4] P. Arbenz and M. Hegland, The stable parallel solution of general narrow banded linear systems, in Proceedings of the Eighth SIAM Conference on Parallel Processing for Scienti c Computing, M. Heath et al., eds., Philadelphia, PA, 1997. 8 pages (compact disk). [5] A. Gupta, F. G. Gustavson, M. Joshi, and S. Toledo, The design, implementation, and evaluation of a banded linear solver for distributed-memory parallel computers, Research Report RC 20481, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1996. [6] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, MD, 2nd ed., 1989. [7] M. Hegland, Divide and conquer for the solution of banded linear systems of equations, in Proceedings of the Fourth Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, Los Alamitos, CA, 1996, pp. 394{401.