E cient Parallelization of Relaxation Iterative ... - Semantic Scholar

0 downloads 0 Views 223KB Size Report
The work of the rst author was supported by JNICT - Junta Nacional de Investiga cão Cient i ca e Tecnol ogica and by the Fulbright program. The second author ...
Ecient Parallelization of Relaxation Iterative Methods for Banded Linear Systems Pedro Diniz and Tao Yang Department of Computer Science University of California at Santa Barbara Santa Barbara, CA 93106 fpedro,[email protected].

December 1994 Abstract In this paper we present an ecient parallel implementation of relaxation iterative methods, such as the Gauss-Seidel (GS) and Successive-Over-Relaxation (SOR), for solving banded linear systems on distributed memory machines. We introduce a novel partitioning and scheduling scheme in our implementation which allows perfect overlapping of computation with communication, hence minimizing latency e ects. We provide analytic and experimental results to verify the e ectiveness of our approach and discuss its incorporation in other iterative solvers. Experiments on nCUBE and Intel Paragon machines for several sample banded matrices show that the proposed approach yields good performance on these machines.

Keywords: Partitioning, Scheduling, SOR, Banded Linear Systems. AMS(MOS) subject classi cations: 68M20; 65Y05, 65F10. The work of the rst author was supported by JNICT - Junta Nacional de Investigaca~o Cienti ca e Tecnologica and by the Fulbright program. The second author is supported by NSF-CCR-9409695 and a startup fund from UCSB. 

1

1 Introduction

Iterative methods can be used to solve large banded linear systems on concurrent processors. These methods range from the complex Conjugate-Gradient (CG) with preconditioning to the simple Successive-Over-Relaxation (SOR). Due to its simplicity SOR methods are still used in solving practical engineering problems [2, 5]. In addition, they can also be valuable in obtaining quick approximation of the solution in intermediate steps for more powerful methods (e.g. preconditioned Conjugate-Gradient) or even replace direct methods in applications parallelization [10]. In this paper we concentrate on the implementation of the SOR method on message-passing multiprocessors. The main diculty in eciently parallelizing the SOR methods stems from the intra-iteration data dependencies (cf. Jacobi method), as iterates values depend on computed values of the same iteration. A possible approach, as taken in the multi-coloring scheme, is to eliminate some of these dependencies, is to perform variable relabeling (coloring) [9]. This scheme increases the amount of exploitable parallelism since nodes assigned the same color can be computed concurrently. Adams and Jordan [1] have shown these methods to be numerically equivalent to performing several SOR iterations computed simultaneously in the absence of convergence tests. However, in order to preserve inter-color data dependencies, processors must communicate between each computation phase associated with each color. This constitutes a drawback in the implementation of multicoloring schemes for message-passing machines as communication and computation cannot be overlapped in order to hide communication latency. In this paper, we examine the parallelization of SOR iterations, and design a partitioning and scheduling method such that computation can be perfectly overlapped with communication to hide the communication latency. Our work uses the idea of overlapping the computation among several iterations, but also carefully designs its execution ordering and synchronization points for data communication. We provide analytic and experimental results to verify the e ectiveness of our approach and compare it with multicoloring schemes. Experimental results show this scheme to deliver good performance and scalability on nCUBE and Paragon distributed-memory multiprocessors. This paper is organized as follows. In Section 2 we formulate the application of the SOR method to solving banded linear systems. Section 3 introduces the diculties in the implementation of such method on message-passing machines in particular discussing the issues of data mapping and computation/communication scheduling. Section 4 presents the partitioning and scheduling scheme proposed in our algorithm. Section 5 gives an analysis on the performance of this algorithm under synchronous and asynchronous communication models. Section 6 describes the experiments, comparing the performance of our algorithm against a naive implementation of the SOR algorithm as well as a multicoloring approach. In this section we also discuss the applicability of our scheme to other iterative solvers. Conclusions and future work are presented in section 7. 2

2 Iterative Relaxation Methods We consider the Successive-Over-Relaxation (SOR) iterative method for solving Ax = b when A is n  n banded with bandwidth w and thus half bandwidth h = w? . Figure 1 below illustrates a banded linear system for n = 8 and w = 5. (

1)

2

0 BB aa BB a BB BB BB BB B@

11 21

31

a a a a

12 22 32 42

a a a a a

13 23 33 43 53

a a a a a

24

34 44

54 64

10 CC BB xx CC BB x CC BB CC BB x CC BB x a C CC BBB x a A@ x x a

1

a a a a a

35 45

55 65

75

2

a a a a a

46

56 66

76 86

3 4

a a a a

57 67

77 87

5

68

6

78

7

88

8

1 0 CC BB bb CC BB b CC BB CC = BB b CC BB b CC BB b CA B@ b b

1 2 3 4 5 6 7

1 CC CC CC CC. CC CC CA

8

Figure 1: Example of a banded linear system (n = 8; w = 5). For such systems the SOR iteration below de nes the value of a given unknown's approximation at iteration k +1 (denoted by xki ) where we have assumed aii 6= 0 and aij = 0 for j < 1 or j > n. i? iXh X xki = (1 ? !)xki + a! (bi ? (1) aij xkj ): aij xkj ? ii j i j i?h +1

1

+1

=

+1

+

= +1

Convergence conditions for the SOR as well as for the special case of Gauss-Seidel methods are extensively analyzed by Young [12] and thus omitted here. From the above formulation several remarks are in order: 1. The value of given unknown's iteration approximation depends only on the values of the unknowns at index distance of half bandwidth, one set of such unknowns from the current iteration and another set from the previous iterations. Formally, xki depends on xki , xkj and xkl for j = i ? h;    ; i ? 1 and l = i + 1;    ; i + h. 2. Instead of a single matrix element each non-null aij can be viewed as a matrix subblock (with some zero elements for blocks near the band boundary). Under this scheme, scalar inversion would be replaced by a matrix solver and each xi would represent a set of unknowns by a diagonal matrix. This blocking scheme has been used in previous research (e.g. [10]) as a mean to improve the locality, both temporal and spatial, of memory accesses, in order to take advantage of memory hierarchy organization. +1

+1

Each SOR iteration can be viewed as a set of tasks to be executed with precedence constraints 3

arising from the underlying data dependences. Let the task denoted by Ti;j (i 6= j ) compute the multiplication of ai;j and xj . Task Ti;ik performs the computation to derive the new iteration value for xki based on the old value, i.e. xik? , bi and the two partial sums from Ti;k ;    Ti;ik ? , and from Ti;ik? ;    Ti;ik? h. Figure 2 below illustrates the task dependences, in the form of a task dependence graph (TDG) for a generic banded matrix system with half bandwidth h and for a generic variable xi. The edges labeled 1 represent data dependences across iterations - dependence distance 1. 1

1 +1

1

1 +

1

1

Ti-h,i

0

. . .

1

0 0

0

Ti,i-h

0

. . .

Ti-1,i

0

1

1 0

0

Ti,i-1

Ti,i

0

0

Ti,i+1

0

. . .

Ti,i+h

0 0

Ti+1,i 0

. . . Ti+h,i

0 0

Figure 2: Task Dependence Graph (TDG) for a generic system of bandwidth w = 2h + 1. A sequential implementation of this method simply loops through the matrix rows computing the new iterate values. The data data dependencies of the algorithm are automatically satis ed as Ti;jk is always executed before Ti;jk and Ti;jk is performed before Ti;jk , i.e. cross-iteration dependencies are respected. Denoting the cost of a oating point operation by c, the total sequential cost per iteration is given by equation 2. +1

+1

(2)

Tseq = 2n(h + 2)c:

3 Message-Passing Multiprocessor Implementation We now describe a naive implementation of the SOR iterations and a multi-coloring scheme on multiprocessors. For both we provide a simple analysis based on the communication cost model presented below. We then introduce the idea of our approach. 4

3.1 Communication Model We will consider two communication models, respectively an asynchronous and a synchronous one. Under the rst model sending a message causes no delay to the sending processor whereas in the synchronous case the sender blocks until the message has been delivered at its destination. In any of the two models the transmission time is modeled as + m where is the message startup time, is the transmission and bu ering cost per numeric element and m the message length in number of numeric elements. Receiving messages blocks the receiving processor until the message is delivered. Computation cost per element, either a oating point add or multiply is modeled by c. Under any of these two models we do not consider communication cost e ects due to long messages , e.g. contention, and non-neighboring communication. This approximation simpli es the analysis and is supported by current multiprocessor implementations as contention-free networks and wormhole routing have become common.

3.2 Naive implementation We assign each processor r = n=p consecutive rows of the system matrix as well the corresponding components of both the solution and the independent solution vectors as discussed above. In the absence of convergence tests, processors may compute simultaneously values of di erent iterations as depicted in Figure 3 by consecutive parallelogram shaped computations. While processor pi is in iteration k, processor pi is still in iteration k ? 1. Computations are labeled the iteration they refer to. +1

Time

1

2

3

4

k-1

k

k-1

k+1

k

k-1

k+1

k

k-1

k+1

k

k+1

Processors

Figure 3: Overlapping of several SOR iterations. In a naive implementation of the SOR iterations each processor computes the new iteration value of the corresponding component. As soon as the rst h rows are processed the corresponding values are sent to the \upper" processor in a single message. In the last h rows, however, and 5

after having sent the new values each processor waits for the \lower" processor to reply with a message containing updated information about its values. Figure 4 illustrates communication dependence in this naive algorithm for each of the parallelogram sections. Task T corresponds to the computation associated with the rst h of the r rows assigned to each processor. T and T correspond to the next r ? 2h and last h rows respectively. 1

2

M

3

M

h rows

T1

T2

(r-2h) rows

h rows

T3 M

M

Figure 4: Processor computation partitioning and scheduling for the naive algorithm corresponding to each of the parallelogram section of Figure 3. For both communication models, and because each processor has to wait for the next one to reply with the updated iterate values, the parallel time per iteration of each computing section is given by equation 3 below. (3)

naive = 2r(h + 1)c + 4( + h)  wrc + 4 + 2 w: Tpar

Despite the simplicity of the algorithm the synchronization for the last h rows introduces an unnecessary waiting time as the \lower" processor computes its T section. We expect this to be a limiting factor for large values of the half bandwidth h as the computational cost of T is O(h ). 1

1

2

3.3 Multicoloring implementation The multicoloring technique relies on the ability to label nodes on the physical domain as to eliminate inter-iteration dependencies of the iterative methods that is used to compute them. The labeling of the variables is the key point factor, yielding matrices with diagonal blocks [9]. The unknowns associated with these diagonal blocks are assigned the same color and can be computed in parallel. After global exchange of the values of variables of a given color, another swap for variable with another color can continue, implicitly overlapping multiple SOR iterations. Adams and Jordan explored this idea for several regular stencils over square domains. Figure 5 below shows a coloring corresponding to the banded matrix of Figure 1. 6

x1

x2

x4

x3

x6

x5

x7

x8

Figure 5: Multicoloring of a physical domain corresponding the banded matrix of Figure 1. Assume that the colored variables are evenly distributed among p processors. Since after computation for the same color, new values for variables with this color are exchanged between processors. The number of such variable per processor is about h n p . There are (h +1) communication steps yielding a parallel time per iteration given by equation 4. ( +1)

(4)

n ): multicoloring = 2nc(2h + 1) + (h + 1)( + Tpar p (h + 1)p

Adams and Jordan [1] have shown multicoloring methods to be equivalent to overlapping multiple SOR iterations. These methods are e ective if the communication overhead is negligible. In the presence of non-zero communication costs, as is the case of current message-passing machines [3], multicoloring methods exhibit signi cant communication costs making it less e ective. Experimental results presented in section 6 for tested matrix systems do show communication costs to be signi cant.

3.4 A New Pipelined Approach From the discussion in the previous two sections one can draw the following remarks. First, multicoloring methods implicitly overlap multiple iterations, but fail to exploit more parallelism due to the inter-color dependencies. Second, the naive implementation of the SOR algorithm incurs in unnecessary waiting time at the end of its computation phase. This suggests an improved algorithm by modifying the way computation and communication steps are performed and scheduled. This new approach, the overlapped algorithm, as described in the next section, exploits both the idea of overlapping successive SOR iterations and the overlapping of computation with communication. This is achieved by scheduling the communication and computation di erently but still respecting the algorithm data dependencies. 7

4 The Overlapped Algorithm The data mapping in this approach is the same as in the naive implementation of SOR iterations. A row-wise computation along the full bandwidth of the matrix within each processor, as suggested by a naive interpretation of the SOR iteration de nition above, would prolong the overall length of each processor computation due to the data dependencies at the boundary rows. Instead we reorder the way the computation for each iterate is done as to allow overlap of computation with communication. As a result the proposed algorithm is numerically equivalent to the sequential implementation of the SOR algorithm. In our algorithm computation of each iteration at each processor pi is divided into ve tasks (T ; T ; T ; T ; T ) along the band of the matrix, based on the inter-processors data dependencies, as depicted in Figure 6. Edges labeled M represent messages sent between processors. 1

2

3

4

5

Due to the banded structure, and imposing r  2h, communication is restricted to the processors assigned to the adjacent sets of matrix rows, named respectively \lower" and \upper" processors. Furthermore careful data mapping allows communication to be kept in a linear pattern, hence between physically neighboring processors. Notice that if r < 2h the basic scheduling scheme is still applicable but now at the expense of more communication messages, not only between neighboring processors but between logical processors at distance 2;    ; d rh e. 2

M

M

h rows

T1

T4

T2

(r-2h) rows

T3

h rows

T5 M

M

Figure 6: Processor computation partitioning and scheduling for the overlapped algorithm corresponding to each parallelogram computing section. Each processor repeatedly executes partitioned tasks in the order of T ; T ; T ; T and T as depicted in Figure 7. Communication occurs when T uses the data from T at processor pi? and T produces data used in T for processor pi? . T , T and T only need local data. Computation and communication can be overlapped as T in processor pi may be computed while processor pi executes T and sends a message back. 1

1

5

1

4

2

3

2

3

3

4

5

1

1

4

+1

1

Figure 8 below outlines an implementation of the overlapped algorithm. The improvement resulting from this scheduling might be signi cant when the band is large as both the computational cost 8

Procs Pi-1

...

T1

T2

T3

...

Pi

T4 T1

...

T5 T2

T3

...

Pi+1

T4 T1

...

T5 T2

T3

T4

T5

...

time

Figure 7: Processor computation scheduling for the overlapped algorithm in terms of (T    T ): 1

5

of T and the message size grow with w, respectively quadratically and linearly. 1

1: for the rst h rows, say j do Execute Tj;j ?h ; Tj;j ?h+1 ;    ; Tj;j 2: Send the newly computed values of xj to pi?1 3: for all rows j but the rst or the last h do Execute Tj;j ?h ; Tj;j ?h+1 ;    ; Tj;j 4: For the last h rows j do Execute Tj;j ?h ; Tj;j ?h+1 ;    ; Tj;j 5: Send the newly computed values of xj to pi+1 6: for every row j do Execute tasks Tj;j +1 ;    ; Tj;l with l < i wn 7: Receive values from pi?1 and pi+1 8: for all rows j do Execute Tj;i wn ;    ; Tj;j +i wn

Figure 8: SOR TDG scheduling for pi.

5 Analysis We provide analysis to show our partitioning and scheduling method to deliver good performance. We derive the Speedup and Eciency of the proposed method for both an asynchronous and synchronous communication models. We further assume the number of iterations to be very large. First we give the computational cost of the ve tasks de ned in the overlapped algorithm in Table 1, which will be used in the rest of this section. Also we consider r = 2h + k with k > 0 as mentioned above. 9

T T T T T

= = = = =

1

2 3 4

5

2h(h + 1)c 2(r ? 2h)(h + 1)c 2h(h + 1)c 2h(r ? h + h? )c h(h + 1)c 2

1

Table 1: Computational costs of T    T for the overlapped algorithm. 1

5

5.1 Asynchronous communication Under this model, as we can observe in Figure 7, we have inter-processor data dependences between tasks (T ; T ) and (T ; T ). 3

1

1

5

By inspection of Figure 7 the condition to have a total overlap of computation with communication is:

T  2( + h) + T  2h(r ? h + (h +2 1) )c  2( + h) + 2h(h + 1)c: 4

1

We can solve this inequality for h and we have

s 1 h  2 ((3 + c ? 2k)) + ((3 + c ) ? 2k) + 4( c )) = hcrit:

(5)

2

Thus for h  hcrit there is total overlap of computation and communication. In this case the Speedup is simply p, with 100% Eciency. Notive again that we have made the assumption of a very large number of iterations. For h < hcrit, however, each processor may have to wait for the next processor to nish his computation of T and send the message containing the corresponding updated solution values back. In this case we have Tpar = T + T + T + max(2( + h) + T ; T ) + T , i.e. Tpar = T + T + 2T + 2( + h) + T as T < 2( + h) + T . Thus (6) Tpar = 2r(h + 1)c + 2h(h + 1)c + 2( + h) = 2(h + 1)(r + h)c + 2 + 2 h: Hence (h + 2)n Speedup = (7) : (h + 1)(r + h) + c + c h 1

1

1

2

3

5

2

3

4

3

3

By approximating (h + 2)  (h + 1)  h we have (8)

Speedup =

p

p n (h + hc

1+ 10

+ c )

:

4

5

Thus an overhead and isoeciency functions [6]

+ ) Toverhead (W; p) = np (h + hc (9) c where W = n  w stands for the total amount of work of each SOR iteration. This reveals an isoeciency of (p), for xed band w, hence showing the algorithm to be optimally scalable, i.e. to keep the same eciency when we increase the amount of work per processor, we only need a linear increase in the number of processors. If the band is increased, however, the number of processors must be reduced.

Theorem 1 Assume the number of iterations to be large and an asynchronous communication model. For h  hcrit the Speedup and the Eciency of the overlapped algorithm on p processors p are close to p and 100% respectively. For h < hcrit the Speedup is approximately the Eciency 1+ p (h+1 + ) . The isoeciency is (p)2. n

hc

p



+ ) 1+ n ( + hc c

h

and

c

Several remarks are in order.

 For large values of r, i.e. eitherqvery large matrices or few processors, hcrit may be approximated by hcrit = (2k ? c ) + ( c ? 2k) + 4 c .  For values of h < r < 2h a similar analysis is possible as each processor only needs to 1 2

2

communicate with adjacent processors. Similar threshold values results from a analogous analysis. For values of r < h, however, a more complex analysis is required as the local data has to be send more than one \adjacent" processor.

5.2 Synchronous communication Under this model no overlapping of communication with computation is possible. We derive the Speedup and Eciency of the overlapped algorithm by a critical path analysis. By inspection of Figure 7 the parallel time for each iteration is given by Tpar = T + T + T + max(T ; T ) + T + 2( + h); which under the assumption that r  2h yields Tpar = 2r(h + 1)c + 2( + h)  wrc + 2 + w: Hence the Speedup and Eciency are 1

(10)

Speedup =

2

1+

3

4

p

1

5

p(2 + w) and Efficiency = nwc

11

1+

1

p(2 + w) : nwc

Theorem 2 Assume the number of iterations to be large and an synchronous communication model. Then the overlapped algorithm on p processors achieves a Speedup and Eciency given respectively by 1+ p p w and 1+ p 1 w . The isoeciency of this algorithm is (p)2. (2 +

nwc

)

(2 +

nwc

)

6 Experiments and Results We compare the performance in terms of peak M ops for a naive implementation of SOR iterations against our overlapped method for several matrices on an nCUBE-2 and Intel Paragon multiprocessors. Both codes were written in C and compiler optimized, (ncc version 3.2 at nCube / icc version R4.5 at Intel) without any assembly routines. All calculations are performed in double precision arithmetic. The Table 2 on the left below presents the attained M ops performance for 64 processors on both machines and varying matrix bandwidth. We verify the scalability of our approach in two settings, namely for a xed problem size and for constant amount of work per processor, i.e. r  band = K . This is known as Scaled-Speedup [11]. Due to memory constraints only data for the Paragon is shown (Table 2 right). 64 procs n:12,800 band: 161 nCUBE-2 Paragon Paragon band naive overlap naive overlap procs n:12,800 r: 200 21 39.5 37.3 346 309 4 41 38 41 37.8 41.4 387 430 8 81 76 81 32.0 43.9 361 537 16 162 152 161 23.8 45.3 280 608 32 314 304 Table 2: Overlapped algorithm performance results [M ops].

64 procs n:12,800 band: 161 band nCUBE-2 Paragon procs n:12,800 r:200 21 24.5 149 4 22.4 11.1 41 26.0 165 8 41.3 22.1 81 26.1 167 16 73.9 44.3 161 26.5 170 32 121.4 88.9 Table 3: Multicoloring algorithm performance results [M ops]. 12

6.1 Discussion The performance of the overlapped algorithm is impressive for high values of bandwidth. An excellent scalability is also revealed making it attractive for very large systems. The results show this overlapped algorithm to clearly outperform the naive implementation in all but the sample cases where the matrix bandwidth is small. We attribute this to the added code complexity as well as to a worse memory hierarchy tailoring. Multicoloring methods reveal poor performance. It is clear from the results that communication and computation cannot be overlapped in this method. This is particularly serious in distributedmemory multiprocessors. The \pure" multicoloring method can also bene t from the idea of overlapping several iterations among the same colored variables yielding a signi cant improvement (typically 2 and 5 times better on the nCUBE-2 and the Paragon respectively). Even with this enhancement multicoloring methods are by far outperformed by the overlapped algorithm. Although the incorporation of the overlapped scheme proposed in our algorithm in the multicoloring methods is possible we believe this would not lead to a signi cant performance improvement as the ratio of the number of messages between the two approaches is still of O( h n p ). ( +1)

6.2 Other Iterative Methods Incorporation the overlapping scheme proposed in our algorithm in other methods is not advantageous in situations where there is not an inter-iteration dependence between processors. As an example, iterative methods like the CG cannot bene t from this approach. However it is possible to take advantage of overlapping several CG iterations as described by Leland [7, 8]. The main obstacle to parallelization in this method is due to the synchronization caused by the dot products. By substituting the recurrences for r (the residual) and p (the search direction) into the expressions for the dot products one can obtain expressions for the dot products in terms of the earlier r and p values, hence avoiding some communication and therefore minimizing the synchronization delay.

7 Conclusions and future work The scheduling algorithm for SOR iterations on banded linear systems presented here reveals good performance over a naive scheduling for large values of the matrix bandwidth as well as over multicoloring methods. A perfect scalability was observed for xed bandwidth and increasing dimension of the linear system. We believe that this result makes our approach attractive in obtaining fast solution approximations as an intermediate step in other more powerful methods. Future work will concentrate in assessing the performance of the technique presented here on shared-memory multiprocessors as well as extending it to other linear solvers. 13

Acknowledgements We thank Robert Leland at Sandia national Laboratories and Gerasoulis Apostolos at Rutgers University for several insightful discussions. This research was performed in part using parallel computing resources located at Sandia National Laboratories in Albuquerque, New Mexico.

References [1] L. Adams and H. Jordan, Is SOR color-blind?, SIAM J. Sci. Stat. Comp., 7 (1986), pp. 490{ 506. [2] R. Barrett et. al, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM Publications. Philadelphia, PA, 1994 [3] T. H. Dunigan, Performance of the INTEL iPSC/860 and nCUBE 6400 Hypercube, ORNL/TM-11790, Oak Ridge National Lab., TN, 1991. [4] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon and D. Walker, Solving Problems on Concurrent Processors, Prentice-Hall, New Jersey, 1988. [5] G. Huang and W. Ongsakol, An Ecient Task Allocation Algorithm and its use to Parallelize Irregular Gauss-Seidel Type Algorithms, In Proc. of the Eighth International Parallel Processing Symposium, Cancun, Mexico, (1994), pp. 497{501. [6] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin-Cummings, 1994. [7] R. Leland and J. Rollett, Evaluation of a Parallel Conjugate Gradient Algorithm, In Numerical Methods in Fluid Dynamics III, Ed. K. Morton and M. Baines, Oxford Univ. Press, 1988. [8] Robert W. Leland, The E ectiveness of Parallel Iterative Algorithms for Solution of Large Sparse Linear Systems, PhD. Thesis, Oxford University 1989. [9] James M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, Plenum Press, 1988. [10] Jaswinder Singh and John Hennessy, Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results and Implications, In Journal of Parallel and Distributed Computing, vol. 15, no. 1, (1992), pp. 27{58. [11] Xian-He Sun and Diane T. Rover, Scalability of Parallel Algorithm-Machine Combinations, In IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 6, (1994), pp. 599{613. [12] David Young, Iterative Solution of Large Linear Systems, Academic Press, 1971. 14

Suggest Documents