In Proceedings of CONPAR '94 Sept. 6-8, 1994, Linz, Austria
VAPP VI ,
Parallelization of non-simultaneous iterative methods for systems of linear equations? Beniamino Di Martino and Giulio Iannello Dipartimento di Informatica e Sistemistica Universita di Napoli \Federico II" Via Claudio, 21 - 80125 Napoli - Italy e-mail: dimartin,
[email protected]
Abstract. This paper proposes a general execution scheme for paral-
lelizing a class of iterative algorithms characterized by strong data dependencies between iterations. This class includes non-simultaneous iterative methods for solving systems of linear equations, such as Gauss-Seidel and SOR, and long-range methods. The paper presents a set of code transformations that make it possible to derive the parallel form of the algorithm starting from sequential code. The performance of the proposed execution scheme are then analyzed with respect to an abstract model of the underlying parallel machine.
1 Introduction Considerable research activity has been devoted in recent years to the programming of parallel computers. In this respect, special attention has been paid to computational science, which currently represents the main eld where parallel computers can be successfully employed [1, 5, 6]. Parallelization techniques concentrate on data parallelism, a characteristic of most computation intensive applications, and produce parallel programs executing in SPMD (Single Program Multiple Data) mode [3]. Although SPMD code can be less ecient than a general parallel code, it can achieve very high eciency for numerical algorithms and it can be generated relatively simply by automatic or semi-automatic tools. The output of parallelizers is generally a conventional language (e.g. Fortran or C), with calls to a run-time library (e.g. PVM [6]) implementing a virtual parallel machine that hides hardware details and improves portability. ?
We wish to thank P. Sguazzero for his helpful hints and suggestions, and IBM ECSEC for having made available to us the SP1 machine on which experimental measures have been performed. This work has been supported by Consiglio Nazionale delle Ricerche under funds of \Progetto Finalizzato Sistemi Informatici e Calcolo Parallelo" and by MURST under funds 40%.
Although in some cases DO-loops can be automatically parallelized through static analysis, the usefulness of parallelizers is limited by two main factors. First, they often require human supervision to improve performance of the parallelized code. For instance, since ecient execution on currently available hardware enforces a lower bound on the computation/communication ratio, communications must be grouped and performed after a sucient number of iterations (message vectorization). Second, there are many algorithms amenable to straightforward parallelization in principle, but that do not satisfy conditions for ecient execution in SPMD mode. In this paper we present a general parallel execution scheme for a class of iterative algorithms characterized by strong data dependencies between iterations. This class includes nonsimultaneous iterative methods for solving systems of linear equations, such as Gauss-Seidel and SOR, and long-range methods. These algorithms cannot be easily parallelized following the SPMD approach, since the elements in the data domain must be updated serially. The computation can instead be organized according to a sort of generalized pipeline where each stage receives a portion of the data domain (the grain) and performs a partial computation of elements assigned to other stages, after updating its own elements. Even though the proposed execution scheme cannot be automatically derived using currently available tools, we give a set of code transformations that make possible the derivation of the nal parallel form starting from a sequential version of the algorithm. The performance of the proposed execution scheme is then analyzed with respect to an abstract model of the underlying parallel machine based on the LogP model recently proposed by Culler et al. [2]. In spite of the highly sequential nature of the algorithms considered, surprisingly the proposed execution scheme guarantees very high eciency.
2 Parallelization of non-simultaneous methods In this paper we are concerned with the parallelization of a class of numerical methods for iterative resolution of systems of linear equations, i.e., nding a solution to the vector equation Ax = b, where A is the N N matrix of linear equation coecients, and x, b are N-dimensional arrays. In non-simultaneous methods, or methods of successive corrections, the x(jk+1) variables at the (k + 1)-th iteration, are updated in sequence using the newly obtained j ? 1 values of the preceding variables for step k + 1 and the \old" N ? j values of the remaining variables from step k. One of these methods is Successive Over Relaxation (SOR). The update prescription is [7]: x(jk+1)
1 ! 0 X j? N X ajixik A ? (! ? 1)xjk = a @bj ? ajixik ? jj 1
i=1
( )
( +1)
i=j +1
for ! = 1 the method is called Gauss-Seidel iteration.
( )
(1)
When the matrix of coecients is a (2s + 1)-diagonal matrix (0 s < N), computational complexity can be improved by modifying the update expression to (for Gauss-Seidel iteration as example): xjk
( +1)
+s ?1 X X aj (j +i)x(jk+)i aj (j +i)x(jk++1) ? = a1 bj ? i jj i=+1 i=?s
!
(2)
This is called stencil computation and s is the stencil size, which can range from N, (full range interaction) to 1, (nearest neighbors interaction). The computation corresponding to the above formulas can be parallelized according to the following execution strategy. Our data domain consists of the array x of N elements. The most natural decomposition of the data domain for execution on P processors is the partitioning of x into P blocks, each containing a grain of n = N=P elements. Unfortunately, elements assigned to dierent processors cannot be updated in parallel, since an element xj can only be updated after all elements xh , 1 h < j have been updated. One way of working with this constraint is to make processors work in pipeline, so that already updated elements can be passed over and used for updating elements with higher indices. In the general case, however, interactions between processors would substantially dier from what happens in a usual pipeline. In fact, if the stencil s is greater than the grain n, each stage must receive values from stages other than the one immediately preceding it. Moreover, after the update has taken place, the new values must also be communicated to previous stages in order to make them available before the next iteration begins. This complex communication pattern would lead to a complete serialization of the pipeline as s approaches N. To overcome this diculty, we have adopted a computational workload distribution strategy which follows an \altruistic" approach. After updating its own elements, each processor computes partial sums for updating elements assigned to subsequent and preceding processors in the pipeline, using elements assigned to it only. As soon as these partial sums are computed, they are sent to the processors that will need them, so that these processors can start a new update step. This step now requires, for each element to be updated, only the computation of the partial sum of the elements assigned to that processor, and a nal sum of the partial results received from the other stages of the pipeline. Figure 1 shows the Gantt diagrams of the execution strategy just described. We have adopted the following conventions. The white rectangles represent the update of the elements assigned to each processor, whereas the thin horizontal lines represent the computation of partial sums to be sent to other processors. Arrows represent communications of partial sums. To avoid too many symbols for each iteration, we have only reported one arrow to represent all communications between each pair of processors. The arrow is then placed at the beginning of the segment representing the computation of a group of partial sums to mean that such sums are communicated individually as soon as their evaluation has
P
i-1
a)
P
i
P
i+1
b)
Fig. 1. Gantt diagrams of the proposed parallel execution scheme in the cases: (a) s < n and (b) s = N , P = 5.
been completed (i.e. no message vectorization is performed). Finally, communications are assumed to be ideal, i.e. without overhead (the length of segments representing computations is not modi ed by the presence of communications) and with zero latency (arrows are drawn vertically) to simplify an informal analysis of the execution scheme just described. Figure 1 shows that in ideal conditions the outlined strategy can lead to very high eciency because all idle times can be reduced to a negligible fraction of the overall iteration time. In the more realistic assumptions that communications have an associated overhead and a non-zero latency, communications of partial results should be aggregated (message vectorization) to reduce communication overhead. In section 4, we will analyze this scheme in more detail and take message vectorization and communication overhead into account.
3 Sequential code restructuring and parallelization The sequential code to compute expressions (1) and (2) assumes the generalized form of the loop nest in g. 2a, where s is the stencil. The outer loop on index k represents the iteration for convergence, the intermediate loop on j represents the scanning of the elements of x, while the inner loop on i represents the update of the j-th element of x, based on the value of neighbor elements belonging to the stencil. The + sign in the updating statement can be substituted by any associative operator, while fij and gj are generic functions, which could depend on subscripts i and j. In the Gauss-Seidel iteration, for example, the associative operator is addition, while the other functions correspond to products with elements of the coecients matrix A.
j DO k = 1,NITER DO j = 1,N DO i = -s,s IF ((i+j).GT.0.AND.((i+j).LE.N) a)
s
j
s
N
x(j) = x(j) + fij (x(j+i)) ENDIF ENDDO x(j) = g j (x(j)) ENDDO ENDDO
0
i
0
j
i
j
DO k = 1, NITER DO j = 1,N DO i = max(1,-s+j),min(N,s+j) x(j) = x(j) + fij (x(i)) b)
ENDDO x(j) = g j (x(j)) ENDDO ENDDO
0 DO k = 1, NITER DO m = 1,P DO j = 1,N DO i=max(1,-s+j,(m-1)*N/P+1),min(N,s+j,m*N/P) x(j) = x(j) + fij (x(i)) c) ENDDO x(j) = g j (x(j))
i
0
j
i
j
ENDDO ENDDO ENDDO
0 DO k = 1,NITER DO l = 1,C DO m = 1,P DO j = (l-1)*N/C+1, l*N/C DO i=max(1,-s+j,(m-1)*N/P+1),min(N,s+j,m*N/P) d) x(j) = x(j) + fij (x(i)) ENDDO x(j) = g j (x(j)) ENDDO ENDDO ENDDO ENDDO DO k = 1,NITER DO l = 1,C DO m = 1,P IF (l.LE.(m-1)*C/P).OR.(l.GT.m*C/P) DO j = (l-1)*N/C+1, l*N/C DO i=max(1,-s+j,(m-1)*N/P+1),min(N,s+j,m*N/P) x(j) = x(j) + fij (x(i)) e) DO m = 1,P IF (l.GT.(m-1)*C/P).AND.(l.LE.m*C/P) DO j = (l-1)*N/C+1, l*N/C DO i=max(1,-s+j,(m-1)*N/P+1),min(N,s+j,m*N/P) x(j) = x(j) + fij (x(i)) x(j) = g j (x(j))
0
i
0
i
j
j
0
i
j
i
j
last first
0
i
0
i
Fig.2. (a) Loop nest we are considering, and iteration space for s = 2 and s = N ; (b) skewing; (c) rst tiling; (d) second tiling; (e) code legalization.
Before proceeding to actual code parallelization, it is worthwhile performing a number of code transformations that implement the \altruistic" workload distribution and message vectorization. We begin by giving some basic concepts and de nitions. A d-nested loop can be represented in an Iteration Space Z d [8]. Each loop iteration is identi ed by a point in the iteration space. Fig. 2a shows the iteration space of the loop nest we are considering, for two values of stencil s. The execution order of iterations is constrained by their data dependencies. Iteration reordering transformations on a loop nest are legal if the transformed code maintains the legal execution order of its iterations. i.e. an execution order which satis es all dependence constraints. Within the framework of the iteration space, it is easy to graphically represent several iteration reordering transformations, such as loop interchange, reversal, coalescing, interleaving, skewing and tiling (or blocking). Skewing . The rst transformation we apply on the loop nest is skewing [8]. We skew loop on i with respect to loop on j, by a factor of 1, leading to the transformed code shown in g. 2b. This gure also shows the transformation of the iteration space in g. 2a, for the two stencil sizes. The eect of this transformation is a simpli ed code when applying tiling, which is graphically visible in the rectangular shape of the tiles (see below). Tiling . This is a transformation that allows for both parallelism and locality within a loop nest [8]. In its general form, it transforms a d-nested loop into a 2d-nested loop. As a result of these transforms, we have the d inner loops of the 2d tiled loop nest performing the execution of a d-dimensional tile of iterations, and the d outer loops controlling the execution of the tiles. This result is re ected graphically in the partitioning of the iteration space into blocks. We apply tiling transformations to the skewed code in two steps, corresponding to the tiling of loop on i and loop on j, respectively. First, loop on i is divided into P groups of iterations (the identi er P reminds us that we will distribute these groups among the processors). The transformed code is shown in g. 2c, which also shows this rst, vertical, tiling, of the iteration space. This tiling transformation corresponds to our computational workload distribution. In fact, for each value of m, the partial sum computed by each iteration of loop on j is the contribution of tile m (processor m, after parallelization) to the update of xj . At every iteration of loop on j, the partial sum is actually computed by the inner loop on i, which turns out to be empty for those values of m for which the corresponding tile must not contribute to the update of xj . It is worth noting that tile m makes use of only those xi elements belonging to the same subset of adjacent elements of x. This leads to a partition of x into P blocks, which are amenable to be distributed among the processors together with the corresponding tiles. The second tiling transformation divides loop on j into C blocks of iterations. The transformed code is shown in g. 2d, which also shows this second (horizontal) tiling of the iteration space. This transformation corresponds to message vectorization. In fact, each tile in this second tiling computes a set of
partial sums contributing to the update of the elements of x belonging to the same block of data, according to the partition determined by the former tiling. In terms of parallel execution, this means that all these partial sums can be grouped and communicated together (i.e. in a single message) to the processor that needs them. Of course, if for some value of index l, no partial sums are actually produced (see above), no communication operation should take place at the end of the corresponding tile. It is important to note that the advantage of message vectorization is counterbalanced by the introduction of an idle time for the processors, since they must wait for partial sums computed by other processors before completing an iteration. The C parameter can be used to control the vectorization degree. For C = P we have a complete vectorization and all partial sums directed to one processor are grouped together. Conversely, for C = N no tiling, and hence no vectorization, is performed and partial sums are communicated individually. We will consider this issue in more detail in section 4. Code legalization . After the last transformation, sequential code is no longer legal, since a sequential execution of tiles does not satisfy the ordering constrained by data dependences2 . Sequential code can be made legal again with the transformation reported in gure 2e. For each iteration of the outer controlling loop of the horizontal tiling (loop on l), this transformation reorganizes the execution of the tiles belonging to the l-th \tiles row". More precisely, the execution of the tile which actually performs the update of elements xj belonging to some block of data (dark area in gure 2e) is postponed to after the execution of the other tiles in the row (light areas in gure 2e). In terms of parallel execution, this means that the processor performing the update actually waits for processors that are computing partial sums for it. Parallelization . Once sequential code restructuring has been performed, parallelization can be accomplished using a message passing interface system such as PVM [6]. Code parallelization involves three main steps: data decomposition, loop iteration distribution and data communication. Data decomposition requires the partition of the array to be updated into blocks, and the assignment of each block to a processor. Likewise, loop iteration distribution is readily performed by assigning to each processor all tiles that refer to local elements of array x only. Graphically, this corresponds to assigning to processors vertical columns of tiles in the iteration space in gure 2c. Finally, communication operations and synchronization can be easily added to the code by inserting non-blocking send and blocking receive primitives, respectively at the end and at the beginning of (horizontal) tiles controlled by loop on l. To cope with the case in which no sums are computed for some value of l, a check has to be introduced, so that communications and subsequent update of x are skipped if no interaction must take place. 2
Last tiling transformations are not legal because, for a tiling transformation to be legal, it should apply to a nest of fully permutable loops [8]. We have applied a skewing transformation with a skewing factor that does not make the nest fully permutable.
4 Performance evaluation In this section we analyze the performance of the execution scheme outlined in section 2. The analysis is performed using the LogP model proposed by Cueller et al. in [2] for the design and analysis of portable parallel algorithms. The LogP model characterizes a parallel machine with the four parameters L (latency), o (overhead), g (gap), and P (number of processors). Unit time, called a cycle, is assumed for local operations. Sequential computations performed on individual processors are therefore modeled by their operation count. The parameters L, o and g are measured as multiples of the cycle. The gap parameter, which is related to network capacity, is not explicitly considered in the following. In fact, it can be easily shown that, whenever eciency is reasonably high, communication operations are always suciently spaced in time so as not to saturate the network. We begin by analyzing the proposed execution scheme in the simplest case when communications are limited to adjacent processors (s n). From the diagram reported in gure 1a and informally discussed in section 2, we can observe that outer iterations performed by each processor consist of four phases corresponding to computing the updated values for the rst n ? s elements of the grain, the updated values for the remaining s elements of the grain, the s partial results to be sent to the next processor, and the s partial results to be sent to the previous processor. The four phases are shown in gure 3a where their operation counts a, b, c and d, respectively, are represented by the area over the line representing each phase. From gure 3a we can easily obtain for a, b, c and d the values: a = 2 n s ? 25 s2 ; b = 23 s2 ; c = d = 12 s2 Since the message from the next processor is not required to start a new iteration, the eciency attainable in the ideal case is very high, and it is only limited by the dierence between the values of b and c ( g. 1a). Figure 3b shows the Gantt diagram of the same algorithm in the more realistic case of non-zero values for L and o and with complete vectorization of messages (i.e. all partial sums are sent together at the end of the corresponding phase). In the diagram, overhead time wasted on sending or receiving messages is represented by little black squares, while the slope of arrows representing communications takes into account the non-zero value of L. From gure 3b we can evaluate the eciency of the computation as the ratio between the time spent doing useful work in one iteration and the time needed to complete one iteration (iteration time): work 1 = useful work +useful overhead + wait time = 1 + s=n + (3 o + L)=(n s) Figure 3c shows the Gantt diagram for the same algorithm when messages are partially vectorized. For graphical representation, diagrams shown in the gure assume that partial sums to be communicated to the next and previous
a)
2s
n-s
s
s
s
P
i-1
b)
P
i
P
i+1
P
i-1
c)
P
i
P
i+1
Fig.3. Gantt diagrams of the algorithm in the case s n: (a) operation counts of the four phases; (b) complete message vectorization; (c) partial message vectorization. processors are grouped into two sequences of two messages each. More generally, in the following, we will denote with NC the number of messages making up each sequence. According to this convention, the parameter C introduced in the previous section corresponds to the product NC P. From gure 3c we can observe that the overhead in the general case is 4 o NC while the waiting time is 2 (o + L) + b0 + c0, where b0 = b ? c + c0, and c0 = c=NC . Using these values we can easily compute the eciency as a function of the LogP parameters and the degree of vectorization, represented by the reciprocal of NC . We can now examine in more detail the performance of the proposed execution scheme in the case s = N. Figure 4a shows the operation count of the computation phases performed locally by each processor. Since there are as many phases as processors and each phase takes n2 elementary operations, the useful work done in one iteration takes P n2 cycles. Figure 4b shows the Gantt diagram of algorithm execution on ve processors and assuming full vectorization of messages. To show how iterations synchronize, messages received from the other processors have been added to the diagram in g. 4b. As requested by the algorithm, the i-th iteration starts as soon as partial sums sent by all other processors have been received. The behavior shown in gure 4b is quite general and independent of the number of processors used. From gures 4a and 4b, observing that events marked by the dashed line must
a)
n n
n
n
n
n
b)
c)
Fig. 4. Gantt diagrams of the algorithm in the case s = N . be executed sequentially, we can derive the expression P (o + n2 + n2 + o + L) for the iteration time. Remembering that the useful work done in one iteration is P n2, the eciency attainable when messages are fully vectorized can be readily computed. Finally, we analyze the execution of the algorithm when messages are only partially vectorized and each group of partial sums is transmitted in NC distinct communications. The Gantt diagram corresponding to this situation is shown in gure 4c. As in the case just considered, for the iteration time we can obtain the expression P (n2 + NC o + n2 =NC + o + L), which gives an eciency: = 1 + 1=N + ((N1 + 1) o + L)=n2 (3) C C From this expression for , we found that eciency is maximumwhen3 NC p n= o. Substituting this value in the expression for we get: max = 1 + 2 po=n +1 (o + L)=n2 3 Actually, vectorization make message length variable and L dependent on NC . This leads to a slightly dierent expression for the optimum value of NC . However, for all reasonable values of the parameters involved, the given approximate expression can be safely assumed. For further details see the extended varsion of this paper [4].
This last result shows that, for large enough values of the grain n, a proper choice of the degree of message vectorization leads to an eciency very close to 1. These results have been extensively con rmed by performance measures carried out on dierent parallel machines, including a network of workstations. For reason of space we here report only gures concerning the eciency of a GaussSeidel algorithm (s = N) running on an IBM SP1 with eight processors. The algorithm was implemented using the PVM communication library [6]. Figure 5 1 n=500 n=500 n=1000 n=1000 n=1500 n=1500
efficiency
0.8
(computed) (measured) (computed) (measured) (computed) (measured)
0.6
0.4
0.2
0 0
20
40
60
80 NC (P=8)
100
120
140
Fig.5. Measured and computed eciency of a Gauss-Seidel algorithm (s = N ). compares measured eciency with that computed using expression (3). The values used for o and L in cycles are 8000 and 1000, respectively. Even though measured eciency turns out to be slightly less than the one computed analytically, the ability of the model to predict the optimum value of NC for dierent grain sizes is apparent.
5 Conclusions
In this paper we have presented and analyzed a parallel execution scheme for a class of iterative methods including non-simultaneous methods for solving systems of linear equations. We are not aware of previous systematic application of this scheme to algorithms implementing non-simultaneous methods. After an informal discussion of the general parallelization strategy, we have presented a set of code transformations to derive the parallel version of a sequential algorithm belonging to the class considered. This procedure cannot be
supported by commercially available parallelizers such as FORGE because the nal execution model does not t into the loosely synchronous SPMD model. Nevertheless, the nal code is still in some sense SPMD and the transformations are those usually employed in conjunction with automatic tools. This could suggest that the capabilities of available parallelizers could be extended so as also to integrate our strategy. We have then presented a performance analysis of the proposed scheme based on the LogP model. We have shown how the overall performance is related to the main parameters of the target machine and under which assumptions our strategy attains reasonably high eciency. Experimental measures have validated the qualitative results of our analysis, and con rmed the usefulness of models like LogP for designing ecient and widely portable parallel algorithms.
References 1. P. Brinch Hansen, \Model programs for computational science: A programming methodology for multicomputers", Concurrency: practice and experience, vol. 5, no. 5, pp. 407-423, 1993. 2. D. Culler et al., \LogP: Towards a realistic model of Parallel Computation", Proc. Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1-12, May 1993. 3. F. Darema et al., \A Single Program Multiple Data computational Model for EPEX/ Fortran", Parallel Computing, pp. 11-24, April 1988. 4. B. Di Martino and G. Iannello, \Parallelization of nonsimultaneous iterative methods for systems of linear equations", Tech. Rep. TR-CPS-019-94, Feb. 1994. 5. G.C. Fox et al., Solving Problems on Concurrent Processors, Prentice Hall, Englewood Clis, New Jersey, 1988. 6. G.A. Geist, V.S. Sunderam, \Network-Based Concurrent Computing on the PVM System", Concurrency: Practice and Experience, vol. 4, pp. 293-311, June 1992. 7. J.R. Westlake, A handbook of numerical matrix inversion and solution of linear equations, Wiley, New York, 1968. 8. M. Wolf and M. Lam, \A Loop Transformation Theory and an Algorithm to Maximize Parallelism" IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 4, pp. 452-471, 1991.
This article was processed using the LaTEX macro package with LLNCS style