AbstractWith the objective of minimizing the total ex- ecution time of a parallel program on a distributed mem- ory parallel computer, this paper discusses the ...
On Time Optimal Supernode Shape Edin Hodzic 2665 N. First St. #300 San Jose, CA 95134
Weijia Shang Department of Computer Eng. Santa Clara University Santa Clara, CA 95053 y
AbstractWith the objective of minimizing the total ex- ment of its initiation. The communication startup
ecution time of a parallel program on a distributed memory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We assume that the communication cost is dominated by the startup penalty and therefore, can be approximated by a constant. We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, we give a closed form expression for an optimal linear schedule vector, and a necessary and sucient condition for optimal relative side lengths. We prove that the total running time is minimized by cutting hyperplane direction matrix whose rows are from the surface of the polar cone of the cone spanned by dependence vectors, also known as tiling cone. The results are derived in continuous space and should for that reason be considered approximate.
Keywords: Supernode partitioning, tiling, parallelizing compilers, distributed memory multicomputer.
1 Introduction
Supernode partitioning is a transformation technique that groups a number of iterations in a nested loop in order to reduce the communication startup cost. This paper addresses the problem of selecting optimal cutting hyperplane directions and optimal supernode relative side lengths with the objective of minimizing the total running time, the sum of communication time and computation time, assuming a large number of available processors which execute multiple supernodes. A problem in distributed memory parallel systems is the communication startup cost, the time it takes a message to reach transmission media from the mo-
cost is usually orders of magnitude greater than the time to transmit a message across transmission media or to compute data in a message. Supernode transformation was proposed in [14], and has been studied in [1, 2, 3, 4, 15, 18, 20, 25, 26, 27] and others to reduce the communication startup cost. Informally, in a supernode transformation, several iterations of a loop are grouped into one supernode and this supernode is assigned to a processor as a unit for execution. The data of the iterations in the same supernode, that need to be sent to another processor, are grouped as a single message such that the number of communication startups is reduced from the number of iterations in a supernode to one. A supernode transformation is characterized by the supernode size, the relative lengths of the sides of a supernode, and the directions of hyperplanes which slice the iteration index space of the given algorithm into supernodes. All the three factors aect the total running time. A larger supernode size reduces communication startup cost, but may delay the computation of other processors waiting for the message and therefore, result in a longer total running time. Also, a square supernode may not be as good as a rectangular supernode with the same supernode size. In this paper, selection of optimal cutting hyperplane directions and optimal relative side lengths is addressed. The rest of the paper is organized as follows. Section 2 presents necessary de nitions, assumptions, and terminology. Section 3 discusses our results in detail. Section 4 brie y describes related work and the contribution of this work compared to previous work. Section 5 concludes this paper. A bibliography of related work is included at the end.
2 Basic de nitions, models and assumptions
This author was supported by AT&T Labs. The architecture under consideration is a parallel This research was supported in part by the National Science Foundation under Grant CCR-9502889 and by the Clare Boothe computer with distributed memory. Each processor Luce Professorship from Henry Luce Foundation. has access only to its local memory and is capable of y
communicating with other processors by passing messages. In our model, the cost of sending a message is represented by ts, the message startup time. The computation speed of a single processor is characterized by the time it takes to compute a single iteration of a nested loop. This parameter is denoted by tc . Algorithms under consideration consist of a single nested loop with uniform dependencies [22]. Such algorithms can be described by a pair (J; D), where J is an iteration index space and D is an n m dependence matrix. Each column in the dependence matrix represents a dependence vector. The cone generated by dependence vectors is called dependence cone. The cone generated by the vectors orthogonal to the facets of the dependence cone is called tiling cone. We assume that m n, matrix D has full rank (which is equal to the number of loop nests n), and all elements on the main diagonal of the Smith normal form of D are equal to one. As discussed in [23], if the above assumptions are not satis ed, then the iteration index space J contains independent components and can be partitioned into several independent sub{algorithms with the above assumptions satis ed. In a supernode transformation, the iteration space is sliced by n independent families of parallel equidistant hyperplanes. The hyperplanes partition iteration index space into n-dimensional parallelepiped supernodes (or tiles). Hyperplanes of one family can be speci ed by a normal vector orthogonal to the hyperplanes. The square matrix consisting of n normal vectors as rows is denoted by H . H is of full rank because the n hyperplanes are assumed to be independent. These parallelepiped supernodes can also be described by the n linearly independent vectors which are supernode sides. As described in [20], column vectors of matrix E = H ?1 are the n side vectors. A supernode template T is de ned as one of the full supernodes translated to the origin 0, i.e., T = fjj 0 H j < 1g. Supernode index space, Js , obtained by the supernode transformation H is: Js = fjsjjs = bH jc; j 2 J g : (1) Supernode dependence matrix, Ds 1 , resulting from supernode transformation H consists of elements of the set: Ds = fds jds = bH (j + d)c; d 2 D; j 2 T g :
As discussed in [20, 26] 2 , partitioning hyperplanes de ned by matrix H have to satisfy HD 0, i.e. each entry in the product matrix is greater than or equal to zero, in order to have (Js ; Ds ) computable. This implies that the cone formed by the column vectors in E has to contain all dependence vectors in D. Therefore, components of vectors in Ds de ned above are nonnegative numbers. In our analysis, throughout the paper, we further assume that all dependence vectors of the original algorithm are properly contained in the supernode template. Consequently, components of Ds are only 0 or 1, and I Ds . This is a reasonable assumption for real world problems [5, 26]. To present our analysis, the following additional notations are introduced. The column vector l = [l1 ; :::; ln ]T is called supernode side length vector. Let L be an n n diagonal matrix with vector l on its diagonal and Eu be a matrix with unit determinant and column vectors in the directions of corresponding column vectors of matrix E . Then, E = Eu L. Thus, components of vector l are supernode side lengths in units of the corresponding columns of Eu . We de ne the cutting hyperplane direction matrix as: Hu = Eu?1 : The supernode size, or supernode volume, denoted by g, is de ned as the number of iterations in one supernode. The supernode volume g, matrix of extreme vectors E and the supernode side length vector l are Q related as g = jE j = ni=1 li . The relative supernode side length vector, r = (r1 ; r2 ; :::; rn ), is de ned as: s r = n 1 l;
g
Q and clearly ni=1 ri = 1: Vector r describes side
lengths of supernodes relative to the supernode size. For example, if Hu = I , the identity matrix, n = 2, and r = (1; 1),p then p the supernode is a square. How2 ever, if r = ( 2 ; 2) with the same Hu and n, then the supernode is a rectangle with the same size as the square supernode but the ratio of the two sides being 2:1. We also use R to denote diagonal n n matrix with vector r on its diagonal, and Q = R?1 . q n gEu R and H = n 1 QHu : A supernode Then, E = p g transformation is completely speci ed by Hu , r, and g, and therefore, denoted by (Hu ; r; g). The advantage of factoring matrix H this way is that it allows us to study the three supernode transformation parameters 1 We use D to denote either a matrix or a set consisting of separately.
the column vectors of matrix D. Whether it is a matrix or a set should be clear from the context.
2
Implication 2 of Corollary 1 of [26]
For an algorithm A = (J; D), a linear schedule [22] is de ned as : J ! N , such that (j) = bj + c ; 8j 2 J 3, where is a linear schedule vector, a row vector with n rational components, minfdi : di 2 Dg = 1, and = ? minfj : j 2 J g. A linear schedule assigns each node j 2 J an execution step with dependence relations respected. We approximate the length of a linear schedule with:
P () = maxf(j1 ? j2 ) : j1 ; j2 2 Js g;
(2)
Note that j1 and j2 for which (j1 ) ? (j2 ) is maximum are always extreme points in the iteration index space. The execution of an algorithm (J; D) is as follows. We apply a supernode transformation (Hu ; r; g) and obtain (Js ; Ds ). A time optimal linear schedule can be found for (Js ; Ds ). The execution based on the linear schedule alternates between computation and communication phases. That is, in step i, we assign supernodes j 2 Js with the same (j) = i to available processors. After each processor nishes all the computations of a supernode, processors communicate by passing messages in order to exchange partial results. After the communication is done, we go to step i + 1. Hence, the total running time of an algorithm depends on all of the following: (J; D), Hu , g, r, , tc , ts . The total running time is a sum of the total computation time and the total communication time which are multiples of the number of phases in the execution. The linear schedule length corresponds to the number of communication phases in the execution. We approximate the number of computation phases and the number of communication phases by the linear schedule length (2). The total running time is then the sum of the computation time Tcomp and communication time Tcomm in one phase, multiplied by the number of phases P . Computation time is the number of iterations in one supernode multiplied by the time it takes to compute one iteration, Tcomp = gtc . The cost of communicating data computed in one supernode to other dependent supernodes is denoted by Tcomm . If c is the number of processors to which the data needs to be sent, then Tcomm = cts . This model of communication greatly simpli es analysis, and is acceptable when the message transmission time can be overlapped with other operations such as computations or communication startup of next message, or when the communication startup time dominates the communication operation. Thus, the total running 3
v1 v2 denotes vector dot product of vectors v1 and v2
time is: T = P (Tcomp + Tcomm ) = P (gtc + cts):
3 Optimal Supernode Shape
(3)
In this section, we present the results pertaining to the time optimal supernode shape, i.e. supernode relative side length matrix, R, and cutting hyperplane direction matrix, Hu, derived in the model and under the assumptions set in the previous section. In the model with constant communication cost, only linear schedule vector and linear schedule length in the expression (3) depend on the supernode shape. Therefore, in order to minimize total running time, we need to choose supernode shape that minimizes linear schedule length of the transformed algorithm. The problem is a non-linear programming problem: min j max ;j 2J 1 2
s Ds det(Q) det(Hu) Hu D
qn
1
1=gs Q Hu (j1 ? j2 )
(4)
= 1; Q is a diagonal matrix = 1 0; p where the scalar n 1=g is a constant that can be computed independent of Hu and Q, and without loss of generality, we can exclude it from the objective function. We studied selection of supernode size in [9]. The oor operator of (1) has been droped in the objective function to simplify the model. It can be shown thatPthe error in the linear schedule length is bounded by + 2, which is insigni cant for components of close to 1 and large iteration index spaces. Theorem 1 gives a closed form of the optimal linear schedule vector for the transformed algorithm. Theorem 1 An optimal supernode transformation, with I Ds , has an optimal linear schedule s = 1. Proof: As de ned in section 2, min sdi = 1; di 2 Ds . Since I Ds , and in order to have feasible linear schedule, i.e. s Ds > 0, we must have s 1. If all extreme projection vectors have non-negative components, then their linear schedule length is minimized with the linear schedule vector with smallest components, i.e. s = 1. If there are extreme projection vectors with negative components, then an initial optimal linear schedule vector may be dierent from 1. We still must have s 1 in order to satisfy the de nition of linear schedule vector, i.e. min sdi = 1. Let the ith component of s be greater than 1. Then, we
0000 1111
0000 1111 can set s0 = s S and modify the supernode shape cone(M) 0000 1111 0 ? 1 0000 1111 by setting Q =pS Q, where S is ap diagonal ma0000 1111 Z 0000 1111 trix with Sii = n si=si and Sjj = n si for j 6= i. 0000 1111 0000 1111 Linear schedule of all vectors in the transformed algo0000 1111 rithm remains the same, i.e. linear schedule of vector 0000 1111 0000 1111 v0 = S ?1 v is now s0 v0 = sSS ?1 v = sv. However, 0000 1111 0000 1111 min s0 > 1, and we can divide s0 with min s0 which 0000 1111 0000 1111 will shorten linear schedule of all points. Therefore, 0000 1111 0000 1111 we got a shorter linear schedule for the algorithm and 0000 1111 0000z 1111 got one more component in the linear schedule vector 0000 1111 xy = 1 0000 1111 equal 1. Continuing the process, we eventually get to 1 0000 1111 s 0000 1111 the linear schedule with all ones. 2 0000 1111 Theorem 2 gives a necessary and sucient condition for an optimal relative side length matrix R, and Figure 1: Construction of vector s. Illustration for consequently its inverse matrix Q, assuming the opti- the proof of Theorem 2. mal linear schedule vector 1.
Theorem 2 Let g and Hu be xed, let the linear Q schedule vector be 1 and let M be the set of maximal and si = 1. The former is ensured by selecting projection vectors in the transformed space. Relative suciently small length of vector z. The latter is enz and side lengths vector r is optimal if and only if vector sured by selecting an appropriate angle Q s between 1 such that s sinks on the curve = 1. Based i with equal components, v, belongs to the cone generated by maximal projection vectors, of the transformed on relation between arithmetic and geometric mean, we must have s1 > 11 = n, which together with algorithm: s1 = 11 + z1 implies z1 > 0. The latter is the case v 2 cone(M ): by the construction of Z , and thus construction of s Proof: Let m be linear schedule length, and with- is feasible. Figure 1 illustrates construction of vector out loss of generality, let v be a vector with equal s in two dimensional space. Then, by further scaling supernode index space by diag(s), i.e. by choosing components such that 1v = 1a = m, a 2 M . R Sucient condition. Let vector v be included in 0 = diag(s)?1 R, we improve linear schedule length cone(M ) and let the corresponding relative supernode side length matrix be R. Consider another supernode transformation close to the original with R0 = diag(ri0 ), Q ri0 = 1, slightly dierent from R. Suppose the image of v of transformation with R0 is v0 = Q0 Rv and the schedule length of v0 is 1v0 . Then, based on relation between geometric arithmetic Q Q 0 0 mean, 1v 1v because vi = vi . The new maximal projection vectors' linear schedule length can only be greater than or equal to 1v0 , which is greater than P (v) = 1v4 . Therefore, supernode relative side length matrix R is optimal. Necessary condition : We prove by contradiction. Let R be optimal and assume v is not in cone(M ). Then there exists a separating hyperplane Z : zx = 0 for all x 2 Z , such that za < 0, for all a 2 M , and z1 > 0. Let s = z + 1. We can select the normal vector z arbitrarily close to being orthogonal to vector 1 and of arbitrary length in order to ensure si > 0
of vectors in M :
za (s ? 1)a sa 1diag(s)a
< < <