Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem. 2 choosing a \good" tile shape 14, 23, 22, 6, 7], and nding the \best" tile size ...
Three-Dimensional Orthogonal Tile Sizing Problem : Mathematical Programming Approach R. Andonov1 and N. Yanev2 and H. Bourzoufi1
1
LIMAV, Universite de Valenciennes, Le Mont Houy - B.P. 311, 59304 Valenciennes Cedex, France
E-mail: andonov[bourzouf]@univ-valenciennes.fr
2
University of So a, Faculty of Mathematics and Informatics, 5, J. Baucher str., So a, Bulgaria
We discuss in this paper the problem of nding the optimal tiling transformation of threedimensional uniform recurrences on a two-dimensional torus/grid of distributed-memory generalpurpose machines. We show that even for the simplest case of recurrences which allows for such transformation, the corresponding problem of minimizing the total running time is a non-trivial non-linear integer programming problem. For the later we derive an O(1) algorithm for nding a good approximation solution. The theoretical evaluations and the experimental results show the obtained in this way solution approximates the original minimum suciently well in the context of the considered problem. Such analytical results are of obvious interest and can be successfully used in parallelizing compilers as well as in performance tuning of parallel codes. Keywords: blocking, partitioning, communication/computation overlap, coarse grain pipelining, parallelizing compilers, integer non-linear optimization, dynamic programming recurrences
1 Introduction A common approach to nd an optimal task granularity that balances the available parallelism with the costs of communications when a nested loop program is executed in SPMD (Single Program Multiple Data) fashion on a DMM is the iteration space tiling (also called "super-node partitioning") [27, 14, 1, 22]. It may be used as a technique in parallelizing compilers (see [13, 20] where it is called coarse-grain pipelining) as well as in performance tuning of parallel codes by hand (see also [16, 23, 24, 17]). A tile in the iteration space is a collection of iterations to be executed as a single unit with the following protocol|all the (non-local) data required for each tile is rst obtained by appropriate communication calls. The body of the tile is executed next, and this is repeated iteratively. The communications are performed by send/receive calls and the code for the body contains no calls to any communication routine. The tiling problem can be broadly de ned as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems:
Corresponding author
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
2
choosing a \good" tile shape [14, 23, 22, 6, 7], and nding the \best" tile size [18, 23, 15, 4, 13, 20]. For the former problem, the communication cost is approximated by the number of dependency vectors crossing a tile boundary. The latter problem assumes that the tile shape is rst given and then seeks to minimize the total execution time. In general, this appears to be a hard integer non-linear optimization problem and researchers use dierent heuristics. This does not however, exclude particular classes where the optimal solution for suciently good approximation of the original problem can be found analytically. In this paper we propose an O(1) algorithm for solving the tile sizing problem when threedimensional uniform dependency loops are executed in SPMD mode on a two-dimensional torus of general-purpose processors. More precisely we address the problem of computing, for all z = [i; j; k]T 2 Dom Z+3 , a function, Y [z] given by Y [z] = f (Y [z ? d1 ]; Y [z ? d2]; : : :; Y [z ? dh]) (1) with appropriate boundary conditions in a domain D = f[i; j; k] : 0 < i l; 0 < j m; 0 < k ng, also called the iteration space1 . We call D = [d1 ; d2; : : :; dh] the dependency matrix of (1). We assume D to be a full rang matrix of size 3 h where h 3. f is a single-valued function, computed at point z in a single unit of time and Y [z] is its value in this point. In addition to these common assumptions we suppose D to have non-negative components, i.e. D 0. Typical example for such recurrences can be found in dynamic programming algorithms for solving optimization problems as nding the edit distance between strings [25] or bidimensional knapsack type problems [19, 9]. Under these assumptions the iteration space is a parallelepiped and the dependencies admit orthogonal tiling (i.e. they are such that tiles whose boundaries are parallel to the domain boundary are valid). We assume in this case a rectangular parallepiped for tile shape and we focus only on the tile sizing problem. Recently, Andonov and Rajopadhye [4] solved this problem in the two-dimensional version of (1) and showed how the methods of systolic array synthesis can be used for optimal tiling. The main observation is that independent of the original dependencies, the tiled computation for all programs in the considered class can be abstracted by a single, speci c uniform recurrence equations (UREs). An analysis of this URE using a realistic model of the communication cost yields a non-linear integer minimization problem having the total running time as objective function and where all the available resources ( size of iteration space as well size of the processors array) appear in the problem constraints. The result in [4] gives a closed form formula for the optimal tile size. Here we extend this approach to the three-dimensional case where the nonlinear minimization problem is signi cantly harder and requires the developement of a new technique for its solution. We also reveal new and useful relationships between the objective function and the optimal tile volume and conducted experimental validation of the proposed technique. Let us mention again that the emphasis of this paper, being an extension of previous results is on the new points in our approach such as the algorithm for solving the minimization problem and the speci cities of the three-dimensional program implementation. More details about 1 few concrete examples of such recurrence are given in section 4
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
3
the used de nitions and notations, as well about some practical aspects of the experimental validations the interested reader can nd in [4, 3]. The remainder of this paper is organized as follows. In the following section, we obtain analytic formul for the running time of a tiled program on a torus of general purpose multiprocessors when D is the identity matrix I. We solve the corresponding optimization problem in Sec.3. We proceed to the more general case D 0 in Sec. 4. We give examples and discussions on the obtained results in Sec.5 and we conclude in Sec.6.
2 Orthogonal tiling in a rectangular parallelepiped 2.1 Framework of the approach and its "systolic" steps We present our approach when h = 3 and D is the identity matrix I and we postpone the discussion concerning the more general case (1) for section 4. Under these assumptions (1) is a perfect loop nest of depth 3 with the basis vectors feig3i=1 as dependency vectors and whose iteration space is 3 dimensional rectangular parallelepiped D. We are interested in optimizing the global execution time when executing Eqn. (1) on a 2-dimensional grid of general-purpose processors. Similarly to the two-dimensional case of equation (1), it can be easily argued using the ideas from [6, 22] why the orthogonal tile shape is the most convenient (see more detailed discussion in [4]). We, therefore focus on the determination of tile size, assuming the orthogonal tile shape. Following as close as possible the approach from [4] we perform the following steps: 1. Partition the iteration space into orthogonal tiles (rectangular parallelepipeds) of size r s t. Leave r; s and t as free variables. 2. \Cluster" the recurrence into tiles, yielding a new uniform recurrence over a new domain. Each tile is considered as an atomic computation. 3. Apply systolic space-time transformations, yielding a virtual two-dimensional systolic array. 4. Implement this array on a torus of size p q = P by performing multiple passes. 5. Relax the systolic timing model to account for practical machines, and obtain a formula for the total running time of the nal implementation. This formula will be expressed as a function of r; s and t, as well as other parameters of the problem and architecture which are constants. 6. Choose the tile size that minimizes this running time by solving the corresponding optimization problem. This is a discrete (non-linear) optimization problem, and special techniques are used to solve it. In the rst step in our procedure we partition the domain in rectangular parallelepipeds of size r; s; t, speci ed by the planes with normals e1 ; e2 and e3 . Now the tile graph is a paral-
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem l m
4
lelepiped of rl ms nt nodes (we will drop the ceilings from all subsequent formul), and the tile dependencies remain e1; e2 ; e3. Then applying the projection method from data dependence analysis [21] we map this to a two-dimensional systolic array by choosing a(i; j; k) = (i; j ) as an allocation function and t(i;l j;mk) = i+ j + k as a timing function. This space-time transformation gives a virtual array of rl ms nodes. We consider the P processors in the network as a logical p q two-dimensional torus, with P = p q . We will denote the (i; j )th processor ( 0 i < p; 0 j < q ), in this mesh as Pi;j . By setting p = 1 or q = 1 we automatically linear arrays in our model. The projection of the original tile graph, containing l m consider l m nodes, will be distributed to processor nodes, so that (i; j ) tile node is assigned to r s node P(i?1)modp;(j ?1)modq . 2.2 Relaxing the systolic model : execution time on DMM We now develop an expression for the running time of such a tiled program on a distributed memory machine, which is valid for the range of possible values of r; s and t, using the standard communication model [20, 18, 11, 10]. The code executed for a tile is the standard loop (the receive call is a blocking one to ensure synchronization): repeat receive(north); receive(west); compute(body); send(south); send(east); end
We use a commonly used communication model [8, 20] in which the time to transmit a message of v words between two processors is given by transfer(v ) = + vt: (2) where is the start-up latency, t is the transmission rate for a speci c architecture. We assume that the time to execute a single instance of the loop body in (1) is a . The time to execute a tile body is then, ta = a rst. We will use the convention that P , and T denote period, latency and completion time, respectively. Let us de ne the tile period, Pt as the time elapsed (in the steady state) between corresponding instructions of any two successive tiles (the tiles are considered successive if one depends directly on the other) that are mapped to the same processor. The tile latency, L is the time elapsed between corresponding instructions of two successive tiles mapped to adjacent processors. Since Pt is at least as large as the time that the CPU is busy during each tile, (four OS calls, and the tile body) we can specify it as, Pt = 4 + arst (3) Now consider the latency, , between tiles when the volume of the transmitted message is v. Assume that two (successive) tiles start simultaneously on adjacent processors. Because of
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
5
the dependency between the tiles, the receive on the second tile will block until the rst tile is nished, its results are sent, and they arrive into user memory. Hence, v = Pt + t v ? 2 = 2 + t v + arst (4) Observe that because the call to receive is overlapped, we subtract a factor of 2 , and the t v accounts for the transmission time. Due to the dependencies, the only data to transmit is a facet of the parallelepiped. Therefore, taking in account the volume of the message in axis i (respectively j ) we get for the latency i (respectively j ) the following formula: i = Pt + tst ? 2 = 2 + tst + arst (5) j = Pt + t rt ? 2 = 2 + t rt + arst
(6)
Similarly we can de ne the latency k in axis k. Since this axis is the direction of projection and the data stay in the memory of the same processor) we simply neglect the propagation time and we have : k = P t (7) Let us note that relation (3) is valid only if + a rst max(t rt; tautst) which is always true for the considered here parallel machines. We have a rectangular parallelepiped tile graph of L M N tiles, where L = l=r; M = m=s and N = n=t (assuming that r divides l, s divides m and t divides n). The total running time of the program is then obtained using a straightforward application of systolic like counting arguments 2 . In general, we will need L=p M=q passes through the torus of p q processors where each pass treats in a pipelined way p q N tiles each one of size r s t. In fact, a pass performs N horizontal strides each one of p q 1 tiles. This means that each pass consumes a parallelepiped of (p r) (q s) n points of the iteration space. Note that because the k axis is chosen as direction for projection, we explore vertically the iteration space completely (i.e, we perform N = n=t tiles in k axis), before to start a new pass in i or j axis. Therefore, the period in the k axis is given by : Pk = NPt (8) A single pass involves communications between p processors in i axis, and we de ne the latency of a pass in i axis as pi . Similarly, the latency in j axis is de ned as q j ). The period (8) and the two latencies (in i and in j axis) give us three constraints that determine when the next pass of p q N tiles can start. More precisely they are: the N tiles in axis k must be completely nished, and the last (pth ) processor in axis i or the last ( q th ) processor in axis j should have produced some results. Hence, the pass-latency of the torus is torus = max(Pk ; min(pi ; q j )) (9) 2 For the sake of simplicity we count twice the corner tiles. We believe the impact of that simpli cation is
negligible especially for large values of l; m and n.
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
6
In a single pass, the execution time of a pass is given by Tp = (p ? 1)i + (q ? 1)j + Pk (10) The entire program is executed in L=pM=q passes through the torus, and hence thelast pass LM can only start at pq ? 1 torus . Thus, the total running time, T (r; s; t), is LM pq ? 1 torus + Tp and we need to solve the following optimization problem. min fT (r; s; t) : 0 < r l=p; 0 < s m=q; 0 < t n; r; s; t : integer g (11) Considering the possible cases in (9) and simplifying, we get. 8 > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > :
Pk min(pi; qj ) ) T1(r; s; t) = LM pq NPt + (p ? 1)i + (q ? 1)j Pk pi qj
)
LM T2(r; s; t) = q ? 1 i + (q ? 1)j + NPt Pk qj pi )
(12)
LM T3(r; s; t) => p ? 1 j + (p ? 1)i + NPt
subject to 1 r l=p and 1 s m=q and 1 t n
(13)
3 Solution of the optimization problem 3.1 Three-dimensional case We saw in the previous section that to minimize the total execution time for Eqn (1) is equivalent to nding the minimum of the non-linear objective function (12) in parallelepiped domain (13). This domain is divided by the non-linear constraint Pk = min(pi; q j ) into two regions. In the rst one (called region I), Pk min(pi; q j ) and the cost function is given by T1 (r; s; t). In the second one (called region II), Pk min(pi; q j ) and the cost function is given either by T2(r; s; t) or by T3(r; s; t). Similar to the two-dimensional [4] it is easy to argue that region II is not an interesting one (except on the boundaries s = m=q or r = l=p which we consider later on). In fact, the constraint Pk min(pi; q j ) which is equivalent to nPt =t min(pi; q j ) is never satis ed asymptotically, since the left side depends on n, while the right side depends on none of the parameters of the iteration space. The case when this constraint is satis ed as strict
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
7
inequality can be considered as a waste of resources, since the problem size is too small to make eective use of available processors. In this case we can reduce p ( if pi = min(pi ; q j ) until we are at the boundary between the two regions and we re-solve the problem for this reduced number of processors. The obtained solution is guaranteed to be no worse than the original one since T2(r; s; t) is independent of p. When q j = min(pi; q j ) we reduce q and we obtain the same result. Therefore our problem reduces to minimizing T1 (r; s; t) in region I. Substituting the de nitions of L; M; N and Pt ; i; j in (12) and simplifying, we obtain the problem given below.
lmn (4 + rst) + (p ? 1)(2 + st + rst) + Minimize T (r; s; t) = pqrst a t a (q ? 1)(2 + t rt + a rst) (14) subject to 1 r l=p (15) 1 s m=q (16) 1 t n (17) n min (p(2 + t st + a rst); q (2 + t rt + a rst)) (4 + arst) (18) t The following lemma is a natural extension of the corresponding lemma from the two dimensional case and the proof follows the same lines and uses similar arguments as in [4].
Lemma 1
The solution of problem (14-18) satis es either t = 1 or (r = pl and s = mq ).
Proof
Let (r; s; t) be a point satisfying 1 < t and s < m=q . In this case T strictly decreases if we reduce t, keeping ts and r xed (i.e., move along a hyperbola increasing s, decreasing t and keeping r): only the term (q ? 1)trt in (14) decreases, and the remaining terms are constants. On the other hand, if any point (r; s; t) satis es (18), any other point (r0; s0; t0 ) hyperbolically to its left (i.e., r0 = r and s0 t0 = st) also satis es it, since in this case the RHS in (18) increases, while the LHS can only decrease. Similar argument holds if we consider a point (r; s; t) satisfying 1 < t and r < l=p. Lemma 1 reduces the dimension of solution space for problem (14-18). This lemma shows that the solution could be either on the edge r = l=p; s = m=q of the parallelepiped or on its facet t = 1. To nd the minimum of (14) in the rst case (when r and s are constants) is really an easy task. The function in this case has the form, F (x) = Ax + Bx + C , the sum of a hyperbola, and a straight line. This is a classic, strictly convex function, which has a unique minimum (for positive x) at q (19) x = A=B
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
8
p
whose value is F (x ) = C + 2 AB . If we desire to minimize the function at an integer point within an interval bounded by a and b, (where a < b, and there is at least one integer between them), this occurs at one of the two integers bx c or dx e (or at dae or bbc if x is too small or too large). Let us denote by t the minimum of the function T (r; s; t) on the edge E1 = f(r; s; t) : r = l=p; s = m=q; 1 t ng. By annulating the derivative we obtain that the function is minimized on this edge in the point s t = lm(p + q ? 2) + 4 npq (20) a t(mp(p ? 1) + lq (q ? 1)) 3.2 Exploring the facet t = 1 of the parallelepiped To nd the minimum on a facet of the parallelepiped is much harder. Let us point that the facet t = 1 corresponds to an edge in the two-dimensional case where the solution is obvious [4]. The analysis presented here below is necessary only for the three-dimensional case. Let us assume for the sake of simplicity pi = min(pi; q j ). The case q j = min(pi; q j ) is symmetric to the previous one and the results are easily obtained from the analysis which follows. When t = 1 we get the problem
lmn (4 + rs) + (p ? 1)(2 + s + rs) + Minimize T (r; s) = T (r; s; 1) = pqrs a t a (q ? 1)(2 + t r + a rs) subject to 1 r l=p 1 s m=q p(2 + ts + a rs) n(4 + ars) The properties of the objective function T (r; s) are given by the following theorem.
Theorem 2
(21) (22) (23) (24)
For the function T (r; s) over the domain R2+ = f(r; s) : r > 0; s > 0g holds : i The global minimum of T (r; s) is attained at the point (r ; s), where s s p ? 1 1 v (25) r = q ? 1 v; s = qp ? ?1 where v is the global optimal tile volume which is given by s v = pq(p4+ lmn (26) q ? 2)a ii T (r; s) is a strictly convex function over the domain Rv = R2+ Tf(r; s) : rs vg.
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
9
Proof
(i) It is convenient to partition the points of R2+ into classes Hv = f(r; s) 2 R2+ : rs = v g. Then [ from R2+ = Hv we have v
min T (r; s) 2 T (r; s) = min v min H R+
(27)
v
p
p
It is easy to see that T (r; s) can be presented as T (r; s) = g (v ) + t( (p ? 1)s ? (q ? 1)r)2 for some function g (v ). Hence, for a xed v the minimum is achieved when the second term in this sum is annulated. We obtain in this way that T (r; s) = (rv; sv ) where (rv; sv ) is the intersection point of the hyperbola rs = v with arg min Hv ? 1 r, i.e. r = q p?1 v; s = q q?1 v. Therefore, (27) is reduced to the line s = pq ? q?1 p?1 1 s s ! p ? 1 q ? 1 f (x) min T (r; s) = min T q ? 1 v; p ? 1 v = min x>0 v>0 R2+ where q p f (x) = 4 lmn 1 + (p + q ? 2)x2 + 2 x (p ? 1)(q ? 1) + const; x = v
t pq x2 a q The function f (x) is in fact, the function T (r; s) along the optimal line s = qp??11 r and is suitably expressed in terms of rs = v = x2 . The function f (x) is a strictly convex and its
minimal value can be nd bypannulating its derivative. In order to obtain closed form solution we discard the coecient 2t (p ? 1)(q ? 1) from the equation f 0 (x) = 0 and obtain in this way p 4 lmn 1=4 x = v = pq(p + q ? 2) (28) a which implies (25). Our approximation underestimates v opt by less than (p + q ? 1)t and respectively ropt and sopt by less than (p+qs? 1)t and (p+qr? 1)t . All our numerical evaluations show that for real instances of the function 3 this aects the fth digit to the right of the decimal point so we do not loose precision by using the compact formula (28). Let us point that discarding the terms multiplied by t is a common approach in this research domain. However, we observed that discarding these terms at some earlier step of the analysis could result in a volume value far from the optimal one. The the impact of these terms for the nal precision of the results still need to be clari ed. (ii) The Hessian of the function T (r; s) is given by the matrix 0 8 lmn 4 lmn + (p ? 1) + (q ? 1) 1 3 a a pqrs pqr2 s2 C (29) H (r; s) = B A @ 4 lmn + (p ? 1) + (q ? 1) 8 lmn a a pqr3 s pqr2 s2 3 On the Intel Paragon system the transmission time t = 0:044sec
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
10
The determinant of the matrice H (r; s) is positive if and only if s
8 lmn > 4 lmn + (p + q ? 2) , sr < 4 lmn a pqr2s2 pqr2s2 pq(p + q ? 2)a = v ; which is always true in Rv . The nding of the optimum of the constrained problem is oriented toward the class of real problems, i.e. the problem with lmn >> pq , a; t ; are parameters of order 10?6 ; 10?5. This allows for obtaining the main results without use of any rigor and by simple geometric approach. Our strategy is justi ed by the shape of a typical surface of the objective function over the plane (v; s) as shown on the left side graphic in Fig. 3. G(v; s) is the function T (r; s) expressed in terms of rs = v and s. The main characteristics of such surfaces are : the objective function is changed more rapidly along the isolines G(v; s = const) than along the isolines G(v = const; s) , which are very closed to straight lines with small slope. s s=r(q-1)/(p-1) A rs=v1
B A*
m/q B*
C
rs=v2 C* rs=v3
1 1
l/p
r
- global optimum - constrained optimum
Figure 1: Illustration of the global/constrained minimum point position for three objective functions The Fig. 1 illustrates all possibles cases for the global/constrained minimum point. As we have shown already, regardless of the values of the parameters of the function T (r; s) its global minimum lies on the line s = pq??11 r, which obviously depends only on the parameters p; q . Assume, the points A,B, and C are the global minimum corresponding to three dierent settings of the parameters of the function T (r; s) (p and q xed). As it is proven the global minimum is at the cross point of the line s = pq??11 r and the hyperbola rs = v with v given by (28). For some three instances of the optimization problem the graphics reveal the three principally
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
11
possible positions4 of the global optimum point : the optimal hyperbola crosses (or doesn't) the restrictive rectangle D and the cross point lies into it. In the rst case, (illustrated by the hyperbola rs = v1 ) the global minimum point A does not lie in D and the hyperbola doesn't cross the rectangle. This is equivalent to v1 > lm pq (l=p; m=q are the coordinates of the point A*). Because of the convexity of the function, the constrained optimum in this case is obviously at A* and the nal analysis follows the discussion for t > 1. In the second case (rs = v2 ), the hyperbola crosses the rectangle but the optimal global point B lies out of it. Then the solution B* is the integer point nearest to the intersection of rs = v2 and s = m=q . And nally, when the global minimum C is in D, the solution C* is given by the nearest integer point in the interior of the rectangle to the point C : this is because the objective function is convex in a small neigbourhood of the global minimum (r; s) (see (ii) of theorem 2). However, all our numerical experiments show that even in this case the closest integer points to the intersection of rs = v3 with the borders of the rectangle give suciently good approximation for the solution (i.e. the solution always lies on a border of the rectangle). As we mentioned this is justi ed by the shape of the surface in Fig. 3. The interpretation is that tiles having a feasible tile size and a volume close to the optimal provide close to the optimal behavior. Also note that the objective function is strictly convex on the borders and the minimum (even the integer minimum) is easily found. Similar analysis holds in the cases when A*, B* or C* violates inequality (24). We show furthermore that this cannot happen in real-like instances. In both cases it is straightforward that the above mentioned considerations allow for creating an algorithm of O(1) complexity for solving the problem (14)-(18). 3.3 Analysis of the non-linear constraint For the lake of space we mention here only the main results from this section. The proofs and the algorithm can be found in the extended version [5]. We prove that under the reasonable assumption n > p the non-linear constraint (24) is active only if p > t+aa n which is obviously unlikely in our model.
4 Extension of the results when D is 3 h with non-negative components All formul and results in the previous sections have been presented under assumption that the dependency matrix D is the identity matrix I. We gradually go in this section to the relaxation of this assumption. Let us rst suppose D to be a diagonal matrix with integer components dii; i = 1; 2; 3 such that dii 1. The rst modi cation in our approach is in the formula for the latency in each axis. It is easily seen that the volume of the transmitted message between any two tiles in axis j (say) is not any more the volume of the tile facet in this axis, but this volume multiplied by djj i.e. djj rt. We have respectively to update (6) and all resulting formul. This constant multiplier does not modify the nature of the analysis. To guarantee 4 In fact these are four - when the line crosses the right edge of the rectangle, the point B is situated on the
right of the line r = pl . Obviously the analysis follows the same arguments.
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
12
local communications between tile nodes we impose djj , to be lower bound for the variable s. For the variable t we still can have t 1 since the non-local communications are in fact
projected in the processors memory as we do for the knapsack problem [4]. As for the tile graph dependencies, it is known [14] that the tile graph will always have the standard basis vectors as dependencies, provided the tiles are large relative to the original dependencies. Additional dependencies, if any, will lie in the cone of the basis vectors and can be represented as linear combinations of these. The corresponding data can always be combined with the results of the orthogonal dependencies and their communication can be subsumed (i.e. the appropriate constant multiplier in the latency should be found). Therefore, static diagonal dependencies can be easily ignored for analysis purposes. Similar strategy can be also applied for some particular cases of dynamic dependencies. Let us consider for example the following knapsack-like recurrences. For all (i; j; k) 2 D, where D := f(i; j; k) : 0 i l ; 0 j m ; 0 k ng, compute f (i; j; k) = max(f (i; j; k ? 1); f (i ? wk ; j ? uk ; k ? 1) + pk ) (30) Such recurrences appear when applying dynamic programming to bidimensional knapsack problems [19, 9]. The dependencies are d1 = [0; 0; 1]T , and d2 = [wk ; uk ; 1]T . The peculiarity of the knapsack-like recurrence is the latter dependecy is a dynamic - it is dierent for dierent k in the domain and may vary from one problem instance to another. We can easily hide one of the dynamic components in the memory of the processors by choosing the i axis as direction of projection (as we do for the monoknapsack [4]). For the dynamic component in the j axis we have look for another solution. Consider an instance of the projection of the dependence graph on the plane j , k is given on Fig. 2 (a) when u1 = 2; u2 = 1; u3 = 3; u4 = 2; u5 = 3. Suppose now we have q processors in j axis. Following the above described strategy we impose 3 s m=q and we obtain the orthogonal tile graph in Fig. 2 (c). However the diculty in the knapsack-like dependencies is how to evaluate the volume of the transmitted message between tiles. As we observe in Fig. 2 (c) this volume may change from tile to tile and also from one problem instance to another. We therefore consider a range of problem instances, whose uk are uniformly distributed in an interval [umin ; umax] and whose expected value is E[u]. This value is a constant for the considered range of problem instances. We suggest in our approach to approximate the volume of the transmitted message by E[u] t where t is the tile surface facet and to add the local communications guaranteeing constraint umax s m=q .
5 Experimental results, discussions and illustrations In this section, we summarize the results of our experimentation on the Intel Paragon at IRISA5 . The Paragon's processing nodes are arranged in a two-dimensional rectangular grid (56 Intel i860 processors). Each node contains a message processor acting in parallel with the application processor, and accessing to memory by DMA. To validate the theoretical results, we created special purpose parallel code which models 5 Campus de Beaulieu, 35042 Rennes Cedex, France
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem j
13
j’ s
k
k’ t
a
b
c
Figure 2: Knapsack-like recurrences: illustration of the process from the original dependence graph (a) to its corresponding tile graph (c) recurrence (1). First of all, we designed a series of experiments to precisely measure the constants ; t and a . We obtained = 90sec, t = 0:011sec per byte and a = 0:51sec. Let us point that the purpose of this section is only to illustrate the peculiarities of the model in the threedimensional case. Questions like : where the gap of 10 percent between the predicted and the experimental time come from, the way we measure these constants, some aspects of the impact of the cache on the precision of the results, comparisons with other models : : :, are discussed in details in our experimentally oriented paper [3] which concerns the two-dimensional case. We next performed experiments to get the running time of our code for a wide range of values for r and s and to compare with the shape of our theoretical objective function over the plane (v; s). This is illustrated in Fig. 3 and we observe that the theoretical timing function and experimental results have the same global behavior. Next, we performed experiments to validate the values of the optimal tile size obtained by our analytical solution. We show a series of such experiments in Fig. 4 and Fig. 5. Fig. 4 illustrates the case when the global minimum is attained on the edge E1 . On the left on this gure we observe the objective function/execution time values in the entire interval for t. The theoretical optimal value of t for the considered instance is 6 according our formula (20). On the right on this gure we zoom in the neighborhood of this value. This case corresponds on performing only one pass since the optimal values for r and s are respectively pl and mq . Our experiments show that performing one pass with tile height given by (20) is optimal when the rectangular parallepiped is not very asymmetric (at least it happens to be with our values for the parameters ; t and a ). Fig. 5 illustrates the opposite case when the minimum lies on the plane t = 1. To get this case we took a very asymmetric domain l = 30000; m = 2000; n = 20. Fig. 5 (a) illustrates how faster the program lose eciency by increasing t. On gures 5 (b) and 5 (c) we compare the theoretical with the experimental values on the edges E2 = f(r; s; t) : 1 r l=p; s = m=q; t = 1g and E3 = f(r; s; t) : r = l=p; 1 s m=q; t = 1g. The minimum is attained on edge E2 and is given
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
G(v,s): Experimental
time(sec)
time(sec)
G(v,s): Theoretical 12.5 12 11.5
13 12.5 12
11
11.5
10.5
11
10
650
650 150000
450 250
250000 350000 v 450000
850
10.5
850 50000
14
50000
s
150000
50
450 250000 350000 v 450000
s
250 50
Figure 3: Plots of the theoretical vs experimental execution time over the plane volume/tile size for a given instance l=5000, m=5000, n=20, p=5, q=5
r=l/p s=m/q t_opt=6
r=l/p s=m/q t_opt=6
40
35 Theoretical Experimental
time (sec)
35
Theoretical Experimental
30
30
25 20
25 50
100
150 t
200
250
10
20
30
40
50
t
Figure 4: Plots of time vs tile size when the optimal solution needs one pass. The considered instance is l=500, m=400, n=5000, p=5, q=4 by ropt = 142 and sopt = mq = 500. This can be clearly seen on Fig. 5 (d) which is a magni ed part of 5 (b). In all these cases we observe the theoretical predictions matched the experimental results very closely. This instance of data is also very interesting by the following observation. By taking the predicted minimum over the plane t = 1 given by the point ropt = 142; sopt = 500 where the objective function value is about 30 sec. we gain 10 sec. in respect to the best time 40 sec. what we have at the point ( pl ; mq ). This simply means that even optimizing in respect to t, but performing only one pass, we stay far from the optimal running time when performing several passes (in the considered case we do 42 passes with the optimal tile size (142; 500; 1). When multiples passes are required, the last processors produce their results before the rst ones are ready to accept them, hence a message buering is needed. A deadlock could appear (when all the processors buers are full) if the communications are not correctly synchronized between the processors on the border of the grid. This synchronization can be avoided by using a threads (light processes) to ensure a communications between boundary processors. The good agreement between the theoretical and experimental times especially in this case is
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
r_opt=l/p s_opt=m/q t_opt=1
s_opt=m/q t_opt=1 r_opt=142
90 Theoretical Experimental
40 Theoretical Experimental
45
70
40
60
35
50
30
40
25
30
Theoretical Experimental
time (sec)
time (sec)
s=m/q t=1
50
80
20 1
2
3 (a)
4
15
35
30
25
5
1000 2000 3000 4000 5000 6000 (b) r
t
50
250
450 (d)
650
850 r
Figure 5: Plots of time vs tile size when the optimal solution needs several passes. The considered instance is l=30000, m=2000, n=20, p=5, q=4 a good criterium both for our theoretical model and for our parallel code. The program allowing several passes is more complicated than the code allowing only one pass because of the necessity to manage the buers between passes. This demands higher competence and more programmer's eorts. However our example shows it is worth providing this possibility to the parallel code in case of an asymmetric domain. This provides the best behavior when n is relatively small compared to l and m. If we are not forced to x the projection axis, the best would be to choose for projection axis the longest domain size. In this case usually only one pass provides the optimal (or close to the optimal) tiling behavior. This appears to be the best strategy when the code is designed to perform only one pass. Let us point again that when some restrictive factors impose xing the direction of projection in the memory the program providing multiple passes could be signi cantly better as shown above.
r_opt, s_opt, t_opt 30
ideal Experimental
Speedup
25
r_opt, s_opt, t=10(not Optimal) 30
ideal Experimental
25
r_opt, s_opt, t=20(not optimal) 30
20
20
20
15
15
15
10
10
10
5
5
5
4
8 12 16 20 24 28 32 nb processors
4
8
ideal Experimental
25
12 16 20 24 28 32 nb processors
4
8
12 16 20 24 28 32 nb processors
Figure 6: Speedup/tile size dependency for a given instance (l=1000 m=1000 n=100) Finally, we conducted several experiments to measure the speedup/ tile size dependency. All the runs are on instances in which only the t-coordinate on the optimal point varies. At
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
16
the optimal tile size (the leftmost plot on Fig. 6) we have good corroboration between the theoretically expected linear speedup and the experimentally obtained one . However, we observe that for runs with non-optimal tile size the speedup degrades signi cantly. This illustrates well the impact of an optimal tiling strategy on the program eciency.
6 Conclusion We have addressed in this paper the problem of determining the tile size that minimizes the total execution time of a 3-dimensional uniform dependency computation when mapped to grid/torus distributed-memory general-purpose machines. We focus on rectangular parallelepiped shaped iteration spaces. We reveal new and useful properties of the total running time function in the three-dimensional case. These properties are interesting not only for themselves, but also permit deriving of an O(1) algorithm for solving a non-linear integer programming problem which hardness is reinforced by the non-convexity of its continuous relaxation. Although that from mathematical view this is an approximation algorithm, the loss of precision is so insigni cant that the results can be considered as exact in the context of the considered application. The validity of the proposed approach has been con rmed by our experimentation on the Intel Paragon. Although the considered class may seem very restrictive, similar assumptions are often made by most researchers treating the problem. For example, the Fortran 90 and Paradigm compiler automatically determine the block size for coarse grain pipelined loops [20, 12], assuming that the iteration space is rectangular. Moreover, regardless of the fact that they treat arbitrarily nested loops, it is only the innermost loops that are tiled. In contrast to that, we explore the entire three-dimensional solution space and for any value of the iteration space size and for any torus con guration we ensure the optimal task granularity. Such global search could provide better solutions especially when the iteration space is very assymetric which is often the case in dynamic programming algorithms. Let us also mention the algorithm developed by Wolf and Lam [26, 2] for applying unimodular transforms wich leave the original loop nests in a canonical form consisting of a nest of fully permutable loop nests. It would be interesting to see how our result could be integrated in such algorithms and this is a subject of our ongoing research. Open questions are also the extensions of these results to more than three dimensions, and to cases where orthogonal tiling is not possible.
References [1] C. Ancourt and F. Irigoin. Scanning Polyhedra with DO Loops. In Third Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 39{50. ACM SIGPLAN, ACM Press, 1991. [2] J. Anderson and M. Lam. Global Optimisation for Parallelism and Locality on Scalable Parrallel Machines. ACM Sigplan Notices, 28(6):112{125, Juin 1993. [3] R. Andonov, H. Bourzou , and S. Rajopadhye. Two-Dimensional Orthogonal Tiling: from Theory to Practice. In International Conference on High Performance Computing (HiPC), pages 225{231, December 1996. Trivandrum, India. [4] R. Andonov and S. Rajopadhye. Optimal Tiling of Two-dimensional Uniform Recurrences. Technical
Andonov, Yanev and Bourzou /Three-Dimensional Orthogonal Tile Sizing Problem
[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
17
Report PI-999, IRISA, Campus de Beaulieu, Rennes, France, October 95. (submitted to JPDC). This report superseeds Optimal Tiling (IRISA,PI-792), January, 1994. Part of these results were presented at CONPAR 94-VAPP VI, Lecture Notes in Computer Science 854, 701{712, 1994, Springer Verlag. R. Andonov, N. Yanev, and H. Bourzou . Three-Dimensional Orthogonal Tile Sizing Problem: Mathematical Programming Approach. Technical Report RR-96-3, LIMAV, Univeriste de Valenciennes, Le Mont Houy - B.P. 311, 59304 Valenciennes, France, Mai 1996. P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-Ultimate Tiling? INTEGRATION, the VLSI journal, 17:33{51, Nov 1994. P-Y. Calland and T. Risset. Precise Tiling for Uniform Loop Nests. In P. Cappello, C. Mongenet, G-R. Perrin, P. Quinton, and Y. Robert, editors, Application Speci c Array Procesors, pages 330{337, July 1995. Strasburg, France. J.J. Choi, J. Dongarra, and D.W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers. Concurrency: Practice and Experience, 6(7):543{570, 1994. A. Freville and G. Plateau. An Exact Search for the Solution of the Surrogate Dual of the 0-1 Bidimensional Knapsack Problem. European J. Oper. Res., 68:413{421, 1993. A.G. Geist and M.T. Heath. Matrix Factorization on a Hypercube Multiprocessor. In Hypercube Multiprocessors, pages 161{180, 1986. D. C. Grunwald and D. A. Reed. Networks for Parallel Processors: Measurements and Prognostications. In Third Conference on Hypercube Concurrent Computer Applications, pages 610{619, January 1988. S. Hiranandani, K. Kennedy, and C. Tseng. Evaluating Compiler Optimizations for Fortran D. Journal of Parallel and Distributed Computing, 21:27{45, 1994. S. Hiranandi, K. Kenedy, and C Tseng. Evaluation of Compiler Optimization for Fortran D on MIMD Distributed-Memory Machines. In 1992 ACM International Conference on Supercomputing, 1992. F. Irigoin and R. Triolet. Supernode Partitioning. In 15th ACM Symposium on Principles of Programming Languages, pages 319{328. ACM, Jan 1988. C-T. King, W-H. Chou, and L. Ni. Pipelined Data-Parallel Algorithms: Part I{Concept and Modelling. IEEE Transactions on Parallel and Distributed Systems, 1(4):470{485, October 1990. C-T. King, W-H. Chou, and L. Ni. Pipelined Data-Parallel Algorithms: Part II{Design. IEEE Transactions on Parallel and Distributed Systems, 1(4):486{499, October 1990. M. Lam and M. Wolf. Automatic Blocking by Compilers. In J. Dongarra, K. Knnedy, P. Messina, D. Sorensen, and Voigt R., editors, SIAM Conference on Parallel Processing for Scienti c Computing, pages 537{542, 1993. S. Miguet and Y. Robert. Path Planning on a Ring of Processors. Intern. J. Computer Math., 32:61{74, 1990. M. Minoux. Programmation mathematique, volume 2. Dunod, 1983. D. Palermo, E. Su, A. Chandy, and P. Banerjee. Communication Optimizations Used in the PARADIGM Compiler for Distributed-Memory Multicomputers. In International Conference on Parallel Processing, St. Charles, IL, August 1994. P. Quinton and Y. Robert. Algorithmes et architectures systoliques. Masson, 1989. English translation by Prentice Hall, Systolic Algorithms and Architectures, Sept. 1991. J. Ramanujam and P. Sadayappan. Tiling Multidimensional Iteration Spaces for Non Shared-Memory Machines. In Supercomputing 91, pages 111{120, 1991. R. Schreiber and J. Dongarra. Automatic Blocking of Nested Loops. Technical Report 38, RIACS, NASA Ames Research Center, Aug 1990. Van Dongen V. Loop Parallelization on Distributed Memory Machines: Problem Statement. In Proceedings of EPPP, 1993. M. S. Waterman. Mathematical Methods for DNA Sequences. CRC Press, Inc. Boca Raton, Florida, 1989. Michael Wolf and Monica S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452{471, October 1991. M. Wolfe. Iteration Space Tiling for Memory Hierarchies. Parallel Processing for Scienti c Computing (SIAM), pages 357{361, 1987.