Improving Code Generation in the Polytope Model Yosr SLAMA¤
Mohamed JEMNIy
Abstract In the literature, we can …nd now several e¢cient scheduling and allocation techniques which describe a performant parallel execution of loop nests. Nevertheless, few works have been devoted to code generation expressing the parallelism derived by those techniques. In [4], we proposed an approach for merging two target program parts, our aim here is to generalize this work to permit merging any numb er of target program parts in order to improve code generation in the polytop e model.
Keywords : Automatic parallelization, code generation, polytope model, target program part.
1
Introduction
Researches in automatic parallelization are especially focused on regular programming structures i.e. nested loops. In this context, the community have proposed a variety of tools and transformation techniques for this purpose. Recently, a mathematically-based model, called the polytope model [1, 2, 5], was designed, in order to de…ne a general framework for automatic parallelization of nested loops. The parallelization in this model consists in three steps. First, modeling the loop nest, called source program, by a convex polytope de…ning its index space. The latter is represented by a system of linear inequalities expressing the loop index boundaries. Second, applying a coordinate a¢ne space/time transformation (constructed upon a schedule and an allocation) on the source polytope, generating a new one, called target polytope. In the third step, code generation, consists in deriving the target program by scanning the target polytope through a nest of target loops. Each space dimension becomes a parallel loop and each time dimension becomes a sequential loop. The formalism on which this model is based, i.e. linear algebra, limits it to regular transformations. In fact, in the basic model [5], only a¢ne transformations (represented by a matrix) can be treated. Lately, the model was extended, …rst, to deal with singular matrices [6], and second, to allow the use of a¢ne by statement and piecewise a¢ne transformations [3, 9]. Dealing with this second extension, in the code generation step, we have to construct the target program from a set of programs already deduced, called target program parts. Our main goal here is to improve this step by giving an e¢cient approach to derive the target program. In [4], we presented algorithms for merging two target program parts in the polytope model. We give here a generalization of these algorithms thus permitting to merge several program parts. In the remainder of this paper, we …rst give some basics and useful de…nitions. Then, we study the parallelization in the basic polytope model and its extensions. Finally, we present the general approach for code generation by merging target program parts.
2
Basics
A polyhedron P in R n is de…ned as follows : P = fx=x 2 R n ; Ax bg, simply denoted (A; b) ; A is an (n0; n) matrix n0 and b a vector of R . A bounded polyhedron is called a polytope. In Zn ; a polytope is ”…nite” i.e. it has a …nite ¤ Faculte y Faculte
des Sciences de Tunis, LAPP, Dep. Informatique, 1060 Belve dere, Tunis,
[email protected] des Sciences de Tunis, LAPP, Dep. Informatique, 1060 Belve dere, Tunis,
[email protected]
This space left blank for copyright notice.
1
Improving code generation in the polytope model
2
number of elements. A set E is said convex i¤ 8 x; y 2 E , ¸x +(1 ¡¸)y 2 E , 0 ¸ 1. Note that the convexity is the basic characteristic of a polyhedron [7]. Source programs in the polytope model are n nested do loops with regularity restrictions on the loop indices i.e. the loop bounds are symbolic constants or a¢ne functions of the surrounding lo op counters and those constants.
3
The basic polytope model
The parallelization process in the basic polytope model [5] transforms a source program into a target program via three steps, which are as follows : 3.1 Source polytope : it de…nes the index space of the source program. It is represented by a set of linear inequalities, each inequality expresses the loop boundaries in a distinct dimension. Each point in the source polytope represents an iteration instance of the loop nest. The coordinates of the point are given by the values of the loop indices at that iteration. The source polytope is convex. 3.2 Target polytope : this is the most important step in the model. It consists in …nding an a¢ne coordinate transformation which veri…es some criteria of e¢ciency e.g. minimizes time execution. Generally, this transformation is called space-time mapping and it transforms the source polytope into another polytope in which some dimensions enumerate time and the others enumerate space. The aim of this step is then to found both an a¢ne schedule ”t” and an a¢ne allocation ”a” on the index space I. Mathematically this is expressed as follows: 8x 2 I; t(x) = ¸x + ® (¸ 2 Zr £ Zn ^ ® 2 Zr ) and 8x 2 I; p(x) = ¾x + ¯ (¾ 2 Zn¡r £ Zn ^ ¯ 2 Zn¡r ). T , the matrix composed by the r rows of ¸ followed by the n ¡ r rows of ¾, is the requested a¢ne transformation. Once the schedule is determined, the allocation must be chosen¡ in a way ¢ such that the transformation must be invertible (jT j 6= 0). The target polytope is then expressed as AT ¡1 ; b ; A is the integer matrix associated to the source polytope (A; b). In fact, let y = T x then x = T ¡1 y and Ax b is then written (AT ¡1 )y b. 3.3 Target program : the e¤ect of the applied transformation is changing the execution order of the operations of the source program. The target program is then determined by scanning the target polytope through a nest of target loops, each space dimension becomes a parallel loop and each time dimension becomes a sequential loop. The parallelism may be synchronous (resp. asynchronous) if the r outermost (resp. innermost) loops are sequential.
4
Extensions
We explain below how the polytope model has been extended to a¢ne by statement and piecewise a¢ne transformations. 4.1 A¢ne by statement transformations : a transformation is called a¢ne by statement when it has an a¢ne expression for each statement of the source program. The polytope model was extended [9, 3] to deal with these transformations as follows : ² We consider, instead of the whole source program, a set of programs, called source program parts, having, each, only one statement as a body-loop. These programs have the same index space as the original program. ² We parallelize each source program part in the basic polytope model. We note that all the source program parts have the same index space and thus the same source polytope. Generated parallel programs are called target program parts. ² We merge target program parts in order to generate the target program. A merge method has been proposed in [3]. In this paper, we propose an other method. 4.2 Piecewise a¢ne transformations : a piecewise a¢ne transformation on the index space is treated in the same manner as an a¢ne by statement transformation. The di¤erence is in the construction of the source program parts. In this case, each program part corresponds to an index subspace in which the transformation has an a¢ne expression.
Improving code generation in the polytope model
5
3
Problematic
Let us consider the following loop as a source program. do i = 1; n Valid schedule and allo cation may be as follows : A [i] = i (S1 ) t (hS1 ; ii) = 0; a (hS 1 ; ii) = i B [i] = i + 1 (S2 ) t (hS2 ; ii) = 0; a (hS2 ; ii) = i end do If we apply the model on each statement, we obtain the two following target programs: do t = 0; 0 dopar p = 1; n S1 (t; p) end dopar end do
do t = 0; 0 dopar p = 1; n S2 (t; p) end dopar end do
Applying classical merge techniques on these two target programs [3], we obtain program P 1 (see below): In this program, the n instances of the loop body can be executed in parallel, each instance is a sequence of an instance of S1 followed by an instance of S2 . But this contradicts the fact that both S1 and S2 have the same schedule i.e. 0. Therefore, the amount of parallelism extracted by the model is not fully exploited. To attempt this ob jective, we should allocate S2 on a set of processors di¤erent of those allocated to S1 . We propose two solutions P2 and P3 . The …rst uses parbegin/parend primitives. However, the second uses if-then-else guards. (P 2 ) (P 1 )
6
do t = 0; 0 dopar p = 1; n S1 (t; p) S2 (t; p) end dopar end do
do t = 0; 0 parbegin dopar p = 1; n S1 (t; p) end dopar dopar p = n + 1; 2 ¤ n S2 (t; p ¡ n) end dopar parend end do
(P3 )
do t = 0; 0 dopar p = 1; 2n if (1 p n) then S1 (t; p) else if (1 p n) then S2 (t; p ¡ n) end if end if end dopar end do
General method ©
ª Let P 1 ; P 2 ; :::; P q be a set of q target program parts. P j (1 6 j 6 q) is a nest of r sequential loops surrounding n¡ r parallel loops, having statement S j as a body. We propose here an approach to merge these program parts, which takes into consideration the type of loops (i.e. sequential or parallel), in order to generate a single target program. The main principle of this method is to collect all operations having the same schedule and to ensure their execution in parallel by allocating them on di¤erent processors. We start then by merging serial loops enumerating time, by applying a recursive algorithm called Merge-time. This latter calls a second algorithm called Merge-space in order to merge the parallel loops enumerating space, when necessary. The merge time algorithm starts by dividing the union of time iteration spaces of the whole program parts into a set of subspaces according to the repartition of statements on the whole time iteration space. Each subset is represented by a loop nest. In the subspaces where a set S of more than one statement is to be executed, the algorithm Merge-space is called. Its principle is to guarantee simultaneous execution of the instances of statements of S having the same schedule. This simultaneous execution is realized only if Merge-space allocates instances of statements of S on di¤erent processors. We proceed then to a shift of some space loops when some parallel operations are assigned to the same processor. Two di¤erent solutions are proposed to merge space loops. The …rst, using guards, groups all space loops into a single one having, as a sub-space iteration, the union of the sub-spaces of all the loops. The second solution uses parbegin/parend primitives that delimit the space loops to be merged, thus guaranteeing their simultaneous execution.
Improving code generation in the polytope model
7
4
Algorithms
For simplicity, we use the following notations : Lji (resp. Uij ) is the lower (resp. the upper) loop bound of P j at level i. N ij denotes the loop nest starting from the loop at level i (1 6 i 6 n) associated to the statement S j . Program P j is thus denoted : N 1j (1 6 j 6 q): At a level i, N i is the list of nests to be merged and Li (resp. n N i ) containothe lower (resp. upper) bounds of these nests. Note that the …rst call is Merge-time (N 1 ) with N1 = N 1j =1 6 j 6 q . Algorithm Merge-time (N i) begin Partition (inputs : Li; Ui ; outputs : I; l) do k = 1; l if jS k j = 1 then Emit : {do xi = Bk ; Bk0 j N i+1 end do } else n o Ni+1 Ã
/* I is the set of intervals [Bk ; Bk0 ] with, for each, the set of statements Sk executed in. l is the cardinal of I */ /* There is only one statement S j */
/* There are many statements S j1 ; :::; S jm */
j1 jm N i+1 ; :::; N i+1
Emit : {do x i = Bk ; Bk0 } if i = r then Merge-space (Ni+1 ; ;; :::; ;) else Merge-time (N i+1 ) end if Emit : {end do } end if end do end
/* The last loop enumerating time */
/* End of Merge-time */
¡ ¢ Algorithm Merge-space N i; G 1 ; G 2 ; :::; Gm /* m is the cardinal of N i */ begin if i = n + 1 then /* End¡ of ¢¢ recursive calls : emit the guards ¡ ¡ ¢ */ Emit : { if conjonction G 1 then S 1 /*conjonction Gj is the conjunction of the conditions of G j */ else ¡ ¡ ¢¢ if conjonction G2 then S2 else .. . if (conj onction (Gq )) then S q end if .. . end if end if } else /* Merge of loops enumerating space of level i (r + 1 i n) */ Shift (N i) /* Shift loops of level i. That guarantees that di¤erent processors are allocated to simultaneous operations [8] */ do k = 1; m /*Save guards of level i*/ ¡ ¢ Gk á Gk + Lki xi Uik /* Gk contains the guards corresponding to the statement S k */ end do © q ª 1 N i+ 1 à N i+1 ; :::; N i+1 Emit : { do x i = mL i; M Ui }
Merge-space ( N i+1 ; G1 ; :::; Gm ) Emit : { end do } end do end
q
q
j=1
j=1
/* Emit the merged loop. mLi = min (Lji ); M Ui =max (Uij ) */ /*Next level */
/* End of Merge-space*/
Improving code generation in the polytope model
5
Remark 7.1 We content our self here with giving only the version of merge-space using guards. The version using primitives consists on a shift seceded by an emission of parbegen/perend.
8
Example
In order to illustrate our merge approach, we present below, three target program parts : (P4 ); (P5 ) and (P 6 ); having each one a time loop and two space loops, on which we apply the …rst version (using guards) of our algorithm [4, 8]. The second version can be easily deduced by inserting parbegin/parend primitives in the adequate places. (P 4 )
do t1 = 1; 80 dopar p 1 = 1; 30 dopar p 2 = 5; 25 S1 (t1 ; p 1 ; p2 ) end dopar end dopar end do
(P 5 )
do t1 = 21; 120 dopar p1 = 1; 100 dopar p 2 = 51; 90 S 2 (t1 ; p 1 ; p 2 ) end dopar end dopar end do
(P6 )
do t1 = 61; 100 dopar p1 = 81; 200 dopar p2 = 1; 100 S 3 (t1 ; p1 ; p 2 ) end dopar end dopar end do
The result program is the following (P 7 ) : (P 7 ) do t1 = 1; 20 dopar p 1 = 1; 30 dopar p 2 = 5; 25 S 1(t1 ; p 1 ; p 2 ) end dopar end dopar end do do t1 = 21; 60 dopar p 1 = 1; 130 dopar p 2 = 5; 65 if (1 p1 30) and (5 p 2 25) then S 1 (t1 ; p1 ; p 2 ) else if (31 p1 130) and (26 p2 65) then S 2 (t1 ; p 1 ¡ 30; p 2 + 25) end if end if end dopar end dopar end do do t1 = 61; 80 dopar p 1 = 1; 250 dopar p 2 = 5; 100 if (1 p1 30) and (5 p 2 25) then S 1 (t1 ; p 1 ; p 2 ) else if (31 p1 130) and (51 p2 90) then S 2 (t1 ; p1 ¡ 30; p2 ) else if (131 p 1 250) and (1 p2 100) then S3 (t1 ; p 1 ¡ 50; p 2 ) end if end if end if end dopar end dopar end do do t1 = 81; 100 dopar p 1 = 1; 220 dopar p 2 = 51; 100 if (1 p1 100) and (51 p2 90) then S 2 (t1 ; p1 ; p 2 ) else if (101 p 1 220) and (1 p2 100) then S 3(t1 ; p 1 ¡ 20; p 2 ) end if
Improving code generation in the polytope model
6
end if end dopar end dopar end do do t1 = 101; 120 dopar p 1 = 1; 100 dopar p 2 = 51; 90 S 2(t1 ; p 1 ; p 2 ) end dopar end dopar end do
9
Conclusion
In order to improve code generation in the polytope model, we proposed in this paper a general metho d to merge a set of target program parts generated by the polytope model. The proposed method takes into account the type of loops in target program parts and derives two solutions : the …rst uses guards when it merges space loops, whereas the second uses primitives such as parbegin/parend. The main bene…t of this approach is the full exploit of the amount of parallelism detected by the applied transformation, by guaranteeing the e¤ective simultaneous execution of parallel operations. As a perspective, we intend to automatize our approach and to integrate it into an available release of parallelizer compiler i.e. loopo.
Acknowledgments Thanks to Pr. Zaher Mahjoub (Faculty of Sciences of Tunis) for his valuable remarks and discussions and for his reading of this paper.
References [1] C. Ancourt & F. Irigoin, Scanning polyhedra with DO loops, In Proc. 3 rd ACM SIGPLAN Symp, On Principles & Practice of Parallel Programming, pp. 39-50, ACM Press, 1991. [2] P. Feautrier, Automatic parallelization in the Polytope Model, T.Rept, Laboratoire PRiSM, Université de Versailles St-Quentin, 1996. [3] M. Griebl , C.Lengauer & S.Wetzel, Code generation in the polytope model, com. PACT’s 98, Paris, 1998. [4] M. Jemni & Y.Slama, Merging target program parts in the Polytope Model, com. PDPTA’99, Vol. 4, pp. 1978-1984, Las Vegas, 1999. [5] C. Lengauer, Loop parallelization in the polytope model, CONCUR’93, Lecture Notes in Computer Science, No. 715, pp. 398-416, Springer Verlag, 1993. [6] W. Li, K. Pingali, A singular loop transformation framework based on non singular matrices, International Journal of parallel programming, Vol. 22, No. 2, pp. 183-205, 1994. [7] A. Schrijver, Theory of Linear and Integer Programming, Series in Discrete Mathematics, John Wiley & Sons, 1986. [8] Y. Slama, Etude de la parallélisation des nids de boucles par le modèle polyédrique, mémoire de DEA, Faculté des Sciences de Tunis, Département des Sciences de l’Informatique, 1999. [9] S. Wetzel, Automatic code generation in the Polytope Model, mémoire de Master, Univ. Passaü, Fak. Math & Inf., 1995.