Abstract: This paper presents a new approach for extracting synchronization- free parallelism ... resolved in order to utilize the entire power of this approach. ...... parallel machine SGI Altix 3700, 128 x Itanium2 1.5GHz (NUMA), 256 GB. RAM ... questions: how many slices are available in a loop? what is the cost to be paid for.
Finding synchronization-free slices of operations in arbitrarily nested loops Anna Beletska1, Wlodzimierz Bielecki2, Krzysztof Siedlecki2, Pierluigi San Pietro1 1
Dipartimento di Elettronica e Informazione, Politecnico di Milano, 20122 via Ponzio 34/5, Milano, Italy {beletska, sanpietr}@elet.polimi.it 2 Faculty of Computer Science, Technical University of Szczecin, 70210 Zolnierska 49, Szczecin, Poland {wbielecki, ksiedlecki}@wi.ps.pl
Abstract: This paper presents a new approach for extracting synchronizationfree parallelism being represented by dependent statement instances of an arbitrarily nested loop. Presented algorithms can be applied to both uniform and non-uniform loops. The main advantage is that more synchronization-free parallelism may be extracted than that yielded by existing techniques. Our approach, based on operations on relations and sets, requires exact dependence analysis, such as the one by Pugh and Wonnacott, where dependences are found in the form of tuple relations. Results of experiments with the NAS benchmark are presented.
Keywords: loop transformations, arbitrarily nested loops, synchronizationfree parallelism, slicing.
1. Introduction Finding synchronization-free parallelism available in loops is of great importance for parallel and distributed computing, enhancing code locality, and reducing memory requirements. It is important that a compiler is able to find parallelism that does not require synchronization when loops are executed in shared memory multiprocessors or does not require communication when they are executed in distributed systems. The purpose of synchronization-free parallelism, however, is not limited to this, since it may also reduce memory requirements and increase performance on a uniprocessor system by enhancing data locality. Numerous techniques have been developed to extract synchronization-free parallelism available in loops, for example, [1,2,4,13,14,16,18,20,21,28]. However, all well-known techniques have limitations that prevent extracting the entire synchronization-free parallelism available in arbitrarily nested loops. The goal of this paper is to present an approach which permits us to extract more synchronization-free parallelism available in arbitrarily nested loops than that extracted by well-known techniques as well as to define research problems still to be resolved in order to utilize the entire power of this approach.
2. Background A loop is called perfectly nested if all its statements are comprised within the innermost nest. Otherwise the loop is called imperfectly nested. An arbitrarily nested loop can be both perfectly and imperfectly nested. Given a nest of n loops, the iteration vector I of a particular iteration of the innermost loop is a vector of integers that contains the iteration numbers for each of the loops in order of nesting level [1]. We refer to a particular execution of a statement T for a certain iteration I of the loops, which surround this statement, as an operation T(I) or an instance of statement T at iteration I. Two operations T1(I) and T2(J) are dependent if both access the same memory location and if at least one access is a write. T1(I) and T2(J) are called the source and destination of a dependence, respectively. An operation space is a set of all operations that are executed by a loop. In this paper, we deal with affine loop nests, i.e., loops where 1) for given loop indices, lower and upper bounds as well as array subscripts and conditionals are affine functions of surrounding loop indices and possibly of structure parameters (i.e., parameterized loop bounds), and 2) the loop steps are known positive constants. The approach presented in this paper requires an exact representation of both loop independent and loop-carried dependences, and consequently an exact dependence analysis which detects a dependence if and only if it actually exists1. In this paper, we refer to a dependence to mean a loop-independent or loop-carried dependence. In general, any known technique, extracting exact dependences, can be applied to implement our algorithm. But the description of algorithms and the fulfillment of experiments depend, respectively, on the presentation of exact dependences and on available tools for carrying out a dependence analysis. To describe and carry out experiments with our algorithms, we chose the dependence analysis proposed by Pugh and Wonnacott [23] where dependences are represented with dependence relations comprised of Presburger formulas, which can be built up out of linear constraints over integer variables, logical connectives, and universal and existential quantifiers. This analysis is implemented in Petit, a research tool for doing dependence analysis and program transformations, developed by the Omega Project (http://www.cs.umd.edu/projects/omega). In this paper, we distinguish between a dependence that is a pair of dependent operations (source and destination) and a dependence relation that represents all dependences among operations of a statement (self dependences) or of a pair of statements. To expose all the sources and destinations of dependences yielded with a dependence relation R, we must compute the domain and range of R, respectively. We distinguish between a dependence graph representing all the dependences among loop operations and a reduced dependence graph being composed by vertices for each statement si, 1≤i≤r, of the loop and edges joining vertices according to dependence relations Ri,j, i,j [1,r], being exposed by an exact dependence analysis, where r is the number of statements within the loop body. 1
A non-exact representation of dependences is also possible, but this will cause the loss in some parallelism because of the over-approximation of dependences, while this work aims at extracting maximal synchronization-free parallelism.
2
An ultimate dependence source(destination) is a source(destination) that is not the destination(source) of another dependence. Ultimate dependence sources and destinations represented by relation R can be found by means of the following calculations: (domain R range R) and (range R domain R), respectively. The algorithm presented in this paper deals with a strongly connected component (SCC) a maximal subset of vertices and edges of a reduced dependence graph where for every pair of vertices there exists a directed path.
3. Iteration Space Slicing Iteration space slicing [23] takes dependence information as input to find all statement instances that must be executed to produce the correct values for the specified array elements. In this paper, we introduce a specific definition of slice as follows. Definition 1. Given a dependence graph, D, defined by a set of dependence relations, S, a slice is a weakly connected component of graph D, i.e., a maximal subgraph of D such that for each pair of vertices in the subgraph there exists a directed or undirected path. If there exist two or more slices in D, then taking in the account the above definition, we may conclude that all slices are synchronization-free, i.e., there is no dependence between them. Definition 2. The source(s) of a slice is the ultimate dependence source(s) that this slice comprises. We use standard operations on relations and sets, such as intersection (), union (), difference (−), domain (dom R), range (ran R), composition of relations () , relation application (given a relation R and set S, R(S):={ [e′] | eS, e→e′R), positive transitive closure (R+), transitive closure (R*=R+ I, where I is the identity relation). These operations are described in detail in [19]. The algorithm proposed in this paper uses the modified Floyd-Warshal algorithm [17] for calculating all transitive dependences among operations. It starts with the results of an exact dependence analysis, in the form of relations Ri,j describing direct dependences between every pair of statements i and j in an SCC. At each iteration of the algorithm, relations R i,j are updated, yielding at the end relations representing all transitive dependences between each pair of statements i and j in the SCC. The algorithm is shown in Figure 1.
3
Input: A set of dependence relations {Ri,j} describing direct dependences between each pair of statements i,j in an SCC /* Ri,j can be empty if a dependence analysis does not extract direct dependences between statements i and j */ foreach statement r foreach statement p foreach statement q R p ,q R p ,q R r ,q ( R r , r ) * R p , r ; Output: After running the algorithm, each Ri,j (in our algorithm we denote it as R i, j ) describes all transitive dependences between statements i and j in the SCC. Figure 1. Modified Floyd-Warshal algorithm.
4. Extracting synchronization-free slices and code generation In this section, we describe an algorithm to find synchronization-free slices available in an arbitrarily-nested loop, and to generate code to scan slices and iterations of each slice in lexicographical order. Given precomputed dependence relations for a loop, our approach consists of the following steps. 1. Extract a set comprising all ultimate dependence sources. 2. Extract sources of slices from the set comprising ultimate dependence sources. 3. Find operations belonging to each synchronization-free slice. 4. Generate code scanning synchronization-free slices and operations of each slice in lexicographical order. Our algorithm deals only with one strongly connected component (SCC). Hence, before applying it, a reduced dependence graph has to be transformed into its strongly connected component graph. If there exist two or more SCCs in a loop, then code is first generated independently for each SCC, and next resulting code is formed from the code generated taking into account the topological order of SCCs in the reduced dependence graph. Algorithm 1, Part 1: Sources extraction. Input: a set S := {Ri,j | i, j[1,q] } of dependence relations representing an SCC, where values of i, j represent the statement identifiers numbered in the order in which statements appear in the source text; each Ri,j denotes the union of all the relations describing dependences between statements i and j; q is the number of vertices in the SCC; the dimension of a loop, n. Output set Sources, composed of the (lexicographically minimal) sources of slices.
4
Method: begin 1. foreach relation Ri,j S do 1.1. Transform relation Ri,j so that each its input and output tuples has exactly n elements by inserting the value -1 at the rightmost positions of tuples, e.g., replace a tuple [e] = [e1 e2 .. en-k], where k is some integer, for the tuple [e1 e2 ... en-k 1 1 .... 1 ]. k
1.2. Extend the input and output tuples of R i,j with additional objects representing identifiers of statements i and j, respectively, i.e., transform Ri,j := {[e][e']} into Ri,j :={[e,i][e',j]}. 2. Find a set, UDS, containing ultimate dependence sources as the difference between the union of domains and the union of ranges of all the relations in S UDS : dom R i , j ran R i, j . R i , jS
3.
R i , jS
Calculate the exact transitive closure, R*, representing all the transitive dependences in the SCC, by applying the modified Floyd-Warshal algorithm to calculate relations, R i, j , representing all transitive dependences between each pair of statements i, j in the SCC (note: R i, j should not be confused with R i, j being the exact positive transitive closure of a single dependence relation describing dependences between statements i, j) and then by computing R* as follows: R* (R i, j ) I , where I is the identity relation. 1 i , j q
4.
5.
Form a relation, R_USC, representing all pairs of ultimate dependence sources that are connected (by an undirected path) in the dependence graph formed on the basis of the SCC: R_USC := {[e][e'] | e, e'UDS, e' e, (R*(e')) (R*(e)) }. Form set, Sources, comprising the (lexicographically minimal) sources of slices Sources := UDS – ran R_USC.
end If R*, calculated in step 3 of the above algorithm, is described with affine forms, then set Sources, produced in step 5, is also affine, and the following algorithm can be applied for code generation. Algorithm 1, Part 2: Code generation. Input: Set Sources, represented with affine forms, as produced by Algorithm 1, Part 1. Exact transitive closure R*, as produced by step 3 of Algorithm 1, Part 1. Relation R_USC, as produces by step 4 of Algorithm 1, Part 1. Output: Code scanning synchronization-free slices and operations of each slice in lexicographical order for the SCC.
5
Method: begin genLoops (in: Sources; out: OuterLoops, L_I); foreach I in L_I do S_Slice := R* (R_USC* (I)) /* note that if R_USC= , then R_USC*(I) =I */ genLoops (in: S_Slice; out: InnerLoops, L_J); foreach J in L_J do genLoopBody (in: OuterLoops, InnerLoops, J; out: LoopBody); end where genLoops (in: OperSet; out: Loops, VectorList) generates set, Loops, of loop nests to scan operations comprised in set OperSet, and returns a list of corresponding parameterized iteration vectors VectorList comprising the loops index values. Because set Sources is described with affine forms, any well-known technique, such as [3,6,8,25,29], can be applied to implement the function GenLoops. genLoopBody (in: OuterLoops, InnerLoops, Iter; out: LoopBody) generates body LoopBody of InnerLoops containing the statements of a source loop to be executed at iteration Iter and inserts loops InnerLoops in the corresponding nest of loops OuterLoops. To generate LoopBody, the last elements of tuples representing set InnerLoop are taken into consideration (they represent the statement identifiers and allow for choosing appropriate statements to be inserted in the loop body). Note that outermost loops generated by the above algorithm can be executed in parallel because they scan independent sources of synchronization-free slices. Our paper deals only with extracting dependent operations belonging to slices. In order to scan independent operations, additional code should be added that can be generated as follows. For each statement Si, 1≤i≤n, executed by a source loop, we calculate a set of independent operations, S_INDi, as the difference between the operation space of this statement, OSi, and the set containing all the dependent operations yielded by Si , that is, S_INDi := OSi – ( domain R i , k range R k ,i ), where 1≤k≤n. R i , k S
R k ,i S
Because all constraints of S_INDi are affine (for the same reason as for set Sources), we can apply any well-known technique to generate loops scanning elements of each set S_INDi, 1≤i≤n. Obviously, these loops can be executed in parallel. To illustrate Algorithm 1, let us consider the following example. Example 1. for (i=1; i