Performance Optimization of a Class of Loops Involving ... - CiteSeerX

Performance Optimization of a Class of Loops Involving Sums of Products of Sparse Arrays Chi-Chung Lamy

P. Sadayappany Daniel Cociorvaz John Wilkinsz

Mebarek Alouaniz

Abstract Multi-dimensional integrals of products of several arrays arise in certain scienti c computations. To optimize the performance of such computations on parallel computers, the total number of arithmetic operations and the total amount of communication need to be minimized. This paper addresses the operation minimization sub-problem and the communication minimization sub-problem. Earlier work had addressed these problems in a restricted context of dense arrays, with additional constraints. In this paper, general solutions are developed that handle sparse arrays and other features such as fast Fourier transforms and multiple use of arrays, that are characteristics of real computational physics. The new algorithm for the operation minimization subproblem has been implemented and used to generate solutions that improve over the best manually-optimized ones by a factor of two.

1 Introduction

This paper addresses the optimization of a class of loop computations that implement multidimensional integrals of the product of several (possibly sparse) arrays. Such integral calculations arise for example in the computation of electronic properties of semiconductors and metals [1, 5, 9]. The objective is to minimize the execution time of such computations on a parallel computer. In addition to the performance optimization issues pertaining to inter-processor communication and data locality enhancement, there is opportunity to apply algebraic transformations using the properties of commutativity, associativity and distributivity to minimize the total number of arithmetic operations. In order to make the performance optimization problem more tractable, we view it in terms of three independent sub-problems: 1. Given a speci cation of the required computation as a multi-dimensional sum of the product of input arrays, determine an equivalent sequence of nested loops that computes the result using a minimum number of arithmetic operations, 2. Given an operation-count-optimal form of the computation (from solution of the above subproblem), determine the data distribution of input, intermediate and result arrays, and the mapping of computations among processors to minimize communication cost for load-balanced parallel execution, 3. Given a sequence of loop computations to be performed on each processor (determined by solving the above two sub-problems), apply loop transformations such as loop permutation, loop tiling, and loop fusion to minimize the total number of cache misses in execution. Supported in part by the National Science Foundation under grant DMR-9520319. Department of Computer and Information Science, The Ohio State University, Columbus, Ohio 43210, fclam, [email protected] z Department of Physics, The Ohio State University, Columbus, Ohio 43210, fcociorva, mea, wilkinsg @paci c.mps.ohio-state.edu y

1

2 In previous work, the rst sub-problem was proved to be NP-complete and an ecient pruning search strategy was proposed [7]. A polynomial-time solution to the second sub-problem was presented by us in [8]. Our prior work constrained all arrays to be distinct and dense, i.e. no two arrays are the same and all elements are nonzero. However, in practice some of the arrays in the product terms occur more than once in the expression and some of them are sparse. Moreover, some of the product terms are exponential functions, that permit some products to be more eciently computed using a fast Fourier transform rather than an explicit matrix-vector product. In this paper, we develop an optimization framework that appropriately models identical arrays, sparsity and FFT operations. The operation minimization algorithm has been implemented and has been used to obtain signi cant improvement in the number of operations for self-energy electronic structure calculations in a tight-binding scheme. Reduction of arithmetic operations has been traditionally done by compilers using the technique of common subexpression elimination [4]. Loop transformations that improve locality and parallelism have been studied extensively in recent years [6, 11]. However, we are unaware of any work on loop transformation based on the distributive law as a mean to minimize arithmetic operations. Chatterjee et al. consider the optimal alignment of arrays in evaluating array expression on massively parallel machines [2, 3] but they do not consider distribution and replication of arrays. The rest of this paper describes how we handle the three features, namely multiple occurrences of arrays, sparsity and fast Fourier transform (FFT), that are characteristics of the computational physics applications considered. Section 2 presents the operation minimization algorithm and shows by an example how the implemented algorithm nds solutions that are better than the best manually-optimized solutions. Section 3 considers the optimal partitioning of data and loops to minimize communication and computational costs for execution on parallel machines. Section 4 provides conclusions.

2 Operation Minimization

In the class of computations considered, the nal result to be computed can be expressed as multidimensional integrals of the product of many input arrays. Due to commutativity, associativity and distributivity, there are many dierent ways to obtain the same nal result and they could dier widely in the number of oating point operations required. The problem of nding an equivalent form that computes the result with the least number of operations is not trivial and so a software tool for doing this is desirable. Section 2.1 brie y describes this operation minimization sub-problem and a pruning search algorithm presented earlier [7] for dense arrays. The extension of the algorithm to handle identical arrays, sparsity and FFT is described in Sections 2.2, 2.3 and 2.4, respectively. An example of its application is given in Section 2.5.

2.1 Preliminaries

Consider for example the following multi-dimensional integral:

S [t] =

P

i;j;k) A[i; j; t] B [j; k; t]

(

If implemented directly as expressed above, the computation would require 2 Ni Nj Nk Nt arithmetic operations to compute. However, assuming associative reordering of the operations and use of the distributive law of multiplication over addition is satisfactory for the oating-point computations, the above computation can be rewritten in various ways. One equivalent form that only requires Ni Nj Nt + Nk Nj Nt + 2 Nj Nt operations is as follows:

f1 [j; t] f2 [j; t] f3 [i; j; t] S [t]

P

= Pi A[i; j; t] = k B [j; k; t] = fP1 [j; t] f2 [j; t] = j f3 [i; j; t]

The sequence of steps in computing the multi-dimensional integrals can be expressed as a sequence of formulae. Each formula computes some intermediate result and the last formula gives the

3 S

P

j

f3

f1

?@@ ? P P f i

2

k

A[i; j; t] B [j; k; t] A binary tree representation.

Fig. 1.

nal result. A formula is either a product of two input/intermediate arrays or a integral/summation over one index, of an input/intermediate array. A sequence of formulae can also be represented as a binary tree. For instance, the binary tree representation of the above formula sequence is shown in Figure 1. A pruning search algorithm for nding the optimal sequence of formulae for dense arrays is given below. 1. Form a list of the product terms. Let Ti denote the i-th product term and IX the set of index variables in X [:::]. Set r and c to zero. Set d to the number of product terms. 2. While there exists an summation index (say i) that appears in exactly one term P (say Ta [:::]) in the list and a > c, increment r and d and create a formula fr [:::] = i Ta [:::] where Ifr = ITa ? fig. Remove Ta[:::] from the list. Append to the list Td[:::] = fr [:::]. Set c to a. 3. Increment r and d and form a formula fr [:::] = Ta [:::] Tb [:::] where Ta [:::] and Tb [:::] are two terms in the list such that a < b and b > c, and give priority to the terms that have exactly the same set of indices. The indices for fr are Ifr = ITa [ ITb . Remove Ta [:::] and Tb [:::] from the list. Append to the list Td [:::] = fr [:::]. Set c to b. Go to step 2. 4. When step 2 and 3 cannot be performed any more, a valid formula sequence is obtained. To obtain all valid sequences, exhaust all alternatives in step 3 using depth- rst search.

2.2 Identical Arrays

In some computational physics integrals, the same input array may appear in more than one product term and each occurrence of the identical input array may be associated with dierent index variables. Formulae involving identical input arrays may actually be computing the same intermediate result. Consider the following function:

R[i; k; m] =

P

j;l) A[i; j ] B [l; m] A[k; l] B [j; k ]

(

Assume Ni = Nj = Nk = Nl = Nm . One formula sequence that computes R[i; k; m] is:

f1 [i; j; k] f2 [i; k] f3 [k; l; m] f4 [k; m] R[i; k; m]

= A P[i; j ] B [j; k ] = j f1 [i; j; k] = B P[l; m] A[k; l] = l f3[k; l; m] = f2[i; k] f4 [k; m]

Here, f1 and f3 are computing the same intermediate result since f3 [k; l; m] = f1 [k; l; m]. We call f1 and f3 equivalent formulae and their costs should be counted only once. In other words, f3 can be obtained without any additional cost. Notice that f2 and f4 are also equivalent to each other as f4 [k; m] = f2 [k; m]. Thus, f4 has a zero cost. To detect equivalent formulae, we compare each newly-formed formula against each existing formula and determine if the two formulae are of the same type (i.e. they are both multiplication or

4 both summation), the operand arrays are equivalent, and the index variables in the operand arrays (and the summation index for summation formulae) can be mapped one-to-one from one formula to the other. In the above example, f1 and f3 are equivalent because they are both products of arrays A and B and the indices k, l and m in f3 can be mapped one-to-one to the indices i, j and k in f1 respectively. Also, f4 is equivalent to f2 because they both involve summation, their operands f3 and f1 are equivalent, and the indices k and m can be mapped one-to-one to i and k.

2.3 Sparsity

When some input arrays are sparse, the intermediate and nal results could be sparse and the number of operations required for each formula is lower than if the arrays are dense. Hence, there is a need to determine the sparsity of the intermediate and nal results and the reduced cost of each formula. We make two assumptions about the sparsity of the input arrays. First, we assume that sparsity only exists between two of the array dimensions, i.e. whether an element is zero depends only on the indices of exactly two dimensions. Second, we assume sparsity is uniform, i.e. the non-zero elements are uniformly distributed among the range of values of an array dimension if it involves sparsity. However, the above assumptions are not enough to determine the sparsity of the result arrays; it depends further on the structure of the sparse arrays. In the computational physics applications we consider, some array dimensions corresponds to points in three-dimensional space and sparsity arises from the fact that certain physical quantities diminish with distance and are treated as zero beyond a cut-o limit. For such sparse arrays we develop an analytical model for the sparsity of the result arrays in terms of the sparsity of the operands arrays of a summation or multiplication formula, based on the following observation. Finding the resulting sparsity is equivalent to nding the probability that a set of randomly distributed points in 3D space satis es some given constraints on their pairwise distances. We represent the sparsity of each array as a list of sparsity entries. Under the above assumptions about input arrays, they can only have a single sparsity entry each, but intermediate arrays may have multiple sparsity entries. Each sparsity entry contains the two dimensions of the array involving the sparsity and a sparsity factor. The sparsity factor is de ned as the fraction of non-zero elements between the two dimensions (which is always between 0 and 1) and is proportional to the cube of the cut-o limit between the two points in 3D space corresponding to the two dimensions. The overall sparsity of an intermediate array (i.e. the fraction of non-zero elements in the entire array) is the product of all sparsity factors in its sparsity entries. The sparsity of an array can also be viewed as a graph in which each array dimension involving sparsity is a vertex and each sparsity entry is an edge that connects the two vertices for the two array dimensions in that entry. We call this graph a sparsity graph. For convenience, when an array reference is given, we use index variables to refer to array dimensions and to label the vertices in a sparsity graph. As an example, Figure 2(a) shows the sparsity entry and the sparsity graph of an array referenced by A[i; j; k] in which sparsity exists between the last two dimensions and 2% of its elements are non-zero. For a multiplication formula, we combine the sparsity entries of the two operand arrays to obtain the sparsity entries of the result array and examine the resulting sparsity graph. If no cycle is formed in the graph, no further work is needed and the overall sparsity of the result array is the product of those of the operand arrays since the dimensions involving sparsity represent independent points in 3D space. A cycle of size 2 is formed if both the operand arrays have sparsity between dimensions referenced by the same pair of index variables. The two sparsity entries forming the cycle can be coalesced into one by keeping the smaller of the two sparsity factors. This is because, for both operand elements to be non-zero, the distance between the pair of points that the two index variables represent in 3D space must be less than the smaller of the two cut-o limits corresponding to the two sparsity factors. Figure 2(c) and (d) show the resulting sparsity entries and the sparsity graphs for the two above cases. Determining the overall sparsity in the presence of cycles of size 3 or larger requires solving multi-dimensional integrals and is not considered in this paper. For a summation formula, the overall sparsity of the result array is the probability that for a set of randomly distributed points in 3D space, there exists a point (corresponding to the summation index) such that some given distance constraints are satis ed. We rst copy the sparsity entries

5 Array Sparsity entries Overall sparsity Sparsity graph

A[i; j; k]

B [j; l]

f1 [i; j; k; l] = A[i; j; k] B [j; l]

f2 [i; j; k] = A[i; j; l] f1 [i; j; k; l]

hj; k; 0:02i

hj; l; 0:05i

hj; k; 0:02i hj; l; 0:05i

hj; k; 0:02i hj; l; 0:02i

hk; l; 0:262i

0.02

0.05

0.001

0.0004

0.262

jr

kr

jr

lr

kr

j HH l H

(c)

(d)

r

r

(a) Fig. 2.

(b)

kr

j HH l H

r

f3 [i; j; l] = P

j f1 [i; j; k; l]

kr

lr

r

(e)

Sparsity entries and sparsity graphs of arrays.

from the operand array to the result array, paying special attention to the entries involving the summation index. If only one sparsity entry has the summation index, we remove it from the result array. If the summation index (say i) appears in exactly two sparsity entries, say hi; j; s1 i and hi; k; s2i, we replace the two entries with a single entry hj; k; ( p3 s1 + p3 s2 )3 i because the cut-o limit of the distance between the two points represented by j and k equals the sum of the two given cut-o limits. Figure 2(e) shows an example of the second case. When the summation index appears in more than two sparsity entries, we need to solve some multi-dimensional integrals. Once the sparsity of a result array is known, nding the cost of the formula that computes the result array is straight forward. In a multiplication formula, the number of operations is the same as the number of non-zero elements in the result array. In a summation formula, the number of operations equals the number of non-zero elements in the operand array minus the number of non-zero elements in the result array (since adding N numbers requires only N ?1 additions). These operation counts are exact if mechanisms exist to ensure operations are performed only on non-zero operands; otherwise, the operation counts would represent lower bounds on the actual number of operations.

2.4 Fast Fourier Transform

In many of the multi-dimensional summations, some of the product terms are exponential functions of some of the indices. Since the exponential function is unique, we consider all product terms that are exponential functions identical. We also assume the products of exponential functions can be obtained at zero cost since the products are themselves exponential functions. The existence of exponential functions in the product terms permits some formulae to be computed as fast Fourier transforms (FFTs), which may involve signi cantly fewer operations than if the same result is computed as a multiplication followed by a summation. The cost of an FFT formula equals the number of individual FFTs to be performed times the cost of each individual P X [ K; i] exp[i; j ] where K is a set of FFT. The general form of an FFT formula is f [ K; j ] = r i p ? 2ij ?1=N . If X is dense, the number of operations in computing indices, exp[i; j ] = e ?Qj 62 K and fr is k2K Nk C N log2 N where N = max(Ni ; Nj ), C is a constant that depends on the FFT algorithm in use and C N log2 N is the cost of each individual FFT. Sparsity in the operand array could lower both components of the FFT cost. Sparsity factors involving the summation index i reduce the size of individual FFTs and those not involving i reduce the number of FFTs. Whether X is sparse or not, the number of operations can be expressed as sizeN(jfr ) C N log2 N where N = max(Nj size(X )=size(fr); Nj ), where size(A) denotes the number of non-zero elements in array A, and Nj size(X )=size(fr) is the number of nonzero elements from X that participate in each individual FFT. For example, consider the formula

6 P

f1 [j; k; l] = i X [i; k; l] exp[i; j ] in which X has sparsity entries hi; k; 0:16i and hk; l; 0:2i. The resulting sparsity of f1 is given by hk; l; 0:2i. If Ni = 800, Nj = Nk = Nl = 100 and C = 10, then size(X ) =5 800 1002 0:16 0:2 = 2:56 105, size(f1)5 = 1003 0:2 = 2 105, N = 10 ; 100) = 128, and the number of operations is 210 10128 log 128 = 1:792107. max( 10022:56 2 105 100 Since the result array and its sparsity are the same whether it is computed using an FFT or not, this choice does not aect the choices with subsequent formulae. Thus, we can compare the FFT cost with the combined cost of multiplication and summation and choose the lower one.

2.5 An Example

The algorithm for searching a formula sequence with the minimum number of operations as described above has been implemented and tested. We have applied the program on several computational physics formulae that involve very complex integrals. The optimal solutions generated by the program are usually a factor of 2 better than the best manually-optimized solutions. One example, for the determination of self-energy in electronic structure of solids, is speci ed by the following input le to the program: sum r,r1,RL,RL1,RL2,RL3 Y[r,RL] Y[r,RL2] Y[r1,RL3] Y[r1,RL1] G[RL1,RL,t] G[RL2,RL3,t] exp[k,r] exp[G,r] exp[k,r1] exp[G1,r1] RL,RL1,RL2,RL3 1000 t 100 k 10 G,G1 1000 r,r1 100000 sparse Y 1,2,0.1 end

The rst two lines show the integral to be computed. The next ve lines are the ranges of the index variables. The last three lines specify the sparsity in the input arrays. Note that the rst four product terms involve an identical input array (as do the next two product terms) and the input array Y is sparse. The constant C in FFT formulae is set to 10. The best hand-obtained formula sequence has a cost of 3:54 1015 operations. Our program enumerated 369 formula sequences and found 42 sequences of the same minimum cost of 1:89 1015 operations. One minimum-cost sequence (which involves two FFT formulae) generated by the program is shown below: f1[r,RL,RL1,t] = Y[r,RL] * G[RL1,RL,t] cost= 1e+12 f2[r,RL1,t] = sum RL f1[r,RL,RL1,t] cost= 9.9e+11 dense f5[r,RL2,r1,t] = Y[r,RL2] * f2[r1,RL2,t] cost= 1e+14 f6[r,r1,t] = sum RL2 f5[r,RL2,r1,t] cost= 9.9e+13 dense f7[k,r,r1] = exp[k,r] * exp[k,r1] cost= 0 dense f10[r,r1,t] = f6[r,r1,t] * f6[r1,r,t] cost= 1e+12 dense f11[k,r,r1,t] = f7[k,r,r1] * f10[r,r1,t] cost= 1e+13 dense f13[k,r1,t,G] = fft r f11[k,r,r1,t] * exp[G,r] cost=1.660964e+15 dense f15[k,t,G,G1] = fft r1 f13[k,r1,t,G] * exp[G1,r1] cost=1.660964e+13 dense

3 Communication Minimization on Parallel Computers

Given a sequence of formulae, we now address the sub-problem of nding the optimal partitioning of arrays and operations among the processors in order to minimize inter-processor communication and computational costs in implementing the computation on a message-passing parallel computer. Section 3.1 brie y describes a multi-dimensional processor view and a dynamic programming algorithm we proposed earlier [8] for dense arrays. The modi cations to this algorithm for handling identical arrays, sparsity and FFT are discussed in Sections 3.2, 3.3 and 3.4.

3.1 Preliminaries

In our multi-dimensional view of the processors, each array can be distributed or replicated along one or more of the processor dimensions. We use a d-tuple to denote the partitioning of the elements of

7 a data array on a d-dimensional processor array. The k-th position in the d-tuple corresponds to the k-th processor dimension. Each position may be one of the followings: an index variable distributed

along that processor dimension, a `*' denoting replication of data along that processor dimension, or a `1' denoting that only the rst processor along that processor dimension is assigned any data. If an index variable appears as an array subscript but not in the d-tuple, then the corresponding dimension of the array is not distributed. Conversely, if an index variable appears in the d-tuple but not in the array, then the data are replicated along the corresponding processor dimension, which is the same as replacing that index variable with a `*'. As an example, suppose the processors form a 4-dimensional array. For a 3-dimensional data array X [i; j; k], the 4-tuple hi; j; ; 1i speci es that the rst and the second dimensions of X are distributed along the rst and second processor dimensions respectively (the third dimension of X is not distributed), and that data are replicated along the third processor dimension and are assigned only to processors whose fourth processor dimension equals 1. Let Tmove (X; ; ) denote the communication cost in moving the elements of an array X from an initial distribution to a nal distribution . We measure Tmove for each possible pair of and and for several dierent message sizes empirically on the target parallel computer. Given a sequence of formulae to be computed, we assume all processors compute the same formula at any given time. Let Tcalc(fr ; ) denote the computational cost in calculating an intermediate array fr with as the distribution of the operand array(s). For multiplication and for summation where the summation index is not distributed, we have: Q Q Tcalc (fr ; ) = ( h2Ifr Nh )=( k 2Ifr pk ) where pk is the number of processors on dimension k, k is the k-th position in , and Ifr is the set of index variables in fr . For the case of summation where the summation index is distributed, partial sums of fr are rst formed on each processor and then either consolidated on one processor along the i dimension or replicated on all processors along the same processor dimension. We denote by Tcalc1 and Tmove1 the computational and communication costs for forming the sum without replication, and by Tcalc2 and Tmove2 those with replication. We assume the input arrays can be distributed initially among the processors in any way at zero cost, as long as they are not replicated. A dynamic programming algorithm that determines the distribution of dense arrays to minimize the computational and communication costs is given below. 1. Transform the given sequence of formulae into a binary tree (see Section 2.1) 2. Let T (X; ) be the minimal total cost for the subtree rooted at X with distribution . Initialize T (X; ) for each leaf node X of the binary tree and each d-tuple as follows:

T (X; ) =

0 if NoReplicate() minNoReplicate( ) fTmove(X; ; )g otherwise

where NoReplicate() is a predicate meaning involves no replication. 3. Perform a bottom-up traversal of the binary tree. For each internal node fr and each d-tuple , calculate T (fr ; ) as follows: Case (a): fr is a multiplication node with two children X and Y . We need both X and Y to have the same distribution, say , before fr can be formed. After the multiplication, the product could be redistributed if necessary. Thus,

T (fr ; ) = min fT (X; ) + T (Y; ) + Tcalc (fr ; ) + Tmove (fr ; ; )g Case (b): fr is a summation node over index i and with a child X . X may have any distribution

. If i 2 , each processor rst forms partial sums of fr and then we either combine the partial sums on one processor along the i dimension or replicate them on all processors along that processor dimension. Afterwards, the sum could be redistributed if necessary. Thus,

8

T (fr ; ) = min fT (X; ) + min(Tcalc1 (fr ; ) + Tmove1 (fr ; ; ); Tcalc2(fr ; ) + Tmove2 (fr ; ; ))g In either case, save into Dist(fr ; ) the distribution that minimizes T (fr ; ). 4. When step 3 nishes for all nodes and all indices, the minimal total cost for the entire tree is min fT (R; )g, where R is the root of the tree. The distribution that minimizes the total cost is the optimal distribution for R. The optimal distributions for other nodes can be obtained by tracing back Dist(fr ; ) in a top-down manner, starting from Dist(R; ). The running time complexity of this algorithm is O(nm2 ), where n is the number of internal nodes in the binary tree and m is the number of dierent possible distribution d-tuples. The storage requirement for T (fr ; ) and Dist(fr ; ) is O(nm). An Example. The above algorithm for dense arrays has been implemented. As an illustration, we apply it on a triple matrix multiplication problem speci ed by the following input le to the program: f1[i,j,k] f2[i,k] = f3[k,l,m] f4[k,m] = f5[i,k,m] f6[i,m] = i 200 j 96000 k 400 l 84000 m 200 end

= A[i,j] * B[j,k] sum j f1[i,j,k] = C[k,l] * D[l,m] sum l f3[k,l,m] = f2[i,k] * f4[k,m] sum k f5[i,k,m]

The rst six lines specify the sequence of formulae. The next ve lines provide the ranges of the index variables. Note that the matrices are rectangular. The target parallel machine is a Cray T3E at the Ohio Supercomputer Center. We empirically measured the processor speed for the computation kernel (found to be about 400 M- ops) and Tmove for each possible pair of initial and nal distributions for several dierent message sizes. These measurements are given as auxiliary input to the program. Eight processors viewed as a logical two-dimensional 2 4 array are speci ed. The optimal distribution of the arrays that minimizes the total computational and communication time as generated by the program is shown in the table below. The appearance of two d-tuples under the distribution column indicates redistribution of the array. Tcalc and Tmove are expressed in seconds. For f2 and f4, the partial sums are not replicated. Array A[i; j ] B [j; k] C [k; l] D[l; m] f1 [i; j; k] f2 [i; k] f3 [k; l; m] f4 [k; m] f5 [i; k; m] f6 [i; m] Total time

Size Distribution ( ! ) Tcalc(fr ; ) Tmove (fr ; ; ) 1:92 107 hi; j i ! h; j i 0.000 0.793 3:84 107 hk; j i 0.000 0.000 3:36 107 h k; li 0.000 0.000 1:68 107 h m; li ! h; li 0.000 0.694 7:68 109 hk; j i 2.400 0.000 8:00 104 h k; j i ! hi; i 2.400 0.024 6:72 109 hk; li 2.100 0.000 8:00 104 hk; li ! h; mi 2.100 0.024 1:60 107 hi; mi 0.005 0.000 4:00 104 hi; mi 0.005 0.000 9.010 1.535

9

3.2 Identical Arrays

With the introduction of identical input arrays and equivalent formulae, a formula sequence must be represented as a directed acyclic graph (DAG) instead of a binary tree since each identical input array or equivalent formula appears as an operand in more than one subsequent formulae and has multiple parents. The direct application of the above algorithm to a DAG will result in an incorrect computation of the communication costs at nodes with multiple parents. The following changes are proposed to the algorithm. Let X be a multi-parent node. To each parent of X , we add an extra Tmove cost for the redistribution of X . The minimization over the distribution of X is not performed at its parents, but rather at the dominator node of X , denoted DOM(X ) (which is the closest ancestor of X that every path from X to the root must pass through). For each node fr on any path from X to DOM(X ), we keep a separate T (fr ; ) for each possible distribution of X . To illustrate the changes, consider the following formula sequence in which f1 has two parents.

f1 [i; j; k] = A[i; j ] B [j; k] f2 [i; j; k] = f1 [i; j; k] C [i; k] f3 [i; j; k] = f1 [j; k; i] f2 [i; j; k] The equations for nding the lowest cost are:

T (f1 ; ) = min fT (A; ) + T (B; ) + Tcalc(f1 ; ) + Tmove (f1 ; ; )g T (f2 ; ) j(f1 ;) = min fT (f1; ) + Tmove (f1 ; ; ) + T (C; ) + Tcalc(f2 ; ) + Tmove (f2 ; ; )g T (f3 ; ) = min fmin fT (f1; ) + Tmove (f1 ; ; ) + T (f2; ) j(f1 ;) g + Tcalc(f3 ; ) + Tmove (f3 ; ; )g Note that the complexity of the revised algorithm is now O(nmt+2 ), where n is the number of internal nodes in the DAG, m is the number of dierent possible distribution d-tuples, and t is the maximum number of multi-parent nodes that are `open' at a time. A DAG also leads to another complication called the Steiner tree eect [10], i.e. the distributions of the parent nodes of a multi-parent node X may be obtained with a lower cost by including more `transit' nodes between X and its parents. The revised algorithm has taken care of this eect for two-parent nodes. To obtain optimal solutions for nodes with more than two parents, more Steiner trees with more Tmove 's have to be considered.

3.3 Sparsity

A sparse array is said to be evenly distributed among the processors if an equal number of array elements is assigned to each processor. We do not consider uneven distribution of sparse arrays as it would lead to load imbalance and probably sub-optimal performance. With the uniform sparsity assumption, a sparse array is guaranteed to be evenly distributed if no two distributed array dimensions appear in the same sparsity entry or are reachable from each other in the sparsity graph (see Section 2.3). As an example, if the array X [i; j; k; l] has sparsity entries hi; j; 0:1i and hj; k; 0:1i, then at most one of the three indices i, j and k would be distributed; otherwise an uneven distribution could result. Since zero elements in sparse arrays do not participate in computation or data movement, the array size component of the computational or communication costs for sparse arrays equals the number of non-zero elements. In other words, Tcalc (X; ) = Tcalc(X 0 ; ) and Tmove (X; ; ) = Tmove (X 0 ; ; ) where X 0 is a dense array which has the same number of non-zero elements as X . These formulae for Tcalc and Tmove of sparse arrays are exact unless the indices assigned to a processor before and after redistribution of X are mutually reachable in the sparsity graph, in which case Tmove would be an approximation.

10

3.4 Fast Fourier Transform

Assuming that exponential functions are computed on the y (by FFT routines), they are neither stored as arrays nor moved between processors and the costs of forming them are usually absorbed into the FFT costs. Thus, we can simplify a DAG by replacing a multiplication node whose children are exponential functions by an exponential function leaf node. An FFT formula introduces into a DAG a new kind of node called an FFT node, which has the summation index as its label and the operand array and an exponential function as its two children. Let fr be an FFT node with summation index i and operand arrays X [K; i] and exp[i; j ], and let

be the distribution of X . The minimal total cost for the DAG rooted at fr with distribution is evaluated as follows. If i 62 , each processor independently performs serial FFTs on its local portion of X ; otherwise, group(s) of processors perform parallel FFTs collectively on their portions of X . Afterwards, fr may be redistributed if necessary. Hence, min fT (X; ) + Tcalc3 (fr ; ) + Tmove (fr ; ; ) if i 62 T (fr ; ) = min fT (X; ) + Tcalc4 (fr ; ) + Tmove4 (fr ; ) + Tmove (fr ; 0; ) otherwise where Tcalc3 is the computational cost for forming the serial FFTs, Tcalc4 and Tmove4 are the computational and communication costs for forming the parallel FFTs, and 0 is with i replaced by j .

4 Conclusions

In this paper, we have considered a compile-time optimization problem motivated by some computational physics applications. The computations are essentially multi-dimensional integrals of the product of several arrays. In practice, some arrays could be sparse and some could be identical. Moreover, some integrals involving exponential functions can be computed more eciently with fast Fourier transforms. These three characteristics were addressed in this paper and generalizations to earlier algorithms for the operation minimization sub-problem and the communication minimization sub-problem were proposed. Work is in progress on the implementation of the changes to the communication minimization program and the empirical measurement of computational and communication costs associated with FFT routines. We also plan on developing an automatic code generator that takes array partitioning information as input and produces the source code of a parallel program that computes the desired multi-dimensional integral.

References

[1] W. Aulbur, Parallel implementation of quasiparticle calculations of semiconductors and insulators, Ph.D. Dissertation, Ohio State University, Columbus, October 1996. [2] S. Chatterjee, J. R. Gilbert, R. Schreiber, and S.-H. Teng, Automatic array alignment in dataparallel programs, 20th Annual ACM SIGACTS/SIGPLAN Symposium on Principles of Programming Languages, New York, pp. 16{28, 1993. [3] , Optimal evaluation of array expressions on massively parallel machines, ACM TOPLAS, 17 (1), pp. 123{156, Jan. 1995. [4] C. N. Fischer and R. J. Leblanc Jr, Crafting a compiler, Menlo Park, CA:Benjamin/Cummings, 1991. [5] M. S. Hybertsen and S. G. Louie, Electronic correlation in semiconductors and insulators: band gaps and quasiparticle energies, Phys. Rev. B, 34 (1986), pp .5390. [6] K. Kennedy and K. S. McKinley, Optimizing for parallelism and data locality, 1992 ACM Intl. Conf. on Supercomputing, pp. 323{334, July 1992. [7] C. Lam, P. Sadayappan, and R. Wenger, On optimizing a class of multi-dimensional loops with reductions for parallel execution, Parallel Processing Letters, Vol. 7 No. 2, pp. 157{168, 1997. [8] , Optimization of a class of multi-dimensional integrals on parallel machines, Eighth SIAM Conference on Parallel Processing for Scienti c Computing, March 1997. [9] H. N. Rojas, R. W. Godby, and R. J. Needs, Space-time method for Ab-initio calculations of selfenergies and dielectric response functions of solids, Phys. Rev. Lett., 74 (1995), pp. 1827. [10] P. Winter, Steiner problems in networks: a survey, Networks, 17 (1987), pp. 129{167. [11] M. E. Wolf and M. S. Lam, A data locality algorithm, SIGPLAN'91 Conf. on Programming Language Design and Implementation, pp. 30{44, June 1991.

Performance Optimization of a Class of Loops Involving ... - CiteSeerX

Performance Optimization of a Class of Loops Involving ... - CiteSeerX

Suggest Documents

Homotopy Decompositions Involving the Loops of ...

A NEW CLASS OF INTEGRALS INVOLVING ... - Preprints.org

Steiner loops of nilpotency class 2

Simple Models for Performance Evaluation of a Class of ... - CiteSeerX

ON A CLASS OF GLOBAL OPTIMIZATION TEST ... - CiteSeerX

On a Class of Integrals Involving a Bessel Function Times

A Class of Multivalent Functions Involving a Generalized Linear ...

Generalized Performance Management of Multi-Class ... - CiteSeerX

A class of simple proper Bol loops - arXiv

Comparison of three-class classification performance ... - CiteSeerX

optimization of thermal performance of a building with ... - CiteSeerX

Optimization of thermal performance of a building with ... - CiteSeerX

optimization of thermal performance of a building with ... - CiteSeerX

Performance Optimization and Modeling of Blocked ... - CiteSeerX

Performance Characterization and Optimization of Atomic ... - CiteSeerX

Performance Characterization and Optimization of Atomic ... - CiteSeerX

Performance Characterization and Optimization of Mobile ... - CiteSeerX

Performance analysis and optimization of concatenated ... - CiteSeerX

Optimization Problems Involving Collections of Dependent Objects

Optimization of Batch Distillation Involving Hydrolysis ...

Performance-Based Optimization: A Review - CiteSeerX

A New Class of Low Cost, High Performance Radiation ... - CiteSeerX

On a new class of summation formulas involving the

Argument Properties for a Class of Analytic Functions Involving Libera