Extracting data ow information for parallelizing FORTRAN ... - CiteSeerX

Extracting data ow information for parallelizing FORTRAN nested loop kernels Edward Walker Submitted for the degree of Doctor of Philosophy The University of York Advanced Computer Architecture Group, The Department of Computer Science. June 1994

Thesis Abstract

Currently available parallelizing FORTRAN compilers expend a large amount of eort in determining data independent statements in a program such that these statements can be scheduled in parallel without need for synchronisation. This thesis hypothesises that it is just as important to derive exact data ow information about the data dependencies where they exist. We focus on the speci c problem of imperative nested loop parallelization by describing a direct method for determining the distance vectors of the inter-loop data dependencies in an n-nested loop kernel. These distance vectors de ne dependence arcs between iterations which are represented as points in n-dimensional euclidean space. To demonstrate some of the bene ts gained from deriving such exact data ow information about a nested loop computation we show how implicit task graph information about the computation can be deduced. Deriving the implicit task graph of the computation enables the parallelization of a class of loop kernels which was previously thought dicult. The class of loop kernels are those involving data dependent array variables with coupled linear subscript expressions. We consider the parallelization of these loop kernels using the DO ACROSS parallel construct on shared memory and distributed memory architectures, and we compare our suggested schemes with other current proposals. We demonstrate improved execution time pro les when our schemes are adopted. Also, we show how an exact data independence test can be derived for multi-dimensional array variables by formulating a dependence constraint system using the derived dependence distance vectors. Through careful implementation and making approximating assumptions where necessary we show that our test remains exact for a very large subset of randomly generated scenarios and that it exhibits \tractable" execution times.

i

Contents 1 Introduction

1

2 Data ow analysis methods

8

1.1 Motivation and aims : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 The thesis plan : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2.1 Interval methods : : : : : : : : : : : : : : : : 2.2 Index subscript analysis methods : : : : : : : 2.2.1 The GCD-Banerjee test : : : : : : : : 2.2.2 The I-test : : : : : : : : : : : : : : : : 2.3 Coupled index subscripts analysis methods : 2.3.1 The Delta test : : : : : : : : : : : : : 2.3.2 The test : : : : : : : : : : : : : : : : 2.3.3 The Power test : : : : : : : : : : : : : 2.3.4 General integer programming methods 2.4 Concluding remarks : : : : : : : : : : : : : :

3 Parallelizing nested loop kernels

: : : : : : : : : :

: : : : : : : : : :

3.1 Spreading : : : : : : : : : : : : : : : : : : : : : : 3.1.1 Low-level spreading : : : : : : : : : : : : 3.1.2 High-level spreading : : : : : : : : : : : : 3.2 Index space transformation : : : : : : : : : : : : 3.2.1 Loop restructurers : : : : : : : : : : : : : 3.2.2 Matrix transformations : : : : : : : : : : 3.3 Mapping loop iterations onto parallel processors 3.3.1 Static allocation schemes : : : : : : : : : ii

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

3 5

9 11 11 13 14 15 16 16 18 19

21

23 23 24 30 31 35 40 41

3.3.2 Dynamic allocation schemes : : : : : : : : : : 3.4 Evaluating parallelizing compilers : : : : : : : : : : : 3.4.1 Dependence breaking syntax transformations 3.5 Concluding remarks : : : : : : : : : : : : : : : : : :

4 Data dependence distance vectors

4.1 Basic concepts and de nitions : : : : : : : : : : : : : 4.1.1 Iterations and subscript expressions : : : : : 4.1.2 Index spaces : : : : : : : : : : : : : : : : : : 4.1.3 Related de nitions : : : : : : : : : : : : : : : 4.1.4 Basic row-reduction matrix operations : : : : 4.2 Deriving ow dependence distance vectors : : : : : : 4.3 Intermediate ow distance vectors : : : : : : : : : : 4.4 Intermediate anti distance vectors : : : : : : : : : : 4.4.1 Deriving anti dependence distance vectors : : 4.5 Array kills and output dependence distance vectors : 4.5.1 Computation procedure : : : : : : : : : : : : 4.5.2 Examples of computation procedure : : : : : 4.6 Concluding remarks : : : : : : : : : : : : : : : : : :

5 Execution pro ling on parallel architectures 5.1 Instrumenting the loop computation : : : 5.2 The processor model : : : : : : : : : : : : 5.2.1 Communicating cost functions : : 5.3 Tracking statements and shadow variables 5.4 Concluding remarks : : : : : : : : : : : :

6 Shared memory architectures

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

6.1 The problem de nition : : : : : : : : : : : : : : : : : 6.2 The DO ACROSS loop on a SMA : : : : : : : : : : : : 6.2.1 The run-time dependence checking approach 6.2.2 The basic dependence vector approach : : : : 6.3 Synchronising along dependence distance vectors : : iii

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : :

45 46 47 48

50

51 51 52 54 56 58 63 66 67 69 71 72 74

75

75 76 78 79 82

83

83 85 86 87 89

6.3.1 The coarse grain DD scheme : : : : : : : : : : 6.3.2 Synchronising along anti dependencies : : : : : 6.4 Comparing the DD scheme with the BDV scheme : : : 6.4.1 Deriving the data dependence distance vectors 6.4.2 Nested loops with many data dependencies : : 6.4.3 Unravelling parallelism : : : : : : : : : : : : : : 6.5 Concluding remarks : : : : : : : : : : : : : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

7 Distributed memory architectures

7.1 The problem de nition : : : : : : : : : : : : : : : : : : : : : : : : : : 7.2 The mapping problem : : : : : : : : : : : : : : : : : : : : : : : : : : 7.2.1 Correspondence between task graph and nested loop computation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.3 The impact of reducing the communication costs : : : : : : : : : : : 7.4 A fast clustering algorithm : : : : : : : : : : : : : : : : : : : : : : : 7.4.1 The LCFM algorithm : : : : : : : : : : : : : : : : : : : : : : 7.4.2 The time complexity of the LCFM algorithm : : : : : : : : : 7.5 Data distribution via task distribution : : : : : : : : : : : : : : : : : 7.5.1 The alignment problem : : : : : : : : : : : : : : : : : : : : : 7.6 Generating the communication requirements : : : : : : : : : : : : : : 7.6.1 Eliminating the anti dependencies : : : : : : : : : : : : : : : 7.6.2 Generating the macro data ow DAG : : : : : : : : : : : : : 7.7 Execution code kernel : : : : : : : : : : : : : : : : : : : : : : : : : : 7.8 Experimental evaluation of the task partitioner : : : : : : : : : : : : 7.8.1 K-cube connected distributed memory multicomputers : : : : 7.9 Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.10 Concluding remarks : : : : : : : : : : : : : : : : : : : : : : : : : : :

8 Exact data independence testing

8.1 The distance test : : : : : : : : : : : : : : : : : : : : : : 8.2 Phase 1: Checking for ow dependence : : : : : : : : : : 8.2.1 Generating the ow dependence distance vector : 8.3 Phase 2: Checking for feasibility : : : : : : : : : : : : : iv

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

90 99 100 101 102 104 107

114

114 116 117 118 120 120 127 129 130 131 131 134 136 139 139 143 143

147

148 150 150 153

8.4

8.5

8.6 8.7

8.3.1 Hyperplanes and polyhedra : : : : : : : : : : : : : : : : 8.3.2 Formulating the dependence constraint system : : : : : 8.3.3 Fourier-Motzkin variable elimination : : : : : : : : : : : 8.3.4 Deriving the compact constraining polyhedron N : : : : Phase 3: Checking for integer solutions : : : : : : : : : : : : : : 8.4.1 Integer solvability : : : : : : : : : : : : : : : : : : : : : 8.4.2 The exact integer test : : : : : : : : : : : : : : : : : : : Experimental results : : : : : : : : : : : : : : : : : : : : : : : : 8.5.1 The implemented algorithm : : : : : : : : : : : : : : : : 8.5.2 Implementation details and approximating assumptions 8.5.3 Timing analysis : : : : : : : : : : : : : : : : : : : : : : : Comparison with the Omega test : : : : : : : : : : : : : : : : : Concluding remarks : : : : : : : : : : : : : : : : : : : : : : : :

9 Conclusions and future directions

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

154 155 157 159 165 165 169 173 173 176 180 182 183

184

9.1 Major contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : 184 9.2 Future directions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 186

v

List of Figures 1.1 Parallelizing translation procedure : : : : : : : : : : : : : : : : : : : 1.2 Major results overview : : : : : : : : : : : : : : : : : : : : : : : : : :

2 6

2.1 Bounding regions for array reference A : : : : : : : : : : : : : : : : : 11 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10

Evaluation tree for (a) expression (3.1) and (b) expression (3.2) : : Data dependence DAG for code fragment 3.1 : : : : : : : : : : : : : Dependence graph for code fragment 3.2 : : : : : : : : : : : : : : : : Execution pro le: (a) from a greedy schedule and (b) after slope adjustment : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Software pipelined program graph for code fragment 3.2 : : : : : : : Data dependence DAG for code fragment 3.3: (a) before reversal (b) after reversal : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Execution order for code fragment 3.4 using the hyperplane method. Allocation via loop spreading for code fragment 3.5 : : : : : : : : : : Linear projection for allocating iterations onto a linear array : : : : Parallelization sequence for nested loop kernels : : : : : : : : : : : :

24 27 29 30 30 35 37 42 43 48

4.1 Convex hull subspace region J for code fragment 4.1 : : : : : : : : : 53 4.2 Output dependence vector between iterations in a loop kernel : : : : 56 4.3 The array variable de nition in Ii is killed by the rede nition in Ij : 61 5.1 The processor model : : : : : : : : : : : : : : : : : : : : : : : : : : : 77 6.1 Loop-level parallelism pro le for code fragment 6.1 : : : : : : : : : : 84 6.2 The data dependence DAG for a uniform recurrence equation with data dependence vector (1,2) : : : : : : : : : : : : : : : : : : : : : : 90 vi

6.3 The coarse grain DD synchronisation scheme : : : : : : : : : : : : : 6.4 Code fragment 6.1 synchronised using the coarse grain DD scheme : 6.5 1/T versus P plot for loop kernel three (Note that the BDV plots are parallel to the P axis) : : : : : : : : : : : : : : : : : : : : : : : : : : 6.6 Synthetic loop kernel one : : : : : : : : : : : : : : : : : : : : : : : : 6.7 Synthetic loop kernel two : : : : : : : : : : : : : : : : : : : : : : : : 6.8 Synthetic loop kernel three : : : : : : : : : : : : : : : : : : : : : : : 6.9 A column cyclic allocation scheme : : : : : : : : : : : : : : : : : : : 6.10 Synthetic loop kernel one synchronised using the DD scheme : : : : 6.11 Synthetic loop kernel two synchronised using the DD scheme : : : : 6.12 Synthetic loop kernel three synchronised using the DD scheme : : : 6.13 Synthetic loop kernel one synchronised using the BDV scheme : : : : 6.14 Synthetic loop kernel two synchronised using the BDV scheme : : : 6.15 Synthetic loop kernel three synchronised using the BDV scheme : : : 6.16 1/T versus P plot for loop kernel one : : : : : : : : : : : : : : : : : : 6.17 1/T versus P plot for loop kernel two : : : : : : : : : : : : : : : : : : 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15

A clustering example with communication costs : : : : : : : : : : : A fork and join example : : : : : : : : : : : : : : : : : : : : : : : : Condition to merge dependence chains at a fork : : : : : : : : : : : Traversing root nodes in the LCFM algorithm : : : : : : : : : : : : The linear clustering procedure : : : : : : : : : : : : : : : : : : : : Example clustering trace of the LCFM algorithm : : : : : : : : : : Some regular data partitioning schemes for a 6 6 array structure Eliminating anti dependence vectors : : : : : : : : : : : : : : : : : Transformation from data dependence DAG macro data ow DAG Communication requirement for iteration (1,6) : : : : : : : : : : : The high level description of the execute code kernel : : : : : : : : Synthetic loop kernel one : : : : : : : : : : : : : : : : : : : : : : : Synthetic loop kernel two : : : : : : : : : : : : : : : : : : : : : : : Synthetic loop kernel three : : : : : : : : : : : : : : : : : : : : : : 1/T versus P plot for loop kernel one : : : : : : : : : : : : : : : : : vii

: : : : : : : : : : : : : : :

91 98 103 105 105 105 107 110 110 111 111 112 112 113 113 119 122 123 125 126 128 131 133 135 136 137 140 140 140 144

7.16 1/T versus P plot for loop kernel two : : : : : : : : : : : : : : : : : : 7.17 1/T versus P plot for loop kernel three : : : : : : : : : : : : : : : : : 7.18 Task load pro le for each processor in a 256-PE architecture executing loop kernel one : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.19 Task load pro le for each processor in a 256-PE architecture executing loop kernel two : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.20 Task load pro le for each processor in a 256-PE architecture executing loop kernel three : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Feasible convex region S and saddle points : : : : : : : : : : : : : : Hash table for generated constraint inequalities in terms of (x1 x2x3 ) Inequality constraint elimination graph : : : : : : : : : : : : : : : : : The convex region S 0 : : : : : : : : : : : : : : : : : : : : : : : : : : : The three cases in two dimensions: i) C1 de nes a new tighter polyhedron N , ii) C2 is redundant, and iii) C3 is inconsistent. : : : : : : 8.6 Two tier hash scheme for maintaining the system of inequalities : : : 8.7 Frequency versus execution time of the distance test with certainty factor 98.6% : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.8 Probability density function for the execution time of the distance test with certainty factor 98.6% : : : : : : : : : : : : : : : : : : : : :

8.1 8.2 8.3 8.4 8.5

viii

144 145 145 145 146 157 160 161 170 178 179 181 181

Acknowledgements

My research work in Computer Science started with my initial interest in systolic array design. Thanks is therefore due to H. T. Kung for sparking my enthusiasm. Thanks also to Gary Morgan and Alan Wood for enlightening discussions. In particular, I will like to thank Gary for giving me the Stream Machine project; the birth place of many of the ideas found in this thesis. Thanks to Bruce Cass and Zygmunt Ulanowski, who were both part of the Stream Machine \team", for tolerating my panic induced rantings. Many thanks to the people in the Advanced Computer Architecture Group, both past and present, for a friendly and conducive research environment, and also for welcoming me at the beginning. Thanks to Elaine Yeo who patiently waited and stood by me while I struggled through this journey. I love you very much. Most of all, thanks to my parents, who taught me the virtues of hardwork and a good education; I would not be who I am without them. Finally, thanks to the producers of BBC's \Horizon" who recharged my enthusiasm for science every monday evening for the past four years. It was fun, but I'm glad its over.

ix

Declaration

I declare that this Thesis was composed my myself and that the work it describes is my own, except where stated in the text. Edward Walker

x

HPF MPP DMA SMA DAG LHS RHS RDC BDV DD def-ref

Glossary

High performance FORTRAN Massively parallel processor Distributed memory architecture Shared memory architecture Directed acyclic graph Left hand side Right hand side Run-time dependency checking Basic dependence vectors Data dependence De ning and referencing pair J Loop kernel index space xi Index variable for the ith loop X Index variable set figen The ith subscript of the de ning array variable fiuse The ith subscript of the referencing array variable coe(xi ; expr) Coecient of xi in linear functional expr di Data dependence distance vector d0i Backward data dependence distance vector Set of dependence distance vectors Dependence dierence vector gen vi De ned variable viuse Referenced variable IDisti Intermediate ow distance for ith dimension IAntii Intermediate anti distance for ith dimension

owi Flow dependence distance vector antii Anti dependence distance vector outi Output dependence distance vector

xi

Chapter 1

Introduction The last few years have seen the wide-spread availability of massively parallel processor (MPP) supercomputers such as the Cray T3D and the Intel Paragon. This current generation of MPP architectures has made it more important than ever to develop techniques which will compile programs to utilise eciently their wealth of available computing resources. Easy and ecient programming of these architectures is essential for making them \general-purpose" and to allow parallel processing to enter into the mainstream of computer applications development. Currently, the dominant programming language for MPP architectures is FORTRAN 77. The language persists in high performance computing because of the large and stable FORTRAN numeric code libraries which have been developed over the years. This inertia is forcing FORTRAN to evolve into the standard programming language for MPP architectures, as is seen in the draft for High Performance Fortran (HPF)[26]. There are three approaches to programming an MPP architecture using an imperative language like FORTRAN 77. The rst approach augments the language with a set of directives. The programmer then becomes responsible for inserting directives into strategic parts of the program instructing the compiler on how best to perform the parallelization process. The second approach incorporates parallel constructs directly into the language de nition. Examples of this approach are seen in HPF and IBM PARALLEL FORTRAN [26, 59], where language constructs such as PARALLEL DO and PARALLEL CASE are introduced. The third approach passes the 1

CHAPTER 1. INTRODUCTION

2

FORTRAN

Parallel FORTRAN dialect

Architecture

Figure 1.1: Parallelizing translation procedure responsibility of parallelization onto an automatic tool which extracts parallelism directly from the serial program. The third approach is closely related to the second in that the rst stage in an automatic parallelization scheme for a serial program is to generate an intermediate parallel form of the computation. This intermediate parallel form can be the parallel dialect version of the original serial program and is in fact the translation scheme adopted by almost all parallelizing FORTRAN compiler projects to date. The stages in the automatic parallelization of a FORTRAN program are succinctly illustrated in gure 1.1. Developing automatic parallelizing techniques for FORTRAN programs is important because it allows many of the currently available FORTRAN code libraries to execute on MPP architectures. Furthermore, in the parallelizing compiler researcher's point of view, the issues involved in parallelizing a FORTRAN program subsumes many of the problems faced by compiler developers of other languages. The lessons learnt in developing a parallelizing FORTRAN compiler can therefore be applied to compilers for other languages with more explicit parallel constructs. Also, developing a parallelizing compiler for an imperative language allows a programmer to develop code for an MPP architecture in a familiar language like C or FORTRAN. Such programs would also result in more portable code if eective compilers can be developed for the dierent classes of MPP architectures. This thesis will concentrate on the automatic parallelization of FORTRAN programs. We assume an input language with no parallelizing semantic extensions and we describe procedures to translate and implement parallel FORTRAN constructs eciently. We further focus our problem by concentrating on the parallelizing trans-


3

lation of nested loop kernels since they account for the largest amount of parallelism in an imperative program [34]. The parallel constructs relevant to our translation schemes are the PARALLEL DO and the DO ACROSS. The PARALLEL DO concurrently executes all iterations in a loop dimension while the DO ACROSS overlaps the execution of iterations constrained by the data dependencies in the computation.

1.1 Motivation and aims Any parallelization scheme for nested loop kernels must be constrained by the data dependencies within the computation. Only when this is done satisfactorily will the newly parallelized version of the program be input/output equivalent to the original. The data dependencies between statements in a nested loop kernel can be due either to the textual order in which they are de ned, or to the temporal order in which they are executed. Thus for the simple loop shown in code fragment 1.1 below, the constraining data dependencies are characterised by 1. a data dependence for scalar variable A, between statements S1 and S2, due to the textual ordering, and 2. a data dependence for array variable C, between dierent occurrences of statement S2, due to the execution order of the loop kernel. DO I = 1, N S1:

A = B( I )

S2:

C( I+1 ) = A + C( I )

END Code Fragment - 1.1

Data dependencies which result from the textual ordering of the statements within a loop body are known as intra-loop dependencies and those which result from the execution order of the loop kernel are known as inter-loop dependencies. This thesis will look at the particular problem of determining data ow information of inter-loop data dependencies resulting from the de nition and reference (or def-ref) of array variable instances in the execution of an n-nested loop kernel.


4

Note C(I+1) and C(I) are de nition and reference variables respectively in code fragment 1.1. Our motivation is that many scienti c and engineering programs have nested loop constructs computing large array structures and, as we will show in this thesis, determining the nature of the data dependencies in such computation will aid us in our eorts to parallelize them. Much of parallelizing compiler research has so far concentrated on developing analysis methods for determining if a def-ref array variable pair is data independent, i.e. the de nition and reference variables access dierent regions within the array structure. There is very little research into methods which determine the actual access patterns of data dependent array variables. Determining the access pattern for an array variable in a nested loop kernel is dicult because the compiler must be able to accurately analyse the variable's subscript expressions. In particular, all studies into the parallelization of n-nested loop kernels assume array variables with subscripts of the form:

x ; : : :; xn) = : : : A(x ? c ; : : :; xn ? cn) : : :

A(

1

1

1

where xi is an index variable for the ith nest level, A(x1; x2; : : :; xn) and A(x1 ? c1; : : :; xn ? cn) is a pair of n-dimensional array de nitions and references in the nested loop kernel, and c1:::n are integer constants. There is in fact a large class of algorithms which are characterised by array variables with such simple subscript expressions. These algorithms are known as uniform recurrence equations and many scienti c and engineering programs utilise them, i.e. nite dierence methods, relaxation algorithms, etc. The parallelization of uniform recurrence equations was rst studied by Karp, Miller and Winograd [27] and subsequently re ned by later research; notably with Lamport's introduction of the hyperplane method [36]. The basis for many of our current parallelization schemes for nested loop kernels, as will be described in chapter 3 of this thesis, rely on the theoretical foundations laid out by these researchers. Notably, some parallelization schemes also depend on the accurate determination of the dependence distance vector or the dependence direction vector. The dependence distance vector for the uniform recurrence equation described above is de ned


5

by the vector (c1; c2; : : :; cn). Whereas, the dependence direction vector is de ned by the vector (sign(c1); sign(c2); : : :; sign(cn)), with sign(ci) taking values from the set f; =; g depending on the sign of the integer constant ci and \" representing an unknown direction. Surprisingly, there is very little research into developing methods for accurately determining these data dependence distance vectors. In particular, we are still not able to derive the data dependence distance vector of the def-ref variable pair, A(i+j,3*i+j+3) = ... A(i+j+1,i+2*j+4), or even for array variable pairs with much simpler subscripts when they appear in a nested loop kernel. Determining the dependence distance vector allows us to deduce the access pattern of the def-ref array variable pair concerned. The dependence distance vector completely describes whether a datum generated at an iteration is needed at another; it de nes the data ow for the associated computation. This thesis presents direct methods for determining the distance vectors of the inter-loop data dependencies de ning a nested loop computation. The thesis will further demonstrate some of the potential bene ts gained when such exact data ow information is known. It will show how knowing the exact data ow information allows a class of loop kernels to be parallelized which were previously thought impossible. It will also show how a data independence test can be formulated for multi-dimensional array variables with potentially better time complexity characteristics then other known independence tests.

1.2 The thesis plan A broad overview of the organisation of the results of this thesis is illustrated in gure 1.2 for clarity. The thesis hypothesises that deriving exact data ow information about the data dependencies which de ne a nested loop computation results in more eective methods for parallelizing the nested loop kernel. The data ow information we derive takes the form of the data dependence distance vectors which describe exactly where values are de ned and referenced within the loop computation. Chapter 4 describes our procedure for determining the data dependence distance vector of a def-ref variable pair. Chapters 6, 7 and 8 develop the implications, and demonstrate


6 CHAPTER 6

Synchronization insertion strategy for parallelizing nested loop kernels on SMAs

CHAPTER 4 Extracting Data Dependence Distance Vectors

CHAPTER 7 Strategy for generating message passing version of nested loop kernels on DMAs

CHAPTER 8 Technique for determining data independence between "carried" array variables involved in a nested loop kernel

EXPERIMENTS Comparison with BDV and RDC scheme

EXPERIMENTS Comparison with HPF regular iteration partitioning scheme

EXPERIMENTS Timing results for 500 randomly generated cases

Figure 1.2: Major results overview the bene ts, of knowing such data ow information in a nested loop computation. Chapter 2 reviews data ow analysis techniques employed by current compilers to determine the data independence of array variables within a nested loop kernel. Such analysis is important in a parallelizing compiler because it determines completely when two statements can be executed in parallel and is therefore needed for the eective translation of a loop kernel into it's PARALLEL DO version. Chapter 3 reviews current transformation techniques employed by state-of-the-art compilers to parallelize nested loop kernels. The chapter describes current techniques used by compilers to expose the parallelism in a nested loop kernel. It also describes how these parallelized loop kernels are mapped onto parallel computer architectures. Chapters 6 and 7 present methods for the parallelizing translation of a large class of loop kernels for which current methods are shown to be inadequate. The class of nested loop kernels targeted by our method are those involving array variables with coupled linear subscript expressions. Current methods often make approximating assumptions when such loop kernels are encountered, resulting in the loss of much potential parallelism. Loop kernels with coupled array variables account for up to


7

44 % of all loop kernels found in current scienti c and engineering programs [61] and, hence, are an important class to consider. Chapters 6 and 7 expound on speci c issues related to the parallelization of such loop kernels on a shared memory architecture, i.e. SMA, and a distributed memory architecture, i.e. DMA, respectively. Speci cally, chapter 6 will show how synchronisation statements may be inserted to implement the parallel DO ACROSS version of a loop on a SMA. We show that our proposed method derives better execution time pro les on parallel processor architectures than other currently proposed methods. Chapter 7 shows how a message-passing version of the DO ACROSS can be generated for a DMA. We compare an aspect of our technique with the regular iteration partitioning schemes currently proposed for HPF, and demonstrate that our technique derives improved execution times. We measure the eectiveness of our proposed parallelization strategies in chapters 6 and 7 by comparing execution time pro les on a simulated parallel processor model. Chapter 5 describes our simulation strategy developed from a code instrumentation technique rst proposed by Kumar [32]. Data ow analysis methods, which make approximations as to the data dependence or independence of an array variable pair, are termed inexact. Chapter 8 describes an exact data ow analysis method resulting from having determined the data dependence distance vectors of the def-ref array variables in a nested loop kernel. We further show the applicability of our data ow analysis method in a realistic parallelizing compiler framework and compare our method with the Omega test; another exact data ow analysis method. Finally, chapter 9 concludes the thesis by summarising our results and discussing the future directions in which the research expounded in this thesis can take. The results presented are the theoretical \stepping-stones" for an implementation in a more realistic parallelizing compiler framework. To be able to demonstrate the utility of the results in this thesis on \real" world programs is, as in all other similar eorts, the projected aim of our research.

Chapter 2

Data ow analysis methods Data ow analysis is important in an optimising compiler because it establishes how data is de ned and referenced within the control ow graph. This information is important because it determines when an optimising transformation can be applied. Traditional data ow analysis methods, such as iterative and interval analysis [1], generate summary information for basic blocks in the control ow graph. Such summary information takes the form of the in and out sets for the basic block concerned. The in and out sets de ne the reaching de nitions entering and leaving a basic block where a reaching de nition de nes a valid variable assignment. An example is shown in the code fragment below where the reaching de nition for variable A in statement d3 is the de nition d2, i.e. A = 3. The previous de nition of the value for variable A in d1 is killed by the rede nition of A in d2. d1:

A = 4 * B

d2:

A = 3

d3:

C = A * 3

Such summary information completely determines the constraints on the order in which statements in a program are allowed to execute. In the example above, statement d3 must always execute after d2, but there is no order constraint on statement d1. Iterative and interval data ow analysis methods are only applicable to scalar variables. Scienti c and engineering programs with nested loop kernels referencing large array structures [61, 32] require much more sophisticated analysis methods 8

CHAPTER 2. DATA FLOW ANALYSIS METHODS

9

for determining the data ow pattern between array variables. Analysing data dependencies between array variables is dicult because of the large number of array elements which have to be analysed, and the need to analyse the array variable subscript expressions which can be arbitrarily complex. Many compilers make the conservative assumption that a data dependence always exists between the de nition and reference of an array variable pair. In the code fragment shown below: DO I = 1, N, 2 d1:

A(2*I) = ...

d2:

...

= A(2*I+1) ...

ENDDO Code Fragment - 2.1

many compilers would assume a data dependence between the array variable definition A(2*I) and the array variable reference A(2*I+1). However, a data ow analysis method should be able to disambiguate the subscript expressions and determine that the statements d1 and d2 are, in fact, data independent since A(2*I) de nes even numbered elements and A(2*I+1) references odd numbered elements. Since the statements d1 and d2 are data independent, they can be executed in parallel without synchronisation thereby resulting in a speedup of up to 2 N over the original serial execution. This chapter reviews the data ow analysis methods which have been proposed in the literature. Section 2.1 looks at an extension of the interval analysis method to support array variables. Section 2.2 describes general analysis methods for array variables with uncoupled subscripts. Finally, section 2.3 describes some very accurate techniques which have been proposed for analysing subscript systems even in the presence of coupled subscripts.

2.1 Interval methods Gross and Steenkiste [22] have extended the interval analysis method to support array variables. Their technique de nes regions for an array reference for which an array de nition is known to be valid. For example, in the code fragment shown


10

below, de nition d1 is valid for the region A(1:5,6:10) and d2 is valid for the region A(6:10,1:5). DO I = 1, 5 DO J = 6, 10 d1:

A( I, J ) = ...

ENDDO ENDDO DO I = 6, 10 DO J = 1, 5 d2:

A( J, I ) = ...

ENDDO ENDDO DO I = 1, 10 DO J = 1, 10 d3:

...

= A( I, J ) ...

ENDDO ENDDO

Since the two regions do not overlap, as shown in gure 2.1, we determine the two de nitions to be independent and some future reference to A, i.e. statement d3, will have reaching de nitions d1 or d2. The condition under which a reaching de nition is valid is expressed in terms of the index variables of the enclosing loops. Thus for condition f1 I 5 ^ 6 J 10g, de nition d1 is valid and for f6 I 10 ^ 1 J 5g, de nition d2 is valid. The extended interval method was implemented in the W2 compiler for the Warp machine. Its primary advantage over the other schemes presented later is that it represents a uni ed approach to handling global data ow analysis; it handles scalars in the same way by looking at them as arrays of dimension zero. One disadvantage of the scheme is that it only handles rectangular array access patterns. As such the access pattern of array A in code fragment 2.1 cannot be dierentiated. Also, the scheme can only handle a single def-ref array pair within a nested loop. Furthermore, it does not consider coupled subscript expressions. The array subscripts are said to


11

10 d1 6 5 d2 1 1

5 6

10

Figure 2.1: Bounding regions for array reference A be coupled when an index variable appears in more than one subscript expression. In fact, coupled subscripts result in inaccuracies in many of the data ow analysis schemes presented later.

2.2 Index subscript analysis methods There is a further class of data ow analysis strategies which attempt to analyse the array subscript expressions directly. For example, in code fragment 2.1 many schemes will attempt to solve the dependence equation (2.1) 2I1 = 2I2 + 1 =) 2I1 ? 2I2 = 1

(2.1)

with 1 I1 , I2 N. If a solution exists for equation (2.1), we deduce an occurrence when a value de ned in variable A, in statement d1, is referenced in statement d2. This information is important because statements in a loop kernel which are deduced to be data independent can then be executed in parallel.

2.2.1 The GCD-Banerjee test The expression in equation (2.1) is known as a linear diophantine equation in two variables. From the GCD theorem in Number Theory, equation (2.1) has an integer solution if and only if the greatest common divisor (GCD) of the coecients in the


12

LHS exactly divides the constant term in the RHS. Thus, for dependence equation (2.1), the GCD of the coecients on the LHS, GCD(2,2) 2, does not exactly divide the constant term on the RHS. Hence we can deduce no data dependence. This test is known as the GCD test and it was rst introduced by Uptal Banerjee [5]. The major disadvantage of the GCD test is that it only predicts the existence of an unconstrained integer solution. Where an integer solution exists which does not satisfy the bound inequalities constraint, i.e. 1 I1, I2 N, the GCD test will predict a data dependence when none exists. Another test which takes the bound constraints into consideration uses the intermediate value theorem. A real solution exists if the RHS is found to lie within the minimum and maximum values which the LHS can take. Therefore for dependence equation (2.1), min(2I1min ? 2I2max ) = 2 ? 2N and max(2I1max ? 2I2min) = 2N ? 2. If 2 ? 2N 1 2N ? 2, equation (2.1) has a real solution within the bound constraints, otherwise a real solution does not exist. This bounds test is commonly referred to as the Banerjee test as it was also rst introduced by Uptal Banerjee [5]. Note that the Banerjee test only decides if a constrained real solution exists. An array subscript analysis scheme can be generalised for a pair of def-ref mdimensional array variables illustrated by the code fragment shown below: DO

x

1

= . . .

L, U

DO

1

xn

d1: d2:

1

L n , Un gen (X ); : : :; f gen(X )) = ... A(f m use(X )) ... = A(f use(X ); : : :; fm =

1

1

ENDDO . . . ENDDO

where X is the index set f x1; : : :; xn g and figen (X ) and fiuse(X ) are linear functionals in terms of X . A data dependence exists between d1 and d2 if the system of


13

diophantine equations,

f gen(X ) ? f use(X 0) 1

1

a ; x ? b ; x0 + + a ;nxn ? b ;nx0n 11

1

11

1

1

1

= c1

fmgen(X ) ? fmuse(X 0) am; x ? bm; x0 + + am;n xn ? bm;n x0n = cm 1

1

1

1

or described more concisely

a ; v + + a ; nv n = c Sm am; v + + am; nv n = cm S

1

11 1

12

1 1

2

2

1

(2.2)

2

has an integer solution subject to the following subscript constraint inequalities

L

v ;v

1

1

2

1

LN v n? ; v n 2

1

U Un

2

(2.3)

This problem is similar to integer programming where given the subscript equalities represented by system (2.2) and the inequalities represented by system (2.3), an integer solution represented by the vector ~v = (v1 ; v2; : : :; v2n) is required which will maximise some cost function cost(~v). Our data ow analysis problem is simpler in that we only determine if an integer solution exists, or in the case of a geometric interpretation, we aim to determine if the hyperplanes described by the subscript equalities intersect at S, within a region V bounded by the subscript constraint inequalities, with S containing some integer point.

2.2.2 The I-test The GCD-Banerjee test does not determine exactly if a def-ref array variable pair is data independent. The reason for this is that it does not distinguish the case where the real solution satis es the bounds constraints, (2.3), but the integer solution does not; both tests will determine a dependence. The I-test proposed by Kong, Klappholz and Psarris [31], integrates the two tests and determines exactly whether an integer solution exists for each equation in the dependence system (2.2) constrained by the bounds constraints (2.3).


14

The I-test poses the subscript equality equations as interval equations of the form, a1;1x1 ? b1;1x01 + + a1;nxn ? b1;nx0n = [L1; U1] (2.4) am;1 x1 ? bm;1x01 + + am;n xn ? bm;nx0n = [Lm; Um] where for the ith subscript equality, the interval equation

ai; v + + ai; nv n = [Li; Ui] 1 1

2

(2.5)

2

is checked for integer-solvability. If g is the GCD of the coecients fai;1, : : :, ai;2ng, an integer solution exists if Li g dLi =g e Ui . Each interval equation is then successively transformed through a series of variable elimination steps for which the step to eliminate v2n is shown below:

ai; v + + ai; n? v n? = [Li ? ai; nU n + a?nL n; Ui ? a nL n + a?nU n] 1 1

2

+ 2

1 2

2

2

1

2

+ 2

2

2

2

assuming jai;2nj U2n ? L2n + 1 and

a = a if a 0, 0 otherwise a? = a if a 0, 0 otherwise +

With each new interval equation generated the bounds test is performed. If the procedure encounters an equation which is inconsistent, the I-test concludes that the array variables concerned are independent. The I-test is an exact test for def-ref array variable pairs possessing uncoupled subscripts. Array variables with coupled subscripts require more sophisticated techniques to disambiguate the array variable access patterns.

2.3 Coupled index subscripts analysis methods We have so far avoided the complications involved in the analysis of coupled subscript expressions. For all the tests described so far multi-dimensional array variables are tested subscript-by-subscript. That is, the systems (2.2) and (2.3) are formulated and tested one array dimension at a time.


15

An array variable has coupled subscripts if there exists some index variable xi 2 X which appears in two or more subscript expression elds. For example, the array variable A(x+y,x) is said to have coupled subscripts. For a def-ref array variable pair with coupled subscripts a subscript-by-subscript test will yield an over conservative estimate. For example, in the code fragment below a subscript-bysubscript application of the GCD-Banerjee test will determine a data dependence when none exists. DO I = 1, N d1:

A(i+1,i+2) = A(i,i) + 3


2.3.1 The Delta test The Delta test proposed by Goft, Kennedy and Tseng [21] accounts for coupled subscripts by using a constraint propagating technique. Their scheme rst involves partitioning the subscript equality expressions into coupled and uncoupled groups, i.e. checking for separability. They then attempt to solve the subscript equalities in an order where an uncoupled subscript equality is favoured over a coupled one. A data independence test (i.e. the GCD-Banerjee test as suggested by the authors) is then performed and constraints are generated in the form of either 1) a dependence hyperplane, 2) a dependence distance or 3) a dependence point. The generated constraints are propagated into the next subscript test where independence is concluded when either the application of the data independence test determines so, or the set of constraints do not intersect. A simple example of the application of the Delta test can be seen in conjunction with code fragment 2.2. The rst subscript dimension is checked where a data independence cannot be determined. A dependence distance constraint, c1 : i1 - j1 = 1, is generated. The second subscript dimension is then checked where again a data independence cannot be determined. A second dependence distance constraint, c2 : i1 ? j1 = 2, is generated. Since c1 \ c2 = ;, we conclude that the def-ref array variable pair is data independent.


16

2.3.2 The test The -test proposed by Li, Yew and Zhu [38] is an ecient data independence test for two multi-dimensional array variables. It is one of the few tests that attempts to solve system (2.2) simultaneously. The test is shown to be especially ecient for two-dimensional arrays. Studies of scienti c and engineering programs [61] have shown that two-dimensional arrays are the most common type of array structures used. The central idea of the -test, for two-dimensional arrays is the determination of a set of and lines, on a (1; 2) plane, which de ne the dependence system. The and lines are of the form a1 + b2 = 0 where a and b are integer constants. The -test for data independence involves forming linear combinations of the and lines,

f1 ;2 = S + S 1

1

2

2

called a -plane, and showing that each -plane does not intersect the subspace V de ned by the constraining subscript inequalities, as de ned in system (2.3). To check if the -plane intersects V a simple bounds computation is performed. The procedure rst determines min(f1 ;2 ) and max(f1 ;2 ) and if min(f1 ;2 ) 0 max(f1;2 ) we then conclude that f1 ;2 intersects V.

2.3.3 The Power test The -test, like the Banerjee bounds test, determines if there is a constrained real solution to the equalities in system (2.2). A true dependence will only occur if there is a constrained integer solution to the system. The Power test, proposed by Wolfe and Tseng [69], integrates the generalised GCD test for unconstrained integer solutions with the Fourier-Motzkin variable elimination technique for constrained real solutions. The test will determine if there is a constrained integer solution if one exists. The test can also handle triangular, trapezoidal and other convex loop bounds. There is a slight inaccuracy in the test, however, which will be explained later. The generalised GCD test was proposed by Banerjee to determine if an unconstrained integer solution exists for the system (2.2). We start with the system (2.2)


17

which can be concisely described by the matrix-vector product pair

~v A = ~c where ~v = (v1; : : :; v2n),

0 BB A = B B@

(2.6)

1

a ; a ; n C C 11

.. .

12

.. .

...

am; am; n 1

CC A

2

and ~c = (c1; : : :; c2n)T . In the generalised GCD test, (2.6) can be factored (eg. Gaussian elimination) into a unimodular matrix U and an echelon matrix D such that U A = D. The system (2.6) has an integer solution if and only if there exists an integer vector ~t such that ~tD = ~c. Since D is an echelon matrix, ~t can be found by back substitution. If ~t cannot be determined the system has no solution and the def-ref array pair are independent. If ~t has an integer solution,

~v = ~tU

(2.7)

is the solution to the system (2.6). The Power test continues by determining if the integer solution for expression (2.7) satis es the constraint system (2.3). It uses the Fourier-Motzkin variable elimination method to determine this. For example, consider the system of three constraint inequalities

t +t > 1 t ?t < 2 t < ?2 1

2

1

2

(2.8)

2

Transforming system (2.8),

t + t > 1 =) t > 1 ? t t ? t < 2 =) t < 2 + t 1

2

1

2

(2.9)

1

2

1

2

(2.10)

The Fourier-Motzkin eliminates t1 by projecting (2.9) > (2.10) =) 2 + t2 > 1 ? t 2 =) t2 > ? 21


18

and combining the new inequality with the original system (2.8),

? 12 < t < ?2 2

which is inconsistent. We therefore conclude no feasible solution to the system of inequalities. Now the solution to expression (2.7) for v1 , : : :, v2n is expressed in terms of the free variables tm+1 , : : :, t2n. The Power Test substitutes the relevant terms in system (2.3) with their free variable solution from expression (2.7) and solves the new constraint system using the Fourier-Motzkin variable elimination method. If an inconsistent pair of bounds is encountered the test deduces no data dependencies between the def-ref array variable pair. There is an inaccuracy in the Power test, for where there is an inequality

h t + h t + + h k tk U 1 1

2 2

and we wish to eliminate variable tk ,

hk tk U ? h t ? h t ? hk? tk? =) tk U ? h t ? h th ? hk? tk? k 1 1

2 2

1

1

1 1

2 2

1

1

the Power test takes the ceiling of the division on the RHS, i.e.

U ? h t ? h t ? ? hk? tk? tk h 1 1

2 2

1

1

k

The ceiling operator in this case is the source of imprecision because it enlarges the solution space and may consequently introduce integer solutions where none exist.

2.3.4 General integer programming methods All the data ow analysis methods which have been described in this chapter are inexact, in that a dependence between a def-ref array variable pair may be reported when none exists. As mentioned earlier in this chapter, an exact answer can be derived through integer programming. Whereas integer programming requires a vector ~v to be found which minimises or maximises some cost function, a data independence test need only determine if there is a feasible solution within a convex set in R2n and if that solution space contains at least one integer point. Wallace


19

[67] Lu, Chen [41] and Pugh [53] have adopted this approach. Note that there is already a very large body of work on integer programming and that the method is generally agreed to be NP-complete. Wallace proposes the Constraint-Matrix method which uses the simplex method in linear programming modi ed to solve integer programming problems. The number of pivoting steps in his method is bounded by a limit to prevent a condition known as cycling, where the simplex method is known not to converge onto a solution. Lu and Chen describe a new integer programming algorithm but the cost of using their procedure is dicult to access as they do not provide a metric. Finally, Pugh proposes the Omega test which uses a modi ed version of the Fourier-Motzkin variable elimination algorithm descibed earlier. Through careful implementation, Pugh claims the Omega test to be practical for commonly encountered def-ref array variable pairs. The Omega test has, however, a worst case exponential time complexity, although Pugh claims that the algorithm is actually bounded within polynomial time for common cases.

2.4 Concluding remarks Data ow analysis for multi-dimensional array variables requires complicated techniques to disambiguate the access patterns de ned by the array subscript expressions. Table 2.1 summaries the characteristics of the data ow analysis methods described in this chapter. Almost all the proposed analysis methods have avoided the complexity of an exact method by deeming it sucient to provide an approximate solution. The exceptions have been methods which aim to solve the data ow analysis problem exactly as integer programs. However, these integer programming techniques have worst case exponential time complexity which have discouraged many parallelizing tool developers from adopting them. Pugh claims the Omega test [53], which is an integer programming algorithm, to have a typical performance which is \acceptable". He makes this claim based on experiments which employ the Omega test on a collection of commonly used scienti c and engineering loop kernels. He does not, however, prove his assertion for the general case. In chapter 8 we present an exact data independence test based on a formulation of the dependence


20

Method

constrained integer

integer

constrained real

coupled subscript

exact

Interval analysis GCD test Banerjee test I-test -test Delta test Power test Integer programming

Yes No No Yes No Yes Yes Yes

No Yes No Yes No Yes Yes Yes

No No Yes Yes Yes Yes Yes Yes

No No No No Yes Yes Yes Yes

No No No No No No No Yes

Table 2.1: Data ow analysis algorithms constraint system using the dependence distance vectors derived from techniques described in chapter 4. We call this test the distance test. The distance test has worst case exponential time complexity but we show that it has potentially better performance, in the worst case, than the Omega test. We have also run experiments employing the distance test on a very large number of randomly generated scenarios and we show its performance behaviour to be quite acceptable when carefully \constrained".

Chapter 3

Parallelizing nested loop kernels Manually programming massively parallel processor (MPP) architectures to utilise eciently the wealth of available computing resources is, at best, ineectual and error-prone. Automatic parallelizers are advantageous because 1) correct parallel programs can be assured, 2) programs become more portable, and 3) problem solving is aided when issues such as task partitioning, data placement and synchronisation are handled by the compiler. In this chapter we review the current state-of-the-art in parallelizing compiler technology. We only examine techniques employed in parallelizing imperative languages, i.e. FORTRAN 77, with minimal or, ideally, no parallelizing semantic extensions. Furthermore, because nested iterative loop structures represent the largest amount of potential parallelism in an imperative program this review will only cover techniques which deal with such structures. Although this represents a subset of current parallelizing compiler research it is still arguably the most important of the dierent sub-areas. A major omission from this review is the parallelization of imperative code for very-long-instruction-word (VLIW) computers which is beyond the scope of this thesis. Also, we do not review technqiues to parallelize special program structures such as recursion, sparse matrix computation and computation with indirection, because they do not directly apply to the applications we examine in this thesis. The reader is directed to the survey paper [7] for a good introduction to the topics not covered. Parallelizing compiler technology has been studied since the early sixties and 21

CHAPTER 3. PARALLELIZING NESTED LOOP KERNELS

22

many techniques have been proposed and subsequently discarded. In commercial use, there are two source-to-source FORTRAN parallelizers: VAST and KAP. These employ similar parallelizing technologies and represent the state-of-the-art. Experimental parallelizing compilers also exist in academia. Of the major experimental compilers, the most successful have been Parafrase II [49] at the University of Illinois, Pfc [3] at Rice University, Vienna FORTRAN [9] at the University of Vienna and PTRAN [59] at IBM Yorktown. We do not review each compiler separately because many of the techniques employed by them are similar. Individual techniques in the areas of syntactic restructurers and loop-processor allocation schemes are presented instead. Sections 3.1 and 3.2 review the dierent techniques used by modern supercompilers to derive the parallel form of nested loop kernels. Section 3.3 reviews techniques currently used to map the derived parallel form of a nested loop kernel onto the parallel architecture. Finally, section 3.4 reviews the eectiveness studies which have been conducted on these techniques. There are in general two approaches to parallelizing loop structures in an imperative language: spreading and index space parallelization. Spreading unfolds parallel tasks in the loop body and distributes the tasks across processors in the parallel architecture. Index space parallelization seeks to expose parallel sections in the iteration execution sequence of the loop computation by modifying the order of its execution. The distinction between spreading and index space parallelization is actually highly ariti cal as both techniques borrow ideas from each other. We nd the classi cation useful in this chapter as it serves to provide some semblance of order in the vast amount of research that has been undertaken in this area. Note that parallel loops are denoted by the key word PARALLEL DO in the loop header. For serial loops transformed into their parallel form we ignore special cases for the rst and last iteration if they arise. This is done in order to present the essential transformation concepts uncluttered by special case observations.


23

3.1 Spreading The spreading approach to parallelizing loop kernels can be further divided into low-level spreading and high-level spreading. Low-level spreading parallelizes loop kernels at the primitive operator level, while high-level spreading parallelizes loop kernels at the statement level. In the following sections we discuss each in turn.

3.1.1 Low-level spreading If we assume a ne-grain parallelism model the expression

abc+ab

(3.1)

can have its arithmetic operators represented in the evaluation tree shown in gure 3.1(a) spread over four processors. The operators can then be executed in parallel constrained only by the data dependencies shown in the tree's arcs. Assuming that the addition and multiplication operators take one unit cycle to execute and that the evaluation tree is solved in parallel, the execution time is calculated to be three unit cycles as opposed to four unit cycles in a pure serial implementation. Two optimisations proposed by Kuck [33] can be applied to low-level spreading: forward substitution and tree-height reduction. In forward substitution variables in an assignment statement are substituted with the expressions which generates their values. For example in the code fragment shown below variable A in statement S2 is replaced with the RHS of statement S1 giving a single assignment statement S3. S1: A = C * 2 =) S3: D = C * 2 * E S2:

D = A * E

Forward substitution increases the eective parallelism in an evaluation tree and increases the overall parallelism that can be exploited by the compiler. In treeheight reduction, expression (3.1) is reduced through the property of associativity into expression (3.2). a b (c + 1) (3.2) The modi ed evaluation tree for (3.2) is shown in gure 3.1(b). In the modi ed evaluation tree three processors are sucient to exploit the available parallelism. Furthermore, the total execution time has now been reduced to two. Note the


24

+

*

*

a

b

a

* c

d

(a) original expression tree

*

+

* a

b

1

c

(b) after tree-height reduction

Figure 3.1: Evaluation tree for (a) expression (3.1) and (b) expression (3.2) reduction in the height of the evaluation tree. Tree-height reduction may also be achieved by using other algebraic properties like commutativity and distributivity. Low-level spreading assumes an architecture which can eectively take advantage of ne-grain parallelism. Unfortunately such architectures are not common as such a model requires excessive synchronisation. Low-level spreading is therefore not often used. Spreading at a higher-level is preferrred.

3.1.2 High-level spreading In high-level spreading statement level tasks are spread over the parallel computer. Consider the code fragment shown below:


25

DO I = 1, N S1:

A(I) = B * 5

S2:

C(I-2) = A(I-2) + D

ENDDO

The two assignment statements, S1 and S2, can be executed in parallel within an instance of the loop body because there are no intra-loop data dependencies. The parallelism within the loop body can be further improved with an optimisation called loop unrolling [34]. In the case of the above code fragment the semantics of the computation will remain unchanged when the loop is unrolled by one iteration. Loop unrolling can increase the amount of parallelism within the loop body as is seen in the transformed code fragment shown below. Note that COBEGIN and COEND state that the set of statements they surround should be executed in parallel. DO I = 1, N, 2 COBEGIN S1:

A(I) = B * 5

S2:

C(I-2) = A(I-2) + D

S3:

A(I+1) = B * 5

S4:

C(I-1) = A(I-1) + D

COEND ENDDO

The fundamental rule in loop unrolling is that given a distance vector (1 ; : : :n) for an inter-loop data dependence the loop at nesting level i can be unrolled i ? 1 times without violating any data dependencies [50]. In the above code fragment, the inter-loop dependence distance for the variable A is two. Loop I can therefore be unrolled by one without violating the inter-loop ow dependence.


26

Cycling shrinking Polychronopolus [50] generalised loop unrolling into a transformation called cycle shrinking. In this transformation a nested loop kernel is linearised and then transformed into a two nested loop version: an outermost serial and an innermost parallel loop. Note that even though there may be a data dependence between two arbitrary statements, there are instances of the two statements which are not involved in a currently active dependence. Cycle shrinking extracts these dependence-free instances and generates an inner PARALLEL DO loop. To illustrate the method, consider the code fragment below: DO I = 1, N1 DO J = 1, N2 A(I,J) = A(I-1,J-1) * 3 ENDDO ENDDO Code Fragment - 3.1

Part of the data dependence DAG for code fragment 3.1 is shown in gure 3.2. The two nested loop has a single data dependence vector d1 = (1,1) whose lexical distance is given by 1 = 1 + 1N2. Cycle shrinking will rst linearise the loop kernel such that, DO IJ = 1, N1 * N2 IF ( IJ

>

1

) THEN

A(IJ) = A( IJ -

1

) * 3

ENDDO

The transformation then allows groups of iterations, where no two are a source and sink of the same data dependence, to execute in parallel. In gure 3.2, a group is highlighted within the dotted enclosure. The restructured program with the innermost parallel loop is given below:


27

Figure 3.2: Data dependence DAG for code fragment 3.1

DO K = 1, (N1 * N2)/( t1 = K * t2 = K *

+

- 1)

1

1

1

PARALLEL DO IJ = t1, t2 IF (IJ

> ) 1

THEN

A(IJ) = A(IJ -

1

) * 3

ENDDO ENDDO

The eect of the new restructured loop is similar to having unrolled the linear loop by 1 ? 1 times and executing the statements in the now enlarged loop body in parallel. Cycle shrinking, however, can only be applied eciently to loop kernels with constant data dependence vectors, i.e. the distance between dependent statements is constant throughout the execution of the algorithm.

Loop skewing A further transformation which facilitates spreading is loop skewing. To illustrate this consider the code fragment below: DO I = 1, 10 S1:

A(I) = B + C

S2:

D(I) = A(I) + 6

ENDDO

Note the intra-loop ow dependence for variable A prevents the loop body from being executed in parallel. If this dependence is transformed into an inter-loop


28

dependence, i.e. the dependence is carried between dierent iterations, statements S1 and S2 can then be executed in parallel. This is what loop skewing does. Loop skewing transforms the above code fragment into the form DO I = 2, 11 COBEGIN S1:

A(I) = B + C

S2:

D(I-1) = A(I-1) + 6

COEND ENDDO

The ow dependence for variable A is now satis ed by the serial execution of the loop. The data dependence has been converted into the inter-loop version and the loop body can now be executed in parallel. Loop skewing is just one in a set of many possible loop restructurers. We shall describe some of the more important ones in the later section.

Software pipelining Perhaps the most important spreading technique is software pipelining. This was originally derived from the concept of compaction scheduling suggested by Aiken and Nicolau [2]. It has been shown that given a nested loop kernel with constant inter-loop dependencies, the statement execution schedule derived from software pipelining is optimal, i.e. the execution time is minimal. We rst present the general software pipelining algorithm and then illustrate it with an example. Assume a nested loop program with n statements in the loop body, represented by S, and p inter-loop dependencies. In software pipeling the loop computation is rst executed and each statement is scheduled in a greedy fashion, i.e. a statement is scheduled as soon as all its reference arguments have been de ned. Aiken and Nicolau have shown that a repeating pattern can be detected when up to O(np) iterations are allowed to execute. Software pipeling then partitions the set S into sets of statements related by a dependence chain, denoted by Ci . Note that there exists Ccri 2 S such that the statements are on the critical path. Assuming length(Ci)


S1

S3

S2

S4

29

Figure 3.3: Dependence graph for code fragment 3.2 and P (Ci) denote the length of Ci and the number of inter-loop dependencies in Ci respectively, the slope of Ci is de ned by

8 > Ci < length P C i slope(Ci) = > : 0 (

( )

)

if P (Ci) 6= 0 if P (Ci) 0

The function slope(Ci) de nes the rate at which the statements related to the dependence chain, Ci, is executed in the greedy schedule. Once a repeating behaviour is detected from the greedy schedule, it is readjusted such that 8Ci 2 S, slope(Ci) slope(Ccri). For an example, consider the code fragment below: DO I = 1, 10 S1:

B = C + 4

S2:

D(I) = B + D(I)

S3:

B(I) = A(I-1) * 5

S4:

A(I) = B(I) + 6


There are two dependence chains within the above code fragment. These are C1 = fS1; S2g and C2 = fS3; S4g, as illustrated in gure 3.3. Note that C2 describes the critical path and that slope(C1) = 0 and slope(C2) = 3. The greedy schedule for the rst four iterations of the above code fragment is shown in table 3.4(a). By adjusting the schedule of the statements in C1 such that its rate of execution is

CHAPTER 3. PARALLELIZING NESTED LOOP KERNELS Iteration

Iteration

1 Time

30

2

1

S1 S3

S1

2

S2 S4

S2

3

4

S1

S1

S2

S2

1 Time

1

S1 S3

2

S2 S4

2

3

S3

3

S1 S3

4

S4

4

S2 S4

5

S3

6

S4

3

5

S1 S3

6

S2 S4

4

7

S3

7

S1 S3

8

S4

8

S2 S4

(a)

(b)

Figure 3.4: Execution pro le: (a) from a greedy schedule and (b) after slope adjustment

S1 S3

S2 S4

Figure 3.5: Software pipelined program graph for code fragment 3.2 the same as that for C2, the adjusted pro le is as shown in table 3.4(b). The nal program graph from the execution pattern in table 3.4(b) is shown in gure 3.5.

3.2 Index space transformation The alternative way of parallelizing a nested loop kernel is by transforming the index space of the kernel directly. These transformations modify the index loop statements changing the execution order of the iterations. The aim of these loop restructurers is to expose the parallelism in a nested loop kernel while preserving the semantics of the original program. In section 3.2.1 we describe some of the many


31

loop restructurers that have been proposed and outline their rationale. A question faced by a parallelizing compiler designer is \which loop restructurers should be used and in what order should they be applied?" This is a fundamental problem which has important implications. At the moment, restructurers are applied in an ad-hoc manner. However, a loop restructurer can sometimes hinder the application of another and optimal parallelization can sometimes only be achieved by applying a subset of the available loop restructurers in a speci c sequence. A more theoretically rigorous approach to loop restructurers, based on unimodular matrix transformations, has recently been emerging. Such an approach is slowly producing insights into the reason why loop restructurers work the way they do and may lead us to a more uni ed approach to loop parallelization. Section 3.2.2 describes some of the more important results in this area.

3.2.1 Loop restructurers We describe ve important loop restructurers: loop distribution, loop coalescing, loop alignment, loop interchanging and loop reversal. We also provide examples of the use of each. All of the restructurers described in this section are included in the survey and tutorial papers [46, 7].

Loop distribution Loop distribution is a technique used to isolate parallel sections of a nested loop kernel. Consider the code fragment shown below: DO I = 1, 10 S1:

A(I) = A(I-1) * 3

S2:

B(I) = A(I) + 4

ENDDO

The loop in the above code fragment cannot be parallelized because of the interloop ow dependence seen in the tight recurrence in statement S1. We can however isolate the recurrence by distributing the statements over two loops as shown in the modi ed loop below:


32

DO I = 1, 10 S1:

A(I) = A(I-1) * 3

ENDDO PARALLEL DO I = 1, 10 S2:

B(I) = A(I) + 4

ENDDO

In general loop distribution can be used to isolate parallelizable portions of a nested loop kernel. An important scenario is also demonstrated in the above code fragment. Some architectures have special hardware which support the ecient solution of recurrences. In such cases loop distribution can isolate a tight recurrence from the general loop computation to allow such hardware to be employed. Loop distribution is not possible, however, when a dependence cycle exists between the statements in the loop body. Thus, the code fragment shown below cannot be distributed. DO I = 1, 10 S1:

A(I) = A(I-1) * 3

S2:

B(I) = A(I+1) + 4

ENDDO

Loop coalescing Our second loop restructuring transformation is loop coalescing. Loop coalescing is not used speci cally to parallelize a loop. It is often used to increase the iteration length so that the parallelized loop can be executed more eciently. In most architectures the start-up cost for a parallel loop is very high, in which case, it becomes desirable to avoid short parallel loops. For example, consider the code fragment below:


33

PARALLEL DO I = 1, 1000 PARALLEL DO J = 1, 2 S1:

A(I,J) = B(I,J) * 4

ENDDO ENDDO

This code fragment will require the individual start-up of one thousand parallel loops each performing just two iterations. This is clearly very inecient. Coalescing the I and J loops into a single IJ loop, as shown in the code fragment below, results in only one start-up cost being incurred. PARALLEL DO IJ = 1, 1000*2 S1:

A(

f (IJ), f (IJ) 1

2

) = B(

f (IJ), f (IJ) 1

2

) * 4

ENDDO

Note that loop coalescing will require the generation of the functions f1 and f2 which are needed to recover the original loop index terms. In the case above, f1 (IJ) = dIJ=1000e and f2 (IJ) = mod((IJ-1)/2) + 1.

Loop alignment The next loop restructuring transformation is loop alignment. Loop alignment has exactly the opposite eect to that of loop skewing introduced in the previous section. Loop alignment converts an inter-loop dependence into an intra-loop one. Consider the code fragment shown below: DO I = 1, 10 S1:

A(I) = B(I) * 5

S2:

C(I) = A(I-1) + C

ENDDO

The code fragment cannot be parallelized because of the existence of the interloop ow dependence for variable A. Loop alignment adjusts the index statements


34

and the subscript expressions such that the inter-loop dependence is converted into an intra-loop one. The modi ed code fragment is shown below: PARALLEL DO I = 2, 11 S1:

A(I-1) = B(I-1) * 5

S2:

C(I) = A(I-1) + C

ENDDO

Loop interchanging and loop reversal We introduce the idea of loop interchanging and loop reversal together. Loop interchanging, as the name suggests, switches the nesting level of a double nested loop kernel. Loop reversal reverses the execution order of the loop iterations. Loop interchanging is useful in the situation illustrated by the code fragment below: DO I = 1, 1000 DO J = 1, 2 S1:

A(I,J) = A(I-1,J+1) * B(I)

ENDDO ENDDO Code Fragment - 3.3

The index space for the code fragment is shown in gure 3.6. Note that it is pointless to parallelize the inner J loop because the start-up cost of initiating one thousand parallel loops of two iterations each would quickly overwhelm any bene ts incurred from the extracted parallelism. Ideally we would like the I loop to be parallelized. If we could interchange the loops such that the I loop is the inner loop, we can then parallelize the inner loop and incur only a minimal start-up cost. The loops cannot be interchanged, however, because the inter-loop dependence is a backward dependence with respect to the J dimension; the new interchanged execution order will not satisfy the inter-loop dependence. By reversing the J loop, however, the loops can now be interchanged. In the transformed code fragment shown below, the

CHAPTER 3. PARALLELIZING NESTED LOOP KERNELS I

35

I

1

2

J

(a)

2

1

J

(b)

Figure 3.6: Data dependence DAG for code fragment 3.3: (a) before reversal (b) after reversal inner J loop can now be parallelized: DO J = 2, 1, -1 PARALLEL DO I = 1, 1000 S1:

A(I,J) = A(I-1,J+1) * B(I)

ENDDO ENDDO

Loop interchanging is a special case of a more general transformation called loop permutation. For code fragments with more then two nesting levels nding a useful loop permutation is very dicult. In the next section we introduce a formal notation for the transformations we have presented. We hope that by formalising the currently ad-hoc manner in which we derive these parallelizing transformations we can better understand the loop parallelization process.

3.2.2 Matrix transformations An index space transformation may be viewed by the two mapping functions : Z n ! Z t and J : Z n ! Z s. We de ne n to be the number of nesting levels in the original loop kernel and t; s n. The transformations and J are denoted


36

the temporal and spatial mappings respectively. Thus, given an iteration Ii = (l1; : : :; ln), J (Ii ) de nes the processor coordinates to which the iteration is assigned and (Ii ) de nes the index coordinates at which the iteration is scheduled.

The hyperplane method In his seminal paper [36], Lamport de nes J to be a one-to-one mapping expressed as a unimodular matrix such that J (I ) = I 0 . The matrix J is unimodular, i.e. det(J ) = 1, such that the inverse transformation J ?1 maps I 0 back to the original iteration point I . Lamport also de ned the temporal mapping . This mapping concurrently executes the body of the loop computation 8Ii 2 J lying in the set fIi : 8Ii 2 J , (Ii ) = c 2 Z k g. This set de nes a collection of (n ? k)-dimensional planes in Z n . In eect, transforms a set of n-nested serial loops into a kernel with k serial loops and one parallel inner loop as shown below: =) DO I1 = L01 , U10 DO I1 = L1, U1 ..

.

DO

In

=

Ln , U n

body

..

.

DO

Ik

=

L0k , Uk0

PARALLEL DO body

Ik

+1

=

L0k

+1 ,

Uk0

+1

For k = 1, we are in eect generating parallel time hyperplanes for which the temporal mapping is de ned : Z n ! Z . This is the hyperplane method as described by Lamport [36] which is also sometimes referred to as the wavefront method. Lamport pointed out that the sucient condition for a correct temporal mapping is that it preserves the sense of the intrinsic data dependencies when applied to the nested loop computation. This ensures that in an algorithm where iterations Ii and Ik are data dependent and Ii occurs before Ik, (Ii) must then also be scheduled before (Ik ) in the transformed algorithm. Since is a linear mapping a correct linear function must satisfy (di) > 0 for all data dependence vectors, di, extracted from the kernel. As an example, consider the two nested loop kernel in the code fragment shown below:


37

t+4

t

Figure 3.7: Execution order for code fragment 3.4 using the hyperplane method.

DO I = 1, N1 DO J = 1, N2 A(I,J) = A(I-1,J) + A(I,J-1) ENDDO ENDDO Code Fragment - 3.4

The data dependence DAG for the computation in the above code fragment is illustrated in gure 3.7. The dependence distance vectors are d1 = (1,0) and d2 = (0,1). A possible linear schedule de ned by the temporal mapping : K I + J = c, is the set of dashed lines intersecting the index space in gure 3.7. The linear schedule illustrated is for c 2 [t; t + 4]. All iterations lying on the same line may be scheduled to execute concurrently since (d1) > 0. A uniform recurrence equation is de ned to be a computation where the derivation of a function p at some instance identi ed by x, depends on the result of some instance x ? d where d is some constant term. It has been shown that these uniform recurrence equations may be optimally scheduled using linear schedules [60]. Assuming = (1 ; 2; : : :; n), Lamport identi ed the problem of minimising the parallel execution time of the transformed algorithm as an integer programming


38

problem requiring the minimisation of the expression

M + M + + M n n 1

1

2

2

(3.3)

where Mi = Ui ? Li + 1. Shang and Fortes [60] re-expressed the problem as the minimisation of the expression I i ? I k ) : I i ; Ik 2 I ) (3.4) f = max(( min((di) : di 2 ) subject to (di ) > 0, with denoting the set of data dependence vectors extracted from the loop kernel. The procedure which they propose to minimise equation (3.4) is shown to be bounded by time complexity O(24n n3), where n denotes the depth of the nested loop kernel. They also highlight a method for deriving an optimal linear schedule using linear programming which has the time complexity of O(2n2m 22n mn3 ) , where m is the number of dependence vectors extracted from the nested loop program.

Unimodular matrix transformations Many loop transformations may be described by selecting an appropriate unimodular matrix J . Uptal Banerjee has expounded on a theory of loop permutation, based on unimodular matrices, in [6]. A permutation of loop nest (x1; : : :; xn) to loop nest (xJ1 ; : : :; xJn ) may be described by an n n indentity matrix where the diagonal one in the ith row is permutated to the Ji th column. If we take code fragment 3.3 as an example, the permutation (I,J) ?! (J,I) can be described by the matrix

1 0 B@ 0 1 CA 1 0

Banerjee has shown that a loop permutation which preserves the sense of the original dependence structure is a sucient condition for a valid transformation. In other words, if there exists di which is a forward dependence, then J (di) must also be a forward dependence. An interesting lemma, derived by Banerjee, is that there are no valid loop permutations for a double nested loop with dependence distance vector (1,-1). This is in fact the dependence distance vector we obtain from code fragment 3.3, and it explains why the loop interchange transformation could not be applied directly.


39

Wolfe and Lam [68] have investigated unimodular matrix transformations which combine the loop permutation, skewing and reversal transformations. They show that the reversal of the ith loop may be described by an n n identity matrix with the ith diagonal element equal to -1. They also show how loop skewing can be described by a lower triangular matrix. Wolfe and Lam have derived an ecient procedure to determine an optimal permute-skew-reverse transformation matrix. As was similarly shown in Banerjee's work on loop permutation, Wolfe and Lam showed that the sucient condition for a valid permute-skew-reverse transform is one where the sense of the dependence structure of the original loop kernel is preserved. If we take code fragment 3.3 as our example again, the combined reverse-permute matrix transformation is represented by 1 1 0 10 0 B@ ?1 0 CA B@ 0 1 CA = B@ 0 ?1 CA 1 0 1 0 0 1 The transformation of the dependence distance vector for code fragment 3.3 is therefore given by 10 1 0 1 0 B@ 0 ?1 CA B@ 1 CA = B@ 1 CA 1 ?1 1 0 Since the sense of the dependence distance vector is preserved we know that the reversal of the inner loop followed by the interchange of the two loops is a valid transformation. Another parallelization scheme which may be represented by a unimodular matrix transformation is partitioning. In partitioning, we attempt to derive independent chains of iterations such that the loop kernel may be re-expressed as a set of outer parallel and inner serial loops. A two nested partitioned loop is shown in the code fragment below: PARALLEL DO I = 1, 10 DO J = 1, 5 A(I,J) = A(I,J-1) * B(I,J) ENDDO ENDDO


40

The above code fragment executes ten independent chains of ve iterations each. This is only possible because the ow dependence for the array variable A does not cross the I loop. D'Hollander [14], Peir and Cytron [48] have de ned partitioning as the problem of nding the independent execution sets,

Xi = f(i ; : : :; in) : i = ii; + ai; ; + + ai;m m; 1

1

0

.. .

1

1 11

1

in = ii;n + ai; ;n + + ai;m m;ng 0

1 1

with (i0i;1; : : :; i0i;n) and (ai;1; : : :; ai;m ) belonging to the set of initial iteration points and integer vectors respectively. Also, where is the set of data dependence vectors and m is the number of vectors in the set , we assume the notation di = (i;1; : : :; i;n) 2 . Each execution set is independent of the execution of any of the other execution sets and each is a subset of the index space J. From the above representation of the partitioning problem, we can deduce two steps for it's solution:

Determine the set of initial points (ii; ; : : :; ii;n). 0

1

0

Solve the system of diophantine equations speci ed above. Where this can be done successfully, the execution sets may then be executed in parallel. Peir and D'Hollander have proposed a method whereby the set of dependence distance vectors , expressed in an n m matrix D, is factored by a unimodular matrix U into an upper triangular matrix DT . The stride for the transformed loops may then be determined from the diagonal elements. The set of initial points may also be determined by a back substitution procedure on matrix DT . A detailed description of the above partitioning algorithm can be found in the relevant papers.

3.3 Mapping loop iterations onto parallel processors After deriving a set of parallel loops a parallelizing compiler then maps the iterations in the derived parallel form onto the processors in the parallel architecture. Much of the literature concerning the mapping of nested loop kernels onto parallel architectures has concentrated on the mapping problem for SMAs. We will rst


41

summarise the current research results for SMAs and then outline other results for other architectures. Mapping iterations onto a parallel computer can be done either at compile-time (i.e. static mapping) or at run-time (i.e. dynamic mapping). Dynamic mapping schemes often incur a large run-time overhead. The method is, however, preferred for loop body computation with variable execution times, i.e. there are conditional branches with very dierent execution times.

3.3.1 Static allocation schemes Given a parallel loop with the set of iterations fI1 ; I2; : : :; Ik g and a set of processors fp1; p1; : : :; pmg, two very simple static mapping schemes may be performed: 1. assign iteration Il to processor Il mod (m + 1) or 2. assign iteration Il to processor Il div (m + 1) where l 2 [1; k]. If k mod (m + 1) 6= 0 and the loop body is large, like an inner serial loop, the simple scheme will result in a very poor load-balance. A modi ed scheme, known as loop spreading, has been proposed by Wu [70]. For an n-nested loop kernel, assume the set of statements fS1(I ); S2(I ); : : :; Sk (I )g with k 2 Z and I denoting the iteration at which the statement is evoked. Loop spreading will allocate all instances of statement S1 rst, then instances of statement S2 and so on. The allocation is done such that the statements Sl (I1 ) and Sl (I2 ) are allocated to pj and pj+1, respectively, with I1 < I2. Hence, for the simple loop kernel shown in code fragment 3.5 below, allocating four processors to iterations in the loop computation is as shown in gure 3.8. DO I = 1, 6 S1 S2 ENDDO Code Fragment - 3.5

CHAPTER 3. PARALLELIZING NESTED LOOP KERNELS P1

P2

P3

P4

S1(1)

S1(2)

S1(3)

S1(4)

S1(5)

S1(6)

S2(1)

S2(2)

S2(3)

S2(4)

S2(5)

42

S2(6)

Figure 3.8: Allocation via loop spreading for code fragment 3.5 When nested parallel loops are encountered, coalescing will result in a better allocation. However, this may sometimes be deemed too expensive because of the need to generate functions which recover the original loop indices. Polychronopolous [51] and Wang [66] have studied the allocation problem for the case of nested parallel loops. Consider the double nested parallel loops below: PARALLEL DO I = 1, 3 PARALLEL DO J = 1, 5

f

Body

g

ENDDO ENDDO

When allocating N processing elements to this double nested loop, we could assign P processing elements to the inner loop, leaving dN=P e clusters to be allocated to the outer parallel loop. In other words, only dN=P e instances of the outer loop can be executed in parallel with P processing elements executing the iterations in the inner parallel loop. Assuming N = 4, there are three possible allocation schemes for the above code fragment: 1) three processors for the inner loop and one cluster to the outer loop, 2) one processor to the inner loop and three clusters to the outer loop, or 3) two processors to the inner loop and two clusters to the outer loop. If we assume Pi to be the number of clusters allocated to the ith loop, the allocation problem for an N processor parallel computer and an n-nested loop kernel is de ned to be that of deriving the set fP1 ; : : :; Pn g such that the execution time is minimised and P1 Pn = N . Polychronopolous [51] proposed a dynamic programming algorithm, called OPTAL, which solves the above optimisation problem


43

I p1

p2

p3

pn

J

pe

pe

pe

pe

Figure 3.9: Linear projection for allocating iterations onto a linear array and Wang [66] improved on the basic algorithm to derive an algorithm with a better time complexity. For the case of nested parallel loops with generalised linear triangular bounds, O'Boyle and Hedayat [44] have shown how a perfect load balanced mapping on a parallel architecture can be achieved. Their method describes an invariance condition via which a perfect load balance can be achieved. The invariance condition de nes a situation where there exists a nesting level, i, with bounds which do not reference index variables in another level, and all other levels do not have bounds referencing the index variable associated with the invariant level, xi . Assuming the upper and lower bounds of an n-nested loop kernel are expressed by the n n matrices U and L respectively, O'Boyle and Hedayat describe how a lower triangular unimodular transformation matrix, T , can be derived such that

T U = Ub T L = Lb where Ub and Lb satisfy the invariance condition. The nesting level in Ub and Lb, which satisfy this invariance condition, can then be permuted to the outermost loop allowing it to be partitioned equitably onto a parallel computer architecture.


44

Systolic array designers allocate iterations in an n-nested loop kernel by projecting the n-dimensional index space onto a k-dimensional processor space. The idea is simply illustrated for code fragment 3.4 by gure 3.9. Here a one dimensional projection, I = c, is used to allocate iterations in the computation onto processing elements in a linear array structure. Necessary and sucient conditions for correct mapping of iterations onto a regular multi-dimensional array structure have been reported by Lee and Kedem [37]. An interesting extension to such a projection approach to scheduling is reported in the work by King and Ni [28, 29]. They have extended the projection method to scheduling considerations in a pipelined multicomputer. Their technique involves grouping linear blocks of iterations onto the available processors in a pipelined multicomputer. A condition for correct allocation is that the contracted structure, resulting from a grouping, must never be cyclic. They present some results whereby groupings can be performed which will not result in cyclic contracted structures. Much of the research into the mapping of nested loop kernels onto DMAs has concentrated on the mapping of the PARALLEL DO construct [30, 55]. The general strategy is to generate a communication cost ecient data partitioning and derive the task partitioning using the owner compute policy [25]. Important work by Li et al. [39], and Gupta et al. [24] describe techniques for generating communication cost ecient message passing programs of nested loop kernels. Li and Chan [39] describe a technique which recognises commonly occurring communication patterns such that they can be mapped onto communication ecient functions present in the parallel computer. Their technique then decides on how the array data accessed in the nested loop computation should be partitioned over the parallel computer. Their sole criterion of the mapping is the minimisation of the communication costs. Gupta and Baneerjee [24] uses the same cost metrics, the minimisation of the communication cost, for deciding on a mapping strategy. Their basic technique decides on a partitioning strategy for the array variables accessed in the nested loop computation based on a set of communication cost constraints.


45

3.3.2 Dynamic allocation schemes Dynamic allocation schemes may also be used instead of the static schemes described in the previous section. Polychronopolous has suggested Guided Self Scheduling (GSS) [52]. The GSS scheme assumes implicit coalescing where all parallel nested loops are coalesced giving a linear parallel loop with N iterations. Assuming P processors in a parallel computer, GSS allocates a block of iterations, called a chunk, to the set of processors f p1 ; p2; : : :; pP g as they become available. An optimal schedule is one which guarantees that all P processors terminate within B time units of each other where B is the amount of time required to execute one iteration. In the GSS scheme, the function to derive the chunk size for allocation to an available processor pi is given by

Ri

xi = P

; Ri+1

Ri ? x i

where the range of iterations assigned to pi is [N ? Ri + 1; : : :; N ? Ri + xi] with R1 = N . An improvement to GSS has been suggested by Tzen and Ni [64]. Their method is called Trapzoidal Self-Scheduling (TSS). It uses a simpler chunk function than GSS. Since each processor has to perform some computation to determine the chunk size to be allocated to it during the scheduling phase, a simpler chunk size function decreases the overheads incurred. TSS uses a decreasing linear chunk size function which introduces a very small run-time overhead. They have performed experiments which show speedup improvements when TSS is used instead of GSS. In the case where loop coalescing is not possible, as in some complex nested loop structures, a two level scheduling algorithm has been proposed by Fang, Yew and Zhu [15]. In their system the high-level scheduler manages the precedence relationship in a task pool of PARALLEL DO loops while the low-level scheduler allocates iterations to processors. They describe the implementation of a parallel linked list structure to store the ready queue of parallel loops. The high-level scheduler selects a parallel loop from this pool when the previous parallel loop has completed execution. For the low-level schedule, the authors suggest using GSS to schedule the iterations in the parallel loop that is being executed.


46

3.4 Evaluating parallelizing compilers In a recent paper by Blume and Eigenmann [8] on the performance analysis of current state-of-the-art parallelizing compilers, the authors claim, Our most important ndings are that available restructurers often cause insigni cant performance gains in real programs and that only few restructuring techniques contribute to this gain.

Their claim is signi cant, augmenting similar results which have been obtained in other studies. Nobayashi and Eoyang [43] studied the performance improvement of automatically generated vectorised kernels over hand-vectorised versions of the same kernels executing on three vector machines: Cray X-MP, Fujitsu VP and NEC SX. They show that less then half of the automatically vectorised kernels actually reach up to 70% of the performance of the hand-vectorised kernels. Cheng and Pace [10] compared the speedups of automatically parallelized programs using KAP over hand-optimised versions of the programs executing on a single processor Cray. Their test suite consisted of 25 dierent programs including the Perfect benchmark. The results they obtain show that less then 10% of the parallelized programs actually show any performance improvement over that of the single processor Cray version. Blume and Eigenmann [8] also compared speedups obtained from the automatically parallelized version of the Perfect benchmark with the unparallelized version when executed on an Alliant FX/80 machine. They also conclude insigni cant performance gains. Petersen and Padua [47] have instrumented automatically parallelized versions of scienti c and engineering programs and compared their speedups on an ideal parallel machine with that of their ideally parallelized versions. They conclude that much parallelism remains undetected by modern parallelizing compilers. We know that the problem cannot be due to the lack of inherent parallelism in scienti c and engineering FORTRAN programs. The work by Kumar [32], Petersen and Padua [47] show that there is a signi cant amount of parallelism in such programs. Blume, Eigenmann [8], Singh, Hennesy [62], Petersen and Padua [47] oer suggestions as to areas which can be improved. It now seems that dependence breaking syntax transformations make the most signi cant contribution to speedups and not the loop restructuring transformations


47

described in the earlier section. Some of these dependence breaking syntax transformations include variable expansion, array privatization, and reduction removal. We describe each, in turn, in the following section.

3.4.1 Dependence breaking syntax transformations An example of the variable expansion and array privatization transformations are illustrated using the code fragment below: DO I = 1, 10 DO J = 1, 10 A( J ) = B * 5 C( I,J ) = C( I,J ) + A( J ) ENDDO ENDDO

The variable A is clearly being used as a temporary variable. Even though no value is being carried across the iterations the outer I loop is prevented from being parallelized. Array privatization can be used to break the false dependence allowing parallelization as shown below: PARALLEL DO I = 1, 10 PRIVATE VAR A(10) DO J = 1, 10 A( J ) = B * 5 C( I,J ) = C( I,J ) + A( J ) ENDO ENDDO

Similarly, variable expansion could also be used to acheive the same eect. The variable expansion of array A is shown below:


48

Identify data independent tasks

Derive parallel form

Map onto architecture

Figure 3.10: Parallelization sequence for nested loop kernels

PARALLEL DO I = 1, 10 DO J = 1, 10 A(I,J) = B * 5 C( I,J ) = C( I,J ) + A( I,J ) ENDDO ENDDO

A loop reduction operator, where some global property is gleaned from an array structure, is a common occurrence in scienti c and engineering programs. An example of a reduction operator is shown in the code fragment below: DO I = 1, MAX SUM = SUM + A( I ) ENDDO

Loop reduction operators are intrinsically serial and special algorithmic substitutions are often the only way to improve their execution time on a parallel computer. Identifying such reduction structures and replacing them with their parallel version can substantially improve performance as has been noted in [8]. Other common loop reduction operators include nding the max-min of a list, matrix pivoting, etc.

3.5 Concluding remarks All the parallelization schemes, presented in this chapter, may be summarised by the sequence shown in gure 3.10 Except for the mapping phase, the identi cation of


49

data independent tasks and the derivation of their parallel forms is predominantly a compile time activity. A method which adopts a dierent approach is the DO ACROSS loop. The DO ACROSS is a much more powerful construct then the PARALLEL DO as it enables the parallelization of loops with inter-loop dependencies. The DO ACROSS was rst described by Cytron [13]. It derives parallelism using a dierent approach from all the other methods described. Rather then determining the available parallel tasks at compile time, the DO ACROSS inserts synchronisation statements into the loop computation to resolve data dependencies at run-time: statements are executed immediately upon having their data dependencies satis ed. Synchronisation primitives such as post(), which signal the completion of an iteration, and wait(), which suspends an iteration, are often used [42]. An example of the DO ACROSS is illustrated in the code fragment shown below: DO ACROSS i = 1, 1000 wait(i) S1:

A(i) = ...

S2:

...

= A(i-2) ...

post(i-2) ENDDO

The post() and wait() synchronisation primitives enforce the data dependence between the array variable A in the loop computation. The wait(i) suspends the execution of statement S1 at iteration i until the completion of the source of its data dependence, S2 at iteration i-2, has been signalled by post(i-2). The DO ACROSS loop has received much less attention than the PARALLEL DO loop primarily due to the perceived complexity involved in generating these synchronising statements. The method is very important, however, for its exibility and for the much larger class of nested loop kernels to which it can be applied. Chapters 6 and 7 concentrate on deriving methods for implementing the DO ACROSS on SMAs and DMAs respectively. We describe strategies whereby the DO ACROSS can be derived for nested loop computations with non-constant data dependence patterns using information derived from the data dependence vectors generated by techniques outlined in chapter 4.

Chapter 4

Data dependence distance vectors This chapter presents a method for determining the dependence distance vectors of def-ref array variable pairs involved in a nested loop computation. Most importantly, our method is able to deal with array variables possessing general linear subscript expressions which are coupled; a large class of subscript expressions which were previously not handled eciently. Coupled subscripts occur when two or more index variables occur simultaneously in an array subscript expression. In an empirical study of scienti c and engineering programs, Shen, Li and Yew [61] claim that over 44% of all array variables in scienti c programs have coupled subscripts. Being able to deal eciently with coupled subscripts is therefore important. Section 4.1 explains the basic concepts and de nitions used in the rest of this chapter. Section 4.2 presents some examples illustrating our proposed method for deriving the data dependence distance vectors. The aim of this section is to provide a step-by-step overview of the analysis procedure before a fuller exposition is embarked upon in the later sections. Sections 4.3, 4.4 and 4.5 develop our procedure for determining the distance vectors for three special classes of data dependencies: ow, anti and output. The three classes of data dependencies completely characterise the dependencies resulting from \carried" array values in a nested loop kernel. A fuller de nition of the three classes of data dependencies is presented in section 4.1.

50

CHAPTER 4. DATA DEPENDENCE DISTANCE VECTORS

51

4.1 Basic concepts and de nitions Let Z , Z + and Z n denote the set of integers, positive integers and n-tuples of integers respectively. Also let R and Rn denote the set of reals and n-tuples of reals respectively. We are concerned with the parallelization of loop kernels of the form shown below: DO

x

1

= . . .

L , U , st

DO

1

xn

1

1

Ln, Un, stn A(f gen (X )) = ... ... = A(f use(X )) =

ENDDO . . . ENDDO

We number the nesting levels 1 to n, from the outer level to inner. The index variable at nesting level i is denoted by xi and the set of all index variables in the nested loop kernel is denoted X = fx1; x2; : : :; xng. The lower and upper bounds, at nesting level i, is denoted by Li and Ui respectively, and we note that they need not be constant integers. For nesting level i, the bounds may be expressed by some linear expression involving the set of index variables fx1; x2; : : :; xi?1g. The stride sti , at nesting level i, denotes the increment steps that the index variable xi takes during the execution of the nested loop kernel.

4.1.1 Iterations and subscript expressions We let li denote a particular instance of the index variable xi and I = (l1; : : :; ln) an iteration. The expressions f gen (X ) and f use (X ) are the subscript expressions of a def-ref array variable pair which are restricted to linear functionals in terms of X . That is, in the general case where an m-dimensional array has been declared, the subscript expressions are de ned to be

flgen (X ) = Ai; x + Ai; x + + Ai;nxn fluse(X ) = Bi; x + Bi; x + + Bi;nxn 1

1

2

2

1

1

2

2


52

with 1 l m and Ai;j ; Bi;j 2 Z . Note that, in general, m need not be equal to n. The terms Ai;j and Bi;j are also known as the coecients of figen (X ) and fiuse(X ), and B;j and A;j are known as the bounding coecients for the index variable xj . The expressions figen(X ) and fiuse (X ) denote the subscript expression in the ith eld of the m subscript dimensions of the de nition and reference array variables respectively. To illustrate this the def-ref pair for a m-dimensional array variable in the above loop kernel is de ned as

f gen(X ); : : :; fmgen(X )) = : : : : : : = A(f use(X ); : : :; fmuse(X ))

A(

1

1

We will on occasions use a function coe(xj ; expr) which returns the bounding coecient for index variable xj where expr is any linear expression in terms of X . In particular, coe(xj ; figen) = Ai;j and coe(xj ; fiuse) = Bi;j . An array variable has coupled subscripts if there exists some index variable xi 2 X which appears in two or more subscript expression elds. For a referenced array variable, this translates to the predicate

9j; k 2 [1; m] such that j 6= k ^ Bj;i 6= 0 ^ Bk;i 6= 0 For example, the array variable A(x+y,x) is said to have coupled subscripts.

4.1.2 Index spaces Assuming an n-nested loop kernel with index variables described in the set X = fx1; x2; : : :; xng, we introduce the following de nitions.

De nition 4.1 We de ne J to be the index space, which is the set of all lattice

points enclosed within a vector subspace in n-dimensional integer euclidean space, i.e. J Z n

The n-dimensional integer lattice space J de nes a Z n vector subspace region which the iterations in the nested loop kernel span. Example 4.1: For the nested loop kernel de ned in code fragment 2.1 below, the region spanned by the iterations is shown by the shaded area in gure 4.1. The


53

I 10 8 6 4 2 2

4

6

8

10 12 14 16 18 20 J

Figure 4.1: Convex hull subspace region J for code fragment 4.1 shaded region, which is a convex hull, de nes the index space J for the nested loop kernel concerned. DO I = 1, 10, 1 DO J = I, I+10, 1 A(J+1) = A(J) + B ENDDO ENDDO Code Fragment - 4.1

2 Geometrically we can think of J as describing a convex hull which satis es the system of inequalities described in (4.1) below. The inequalities describe the domain from which the values for the index variables may be derived and are constrained by the relevant loop bounds as de ned below:

L x U L x U 1

2

1

1

2

2

.. .

(4.1)

Ln xn Un We denote a feasible iteration I = (l1; l2; : : :; ln) if the iteration is a member of J and an infeasible iteration I^ = (l10 ; l20 ; : : :; ln0 ) if it is not a member of J . An order relation is de ned on feasible iterations lexicographically, e.g. where J is some


54

vector subspace in Z 3 and the two feasible iterations (1,0,2), (0,3,5) 2 J then (1,0,2) > (0,3,5). All algebraic manipulations involving the inequality symbols will assume the lexical meaning illustrated above. Furthermore, for the equality operator two n-tuples are equal if and only if all coordinates in the two n-tuples are equal. We assume a vector is de ned to be positive if its left-most nonzero coordinate is positive. A vector is de ned to be strictly positive if every nonzero coordinate is a positive integer. Thus, (1,2,3) (1,2,3), (0,1,-4) is positive and (1,2,3) is strictly positive. For our convenience we shall also de ne a function bScIk which returns the subset of feasible iterations S 0 S such that 8Ii 2 S , Ii < Ik .

De nition 4.2 We de ne X to be the set of variables, both scalar and array, which are either de ned or referenced at some point in the index space J .

The set of variables de ned or referenced in a loop body computation is represented by the set X . For example, in code fragment 2.1 above, X = f A(1), : : : , A(21) , B g. Where the variable vi , which is a member of X (i.e. vi 2 X ), is de ned or referenced at some iteration Ii 2 J , we denote its occurrence notationally by vigen(Ii ) and viuse(Ii) respectively.

4.1.3 Related de nitions The complete data dependence constraints for a nested loop kernel are expressed in the set .

De nition 4.3 The set of data dependence distance vectors extracted from an nnested loop kernel, denoted by the term , is de ned to be the set fdi : di 2 Z n ; i 2 [1; k]g where di denotes a data dependence distance vector for a def-ref variable pair involved in the nested loop computation.

To illustrate a data dependence distance vector for a 2-dimensional loop computation, consider the scenario where the variable vi 2 X is de ned at iteration (1; 1) and then later referenced again at iteration (2; 3). The de nition and the later reference of the variable vi describe a data dependence, whose distance vector is the 2-tuple (1; 2). Thus, if v gen (Ii) v use(Ik ), di = Ik ? Ii .


55

De nition 4.4 A data dependence distance vector, di, is de ned to be the n-tuple (i; (X ), : : :, i;n(X )) where i;k (X ) is a linear functional in terms of X and 8Ii ; Ik 2 J such that v gen(Ii ) v use(Ik ), 1

Ii + di(Ii) = Ik Thus for the set in de nition 4.3, di(Ii ) 2 represents a data dependence distance vector which is incident from every iteration in Ii 2 J . A useful method for representing the computation in a nested loop kernel is to construct a directed task graph Q in n-dimensional euclidean space called the data dependence DAG. Each task in the graph represents an iteration in the nested loop computation and arcs connecting two iterations, say Ii and Ij , exist if there is a data dependence between Ii and Ij for some variable vi 2 X . Thus the arcs represent data dependence vectors for all di 2 . Geometrically a data dependence vector di(Ii ) can also be interpretated as a vector in n-dimensional euclidean space where Ii ? Ij = di(Ii). Now a data dependence will exist if v gen(Ii ) v use(Ik ). There are three cases to be considered. For the rst case where Ii < Ik , the data dependence is de ned as an inter-loop (i.e. crosses an iteration boundary) ow dependence. For the second case where Ii > Ik , the data dependence is de ned as an inter-loop anti dependence. And for the third case where Ii Ik , the data dependence is de ned as an intra-loop data dependence. An intra-loop dependence can either be an anti dependence or a ow dependence depending on the textual ordering in which the de nition and reference occurs. A further class of inter-loop data dependencies is de ned for the case v gen(Ii ) v gen(Ik ). This class is generally termed an output dependence and it is inter-loop if Ii 6= Ik or intra-loop otherwise. We elaborate on the concept of output dependencies later in this section.

De nition 4.5 Given the m subscript expressions, figen(X ) and fiuse(X ), for an

m-dimensional def-ref array variable pair, the dependence dierence vector is de ned to be the m-tuple ( ; ; : : :; m ) where 1

2

i = figen (X ) ? fiuse(X ); i 2 [1; m] The concept of dierence vectors will be used later in developing a method for determining the data dependence distance vectors for def-ref array pairs.


56

Ii: A(4) = ... output dependence vector Ik: A(4) = ...

Figure 4.2: Output dependence vector between iterations in a loop kernel

De nition 4.6 An output dependence distance vector, outi, is de ned to be the n-tuple ( ; : : :n ) where outi is a positive vector and 8Ii, Ii + outi 2 J , then 1

vigen(Ii ) vigen(Ii + outi) As was described earlier in this section, an output dependence occurs when the value in an array variable is rede ned during the execution of a nested loop kernel. The output dependence distance vector describes an arc in J which points to later iterations in which the value of an array element, de ned in the current iteration, is rede ned. Thus, as illustrated in gure 4.2, where A(4) is de ned in the two iterations Ii and Ik , the distance vector for this output dependence is given by the vector subtraction Ik ? Ii .

4.1.4 Basic row-reduction matrix operations Throughout this thesis we will encounter systems of the form shown in (4.2).

A ; x + A ; x + + A ;nxn = b A ; x + A ; x + + A ;nxn = b 11

1

12

2

1

21

1

22

2

2

Am; x + Am; x + + Am;nxn 1

1

2

2

1

(4.2)

2

.. . = bm

As a preliminary, we describe the method by which such systems are solved. Given a matrix-vector formulation of system (4.2),

0 BB BB BB B@

10 CC BB CC BB B .. C . C CA BB@

1 0 1 CC BB b CC CC BB b CC = C BB .. CC .. C . C B@ . CA A

A ; A ;n A ; A ;n

x x

Am; Am; Am;n

A; A;

11 21

.. .

1

12

22

.. .

2

1

...

2

1

1

2

2

xn

bm

(4.3)


57

=) A~x = ~b We can solve (4.3) by forming the augmented matrix A = (Aj~b)

0 BB BB BB B@

1

A; A;

11 21

.. .

A ; A ;n b C C A ; A ;n b C C 12

22

.. .

1

1

2

.. .

2

.. .

...

Am; Am; Am;n bm 1

CC CA

(4.4)

2

Note that (4.3) is completely determined by (4.4). Solving (4.2) thus involves solving (4.4) and we do this via a standard procedure known as row reduction. The algorithm for row reduction using Gaussian elimination modi ed for integer values is detailed below: Algorithm to row-reduce a matrix into echelon form

step 1 Suppose the j th column is the rst column with a non-zero entry. Interchange the rows such that the largest non-zero entry in the j th column is in the rst row. If there is no column with a non-zero entry we exit.

step 2 For every ith row other then the rst (i.e. i > 1) we seek to elim-

inate all entries in the j th column. We achieve this by performing the series of operations

Ri ! ?Ai;j R + A ;j Ri 1

1

where Ri denotes the entire ith row entry in the matrix being row reduced.

step 3 Repeat steps 1 and 2 with the submatrix formed by deleting the

rst row. If deleting the rst row results in an empty matrix, we exit.

The row reduced matrix A0 is now said to be in echelon form. A matrix is in echelon form if the number of zeros preceding the rst non-zero entry of a row increases row by row until only zero rows remain, i.e.

A0 ;j1 ; A0 ;j2 ; : : :; A0r;jr > 0 where j < j < < jr and r is the row rank 1

2

1

2


58

The row-reduced matrix A0 in echelon form determines the equivalent system for (4.3) in the form,

A0 ;j1 xj1 + + A0 ;nxn = b0 A0 ;j2 xj2 + + A0 ;nxn = b0 1

2

1

1

2

2

.. .

(4.5)

A0r;jr xjr + + A0r;n xn = b0r where j1 < j2 < < jr m. If m jr n the echelon system (4.5) can then be solved by backwardsubstitutions where

1

0

n X (A0k;ixi )A ; k = n : : : 1 xjk = 01 @

Ak;jk

i=jk+1

(4.6)

If m < n, the solution to the echelon system (4.5) can still be formulated. Each unknown xi which does not appear at the beginning of any of the equations in (4.5) is denoted a free variable, and the solution to (4.5) is then characterised in terms of n ? m free variables.

4.2 Deriving ow dependence distance vectors This section describes how the ow dependence distance vector of a def-ref array variable pair may be determined. The method is illustrated rst before a more formal derivation of the procedure is described. The rst example illustrates a simple uniform recurrence equation, for which we derive its set of associated dependence distance vectors. The example is trivial and the procedure may seem tedious for the simple array subscript expressions involved. The intention, however, is to highlight the analysis procedure step-by-step, without the complication of more elaborate subscript expressions. The next example illustrates the same analysis procedure but with a def-ref array variable pair with more sophisticated subscript expressions. Example 4.2: Consider the simpli ed relaxation algorithm in the loop kernel shown below.


59

DO i = 1, 100 DO j = 1, 100 DO k = 1, 100 U(j,k) = (U(j+1,k) + U(j,k+1) + U(j-1,k) + U(j,k-1))*0.25 ENDDO ENDDO ENDDO

This relaxation example was rst presented in the seminal paper by Lamport [36]. We will determine the six ow dependence distance vectors associated with the above example. Note that there are four def-ref array variable pairs in the above code fragment: 1. ( U(j,k) ! U(j+1,k) ) 2. ( U(j,k) ! U(j,k+1) ) 3. ( U(j,k) ! U(j-1,k) ) 4. ( U(j,k) ! U(j,k-1) ). To derive the data dependence distance vector for the rst def-ref variable pair, namely (U(j,k) $ U(j+1,k)), we rst derive the dierence vector = (1; 2) such that 1 = f1gen (X ) ? f1use(X ) = j-j-1 = ?1 2 = f2gen (X ) ? f2use(X ) = k-k = 0 Letting = (-1,0), we next derive a set of intermediate ow distances. These intermediate ow distances are denoted IDisti and they de ne the distance, in the ith dimension of the loop computation, between a de nition and a reference of an array variable. The system used to compute these intermediate ow distances is

CHAPTER 4. DATA DEPENDENCE DISTANCE VECTORS given by (4.7) below:

Pn coe(x ; f use(X )) IDist = i i i Pn coe(x ; f use(X )) IDist = i i i =1

1

1

=1

2

2

.. . Pn coe(x ; f use(X )) IDist = i m i m i=1

60

(4.7)

For the rst def-ref variable pair and derived above, the system (4.7) is immediately solved such that IDistj = -1 and IDistk = 0 Since IDisti is not represented by the system of intermediate ow distances IDisti is said to be invariant in the ith dimension and taken to be 1. The three-tuple (IDisti ,IDistj ,IDistk ) = (1,-1,0) represents the distance vector for the ow dependence between the array variable pair. Thus for the array variable U(j,k) de ned at iteration (i,j,k), the same array variable is referenced again at iterations (i,j-1,k+0), (i+1,j-1,k+0), : : :, (Ui,j-1,k+0). Note, however, that not all of the Ui ? i + 1 ow distance vectors represent true data dependencies. Some of the ow vectors are invalidated by the rede nition of the array variable U(j,k) at some future iteration. The output dependence vector for the array variable U(j,k) is given by the 3-tuple (1,0,0). The output dependence vector (1,0,0) tells us where, in the J subspace, the array variable U(j,k) is next de ned. If the execution sequence for the loop kernel is checked the reader can verify that for all iterations (i,j,k) 2 J , the same array variable on the LHS of the assignment statement is rede ned at iteration (i+1,j,k) 2 J . This phenomenon is known as array kills and is illustrated in gure 4.3. The gure shows ow dependence vectors Ii ! Ik and Ij ! Ik . Because of an output dependence, the same array variable element is de ned in both Ii and Ij . The ow dependence vector from Ii to Ik is said to be killed as the actual value referenced is that determined at iteration Ij . We use the oor operator as described in section 4.1.2 to obtain the true ow dependence distance vector, given the existence of an output dependence. Thus,

b(1; ?1; 0)c

;;

(1 0 0)

= f(0; ?1; 0); (1; ?1; 0)g = fd1; d2g

Similarly, the ow dependence distance vectors for the other three def-ref array


61

TIME Ii output

killed Ij

flow

flow Ik

Figure 4.3: The array variable de nition in Ii is killed by the rede nition in Ij variable pairs are derived below:

d = ( 0, 0, -1 ) 3

d = ( 1, 0, -1 ) 4

d = ( 0, 1, 0 ) 5

d = ( 0, 0, 1 ) 6

The nested loop kernel therefore has a total of six data dependence distance vectors associated with the computation. 2 Example 4.3: Consider the nested loop kernel illustrated in the code fragment below: DO i = 1, 1000 DO j = 1, 1000 A( i+j,3*i+j+3) = ... ...

= A( i+j+1, i+2*j+4) ...

ENDDO ENDDO

This loop fragment was rst presented by Tzen and Ni [65] to illustrate the diculties in parallelizing loop kernels involving array variables with coupled subscript expressions.


62

The def-ref array variable pair in the example is de ned by the pair (A(i+j,3*i+j+3) $ A( i+j+1,i+2*j+4 )) where

f gen(X ) = i + j

f use(X ) = i + j + 1

1

1

f gen(X ) = 3 i + j + 3

f use(X ) = i + 2 j + 4

2

2

Deriving the dierence vector = (1; 2):

?1 = f gen (X ) ? f use(X ) = = f gen (X ) ? f use(X ) = 2i ? j ? 1 1

1

1

2

2

2

We may now derive the intermediate ow distances by formulating system (4.7): IDisti + IDistj = ?1 IDisti + 2 IDistj = 2i ? j ? 1 The above system may be more concisely expressed by the matrix-vector expression A~x = ~c where A is the 2 2 matrix of the coecients of the LHS of the system of equations above, i.e. 1 0 1 1 CA B@ 1 2 and ~x and ~c are the vectors (IDisti ; IDistj )T and (?1; 2i ? j ? 1)T respectively. Accordingly, 1 1 0 10 0 ? 1 IDist 1 1 1 C CA CA B@ B@ A = B@ 2i ? j ? 1 IDist2 1 2 By formulating the augmented matrix A and row-reducing, i.e. row2 = row2 ? row1, the equivalent echelon form may then be derived,

0 1 B@ 1 1 ?1 CA 0 ?1 2i ? j

The echelon system may be solved by back substitution, giving the ow dependence distance vector (IDisti; IDistj ) = (-1-2i+j,2i-j). The next task, as in the previous example, is to derive the output dependence vector. As such, we solve the system of homogeneous equations formed from the de ning array subscript expressions: i+j

= 0

3i + j + 3 = 0


63

Expressing the system as a matrix-vector expression, as for the intermediate ow system, we get 10 1 0 1 0 B@ 1 1 CA B@ i CA = B@ 0 CA ?3 j 3 1 The augmented matrix A may be similarly formulated and row-reduced, i.e. row2 = row2 ? 3row1, to the lower echelon form,

1 0 B@ 1 1 0 CA 0 ?2 ?3

By back substituting the echelon system, the output dependence matrix is derived to be (-3/2,3/2). A feasible output dependence vector must be a positive vector with integer coordinates. The vector (-3/2,3/2) is not considered feasible because it is neither positive nor composed of integer coordinates. The sole data dependence distance vector for the example is therefore

d

1

2

= (-1-2i+j,2i-j)

Section 4.3 formally derives the intermediate ow system used in the above two examples. Section 4.4 further describes an equivalent intermediate system for anti dependencies. Finally, section 4.5 discusses the concept of array kills and describes a procedure by which the output dependence vectors may be derived.

4.3 Intermediate ow distance vectors Our basic technique for deriving the ow dependence distance vector solves for the condition when all m subscript elds, of the def-ref array variable concerned, are equivalent. This condition is expressed with the predicate

fjgen(Ii) fjuse(Ik ) where 1 j m and Ii < Ik . Now assuming the term IDistz is de ned to be the intermediate ow distance for the z th dimension such that iterations (l1, : : :, lz , : : :, ln) and (l1, : : :, lz + IDistz , : : :, ln) have equivalent values for the ith subscript eld of the def-ref array variable pair concerned, we provide theorem 4.1 below which generates all iteration instances


64

which have equivalent values for the ith subscript eld with an iteration instance assumed to be Ii .

Theorem 4.1 The intermediate ow distance in the zth dimension, denoted by

IDistz , for the ith subscript equivalence of an array variable de nition at Ii and a later array variable reference is de ned by the expression n X z=1

(Bi;z IDistz ) = i (Ii)

(4.8)

where Bi;z is the constant bounding coecient of the index variable xz in the expression fiuse and i (Ii ) is the ith component of the dierence vector .

Proof:

For the two iterations Ii = (l1; l2; : : :; ln) and Ii0 = (l1 + IDist1 ; l2 + IDist2 ; : : :; ln + IDistn ), a solution exists for the equivalence

fzgen (Ii) = fzuse (Ii0) () Ai; l + + Ai;z lz + + Ai;nln = Bi; (l + IDist ) + + Bi;z (lz + IDistz ) + + Bi;n(ln + IDistn) () Bi; IDist + + Bi;z IDistz + + Bi;nIDistn = (Ai; ? Bi; )l + + (Ai;z ? Bi;z )lz + + (Ai;n ? Bi;n )ln () Pnz (Bi;z IDistz) = i(Ii) 1 1

1

1

1

1

1

1

1

1

=1

2

A yet more general expression for the intermediate ow terms, with the stride terms incorporated, is given in corollary 4.1 shown below:

Corollary 4.1 Assuming the loop increment at nesting level z is given by the stride stz , then it follows that

n X z=1

(Bi;z stz IDistz ) = i (Ii)

(4.9)

where IDistz is the iteration distance for the z th index dimension and Bi;z is the bounding coecient of the index variable xz in the expression fiuse.

Theorem 4.1 and corollary 4.1 are signi cant in that expressions (4.8) and (4.9) de ne the ow distance vector for the ith subscript eld of the m-dimensional array variable. In other words, the intermediate ow vector (IDist1 , IDist2, : : :, IDistn ), for the ith subscript eld, tells us that the ith subscript expression will be equivalent


65

at iterations (l1, l2, : : :, ln) and (l1 + IDist1, l2 + IDist2 , : : :, ln + IDistn). For a true ow dependence to exist between a def-ref array variable pair, we must nd the intersection for all ow distance vectors for all m subscript elds. Accordingly we solve the system of diophantine equations (4.10) illustrated below:

Pn coe(v ; f use(v)) IDist = i i i Pn coe(v ; f use(v)) IDist = i i i =1

1

1

=1

2

2

.. . Pn coe(v ; f use(v)) IDist = i m i m i=1

(4.10)

from which the ow dependence distance vector, owi = (IDist1, IDist2 , : : :, IDistn ), can then be derived. Example 4.4: Consider the inner loop kernel in the code fragment below: DO i = 1, 1000 x(3i) = c + d f = x(2i-1) END

The dierence vector for the def-ref array variable pair ( x(3i) $ x(2i-1) ) is = (i+1) and the bounding coecient for the index variable i in the reference array subscript eld is B1;1 = 2. Formulating system (4.10) for intermediate ow distances 2 IDisti =

i+1

=) IDisti = 12 (i + 1) We can assume the ow distance vector to be de ned by 1

ow = (i + 1) 2 Thus, for iteration i=3, we calculate ow = 2. Hence there is a ow dependence of distance 2 from iteration i=3 to i=5, i.e. (i

2

= 3)

?! (i

x(9)

= 5)


66

4.4 Intermediate anti distance vectors As in the case for determining the ow dependence distance vectors, the anti dependence distance vectors in a nested loop computation may be determined by solving for the condition de ned by the predicate

fjgen(Ii) fjuse(Ik ) where 1 j m and Ik < Ii . The intermediate anti vector (IAnti1; IAnti2 ; : : :; IAntin) describes a positive vector in n-dimensional index space J , for which the referencing array variable and the de ning array variable have equivalent subscript elds. Theorem 4.2 de nes the general expression from which the intermediate anti vector for the ith array subscript equivalence can be determined.

Theorem 4.2 The intermediate anti distance in the zth dimension, denoted by

IAntiz , for the ith subscript equivalence of an array variable reference at Ii and a later array variable de nition is given by the expression n X z=1

(Ai;z IAntiz ) = ?i (Ii )

where Ai;z is the constant bounding coecient of the index variable vz in the expression figen and i (Ii) is the ith component of the dierence vector (v ).

Proof:

For the two iterations Ii = (l1; l2; : : :; ln) and Ii0 = (l1 + IAnti1; l2 + IAnti2; : : :; ln + IAntin ), a solution exists for the equivalence

fzgen (Ii) = fzuse (Ii0) () Ai; (l + IAnti ) + + Ai;z (lz + IAntiz ) + + Ai;n(ln + IAntin) = Bi; l + + Bi;z lz + + Bi;n ln () Ai; IAnti + + Ai;z IAntiz + + Ai;nIAntin = (Bi; ? Ai; )l + + Bi;z ? Ai;z )lz + + (Bi;n ? Ai;n)ln () Pnz (Ai;z IAntiz ) = ?i (Ii) 1

1

1

1 1

1

1

2

1

1

1

=1

As for the intermediate ow term, a more general expression for the intermediate anti term is given in corollary 4.2 shown below, where the stride term for each nesting level is now incorporated.


67

Corollary 4.2 Assuming the loop increment at nesting level z is given by the stride stz ,

n X z=1

(Ai;z stz IAntiz ) = ?i (Ii )

where IAntiz is the iteration distance for the z th index dimension and Ai;z is the bounding coecient of the index variable xz in the expression figen.

Note that the intermediate anti vector, (IAnti1; IAnti2 ; : : :; IAntin) for the ith subscript eld, tells us that the ith subscript eld will be equivalent at iterations (l1; l2; : : :; ln) and (l1 +IAnti1; l2 +IAnti2 ; : : :; ln +IAntin ). As before, for a true anti dependence to exist, i.e. an anti vector which exists for all subscript expressions in the m-dimensional array variable, the system of diophantine equations in (4.11), illustrated below, must be solved. The solution of system (4.11) below derives the anti dependence distance vector antii = (IAnti1 ; : : :; IAntin ).

Pn coe(v ; f gen(v)) IAnti = ? i i i Pn coe(v ; f gen(v)) IAnti = ? i i i =1

1

2

=1

2

2

(4.11)

.. . Pn coe(v ; f gen(v)) IAnti = ? i m i m i=1

4.4.1 Deriving anti dependence distance vectors Example 4.5: Consider the loop body computation rst seen in example 4.3, A( i+j, 3*i+j+3 )

=

= A(

i+j+1, i+2*j+4 )

The def-ref array variable pair is represented by ( A( i+ 2*j+4 ) ), where

f gen(v) = i + j f gen (v) = 3i + j + 3

i+j, 3*i+j+3 )

f use1(v) = i + j + 1 f use(v) = i + 2j + 4

1

1

2

2

We can derive the dierence vector (v ) = (1 (v ); 2(v )),

= f gen (v) ? f use(v) = ?1 = f gen (v) ? f use(v) = 2i ? j ? 1 1

1

1

2

2

2

:::

$ A(

i+j+1,


68

for which the intersections for the intermediate anti distances, from the system of diophantine equations in (4.11), is given by IAntii + IAntij = 1 3 IAntii + IAntij = ?2i + j + 1 The system can be expressed by the matrix-vector expression A~x = ~c where A is the 2 2 matrix of the coecients of the LHS of the system above, i.e.

1 0 1 1 CA B@ 3 1

and ~x and ~c are vectors (IAntii ; IAntij )T and (1; ?2i + j +1)T respectively. Accordingly,

0 1 10 0 IAnti 1 1 i C CA B@ B A = B@ @ 3 1

IAntij

1

1 CA ?2i + j + 1

By formulating the augmented matrix A and row-reducing, i.e. row2 = row2 ? 3row1, the equivalent echelon form can then be derived,

1 0 1 CA B@ 1 1 0 ?2 ?2i + j ? 2

The echelon system can be solved by back substitution, from which the derived anti dependence distance vector is (IAntii ; IAntij ) = (j/2-j,1-j/2+i)

2

Example 4.6: Consider the loop body computation also used as an example

loop kernel by Tzen and Ni [65],

A( 2*j+3, i+1 )

= :::

::: =

A( i+j+3, 2*i+1 )

:::

The def-ref array variable pair is represented by (A( 2j+3, i+1 ) $ A( where f1gen (v) = 2j + 3 f1use(v) = i + j + 3 f2gen (v) = i + 1 f2use(v) = 2i + 1

i+j+3, 2i+1 )),


69

We can derive the dierence vector 1(v ) = (1(v ); 2(v ))

= f gen (v) ? f use(v) = j ? i = f gen (v) ? f use(v) = ?i 1

1

1

2

2

2

for which the intersections for the intermediate anti distances, from the system of diophantine equations in (4.11), is given by 2 IAntij = ?j + i IAntii =

i

The system can similarly be expressed by the matrix-vector expression A~x = ~c where A is the 2 2 matrix of the coecients of the LHS of the system of equations above, i.e. 1 0 B@ 1 0 CA 0 2 and ~x and ~c are the vectors (IAntii ; IAntij )T and (i; ?j + i)T respectively. Accordingly, 0 10 1 0 1 B@ 1 0 CA B@ IAntii CA = B@ i CA 0 2 IAntij ?j + i The system can now be solved by back substitution from which the anti dependence distance vector is derived to be (IAntii ; IAntij ) = (-i/2,j-i/2)

2

4.5 Array kills and output dependence distance vectors An array kill is a phenomenon whereby an output dependence vector invalidates a

ow dependence. We de ne the phenomenon more concisely and derive the necessary and sucient condition for which a ow dependence remains valid. Let outi and owi be some output and ow dependence vectors derived for some arbitrary nested loop kernel and assume Ii 2 J . Now

Ii + outi = Ik if and only if vigen(Ii) vigen(Ik ) and

Ii + owi = Ik0 if and only if vigen (Ii) viuse(Ik0 )


70

The ow dependence from Ii to Ik0 is killed if and only if Ik Ik0 , or similarly

Ii + owi Ii + outi =) owi outi Conversely, the the ow dependence remains valid if and only if Ik > Ik0 , or similarly

Ii + outi > Ii + owi =) owi > outi Deriving the set of output dependence vectors involves solving the system of homogeneous diophantine equations formulated from the coecients in the de ning array subscripts. The intuitive rationale for this may be dicult to see immediately, but the proof of the approach involves the null subspace properties of a system of linear equations.

Theorem 4.3 The set of output dependence distance vectors is derived by solving the system of homogeneous equations

f gen (X ) = 0 f gen (X ) = 0 1

2

.. . fmgen (X ) = 0

(4.12)

over J.

Proof: Let V denote the solution subspace for the system of non-homogeneous equations (4.13) shown below

f gen(X ) = c f gen(X ) = c 1

1

2

2

.. . fmgen(X ) = cm

(4.13)

Also let W denote the null subspace which de nes the solution subspace of the equivalent system of homogeneous equations. From a well established theorem in linear algebra, there exists a vector u 2 V such that the set of all possible solutions


71

for (4.13) is given by u + w 2 V, 8w 2 W [40]. Now fc1; c2; : : :; cmg describes the set of subscript values for the m elds in the de ning array variable. Therefore for any two iterations in J , i.e. I1 and I2, such that the system (4.13) is satis ed, an output dependence is de ned. Now I1 ; I2 2 V. Hence

I = u + w and I = u + w where w ; w 2 W 1

1

2

2

1

2

The output dependence vector is therefore given by

I ? I = (u + w ) ? (u + w ) = w ?w 1

2

1

1

2

2

Since W is a subspace in J Z n, w1 ? w2 2 W. Hence I1 ? I2 2 W and is the solution for the system of homogeneous equations. 2

4.5.1 Computation procedure The straight-forward method of solving the system of homogeneous equations (4.12) is to pose the system as a matrix-vector expression A~x = ~c where A is an n m matrix of the bounding coecients of the de ning array subscript expressions, ~x is the index vector (x1 ; : : :; xn)T and ~c is some integer vector (c1; : : :; cm )T . By performing elementary row reduction operations on the augmented matrix A, we can derive a matrix in echelon form with the number of rows equal to the rank r of A. If r n there will exist a unique solution which can then be obtained by back-substitution. If r < n there will be many possible solutions expressed in terms of n ? r free variables. On the occasion when r < n, we are not required to derive all possible solution vectors. Instead, we only determine the minimum positive integral vector, since any larger output dependence vector will be subsumed by the smaller output dependence vector. The procedure to derive a unique minimum output dependence vector is thus summarised below: 1. If r = n there is a unique solution to the set of homogeneous equations. If the solution vector has integer coordinates then the output dependence vector is considered feasible. However if any of the coordinates is non-integer, the output dependence is considered infeasible, i.e. there is no output dependence for that particular de ning array variable.


72

2. If r < n there is a set of solutions in terms of n ? r free variables. We must rst arrange the solution set such that a vector of xed variables fixed = (xn?r+1, : : :, xn) is determined by a vector of free variables free = (x1, : : :, vn?r ). We substitute the smallest vector for free such that fixed is a feasible solution, i.e. positive and integral. The output dependence vector is then given by the concatenated n-vector (free jfixed ).

4.5.2 Examples of computation procedure Example 4.7: Consider the nested loop where DO z = 1, 100 DO y = 1, 100 DO x = 1, 100 A( x+3y+2z ) = ... ENDDO ENDDO ENDDO

To derive the output dependence vector we solve the homogeneous equation x+3

y+2z = 0

We wish to derive the minimum output dependence vector for the de ning array variable. To do this we express the index variable x in terms of the two free index variables y and z, i.e. fixed = (x) and free = (z,y), de ned by the equality x

= ?3 y ? 2 z

We substitute the minimum positive vector free = ( 0, 1 ), giving fixed = ( -3 ) and the minimum output dependence vector out = (free jfixed ) = ( 0, 1, -3 )

2

Example 4.8: Consider the nested loop kernel


73

DO x = 1, 10 DO y = 1, 10 DO x = 1, 10 A( x+3y+2z, 3x+2y+z ) = ... ENDDO ENDDO ENDDO

To obtain the output dependence vector we have to solve the set of homogeneous equations

y+x = 0 z+2y+3x = 0 z+3

We solve the system of homogeneous equations by forming the matrix vector expression

0 1 0 1 1B x C 0 C B 2 3 1 CA B y C = B@ 0 CA B@ B C 0 1 2 3 @ A z

Formulating the augmented matrix A and row-reducing, i.e. row2 = 2row2 ? row1, we can derive the lower echelon form,

1 0 2 3 1 0 CA B@ 0 1 5 0

For which the vectors free = equations,

(z)

y x

and fixed =

(y,x)

= ? 75 z = 1 z 7

The minimum output dependence vector is then given by out = (free jfixed ) = (7,-5,1)

2

are determined by the


74

Having derived the set of ow dependence and output dependence vectors, it is interesting to note that an alternative de nition of data independence arises which depends solely on the output and ow dependence vectors. This data independence property is revealed in remark 4.1. The result is intuitive in that a ow dependence will be killed if its associated output dependence vector for the de ning array variable is smaller. One practical consequence of the remark, in the case of a FORTRAN nested loop kernels de ned with J Z n , is that any output dependence vector of the form (0; : : :; 0; 1) will always result in a data independence of the iterations in the algorithm.

Remark 4.1 Given a set of output dependence and ow dependence vectors ex-

tracted from a nested loop kernel, denoted outi and owi respectively, the iterations in J are data independent if outi owi for all output and ow dependence vectors associated with the same de ning array variable.

4.6 Concluding remarks This chapter has described methods for determining the distance vectors for the ow, anti and output dependencies in a nested loop computation. The set of distance vectors di 2 de nes the data ow patterns in a nested loop computation; iteration (Ii + di(Ii )) 2 J is always data dependent on Ii 2 J . The set can then be used to generate the data dependence DAG of the nested loop computation, with nodes representing iteration tasks and arcs representing data dependencies. The data dependence DAG is useful in that it provides us with exact data ow information between tasks in the computation. Deriving the data dependence DAG for a nested loop kernel better de nes the problem of mapping the computation onto a parallel architecture. We develop on this exact knowledge in later chapters. In particular, we show how we can parallelize loop kernels which were previously thought dicult. We also describe an exact data independence test for multi-dimensional array variables using the information determined from the data dependence distance vectors derived by the techniques presented in this chapter.

Chapter 5

Execution pro ling on parallel architectures In chapters 6 and 7 we determine the execution timing pro le of nested loop kernels executing on parallel processors with varying resource characteristics. This chapter summarises the main features of our performance estimator highlighting the simulation technique used and the parameters which determine an estimation run. Section 5.1 describes the general instrumentation technique used in our performance estimation experiments. Section 5.2 describes the parallel processor model used in our performance estimator. Finally, section 5.3 describes how shadow variables and tracking statements are added to the original program. The shadow variables and tracking statements determine, at run-time, the data ow pattern of the computation. The section shows how this data ow information is used by a scheduler in our performance estimator to determine the execution time pro le of a nested loop kernel executing on our parallel processor model.

5.1 Instrumenting the loop computation We estimate the performance of a nested loop kernel on a parallel processor by instrumenting the original loop kernel and executing it. The technique modi es the original body of the nested loop kernel by including a set of tracking statements. These tracking statements extract and record relevant data ow information about the computation during its execution. The data ow information extracted is stored 75

CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES76 in a set of shadow variables one for each iteration associated with the nested loop computation. Using this data ow information the performance estimator then schedules iterations when all its input data are signalled as ready and a processor is available. A similar instrumentation strategy was developed by Kumar [32] to measure the parallelism pro le of scienti c programs. The method has also been extended by Paul Peterson [47] to evaluate the eectiveness of currently available compilers in parallelizing nested loop kernels. The similarities of their work to ours are in the target application domain, imperative nested loop kernels, and the general overall instrumentation strategy. The dierence is in the processor models adopted. Both Kumar and Peterson assume an idealised parallel processor model with in nite resources. We assume a more realistic model. The model used by both Kumar and Peterson lters out machine-dependent characteristics like nite storage, limited number of processors, memory latency, resource management policies, etc. This was acceptable in their experiments as they were simply measuring the intrinsic or exposed parallelism in a nested loop computation. We assume a more realistic parallel processor model which measures the eect of a limited number of processing elements and data communication overheads on the execution time of a nested loop kernel.

5.2 The processor model The processor model used in our experiments is developed from the idealised parallel processor model used by Kumar [32]. We add processor dependent characteristics which we intend to measure. The model assumes the simple schematic shown in gure 5.1. We can summarise the behaviour of the processor model as follows:

Each cluster is assumed to contain a number of processors and memory mod-

ules connected in a local interconnection network. We denote the number of processors within a cluster by M throughout.

A task takes a nite integral number of cycles to complete its execution on a processor and is denoted by the term COMPUTE. The time period de ned in

CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES77 cluster

cluster

cluster

Interconnection Network

Figure 5.1: The processor model is assumed to amortise all memory accesses and arithmetic/control operations for tasks executing on a processor. In our context, each task is assumed to be an iteration and the task computation involves the loop body.

COMPUTE

A cost is associated with the communication of data in the parallel processor. If data have to be communicated between tasks assigned to processors in the same cluster this cost is denoted by a function clocal . If the communication is between clusters this cost is denoted by a function cdistant. The nature of these costs for dierent interconnection networks used in our performance estimation is explained later.

There is only a nite number of clusters in the parallel processor model. We

denote the number of clusters in the parallel processor model by N throughout.

Tasks are self scheduled on the parallel processor model. That is, a task is

scheduled to execute when all its input data are signalled as ready, subject to the availability of a free processor. Given a set of P ready tasks and a function avail(t) which determines the number of free processors at time step t, we select and schedule avail(t) tasks randomly if P > avail(t), assuming no prior information about the nature of the tasks selected. Ready tasks not able to execute at cycle t will be scheduled at some time step t0 > t as soon as a processor is made available.

CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES78 Interconnection network Cost function Cross-bar 1 Bus B mP K-cube B mKq Omega packet switched B m log P

Table 5.1: Cost functions for characterising transmission delays in an interconnection network

5.2.1 Communicating cost functions Data communicated across an interconnection network will incur a cost as noted before. In many parallel processors it is usual for clocal cdistant. This section will present some well-established cost functions which can be used to approximate clocal or cdistant. Such cost functions characterise the average transmission latency of a message travelling across a particular interconnection network. We take B to be the message size communicated, m to be the basic arbitration or transmission time for an item of data, and P to be the number of processors connected by the interconnection network. The cost functions characterising the transmission delay of a message packet for some common interconnection networks are listed in table 5.1. Note that for a K -cube interconnection network the processors are connected in a cube of dimension K with size q in each dimension. The interested reader can obtain the derivation of the cost functions which are listed in table 5.1 from [16]. The parallel processor architecture in our performance estimator is concisely described by the parametric set

fM; N; clocal; cdistant; COMPUTEg Thus, a distributed memory multicomputer like the Meiko CS-2 can be described with fM; clocal g 1 and fN; cdistant, COMPUTE g assuming typical values. Shared memory multiprocessors like the Cray Y-MP will have fM; clocal ; cdistantg 1 with fN; COMPUTEg assuming typical values. Experimental hierarchical multiprocessors like Cedar [35] and virtual shared memory architectures like KSR1 and BBN GP1000 can also be modelled with fM , N; clocal ; cdistant; COMPUTEg assuming their

CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES79 associated typical values.

5.3 Tracking statements and shadow variables In our performance estimation scheme the original program S is modi ed by adding a set of shadow variables and tracking statements to produce a new program S 0. The new program is then compiled and executed iteratively until the entire computation has been scheduled on our parallel processor model. In practise, the original loop body computation need not be executed. Instead, the modi ed program simply extracts and records the data ow of the computation using sync functions which enforce the data dependencies and synchronise the execution of the iterations in the nested loop computation. For sync functions de ned by , which is the set of dependence distance vectors extracted from the nested loop kernel, our schedule becomes data-driven and iterations are placed on the ready queue as soon as all its input data are de ned. Note that other sync functions can also be used so long as the data dependencies in the nested loop computation are guaranteed to be preserved. As we noted earlier, the original program is augmented with the set of shadow variables. A shadow variable is declared for each iteration in the nested loop computation. These shadow variables record a set of useful information needed by our performance estimator. The information elds in a shadow variable include:

The scheduled flag eld which indicates whether an iteration has been sched-

uled. At the beginning of an estimation run iterations are assumed not to be scheduled and hence scheduled flag will be in its unset position. When an iteration is scheduled in a time slot on a processor in our parallel processor model scheduled flag is set.

The scheduled time eld which records the time slot in which an iteration is scheduled. When an iteration is scheduled in an available time slot on a processor in our parallel processor model we update the scheduled time eld with the appropriate value.

The cluster eld which records the cluster in which an iteration is assigned.

Iterations are assumed to have been assigned to clusters prior to the start of

CHAPTER 5. EXECUTION PROFILING ON PARALLEL ARCHITECTURES80 an estimation run and the assigned cluster ID for the iteration concerned is recorded in the cluster eld of its associated shadow variable. For the case of the two-nested loop below: DO I = 1, 100 DO J = 1, 100

We generate an appropriately translated C program and declare a 100 100 global shadow array with each element associated with an iteration in the nested loop kernel. The translation is shown below: SHADOW task[100][100]; for ( i = 1; i 1 and the processors and memory modules in each cluster are assumed to be connected by a bus. The memory model, therefore, simulates hierarchical architectures like the experimental Cedar multiprocessor [35]. As was noted in chapter 5, our processor model is concisely described by the parametric set f M; N; clocal ; cdistant COMPUTE g. We summarise the parametric sets used to de ne our performance estimation experiments in tables 6.2, 6.3 and 6.4.

CHAPTER 6. SHARED MEMORY ARCHITECTURES

DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...

ENDDO ENDDO

Figure 6.6: Synthetic loop kernel one DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 A(2*j+3,i+1) = ... ...

= A(i+j+3,2*i+1) ...

ENDDO ENDDO

Figure 6.7: Synthetic loop kernel two DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 B(2*j+3,i+1) = ... ...

= B(i+j+3,2*i+1) ...

A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...

ENDDO ENDDO

Figure 6.8: Synthetic loop kernel three

105


106

The derivation of the COMPUTE value deserves an explanation. The COMPUTE value de nes the number of cycles needed to complete the execution of a task on a processor in our model. In our case, we de ne a task to be an instance of the body of the loop kernel. We estimate the value of COMPUTE for the BDV and DD synchronised versions of the three loop kernels by adding up the essential operation cycles needed to implement them. The DD synchronised version of loop kernels one, two and three are illustrated in gures 6.10, 6.11 and 6.12 respectively. Generating the COMPUTE value involves adding up the cycles associated with each arithmetic operation involved in generating the data dependent iterations from the derived distance vector, and the array index values. The cycle cost of the arithmetic operations are assumed to be

f; =g 2 cycles and f+; ?g 1 cycle Similarly, the BDV synchronised versions of loop kernels one, two and three are illustrated in gures 6.13, 6.14 and 6.15 respectively. The COMPUTE values for all three loop kernels are also estimated by adding up the cycles needed to generate the synchronising iteration instance and the array index values. Note that we adhere to the scheme suggested by Tzen and Ni [65] in that we serialise the inner loop and only synchronise on the second vector, (1,t), from the BDV set. The allocation strategy used for assigning tasks to processors in the parallel computer is column cyclic. This strategy for our two nested synthetic loop kernels and a two processor parallel computer is illustrated in gure 6.9: column J is allocated to processor (J mod 2). Note that the BDV scheme must always adopt a column partitioning allocation strategy because it serialises the execution of the inner J loop in the same processor. Figure 6.9 illustrates a part of the column cyclic allocation strategy, with the rst vector of the BDV set assuming (0,1). The execution time pro les for the three synthetic loop kernels executed on the virtual SMA are shown in gures 6.16, 6.17 and 6.5. We plot 1/T versus P, where T is de ned to be the execution time of the loop kernel, and P to be the number of processors in our at and hierarchical memory parallel processor models. We note that in our at and hierarchical memory parallel processor models, the execution time pro les for loop kernels one and two have the BDV parallelized


107

I

4

3

2

1 J

Processors =

0

1

0

1

Figure 6.9: A column cyclic allocation scheme loop versions demonstrating better performance for up to 75 processors. The better performance by BDV over DD at this stage is primarily due to the smaller cost associated with determining the synchronising iterations. However, for the two parallel processor models with more then 75 processors, the DD scheme consistently derives better execution time pro les. This improved execution times in the DD parallelized versions is the direct result of DD unravelling all the available parallelism in the loop kernels, compensating for the additional run-time overhead in determining exactly all data dependent iterations. The execution time pro le of loop kernel three illustrates the dramatically better parallelizing property of DD compared to BDV. As was noted in section 6.4.2, loop kernel three cannot be parallelized with the BDV scheme; loop kernel three is serialised when the BDV scheme is adopted. The DD scheme, however, synchronises directly on the dependence distance vectors, unravelling all the available parallelism, and improved execution times are derived.

6.5 Concluding remarks This chapter has described a strategy for generating the DO ACROSS version of a nested loop kernel with non-constant data dependencies. Its advantage over other similar schemes is its ability to unravel more parallelism with much less eort. The


108

DD scheme is shown to have a much lower space complexity then the RDC scheme and, as shown through our performance estimation experiments, unravels much more parallelism then BDV; currently the most relevant method. Perhaps the biggest contribution of this chapter is that it demonstrates the practicality of the DO ACROSS loop as a feasible mechanism for parallelizing loops which were previously thought impossible. An obvious extension to the work presented is the incorporation of our DO ACROSS strategy in a realistic parallelizing compiler to measure the method's potential on real world programs other then loop kernels. For the remainder of this thesis we further develop the DO ACROSS by deriving the message-passing version of the construct for implementation on distributed memory architectures like the iPSC/2 and the iPSC/860.


Scheme M BDV 1 DD 1 Scheme M BDV 8 DD 8

109

Flat memory model

N

c

local

c

distant

1 $ 150 1 4 log N 1 $ 150 1 4 log N Hierarchical memory model

N

c c 1 $ 19 0:5 M 4 log N 1 $ 19 0:5 M 4 log N local

distant

COMPUTE

14 20 COMPUTE

14 20

Table 6.2: Processor model for the execution time pro ling of loop kernel one Scheme M BDV 1 DD 1 Scheme M BDV 8 DD 8

Flat memory model

N

c

local

c

distant


N


distant

COMPUTE

12 16 COMPUTE

12 16

Table 6.3: Processor model for the execution time pro ling of loop kernel two Scheme M BDV 1 DD 1 Scheme M BDV 8 DD 8

Flat memory model

N

c

local

c

distant


N


distant

COMPUTE

22 37 COMPUTE

22 37

Table 6.4: Processor model for the execution time pro ling of loop kernel three


DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 i1 = i-1-2*i+j j1 = j+2*i-j /* WAIT on vaild V(i1,j1) */ A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...

/* SET V(i,j) */ ENDDO ENDDO

Figure 6.10: Synthetic loop kernel one synchronised using the DD scheme DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 i1 = i-j/2 j1 = j+j-i/2 /* WAIT on valid V(i1,j1) */ A(2*j+3,i+1) = ... ...

= A(i+j+3,2*i+1) ...


Figure 6.11: Synthetic loop kernel two synchronised using the DD scheme

110


DO ACROSS i = 1, 500 DO ACROSS j = 1, 500 i1 = i-1-2*i+j j1 = j+2*i-j i2 = i-j/2 j2 = j+j-i/2 /* WAIT on valid V(i1,j1) and V(i2,j2) */ B(2*j+3,i+1) = ... ...

= B(i+j+3,2*i+1) ...

A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...


Figure 6.12: Synthetic loop kernel three synchronised using the DD scheme DO ACROSS i = 1, 500 DO j = 1, 500 i1 = i+1 j1 = j+1 /* WAIT on vaild V(i1,j1) */ A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...


Figure 6.13: Synthetic loop kernel one synchronised using the BDV scheme

111


DO ACROSS i = 1, 500 DO j = 1, 500 i1 = i+1 j1 = j-2 /* WAIT on valid V(i1,j1) */ A(2*j+3,i+1) = ... ...

= A(i+j+3,2*i+1) ...


Figure 6.14: Synthetic loop kernel two synchronised using the BDV scheme DO ACROSS i = 1, 500 DO j = 1, 500 i1 = i+1 j1 = j-499 /* WAIT on valid V(i1,j1) */ B(2*j+3,i+1) = ... ...

= B(i+j+3,2*i+1) ...

A(i+j,3*i+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...


Figure 6.15: Synthetic loop kernel three synchronised using the BDV scheme

112


0.00014

0.0002 0.00018 0.00016

BDV scheme DD scheme

0.00012


0.00014 0.00012

0.0001

1/T

1/T

113

0.0001

8e-05 6e-05

8e-05 6e-05

4e-05

4e-05 2e-05

2e-05

0

0 0

20

40

60

80 P

100

120

140

0

160

(a) Flat memory model

20

40

60

80 P

100

120

140

160

140

160

(b) Hierarchical memory model

Figure 6.16: 1/T versus P plot for loop kernel one

0.00025

0.00016 0.00014 BDV scheme DD scheme


0.00012 0.0001

0.00015

1/T

1/T

0.0002

0.0001

8e-05 6e-05 4e-05

5e-05

2e-05 0

0 0

20

40

60

80 P

100

(a) Flat memory model

120

140

160

0

20

40

60

80 P

100

120

(b) Hierarchical memory model

Figure 6.17: 1/T versus P plot for loop kernel two

Chapter 7

Distributed memory architectures An important implication of our ability to derive the data ow information of the data dependencies in a nested loop computation, is our consequent ability to parallelize FORTRAN nested loop kernels with non-constant data dependencies onto DMAs. A DMA does not have a globally shared memory space, and so data will have to be explicitly communicated from memory module to processor if it does not reside in the local memory. This style of programming is called message-passing because explicit messages have to be sent between processors which generate data and those which use the data. This chapter will show how the message-passing version of a nested DO ACROSS can be automatically generated by a parallelizing FORTRAN compiler. We also compare our method with other schemes used by current parallelizing compilers.

7.1 The problem de nition The model of computation we adopt is the static macro data ow model where a program is represented as a direct acyclic graph (DAG), with nodes representing computation tasks and arcs representing communication requirements between the tasks. Each task node is augmented with a computation time cost and each arc is augmented with a communication time cost. Task execution is triggered by the arrival of all its input data. The model also assumes that data is sent immediately 114

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES

115

to all dependent tasks on completion of the current tasks. The problem of generating the message-passing DO ACROSS version is an instance of the general mapping problem in parallel processing. The mapping problem concerns assigning tasks in the DAG to processors in a parallel computer such that the execution time is minimised, when the computation and communication costs are assumed. In general this is a very dicult problem. It has been shown by several researchers that the general mapping problem is NP-complete [58, 45]. We therefore concentrate on developing heuristics which solve the mapping problem quickly, giving as near an optimal solution to the problem as possible. As was mentioned in chapter 3, much work into the mapping of FORTRAN nested loop kernels onto DMAs has concentrated on the mapping of the PARALLEL DO loop. Such loops are easier to map onto DMAs then DO ACROSS loops because of the absence of inter-loop data dependencies, i.e. there is no communication of data between iterations. This chapter suggests a strategy for generating the message passing version of a nested DO ACROSS for DMAs using data ow information derived from the set obtained in an earlier compilation stage. The strategy is summarised by four distinct steps: 1. Task allocation: allocating the iterations in the nested loop kernel onto processors in the parallel computer. 2. Data layout: determining how the array structures, accessed in the nested loop computation, are partitioned across the local memories of the processors in the parallel computer. 3. Communication requirement generation: summarising the communication needs for the loop body computation. 4. Code generation: generating the execution code kernel for each processor such that each knows which iterations are allocated to it and where the relevant data are stored. We describe our solution to each of the above steps in the remainder of this chapter. The basis of our scheme is the accurate determination of the distance vectors of \carried" array values in the computation. The message-passing version of the


116

computation is generated from it's data ow patterns determined by these data dependence distance vectors. Section 7.2 describes our problem of generating the message-passing variant of the DO ACROSS as a version of the mapping problem; a pivotal problem in parallel processing. Section 7.3 describes some non-intuitive execution timing results when mapping a computation onto parallel computers based solely on metrics which produce good load balance and minimum communication costs. The section emphasises the need for determining a mapping which deliberately derives a minimal execution time. Reduced communication costs and good load balance do not necessarily produce good mappings. Sections 7.4, 7.5, 7.6 and 7.7 describe our task allocation strategy, our data layout policy, our communication requirement generation strategy and our code generation scheme respectively. Section 7.8 describes experimental results which evaluate the eectiveness of our task allocation strategy compared to those currently proposed for HPF [26]. And nally, section 7.9 describes a related DO ACROSS generation scheme proposed by Su [63]. We compare and contrast his suggested scheme with ours.

7.2 The mapping problem We pose the problem of mapping a FORTRAN nested loop kernel onto a DMA as a problem of mapping the macro data ow DAG representation of the computation onto distinct clusters such that the parallel computation time is minimised. Iterations in the nested loop computation are mapped onto clusters in the set fC1; C2; : : :; Crg where each cluster represents either a single processor or a collection of processors connected by a local high speed interconnection network. We de ne a mapping to be the translation of the task graph de ned by the tuple (Jlabel ; E , ; cdistant) to the task graph de ned by the tuple (Jlabel ; E; , cdistant; clocal ). Both task graphs de ne a nested loop computation, the former and latter de ned to be unmapped and mapped respectively. We assume Jlabel to be the set of all the computation tasks and E to be the set of communication arcs. We also assume jJlabel j = v to be the total number of tasks and jE j = e to be the number of communication arcs in the nested loop computation. Note that the communication


117

arcs represent precedence vectors, determining the order in which the tasks can be executed, as well as representing the data communication needs for a correct execution. Every computation task is assumed to take an equal amount of time to complete execution and this time is denoted by . A data communication, represented by an arc in the task graph, can take one of two dierent costs: cdistant and clocal . The communication cost, cdistant, represents the costs incurred when a data value has to be communicated across the interconnection network to reach a distant cluster. The communication cost, clocal , represents the costs incurred when a data value is communicated within the local cluster. Arcs linking tasks which have been assigned to the same cluster are internalised whereas arcs linking tasks not assigned to the same cluster are externalised. The unmapped task graph is assumed to have all arcs initially externalised. The mapping function can thus be viewed, as suggested by Sarkar [58], as an internalising pre-pass which determines which arcs are to be internalised by assigning connected tasks onto the same cluster. By doing this in some deliberate way we aim to minimise the execution time of the overall program when the task graph is eventually executed on our parallel architecture.

7.2.1 Correspondence between task graph and nested loop computation Note that there is an explicit relationship between the macro data ow DAG and the nested loop computation. The set Jlabel corresponds to the set J in that every iteration in J is represented by a uniquely labelled task in Jlabel . The set E corresponds to a subset of the data dependence distance vector set . The structure of the DAG for a speci c form of a nested loop kernel can then be completely recognised by the set J and the set , both of which can be automatically extracted from the nested loop computation by a parallelizing compiler using the techniques described in chapter 4. For our current exposition, the form of the nested loop kernel, for which our technique applies, will have the following constraints imposed on it: 1. there are no output dependencies for the array variables generated by the nested loop computation,


118

2. array variable A is the only array structure being computed by the nested loop computation, 3. the loop body computation can reference scalar variables or elements of array A, and 4. the loop bounds for the nested loop computation must be known at compiletime. For the results described in this chapter, conditions 1 and 2 may be discarded with little modi cation to our suggested scheme. Relaxing condition 4 will require some form of run-time inspector/executor style data ow analysis, as presented in the work of Saltz et al [57] and Koebel et al [30]. Condition 3 represents a separate problem to that addressed in this chapter, i.e. the alignment problem. The alignment problem is further discussed in section 7.5.1, where we present a simple solution.

7.3 The impact of reducing the communication costs The impact of reducing the communication costs on the parallel execution time is much less signi cant in the DO ACROSS loop as compared to the PARALLEL DO loop. Anomalous behaviour can occur where a mapping with fewer inter-cluster communications can take longer to execute than a mapping with more. This behaviour can unfortunately arise quite frequently. An example of this behaviour is shown in two dierent mappings of the DAG shown in gure 7.1. We assume that cdistant = 1 and clocal = 0. In other words, data communication within clusters incur zero cost while the same across dierent clusters incur the cost of one cycle. Every task is also assumed to have an execution time of one cycle (i.e. = 1). In the rst mapping strategy, illustrated in gure 7.1(a), the chain of tasks on the left is collected into cluster one, and the chain on the right into cluster two. The numbers next to each node indicate the time at which the tasks can be executed. Note that the minimum completion time is ve cycles and that there are three inter-cluster communications. The next mapping strategy, as illustrated in gure 7.1(b), is interesting because it achieves a better load balance then the rst and


119

1 1

2

3

1

cluster 1 2

3

1

4 1

3

1

4

1

4 4

5

5

cluster 2

5

cluster 2 7

6 cluster 1

(a)

(b)

Figure 7.1: A clustering example with communication costs the number of inter-cluster communications have also been reduced. However the mapping strategy has resulted in an increase in the execution time of the DAG. To generate the DO ACROSS version of a nested loop kernel, it is imperative that we derive techniques which map the task graph of the computation onto the DMA such that near-optimal execution times are obtained. We have seen that techniques involving only improving the load balance and decreasing the communication costs are not enough. The mapping strategy must attempt to minimise directly the execution time of the parallelized DAG. Because such a mapping strategy may eventually be employed in an interactive parallel programming environment or in the optimising back-end of a parallelizing compiler, the mapping strategy must also be reasonably fast. Figure 7.1(a) illustrates a distinct class of mapping strategies called linear clustering [17]. In linear clustering we assign chains of tasks onto separate clusters. In the non-linear clustering example shown in gure 7.1(b), there exists tasks which have no data dependence between them and which have been assigned to the same cluster. The intuitive reason why a linear clustering is better than a non-linear clustering, when all computation costs and communication costs are the same, is that a non-linear clustering inhibits parallelism by allocating tasks, which may otherwise


120

execute in parallel on the same processor. A linear clustering will not inhibit parallelism because tasks in a dependence chain, which have to be executed in series anyway, are allocated onto the same processor. Gerasoulis and Yang have further proved that the upper-bound for the parallel schedule derived from a linear clustering is always, at worse, twice that of the optimal parallel schedule for coarse grain task graphs [18]. Note that a DAG will have many possible dependence chains and a linear clustering algorithm must possess some policy for making decisions as to which chains are picked for clustering. Furthermore, since there are only ever a nite number of processors available in a multicomputer, a pure linear clustering strategy can never be implemented, i.e. where every linear dependence chain is assigned an individual processor. Some form of decision has to be made regarding which dependence chains are clustered onto which processors. In the next section we present a very fast linear clustering algorithm which is shown to derive good execution times when compared to other currently proposed allocation schemes.

7.4 A fast clustering algorithm Our task allocation strategy employs an algorithm called linear clustering with fork migration (LCFM). It takes as input the data dependence DAG of the nested loop computation and outputs the assignment of tasks in the DAG into a bounded set of clusters. The LCFM algorithm is a level-directed mapping strategy which selects dependence chains for clustering from the longest independent dependence chain downwards. It assigns a cluster to an entire dependence chain and considers merging dependence chains onto the same cluster if and only if the execution time can be improved.

7.4.1 The LCFM algorithm The rst part of the algorithm visits each task node in the DAG in the lexical order in which they appear in the nested loop computation. Nodes encountered which have not been marked visited are labelled root nodes and a depth- rst tree walk along the data dependence vector set is initiated. Within this walk each node


121

visited is marked as such and the longest dependence path from the current node to the exit node, with all arcs assumed to be internalised, is recorded in the eld nx ! level where nx denotes the node being visited. After a depth- rst tree walk is completed the currently considered root node is added to a priority list of root nodes, denoted by the term STARTLIST, sorted according to the magnitude of the node's level eld. The priority list maintains the root nodes in the order such that longer dependence chains are placed ahead of shorter chains. When all the nodes have been visited, the algorithm is then ready to start clustering along dependence chains. The pseudo code for this preprocessing phase of the LCFM algorithm is illustrated below: DO

8nx 2 J

nx ! visited

IF !(

) THEN

DEPTH-FIRST-WALK(

nx

)

INSERT-PRIORITY-LIST(

nx ,

STARTLIST )

ENDIF ENDDO

The \fork" and \join" conditions The LCFM algorithm will cluster tasks along dependence chains. Two dependence chains are merged into the same cluster if and only if the execution time is likely to decrease. Otherwise, the two dependence chains are assigned to clusters which enable them to start at the earliest possible time. The algorithm is very fast because it has to make a merging decision in only two situations: when a fork or a join is encountered. A fork de nes the arc out from a task node in a dependence chain as shown in gure 7.2(a). The task node from which the arc emits is called the forking task. A join de nes the arc into a task node in a dependence chain as shown in gure 7.2(b). The task node into which the arc is incident is called the joining task.


fork

(a)

122

join

(b)

Figure 7.2: A fork and join example We also de ne cdistant to be the inter-cluster communication cost for a fork. We further de ne clocal to be the communication cost of an arc internalised in a cluster and length(depchain) to be a function which returns the length of the dependence chain. We also introduce the concept of dominant and subdominant dependence chains. A dominant chain is de ned to be the longest dependence chain from the current node to the exit node. All dependence chains from the current node, other than the dominant dependence chain, are denoted subdominant. We now consider the case of a fork and a join in turn, and present our policy for deciding whether dependence chains are merged into the same cluster. We de ne the execution time, denoted by PT, to be the current execution time of the portion of the DAG that has already been clustered. The parallel execution time is therefore a monotonically increasing function of the number of task nodes that have been clustered. In the LCFM algorithm a decision is made at a fork whether to merge a subdominant chain with a dominant chain. We consider the fork scenario depicted in gure 7.3. Note that chain2 is the dominant dependence chain which has already been assigned a cluster (i.e. pcluster), and chain1 is a subdominant dependence chain. The subdominant dependence chain will be assigned to a dierent cluster (i.e npcluster) unless assigning it to pcluster improves the parallel execution time. For the example in gure 7.3, where if we assume nx to be the forking node and the fork to be a ow dependence, the parallel execution time of the clustered DAG


123

n x fork_cost

chain 1 chain 2

Figure 7.3: Condition to merge dependence chains at a fork with chain1 and chain2 merged into the same cluster, is given by the expression: PTmerge =

ST(nx ) + length( chain1 ) + length( chain2 ) + clocal For the case when chain1 and chain2 have been assigned to dierent clusters, the parallel execution time is then given by the expression: PTmerge =

ST(nx) + max [length(chain1) + clocal ; length( chain2 ) + cdistant] In the LCFM algorithm all dependence chains are assigned to a dierent cluster unless an improvement can be shown otherwise. In the above case, if we can show that PTmerge < PTmerge , we merge the two dependence chains, chain1 and chain2, into the same cluster. This is the merging rule at a fork and the rule is summarised in condition C1 below. C1 (Condition for merging at a fork): We merge a dominant and subdominant dependence chain at a fork if and only if length(dom chain) + length(subdom chain) + clocal < max [length(dom chain) + clocal ; length(subdom chain) + cdistant]

(7.1)

Where dom chain and subdom chain are the dominant and subdominant dependence chains from the forking task respectively, clocal is the local cluster communication cost and cdistant is the inter-cluster communication cost. 2 Note that in the case where the fork is an anti dependence arc, the dominant and subdominant dependence chains are always assigned to dierent clusters. This


124

policy is enforced in our LCFM algorithm so as to eliminate the anti dependencies in the nested loop computation when our single assignment memory management policy, described in section 7.6.1, is adopted. For the case of a join, two scenarios are considered:

the joining task has not been assigned to a cluster, and the joining task has been assigned to a cluster. When an unassigned joining task is encountered the LCFM algorithm will continue clustering along the dominant chain of the joining task. If the joining task is also a forking task the decision to merge the subdominant dependence chain with the dominant chain is taken on condition C1. When an assigned joining task is encountered, we simply stop clustering.

Selecting clusters The LCFM algorithm assumes a multiprocessor with a nite number of clusters. The algorithm selects a cluster at the beginning of the traversal of a dependence chain, which it then assigns to all tasks in the dependence chain until either the end of a dependence chain or a join is encountered. To select a cluster our algorithm picks that with the earliest schedulable start time. We approximate this rule by storing the latest scheduled time slot for a cluster, pcluster, in an array entry start time[pcluster]. During the course of the clustering, the start time of the cluster is adjusted start time[pcluster] = start time[pcluster] + K *

where K is the number of tasks assigned to pcluster. To select a cluster, the function NewMinCluster is invoked which chooses a cluster with the smallest entry in the array start time. Note that at the beginning of the LCFM algorithm, the array structure start time is initialised to zero.

The clustering algorithm The main part of the LCFM algorithm selects a task node, nx , from the head of the priority list STARTLIST and a cluster, pcluster, determined by the function NewMinCluster. The procedure LinearCluster is then called which performs the

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES WHILE (

nx

125

= HEAD(STARTLIST) != NULL )

pcluster = NewMinCluster() LinearCluster( delete(

nx ,

nx ,

pcluster )

STARTLIST )

ENDWHILE

Figure 7.4: Traversing root nodes in the LCFM algorithm clustering for all dependence chains for which the node nx is the root. After this the next task in STARTLIST is selected and another cluster is determined using NewMinCluster, until there are no more root nodes left in STARTLIST. This phase is summarised by the pseudo code illustrated in gure 7.4. The procedure LinearCluster does a depth- rst tree walk with the rule that the branch task with the larger level is always traversed rst. The procedure travels down a dependence chain assigning task nodes to pcluster until a fork is encountered. The algorithm then checks for condition C1 to determine if a new cluster will have to be selected, npcluster, for the traversal of the subdominant dependence chain. LinearCluster then travels down the subdominant branch clustering the dependence chain into pcluster or npcluster. As LinearCluster clusters a dependence chain it also accumulates the number of tasks that is being assigned. This information is then used to update start time[pcluster] or start time[npcluster] before a new cluster has to be selected. The procedure LinearCluster is illustrated with the pseudo code shown in gure 7.5. Example 7.1: Consider the example DAG illustrated in gure 7.6. For the example we assume =1 clocal = 0 cdistant = 1 The LCFM algorithm rst derives the level information for each task node in the DAG. Note again that the level information for task node nx is the longest dependence chain from nx to the exit node, with all nodes assumed to be internalised.


ny ,

PROC LinearCluster( node * IF (

ny

int pcluster )

has already been assigned a cluster ) THEN

update start time[pcluster] RETURN ENDIF FORKLIST = sorted list of successor nodes of WHILE (

nx

ny

= head(FORKLIST) != NULL )

IF ( C1 ) THEN update start time[pcluster] npcluster = NewMinCluster() LinearCluster(

nx ,

npcluster )

LinearCluster(

nx ,

pcluster )

ELSE

ENDIF ENDWHILE ENDPROC

Figure 7.5: The linear clustering procedure

126

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES The level information for the set of tasks Jlabel = fT1, T2, T3, T4, is shown in brackets next to their respective nodes in gure 7.6(a). From the set Jlabel , we then construct the root node priority list, STARTLIST =

f

T1, T6, T7

127

g

T5, T6, T7

g

The LCFM algorithm picks T1 from STARTLIST to begin traversing rst. The dominant chain is clustered as shown by the dotted enclosure in gure 7.6(b). For the fork encountered at T1 the algorithm must decide whether to merge the subdominant chain with the dominant chain. The algorithm therefore computes the LHS of (7.1) length(dom chain) + length(subdom chain) + clocal = 2+1+0 = 3 and the RHS of (7.1) max[length(dom chain) + clocal ; length(subdom chain) + cdistant] = max[2 + 0; 1 + 1] = 2 The algorithm decides not to merge the subdominant chain with the dominant chain as a bene t cannot be predicted, and assigns a separate cluster to the subdominant chain from T2, as shown in gure 7.6(c). The algorithm continues with the clustering procedure for root node T6, gure 7.6(d), and similarly for root node T7, gure 7.6(e). The task allocation is therefore de ned by the clusters illustrated in gure 7.6(e).

2

7.4.2 The time complexity of the LCFM algorithm The rst part of the LCFM algorithm involves traversing the set Jlabel and inserting the set of start nodes into a priority list. Assuming the worst case where all v tasks have to be inserted into the priority list and a balanced tree structure is adopted to maintain this list, the time complexity for the rst part is given by O(v log v ). For the second part of the algorithm, where the procedure LinearCluster does a depth rst traversal of the DAG, each arc is visited only once. Hence the time complexity is given by O(e). The total time complexity of the LCFM algorithm is therefore O(v log v + e). The LCFM algorithm exhibits a better time cost complexity than the

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES T1(3)

T6(2) T2(1)

T7(2)

T4(2)

T5(1)

T3(0)

(a)

(b)

(c)

(d)

(e)

Figure 7.6: Example clustering trace of the LCFM algorithm

128


129

critical path method proposed by Sarkar [58] and the dominant sequence clustering (DSC) algorithm proposed by Gerasoulis and Yang [71]. The critical path algorithm exhibits a time complexity of O((v + e)e) while the DSC algorithm exhibits a time complexity of O((v + e) log v ). The DSC algorithm has, however, been shown to be optimal for certain classes of DAGs [71] for which our algorithm is not.

7.5 Data distribution via task distribution We view the data distribution policy as a consequence of the task distribution policy. This is a valid approach as it adheres to the owner computes data layout policy adopted by almost all parallelization systems for DMAs. In the owner computes rule all variables computed by tasks assigned to a processor are owned by that processor. All accesses to array variables non-local to the processor, i.e. not owned by it, will be read-only. In mapping the tasks in the DAG of a nested loop computation we also implicitly partition the array structure generated by the computation. The parallel computer onto which the tasks and array data are partitioned is assumed to consist of a set of processors with similar execution time characteristics. The task distribution is de ned by the one-to-one mapping exec : Jlabel ! [1; P ] where Jlabel is the set of iterations and [1; P ] de nes the range of processor IDs in the parallel computer. The data distribution is similarly de ned by the one-to-one mapping data part : D ! [1; P ] where D denotes the domain of the array variable being partitioned. To illustrate the owner computes data layout policy, consider the example below. Example 7.2: A simple example of the owner computes policy can be illustrated by the loop kernel DO i = 1, 1000 A(i) = A(i+2) ... END

In the loop, iteration i0 computes the value of A(i0). Thus if the set J is partitioned cyclicly such that exec(i0) = fproc : proc i0modP g, then array A is also partitioned such that data part(i0) = fproc : proc i0modP g. 2


130

7.5.1 The alignment problem Our exposition does not consider the computation of an array structure involving other array structures as shown below: DO

8I 2 J

f gen(I ))

A(

= ...

f use(I ))

B(

...

END

We have, therefore, ignored the alignment problem, which is how best to partition the array B such that the need for inter-cluster communication is minimised. One possible solution is to adopt one of a few regular data layout policies for array B, and to run an instrumented version of the kernel to determine the policy resulting in the smallest number of inter-cluster communications. The instrumented version of the code checks the assignment of an iteration together with the corresponding placement of the referenced array element B(f use(I )): accumulating instances when an inter-cluster communication is needed. Some possible regular data layout policies are illustrated in gure 7.7, where a 6 6 array structure is partitioned across a DMA with three processors. Let us assume that the task allocation phase has derived the distribution function exec, and the data distribution function for the array B is de ned part dataB for the selected data layout policy. The instrumented version of the loop kernel, for determining the inter-cluster cost of the selected data layout policy, can take the form: COMMS COST = 0 DO

8I 2 J

I ) 6= data partB(f use(I ))

IF ( exec(

) THEN

COMMS COST++ ENDIF ENDDO

The instrumented code can then be run with B adopting dierent data layout policies and COMMS COST determined. The data layout policy which results in the smallest inter-cluster communication cost can then be the choice data partition for array B.


P1 P2 P3 P1 P2 P3

Column Cyclic

P1

P2

131

P3

Column Contiguous

P1 P1 P2 P3 P2 P1 P2 P3 P3

Row Cyclic

Row Contiguous

Figure 7.7: Some regular data partitioning schemes for a 6 6 array structure

7.6 Generating the communication requirements In this section we describe how the communication requirements of a nested loop computation can be derived. This is an important task in parallelizing a nested loop kernel on a DMA, as it determines the explicit message-passing requirements of the program.

7.6.1 Eliminating the anti dependencies We generate the communication requirements for the tasks in the macro data ow DAG of the nested loop computation from its associated data dependence DAG. In the data dependence DAG, nodes represent the loop body computation, as in the macro data ow DAG, and arcs de ne data dependencies in the computation. The data dependencies de ne the precedence relationships between task nodes when the computation is executed on a SMA, i.e. a uniprocessor or a shared memory multiprocessor. The precedence relationship between task nodes executing on a DMA need, however, only be de ned by ow dependencies. This is because memory related dependencies can be easily broken with a deliberate memory management policy. An anti dependence is a memory-related dependence; a result of a shared memory


132

module assuming a single store for a common array element. As such, the order of accesses for an array element must be carefully preserved to ensure the integrity of the computation. Consider the anti dependence in the code fragment below,

I: I 0:

A(3) = A(4) ... A(4) = A(5) ...

where I and I 0 denote the iterations to which the statements belong. The anti dependence is a result of iteration I needing to reference A(4) and iteration I 0 needing to rede ne A(4). In the case of a shared memory location for A(4), I and I 0 must be executed in order. If I and I 0 are mapped onto separate processors, each with a separate memory store for A(4), iterations I and I 0 can then be executed in parallel. In the owner computes policy, all array variables are owned by the processor to which the iteration which computes its value is assigned. An inter-cluster communication will result if an iteration requires a datum de ned by an iteration not assigned to the same processor. This is represented by a ow dependence in the data dependence DAG. An anti dependence spanning dierent processors can be ignored if we assume the following memory management policy: 1. all processors are downloaded with the initialised array structure A, 2. a processor can only update the portion of the array A which it owns. The portion of array A which it does not own is declared read-only and is never updated, and 3. all communicated data, de ned in a distant processor, is stored in a hash table in the local memory together with the label of the iteration instance which needs it. To understand better the memory mangement policy consider a small portion of a data dependence DAG illustrated in gure 7.8(a). Without an explicit single assignment memory management policy, as discussed before, the iterations in gure 7.8(a) must be executed in series. That is, the order of execution for the iterations is de ned Ii ! Ij ! Ik . Assuming Ii and Ik have been assigned to processor one and Ij to processor two, our memory management policy ensures that processor one


133

I i A(4) = A(5) ... flow I j A(6) = A(4) ... anti I k A(4) = A(7) ...

(a)

Processor 2

Processor 1 I i

Hash Table

A(4) = A(5) ... comms A(4) write store

I j A(6) = A(4) ... I k

A(4) = A(7) ...

(b)

Figure 7.8: Eliminating anti dependence vectors


134

owns data variable A(4) and processor two owns data variable A(6) from the owner computes policy. Thus when Ii is executed on processor one, the new value for A(4) is written into the local write store. The value is also communicated to the hash table of processor two, which is used to store externally updated data variables needed in a local computation. Because processors one and two have their own copies of A(4), iterations Ij and Ik can now be executed in parallel. The anti dependence between Ij and Ik is therefore broken.

7.6.2 Generating the macro data ow DAG The arcs in the macro data ow DAG represent communication requirements between tasks. The conversion from a data dependence DAG to the macro data ow DAG is summarised in gure 7.9. This conversion can be achieved by two immediate transformations. The rst transformation removes all anti dependencies. As was noted in the previous section, anti dependencies can be eliminated because they do not incur any communication requirements when a memory management policy, similar to the one outlined earlier, is adopted. The second transformation is the consequence of the task allocation procedure we described earlier. At the completion of task allocation, the iterations of a nested loop computation are explicitly assigned to processors in the architecture. Iteration instances which are ow dependent and have been assigned to the same processors are said to have their ow dependence arcs internalised. As such the internalised ow arcs can be removed from the data dependence DAG because there is now no explicit need to communicate data between the data dependent iterations. We summarise the communication requirements of an iteration by generating a group of communication sets for each iteration. Each set represents the input and output data requirement of an iteration I 2 J . Assuming FLOW() to be the set of

ow dependence arcs and BACK(FLOW()) to be the corresponding set of backward vectors, we rst de ne flow out(I ) to be the set of tasks ow dependent on iteration I , i.e. flow out(I ) = fi : 8 ow 2 FLOW(); I + ow = i s.t. i 2 J and i > I g. We also de ne flow in(I ) to be the set of tasks with which the current iteration I is ow dependent, i.e. flow in(I) = fi : 8 ow0 2 BACK(FLOW()); I + ow0 = i s.t. i 2 J


135

Data dependence DAG

Remove anti dependence vectors

Remove the internalized dependence vectors

Macro dataflow DAG

Figure 7.9: Transformation from data dependence DAG macro data

ow DAG and i < I g. We further de ne the set local(p) to be the set of iterations assigned to processor p, i.e. local(p) = fi : 8i 2 J and exec(i) pg, and the set distant(p) to be the set of iterations not assigned to the processor p. The communication sets are then de ned:

I ) = fi : flow out(I ) \ distant(p)g

send set(

receive set(

I ) = fi : flow in(I ) \ distant(p)g

The communication sets, receive set(I ) and send set(I ), de ne the distant input and output communication requirements for iteration I respectively. This, of course, assumes that the task allocation has already been performed which has assigned I to some processor p. Example 7.3: For code fragment 6.1 with forward dependence vectors

ow = (?1 ? 2 i + j; 2 i ? j )

and backward ow dependence vectors

ow0 = (?i + j=2; i ? j=2 + 1)

The flow out and flow in sets are de ned

i, j ) = f(i0; j 0): (i; j ) + (?1 ? 2 i + j; 2 i ? j ) = (i0; j 0), s.t. (i0; j 0) 2 J and (i0; j 0) > (i; j )g

flow in(i,j)

flow out(

= f(i0; j 0): (i; j )+ (?i + j=2; i ? j=2 + 1) = (i0; j 0), s.t. (i0; j 0) 2 J and (i0; j 0) < (i; j )g

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES (1,6)

comms

136

(4,2)

Figure 7.10: Communication requirement for iteration (1,6) Also assuming local(p) = f(i; j ): i pg and distant(p) = f(i; j ): i 6= pg, the communication set for iteration (1,6) is given by:

send set(1,6)

receive set(1,6)

= f(4; 2)g =;

The communication requirement for iteration (1,6) to iteration (4,2) is therefore represented by the arc in the macro data ow DAG shown in gure 7.10. Note that if the local function had been such that both iterations (1,6) and (4,2) were allocated onto the same processor, both send set and receive set would be empty. 2

7.7 Execution code kernel This section describes the execution code kernel compiled for each processor in the DMA. A correct execution code kernel must perform three important tasks: 1. It must be able to identify, for the local processor, the iterations for which it is responsible, i.e. the function exec. 2. It must communicate its computed data to the appropriate processors at the completion of every iteration. 3. It must be able to wait for the receipt of the necessary non-locally computed data before it permits an iteration to start execution. These three tasks describe the essential duties of the execute code kernel. The kernel is summarised in the high level pseudo-C shown in gure 7.11. The set local(p) used in the execution kernel is as de ned previously and returns the set of iterations which the LCFM mapping strategy has assigned to processor p. A simple implementation of the function local can be achieved by using a N element


for ( I

2 local(P ) )f

I ) ; )f

if ( receive set(

/* EXECUTE LOOP BODY */ /* Send out computed data */ for (

k 2 send set(I ) )f fill buffer( label(k), data );

g gelsef

send( exec(k), buffer ); /* for loop */

/* Insert iteration instance into hash table

g

for suspended tasks */

g

/* for loop */

f

while( !BARRIER )

/* Wait on a receive */ receive( buffer ); /* Extract relevant iteration instance from the hash table for suspended tasks */

f

if ( not all non local input data have been received ) continue;

g /* EXECUTE LOOP BODY */ /* Send out computed data */ for (

k 2 send set(I ) )f

fill buffer( label(k), data ); send( exec(k), buffer );

g

g

/* for loop */

/* while loop */

Figure 7.11: The high level description of the execute code kernel

137


138

bit array where N is the number of iterations in the nested loop computation. We call this bit array the ownership vector. The ownership of an iteration by a processor p can then be indicated by setting the corresponding bit position in the ownership vector distributed to that processor. The top level for loop in the kernel executes all iterations which have no intercluster input requirements, as determined by a null receive set. On completion of a scheduled iteration, computed data needed by iterations assigned to distant processors, as determined by send set and exec, are sent. Where an iteration is encountered which is data dependent on an externally computed data value, i.e. receive set is not empty, the iteration is suspended and placed in a hash table for suspended tasks. Having scheduled all iterations with no inter-cluster input requirements, the execution kernel then enters into a while loop. The test condition for the while loop is some global synchronisation ag, BARRIER, which is set when all iterations in the nested loop computation have been scheduled. The start of each iteration of the while loop is blocked by a receive which waits for incoming message packets from distant processors in the multicomputer. When a message packet is received the relevant iteration to which the packet is targetted is extracted from the hash table of suspended tasks. The identity of the target iteration is assumed to be contained in the data packet which the processor receives. The extracted iteration is then checked to see if all inter-cluster input requirements have been satis ed. If not, the data received is stored in the hash table for externally computed values, the satisfaction of one of the inter-cluster input requirements for the current iteration is recorded and the kernel returns to its suspended state. If the current iteration has all its inter-cluster input requirements satis ed, it is then executed. When the current iteration has completed execution, its inter-cluster output requirement is satis ed by sending the newly computed data to the non-locally assigned data dependent iterations, as explained before. When this is done, control ows back to the head of the while loop and the whole process is repeated until all iterations have been scheduled, i.e. BARRIER is released.

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES Memory model M Flat 1

N

1 $ 256

c

local

1

c

139

COMPUTE

distant

20 K

10

Table 7.1: Parametric description of K-cube interconnected distributed memory parallel processor model

7.8 Experimental evaluation of the task partitioner This section evaluates the eectiveness of the LCFM iteration task partitioner over schemes currently advocated for the HPF standard being drafted for massively parallel machines [26]. We concentrate on the partitioning directives present in the preliminary proposal. In HPF, the task partition is derived through the data partition using the owner computes rule; specifying the distribution of array structures determine where iterations in a nested loop kernel are mapped. HPF de nes the DISTRIBUTE primitive which is used to assign independent attributes to each dimension of an array structure. The prede ned attributes are BLOCK and CYCLIC. Thus for the statement DISTRIBUTE A( :, BLOCK )

where \:" denotes don't care, the array structure A is partitioned in a column contiguous fashion as shown in gure 7.7. For the other regular partitioning schemes in gure 7.7, HPF de nes for Column cyclic Row cyclic Row Contiguous

! ! !

DISTRIBUTE A( :, CYCLIC ) DISTRIBUTE A( CYCLIC, : DISTRIBUTE A( BLOCK, :

) )

In our experiments, we estimate the execution time pro le of the three synthetic loop kernels shown in gures 7.12, 7.13 and 7.14. We compare the execution time pro les when the LCFM scheme is used to partition iterations in the nested loop kernels and when the regular schemes, as proposed for HPF, are used.

7.8.1 K-cube connected distributed memory multicomputers The parallel processor model used in our performance estimation de nes a K-cube connected DMA. The parameters describing the parallel processor model used in


DO ACROSS i = 1, 200 DO ACROSS j = 1, 200 A(i+j,3*y+j+3) = ... ...

= A(i+j+1,i+2*j+4) ...

...

= A(i+3,2*i+j) ...

ENDDO ENDDO

Figure 7.12: Synthetic loop kernel one DO ACROSS i = 1, 200 DO ACROSS j = 1, 200 A(i+j,3*i+j+3) = ... ...

= A(i+j+1, i+2*j+4 ) ...

...

= A( i+j-1, 2*i-j) ...

ENDDO ENDDO

Figure 7.13: Synthetic loop kernel two DO ACROSS i = 1, 200 DO ACROSS j = 1, 200 A(i,j) = ... ...

= A(i-3,j-2) ...

...

= A(i-2, j+1) ...

ENDDO ENDDO

Figure 7.14: Synthetic loop kernel three

140

CHAPTER 7. DISTRIBUTED MEMORY ARCHITECTURES Kernel sync functions: ( next i(), 1 (?1 ? 2 i + j; 2 i ? j ) (j ? 3; i ? 2 j + 9) 2 (?1 ? 2 i + j; 2 i ? j ) (2; ?1) 3 (3; 2) (2; ?1)

141

next j() )

Table 7.2: Sync functions for task allocation performance estimation experiments all our experiments are summarised in table 7.1. A detailed description of the parameters and the performance estimation technique used in our experiments can be found in chapter 5. Examples of parallel processors described by our set of parameters are the iPSC/2 and iPSC/860 which are K-cube connected DMAs. In our experiments each node is assumed to be a single processor and we estimate the performance for K-cubes with the range K 2 [2; 8] and q (dimension size) = 2. The sync functions, which de ne the temporal partial order of the schedule for the iterations of the three loop kernels, are shown in table 7.2. The sync functions describe the ow dependence data ow patterns in our synthetic nested loop kernels, derived using techniques described in chapter 4. The sync functions, together with the task partitioning function being modelled in our experiments, de ne the message passing requirements of the nested loop kernel. We model each loop kernel of the form: DO i = 1, 200 DO j = 1, 200 *** RECEIVE MESSAGE *** *** PERFORM COMPUTATION *** *** SEND MESSAGE *** ENDDO ENDDO

Our experiments amortise the COMPUTE value for all three nested loop kernels in


142

a constant value, representing the work load for the computation part of the loop body across the simulation runs with dierent task partitioners. As shown in table 7.1, the COMPUTE value used in our experiments is the constant ten. Every time a message has to be transferrred across clusters in our parallel processor model, the cost cdistant is incurred in numbers of cycles. The ratio of the loop computation cost, COMPUTE, over the messaging passing cost, cdistant, is therefore given by 10 COMPUTE = c 20 K distant

A further aim of our experiments was to determine if the nature of the data dependencies in a loop kernel has an aect on the desired choice of a task partitioner. Loop kernel one has data dependencies which are all non-constant. Loop kernel two has a constant and a non-constant data dependence. While loop kernel three has only constant data dependencies. The execution time pro le for the three loop kernels are shown in gure 7.15, 7.16 and 7.17. The execution time pro les are in the form of the 1/T versus P plots illustrated in the gures, where T is assumed to be the execution time of the loop kernel and P to be the number of processors. It is clear from the pro les that the LCFM task allocation scheme consistently derives vastly improved execution times, compared to the regular task allocation schemes currently proposed for HPF. We assert that regular task allocation schemes are inadequate for distributing the computation of loop kernels with inter-loop data dependencies. We have further instrumented our performance estimator to monitor the load balance of the estimation run for the 256-PE, i.e. 8-cube, processor model when our LCFM allocation policy is adopted. The load balance pro les for the three nested loop kernels are shown in gures 7.18, 7.19 and 7.20. Note that even though the LCFM algorithm does not explicitly try to balance the work load across the parallel computer it still generates a very even load balance for loop kernels one and three. Also note that even with an uneven load pro le, as is seen for the LCFM task allocation of loop kernel two, execution times better then those obtainable from a regular task allocation scheme are derived. We emphasize that load pro les are not a good indicator of performance in some cases.


143

7.9 Related work A scheme similar to the one presented in this chapter has been proposed by Su [63]. He proposes a strategy for generating the static message-passing version of the DO ACROSS on a distributed shared memory multiprocessor. Our scheme is similar to his in that message binding is predetermined at compile-time: communicated messages are placed in predetermined memory locations in the local store of each processor. His scheme, however, diers from ours in a number of very important ways. Firstly, the scheme proposed by Su is only valid for nested loop kernels with constant data dependence vectors, i.e. uniform recurrence equations. Su proposes a scheme which inserts communication primitives directly based on the knowledge that the data dependent iterations are always a constant distance apart. Our scheme applies equally well to loop kernels with constant data dependencies as well as nonconstant data dependencies. Su does not discuss how the data ow patterns of the computation can be derived. We determine the data ow pattern directly and instrument the loop kernel with communication primitives which enforce the data dependencies during run-time. Secondly, Su proposes a task allocation strategy similar to the DISTRIBUTE primitive in HPF. He describes a CYCLIC and BLOCK distribution scheme which assigns iterations in a loop kernel cyclically or in contiguous blocks to the processors in the architecture. As seen in section 7.8, regular task allocation strategies will produce bad execution times for the DO ACROSS loop. Our scheme performs the task allocation with the deliberate aim of reducing the execution time. We distribute iterations in a loop kernel by linearly clustering data dependent iterations onto the same processor. Execution time improvements using our approach over a regular task allocation scheme can be quite dramatic as seen in gures 7.15, 7.16 and 7.17.

7.10 Concluding remarks This chapter has described a strategy for generating the message-passing version of the DO ACROSS on a DMA. The strategy relies on the exact data ow information derived from the accurate determination of the data dependence distance vectors of the nested loop kernel. The chapter also highlights the LCFM task allocation strategy


144

0.0006 LCFM mapping row cyclic column cyclic row contagious column contagious

0.0005

1/T

0.0004 0.0003 0.0002 0.0001 0 0

50

100

150 P

200

250

300

Figure 7.15: 1/T versus P plot for loop kernel one 0.0003 0.00025 LCFM mapping row cyclic column cyclic row contagious column contagious

1/T

0.0002 0.00015 0.0001 5e-05 0 0

50

100

150 P

200

250

300

Figure 7.16: 1/T versus P plot for loop kernel two which derives improved execution times over the regular allocation schemes currently proposed for HPF. The implication is clear: the current DISTRIBUTE primitive in HPF is inadequate. A more exible scheme, whereby the compiler determines the distribution of iterations using a allocation scheme like LCFM, will result in better execution times. The lesson we learn from our discussions in this chapter is that deriving improved information of the data ow patterns in a computation results in improved mappings. It is envisioned that our proposed strategy will eventually be implemented in a realistic compiler framework to assess its potential on real world problems.


145

0.0007 LCFM mapping row cyclic column cyclic row contagious column contagious

0.0006

1/T

0.0005 0.0004 0.0003 0.0002 0.0001 0 0

50

100

150 P

200

250

300

Figure 7.17: 1/T versus P plot for loop kernel three

160 140

Load

120 100 80 60 40 20 0 0

50

100

150 Processor

200

250

300

Figure 7.18: Task load pro le for each processor in a 256-PE architecture executing loop kernel one

400 350

Load

300 250 200 150 100 50 0 0

50

100

150 Processor

200

250

300

Figure 7.19: Task load pro le for each processor in a 256-PE architecture executing loop kernel two


146

180 160 140 Load

120 100 80 60 40 20 0 0

50

100

150 Processor

200

250

300

Figure 7.20: Task load pro le for each processor in a 256-PE architecture executing loop kernel three

Chapter 8

Exact data independence testing An important task in any optimising compiler is the ability to decide if the def-ref array variable pair

f gen )

S1:

A(

S2:

...

= ...

= ...

f use)

A(

...

is data independent. Where this can be done accurately, more eective optimising transformations may then be applied to the program since the statements S1 and S2 may be moved in any order relative to each other. In the context of a parallelizing compiler, any two independent statements may be allowed to execute in parallel without need for synchronisation. This chapter presents a data independence test for multidimensional array variables called the distance test. The test is presented for the case where an array variable pair is de ned within a nested loop kernel with generalised trapezoidal loop bounds. Such cases are especially important because they generalise for the majority of nested loop kernels which are used in typical scienti c and engineering programs [61]. Furthermore, the distance test is also an exact test in that it can decide on the data independence question exactly given the resources without resort to making approximating assumptions. In a realistic situation, where such a test is used in a parallelizing FORTRAN compiler, we show how approximating assumptions may be made to ensure that the test is performed in a reasonable amount of time. Our experiments show that the occasions where these approximating assumptions are 147

CHAPTER 8. EXACT DATA INDEPENDENCE TESTING

148

needed are few and far between for cases of two to three nested loop kernels. This class of loop kernels is important because they account for 99% of loop kernels found in typical scienti c programs [61]. Section 8.1 presents an overview of the distance test. It will brie y describe the suite of tests needed to determine the data independence of a def-ref array variable pair. Sections 8.2, 8.3 and 8.4 elucidate on the details of the distance test. They describe, in detail, the three phases which the test has to go through to decide if a def-ref array variable pair is data independent. Section 8.5 discusses the results from a timing analysis experiment, which engaged a version of the distance test on a large number of possible scenarios. It will show that the distance test is practical, when carefully constrained, for use in an optimising compiler. Finally, section 8.6 compares the distance test with another important exact test recently proposed in the literature; the Omega test [53]. Even though both tests have worst case exponential time complexities, the distance test is shown to be potentially quicker.

8.1 The distance test A data independence test decides whether a pair of m-dimensional array variables, de ned in an n-nested loop kernel, will access the same memory location during the execution of the loop. The distance test adopts a three phase resolution procedure:

Phase 1 derives the distance vector for the ow dependence between the array variable pair.

Phase 2 determines the feasibility of a ow dependence within the real bounds of the nested loop kernel.

Phase 3 decides on the existence of a ow dependence on a true iteration in the nested loop computation.

We rst de ne the index set X = fx1 ; x2; : : :; xng 2 Rn and assume the de nition and reference array variables have subscript expressions de ned by,

figen (X ) = Ai; x + Ai; x + + A ;nxn; (i = 1; : : :; m) fjuse(X ) = Bj; x + Bj; x + + Bj;nxn ; (j = 1; : : :; m) 1

1

2

2

1

1

2

2

1


149

The rst phase of the distance test generates the ow dependence distance vector using the procedure described in chapter 4. This procedure is summarised in two steps.

Step 1 generates the dierence vector = ( , , : : :, n) where 1

2

i = figen (X ) ? fiuse(X ); (i = 1; : : :; m)

Step 2 then forms the intermediate ow dependence system n X coe(xi; fjuse) IDisti = j ; (j = 1; : : :; m) i=1

(8.1)

from which the ow dependence distance vector, ow = (IDist1 , IDist2 , : : :, IDistn), is derived. If a ow dependence distance vector is derivable from system (8.1), i.e. solution(8.1) 6 ;, we infer a ow dependence exists for the def-ref array variable pair. That is, for the de nition of an array variable location at iteration (l1; : : :; ln) 2 J , the ow dependent iteration, or equivalently the iteration instance which references the same array variable location, is given by (l1 + IDist1 ; : : :; ln + IDistn )

(8.2)

The n-tuple (8.2) is a feasible integer point if and only if it is in J and it represents an integer vector. Assuming that we denote the ow dependent iteration to be Pint(I) = (P1(I); P2(I); : : :; Pn(I)), where the subscript int emphasizes that it is an integer vector, we state below the fundamental theorem for data dependence between an array variable pair whose proof is obvious.

Theorem 8.1 A data dependence exists between a de ning and referencing array variable pair if and only if Pint(I) 2 J for some integer vector I 2 J . The second phase of the data independence test determines if the solution set JP = fI : Preal (I ) 2 J g, for the intermediate ow dependence system (8.1) is empty. The set JP is derived through a process of variable elimination steps, where the bounds for the index variables in the index set X is determined for which a real vector, Preal (X ), is valid in J. The bounds for the index set X de ne a polyhedron


150

N , within which Preal (X ) 2 J is true, for which JP ; is true if and only if N is empty. If we deduce that JP ;, we conclude that there is no data dependence between the def-ref array variable pair. If JP 6 ;, a data dependence may exist. Phase three of the distance test then checks for the existence of Pint(X ) in Preal(X ). We derive a resolution procedure where only a small subset of JP need be checked to determine if the integer vector Pint(I) exists for an integer point I 2 JP . A data dependence between a def-ref array variable pair is then concluded if and only if such an integer point exists. In summary, our data independence test will conclude a data dependence between a def-ref array variable pair if and only if solution(8.1) 6 ; ^ JP 6 ; ^ 9I 2 JP such that Pint(I) 2 J

8.2 Phase 1: Checking for ow dependence The rst part of the distance test involves generating the ow dependence distance vector. The ow dependence distance vector de nes the set of instances, in Rn, data dependent on an iteration. If no ow dependence distance vector exists for the def-ref array variable pair, we conclude no data dependence.

8.2.1 Generating the ow dependence distance vector We generate the ow dependence distance vector by rst formulating the system of intermediate ow distances into a matrix-vector expression A ~x = ~c. Note that A is the n m matrix of the coecients of the LHS of system (8.1), ~x is the vector (IDist1 ; IDist2 ; : : :; IDistn)T and ~c is the vector (1, 2 , : : :, m )T . The solution of (8.1) thus involves solving the expression

0 BB BB BB B@

A ~x = ~c =) B; B;

11 21

.. .

1 0 1 10 IDist CC BB CC CC BB CC BB IDist CC BB CC B C=B C .. C . C CA BB@ ... CCA BB@ ... CCA

B ; B ;n B ; B ;n 12

22

.. .

...

1

1

1

2

2

2

IDistn

m

Bm; Bm; Bm;n 1

2

where every solution for (8.3) is a solution for (8.1) and vice-versa.

(8.3)


151

Solving (8.3) involves rst forming the augmented matrix A = (Aj~c)

0 BB BB BB B@

1

B; B;

11 21

.. .

B ; B ;n C C B ; B ;n C C 12

22

.. .

1

...

2

.. .

1

2

.. .

Bm; Bm; Bm;n m 1

CC CA

(8.4)

2

and row-reducing the augmented matrix A into an echelon matrix A0 using Gaussian elimination modi ed for integer values. The echelon matrix A0 determines the equivalent echelon system for (8.1) in the following form:

A0 ;j1 IDistj1 + + A0 ;nIDistn = 0 A0 ;j2 IDistj2 + + A0 ;nIDistn = 0 1

2

1

1

2

2

.. .

(8.5)

A0r;jr IDistjr + + A0r;n IDistn = 0n where j1 < j2 < < jr m rank(A). A free variable is an unknown IDisti which does not appear at the beginning of any of the equations in (8.5). In our preceding discussions we will denote such a free variable as ti and the set of all free variables from (8.5) as T . The solution to system (8.5) is completely characterised by the following three cases:

r = n. There are as many equations as there are unknowns. The system will therefore have a unique solution for a ow dependence distance vector.

r < n. There are fewer equations then there are unknowns. There will therefore not be a unique solution. The ow dependence distance vector will be parameterised by n ? r free variables. To solve the system we can arbitrarily assign values to the free variables to derive solutions which satisfy (8.3).

rank(A) 6= rank(A0) There are rows in (8.5) such that 0x +0x + +0xn = 0i where i < r and 0i = 6 0. Such a system is inconsistent and we can conclude 1

2

that there is no solution and hence no ow dependence exists.

Example 8.1: Consider the following nested loop kernel fragment:


152

DO I = 0, 20 DO J = 0, 20 A(I,J) = ...

A(I-1,I-2) ...

ENDDO ENDDO

for which X = (I; J ). We then generate the dierence vector = (1 ; 2),

= I ?I +1 = 1 1

=J?I+2 2

and formulate the matrix-vector expression (8.3) for the intermediate ow system:

1 0 1 10 0 IDist 1 0 I C B C CA B@ B@ A=@ A 1

IDistJ

1 0

2

Forming the augmented matrix A (8.4) and row-reducing, we derive

0 1 0 B @

1

0 0 2 ? 1

1 0 CA =) B@ 1 0

1

0 0 J ?I +1

1 CA

Since 0IDistI + 0IDistj 6 J ? I + 1, we must conclude that there is no solution to the intermediate ow dependence system, and hence no data dependence between the array variable pair. 2 Example 8.2: Consider again another nested loop kernel fragment: DO I = 1, 200 DO J = 1, 200 DO K = 1, 200 A(2I,J) = ...

A(I+K,2I+J-K) ...

ENDDO ENDDO ENDDO

for which X = (I; J; K ). We then generate the dierence vector ,

= 2I ? I ? K = I ? K 1


153

= J ? 2I ? J + K = K ? 2I 2

and formulate (8.3)

1 0 0 1 B IDistI C 0 1 C 1 1 1 CB B @ A BB IDistJ CC = B@ CA A 2 1 1 @ 1

IDistK

2

Row-reducing the augmented matrix A,

0 1 1 1 1 B CA @ 0 ?1 ?1 ?2 + 1

1

2

determines the equivalent echelon system,

IDistI + IDistJ + IDistK = I ? K ?IDistJ ? IDistK = K ? 2I Since r = 2 and n = 3, there is no unique solution to the intermediate ow system. Instead we let IDistK to be the free variable, i.e. IDistK t1 2 T , and solve with the free variable.

IDistI = ?I IDistJ = 2I ? K ? t IDistK = t

1

1

We thus derive the ow dependent iteration instance

P (X ) = (0; J + 2I ? K ? t ; K + t ) 1

1

2

8.3 Phase 2: Checking for feasibility The second part of the distance test involves determining the feasible region within J for which a data dependence exists. We derive such a feasible region by rst formulating the inequality constraints, de ning the condition under which a data dependence will exist, and reducing using a version of the Fourier-Motzkin variable elimination solution method for systems of inequalities. If we deduce that the feasible region in J is empty, we conclude no data dependence for the array variable pair concerned.


154

8.3.1 Hyperplanes and polyhedra We rst de ne a hyperplane in Rn to be an (n ? 1) dimensional at where a at is a translation in the linear subspace Rn. Just as a hyperplane in R3 is described by the equation a1x1 + a2 x2 + a3 x3 = , a hyperplane in Rn can similarly be described by the equation f (x) = a1 x1 + a2x2 + + anxn = For the remainder of this chapter, we let the set of points fp 2 Rn : f (p) = g describe a hyperplane denoted by H [f : ]. Also given a hyperplane H [f : ], we de ne the closed half-spaces of H to be the sets fp : f (p) g or fp : f (p) g denoted by H + . An n-polyhedron, denoted N , de nes a (possibly degenerate) polytope in Rn, bounded by hyperplanes parallel to the coordinate axis and hence is assumed compact. The n-polyhedron is de ned by a set of n lower and upper bounds for each of the coordinates in a point pi = (x1; x2; : : :; xn) 2 N , i.e.

xlow x xhigh xlow x xhigh 1

1

1

2

2

2

xlow xn xhigh n n For example a 2-polyhedron describes a rectangle, and a 3-polyhedron a rectangular region de ned by rectangular boundary surfaces. The vertices describe the extreme points of the polyhedron. Assuming we amortise the upper and lower low bounds of the coordinate xi to be xBi = fxhigh i ; xi g, the vertices of the polyhedron are described by the set of points V = f(xB1 ; xB2 ; : : :; xBn )g. Pairs of vertices which do not share the same faces are known as distinguished vertices and they are described by the pair f(xB1 ; xB2 ; : : :; xBn ); (^xB1 ; x^B2 ; : : :; x^Bn )g where the hat operator is de ned

8 > < xhigh if B low B x^i = > ilow : xi if B high

The diagonals of N in the context of this chapter are de ned as the vectors joining the pairs of distinguished vertices. Thus a diagonal is de ned as a one


155

dimensional vector i = (xB1 ? x^B1 ; xB2 ? x^B2 ; : : :; xBn ? x^Bn ). The total set of diagonals for N is de ned to be D = fi : i = 1; : : :; 2n?1g. Now N tightly or strictly bounds a convex set, denoted by S , if S is contained in N and every hyperplane de ning N intersects S at least once. We also de ne the centroid of N , denoted by C , to be the point in N with coordinates equal to the mid-point of the range of N along each axis.

8.3.2 Formulating the dependence constraint system The dependence constraint system, S (X ), is de ned by two sets of inequality constraints: the index boundaries and the dependence boundaries. The index boundaries are a set of linear inequality constraints in echelon form, de ned by the generalised trapezoidal loop bounds of the n-nested loop kernel. These boundary conditions, denoted by S1 throughout, are de ned: Lower Bounds

L x L (x ) x L (x ; x ) x 1

3

1

=) =)

2

1

2

1

2

3

Ln(x ; : : :; xn? ) xn 1

1

=) .. . =)

B (X ) 0 B (X ) 0 B (X ) 0 1

2

3

Bn(X ) 0

Upper Bounds

U x U (x ) x U (x ; x ) x 1

3

1

=) =)

2

1

2

1

2

3

Un(x ; : : :; xn? ) xn 1

1

=) .. . =)

Bn (X ) 0 Bn (X ) 0 Bn (X ) 0 +1

+2

+3

B n0 2

Also, given a ow dependent iteration, de ned by P (X ) = (P1 (X ); P2 (X ); : : :; Pn(X )), the system S (X ) is further augmented with the set of dependence boundaries. These de ne the set of linear inequalities which P (X ) must satisfy for it to


156

be a point in J. This system of inequalities, denoted by S2 throughout, is de ned: Lower Bounds

L P (X ) L (P (X )) P (X ) 1

2

1

1

2

Ln (P (X ); : : :; Pn? (X )) Pn(X ) 1

1

=) =) .. . =)

B n (X ) 0 B n (X ) 0 2 +1

2 +2

B n (X ) 0 3

Upper Bounds

U P (X ) U (P (X )) P (X ) 1

2

1

1

2

Un (P (X ); : : :; Pn? (X )) Pn(X ) 1

1

=) =) .. . =)

B n (X ) 0 B n (X ) 0 3 +1

3 +2

B n (X ) 0 4

Geometrically, the dependence constraint system describes a set of half spaces, de ned by the hyperplanes Hi = [Bi : 0]. For a data dependence to exist, a necessary condition is that H1+ \ H2+ \ \ H4+n 6= ;. Or in other words, there must exist some convex region S in Rn for which every inequality in system S(X ) is true. For example, consider the echelon bounds for the index set i; j : 0 i 14

? 12 i + 3 j i

The convex region S is illustrated by the shaded area in gure 8.1. From convex set theory, we know that the region S is completely de ned by its vertices. We do not, however, need to derive all vertices for S in order to decide if it is non-empty. We need only derive the saddle points of S , which are a subset of the vertices of S , de ned to be points resting on a supporting hyperplane xi = i where S 6 fxi Rig with R being either the relationship < or >. The supporting hyperplanes, on which the saddle points rest, de ne the tightly constraining N for S , for which N is empty if and only if S is empty. We will show later how this constraining N can be obtained by implementing a version of the standard Fourier-Motzkin variable elimination method. If N is found to be empty,


157

j = 10

10

j high saddle point j i + 6 j low

6 j=6 saddle point 3

0

6

14

i

Figure 8.1: Feasible convex region S and saddle points we conclude that a data dependence does not exist. If N is not empty, a data dependence may exist.

8.3.3 Fourier-Motzkin variable elimination Given a system of inequalities,

n X j =1

ai;j xj bi; i = (1; : : :; m)

(8.6)

We may take a term xj as the pivot, and partition system (8.6) into three sets of inequalities according to whether, ai;j > 0, or ai;j < 0, or ai;j 0:

8 > > < xj ai;j > 0 > > : 8 xj > > < xj ai;j > 0 > > : xj 8 > > > :

L (X ? fxj g) .. .

1

.. .

1

Lp(X ? fxj g) U (X ? fxj g) Up (X ? fxj g) F (X ? fxj g) .. .

1

0 Fr (X ? fxj g)

(8.7)


158

System (8.7) thus de nes a reduced system

Ll (X ? fxj g) Uu (X ? fxj g); l = (1; : : :; p); u = (1; : : :; q) 0 Fk (X ? fxj g); k = (1; : : :; r)

(8.8)

where nding an xj which satisfy max Ll (X ? fxj g) xj min Uu (X ? fxj g); l = (1; : : :; p); u = (1; : : :; q) (8.9) l

u

implies xj satis es (8.6). This is the standard de nition for the Fourier-Motzkin variable elimination method. Note that we may have to continuously derive the reduced system (8.8), eliminating further variables, until an irreducible conclusion is derived for (8.9). An irreducible conclusion is de ned to be the state

cj + 0x + + 0xj? + 0xj + + 0xn R xj 1

1

+1

where R = f; g. Consequently, the worst case time complexity of the standard method approaches O(m2n ) If an upper or lower bound is found to be irreducible for a term xj , we conclude that the feasible set of values for xj is unbounded in (8.6) for the relevant direction. The dependence constraint system S(X ) is, however, always bounded for all variables xj 2 X . This can be deduced from the sequence for determining the lower bound for xi from S1. The sequence assumes the intermediate lower bounds to be in terms of some subset of X or else the deduction is immediate. Hence for xi ; Bi (X ) 0 =)

Li (x ; : : :; xi? ) xi 1

1

1

=)

Li (x ; : : :; xi? ; xi) xi? 2

Bn

1

2

(8.10)

1

i?1 (X ) 0 =)

+

? Ui? (x ; : : :; xi? ) ?xi? 1

1

2

1

(8.11)


159

(8.10) + (8.11) =)

Li (x ; : : :; xi? ; xi) xi? 3

Bn

1

3

(8.12)

2

i?2 (X ) 0 =)

+

? Ui? (x ; : : :; xi? ; xi) ?xi? 2

1

3

2

(8.13)

(8.12) + (8.13) =)

Li (x ; : : :; xi? ; xi) xi? 4

1

4

.. .

3

Lin? xi 1

and so on for all xj 2 X ? fxig. We therefore assert the dependence constraint system S(X ) to be compact.

8.3.4 Deriving the compact constraining polyhedron N Note that the Fourier-Motzkin variable elimination is equivalent to the positive linear combination of two equality constraints

+ ai;kxk + i ? aj;kxk + j

(8.14) (8.15)

such that aj;k (8.14) + ai;k (8.15) =)

+ 0xk + aj;k i + ai;k j For which the term xk has now been eliminated. From the theory of linear inequalities, the consequence derived is valid for the original system if the additive linear combination involves only positive multiplying constants. We will use this basic variable elimination step in our procedure, described below, to derive the set of irreducible bounds for the set of terms xi 2 X . We describe a procedure which will derive the constraining N , for the feasible convex region S of S(X ), if it exists. The process involves deliberately generating consequences through positive linear combinations of existing inequalities such that the consequences have a monotonically decreasing number of unknown terms. This

CHAPTER 8. EXACT DATA INDEPENDENCE TESTING Intermediate Constraints

111

110

160

Constraining polyhedron Bounds

101

011

100

010

001

Figure 8.2: Hash table for generated constraint inequalities in terms of (x1x2 x3 ) process continues until all possible irreducible bounds for all xi 2 X have been derived. The rst part of the procedure constructs a hash table, H, for which each tabular entry maintains a list of inequality constraints. Every list is distinguished by the terms present in the constraints within the list. Thus, for example, the set of all inequality constraints with terms x1 and x2 only are kept in the same list. The hash table for the dependence constraint system with index terms fx1; x2; x3g is illustrated in gure 8.2. In the example, hash table entry 7 (i.e. 111) maintains the list for constraints Si (x1x2 x3) de ned to be

ai; x + ai; x + ai; x i 1

1

2

2

3

3

for which ai;1 6= 0, ai;2 6= 0, and ai;3 6= 0. For convenience we denote the particular table entry as H(x1 x2 x3), for which x1 x2x3 111. Further taking another example, the table entry 6, i.e. 110, maintains the list for constraints Si (x1x2 x3), de ned to be

ai; x + ai; x i 1

1

2

2

for which ai;1 6= 0, and ai;2 6= 0. Similarly, we denote the table entry as H(x1 x2 x3) for which x1x2 x3 110. The procedure forms an elimination graph, and propagates intermediate constraints generated by variable elimination steps. Continuing our example with the dependence constraint system with index terms fx1 ; x2; x3g, the elimination graph is


161

111

propagate 110

101

011

100

010

001

constraints

Figure 8.3: Inequality constraint elimination graph illustrated in gure 8.3. Initially, the inequality constraints in S(X ) are distributed across H accordingly. Then for all Si 2 H(x1 x2 x3), we generate intermediate constraints for H0 (x1xx), H0 (xx2 x) and H0 (xxx3), with x \don't care". For the step

H(x x x ) ?! H0 (x xx) 1

2

3

1

the procedure successively picks each inequality constraint in H(x1 x2x3 ) as a pivot and eliminates the term x1 from the pivot through a positive linear combination with another constraint. It does this by scanning all inequality constraints in the lists H(x1 x2x3 ), H(x1 x2 x3 ), H(x1 x2 x3) and H(x1 x2 x3), i.e. H(x1 xx), generating a new constraint with the term x1 eliminated, and appending the new constraint to its relevant list. The procedure then continues, after every constraint in H(x1 x2 x3 ) has been considered as a pivot, with the step

H(x x x ) ?! H(xx x) 1

2

3

2

and so on until the new lists H(x1 xx), H(xx2 x) and H(xxx3 ) have been generated. The procedure then continues with the steps

H(x x x ) ?! H(x x x) H(x x x ) ?! H(x xx ) H(x x x ) ?! H(x x x) 1

2

3

1

1

2

3

1

1

2

3

1

2

3

2

until we have derived the irreducible lists H(x1 x2 x3), H(x1 x2 x3), and H(x1 x2 x3 ). The whole procedure is succinctly described by the elimination graph illustrated in gure 8.3.


162

Having derived the set of irreducible bounds for xi 2 X , a set of upper bound constraints may collectively be true:

xi Ui xi Ui

1

2

.. .

xi Uik In such a situation we derive the tightest upper bound for xi such that

xi min(Ui ; Ui ; : : :; Uik) Uimin 1

since

2

xi Uimin Ui Ui Uik 2

1

Similarly, a set of lower bound constraints may become collectively true for a term xi :

xi L i xi L i

1

2

.. .

xi Lki We again derive the tightest lower bound for xi such that

xi max(Li ; Li ; : : :; Lki) Lmax i 2

1

since

k xi Lmax i L i Li L i 1

2

The set of of bounds fLmax xi Uiming de ne the compact constraining i polyhedron N for the dependence constraint system. The polyhedron is empty if Uimin < Lmax i from which we can deduce that a data dependence is not feasible. Example 8.3: Continuing the earlier example with the nested loop kernel


163

DO I = 1, 200 DO J = 1, 200 DO K = 1, 200 A(2I,J) = ...

A(I+K,2I+J-K) ...

ENDDO ENDDO ENDDO

from which we have derived P (X ) = (0; J + 2I ? K ? t1 ; K + t1 ). We can then formulate the dependence system S (X ), with the index boundaries (S1) 1 I 200 1 J 200 1 K 200 and the dependence boundaries

200 (!) 1 J + 2I ? K ? t 200 1 K +t 200

1

0

1

1

The dependence system is inconsistent (i.e. 1 0 200). We therefore deduce that there is no ow dependence between the array variables within the loop bounds of the nested loop kernel. 2 Example 8.4: Consider the dependence system S (X ) derived from the nested loop kernel DO I = 1, 500 DO J = 2*I, 500 A(J,I) = ... ENDDO ENDDO

A(I,J) ...


164

such that S1 is de ned to be, 1 I 500 =) I ? 500 0

?I + 1 0 2I J 500 =) J ? 500 0 2I ? J 0 We derive the distance vector for the ow dependence between the array variable pair to be flow = (J ? I; I ? J ), from which we de ne the ow dependent iteration to be P (X ) = (J; I ). Thus S2 is de ned to be, 1 J 500 =) J ? 500 0

?J + 1 0 2I I 500 =) I ? 500 0 I 0 The hash table H(IJ ) is initially lled

H(11) H(10) H(01) pivot ! 2I ? J 0 I ? 500 0 J ? 500 0 ?I + 1 0 ?J + 1 0 I0 Since H(11) has only 2I ? J 0 as the sole constraint in its list, it will be the

only pivot constraint in the elimination steps

H(IJ ) ?! H(Ix) H(IJ ) ?! H(xJ) For the step H(IJ ) ?! H(Ix), we generate a new constraint f?J + 2 0g into H(01). Further for the step H(IJ ) ?! H(xJ), we generate a new constraint f2I ? 500 0g into H(10). The collective bounds of H(01) and H(10) is found to be inconsistent. In particular, UImin 0 and Lmax I 1. Hence N ; and we conclude there is no data dependence between the array variable pair de ning the dependence constraint system S (X ). 2


165

8.4 Phase 3: Checking for integer solutions Now given a ow dependent iteration de ned to be P (X ) = (P1 (X ); : : :; Pn(X )), a data dependence exists if and only if there exists some integer point I = (l1; : : :; ln) 2 S such that P (I) is an integer point in J . Or geometrically,

9I 2 S such that Pint(I) 2 J where S is a feasible convex solution set for the dependence constraint system. Checking for integer solutions in a convex subspace is known to be a problem in integer programming and hence NP-complete. In our approach we break the dependence test into two parts:

a test to determine if an integer solution is feasible for a point P (I) in Rn and an exact technique to determine if there is an integer point I in S such that there exists an integer point Pint(I) in J . Application of the rst part only will give an approximate solution to whether a data dependence exists between a def-ref array variable pair. The advantage, of course, is that it gives an approximate decision quickly. Application of both parts of the test will derive an exact solution to the data independence test for the variable pair concerned. The cost is in additional compilation time needed in the analysis.

8.4.1 Integer solvability The coordinates in a ow dependent iteration P (X ) de ne the integer constraint system: P1 (X ) = a1;1x1 + a1;2x2 + + a1;nxn + c1 = 1 P2 (X ) = a2;1x1 + a2;2x2 + + a2;nxn + c2 = 2 (8.16) .. . Pn (X ) = an;1x1 + an;2x2 + + an;nxn + cn = n with ai;j ; ci; i 2 R, 8i; j 2 [1; n]. For which (8.16) is integer-solvable if and only if there exists I = (l1; : : :; ln) 2 Z n =) P (I) = (1; : : :; n) 2 Z n . We further de ne two sub-classes of integer solvability: row integer-solvability and column integersolvability.


166

The system (8.16) is row integer-solvable if it passes the GCD test. Assume a linear functional, b1x1 + b2x2 + + bnxn = c (8.17) with b1; : : :; bn; c 2 Z . We know from number theory that (8.17) is integer-solvable for an integer set (l1; : : :; ln) if and only if GCD(b1; : : :; bn) divides c evenly. Now, the integer constraint system (8.16) may be normalised by a set of integer factors,

i, such that

P (X ) = a0 ; x + a0 ; x + + a0 ;nxn + c0 = 0

P (X ) = a0 ; x + a0 ; x + + a0 ;nxn + c0 = 0 1

1

11

1

12

2

1

1

1

2

2

21

1

22

2

2

2

2

.. .

nPn(X ) = a0n;1x1 + a0n;2x2 + + a0n;nxn + c0n = 0n

(8.18)

and a0i;j ; 0i 2 Z , 8i; j 2 [1; n]. The integer constraint system (8.16) is then de ned to be row integer-solvable if GCD(a0i;1; : : :; a0i;n) evenly divides c0i ; (i = 1; : : :; n) For which we can deduce there exists an integer set (l1; : : :; ln; i) which satis es each coordiante equation Pi (l1; : : :; ln) = i separately. We can further reformulate the integer constraint system (8.16) into a coupled form: 1;1(x1) + 1;2(x2) + + 1;n(xn) = 1 2;1(x1) + 2;2(x2) + + 2;n(xn) = 2 (8.19) .. . n;1(x1) + n;2(x2) + + n;n(xn) = n where i;j (xj ) = ai;j xj + f0; cig. The form depicts Pi (X ) = i as a sum of n linear subfunctionals, in which one and only one will take the form ai;j xj + cj for j = (1; : : :; n). The coupled form (8.19) is column integer-solvable if for all i = (1; : : :; n), there exists an integer set (l1; : : :; ln ; t1 ; : : :; tn ) such that

;j (lj) = a ;j l + K ;j = t ;j (lj) = a ;j l + K ;j = t 1

1

1

1

1

2

2

2

2

2

.. . n;j (lj) = an;j ln + Kn;j = tn

(8.20)


167

where Ki;j = f0; cig; i = (1; : : :; n). Now for every

ai;j xj + Ki;j = ti

(8.21)

we can always scale (8.21) with an integer factor i such that

iai;j xj + iKi;j = iti a0i;j xj + Ki;j0 = iti

(8.22)

0 ; i 2 Z . with a0i;j ; Ki;j The form (8.22) is a linear diophantine equation in two variables, the solution for which may be derived using the Extended Euclid's Algorithm. Letting g denote the greatest common divisor for a0i;j and i0, (8.22) has an integer solution for (xj ; ti) 0 evenly. If K 0 is divisible by g, (8.22) has an in nite if and only if g divides Ki;j i;j number of solutions, where the general solution for (8.22) in parametric form, in terms of the free integer set (t1i ; zi1), is given by

K0 xj = gi ti ? gi;j xj a0 K0 ti = gi;j zi ? gi;j ti 1

0

1

0

with (x0j ; t0i ) being a speci c solution to a0i;j x0j ? i t0i = g . We can therefore deduce the integer solvability of the column (8.20) by deriving the parametric form for the column term xj ; 1;j (xj ) = t11 =) xj = 11;j (t11) 2;j (xj ) = t12 =) xj = 12;j (t12) (8.23) .. . n;j (xj ) = t1n =) xj = 1n;j (t1n) and performing a triangular solve; step = 1 xj = 11;j (t11) xj = 12;j (t12) .. . .. . xj = 1n;j (t1n)

step = 2 step = n 9 > 9 > > > t = ( t ) ;j > > 9 > > > = = > = n? t = ;j (t ) > t = n;j (tn) . > > > . ; . > > > > > > ; > > ; t = n? ;j(tn? ) 1 1

2 1

2 1

1 1

2 2

2 2

1

1 1

2

1

2

1

1

1

1

(8.24)


168

where for each step = (1; : : :; n ? 1), we continuously solve for step step step step ;j (t ) = k;j (tk ); 8k 2 [2; n ? step + 1] 1

1

(8.25)

The integer constraint system (8.16) is not column integer-solvable if there is no coupled form, (8.19), which is integer-solvable for all column terms xj , j = (1; : : :; n). The procedure we have described for determining the column integer-solvability of (8.19) also derives the general solution for the column term xj in parametric form. The solution for xj can be deduced by back-substitution for all tstep i derived for step = (1; : : :; n ? 1),

xj = ;j ( ;j ( (n;j (tn)) )) = j tn + j 1 1

2 1

1

1

1

(8.26)

Note that because the terms i;j (xj ) are assumed integers which implies i are n?1 integers, tstep i are also integers, and an integer value for ti determines an integer value for xj . This interpretation of the parametric solution for the column term xj in (8.16) will become useful later in the exact form of the integer solvability test. Having determined if (8.16) is either column or row integer-solvable or both, we can then further deduce if there exists an integer solution set (l1; : : :; ln; 1; : : :; n) which satis es (8.16) from the theorem below.

Theorem 8.2 The integer constraint system is integer solvable if and only if it is both row integer-solvable and column integer-solvable.

Proof: We rst show that it is a necessary condition that (8.16) be row integer-

solvable for the system to be integer solvable. We prove this by contradiction: assume to the contrary that P1 (X ) = 1 is not integer solvable. Then there cannot be a set of integer vectors I = (l1; : : :; ln) such that 1 is also integer solvable. The row integer solvability case is therefore a necessary condition. Now assume Pi (X ) = i; (i = 1; : : :; n) are all row integer solvable. We prove the suciency of column integer-solvability to indicate that (8.16) is also integer-solvable. For the if case: assume (8.19) is column integer-solvable. Then there exists i;j (lj ) (8i; j 2 [1; n]) which is integer for some integer index set (l1; : : :; ln). Since i;j (lj ) (8i; j 2 [1; n]) is integer, i


169

(8i 2 [1; n]) is also integer. Hence (8.16) is integer-solvable. For the only if case: assume (8.16) is integer-solvable. Then there exists some integer index set (l1; : : :; ln) such that i (8i 2 [1; n]) is integer. For which there must exist a set of rational numbers i;j (lj ) (8i; j 2 [1; n]) such that i (8i 2 [1; n]) is integer. Since i;j (lj ) (8i; j 2 [1; n]) is rational, we can always nd an integer scale factor i such that

i i;j (lj ) (8i; j 2 [1; n]) is integer for which i (8i 2 [1; n]) must also be integer. Hence (8.16) must be column integer-solvable. Thus (8.16) is integer-solvable if and only if it is both row and column integer-solvable. 2 Theorem 8.2 tells us that determining the row and column integer-solvability properties of the integer constraint system allows us to deduce whether the system itself is integer-solvable. If the integer constraint system is found not to be integersolvable, we can conclude that there is no data dependence between the def-ref array variable concerned. If the integer constraint system is integer-solvable, there may be an integer set (l1; : : :; ln) 2 S such that there exist an integer point (1; : : :; n) 2 J for a data dependence to exist. The next section describes how an exact solution to this question can be derived.

8.4.2 The exact integer test Having decided that an integer solution must exist for the integer constraint system (8.16). We next determine if such an integer solution lies within the de ned feasible region. To illustrate the concepts presented in this section, we rst consider an example. Example 8.5: Consider gure 8.4 where a convex region S 0 in R2 is de ned by a line on which the two points (1,1) and (7,3) lie. Assume we are given a point in S 0, p = (2:5; 1:5), and our task is to determine the closest integer points to p in S 0. The solution is easily determined if we derive the parametric form of the equation for S 0. For which, S 0 = f(x; y) : x = 3t + 1; t + 1g and p = (2:5; 1:5) =) t = 0:5. Because the parametric form for S 0 has only integer coecients,

x = 3t + 1


170

y 4 3 p

2 1 1

2

3

4

5

6

7

8

x

Figure 8.4: The convex region S 0

y = t+1 The nearest integer points to p in S 0 is therefore given by the closest integer to the free value t = 0:5, (3d0:5e + 1; d0:5e + 1) = (4; 2) (3b0:5c + 1; b0:5c + 1) = (1; 1) which is precisely what we can derive from gure 8.4. 2 Now from the previous test for integer solvability, we have already derived the general parametric form of the solution for xj in the feasible convex region S ,

xj = j tj + j where j ; j 2 Z , and tj is the free variable such that

tj 2 Z =) i;j(xj ) 2 Z =) i 2 Z The exact integer test thus takes the form:

look for an initial feasible solution in S determine the set of nearest integer solutions for the integer constraint system and check to see if it is in S . For the rst part, determining an initial feasible solution in S turns out to be easy: the centroid of the compact constraining polyhedron N is always in S . To derive this conclusion, rst consider lemma 8.1 below.


171

Lemma 8.1 The smallest convex sets strictly bounded by N is described by the set D. Proof: We assume the nondegenerate case where dim N 1. Now a straight line

is a convex set because a line joining any two points on the original line must also lie in the convex set. The smallest convex region bounded by N must therefore be a straight line (except for the trivial case when N is a point). For a convex region to be strictly bounded by N , it must be intersected by all the bounding hyperplanes. This is so only for the lines joining the distinguished vertices. To see this, assume to the contrary that the pair f(xB1 ; xB2 ; : : :; xBn ); (xB1 ; x^B2 ; : : :; x^Bn )g, which does not de ne a diagonal, describes the smallest strictly bounded convex region. The line joining the two points never intersects the hyperplane x1 = x^B1 , hence it cannot be a convex set strictly bounded by N . 2 Because the smallest strictly bounded convex set in N must be the diagonals joining the distinguished vertices, i.e. i 2 D, and because we also know from geometry that these diagonals must intersect the centroid of N , we thus conclude the following corollary.

Corollary 8.1 The centroid is always part of the bounded convex region within the n-polyhedron N Having determined the polyhedron bounds;

L x U L x U 1

2

1

1

2

2

.. .

Ln xn Un the centroid C can always be determined to be the point; (c1; : : :; cn) where ci = Ui ?2 Li + Li The centroid C is used as the initial feasible point in S . Note that if C is an integer point, and P (C ) is also an integer point, we have already determined that a feasible integer point exists. For the case where the coordinates of C is not integer, we


172

determine the closest points (l1; : : :; ln ; P1(l1; : : :; ln); : : :; Pn(l1; : : :; ln)) around (c1; : : :; cn; P1 (c1; : : :; cn); : : :; Pn (c1; : : :; cn)), and check for feasibility. For this, we rst determine the free value t01 for which ci occurs;

i t0i = ci ? i

Now de ning

bcic = ibt0ic + i

and

dcie = idt0ie + i

We can determine the closest surrounding integer points

q = (bc c; bc c; : : :; bcnc) q = (bc c; bc c; : : :; dcne) 1

1

2

2

1

2

.. .

q n = (dc e; dc e; : : :; dcne) 1

2

2

and check that the boundary points

ei = (qi jP (qi)); 8qi 2 Q are integer points, with Q = fqi : i = (1; : : :; 2n)g. If ei is an integer point, we check for the feasibility of ei by determining if the predicate

qi 2 S ^ P (qi) 2 J is true. If not, a feasible integer point may exist tightly nestled between some boundary points. To check for such an integer point, we de ne, 1

i U (cj ) = i(bt0ic ?! t0i ) + i 1 i t0i ) + i L(cj ) = i(dt0ie ?!

and checking for feasibility,

e0 = (L(c ); : : :; L(cn)) 2 S ^ P (e0 ) 2 J e0 = (L(c ); : : :; U (cn)) 2 S ^ P (e0 ) 2 J 1

1

2

1

1

.. .

2

e0 n = (U (c ); : : :; U (cn)) 2 S ^ P (e0 n ) 2 J 2

1

2


173

We take 1=i strides for each free variable ti because ti = bt0i c + K (1=i) and ti = dt0ie ? K (1=i), for some integer K , also determines an integer value for the term xi . If a feasible integer point has still not been found at the end of the procedure described, we have determined exactly that no integer solution to the integer constraint system satis es the feasible bounds. We can then de nitely conclude that there is no data dependence between the def-ref array variable concerned.

8.5 Experimental results In this section we look at an implementation of the distance test. We demonstrate the feasibility of the distance test, for use in a parallelizing FORTRAN compiler, by measuring the execution time of the test to resolve the data dependence for a statistically signi cant number of randomly generated array subscript pairs. We show that our implemented version of the distance test exhibits tractable execution times. We also look at the speci c optimisations to the algorithms in our implementation which produce these tractable execution times.

8.5.1 The implemented algorithm We have implemented a preliminary version of the distance test in C. The control

ow for the main top level procedure is described in the pseudo code below:


174

Input: subscript expressions for array variable pair, dimensions of array variable and loop kernel Output: dependent/independent, flow dependence distance vector

solve intermediate flow system if ( no solution exists ) return independent else formulate dependence constraint system derive polyhedron bounds through variable elimination if ( polyhedron empty ) return independent else check for integer solvability if ( not integer solvable ) return independent else generate surrounding integer points scan enclosing region if ( valid integer point found ) return dependent else return independent endif endif endif endif

The implemented distance test is simply the collection of tests associated with the three phases described in the main part of this chapter: it derives the ow dependence distance vector, checks for feasibility within the real bounds of the execution loops, and nally checks for integer solvability within these bounds. We have run a series of timing experiments on a set of randomly generated scenarios in which the distance test may encounter in a parallelizing FORTRAN compiler. Each scenario succinctly describes a synthetic nested loop kernel of the form

CHAPTER 8. EXACT DATA INDEPENDENCE TESTING Parameter

Range

n m

2 !3 1 ! 2 (if n 2) 2 ! 3 (if n 3) -2 ! 2 500 ! 700 -4 ! 4

u c

i;j i

A ,B i;j

i;j

175

Table 8.1: Parameter ranges for timing analysis experiment

DO

x

1

c

= 1, DO

x

2

1

= 1,

DO

u ; x +c 21

xn

1

2

un; x + + un;n? xn? + cn gen(X ); : : :; f gen(X )) = ... A(f m use use(X )) ... = A(f (X ); : : :; fm = 1,

1

1

1

1

1

1

ENDDO

ENDDO ENDDO

with X being the index set fx1 ; : : :; xng and figen (X ), fiuse(X ) being linear functionals in terms of X de ned to be

figen(X ) = Ai; x + + Ai;nxn + Ai;n fiuse(X ) = Bi; x + + Bi;nxn + Bi;n 1

1

+1

1

1

+1

For our experiment we generate random values, within speci ed ranges, for the set of de ning parameters fn; m; ci; ui;j ; Ai;j ; Bi;j g where the subscripts, i and j , represent the appropriate ranges as de ned in the above synthetic loop kernel. The random number generator used is the lrand48() function in UNIX. We summarise the ranges, generated for the set of de ning parameters in our experiments, in table 8.1.


176

8.5.2 Implementation details and approximating assumptions Our implemented version of the distance test is not exact, the reason being that we make conservative approximations when we determine that the execution time of the algorithm may become intolerable. We need to make these conservative assumptions because of the worst case exponential time complexity of the distance test. For this section, we discuss some interesting aspects of our implementation of the distance test, and describe the approximating assumptions that are sometimes made by our implemented version of the test. In the distance test two possible situations may occur where the execution time of the algorithm can become intolerable. The rst situation is the intermediate constraint explosion which can result from using the Fourier-Motzkin variable elimination solution method. The second is when there is a very large number of points, surrounding the centroid, which are required to be examined to determine an exact solution to the dependence constraint system. As was previously noted, the time complexity of the Fourier-Motzkin variable elimination procedure is O(m2n ) As the number of dimensions increases, the number of intermediate constraints that can be generated increases at a rate which can become intolerable. To improve the eciency of the Fourier-Motzkin method, we introduce two speci c optimisations. The rst optimisation decides if an intermediate constraint, generated in the elimination procedure, adds to our current knowledge. If it does not, we throw it away, otherwise the intermediate constraint is added to the original system. We determine this by checking if the linear functional de ning the intermediate constraint intersects the compact constraining polyhedron N already derived. Three cases totally de ne the situation when an intermediate constraint, de ned by a hyperplane Hi = [fi (X ) 0], is generated: 1. The linear functional fi (X ) intersects N =) Hi+ S in which case a new tighter polyhedron may be derived. 2. The linear functional fi (X ) does not intersect N and S Hi+ =) the constraint is redundant.


177

3. The linear functional fi (X ) does not intersect N and S 6 Hi+ =) the polyhedron N is not in Hi+ ; Hi is inconsistent and hence the system is inconsistent. This can be easily illustrated for the case in two dimensions, as shown in gure 8.5. Checking which case describes a constraint is relatively straight forward. A linear functional fi (X ) intersects the polyhedron if there exists some v; v 0 2 V (i.e. the set of vertices) such that fi (v ) < 0 and fi (v 0) > 0. If 8v 2 V , fi (v ) < 0, we can conclude that the intermediate constraint is redundant. Or if 8v 2 V , fi (v ) > 0 we must conclude that the system is inconsistent. Our strategy is therefore quite simple. We rst generate the largest possible polyhedron bounds that a term can take. That is, given the bound for a term initially de ned

x R l x + l x + + l n xn + l n 1

2

2

3

3

+1

(8.27)

Assuming R fg, we can then derive the initial upper polyhedron bound for x1 by adopting the substitution policy where we replace the xk term in the RHS of (8.27) with

its largest extreme value if lk > 0 or its lowest extreme value if lk < 0. Similarly, assuming R fg, we can then derive the initial lower polyhedron bound for x1 by adopting the substitution policy where we replace the xk term in the RHS of (8.27) with

its lowest extreme value if lk > 0 or its largest extreme value if lk < 0. In our implementation, the bounds for the polyhedron N are kept in the two arrays PhysicalUpper and PhysicalLower which also de ne the currently derived set V . The PhysicalUpper and PhysicalLower arrays are then continuously updated through a MIN and MAX operator every time a polyhedron bound constraint is generated in the hash table H during the elimination process. Then, before an intermediate constraint is added to the original system, we simply check for one of


178

C2+

C1+

polyhedron

C3+

vertex

Figure 8.5: The three cases in two dimensions: i) C1 de nes a new tighter polyhedron N , ii) C2 is redundant, and iii) C3 is inconsistent. the three cases. If it belongs to case one, we add the constraint to the system. If it is case two, we delete the constraint, and if it case three, we exit the procedure concluding the system is inconsistent. Our second optimisation is the data structure adopted to implement the hash table H. We designed the hash table such as to enable fast lookup for variable elimination; for the selection of the pivot constraint and the constraint involved in a variable elimination with the pivot. To this end, we have adopted a two-tier hash table scheme. At the top layer, a constraint is hashed according to the presence of a term. Thus, as was explained before, a constraint Si (x1x2x3 ) is hashed 101 since only the terms x1 and x3 are present. Each hash table entry H(X ) also points to another hash table, with the constraints hashed according to a unique key determined from the sign of the terms present. We denote this hash table by a prime, H0 , and the hash function which indexes the hash table by F (X ). A constraint Si (X ) (intermediate or otherwise) is always added to the hash table entry H0 (F (Si(X ))). The H hash table entry does not hold any constraints. Instead it holds a pair of array elds


179

H(X) POS[i]

H’(F(xx+xx))

NEG[i]

H’(F(xx-xx))

Figure 8.6: Two tier hash scheme for maintaining the system of inequalities and NEG[i] each of which points to the set of hash entries in H0 in which the term xi is positive or negative respectively. This two-tier scheme is illustrated in gure 8.6. The strategy then is always to choose the pivot constraints from the POS eld in the hash table entry, H(X ), currently being considered in the elimination graph. The constraint to eliminate the term xi can then be accessed directly from the set of hash table entries H0 (H(X xi X) !NEG[i]). The advantage of this scheme is that we avoid ever having to search for appropriate pairs of constraints to eliminate a variable. Looking up for such a pair has time complexity O(1). Our implementation, however, makes an approximating assumption when the number of intermediate constraints, augmented to the original system, reaches a threshold value. Thus while the elimination process is proceeding, a counter determines the number of constraints added to the original system. When the counter reaches a threshold number, the elimination procedure exits, and the array variable pair is assumed dependent. For the timing analysis experiment described in the next section, this threshold number has been set to seventy thousand. The second possible situation, in which the execution time of the distance test can become intolerable, occurs when there is a large number of points, surrounding the centroid, which have to be scanned in order to determine if a feasible integer solution exists for the dependence constraint system S (X ). Given that for each term xi , we scan with strides 1=i from bt0i c ! t0i and then t0i ! dt0i e, the number of candidate points can become quite large indeed. The cardinality of this candidate set is de ned to be POS[i]

n Y

i=1

i

(8.28)


180

Our implemented distance test makes the approximating assumption that an array variable pair is data dependent when the value for (8.28) is above some threshold value. For our experiments, the threshold value for (8.28) has been set to seventy thousand.

8.5.3 Timing analysis We have randomly generated 500 scenarios with parameters derived within the ranges shown in table 8.1. For each scenario the distance test is applied. The execution time for the test, on an IBM RS6000, to determine if the array variable pair in each scenario is either dependent or independent is then measured using the UNIX clock() utility. The test is exact in all cases except when the algorithm determines it has too much to do. For such cases the approximating assumptions, as described in the previous section, is applied. For our experiment, we have also calculated the certainty factor which is the percentage number of scenarios in which the exact test has been performed. The certainty factor is calculated of exact conclusions 100 certainty factor(%) = Number Total number of scenarios For the experimental run reported in this section, twelve scenarios out of the total of 500 had an approximating assumption made. The certainty factor for our experimental run is therefore 98.6%. The result of our timing run is shown in the frequency versus execution time graph in gure 8.7. For the 500 scenarios, the mean execution time for the distance test is found to be approximately 0.7 seconds. The probability density function for the execution time of the distance test assuming a Gaussian distribution is shown in gure 8.8. These observations are encouraging because they indicate that the mean and the probable execution time is small, making the distance test suitable for use in a parallelizing compiler. Where approximating assumptions are made, we do so deliberately so we know when an uncertain scenario has been encountered. This is important because when such a situation arises, the compiler can then prompt the programmer for suggestions on how to proceed.


181

90

Frequency

80

mean execution time

70 60 50 40 30 20 10 0 0.001

0.01

0.1 1 Time (seconds)

10

100

Probability density

Figure 8.7: Frequency versus execution time of the distance test with certainty factor 98.6% 0.00016 0.00014 0.00012 gaussian distribution 0.0001 8e-05 6e-05 4e-05 2e-05 0 0

2

4

6

8 10 Time (seconds)

12

14

Figure 8.8: Probability density function for the execution time of the distance test with certainty factor 98.6%

16


182

8.6 Comparison with the Omega test The distance test is an exact data independence test for multi-dimensional array variables. The Omega test, proposed by Bill Pugh [53], is also an exact test which uses a modi ed version of the Fourier-Motzkin variable elimination method for the solution of systems of linear inequalities. Comparison between the two tests is dicult since any timing experiments will depend, particularly, on the eciency of the implementation of the elimination procedure. The time complexity of our test will, however, be slightly better than that for the Omega test in the worst case. This is the result of the way in which the dependence constraint problem is formulated in both tests. For an n-nested loop kernel, the Omega test will formulate the dependence constraint system in terms of 2 n variables. This is due to the Omega test having always to constrain for a pair of read and write index sets. That is, it formulates the problem in terms of an index set instance in which the array variable is written, and an index set instance in which the array variable is read. Thus, to illustrate for the loop kernel DO I = 1, 10 DO J = I+1, I+25 A(I,J) = A(I-1,J-1) ... ENDDO ENDDO

the omega test formulates the dependence problem with the constraints 1 I +1 1 I0 + 1

I 10 J I + 25 I 0 10 J 0 I 0 + 25

where the tuples (I; J ) and (I 0; J 0) denote the read index pair and write index pair respectively. The distance test, described in this chapter, generates constraints in terms of only n + card(T ) variables, where card(T ) is the cardinality of the set of free


183

variables, and n + card(T ) is always less then 2 n. For the above example, the dependence distance vector is de ned to be d1 = (1; 1) and the data dependent iteration is de ned to be P (X ) = (I + 1; J + 1). Formulating S1, 1 I 10 I + 1 J I + 25 and S2 0 I 9 I + 1 J I + 25 The dependence constraint system, S (X ), de ned for the distance test is therefore given by, 1 I 9 I + 1 J I + 25

8.7 Concluding remarks This chapter has described an exact data independence test called the distance test. It uses the data dependence distance vectors derived from techniques described in chapter 4 and formulates a constraint system which is solved in three phases. We have employed a version of the distance test on a very large number of possible subscript expressions and we show that the average analysis time is approximately 0.7 seconds when executed on an IBM RS6000. The version of the distance test in our experiments is not exact however. The test makes approximating assumptions when it realises that the constraint system has become to dicult to solve quickly. However, experiments show that these approximating assumptions are made on only a few occasions: twelve times in the total of ve hundred runs. Also, since the approximating assumptions are made deliberately, it allows the compiler to inform the programmer when an analysis for a def-ref array variable pair is approximate. We therefore assert the distance test to be suitable for use in a FORTRAN parallelizing compiler. Further, we show that, in the worst case, the distance test should have a slightly better time complexity then the Omega test [53]; another exact data independence test. Both tests, however, have worst case exponential time complexities.

Chapter 9

Conclusions and future directions The central hypothesis of this thesis has been that extracting exact data ow information results in the improved mapping of nested loop kernels onto parallel architectures. The thesis has developed a method which derives such exact data ow information and has shown how this information can be exploited in a number of important areas in the development of a parallelizing compiler. For the remainder of this chapter we summarise the results presented in this thesis. This chapter will also suggest directions in which the results presented can be extended and presents some open-ended questions that have still to be answered.

9.1 Major contributions It has been noted in chapter 3 that current state-of-the-art parallelizing FORTRAN compilers derive very unsatisfactory performance gains for the potential parallelism found in \real world" programs [8, 43, 47]. The problems with these compilers are twofold: 1) the inability to handle compile-time unknowns in loop bounds and subscript expressions, and 2) not knowing a priori which subset of restructuring transformations, described in chapter 3, have to be applied to derive the most bene t. The work described in this thesis takes us some way towards a solution to the latter problem. We do not advocate the need for iteration space transformations to expose the 184

CHAPTER 9. CONCLUSIONS AND FUTURE DIRECTIONS

185

parallelism in a nested loop kernel. We derive the data ow information of a nested loop computation directly by determining the distance vectors of the data dependencies which de ne a nested loop computation. Note, however, that current syntax transformations can still be useful in increasing the available parallelism and isolating the parallelizable portions from the non-parallelizable portions in a piece of FORTRAN code. We believe such transformations still have an important place in a parallelizing compiler's repertoire. To summarise the main contributions of this thesis:

We have determined a direct method for deriving the distance vectors for data dependencies in a nested loop kernel. There were previously no systematic methods by which the distance vector of a data dependence between a def-ref array variable pair could be derived. This thesis shows how the forward and backward distance vector of a data dependence can be determined in terms of the current iteration instance. The forward distance vector determines lexically later iterations which are data dependent on the current iteration, while the backward distance vector determines lexically earlier iterations upon which the current iteration is data dependent.

We have suggested parallelization schemes for nested loop kernels with non-

constant data dependencies. For such loop kernels the data dependencies vary from iteration to iteration. Loop kernels involving array variables with coupled subscripts often have such data dependence properties. The thesis has suggested parallelizations schemes for SMAs and DMAs. For SMAs, we describe how synchronisation statements can be inserted into a nested loop kernel to implement its DO ACROSS version. We show that our sychronization scheme unravels more parallelism with much less eort then current alternative schemes proposed by Polychronopolous [50], and Tzen et al [65]. For DMAs, we describe how the message-passing version of the DO ACROSS of a nested loop kernel can be synthesised. We contrast our technique with that of a similar scheme proposed by Su [63] and describe the advantages of our scheme. We also compare our task allocation and data layout strategy to that proposed for HPF. We conclude that the currently proposed task allocation and data


186

layout strategies for HPF is inadequate and a more sophisticated scheme, as suggested in this thesis, should be adopted.

The thesis has presented an exact data independence test for multi-dimensional

array variables. By determining data independent statements a parallelizing compiler can then schedule them in parallel. The thesis suggests a data independence test called the distance test. The test formulates the dependence constraint problem using the data dependence vectors determined by techniques described in this thesis and solves the constraint problem using a combination of techniques. The distance test is an exact test in that a data dependence is determined if and only if one exists. Although the distance test has exponential worst case time complexity we show how a carefully constrained version of the test can exhibit \tractable" execution times for a very large set of potential array subscript expressions. We constrain the test by allowing it to make approximating assumptions when it determines that it has too much to do. We further show that the accuracy of the test is not unduly aected: the approximating assumptions were made on only 1.4% of the scenarios generated by our experiments. We have also compared the distance test with the Omega test, another exact data independence test, and we show that our test has potentially better worst case time complexity.

9.2 Future directions The results we present are \theoretical stepping-stones" in the development of the ideal parallelizing compiler. The ideal parallelizing compiler is de ned in this thesis as one which takes a program developed in a language with no explicit parallelizing semantic extensions, and 1) derives its optimal parallel form and 2) maps this onto the parallel architecture such that its execution time is minimised. This thesis takes us a step closer towards such an ideal for a parallelizing FORTRAN compiler. The logical \next step" to the research presented in this thesis is an empirical evaluation of our suggested parallelization schemes. An empirical evaluation would implement our parallelization schemes in a state-of-the-art parallelizing FORTRAN compiler (i.e. KAP, PTRAN etc.), and its eectiveness accessed using a framework


187

similar to that developed by Peterson and Padua [47]. Integrating our proposed parallelization schemes into a state-of-the-art compiler is essential because some of the transformations discussed in chapter 3 are needed to enable our proposed schemes to work eectively. An example is the loop distribution transformation which isolates parallelizable portions of a loop kernel to allow the compiler to parallelize a kernel not otherwise possible. Furthermore, dependence breaking transformations, as described in chapter 3, have been shown to dramatically improve the available parallelism exploitable by the compiler [8, 62]. Some immediate questions that need to be answered in an empirical evaluation are: 1. What selection of loop transformations will produce the most bene t for our proposed schemes when applied to \real world" programs? Parallelism exposing and dependence breaking transformations may provide the most bene t. Iteration space transformations such as loop reversal and loop permutation may be unnecessary in our proposed schemes. Further work will have to closely examine the relationship between previously proposed transformations and our proposed parallelization schemes. 2. What additional overheads are incurred when our proposed schemes are applied? Will the synchronisation overheads incurred in our DD synchronisation scheme overwhelm the bene ts of unravelling all the available parallelism in a loop kernel? For loop kernels with non-constant data dependencies, the need for synchronisation is quite sparse. In uniform recurrence type loops, synchronisation overheads may quickly overwhelm the potential bene ts particularly when the SET and WAIT synchronising primitives are not eciently implemented on the target architecture. What trade-os must be applied if such additional overheads are detected? Perhaps a way of quantifying the overheads incurred from the synchronising primitives may be needed to predict the performance of a parallelized loop before it is executed. Depending on whether a bene t can be predicted we may invoke either the parallelized version or the serial version of the loop kernel. 3. How are unknown loop bounds to be handled especially in the context of the


188

LCFM task allocation algorithm? The LCFM task allocator requires the loop bounds to be known at compile-time in order that the data dependence DAG can be traversed. The question of how best to deal with unknown values in pivotal variables is currently recognised to be an important problem in parallelizing compiler technology. Pivotal variables may be variables in loop bounds or array variable subscript expressions. Shen, Li and Yew [61] propose that a programmer should be allowed to insert assertions into a program suggesting possible values such pivotal variables can take. Another alternative strategy is to provide the compiler with the ability to predict values of unknown terms from information determined through execution pro ling. The issues involved in this area are extensive and will require further in depth investigation. 4. How are data dependencies which are dependent on conditional branches to be handled? This thesis assumes data dependencies which are valid throughout a computation. However, data dependencies which cross conditional branches may be invalidated during a computation run. An example of such a data dependence is seen in the code fragment below: A(I) = ... IF ( C1 ) THEN ...

= ...

A(I-1)

ENDIF

Thus, depending on condition C1, the ow dependence between the array variable pair (A(I) $ A(I-1)) may not always be valid. Do such data dependencies occur often? Resolving these issues will require further analysis of \real world" programs and developing techniques which deal with such nondeterminism eciently. 5. We have described techniques for generating the data dependence distance vectors for array variables with generalised coupled subscripts. How are subscripted subscripts of the form A(B(i)) to be treated? It may be that a vector form is not sucient to predict the data ow patterns and some tabular mechanism generated through an instrumented prerun is needed.


189

None of the above questions can be answered without deliberate experimentation by the application developers using a parallelizing FORTRAN compiler incorporating our parallelization schemes. We hope such experiments will be possible in the future.

Bibliography [1] A. V. Aho, R. Sethi and J. D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley 1986. [2] A. Aiken and A. Nicolau, \Perfect pipelining: A new loop parallelization technique", in Proc. 1988 European Symp. on Programming, Springer-Verlag Lecture Notes in Computer Science, No. 300, March 1988. [3] J. R. Allen and K. Kennedy, \Pfc: A Program to Convert FORTRAN to Parallel Form", in Supercomputers: Design and Applications, eds. K. Hwang, IEEE Computer Scoiety Press 1985, pp. 186{205. [4] Gene. M. Amdahl, \Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities", in AFIPS Proc. SJCC, vol. 31, 1967, pp. 483{485. [5] U. Banerjee, Dependence Analysis for Supercomputing, Kluwer Academic Publishers, Norwell, Mass., 1988. [6] U. Banerjee, \A Theory of Loop Permutation", in Languages and Compilers for Parallel Computing, eds. D. Gelernter, A. Nicolau and D. Padua, MIT Press, Cambridge, 1989, pp. 54{74. [7] U. Banerjee, R. Eigenmann, A. Nicolau and D. Padua, \Automatic Program Parallelization", Proceedings IEEE, vol. 18, no. 2, 1993, pp. 211{243. [8] W. Blume and R. Eigenmann, \Performance Analysis of Parallelizing Compilers on the Perfect Benchmark Programs", IEEE Trans. Parallel and Distributed System, vol. 3, no. 6, 1992, pp. 643{656. 190

BIBLIOGRAPHY

191

[9] B. M. Chapman, P. Mehrotra and H. P. Zima, "High performance Fortran without templates: an alternative model for distribution and alignment", in Fourth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, vol. 28, no. 7, July 1993, pp. 92{101. [10] Doreen Cheng and D. M. Pase, \An Evaluation of Automatic and Interative Parallel Programming Tools", in Proc. Supercomputing '91, 1991, pp. 412{422. [11] Marina Chen and James Cowie, \Optimizing FORTRAN-90 Programs for Data Motion on Massively Parallel Systems", Tecnical Report, Dept. of Computer Science, Yale University, 1993. [12] R. Cytron, \Doacross: Beyond vectorization for multiprocessors", in Proc. Intl. Conf. on Parallel Processing, August 1986, pp. 836{845. [13] Ronald Gary Cytron, Compile-Time Scheduling and Optimization for Asynchronous Machines, PhD Thesis, Univ. of Illinois at Urbana-Champaign, Dept. of Computer Science, Oct. 1984. [14] E. H. D'Hollander, \Partitioning and Labeling of Loops by Unimodular Transformations", IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, 1992, pp. 465{476. [15] Z. Fang, P. Tang, P-C. Yew and C-Q. Z, \Dynamic Processor Self-Scheduling for General Parallel Nested Loops", IEEE Trans. Computers, vol. 39, no. 7, 1990, pp. 919{929. [16] Horace P. Flatt and Ken Kennedy, \Performance of parallel processors", Parallel Computing, vol. 12, pp. 1-20, 1989. [17] A. Gerasoulis and Tao Yang, \A Comparison of Clustering Heuristics for Scheduling DAGs on Multiprocessors", Journal of Parallel and Distributed Computing, vol. 16, no. 4, 1992, pp. 276{291. [18] A. Gerasoulis and Tao Yang, \On the Granularity and Clustering of Directed Acyclic Task Graphs", IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 6, June 1993, pp. 686{701.

BIBLIOGRAPHY

192

[19] K. Gopinath, Copy Elimination in Single-Assignment Languages, PhD Thesis, Computer Systems Laboratory, Stanford University, 1989. [20] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAulie, L. Rudolph and M. Snir, \The NYU Ultracomputer { designing an MIMD shared memory parallel computer", IEEE Trans. Computers, vol. 32, 1983, pp. 175{189. [21] G. Go, K. Kennedy and Chau-Wen Tseng, \Practical Dependence Testing", in Proc. ACM SIGPLAN'91 Conf. on Programming Language Design and Implementation, Tronto Canada 1991, pp. 15{29. [22] Thomas Gross and Peter Steenkiste, \Structured Data ow Analysis for Arrays and its Use in an Optimizing Compiler", Software{Practice and Experience, vol. 20(2), Feb. 1990, pp. 133{155. [23] J. L. Gustafson, \Reevaluating Amdahl's Law", Comms. ACM, vol. 31, no. 5, 1988, pp. 532{533. [24] Manish Gupta and P. Banerjee, \Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers for Multicomputers", IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, 1992, pp. 179{193. [25] S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer and C-W. Tseng, \An Overview of the FORTRAN-D Programming System", Proc. 4th Workshop on Languages and Compilers for Parallel Computing, Santa Clare CA, August 1991. [26] High Performance FORTRAN Forum, High Performance FORTRAN Journal of Development, CRPC-TR93300, Center for Research on Parallel Computation, Rice University, Houston, May 1993. [27] R. M. Karp, R. E. Miller and S. Winograd, \The Organisation of Computations for Univorm Recurrence Equations", JACM, vil. 14, no. 3, 1967, pp. 563{590. [28] C-T. King, W-H. Chou and L. M. Ni, \Pipelined Data-Parallel Algorithms: Part I - Concept and Modeling", IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 4, 1990, pp. 470{485.

BIBLIOGRAPHY

193

[29] C-T. King, W-H. Chou and L. M. Ni, \Pipelined Data-Parallel Algorithms: Part II - Design", IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 4, 1990, pp. 486{499. [30] Charles Koebel and Piyush Mehrotra, \Compiling Global Name-Space Parallel Loops for Distributed Execution", IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, 1991, pp. 440{451. [31] X. Kong, D. Klappholz and K. Psarris, \The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization", IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, July 1991, pp. 342{349. [32] Manoj Kumar, \Measuring Parallelism in Computation-Intensive Scienti c/Engineering Applications", IEEE Trans. of Computers, vol. 37, no. 9, Sept. 1988, pp. 1088{1098. [33] D. J. Kuck, The Structure of Computers and Computation, Volume I, John Wiley, NY, 1978. [34] D. Kuck, P. Budnik, S-C. Chen, Jr. E. Davis, J. Han, P. Kraska, D. Lawrie, Y. Muraoka, R. Strebendt and R. Towle, \Measurements of Parallelism in Ordinary FORTRAN Programs", Computer, vol. 7, no. 1, 1974, pp. 37{46. [35] D. Kuck, E. Davidson, D. Lawrie, A. Sameh, D. Padua, and P. Yew, \The Cedar System and an Initial Performance Study", University of Illinois at Urbana-Champaign, CSRD Tech. Report No. 1261, 1993. [36] L. Lamport, \The Parallel Execution of DO Loops", Comms. ACM, vol. 17, Feb. 1974, pp. 83{93. [37] P-Z. Lee and Z. M. Kedem, \Mapping Nested Loop Algorithms into Multidimensional Systolic Arrays", IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 1, Jan. 1990, pp. 64{76. [38] Z. Li, Pen-Chung Yew and Chuan-Qi Zhu, \An Ecient Data Dependence Analysis for Parallelizing Compilers", IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 1, January 1990, pp. 26{34.

BIBLIOGRAPHY

194

[39] J. Li and Marina Chen, \Compiling Communication-Ecient Programs for Massively Parallel Machines", IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, 1991, pp. 361{376. [40] S. Lipschutz, Schaum's Linear Algebra, McGraw-Hill, New York, 1974. [41] L. Lu and M. Chen, \Subdomain Dependence Test for Massive Parallelism", in Proc. Supercomputing'90, New York Nov. 1990. [42] S. Midki and D. Padua, \Compiler Algorithms for Synchronization", IEEE Trans. Computer, Dec. 1987, pp. 1485{1495. [43] H. Nobayashi and C. Eoyang, \A Comparison Study of Automatically Vectorizing FORTRAN Compilers", in Proc. Supercomputing'89, 1989. [44] Michael O'Boyle and G. A. Hedayat, "Load Balancing of Parallel Ane Loops by Unimodular Transformations", Dept. of Computer Science, University of Manchester, Tech. Report UMCS-92-1-1, 1992. [45] C. Papdimitriou and and M. Yannakakis, \Toward an architecture independent analysis of parallel algorithms", SIAM J. Computing, vol. 19, 1990, pp. 322{328. [46] D. Padua and M. J. Wolfe, \Advanced Compiler Optimization for Supercomputers", Comms. ACM, vol. 29, no. 12, 1986, pp. 1184{1201. [47] Paul Peterson and David Padua, \Machine-Independent Evaluation of Parallelizing Compilers", University of Illinois at Urbana-Champaign, CSRD Tech. Report No. 1173, 1993. [48] J-K. Peir and R. Cytron, \Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors", IEEE Trans Computer, vol. 38, 1989, pp. 1203{1211. [49] C. Polychronopolous, M. Girkar, M. Reza Haghigat, Chia-Ling Lee, B. Leung and D. Schouten, \Parafrase-2: A New Generation Parallelizing Compiler", Int. Journal of High Speed Computing, vol. 1, May 1989, pp. 45{72.

BIBLIOGRAPHY

195

[50] C. Polychronopoulos, \Compiler Optimizations for Enhancing Parallelism and Their Impact on Architectural Design", IEEE Trans. Computers, vol. C-37, no. 8, 1988, pp. 991{1004. [51] C. Polychronopolous, D. J. Kuck and D. Padua, \Utilizing Multidimensional Loop Parallelism on Large-Scale Parallel Processor Systems", IEEE Trans. Computers, vol. 38, no. 9, 1989, pp. 1285{1296. [52] C. Polychronopolous, \Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers", IEEE Trans. Computers, vol. C-36, no. 12, 1987, pp. 1425{1439. [53] William Pugh, \A Practical Algorithm for Exact Array Dependence Analysis", Comms. ACM, vol. 35, no. 8, August 1992, pp. 102{115. [54] P. Quinton, \Automatic Synthesis of Systolic Arrays from Unifrom Recurrent Equations", in Proc. 11th Intl. Symp. on Computer Architectures, 1984, pp. 208{214. [55] J. Ramanujam and P. Sadayappan, \Compile-Time Techniques for Data Distribution in Distributed Memory Machines", IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, 1991, pp. 472{481. [56] S. K. Rao and T. Kailath, \Regular Iterative Algorithms and their Implementation on Processor Arrays", Proc. IEEE, vol. 76, no. 3, 1988, pp. 259{269. [57] Joel. H. Saltz, Ravi Mirchandaney and Kay Crowley, \Run-Time Parallelization and Scheduling of Loops", IEEE Trans. Computers, vol. 40, no. 5, 1991, pp. 603{612. [58] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors, MIT Press, Cambridge, MA, 1989. [59] Vivek Sarkar, \PTRAN - The IBM Parallel Translation System", in Parallel Functional Languages and Compilers, ACM Press 1991, pp. 309{391.

BIBLIOGRAPHY

196

[60] W. Shang and J. Fortes, \Time Optimal Linear Schedules for Algorithms with Uniform Dependencies", IEEE Trans. Computers, vol. 40, no. 6, 1991, pp. 723{742. [61] Z. Shen, Z. Li and Pen-Chung Yew, \An Empirical Study of FORTRAN Programs for Parallelizing Compilers", IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 3, July 1990, pp. 356{364. [62] J. P. Singh and J. L. Hennessy, \An Empirical Investigation of the Eectiveness and Limitations of Automatic Parallelization", in Proc. Intl. Symp. on Shared Memory Multiprocessing, Tokyo, April 1991. [63] Hong-Men Su, Multiprocessor Synchronization and Data Transfer, PhD Thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, 1991. [64] T. H. Tzen and L. M. Ni, \Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers", IEEE Trans. Parallel and Distributed System, vol. 4, no. 1, 1993, pp. 87{98. [65] T. H. Tzen and Lionel M. Ni, \Dependence Uniformization: A Loop Parallelization Technique", IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 5, 1993, pp. 547{558. [66] C-M. Wang and S-D. Wang, \Ecient Processor Assignment Algorithms and Loop Transformations for Executing Nested Parallel Loops on Multiprocessors", IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 1, 1992, pp. 71{82. [67] D. Wallace, \Dependence of Multi-Dimensional Array References", in Proc. 2nd Intl. Conf. on Supercomputing, St. Malo France July 1988. [68] M. E. Wolfe and M. S. Lam, \A Loop Transfomation Theory and Algorithm to Maximize Parallelism", TR No. CSL-TR-91-457, Computer Systems Laboratory, Stanford University, 1991.

BIBLIOGRAPHY

197

[69] Michael Wolfe and Chau-Wen Tseng, \The Power Test for Data Dependence", IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 5, Sept. 1992, pp. 591{601. [70] Y. Wu, \Parallel Decomposed Simplex Algorithms and Loop Spreading", PhD Thesis, Department of Computer Science, Oregon State University, 1988. [71] Tao Yang and A. Gerasoulis, \A Fast Static Scheduling Algorithm for DAGs on an Unbounded Number of Processors", in Proc. Supercomputing'91, Albuquerque, NM, 1991, pp. 633{647.

Extracting data ow information for parallelizing FORTRAN ... - CiteSeerX

Extracting data ow information for parallelizing FORTRAN ... - CiteSeerX

Suggest Documents

dynamic data distributions in vienna fortran - CiteSeerX

Extracting High-Level Information from Location Data: The ... - CiteSeerX

Extracting information from noisy time series data - CiteSeerX

The synchronous data ow programming language ... - CiteSeerX

The synchronous data ow programming language ... - CiteSeerX

dynamic data distributions in vienna fortran - CiteSeerX

Extracting Predictive Information from Heterogeneous Data Streams ...

Extracting Semantic Information from Visual Data

Flower: extracting information from pyrosequencing data

Extracting meaningful information from financial data

Demand-Driven Data ow for Concurrent Committed ... - CiteSeerX

On Extracting Knowledge from the Data Warehouse for Information ...

Model for Extracting Information from Production Schedule Data

Extracting Machinery Management Information from GPS ... - CiteSeerX

Extracting Machinery Management Information from GPS ... - CiteSeerX

Accessing, Analyzing, and Extracting Information from ... - CiteSeerX

Extracting Semantic Information from Corpora Using ... - CiteSeerX

Extracting Information from Financial Market Instruments - CiteSeerX

DATA DEPENDENCE ANALYSIS FOR FORTRAN ... - Semantic Scholar

Accessing, Analyzing, and Extracting Information from ... - CiteSeerX

Generic Programming Techniques for Parallelizing and ... - CiteSeerX

Extracting Relational Data from HTML Repositories - CiteSeerX

Extracting Assumptions from Incomplete Data - CiteSeerX

Extracting Structured Data from Web Pages - CiteSeerX