The Role of Performance Models in Parallel Programming and ...

41 downloads 67165 Views 1MB Size Report
data parallel class of applications where the parallelism is abundant and the ..... For instance, an auto mechanic may look at an automobile and think in terms of ...
c Copyright 1997 Ton Anh Ngo

The Role of Performance Models in Parallel Programming and Languages by Ton Anh Ngo

A dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy University of Washington 1997

Approved by

Program Authorized to O er Degree Date

(Chairperson of Supervisory Committee)

Doctoral Dissertation In presenting this dissertation in partial ful llment of the requirements for the Doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with \fair use" as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to University Micro lms, 300 North Zeeb Road, Ann Arbor Michigan 48106, to whom the author has granted \the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform."

Signature Date

University of Washington Abstract

The Role of Performance Models in Parallel Programming and Languages by Ton Anh Ngo

Chairperson of the Supervisory Committee: Professor Lawrence Snyder Department of Computer Science and Engineering

A program must be portable, easy to write and yield high performance. Unfortunately, these three qualities present con icting goals that require dicult and delicate balancing, and for parallel programs, the parallelism adds yet another level of complexity. The diculty has proved to be a persistent obstacle to parallel systems, even for the data parallel class of applications where the parallelism is abundant and the computation is highly regular. This thesis is an e ort to nd this delicate balance between scalability, portability and convenience in parallel programming. To establish the framework for analyzing the many disparate and con icting issues, I develop the concept of modeling and apply it throughout the study. In addition, the empirical nature of the problem calls for experimental comparisons across current parallel machines, languages and compilers. The nonshared memory model is rst shown to be more portable and scalable because it exhibits better locality, then two data parallel languages, HPF and ZPL, are studied in detail. HPF o ers many advantages and enjoys wide attention in industry and academia, but HPF su ers from a signi cant gap in its performance model that forces

the user to rely completely on the optimization capability of the compiler, which in turn is nonportable. On the other hand, ZPL's foundation in a programming model leads to predictable language behavior, allowing the user to reliably choose the best algorithm and implementation. In this respect, a new language abstraction is proposed that promotes scalability and convenient programming: mighty scan generalizes the parallel pre x operation to apply to conceptually sequential computation that is dicult to parallelize.

Table of Contents List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

v

List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii Chapter 1: Introduction : : : : : : : : : : : : 1.1 Modeling : : : : : : : : : : : : : : : 1.2 Parallel programming and modeling 1.2.1 Model for machines : : : : : 1.2.2 Model for compilers : : : : : 1.2.3 Model for languages : : : : : 1.2.4 The larger picture : : : : : : 1.3 Thesis outline : : : : : : : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

1 1 3 7 10 12 13 13

Chapter 2: Programming Models for Data Parallel Problems 2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : 2.2 Methodology : : : : : : : : : : : : : : : : : : : : : : 2.2.1 The Machines : : : : : : : : : : : : : : : : : : 2.2.2 The applications and the implementations : : 2.3 LU Decomposition : : : : : : : : : : : : : : : : : : : 2.3.1 The problem : : : : : : : : : : : : : : : : : : 2.3.2 The parallel algorithms : : : : : : : : : : : : 2.3.3 Volume of data references : : : : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

17 17 20 20 23 25 26 27 29

2.3.4 Performance results : : : 2.4 Molecular Dynamics Simulation : 2.4.1 The problem : : : : : : : 2.4.2 The algorithm : : : : : : 2.4.3 Volume of data references 2.4.4 Performance results : : : 2.5 Conclusion : : : : : : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

32 34 34 37 41 44 45

Chapter 3: Two Data Parallel Languages, HPF and ZPL 3.1 Introduction : : : : : : : : : : : : : : : : : : : : : 3.2 A review of HPF : : : : : : : : : : : : : : : : : : 3.3 A review of ZPL : : : : : : : : : : : : : : : : : : 3.4 Expressing parallelism : : : : : : : : : : : : : : : 3.4.1 Parallel computation : : : : : : : : : : : : 3.4.2 Array reference : : : : : : : : : : : : : : : 3.5 Distributing parallelism : : : : : : : : : : : : : : 3.5.1 Distributing the data : : : : : : : : : : : : 3.5.2 Distributing the computation : : : : : : : 3.6 Conclusion : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

53 53 56 58 61 61 64 66 66 73 75

Chapter 4: The Performance Model in HPF and ZPL : : : : : : : 4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 4.1.1 The performance model : : : : : : : : : : : : : : : 4.1.2 ZPL's performance model : : : : : : : : : : : : : : 4.1.3 HPF's performance model : : : : : : : : : : : : : : 4.1.4 A methodology to evaluate the performance model 4.2 Selecting an implementation: array assignment : : : : : : 4.2.1 Quantifying the performance model : : : : : : : : 4.2.2 Results : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

78 78 79 81 82 83 86 86 88

ii

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

98 98 100 101 108 109

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

111 111 113 113 115 116 118 120 124 126 126 131 133 134 135 137

Chapter 6: Mighty Scan, parallelizing sequential computation 6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : 6.2 A case study : : : : : : : : : : : : : : : : : : : : : : 6.2.1 Idealized execution : : : : : : : : : : : : : : : 6.2.2 An algorithmic approach : : : : : : : : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

140 140 142 143 144

4.3 Selecting an algorithm: matrix multiplication 4.3.1 HPF versions : : : : : : : : : : : : : : 4.3.2 ZPL versions : : : : : : : : : : : : : : 4.3.3 Results : : : : : : : : : : : : : : : : : 4.4 Current HPF solutions : : : : : : : : : : : : : 4.5 Conclusions : : : : : : : : : : : : : : : : : : : Chapter 5: Benchmark Comparison : : 5.1 Introduction : : : : : : : : : : : 5.2 Methodology : : : : : : : : : : 5.2.1 Overall approach : : : : 5.2.2 Benchmark selection : : 5.2.3 Platform : : : : : : : : : 5.2.4 Embarrassingly Parallel 5.2.5 Multigrid : : : : : : : : 5.2.6 Fourier Transform : : : 5.3 Parallel Performance : : : : : : 5.3.1 NAS EP benchmark : : 5.3.2 NAS MG benchmark : : 5.3.3 NAS FT benchmark : : 5.3.4 Communication : : : : : 5.3.5 Data Dependences : : : 5.4 Conclusion : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

iii

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

6.2.3 Pipelining : : : : : : : : : : : : : : : : : : : 6.3 Implementations by HPF and ZPL : : : : : : : : : 6.3.1 DO loop implementation : : : : : : : : : : : 6.3.2 An array oriented approach: F90 and ZPL : 6.4 A new construct for ZPL : : : : : : : : : : : : : : 6.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : Chapter 7: Conclusions 7.1 Contributions : 7.2 Summary : : : 7.3 Future works :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

147 149 149 152 153 155

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

157 157 158 161

Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 162 Appendix A: Performance data : : : : : : : : : : : : : : : : : : : A.1 LU on shared-memory machines : : : : : : : : : : : : : A.2 WATER on shared-memory machines : : : : : : : : : : A.3 HPF and ZPL programs on nonshared-memory machine

iv

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

174 174 176 177

List of Figures 1.1 A user perspective of parallel system : : : : : : : : : : : : : : : : : : : : : 5 1.2 Realizing a parallel programming model : : : : : : : : : : : : : : : : : : : 8 1.3 Experimental comparisons of the programming models and languages. : : 15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16

Machine memory hierarchy : : : : : : : : : : LU speedup based on model: n=512, t=4 : : LU execution time: Sequent and Cedar : : : : LU execution time: Butter y and KSR : : : : LU execution time: DASH : : : : : : : : : : : LU speedup: Sequent and Cedar : : : : : : : LU speedup: Butter y and KSR : : : : : : : LU speedup: DASH : : : : : : : : : : : : : : WATER speedup based on model : : : : : : : WATER execution time: Sequent and Cedar : WATER execution time: Butter y and KSR : WATER execution time: DASH : : : : : : : : WATER speedup: Sequent and Cedar : : : : WATER speedup: Butter y and KSR : : : : WATER speedup: DASH : : : : : : : : : : : Pns factor for all machines : : : : : : : : : : : Ps

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : :

20 32 35 36 37 38 39 40 43 46 47 48 49 50 51 52

3.1 Reference to a distributed array in ZPL. : : : : : : : : : : : : : : : : : : : 65 v

3.2 Example of HPF data layout. : : : : : : : : : : : : : : : : : : : : : : : : : 68 3.3 Example of ZPL data layout. : : : : : : : : : : : : : : : : : : : : : : : : : 70 4.1 4.2 4.3 4.4

Cross compiler performance for HPF array assignment. : : : : : : : : : : : 90 Communication for array assignment using di erent index expressions: p=8. 93 Execution time for array assignment using di erent index expressions: p=8 95 Matrix Multiplications by ZPL and 3 HPF compilers: 2000x2000, p=16 : 102

5.1 5.2 5.3 5.4 5.5 5.6

Illustrations of EP as implemented in HPF and ZPL. : Illustrations for the Multigrid algorithm. : : : : : : : : Illustrations of FT implementation. : : : : : : : : : : : Performance for EP (log scale). : : : : : : : : : : : : : Performance for MG (log scale). : : : : : : : : : : : : Performance for FT(log scale). : : : : : : : : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

121 123 125 127 128 129

6.1 Two methods for array partitioning. : : : : : : : : : : : : : : : : : : : : : 142 6.2 Three possible parallel executions. : : : : : : : : : : : : : : : : : : : : : : 144

vi

List of Tables 2.1 Characteristics of the shared memory machines : : : : : : : : : : : : : : : 21 2.2 Volume of data references for LU : : : : : : : : : : : : : : : : : : : : : : : 31 2.3 Volume of data references for WATER : : : : : : : : : : : : : : : : : : : : 42 3.1 High level comparison of HPF and ZPL : : : : : : : : : : : : : : : : : : : 55 Array assignment in HPF and ZPL : : : : : : : : : : : : : : : : : : : : : : 89 Choices of array assignment in HPF. : : : : : : : : : : : : : : : : : : : : : 92 Matrix multiplication algorithms expressed in HPF. : : : : : : : : : : : : 99 Matrix multiplication algorithms expressed in ZPL. : : : : : : : : : : : : : 101 Pseudo-code for the matrix multiplication algorithms by three HPF compilers. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 4.6 Speedup ratio from 16 to 64 processors, 2000  2000 matrix multiplication.107

4.1 4.2 4.3 4.4 4.5

5.1 Sources of NAS benchmarks for HPF and ZPL : : : : : : : : : : : : : : : 117 5.2 Communication statistics for EP class A, MG class S and FT class S: p=8 135 5.3 Dependence ratio mn for EP, MG and FT. : : : : : : : : : : : : : : : : : : 137 6.1 The forward elimination step from tomcatv. : : : : : : : : : : : : : : : : : 150 6.2 Pseudo-code for the tomcatv segment by HPF and ZPL compilers. : : : : 151 6.3 tomcatv expressed using SCAN and the resulting SPMD code. : : : : : : 154 A.1 Performance (seconds) of LU Decomposition on 5 shared memory machines.174 vii

A.2 A.3 A.4 A.5

Performance (seconds) of WATER on 5 shared memory machines. : : : : 176 Characteristics of the IBM SP2 used for the cross-compiler comparison : : 178 Performance (seconds) of Matrix Multiplication on the IBM SP2: 20002000178 Performance (seconds) of NAS 2.1 benchmarks on the IBM SP2: EP, FT, MG. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 179

viii

Acknowledgments This dissertation would not have been possible without the ever ready encouragement and guidance from my adviser, Lawrence Snyder. I am also indebted to the IBM Thomas Watson Research Laboratory, Yorktown Heights, NY for supporting my entire graduate program, and to my managers Fran Allen, Kevin McAuli e, William Brantley, Edith Schonberg, Lee Nackman, Jim Russell and John Barton. The experimental work in this thesis required the parallel computing resources of the Lawrence Livermore National Laboratory, Stanford University, the University of Illinois at Urbana-Champaign, the Cornell Theory Center, the Caltech Center for Advanced Computing Research and the San Diego Supercomputing Center. I would like to express my gratitude to these institutes for making their resources available. I also would like to thank my colleagues Manish Gupta and Bradford Chamberlain who have always been ready to help with issues on HPF and ZPL.

ix

Dedications My parents, Dinh Pham and Bao Ngo, and my brothers and sisters Tuan, Dung, Tu and Thu have always been supportive in quietly reminding me that I do need to nish my graduate study some time. Dung has been especially stimulating despite the extremely dicult life challenge he is facing. To them, I dedicate this work. My wife, Hue, and I have found that medical school, graduate school and children all at once is not exactly fun, but along our arduous journey together, we have discovered a strong sense of closeness. Nevertheless, we would not recommend to anyone to try it, at least not without a lot of help and a healthy sense of humor.

x

Chapter 1

Introduction model 1(n) (Webster 7th dictionary) DEFINITIONS:

[1 (obs)] a set of plans for a building [2 (dial Brit)] COPY, IMAGE [3] structural design [4] a miniature representation of something; also : a pattern of something to be made [5] an example for imitation or emulation [6] a person or thing that serves as a pattern for an artist; esp : one who poses for an artist [7] ARCHETYPE [8] one who is employed to display clothes or other merchandise: MANNEQUIN [9a] a type or design of clothing [9b] a type or design of product (as a car or airplane) [10] a description or analogy used to help visualize something (as an atom) that cannot be directly observed [11] a system of postulates, data, and inferences presented as a mathematical description of an entity or state of a airs

1.1 Modeling This thesis is based on the concept of modeling. Although the term model is heavily used in many elds and our focus in this study will be quite narrow in contrast to this general term, it is helpful to remember that it forms the underlying basis for this work.

2 In the simplest terms, a model re ects our understanding of how a certain system operates; it captures the information that is necessary and sucient for our expected use of the system. For instance, when we operate an automobile, we have in mind a certain model of the machinery: we expect features such as a steering wheel, an accelerator pedal, a brake pedal, etc. We expect these features to function in a certain manner: the accelerator should move the car and the brake should stop it. We operate these features and indeed they function as expected. We are also conscious of the size of the car, i.e., how far ahead and behind are the front and back bumpers, so that we can avoid colliding with other cars. This conformity to a consistent model allows a person to step into a car of any make or model and drive with no diculty. A model may capture di erent levels of detail depending on the needs of the user. For instance, an auto mechanic may look at an automobile and think in terms of the engine, transmission, carburetor, electrical wiring, while a car salesman may think in terms of front-wheel drive, antilock brakes, and service contract. Furthermore, the model must be accurate for the system to be operated e ectively. Consider a race car driver whose goal is to drive as fast as possible without wrecking the car. Among other things, he must learn precisely how the car steering responds at high speeds, or more formally, he must acquire in his mind an accurate model of the car's steering behavior. A number of scenarios can occur that prevent such a mental model: (1) the steering is soft because of the car's poor design, (2) some last minute repair work had altered the steering and the instruction given to the driver concerning the change turns out to be inaccurate, or (3) the car is new and he has not gained enough experience with its behavior. Clearly, if the driver is not sure how the car will steer at high speeds, he will have to be conservative and drive at a lower speed. Modeling is also pervasive in computer science. An example of a successful model can be found in the design of the memory for a sequential machine. Computer architects e ectively model a program as a stream of references that exhibit some temporal and spatial locality. Designers then optimize to this model, and the result is the cache, a

3 ubiquitous and very successful feature. Interestingly, this example can be carried further to illustrate an instance when the model fails. A cache under the expected operating conditions will yield an average memory access time that is much better than that of the main memory, thereby creating an illusion (i.e., model) of a uniformly fast memory. A user then can program according to this model by largely ignoring the memory access time. However, unfavorable behaviors can occur, such as a long sequence of consecutive addresses or a stride in the references which makes use of only one word per cache line. In these cases, the original model of temporal and spatial locality fails, resulting in the failure of the cache and consequently of the user perception of a uniformly fast memory. The user then encounters surprisingly poor performance. Fortunately in this particular case, cache blocking by the compiler has been successful in identifying these reference patterns and breaking the long sequence or adjusting the memory stride. In other words, the compiler is able to transform the program so that it ts the original model on which the cache design is based. In this thesis, I will focus on the topics of programming for parallel machines, and the modeling concept is used to study the issues and problems.

1.2 Parallel programming and modeling Parallel systems are among the few disciplines in computer science in which very substantial bene ts appear feasible but have remained elusive despite extensive research. Consequently, it is not surprising that the interest in parallel systems has waxed and waned over the years. The late 80's and the early 90's saw a surge in the development of parallel machines and software, driven by the availability of cheap microprocessors. In the last few years however the number of vendors as well as the variety of parallel machines has been dwindling. The level of interest in parallel systems at the current time seems low, yet the relentless quest for more computing power has not eased and it is conceptually clear that parallel machines can lead to faster execution times than a sequential machine. This motivates the question, \What are the major obstacles in

4 realizing the potential of parallel systems?" In seeking an answer, it is helpful to consider a more elementary question, \What does a user expect from a parallel machine?" The following expectations are identi ed. Foremost is performance, or more precisely scalable performance, the principal bene t from a parallel machine. It is clear that a user who considers a parallel machine does so out of necessity. Without scalable performance, there is little motivation for the user to depart from the familiar sequential machine. Scalable performance has two components: (1) competitive scalar performance against a state of the art sequential machine, and (2) good speedup. The second expectation from users is the durability of their software. Program development is costly without the added diculty of parallel programming; therefore, the users expect their programs to be portable not only across di erent parallel platforms but also across generations of parallel machines. It should be stressed that portability must include both correctness and performance; otherwise, a correct but slow program is no more useful than a fast program that produces incorrect results. The third expectation is ease of use. This may seem to have lower priority than the rst two, but it is clear that parallel machines will not gain a critical mass of users until they can be made readily accessible. We next consider how various aspects of parallel systems are meeting these requirements. A contract between the user and the system

Figure 1.1(a) shows a user's perspective in developing a parallel program. From the problem speci cation, a user typically chooses an algorithm, implements it in a particular language, compiles the program with a particular compiler, then runs the program on a parallel machine. On a closer look (Figure 1.1(b)), we can see that the portability issue mainly involves the compilers and the parallel machines since the goal in portability is to be able to run the same program without modi cations on di erent platforms. The language is only involved in the sense that one standard language is used. The ease of use issue falls entirely within the language since it depends on the type of abstractions

5

algorithm

language

compiler

machine

(a) Developing a parallel program

Scalability

Ease of use

algorithm

language

Portability compiler

machine

(b) Satisfying the three user expectations Figure 1.1: A user perspective of parallel system for parallelism that the language provides. Scalability ultimately rests with the user in the choice of the algorithm. Since the language is the only interface accessible to the user, he/she would learn the abstractions and functionality provided by the language and, in the process, form a mental picture of how the parallel execution is to proceed. This mental picture, or more formally the programming model, is then used to select the best algorithm for the problem and to implement the algorithm in the most ecient manner. The responsibility of the compiler and the machine is to accurately implement the programming model that that language presents. The programming model thus serves as the contract between the user and the parallel system. It follows that if either of the parties of the contract fails, e.g. the user fails to nd a scalable solution or the system fails to meet the expectation, then scalability is not achieved. As an example, consider the case of KAP, a parallelizing compiler for Fortran programs. A KAP user would program in the standard Fortran syntax with the understand-

6 ing that when a loop is encountered, a number of processors will wake up and execute the loop in parallel. Under this programming model, the user will choose and implement an algorithm based on a number of assumptions: (1) the loop that is expected to execute in parallel will indeed execute in parallel, (2) the overhead for entering the parallel execution is negligible, and (3) the data can be accessed with an average low latency. Then, the task for the user is to ensure that the DO loop dominates the program execution, while the task for the parallelizing compiler is to deliver the expected performance. If the compiler meets the expectation, then the program scales; otherwise, the program does not scale. In the latter case, the failure is not due to the algorithm since it is indeed scalable according the programming model; rather, the failure rests with the compiler in implementing the model. The glue for the components of a system

Thus far we have only discussed modeling as an interface between the user and the language, yet modeling exists between other components as well. Consider the models in a native system as illustrated in Figure 1.2(a). In the development of many parallel systems, the e ort has tended to focus on the hardware, leaving the software often under-developed and primitive. In addition, machine vendors are often motivated toward proprietary software that closely mirrors the functionality of the hardware. The result is a tight coupling between the machine, the compiler, and the associated language. In other words, the machine is designed for a particular architecture, the language is designed to match the machine capability, and the compiler translates directly from the language syntax to the machine function. The models that exist between these components contain speci c details that help make the system ecient. However a disadvantage is that programs tuned to the system may not be portable. An example is the Connection Machine CM-2 and its C* language: the where construct is an excellent match for the SIMD control mechanism in the hardware, but implementing this language abstraction on an MIMD may involve more overhead. Figure 1.2(b) shows a more loosely coupled scheme in which a language, preferrably

7 a standard one, is targeted by many compilers. The compiler in turn can target a number of parallel machines. In this approach, a language designer must make a number of assumptions about the capability of the compilers, i.e. a compiler model. Likewise, a compiler developer who intends to port the compiler implementation to multiple machines must have a certain model common to all the machines. A compiler for distributed memory machines could expect the machine to provide common communication functions such as asynchronous send/receive or even a standard communication interface such as MPI. The decoupling of the components renders the modeling aspect more critical because a mismatch in the chain of models has the potential of degrading the scalability of the entire system. Accurate modeling thus becomes the \glue" that ties together the components of a fully portable parallel system. Clearly, any reasonable model can be used between any two components. The only requirement is that the model can be implemented faithfully, or conversely, the model accurately captures the performance of the components being modeled. Having established the high level picture, we can summarize the overall approach for scalable, portable and easy to use systems: (1) the parallel language must be designed carefully for ease of use; (2) for portability, the language must be supported on multiple compilers and machines; (3) for scalability, the modeling must be kept accurate across all components of the parallel system. We now brie y survey the state of the art in parallel machines, compilers and languages to understand the types of model that are in use.

1.2.1 Model for machines The model at the machine level can typically be derived from a description of the machine architecture and organization. Parallel architectures proposed over the years cover a wide spectrum ranging from shared memory and distributed memory to data ow and multithreading. Since the processing elements are not di erent from the uniprocessors, the models for parallel machines are generally distinguished by the memory system:

8

algorithm

language

compiler

machine

Model

(a) Native models: tight coupling between the machine, the compiler and the language leads to poor portability

algorithm

language Model

compiler Model

machine Model

(b) More portable models: each phase is decoupled Figure 1.2: Realizing a parallel programming model shared memory and nonshared memory. Note that we use the term nonshared instead of the more conventional term distributed to avoid confusion since a shared memory machine may have a distributed memory organization. Shared memory architecture is characterized by a single address space for all processors, while in the nonshared memory architecture each processor explicitly manages its address space and shares data only through explicit messages that involve both the sender and the receiver. The shared memory model can be further quali ed by the knowledge of the physical location of an address. In other words, a processor is able to directly access any memory address, but it may or may not be able to determine whether the address is local or nonlocal. This knowledge imparts on the shared memory model some characteristics of the nonshared memory model; therefore for our purpose, this specialized model is called a hybrid model. Shared memory machines An early memory model for shared memory parallel machines is the Parallel Random Access Memory model (PRAM), which extends the sequential Random Access Memory

9 model (RAM) by assuming that each processor can access any memory location in unit time. The simpli cation allows a programmer to reason about the complexity of di erent algorithms so that an ecient algorithm can be chosen. For the shared memory architectures, a number of machines present a pure shared memory model by implementing a memory with a uniform memory access time (UMA). These include symmetric multiprocessors based on a bus or crossbar (SMP, currently marketed by numerous vendors such as Sequent, SGI), as well as earlier machines that use a multistage interconnect network (MIN) to connect the processors to a global memory. In the actual implementation, the memory access time is not strictly uniform, but, as with the cache in a sequential machine, there is no direct method for the user to control the access time. Other architectures fall into the category of pure shared memory as well. Cache Only Memory Architecture (COMA) implemented in the Kendall Square Research machine does not allow the general user to di erentiate among memory locations since the machine's responsibility is to dynamically relocate the memory sections in use to the local processor. Data ow machines such as the Monsoon machine[Papadopoulos & Culler 90] closely match the data ow programming model, which has no explicit notion of data location. Finally, the Tera machine relies on the program parallelism to mask the high latency of the global shared memory, so that if sucient parallelism exists, the global memory will appear to be uniformly fast. Hybrid machines

Large scale shared memory machines by necessity must distribute its physical memory. To take advantage of the processor locality, many machines provide some mechanisms for the user to distinguish di erent levels in the memory hierarchy, resulting in a model of nonuniform memory access time (NUMA). The IBM RP3 and BBN Butter y distinguished local and remote addresses while the Stanford DASH could recognize node, cluster and remote memory addresses. Recent advances in network technology in reducing the latency and improving the bandwidth are also shifting the focus to clusters of SMP's. These clusters (Distributed Shared Memory, or DSM) are generally connected

10 through a high performance switch and have a hardware coherent cache. Thus departing from pure shared memory, these machines present a hybrid model that has both shared and nonshared memory characteristics: hardware for shared memory is provided, but the underlying communication mechanism is also exposed to varying degrees, which could be as simple as fetching a xed size cache line, or more full- edged with put/get of variable size messages (Cray T3D, T3E). Nonshared memory machines At the other end of the spectrum are the message passing machines with the pure nonshared memory model. Early machines such as the Caltech Hypercube support little more than the basic send and receive, but more recent machines (SP2, Paragon, CM-5) enrich the communication functionality to include asynchronous and collective communication. The level of communication tends to be coarse in most machines, thereby encouraging coarse grained parallelism, but ne grained communication at the level of program variables is also possible in machines such as the Transputer.

1.2.2 Model for compilers A language designer must be aware of the capability of the compiler technology to ensure a viable implementation of the language. Within the framework in Figure 1.2(b), this awareness constitutes the compiler model. In general however, the model that compilers present is not commonly recognized as a separate entity. It is not always clear whether a language design is based on an existing compiler model or assumes a future compiler model with the expectation that new compiler optimizations will be developed. In the latter case, the language designer has in e ect created and used a new model that some future compiler must satisfy. We brie y survey some current approaches in parallel compilers to understand the models in use. When a parallel language is extended from a sequential language through libraries, no additional e ort is required from the compiler. The compiler treats a parallel program no di erently than a sequential program; therefore no compiler model exists with respect

11 to parallelism. Examples include C/Fortran compilers for shared memory programs that use the conventional thread abstractions (lock, spawn, etc.) as well as message passing programs that use the standard communication libraries (PVM, MPI, etc.). Compilers for sequential languages typically optimize for locality in the spatial and temporal dimensions of the reference pattern. Compilers for shared memory parallel languages must deal with the additional processor dimension in the reference pattern. On a cache-coherent shared memory machine, a compiler may manipulate the data sharing pattern, for instance by reorganizing the memory layout to minimize the false sharing e ect[Jeremiassen & Eggers 95]. In this case, the compiler model would include the capability to identify the shared data, analyze the access patterns and perform the necessary transformations. For parallelizing compilers, the model requires extensive capabilities. The compiler must be able to analyze and disambiguate dependences in the DO loops and perform the necessary transformations so that the loops can be executed in parallel. If the memory hierarchy is not uniform, the compiler must partition the data to maximize locality. The compiler must also minimize any overhead involved in creating the parallelism such as data copying, synchronization, managing worker threads, etc. HPF and directive-based parallel Fortran variants assume a compiler able to exploit the data distribution directives. Since these languages mainly target distributed memory machines, the compiler must (1) analyze the index expression to determine whether each memory reference is local or remote, (2) generate the communication required, and most importantly, (3) optimize the communication so that the program performance is scalable. In addition, the standard Fortran semantics must be observed. For ZPL[Lin & Snyder 93], the design of the language follows the philosophy that an ecient compiler implementation must exist for the abstractions. Although it seems contradictory to be proposing new powerful language abstractions that only require existing compiler optimization techniques, being designed from rst principles and being developed incrementally allows the language to evolve with new compiler technology. In

12 this case, the compiler model is simply one that eciently implements the language.

1.2.3 Model for languages Models at the language level are also called programming models. They are interesting for several reasons. First, they are the direct interface between a user and the machine; therefore they must focus on the user's needs, e.g., ease of use. Second, it is easy to create and implement a new programming model because it can be easily interpreted in software. This has lead to a rich diversity of models, which allow for many ways to categorize them; however to be consistent with the machine model, we will classify them as shared-memory and nonshared-memory and discuss the variations within each class. In the shared memory class, the most elementary model is one that directly re ects the functionality of the hardware. Control for hardware features such as lock, barrier synchronization, cache prefetch, etc. are coded in libraries which are then used to extend a sequential language. The model is the most ecient since it accurately captures the capabilities of the machine, but tends to be rigid. Di erences in syntax and functionalities between machines also hamper program portability, although standard macros such as those from Argonne National Laboratory have been developed to ease this problem. For languages targeted by parallelizing compilers such as KAP or PTRAN, the original sequential programming model is only extended with the awareness that parallelism will arise from the loops. A user would only need to ensure that loops constitute a major part of the dynamic execution of the otherwise sequential program. Elementary nonshared memory languages are also extended from sequential languages with communication libraries. These libraries control the low level hardware and may add functionality such as message bu ering, error recovery, and collective communication; therefore the programming model is usually richer than the actual machine. Di erences in syntax and functionality lead to some portability problems, although they are alleviated to some degree by standard communication interface such as PVM and MPI.

13 Data parallel languages present an interesting model that has some characteristics of nonshared memory but allows on a global view of the program computation. Examples include C*, High Performance Fortran and similar variations, and ZPL. The nonshared memory quality can exist as data partitioning directives as with HPF, or can be implicit in the language abstractions as in ZPL. Under this model, the user programs with the knowledge that the data will be distributed, but is spared the tedium of coding low level communication to manage the distributed data.

1.2.4 The larger picture Our discussion thus far has identi ed the following points: 1. Parallel programs must be scalable, portable and easy to write 2. Scalability originates in the choice of the algorithm 3. Ease of use is provided by the language 4. Portability requires the support of multiple compilers and machines 5. The appropriate models bind together the algorithm, the language, the compiler and the machine to form an e ective system. In this framework, a system as a whole is only as e ective as the least e ective link. This observation thus lays the basis for the methodology throughout the thesis: to evaluate a system, it suces to evaluate the models that bind together the components. Evaluating the model in turn will involve formulating some questions that can be answered by experimental data.

1.3 Thesis outline Having laid out the framework, the point of departure for this thesis can be stated as follows. Our goal is to understand the best approach for achieving scalability, portability

14 and ease of use in parallel programming. The scope is limited to the class of data parallel problems. The studies involve experimental comparison across parallel machines, compilers and languages. The concept of modeling serves as the basis for analyzing the observed behavior. This thesis asserts that an accurate performance model is a fundamental requirement for a parallel system. It makes the following contributions:

 An experimental comparison and analysis of two general programming models: the shared and nonshared memory models.

 An experimental comparison and analysis of two data parallel languages: HPF and ZPL.

 A new high level data parallel abstraction that promotes scalability, portability and ease of use: Mighty scan. Chapter 2 begins with the issues of scalability and portability, omitting for the time the question of ease of use. Given the shared and nonshared memory programming models on di erent machines (Figure 1.3(a)), the question is which model is more portable across the machines. The data provides experimental evidence to support the choice of a nonshared memory programming model. Having established a candidate programming model, we next consider the issue of ease of use. In the last few years, considerable research has focused on high level parallel languages that are based on the nonshared memory programming model and that target both scalability and portability. Although the data parallel model is not adequate to address general parallel programming, the prevalence of data parallel problems in scienti c applications has motivated the development of many data parallel languages, all with the goal of enabling easier programming while achieving scalable and portable performance. In this thesis we will consider HPF and ZPL (Figure 1.3(b)). Compilers for both HPF and ZPL are now widely available on many platforms. HPF vendors include APR (Applied Parallel Research), PGI (Portland Group Inc.), IBM, and DEC, and the supported

15

shared memory

shared memory

language compiler

machine

algorithm

nonshared memory language compiler

nonshared memory machine

(a) Evaluating the portability of programming models

shared memory language

compiler

machine

algorithm nonshared memory language a

compiler a

language b

compiler b

(b) Evaluating the portability of languages Figure 1.3: Experimental comparisons of the programming models and languages. Note: the lighter shaded diagrams belong in the overall picture but are not currently under consideration.

16 platforms include the IBM SP2, Cray T3D, Intel Paragon, DEC SMP AlphaServer, and SGI PowerChallenge. The ZPL compiler from the University of Washington supports the Intel Paragon, Cray T3D, IBM SP2, KSR2 and network of DEC Alpha workstations. Such a wide availability allows for detailed comparisons and analysis between HPF and ZPL as well as among HPF compilers. The reader may question at this point the practical purpose of comparing HPF and ZPL since it appears that HPF, being based on Fortran, is more likely to be accepted as the standard parallel language. My contention, to be borne out by the results of this thesis, is that HPF lacks a robust programming model to assure the programmers that an algorithm implemented in HPF will be scalable and portable. In this respect, the choice of ZPL for a comparison with HPF is appropriate since ZPL has a strong foundation in a programming model. One can argue that because of the high visibility of HPF, a failure of the language due to performance problems will not only disappoint the user community but will also have an adverse e ect on continuing research in parallel systems as a whole. Therefore, if HPF is failing to meet user expectations, it is imperative that we understand the source of the problem. Chapter 3 will focus on key features of both languages that are critical to scalability and portability. While it is dicult to measure the syntactic appeal of a language abstraction, the ease of use ultimately depends on whether the performance of the abstraction meets the user expectation. In Chapter 4 this will be quanti ed through an experimental characterization of a number of basic language abstractions in both languages. Chapter 5 will carry the comparison to the NAS benchmarks. This chapter will also brie y consider the interesting issue of native and nonnative compilation. Chapter 6 will consider the problem of a computation with a recurring dependence. A number of solutions are studied and a new language abstraction is proposed that generalizes the parallel pre x operation. Finally Chapter 7 will summarize the ndings of this thesis and outline future work.

Chapter 2

Programming Models for Data Parallel Problems 2.1 Introduction In Chapter 1 portability and scalability have been identi ed as critical requirements for software development, particularly for parallel applications. Portability for parallel programs, however, is hampered by the rich diversity in parallel architectures. Parallel machines in general present one of two memory models: shared memory and nonshared memory. If the programmer is to adhere to the memory model of the machine, then di erent versions of the program must be written, contradicting the goal of portability. At the same time, choosing one model may preclude executing the program on machines that do not support the model. To remedy this problem, a straightforward solution is to emulate the program's programming model if such a model is not directly supported by the machine. Since a programming model is an abstraction, it can be readily constructed in software, albeit at a certain cost. For instance, on a nonshared memory machine, a software layer can provide the view of a single address space. Conversely, a nonshared memory programming model can be created trivially on a shared memory machine by emulating the send

18 and receive. The focus then shifts to the cost of such an emulation because this cost directly a ects the scalability. If it is negligible, then the portability problem with respect to the shared and nonshared memory models can be considered solved. Shared memory programs can be executed on nonshared memory machines by using the Shared Virtual Memory system proposed by Li and Hudak[Li & Hudak 89]. In this approach, the operating system maintains a consistent cache of memory pages to create the view of shared memory in a nonshared memory machine. Priol and Lahjomri [Andre & Priol 92] measured the performance of a number of shared memory programs running on a Shared Virtual Memory system for the iPSC/2 and compared against the native nonshared memory programs on the same machine. They found that the shared memory programs tend to have diculty with the granularity of sharing. In this chapter, we examine the case of executing nonshared memory programs on shared memory machines. Real world instances of this approach can readily be found. Large scale nonshared memory machines (e.g., Intel Paragon, IBM SP2) are targeted by data parallel languages such as HPF[Forum 93] and ZPL[Lin & Snyder 93]. At the same time, as a testimony to the low cost and e ectiveness of the class of small scale bus-based shared-memory machines, many computer vendors presently o er symmetric multiprocessors (SMP). It is logical then to add the proper software emulation on these SMPs to execute the nonshared memory programs generated by HPF and ZPL. In fact, ZPL programs can be run on both shared memory machines such as the KSR and nonshared memory machines such as the Paragon. Similarly, the DEC HPF compiler supports both a network of workstations and the DEC SMP server [Harris et al. 95]. One motivation for choosing the nonshared memory model is that this model o ers accurate performance prediction. Anderson and Snyder [Anderson & Snyder 91] analyzed several algorithms developed with the shared and nonshared memory models and found that the shared memory model produces overly optimistic performance prediction that leads to suboptimal algorithms. To achieve portability for the nonshared memory model requires emulation that may a ect the scalability and this tradeo needs to be

19 quanti ed. In this chapter, this question is formulated as a comparison of shared memory and nonshared memory versions of two applications, LU Decomposition and Molecular Dynamics simulation. The comparisons are made in two ways: 1. The data reference pattern of each application is used to construct a simple analytical model of the parallel execution to predict the behavior. 2. The performance of the programs is measured on ve shared memory machines with widely di ering memory organizations. In related work, Lin and Snyder [Lin & Snyder 90] have examined the performance of shared and nonshared memory programs on several shared-memory machines. They found that the nonshared memory program can outperform the shared memory version in many cases. However, the study included only two simple programs, matrix multiplication and Jacobi iteration, and only two machines, the Sequent Symmetry and the BBN Butter y GP1000. Although the results were interesting, the data set was deemed insucient for making generalizations. Leblanc [LeBlanc 86] also made a similar comparison using Gaussian Elimination on the BBN Butter y. He observed that the model should be chosen based on the nature of the application, and that the shared model may encourage too much communication. However, since the version of Gaussian Elimination used did not include the dicult partial pivoting step, the computation is not very di erent from matrix multiplication. In addition, the measurements were collected from only one older class of machine that uses relatively slow processors. The study in this chapter contributes to the results from previous work by including many contemporary large scale shared memory machines and by considering an analytical model for the benchmarks that corroborates the observed performance. The second point is particularly signi cant because measured performance may include unknown system e ects that may bias the results. An analytical model that is validated by real data will con rm that the hypothesized e ect, i.e., the memory model, is indeed the primary e ect.

20

Figure 2.1: Machine memory hierarchy The remainder of the chapter is organized as follows. Section 2.2 de nes the scope of the problem and describes the methodology of the experiments, including the machines and the implementation. Sections 2.3 and 2.4 describe in detail each of the two applications and the results. Conclusions are found in Section 2.5.

2.2 Methodology To facilitate the discussion in this chapter, we use the subscripts s and ns: Ps is a shared memory program while Pns is a nonshared memory program.

2.2.1 The Machines The shared-memory machines used in the experiment are the Sequent Symmetry, the BBN Butter y, the Cedar at CSRD, Illinois, the Kendall Square Research KSR-1, and the DASH at Stanford. They represent a wide range of memory hierarchies from relatively uniform access (Sequent) to non-uniform access (Cedar, Butter y, KSR-1, and

21 Table 2.1: Characteristics of the shared memory machines

(Note: (1) access ratio is normalized to the access time of the memory closest to the processor, (2) * is sum of all local memories ) Machine Sequent CSRD Cedar Butter y Kendall Square DASH Model Symmetry A Cedar TC2000 KSR-1 DASH Site U. Washington U. Illinois LLNL U. Washington Stanford nodes 20 32 128 32 48 processors Intel 80286 MC 68020 MC 88100 custom MIPS R3000 I cache combined 16 Kb/node 32 Kb/node 256 Kb/node 64Kb/node D cache 64 Kb/node 128 Kb/cluster 16 Kb/node 256 Kb/node 64Kb/L1 256Kb/L2 128 Kb/cluster local mem 0 32 Mb/cluster 16 Mb/node 32 Mb/node 14 Mb/node global mem 32 Mb 256 Mb 2 Gb* 1 Gb* 168 Mb* network bus-based omega butter y ring mesh access ratio 1 1:4.5 1:3.7:12.7 1:10:75:285 1:4.5:8:26

DASH). The Sequent, KSR-1 and DASH employ hardware coherent caches while the Cedar and the Butter y do not. Figure 2.1 and Table 2.1 show the general organization and some relevant characteristics of each machine. Note that the access ratio is the access time of each memory level normalized to the access time of the level closest to the processor. This parameter is intended to give an indication of the depth of the memory hierarchy. One important point worth noting is that although several machines are being studied, the focus of the comparisons is between two programs on each machine. Therefore, while the speed of the processor relative to the memory performance plays a central role in determining the performance of a program, it can be factored out in a comparison between programs on the same machine. The same argument applies to the di erences in the memory performance between the machines due to the memory architecture, e.g., memory bandwidth, cache line size, etc. Following is a brief description for each parallel machine used in this study. The Sequent Symmetry is a small scale bus-based machine. Because of the low speed of the processors, the bus is able to support 20 nodes. Cache coherency is maintained by bus-snooping using a modi ed Illinois protocol[Gi ord 87]. Since the main memory

22 resides on the bus, the access time is the same from any processor cache, and since there is no control over the cache, each processor will only see a certain average access time. Because of these characteristics, the Sequent bears the closest resemblance to the PRAM model1 when compared to the other machines. For this reason, the access ratio of the Sequent in Table 2.1 is approximated as 1. The Cedar has a cluster architecture: each bus-based cluster contains 8 processors sharing a local memory and the clusters are connected to the global memories through a switch. Therefore there are two levels of memory hierarchy. Note that the intra-cluster communication can be done at the cluster memory level, but inter-cluster communication must use the global memory. In the Butter y TC2000, each processor node contains a cache and a local memory. Each node is connected to the local memories of all other nodes through a multistage interconnection network (MIN); the global memory thus consists of the aggregate of all local memories. There are three levels of memory hierarchy: the cache, the local memory, and the remote memory. Although memory locations can be assigned a variety of cache attributes, the most commonly used attributes are cacheable local and non-cacheable global, which represent only two levels of memory hierarchy. Since there is no hardware coherency mechanism, the machine provides various cache invalidation functions to support software caching. Finally, to avoid hot spots in this memory organization, the global address space is interleaved across all nodes. The KSR-1 employs an AllCache architecture [Kendall Square Research 92]: instead of a main memory, each node possesses a large cache that is kept coherent with all other caches through a snoopy mechanism at the ring level and a directory-based scheme within the ring hierarchy. In addition to the main cache, there are also instruction and data subcaches on each node. The nodes are connected in a hierarchy of rings; therefore, there are multiple levels of memory hierarchy, beginning with the subcache, to the local cache, the caches within the same ring, the caches within the next ring level, and so on. Because 1 Parallel Random Access Memory model: each processor can access any memory location in unit time.

23 the unit of communication on the ring is a 128 byte cache line, the granularity of shared data is an important issue. While the architecture provides a combining mechanism by servicing a cache miss at the lowest level of the ring hierarchy, the machine used in the experiment only contains one ring; therefore this combining e ect is not visible. The Stanford DASH also uses a directory scheme for cache coherency [Lenoski et al. 93]; however, it is organized as a collection of bus-based clusters connected in a mesh topology. Each cluster contains 4 processor nodes, a cluster memory and a cluster cache. Each node in turn contains a level 1 and a level 2 cache, with coherency maintained at the level 2 cache through bus snooping. Consequently, there are at least 5 levels of memory hierarchy: level 1 cache, level 2 cache, cluster memory, remote cluster memory, and remote dirty cluster memory. The cluster cache that resides in the interface between a cluster and the other clusters combines requests to the same address. Since there is a home node for each memory location, the system also provides functions for data placement.

2.2.2 The applications and the implementations In this study, we focus on the class of data parallel applications by using the LU decomposition problem and the WATER benchmark from the Stanford SPLASH suite. Although there are other important classes of parallel applications, the data parallel class covers a large number of scienti c problems and this class stands to bene t immediately and signi cantly from parallelization. Ecient algorithms exist for both problems in both shared and nonshared memory models, and the algorithms employ static data partitioning and load balancing. The following sections will analyze these algorithms in detail. Note that because our focus is on the portability issue of parallel programs, we are not considering applications with an irregular structure for which an ecient nonshared memory solution may not be obvious. While an algorithm for such irregular applications may perform reasonably well on a shared memory machine, the same algorithm is unlikely to perform well on

24 a nonshared memory machine with emulated shared memory because of the signi cant di erence in the memory latency. In other words, with respect to portability, we consider that portable solutions for such problems have not yet been developed. In implementing the algorithms, all programs are written in the SPMD style. The performance is measured only for the useful computation; the initialization costs are not included. In Ps , data resides in the global memory and the standard lock and barrier mechanisms are used for synchronization. In instances where an architecture provides a major feature to support the shared memory, simple machine speci c optimizations for data placement are used, and they are described in the following sections. In general, however, aggressive optimizations that are not portable are excluded. In Pns , data is placed in the level of memory closest to the processor. On machines with coherent caches, this e ect is achieved automatically through the data reference pattern: since a processor only accesses its data partition, data is allowed to migrate to the highest level of the cache. Note that although data in Ps also migrates to the level closest to a processor, the data may be written by other processors and as a result be forced out of this level. In general, data locality (in the processor dimension) is an inherent advantage of Pns. The sends and receives are emulated with straightforward block copy, and simple one-reader/one-writer ports implement some connection topology (binary tree, mesh, etc.). The bu ers used for communication are placed in the global memory. Data communication is more expensive in Pns than in Ps because it must be emulated through software, incurring costs in code to manage the bu ers as well as additional loads and stores to read and write the bu ers. In contrast, Ps communicates data simply by writing to the shared address space. Although the software emulation can be further optimized by passing pointers, such optimizations are excluded in this study. Given the advantages and disadvantages described above, the performance tradeo for Pns will be between the communication overhead and better processor locality.

25 An important issue in implementing the shared and nonshared memory versions is the relative ease of the implementations. Given the current state of the art in parallel languages, a shared memory program is generally easier to write than a nonshared memory program and this holds true in our implementations. However, since our focus is portability, there are several points worth noting. First, it is easier to arrive at a working shared memory program from scratch because the model allows the programmer to ignore data placement. In many cases, the pattern of reference allows the data to migrate naturally to the processors that use them, yielding good performance with little e ort by the user. However, it is generally recognized that the program must be optimized for data locality to obtain scalable performance. This optimization step is dicult since machine speci c techniques are not portable and portable techniques will likely require signi cant restructuring of the program or the algorithm. Second, the diculty in writing a nonshared memory program re ects the lack of appropriate language support for the nonshared memory model rather than a fundamental limit in the model. Current developments in parallel languages for nonshared memory machines such as HPF and ZPL may provide a solution to this problem. As a result, we believe the issue of ease of programming should be excluded for the focus of this chapter. This issue will be revisited in the following chapters.

2.3 LU Decomposition The following two sections describe the two sets of experiments. The problem and the algorithm are rst described, followed by an analysis of the data locality in the Ps and Pns versions. Results and discussion follow.

26

2.3.1 The problem Typical of matrix problems, LU decomposition is a numerical method for solving large systems of linear equations. Given the system of equations:

Ax = b where A is the coecient matrix and b is the constant vector, A is decomposed into a lower and upper triangular matrices L and U:

LUx = b Letting Ux = y, we can solve directly for y by forward-elimination

Ly = b and then for x by backward-substitution.

Ux = y The sequential algorithm consists of iterating over the diagonal of the matrix, each iteration consisting of two steps: partial pivoting (line 2) and row update (lines 3:6). LUseq (1) for (k=0; k Pr, the remaining array dimensions are collapsed; if Ar < Pr, the array is distributed over Ar dimensions on the processor grid and is anchored on the rst order of the remaining processor dimensions. The current compiler implementation supports a 2-dimensional processor grid. An array can be aligned with another by selecting the proper index range and stride of its region. For instance, Figure 3.3 shows array B[9..15,5..11] with a stride of 2 being overlaid over array A[1..20,1..20]. It is possible to reproduce the HPF layout in Figure 3.2, but in general HPF distribution is richer. Compared to HPF, ZPL does not map an arbitrary array dimension to a processor grid dimension. This precludes mapping the array to a transposed processor grid as in Figure 3.2. ZPL also does not separate the o set from the index range when two arrays must be laid out relative to each other. For instance, array B in the example has 4  4 elements, but it cannot be indexed as [1..4,1..4]. On the other hand, the processor grid is speci ed at runtime, delaying the binding and thus allowing di erent processor con gurations to be invoked without recompiling the program. Redistribution

HPF allows an array distribution to be changed in two ways. It can be changed at any point in the program via REDISTRIBUTE and REALIGN to be tailored to each phase of the computation. In addition, since a subroutine can specify a di erent data distribution within its scope, an array that is passed as a formal argument may have to be redistributed on the subroutine entry to match the local distribution and then again on the exit to return to the caller's distribution. This presents a potential ineciency if an array is repeatedly redistributed across the subroutine boundary. This can be avoided if the caller and the callee reside in the same program by ensuring that both specify the same distribution. However for separately compiled subroutines, the caller's distribution is unknown. In addition, the distribution inherited from the caller may be a poor match for the subroutine and result in excessive communication.

70 region bigR[1..20,1..20] smallR[9..15,5..11] by [2,2]; var A: [bigR] integer; B: [smallR] integer; (a) ZPL code segment

1

10

11

20

1

(9,5) (9,7) (9,9) (9,11) 10

(11,5) (11,7) (11,9) (11,11)

11

alignment

(13,5) (13,7) (13,9)(13,11) (15,5) (15,7) (15,9) (15,11)

20

array B[9..15,5..11] by 2 array A[1..20,1..20]

(b) Logical mapping

P(0,1)

P(0,0)

A[1..10,11..20] B[9..9,11..11] by 2

A[1..10,1..10] B[9..9,5..9] by 2

P(1,0)

P(1,1)

A[11..20,1..10]

A[11..20,11..20]

B[11..15,5..9] by 2

B[11..15,11..11] by 2

(c) Physical mapping

Figure 3.3: Example of ZPL data layout. Note: the bounding box for A and B is used to compute the partitioning.

71 HPF does not solve this problem but rather provides the user the options to specify the appropriate action. The normal DISTRIBUTE directive is called prescriptive since the array must be redistributed as speci ed. The INHERIT directive is transcriptive: the array distribution is to be inherited from the caller without any redistribution. The DISTRIBUTE * directive is descriptive: the programmer asserts that redistribution can be omitted because the distribution in the caller and in the subroutine will be identical, thereby relieving the compiler and the runtime system of the task of checking the distribution. Thus with careful use of these directives, the user can balance between the redistribution cost and the computation in the subroutine. Procedures in ZPL are distinguished as sequential or parallel. A procedure is sequential if it does not contain any reference to a parallel variable (a region); otherwise it is a parallel procedure. An example of a sequential procedure is one that accepts an integer and returns its square. This distinction requires the two types of procedure to be treated di erently; speci cally, the concept of promoting a procedure only applies to a sequential procedure that has no side e ects. Calling a procedure does not involve communication in ZPL for the following reason. Calling a sequential procedure requires no communication since by de nition, the procedure only operates on data local to the processor. Calling a parallel procedure without passing any parallel array also results in no communication since parallel arrays declared within the procedure scope are not visible to the caller. If a parallel array is passed, the compiler will analyze references to the parallel array in the interprocedural analysis step to compute the appropriate bounding box for partitioning. The result is that the parallel array will remain perfectly aligned across the procedure boundary, thus requiring no communication. Discussion

HPF clearly provides a wide range of complex data distributions. In addition, the actual distribution of an array is completely decoupled from the reference to array elements since HPF must adhere to the Fortran syntax; in other words, an array element

72 is accessed in the same manner regardless of how it is distributed. These features culminate in a high level of convenience and expressiveness for HPF programs. In terms of the modeling framework, they capture the nonshared memory characteristics of the programming model by requiring the user to consciously distribute the data. At the same time they allow the problem to be solved from a global point of view, sparing the user the tedious low level programming typical of message passing programs. Since HPF's abstractions to implement the nonshared memory programming model appear to be reasonable, we proceed to consider whether they can be implemented eciently and consistently. The challenge to the compiler lies in two areas: optimizing the data redistribution and the nonlocal array references. The latter has been discussed in detail in Section 3.4.2. Redistribution is clearly a heavy weight process due to the complex mapping and the communication involved. The compiler can generate some code for redistribution, but the runtime system must handle most cases because many variables are unknown. Redistribution requires computing the distribution map of the source and destination arrays, intersecting the maps to determine all the senders and receivers, generate the communication and marshalling the data to and from the communication bu ers. The cost thus includes the computation, the bu er allocation and copying, the communication and the degradation in cache performance. For this reason, the APR compiler provides the option of allocating the full storage for a distributed array at each node, of which only the local array section is used. This reduces the overhead for bu er allocation and the data marshalling, but the memory requirement is not scalable. Recalling our modeling framework, the signi cant redistribution cost should be visible in the programming model. The explicit REDISTRIBUTE and REALIGN directives meet this requirement, however the implicit redistribution that occurs across subroutine boundaries does not. Since the default action by the compiler is to enforce redistribution to ensure correctness, a programmer may be surprised by unexpected redistribution calls inserted by the compiler. The presence of REDISTRIBUTE or REALIGN in the source

73 program also adds complexity to the compiler since it interferes with many standard optimizations for the sequential code; for this reason, the IBM HPF compiler currently does not support REDISTRIBUTE and REALIGN.

3.5.2 Distributing the computation Section 3.4.1 describes how the user can express the parallel computation. The next step is to distribute the parallel computation among the processors and ultimately, the degree of parallelism achieved depends on how this distribution is actually implemented. In the ideal case, the computation will somehow be evenly distributed among the processors with no overhead. In reality however, the distribution strategy not only a ects the workload balance but also the necessary communication since nonlocal data needed for the computation must be fetched to the site that performs the computation and the result must be stored to the appropriate site. Since parallelism in both languages is derived from some form of looping, the common method for distributing the loop is to adjust the loop bounds and stride to re ect the local workload. If the loop cannot be distributed, the bounds will not be adjusted and all processors would execute all iterations, resulting in no parallelism. Can the loop be distributed? Before a loop can be distributed, an HPF compiler must rst analyze the dependencies to determine whether correctness can be preserved. Clearly the distribution is inhibited if the analysis fails and in this respect, HPF provides several directives to help the compiler. A loop with no cross iteration dependency can be annotated with INDEPENDENT. A subroutine call within the loop may have side e ects which would constitute a loop dependency; the subroutine with no side e ects can be annotated with PURE to assert this fact. Whether a user needs to insert these directives depends on the analysis capability of the compiler. How to distribute? In contrast to its extensive data layout capability, HPF does not specify how the

74 computation is to be distributed, leaving this decision to the compiler implementation5 . One motivation for this choice is to leave open optimization opportunities for the compiler. A common convention adopted by several compilers (IBM, PGI, DEC) is the owner-computes rule: the processor that owns the LHS of an assignment statement is to execute the statement. In this scheme, the compiler will generate the communication necessary for the owner of the LHS to gather the data referenced in the RHS before the computation takes place. The advantage of this scheme is twofold: it is simple to implement and the assignment is made locally. There are situations where the owner-computes rule is suboptimal. For instance, if many array elements referenced in the RHS reside on the same processor, the communication volume can be reduced by allowing this processor to perform the computation and to send only the result to the owner of the LHS. Various proposals have been made to optimize the communication by selecting the best site to perform the computation [Amarasinghe & Lam 93], but this proved to be a dicult problem. Other schemes are also possible, for instance the APR compiler simply selects (automatically using some set of rules or with the user's assistance) a favorable loop to distribute[App 95]. From the modeling point of view, the absence of a de nite scheme to distribute the computation leaves a gap in the programming model. This gap is both necessary to preserve the Fortran syntax and intentional to make the programming convenient. Yet it forces a programmer either to (1) rely completely on the compiler to parallelize the program or (2) close the gap by tuning the program to the speci c compiler and sacri cing portability in the process. In other words, it may be dicult to balance the workload in a portable manner since the parallelization strategy varies with the platform. A case study of the NAS EP benchmark in Chapter 5 will illustrate this problem. ZPL distributes its computation according to a number of rigid rules, leaving little opportunity for deviation or in fact optimization. First, the global region representing the whole program will be partitioned by block onto the processor grid, thereby distrib5 HPF 2.0 addresses this issue by adding the ON HOME directive that enforces the owner-computes rule[Forum 96]

75 uting any arrays and statements de ned over the region. Second, the computation is aligned with the LHS, thereby ensuring that any ZPL implementation will follow the owner-computes rule. Distribution in ZPL thus strictly implements the semantics of the language: a ZPL compiler only has to compute the local loop bounds for the array on the LHS; it does not need to perform any analysis to arrive at the decision to parallelize the loop. The di erence between HPF and ZPL can be summarized as follows. An HPF compiler begins with an undistributed loop and attempts to distribute it by analyzing the dependencies and/or by relying on user directives; if the analysis fails, the loop is not distributed and all processors execute the full iteration. ZPL on the other hand strictly implements the language semantics; in other words, a statement within a region scope is a parallel statement and is guaranteed to execute in parallel. In terms of modeling, the HPF model of parallelization is weaker since it includes two variables: whether the loop can be parallelized and how the workload will be distributed. ZPL model is stronger since the language semantics clearly de ne the behavior of these two variables. For HPF compilers that use the owner-computes rule, the computation distribution shadows the data distribution. The wide range of choice for data distribution thus leads to more exibility in implementing an algorithm. For instance, the LU decomposition would bene t from a block cyclic distribution since the workload is more evenly distributed, while an SOR solution would prefer a block decomposition to exploit the neighbor locality. ZPL's choice for distribution is more limited in this respect since only block distribution is supported.

3.6 Conclusion The issue is not whether one can achieve performance in a particular language but whether this can be done conveniently. Given time and resources, an ardent enthusiast of a language can tune a program in the language to achieve good scalar and parallel

76 performance, but it may not be convenient to arrive at the result and the result may not be portable. An example of this can be found in the benchmarks used by some HPF vendors: the APR benchmarks for HPF have been carefully tuned to match the APR compiler; as a result they perform well with the APR compiler but not with other HPF compilers. PGI versions of the benchmarks also exhibit this characteristic. An optimization technique is most successful when it can cover the majority of the cases. When it only covers speci c cases, several problems arise: the case must rst occur in the program and then the compiler must be able to recognize the case to apply the optimization. The latter problem can be particularly dicult, but both problems reduce the e ectiveness of the optimization. Optimizations for array references in HPF seem to su er from this phenomenon: the ability to make global array references leads to many possible patterns of reference, with no particular prevalent pattern that can be easily recognized. Given that parallel language is still a developing area, one can safely presume that the designers of neither HPF nor ZPL assert that their current language is the nal solution for parallel programming. The discussion in this chapter clearly shows the bene ts and liabilities of two fundamentally di erent parallel languages. By capitalizing on the Fortran base, HPF succeeds in establishing a standard in parallel languages where none existed before and this should be recognized as a very signi cant achievement. However, building from an inherently sequential language carries many implications that are now becoming more apparent as experience is gained from implementing and using the language. One general observation is that the ow of information in an HPF program is largely from the user to the compiler. In other words, the user provides information about the program to aid the compiler, but the compiler guarantees to the user little more than correctness. ZPL is based on a model of parallel machines. Being designed from rst principles, ZPL can freely incorporate new concepts and language constructs that work well. Because these concepts and constructs are intrinsic to the language, the language speci-

77 cation serves as a secure contract to the user that the program will behave as expected. Information thus ows both ways, from the language to the user and vice versa. Being new also carries the burden of having to prove ZPL's merit and to gain acceptance from users and software developers. In the following chapters, we will further compare HPF and ZPL, rst with respect to the concept of performance model (Chapter 4), then experimentally using the current HPF and ZPL compilers (Chapter 5).

Chapter 4

The Performance Model in HPF and ZPL \HPF must be used carefully, because ecient and inecient codes look very similar. An HPF translator generates code to implicitly perform whatever communication and overhead required to correctly execute the HPF program statements. On the other hand, with a program parallelized by explicit insertion of message passing communication calls it is very apparent where overhead is introduced. The ability of a translator to generate whatever communication and overhead is required, is in a way a defect, in that it obscures what really happens at the runtime." APR Fortran Parallelization Handbook [Friedman et al. 95]

4.1 Introduction In Chapter 1 we discussed the pervasive use of modeling. Having examined qualitatively in Chapter 3 the facilities in HPF and ZPL for parallel programming, we nd that a clear distinction between HPF and ZPL is in the modeling aspect of the languages. In this chapter, we begin by identifying this modeling aspect, namely the performance model.

79 One may question whether this di erence ultimately leads to any tangible di erence in the program performance. In other words, what is the bene t of a performance model? We will seek a quantitative answer to this question by rst formulating a methodology and then by considering two in-depth case studies.

4.1.1 The performance model Recall that a model re ects our understanding of how a certain system operates; it captures the information that is necessary and sucient for us to use the system. It follows that if the model is inaccurate or insucient, it will be dicult to use the system e ectively. Conversely, too much complexity makes the model dicult to use and defeats its very purpose. A model must capture the right level of information. In particular, when performance is an objective, the model must capture some information about performance. For a concrete example, consider sequential languages such as C or Fortran. Pro cient programmers in these languages have an approximate understanding of how each language abstraction correlates with its low level implementation, and they use this information routinely without any special consideration. For instance, procedure calls are convenient and helpful in structuring a large program, however, they do incur cost in pushing the stack and passing arguments, and this cost is be ttingly associated with the text required to set up and call a procedure. As a result, a programmer would not create procedures unnecessarily, nor would he/she create procedures thinking that doing so would make the program run faster. In the context of program development, this approximate understanding helps form a coarse but reliable classi cation of language abstractions in terms of their cost (execution time). We call this relative ranking the performance model. The performance model is coarse because it is not possible to use it to determine the actual execution time of the program independent of the targeted machine. In this respect, it is di erent from parameterized, analytical models such as the formula that is commonly used to predict

80 the communication cost:

msgsize time = startup + bandwidth

The performance model is reliable because a language abstraction classi ed as expensive will always take more time than one classi ed as inexpensive. For this reason, a more suggestive name for the model may be \WYSIWYG". For the terminology, Snyder in previous work [Snyder 95] de nes a machine model as a set of common facilities for a large class of parallel machines; the set contains sucient detail to allow the programmer to choose between programming alternatives. A programming model then extends the machine model by adding new abstractions with \known" costs. For the focus of this chapter, the term performance model refers to the performance aspect of Snyder's programming model. In other words, given the programming model, the performance model de nes the relative cost of the abstractions. Revisiting C and Fortran from above, their performance model arises naturally from the close aliation between the language and the hardware. The abstract von Neumann machine, by encompassing most sequential machines, thus emerges as the machine model for these languages. How is the performance model used? The performance model indicates the relative performance; therefore it helps the programmer in two important tasks: (1) given several alternative algorithms for a problem, how to choose the appropriate algorithm, and (2) given several alternatives to implement an algorithm, how to choose the appropriate implementation. These two tasks are critical because ultimately, the most e ective optimization rests with the user. No compiler optimization can transform a poor algorithm into an optimal algorithm. How does a language incorporate a performance model? Because no explicit formula or equation is involved, a performance model is typically derived from several sources. For a frequently used programming technique that has been encapsulated in a high level abstraction, its implementation and therefore its cost is well understood. An example is the DO loop. The language semantics, if well de ned

81 and concise, can provide valuable information. For instance, the register directive in C suggests that the attributed variable can be accessed faster than normal variables. Visual cues in the language syntax help to associate a relative cost with a particular language construct. For instance, the quantity of text can serve as the cue: a construct that is fast (e.g., an arithmetic operation) should be expressed with little text while a slower construct (e.g., a library function call) should require more text. Such a loose association leads to the coarse quality of the model, while the reliability of the model depends on the accurate association between the construct and the actual cost. The only requirement is that the part of the language that needs to have a cost associated must be visually identi able. Is the performance model a standard component of all languages? Many languages exist that do not contain a performance model. For example, Prolog was designed for expressing predicate calculus. Because there is no performance information in the language construct, it is dicult to infer whether a Prolog statement will execute quickly or take an exponential amount of time. Similarly, data ow and functional languages are built from mathematical abstractions with little consideration for the physical implementation. As a result, no performance model exists to guide the users in writing data ow or functional programs that can be emulated eciently on conventional machines.

4.1.2 ZPL's performance model Through extended research in programming models, ZPL designers recognized the existence and the importance of the performance model as a distinct entity. ZPL was designed with the performance model as a clear objective; as a result, one of ZPL's signi cant contribution is the careful integration of a performance model into the language. ZPL's machine model (the CTA) and programming model have been described earlier. For the performance model, ZPL abstractions exhibit the following behavior, listed in the order of highest to lowest performance [Snyder 94].

82 1. Element-wise array operations execute fully in parallel with no communication and achieve the highest performance. These operations are clearly distinguished since they involve only arrays declared as parallel arrays. 2. @ operation is likely to involve communication: the cost can be approximated as one message per @. 3. Flood operation requires one or more messages to replicate an array section; it is identi ed by the ood operator >>. 4. Reduction and scan operation require collective communication that involves multiple messages; they are identi ed by the operators ; ; > [S] B; A(i) = B(S) spread(B(S), A(i) = B(S) enddo 1,N) point A(1) = B(N) A(1) = B(N) A(1) = B(N) [F] Af := >> [N] B; to point [1..1] A := Af; dynamic do i=1, N forall (i=1:N/2) if (Index1 > operator in the program as speci ed by the performance model. Interestingly, the SUMMA performance by all compilers is consistently the best compared to the other algorithms. IBM and APR in particular achieve good performance although their implementations do not re ect SUMMA. SUMMA has been shown to be superior in many aspects such as memory usage, generality and exibility for non-square matrices [van de Geijn & Watts 95]. Our limited experiment in this case does not cover the parameter space suciently to illustrate the advantage of SUMMA. Therefore, the good performance belies the fact that the actual implementations do not re ect the same algorithm. The ZPL performance SUMMA is lower than the HPF's performance because of the scalar component. An inspection of the assembly codes generated by the Fortran and C compilers shows that the Fortran compiler can generate a much better instruction schedule for pipelining the superscalar processor in the SP2. Finally, we consider the self-relative speedup from 16 to 64 processors (Table 4.6). Compared to the ideal speedup of 4, we nd that APR and PGI fail to achieve any speedup for the DO loop version; APR's Cannon also fails because of high memory allocation. IBM and APR only achieve a modest speedup when the DO loop version is optimized, while PGI achieves superlinear speedup. This unpredictable variation in the

108 speedup is another indication of the weakness of HPF's performance model. In contrast, both ZPL implementations scale well. In summary, we nd that little correlation exists between the algorithm expressed in the HPF programs and the actual implementations. By HPF's design, the compiler has signi cant freedom in transforming the program to arrive at an implementation. This characteristic in e ect constitutes a gap in the performance model that essentially prevents the users from making any performance prediction for an algorithm. In contrast, ZPL correctly predicts that SUMMA is a better choice than Cannon, which in turn is better than the triply nested DO loop (not implemented). ZPL thus satis es equation 4.2 in each case by matching the relative performance predicted by the performance model with the relative performance of the actual implementations.

4.4 Current HPF solutions A consequence of the weak performance model in HPF is the general diculty in predicting the behavior of an HPF program. This is evident in the solutions o ered by software vendors or features that programmers rely on for tuning HPF programs. A part of the HPF compiler from APR is an extensive set of compiler options and tools for pro ling and analyzing the program performance. The program can be automatically instrumented to gather statistics at each DO loop level and the statistics can serve as a database to the compiler for further recompilation and optimization. The listing generated by the compiler also provides performance details not available in the source program such as which loop level is being parallelized, which statement may incur communication, where communication is inserted and which array is being communicated. In addition, the FORGE Explorer Distributed Memory Parallelizer allows a user to interactively choose the arrays to partition and the loops to distribute. The user can also insert additional APR directives to aid the compiler in parallelizing loops or reducing communication. The capability of the APR system may suciently supplement the HPF performance

109 model so that a user can achieve good performance with the system. Indeed, users pro cient with the FORGE system may have learned a fairly complete model for producing good parallel programs. The success of this approach is evident in the benchmarks that APR made publicly available in conjunction with their published HPF performance; the benchmarks are highly tuned to the APR compiler, containing liberal APR speci c directives to aid the compiler. Unfortunately, such a model is a superset rather than a part of the HPF language speci cation. It is not portable to other HPF compilers, and there is no evidence that it should be formalized as HPF's model. Another program tuning technique that is more generally accessible is the intermediate SPMD code generated by the compilers. This option is available in all three HPF compilers being studied, even though the IBM compiler is a native compiler and does not need to generate the intermediate output. The intermediate code can be dicult to decipher, but has proven indispensable in providing important clues such as the communication generated for a particular statement.

4.5 Conclusions In this chapter, we identify the performance model as a crucial component of the language that programmers rely on for selecting the best implementation or algorithm. Because HPF and ZPL are data parallel languages for distributed memory parallel machines, the relevant aspects of the performance model are the communication and the overhead for managing the data distribution. To quantify the bene t of the performance model, we formulate a framework which leads to two requirements. For a language with no performance model, equation 4.1 requires the compiler to neutralize any performance di erence between two alternative programs because the programmer has no means to make a selection. For a language to have a performance model, equation 4.2 requires that the relative performance predicted by the model matches the implementation. Two case studies are performed using the array assignment and the matrix multi-

110 plication. In each case, HPF fails to satisfy both of the requirements above, while ZPL consistently meets the requirements. Since the performance model is needed to select between alternatives, the potential cost or bene t of the model is equivalent to the performance di erence between the alternatives. The data from both studies shows that this performance di erence is at least one order of magnitude. This result indicates that HPF users will face diculties in achieving consistent and portable performance in their programs. In practice, a performance model can be extended beyond the language speci cation. A user accustomed to a particular HPF compiler will over time learn the behavior of the compiler and supplement any missing information in the HPF performance model with compiler speci c information. For instance, the user may assume that F90 array assignment will be the most ecient. The supplemented model may enable the user to write high performance programs using the particular compiler, but since this model is not portable across compilers and consequently across platforms, the program performance will not be portable. It appears that part of HPF problem lies in the multiple alternatives for expressing the parallel computation. One may conjecture whether HPF can limit the number of alternatives to present a more consistent model. However, our discussion shows that a performance model requires a more careful e ort than simply limiting the alternatives. Furthermore, this con icts with the goal of compatibility with Fortran since the DO loop is an integral part of Fortran 77 as the array syntax is an integral part of Fortran 90. At the same time, the new Forall construct is desirable since the DO loop is dicult to analyze while the F90 syntax is too restrictive. In summary, this chapter demonstrates the importance of the performance model. The detailed study of the implementations shows that to create a robust performance model, the language speci cation must be suciently concise for the programmer to rely on and for the compiler to implement with consistency.

Chapter 5

Benchmark Comparison 5.1 Introduction Recently, several commercial HPF compilers have become available, enabling users to learn HPF, program in it, and directly evaluate the language. The Cornell Supercomputer Center has made available to the scienti c community a comprehensive set of HPF compilers and tools for production use. A ZPL compiler developed at the University of Washington is also publicly available for several parallel platforms, including the KSR, Intel Paragon, IBM SP2, and Cray T3D. Since no broad evaluation of the language and the compilers is yet available, this chapter is focused on the e ectiveness of HPF and ZPL in achieving portable and scalable performance for data parallel applications. We will study in-depth the performance of three NAS benchmarks compiled with three commercial HPF compilers on the IBM SP2. The benchmarks are: Embarrassingly Parallel (EP), Multigrid (MG), and Fourier Transform (FT). The HPF compilers include Applied Parallel Research, Portland Group, and IBM. To evaluate the e ect of data dependences on compiler analysis, we consider two versions of each benchmark: one programmed using DO loops, and the second using F90 constructs and/or HPF's Forall statement. The ZPL compiler is a version ported to the IBM SP2. For comparison, we also consider the performance of each benchmark written in Fortran with MPI.

112 The MPI results represent a level of performance that the HPF program should target to be considered a viable alternative. The ZPL version gives an indication of the performance achievable when the compiler is not hampered by language features unrelated to parallel computation. The results show some successes with the F90/Forall programs but the results are not uniform. For the other programs, the results suggest that Fortran's sequential nature causes considerable diculty for the compiler's analysis and optimization of the communication. The varying degrees of success among the compilers in parallelizing the programs, coupled with the absence of a clear model to guide the insertion of directives, results in an uncertain programming model for users as well as portability problems for HPF programs. In related work, APR published the performance of its HPF compiler for a suite of HPF programs, along with detailed descriptions of their program restructuring process using the APR FORGE tool to improve the codes [App 95, Friedman et al. 95]. The programs are well tuned to the APR compiler and in many cases rely on the use of APRspeci c directives rather than standard HPF directives. Although the approach that APR advocates (program development followed by pro ler-based program restructuring) is successful for these instances, the resulting programs may not be portable with respect to performance, particularly in cases that employ APR directives. Therefore, we believe that the suite of APR benchmarks is not well suited for evaluating HPF compilers in general. Similarly, papers by vendors describing their individual HPF compilers typically show some performance numbers, but the benchmarks tend to be selected to highlight the speci c compiler's strengths. Consequently, it is dicult to perform comparisons across compilers [Harris et al. 95, Bozkus et al. 95, Gupta et al. 95]. Lin et al. used the APR benchmark suite to compare the performance of ZPL versions of the programs against the corresponding HPF performance published by APR and found that ZPL generally outperforms HPF. However, the lack of access to the

113 APR compiler did not allow detailed analysis, limiting the comparison to the aggregate timings [Lin et al. 95]. The goals in this chapter are: 1. An in-depth comparison and analysis of the performance of HPF programs with three current HPF compilers and alternative languages (MPI, ZPL). 2. A comparison of the DO loop with the F90 array syntax and the Forall construct. 3. A comparison of machine-speci c and portable HPF compilers. 4. An assessment of the parallel programming model presented by HPF. The chapter is organized as follows: Section 5.2 describes the methodology for the study, including a description of the algorithms and the benchmark implementations. In Section 5.3, we examine and analyze the benchmarks' performance, detailing the communication generated in each implementation and quantifying the e ects of data dependences in the HPF programs. Section 5.4 provides our observations and our conclusions.

5.2 Methodology 5.2.1 Overall approach We study the NAS benchmarks EP, MG and FT [Bailey et al. 91]. MG and FT are derived from the NAS benchmark version 2.1 (NPB2.1), published by the Numerical Aerodynamic Simulation group at NASA Ames [Bailey et al. 95]. Previous versions of the NAS benchmarks were implemented by computer vendors and were intended to measure the best performance possible on a parallel machine without regard to portability. NPB2.1 is an MPI implementation by NAS and is intended to measure the best portable performance for an application. Because NPB2.1 programs are portable and are originally parallel, they constitute an ideal base for our study. Speci cally, NPB2.1 serves two purposes:

114 1. The actual NPB2.1 performance serves as the reference point for the HPF implementations or for any high level language [Saphir et al. 95]. 2. The HPF implementations are derived by reverse engineering the MPI programs: communication calls are removed and the local loop bounds are replaced with global loop bounds. HPF directives are then added to parallelize the programs. Conceptually, the task for the HPF compilers is to repartition the problem as speci ed by the HPF directives and to regenerate the communication. Since EP is not available in NPB2.1, we use the version from the benchmark suite published by APR. An important issue in an HPF program is how the computation is expressed. It was recognized that the conventional Fortran DO loop may over-specify the data dependences in data parallel computations; therefore the Fortran 90 array syntax and the Forall construct were proposed as better alternatives for expressing parallelism [Forum 93]. Since the NPB2.1 programs (and the derived HPF programs) are written in Fortran 77 with DO loops, we also consider a version in which the DO loops for whole-array operations are replaced with Fortran 90 syntax or the HPF Forall construct. To focus on the portability issue, the HPF directives are used according to the HPF speci cation within the functionality limit of the compilers; in other words, the programs are not tuned to any speci c compiler. This may put the APR compiler at a disadvantage since it relies on the APR-provided pro ling and program restructuring tools. The chosen benchmarks use only the basic HPF intrinsics and require only the basic BLOCK distribution that is supported by all three compilers. Therefore they can stress the compilers without exceeding their capability. The implementations in the ZPL language are derived from the same source as the HPF implementations, but in the following manner: the sequential computation is translated directly from Fortran to the corresponding ZPL syntax, while the parallel execution is expressed using ZPL's parallel constructs.

115

5.2.2 Benchmark selection NPB2.1 contains 7 benchmarks1 , all of which should ideally be included in the study. Unfortunately, a portable HPF version of these benchmarks is not available, severely limiting an independent comparison. While APR and other HPF vendors publish the benchmarks used to acquire their performance measurements, these benchmarks are generally tuned to the speci c compiler and are not portable. This limitation forces us to carefully derive HPF versions from existing benchmarks with the focus on portability while avoiding external e ects such as algorithmic di erences. The following criteria are used: 1. The benchmarks should be derived from an independent source to insure objectivity. 2. A message passing version should be included in the study since the comparison is not only between HPF and ZPL but also against the target that HPF and ZPL are to achieve. 3. For HPF, there should be separate versions that employ F77 DO loop and F90/Forall because there is a signi cant di erence between the two types of construct. 4. There should be no algorithmic di erences between versions of the same benchmark. 5. Tuning must adhere to the language speci cation rather than any speci c compiler capability. 6. Because support for HPF features is not uniform, the benchmarks should not require any feature that is not supported by all HPF compilers. Considering the benchmark availability in Table 5.1, the sources for NPB1 are generally sequential implementations. Although they could be valid HPF programs, the 1

One was added recently after this writing

116 sequential nature of the algorithms may be too dicult for the compilers to parallelize and may not re ect a natural approach to parallel programming. In other words, an HPF programmer may have chosen a speci c parallel algorithm and simply wants to implement it in HPF. APR's sources, as mentioned above, are tuned to the APR compiler; therefore they are not appropriate as a portable implementation. The NPB2.1 sources are the best choice since they implement inherently parallel algorithms and they use the same MPI interface as all compilers being studied. Among the benchmarks, CG is eliminated simply because it is not available in NPB2.1. SP, BT and LU cannot be included because they require a block cyclic and 3-D data distribution that is not supported by all HPF compilers and ZPL. The limitations in themselves do not prevent these benchmarks to be implemented in HPF and ZPL (indeed they are); however the implementations will have an algorithmic di erence that cannot be factored from the performance. This leaves only FT and MG as potential candidates. Fortunately, EP is by de nition highly parallel; therefore its sequential implementation can be trivially parallelized.

5.2.3 Platform The targeted parallel platform is the IBM SP2 at the Cornell Theory Center. The system is a distributed memory machine with 512 processors, 48 of which are wide nodes2 . The compilers used in the study include:

 Portland Group pghpf version 2.1  IBM xlhpf version 1.0  Applied Parallel Research xhpf version 2.0  ZPL compiler (SP2 port) from the University of Washington One potential source of di erence between ZPL and HPF performance is that the ZPL compiler produces C as the intermediate code. In general, Fortran compilers have 2

Wide SP2 nodes have wider data path and larger caches.

117 Table 5.1: Sources of NAS benchmarks for HPF and ZPL The left 3 columns show the availability of the sources; the right 3 columns show the versions needed for the study and the source used. Available Needed for study NPB1 tuned APR NPB2.1 HPF DO HPF F90 ZPL EP yes yes N/A tuned APR NPB1 NPB1 FT yes yes yes NPB2.1 NPB2.1 NPB2.1 CG yes N/A N/A N/A N/A N/A MG yes yes yes NPB2.1 NPB2.1 NPB2.1 SP yes yes yes N/A N/A NPB1 BT yes yes yes N/A N/A N/A LU yes N/A yes N/A N/A N/A more opportunities for scalar optimization than a C compiler. To obtain a coarse approximation of this di erence, we converted the FT benchmark from Fortran to C and compared the performance. For 1 to 8 processors, the Fortran version is 32% to 42% faster than the C version. This suggests that the observed scalar performance for ZPL will tend to be conservative. Processors Fortran C %  over C 1 4.70 secs 7.55 secs 38% 4 1.49 secs 2.20 secs 32% 8 .83 secs 1.43 secs 42% All compilers generate MPI calls for the communication and use the same MPI library, ensuring that the communication fabric is identical for all measurements. Besides the aggregate timings, the program execution is also traced using the UTE facility [IBM 95] to measure the major phases of the program and the communication.

118 All measurements use the same compiler options and system environment that NAS and APR speci ed in their publications, and spot checks con rmed that the published NAS and APR performances are reproduced in this computing environment. The following sections will brie y describe the benchmarks and give further details on how the HPF (DO loop and F90/Forall) and ZPL implementations were created.

5.2.4 Embarrassingly Parallel The benchmark EP generates N pairs of pseudo-random oating point values (xj ; yj ) in the interval (0,1) according to the speci ed algorithm, then redistributes each value xj and yj onto the range (-1,1) by scaling them as 2xj ? 1 and 2yj ? 1. Each pair is tested for the condition: tj  1 where tj = x2j + yj2 If true, the independent Gaussian deviates are computed:

q q

X = xj (?2 log tj )=tj Y = yj (?2 log tj )=tj Then the new pair (X; Y ) is tested to see if it falls within one of the 10 square annuli and a total count is tabulated for each annulus.

l  max(jX j; jY j) < l + 1 where 0  l < 9 The pseudo-random numbers are generated according to the following linear congruential recursion:

xk = axk?1 mod 246 where a = 515; x0 = 271828183 The values in a pair (xj ; yj ) are consecutive values of the recursion. To scale to the (0,1) range, the value xk is divided by 246. Figure 5.1 illustrates the data structure, the general ow of the computation and the pseudo-codes. Clearly, the computation for each pair of Gaussian deviates can proceed

119 independently. Each processor would maintain its own counts of the Gaussian deviates and communicate at the end to obtain the global sum. The random number generation, however, presents a challenge. There are two ways to compute a random value xk : 1. xk can be computed quickly from the preceding value xk?1 using only one multiplication and one mod operation, leading to a complexity of O(n). However, the major drawback is the true data dependence on the value xk?1. 2. xk can be computed independently using k and the de ned values of a and x0. This will result in an overall complexity of O(n2 ). Fortunately, the property of the mod operation allows xk to be computed in O(log k) steps by using a binary exponentiation algorithm [Bailey et al. 91]. The goal then is to balance between method (1) and (2) to achieve parallelism while maintaining the O(n) cost. Because EP is not available in the NPB2.1 suite, we use the implementation provided by APR as the DO loop version. This version is structured to achieve the balance between (1) and (2) by batching (see Figure 5.1(b)): the random values are generated in one sequential batch at a time and saved; the seed of the batch is computed using the more expensive method (2), and the remaining values are computed using the less expensive method (1). A DO loop then iterates to compute the number of batches required, and this constitutes the opportunity for parallel execution. The F90/Forall version is derived from the DO loop version with the following modi cations (Figure 5.1(c)):

 All variables in the main DO loop that cause an output dependence are expanded into arrays of the size of the loop iteration. In other words, the output dependence is eliminated by essentially renaming the variables so that the computation can be expressed in a fully data parallel manner. Since the iteration count is just the number of sequential batches, the expansion is not excessive.

 Directives are added to partition the arrays onto a 1-D processor grid.

120

 The DO loop for the nal summation is also recoded using the HPF reduction intrinsic. A complication arises involving the subroutine call within the Forall loop, which must be free of side e ects in order for the loop to be distributed. Some slight code rearrangement was done to remove a side e ect in the original subroutine, then the PURE directives were added to assert freedom from side e ects. Unfortunately, support for PURE varies among the compilers. For instance, APR does not support the PURE and INTENT directives apparently because it performs interprocedural analysis to detect the side e ects. APR and PGI do not allow a function to return an array, thus precluding an implementation similar to the ZPL implementation. The ZPL version is translated in a straightforward manner from the DO loop version. The only notable di erence is the use of the ZPL region construct to express the independent batch computation (Figure 5.1(d)).

5.2.5 Multigrid Multigrid is interesting for several reasons. First, it illustrates the need for data parallel languages such as HPF or ZPL. The NPB2.1 implementation contains over 700 lines of code for the communication { about 30% of the program { which are eliminated when the program is written in a data parallel language. Second, since the main computation is a 27-points stencil, the reference pattern that requires communication is simply a shift by a constant, which results in a simple neighbor exchange in the processor grid. All compilers (ZPL and HPF) recognize this pattern well and employ optimizations such as message vectorization and storage preallocation for the nonlocal data [App 95, Gupta et al. 95, Chamberlain et al. 95, Bozkus et al. 95]. Therefore, although the benchmark is rather complex, the initial indication is that both HPF and ZPL should be able to produce ecient parallel programs. The benchmark is a V-cycle multigrid algorithm for computing an approximate so-

121 X

batch seed

Q

x0 y0 x1

count

y1

count batch

(a) NPB1 sequential version

P0

X

batch seed

P1

X

batch seed

pseudo code: x0

x0

y0

y0 x1

(1) !HPF$ INDEPENDENT (2) DO 1:NB (3) DO (4) starting seed by binary expoentiation (5) END (6) CALL vranlc() generate remaining batch (7) DO 1:BS (8) compute Gaussian deviates (9) tally count in concentric square annuli (10) END (11) END

x1

count

count

y1

y1 local count

local count

replicated batch

replicated batch

global reduction

Q

Q distributed count

distributed count

(b) Tuned APR version

P0

P1

X batch seed

X

pseudo code:

batch seed x0

y0

x1

(1) (2) (3) (4) (5) (6) (7) (8)

y1

Q

Q

count

count distributed count

distributed count

distributed batch

FORALL 1:NB compute NB starting seed DO 1:BS X(1:NB) = NB next pair X(1:NB) = NB Gaussian deviates tally count in concentric square annuli END sum(Q)

distributed batch

global reduction (c) F90/FORALL version global reduction

batch seed RandPairs

batch seed RandPairs Gc

Gc

0

Q

0

count local

Q

count 9

global reduction

local

distributed

9

(1) (2) (3) (4) (5) (6) (7) (8)

region B = [1..NB]; Gc, Q: [B] integer; [B] begin Gc := RandPairs(..,Q); for i:=0 to 9 q[i] := +\ Q[i]; gc := +\Gc; end;

distributed

(d) ZPL version

Figure 5.1: Illustrations of EP as implemented in HPF and ZPL. Note: the pseudo-codes and the data structures distributed onto two processors.

122 lution to the discrete Poisson problem:

r2u = v where r2 is the Laplacian operator r2 u = ( x22 ; y22 ; z22 ) The algorithm consists of 4 iterations of the following three steps:

r = v ? A  u (1) evaluate residual z = M kr (2) compute correction u= u+z (3) apply correction where A is the trilinear nite element discretization of the Laplace operator r2 , M k is the V-cycle multigrid operator as de ned in the NPB1 benchmark speci cation (section 2.2.2). Figure 5.2 illustrates these three steps together with the data structures: the down cycle and up cycle constitute step (2), compute correction. The interpolation and projection of the hierarchical grids during the down and up cycle are also illustrated for the 2-D case; note that the actual arrays are 3-D. For further details, the reader is referred to the NPB1 speci cation [Bailey et al. 91] as well as other publications on the vendor implementations [Agarwal et al. 95]. The algorithm implemented in the NPB2.1 version consists of three phases: the rst phase computes the residual, the second phase is a set of steps that applies the M k operator to compute the correction while the last phase applies the correction. The HPF DO loop version is derived from the NPB2.1 implementation as follows:

 The MPI calls are removed.  The local loop bounds are replaced with the global bounds.  The use of a COMMON block of storage to hold a set of arrays of di erent sizes is incompatible with HPF; therefore the arrays are renamed and declared explicitly.

 HPF directives are added to partition the arrays onto a 3-D processor grid. The array distribution is maintained across subroutine calls by using the transcriptive directives to prevent unnecessary redistribution.

123 4 iterations

U

V

down cycle

R = V - A .U

Rk

R = P.R k k+1

Rk

R k-1

R k-1

Rk = Rk- A . Z k

Z =Z -C.R k k k

Zk

Zk = Q.Zk-1

Z k-1

R1 = P. R2

R1

R2

R2 = R2 - A . Z 2

Z 2= Z2 - C . R 2

U=U+Z k

U

R2

Z2

Z2 = Q.Z1

Z1

up cycle

(a) MG algorithm

(b) Stencil computation for hierarchical grid

Figure 5.2: Illustrations for the Multigrid algorithm. Note: (a) The arrays U, V, Rk , Zk are full sized 3-d arrays; Ri, Zi for k ? 1  i  1 are hierarchically scaled arrays. (b) This is a 2-D example of the multigrid interpolation and projection; the actual computation is 3-D.

124 The HPF F90/Forall version requires the additional step of rewriting all data parallel loops in F90 syntax. The ZPL version has a similar structure to the HPF F90/Forall version, the notable di erence being the use of strided region to express the hierarchy of 3-D grids. A strided region is a sparse index set over which data can be declared and computation can be speci ed.

5.2.6 Fourier Transform Consider the partial di erential equation for a point x in 3-D space: u(x; t) = r2 u(x; t)

t

The FT benchmark solves the PDE by (1) computing the forward 3-D Fourier Transform of u(x; 0), (2) multiplying the result by a set of exponential values, and (3) computing the inverse 3-D Fourier Transform. The problem statement requires 6 solutions, therefore the benchmark consists of 1 forward FFT and 6 pairs of dot products and inverse FFTs. The NPB2.1 implementation follows a standard parallelization scheme, illustrated in Figure 5.3 [Bailey et al. 95, Agarwal et al. 94a]. The 3-D FFT computation consists of traversing and applying the 1-D FFT along each dimension. The 3-D array is partitioned along the third dimension to allow each processor to independently carry out the 1-D FFT along the rst and second dimension. Then the array is transposed to enable the traversal of the third dimension. The transpose operation constitutes most of the communication in the program. Note that the program requires moving the third dimension to the rst dimension in the transpose so that the memory stride is favorable for the 1-D FFT; therefore the HPF REDISTRIBUTE function alone is not sucient3 . The HPF DO loop implementation is derived with the following modi cations: 1. HPF directives are added to distribute the arrays along the appropriate dimension. Transcriptive directives are used at subroutine boundaries to prevent unnecessary redistribution. 3

HPF data distribution speci es the partition to processor mapping, not the memory layout.

125

random

FFT dim 1: X1 X1

FFT dim 2: X1 transpose: X1 -> X0

NT iterations

forward FFT

FFT dim 3: X0 exponent

X0

X3

dot multiply

X2

FFT dim 3: X2 transpose: X2 -> X1 inverse FFT

FFT dim 2: X1 FFT dim 1: X1

X1

checksum

Figure 5.3: Illustrations of FT implementation. Note: (1) The 3-D arrays X0, X2 and X3 are partitioned along dimension 2; X1 is partitioned along dimension 3; (2) The arrows in the arrays show the direction of the 1-D FFT.

126 2. The communication for the transpose step is replaced with a global assignment statement. 3. A scratch array that is recast into arrays of di erent ranks and sizes between subroutines is replaced with multiple arrays of constant rank and size. Although passing an array section in a formal argument is legitimate in HPF, some HPF compilers have diculty managing array sections. The HPF F90/Forall version requires the additional step of rewriting all data parallel loops in F90 syntax. The ZPL implementation allocates the 3-D arrays as regions of 2-D arrays; the transpose operation is realized with the ZPL permute operator.

5.3 Parallel Performance In this section we examine the performance of the programs. Because the execution time may be excessive depending on the success of the compilers, we rst examine the small problem size (class S), then the programs with a reasonable performance and speedup with the large problem size (class A). Figure 5.4, 5.5 and 5.6 show the aggregate timing for all versions (MPI, HPF, ZPL) and for the small and large problem size (class S, class A). The following discussion will focus on (1) the scalability and (2) the scalar performance, and examine the causes for any problems with (1) and (2).

5.3.1 NAS EP benchmark In Figure 5.4(a), the rst surprising observation is that the IBM and PGI compilers achieve no speedup with the HPF DO loop version although the APR compiler produces a program that scales well (recall that the EP DO loop version is from the APR suite). Inspecting the code reveals that no distribution directives were speci ed for the arrays, resulting in a default data distribution. Although the default distribution is implementation dependent, the conventional choice is to replicate the array. The IBM and PGI

127

(a) EP class S

(b) EP class A 1000

time(secs)

time(secs)

10

1

0.1

100

10 0

5

10

15

20

0

10

Processors 150 100 50 00

20

30

Processors ZPL20 APR_do

40 IBM_F90 IBM_do

60 PGI_F90 PGI_do

Figure 5.4: Performance for EP (log scale).

80

40

128

(a) MG class S

(b) MG class A

1000

1000

100

time(secs)

time(secs)

100 10

1

10 0.1

0.01

1 0

2

4

6

8

10

0

Processors 150 100 50 00

20 MPI PGI_F90

10

20

30

Processors ZPL APR_do

40 APR_F90

IBM_do

60 IBM_F90 PGI_do

Figure 5.5: Performance for MG (log scale). Note: APR is not shown in (b) because the execution timed out.

80

40

129

(a) FT class S

(b) FT class A

1000

1000

100

time(secs)

time(secs)

100

10

10 1

0.1

1 0

2

4

6

8

10

0

10

Processors 150 100 50 00

20 MPI PGI_F90

20

30

Processors ZPL APR_do

40 APR_F90

IBM_do

60 IBM_F90 PGI_do

Figure 5.6: Performance for FT(log scale).

80

40

130 compilers distribute the computation by the owner-computes rule4 ; therefore, in order for the program to be parallelized, some data structures must be distributed. Since the arrays in EP are replicated by default, no computation is partitioned among the processors: each processor executes the full program and achieves no speedup. By contrast, the APR parallelization strategy does not strictly adhere to the ownercomputes rule. This allows the main loop to be partitioned despite the fact that none of the arrays within the loop are distributed. Note that the HPF language speci cation does not specify the default distribution for the data nor the partitioning scheme for the computation. The omission was likely intended to maximize the opportunity for the compiler to optimize; however the observation for EP suggests that the di erent schemes adopted by the compilers may result in a portability problem with HPF programs. When directives were inserted to distribute the arrays, it was found that the main array in EP is intended to hold pseudo-random values generated sequentially, therefore there exists a true dependence in the loop computing the values. If the array is distributed, the compiler will adjust the loop bounds to the local partition, but the computation will be serialized. The HPF F90/Forall version corrects this problem by explicitly distributing the arrays and the IBM and PGI compilers were able to parallelize. The class A performance in Figure 5.4(b) shows that all compilers achieve the expected linear speedup. However, expanding the arrays to express the computation into a more data parallel form introduces overhead and degrades the scalar performance. It is possible for advanced compiler optimizations such as loop fusion and array contraction to remove this overhead, but these optimizations were either not available or not successful in this case. The ZPL version scales linearly as expected and the scalar performance is slightly better than the APR version despite the C/Fortran di erence described earlier. 4

PGI can also deviate from the owner-computes rule in some case.

131

5.3.2 NAS MG benchmark Compared to EP, MG allows a more complete and rigorous test of the languages and compilers. We rst discuss the performance for the class S in Figure 5.5(a). The p=1 column shows considerable variation in the scalar performance with all versions showing overhead of 1 to 2 orders of magnitude over the Fortran/MPI performance. For the base cases, both the original MPI program and the ZPL version scale well. The ZPL compiler partitions the problem in a straightforward manner according to the region and strided region semantics, and the communication is vectorized with little e ort. The scalar performance does however show over a 6x overhead compared to the MPI version. The HPF DO loop version clearly does not scale with any HPF compiler. The PGI compiler performs poorly in vectorizing the communication when the computation is expressed with DO loops: the communication call tends to remain in the inner most loop, resulting in a very large number of small messages being generated. In addition, the program uses guards within the loop instead of adjusting the loop bound. The APR compiler only supports a 1-D processor grid, therefore the 3-D distribution speci ed in the HPF directives is collapsed by default to a 1-D distribution. This limitation a ects the asymptotic speedup but does not necessarily limit the parallelization of the 27-point stencil computation. For one subroutine, the compiler detects through interprocedural analysis an alias between two formal arguments, which constitutes an inhibitor for the loop parallelization within the subroutine. However, the analysis did not go further to detect from the index expressions of the array references that no dependence actually exists. For most of the major loops in the program, the APR compiler correctly partitions the computation along the distributed array dimension, but generates very conservative communication before the loop to obtain the latest value for the RHS and after the loop to update the LHS. As a result, the performance degrades with the number of processors. The IBM compiler does not parallelize because it detects an output dependence on a

132 number of variables although the arrays are replicated. In this case, the compiler appears to be overly conservative in maintaining the consistency of the replicated variables. Other loops do not parallelize because they contain an IF statement. The INDEPENDENT directive is treated di erently by the compilers. The PGI compiler interprets the directive literally and parallelizes the loop as directed, while the IBM compiler nevertheless performs a more rigorous dependence check and elects not to parallelize the loop because of detected dependences. For the HPF F90/Forall version, the IBM and PGI compilers are more successful. The IBM compiler performance and scalability approach ZPL's, while the PGI compiler now experiences little problem in vectorizing the communication. Indeed, PGI's scalar performance now exceeds IBM's. The APR compiler does not result in slowdown but does not achieve any speedup either. It partitions the computation in the F90/Forall version similarly to the DO loop version, but is able to reduce the amount of communication. It continues to be limited by its 1-D distribution as well as an alias problem with one subroutine. Note that the version of MG from the APR suite employs APR's directives to suppress unnecessary communication. These directives are not used in our study because they are not a part of HPF, but it is worth noting that it is possible to use APR's tools to analyze the program and manually insert APR's directives to improve the speedup with the APR compiler. Given that the DO loop version fails to scale with any compiler, one may conjecture whether the program may be written di erently to aid the compilers. The speci c causes for the failure of each compiler described above suggest that the APR compiler would be more successful if APR's directives are used, that the PGI compiler may bene t from the HPF INDEPENDENT directive, and that the IBM compiler would require actual removal of some data dependences. Therefore, it does not appear that any single solution is portable across the compilers. Since the HPF DO version does not scale, the class A data only includes MPI, ZPL and HPF F90/Forall (Figure 5.5(b)). MPI and ZPL again exhibit good speedup but the

133 ZPL overhead persists. IBM and PGI also achieve speedup but PGI appears to level o quickly while IBM shows yet higher overhead than ZPL. APR on the other hand does not achieve any speedup as predicted with the small problem size; the APR plot is not shown because the execution runs timed out. Note that the MG performance published by APR is competitive with ZPL's performance. However, the MG benchmark made available by APR not only relies on APR's directives but can be compiled by neither IBM nor PGI because of the use of SEQUENCE, an incompatible memory layout feature. MG thus illustrates that (1) HPF programs can achieve some speedup, and (2) when tuned to a speci c compiler such as APR, the programs can achieve the same level of performance as ZPL. However, it is dicult to guarantee scalability and portability in HPF programs.

5.3.3 NAS FT benchmark FT presents a di erent challenge to the HPF compilers. In terms of the reference pattern, FT consists of a dot product and the FFT butter y pattern. The former requires no communication and is readily parallelized by all compilers. For the latter, the index expression is far too complex for a compiler to optimize the communication. Fortunately, the index variable is limited to one dimension at a time; therefore the task for the compiler is to partition the computation along the appropriate dimensions. The intended data distribution is 1-D and is thus within the capability of the APR compiler. Figure 5.6(a) shows the full set of performance results for the small problem size. As with MG, the MPI and ZPL versions scale well and the scalar performance of all HPF and ZPL implementations shows an overhead of 1 to 2 orders of magnitude over the MPI implementation. For the HPF DO loop version, the APR compiler exhibits the same problem as with MG: it generates very conservative communication before and after many loops. In addition, the APR compiler does not choose the correct loop to parallelize. The discrepancy arises because APR's strategy is to choose the partitioning based on the

134 array references within the loop. In the program, the main computation and thus the array references are packaged in a subroutine called from the loop so that when the loop is parallelized, the subroutine will operate on the local data. When the APR compiler proceeds to analyze the loops in this subroutine (1-D FFT), it nds that the loops are not parallelizable. The PGI compiler also generates poor communication, although its principal limitation is in vectorizing the messages. The IBM compiler does not parallelize because of assignments to replicated variables. The HPF F90/Forall version requires considerable experimentation and code restructuring to arrive at a version that is accepted by all compilers, partly because of di erences in supported features among the compilers and partly because of the nested subroutines structure of the original program. All HPF compilers achieve speedup to varying degrees. APR is particularly successful since the principal parallel loop has been moved to the innermost subroutine. Its scalar performance approaches the MPI's performance, although communication overhead limits the speedup. PGI shows good speedup while IBM's speedup is more limited. For the class A problem size, the memory requirement proves to be considerable since several programs fail the p=8 con guration. Unfortunately, the Cornell system only has 48 wide nodes out of its 512 nodes; this limits the set of the data points. The memory requirement also manifests itself in some superlinear speedup for all HPF and ZPL programs. Nevertheless, the overall observation is that the programs achieve the expected speedup. ZPL scalar performance is lower than the HPF performance; however, when we take into account the 40% slowdown from C to Fortran measured earlier for FT, ZPL parallel performance is comparable to HPF.

5.3.4 Communication Table 5.2 shows the total number of MPI message passing calls generated and the differences in the communication schemes employed by each compiler. The APR and PGI

135 Table 5.2: Communication statistics for EP class A, MG class S and FT class S: p=8 Benchmark version point-to-point collective type of MPI calls EP (class A) ZPL 0 120 Allreduce, Barrier APR DO loop 31 0 Send, Recv IBM F90 70 120 Send, Recv, Bcast PGI F90 240 0 Send, Recv MG (class S) MPI 2736 40 Send, Irecv, Allreduce, Barrier ZPL 9504 56 Isend, Recv, Barrier APR F90 126775 8 Send, Recv, Barrier IBM F90 9636 32 Send, Recv, Irecv, Bcast PGI F90 22191 0 Send, Recv FT (class S) MPI 0 104 Alltoall, Reduce ZPL 1064 32 Isend, Recv, Barrier APR F90 58877 8 Send, Recv, Barrier IBM F90 728 258048 Send, Irecv, Bcast PGI F90 64603 0 Send, Recv compilers only use the generic send and receive while the IBM compiler also uses the nonblocking calls and the collective communication; this may have rami cations in the portability of the IBM compiler to other platforms. The ZPL compiler uses nonblocking MPI calls to overlap computation with communication as well as MPI collective communication.

5.3.5 Data Dependences HPF compilers derive parallelism from the data distribution and the loops that operate on the data. Loops with no dependences are readily parallelized by adjusting the loop

136 bounds to the local bounds. Loops with dependences may still be parallelizable but will require analysis; for instance, the IBM compiler can detect some dependence patterns of loops that perform a reduction and generate the appropriate HPF reduction intrinsic. In other instances, loop distribution may isolate the portion containing the dependence to allow the remainder of the original loop to be parallelized. To approximately quantify the degree of diculty that a program presents to the parallelizing compiler in terms of dependence analysis, we use the following simple metric:

count of all loops with dependences count of all loops

A value of 0 would indicate that all loops can be trivially parallelized, while a value of 1 would indicate that whether any loop is parallelizable depends on the analysis capability of the compiler. Using the KAPF tool, we collect the loop statistics from the benchmarks for the major subroutines; they are listed in Table 5.3. This metric is not complete since it does not account for the data distribution; for instance, for 3 nested loops and a 1-D distribution, only 1 loop needs to be partitioned to parallelize the program and 2 loops may contain dependences with no ill e ect. However, the metric gives a coarse indication for the demands on the compiler. The loop dependence statistics show clear trends that correlate directly with the performance data. We observe the expected reduction in dependences from the DO loop version to the F90/Forall version. The reduction greatly aids the compilers in parallelizing the F90/Forall programs, but also highlights the diculty with parallelizing programs with DO loops. For MG, the di erence is signi cant; the array syntax eliminates the dependences in most cases. Some HPF compilers implement optimizations for array references that are ane functions of the DO loop indices, particularly for functions with constants. These optimizations should have been sucient for the MG DO loop version, however it does not appear that they were successful. Note that the loops in the subroutine norm2u3 are replaced altogether with the HPF reduction intrinsics. For FT, the low number of dependences in tpde comes from the dot-products which

137 Table 5.3: Dependence ratio mn for EP, MG and FT. Note: m is the count of loops with data dependences or subroutine calls, and n is the total loop count. subroutine DO F90/Forall subroutine DO F90/Forall EP embar 3/5 1/31 MG hmg 1/2 1/27 vrandlc 1/1 mg3P up 1/1 1/22 get start seed 1/1 mg3P down 1/1 1/1 psinv 4/4 0/6 FT tpde 2/16 2/16 resid 4/4 0/6 c t3 0/6 0/6 c ts1 2/6 1/7 rprj3 4/4 0/6 interp 7/21 0/30 c ts2 2/6 1/7 norm2u3 3/3 0/0 c tz 3/5 1/4 tz2 3/3 3/4 comm3 0/6 0/18 are easily parallelized. The top-down order of the subroutines listed also represents the nesting level of the subroutines. The increasing dependences in the inner subroutine re ect the need to achieve parallelism at the higher level. As explained earlier, this proves to be a challenge to the APR compiler which focuses on analyzing individual loops to partition the work.

5.4 Conclusion Our objective in this chapter has been to subject the current state of the art compilers for data parallel languages to more substantive applications. Three NAS benchmarks were studied across three current HPF compilers and a ZPL compiler. We examined di erent styles of expressing the computation in HPF and we also consider the same benchmarks written in MPI to understand the limits of the performance.

138 The three HPF compilers show a general diculty in detecting parallelism from DO loops. They are more successful with the F90 array syntax and the Forall construct, although even in this case the success in parallelization is not uniform. Signi cant variation in the scalar performance also exists between the compilers. The ZPL compiler shows consistent speedup and performance competitive with the HPF compilers. While di erences between compilers will always be present, the di erences must preserve a certain performance model in order for program portability to be maintained in the language. In other words, the user must be able to use any compiler to develop a program that scales, then have the option of migrating to a particular machine or compiler for better scalar performance. This requires a tight coupling between the language speci cation and the compiler in the sense that the compiler must reliably implement the abstraction provided in the language. To this end, the language speci cation must serve as a consistent contract with the programmer, or more formally, the language must provide a concise performance model. In the case of HPF, the results point to two diculties. First, while the HPF directives and constructs provide information on the data and computation partitioning, the sequential semantics of Fortran leave many potential dependences in the program. An HPF compiler must analyze these dependences, and when unable to do so, it must make a conservative assumption. Although this analysis capability di erentiates di erent vendor implementations, the diculty for the compilers to parallelize reliably leads to a diculty for the user in predicting the parallel behavior and thus the speedup of the program. A direct result is that the user needs to continually experiment with the compilers to learn their actual behavior. In doing so, the user is e ectively supplementing the performance model provided by the language with empirical information. Yet, such an enhanced model tends to be platform speci c and not portable. Second, the optional nature of the directives, while fostering compatibility and a

139 smoother transition from a current language, leads to an uncertain performance model for the user. In other words, it is not clear how much e ort from the user is necessary or sucient; for instance, the INDEPENDENT directive may or may not parallelize a loop depending on the compiler implementation. In this respect, the ZPL language addresses some of these problems by providing a clear demarcation between parallel and sequential execution. The results demonstrate that ZPL o ers a consistent performance model and scalable performance. The results also show that signi cant overhead remains in all implementations compared to the MPI programs. One source for the overhead is the large number of temporary arrays generated by the compiler across subroutine calls and parallelized loops. They require dynamic allocation/deallocation and copying, and generally degrade the cache performance. It is clear that to become a viable alternative to explicit message passing, compilers for data parallel language must achieve a much lower overhead.

Chapter 6

Mighty Scan, parallelizing sequential computation 6.1 Introduction The general objective for a parallelizing compiler is to analyze and detect the dependences in the program so that when there are none the computation can be scheduled to proceed in parallel. However, when a true dependence exists, serialization occurs and a di erent approach must be employed to achieve parallelism. In this discussion, we focus on the case where the dependence occurs along one dimension of one or several arrays. Such a computation typically involves traversing the array(s) and updating each element using the values of the preceding elements. Because a true dependence exists, the computation is conceptually sequential. A simple example is the parallel pre x operation (scan) on an array. A more complex example involving multiple arrays and arithmetic operations is the forward elimination and backward substitution steps in a solver for linear equations. The solution for this problem exists in many forms depending on the particular case. When the operator is commutative and associative and only one array is involved, the computation is known as the parallel pre x operation. Because it is frequently used and an ecient parallel algorithm exists, parallel pre x is often supported directly

141 in communication interfaces such as MPI (MPI Pre x) and in high level data parallel languages such as HPF and ZPL. Since these implementations are well optimized, we can consider the parallel pre x problem to be solved. However, when the operator is not commutative or associative and multiple arrays are involved, no direct support currently exists in either the libraries or the languages. For message passing programs, this does not present a serious obstacle since the relatively low programming level allows considerable freedom in implementing any algorithm. In this case, pipelining has proven to be very e ective when additional coarse grain parallelism exists in addition to the sequential computation. When the program is written in a data parallel language, the solution is also straightforward if the programmer has freedom in choosing the array partitioning scheme. For instance, if the sequential computation proceeds along the rst dimension of a 2-D array (as illustrated in Figure 6.1(a)), the array can simply be partitioned along the second dimension onto a 1-D processor grid so that there is no interprocessor dependence. This allows each processor to proceed independently. However, it is often the case that the best partitioning scheme for the overall problem requires the rst dimension to be partitioned also, for instance a 2-D array onto a 2-D processor grid as shown in Figure 6.1(b). In this case, all processors in the rst dimension will be serialized during the sequential computation. The problem described above thus sets the stage for our discussion. Assume: (1) a program written in a data parallel language (HPF or ZPL), (2) a computation that is semantically sequential along dimension i of some array(s) and, (3) a partitioning scheme that requires dimension i to be distributed onto a processor grid. The goal is to avoid the serialization of the processor set onto which the sequential computation is mapped. In the remainder of this chapter, Section 6.2 will describe a number of solutions, their implementations in HPF and ZPL and their shortcomings. Then Section 6.4 will propose the speci cations for a new language construct that o ers the best solution.

142

Figure 6.1: Two methods for array partitioning. Note: (a) no dependence across processors, fully parallel; (b) dependence exists, some processors serialized.

6.2 A case study For the case study, we will use the widely studied tomcatv benchmark since it embodies the essential characteristics while remaining suciently simple to facilitate our analysis. A much more complex and realistic case is the NAS SP and BT benchmarks, which are large scale partial di erential equation (PDE) solvers, typically used in computational

uid dynamics (CFD). The version of tomcatv under study is from the suite of HPF benchmarks published by APR, which has been restructured and annotated with directives for HPF. The computation consists of a 2-phase iteration over several 2-D arrays:

143 Phase 1 is a 9-point stencil computation and Phase 2 is a solver for a tridiagonal system of linear equations. Phase 2 exhibits a true dependence along the rst dimension; therefore it motivates distributing the array along the second dimension. However, Phase 1 favors a 2-D partition for better scalability. The forward elimination step of Phase 2 consists of the following (the backward substitution step is similar): DO i = 2 , n

DO j = 2 , m r(1,i) = aa(j,i) * d(j - 1,i) d(j,i) = 1. / (dd(j,i) - aa(j - 1,i) * r(1,i)) rx(j,i) = rx(j,i) - rx(j-1,i) * r(1,i) ry(j,i) = ry(j,i) - ry(j-1,i) * r(1,i) ENDDO ENDDO The characteristics of this computation can be summarized as: 1. Sequential dependence in the j dimension: no parallelism. 2. No dependence in the i dimension: available parallelism. 3. Multiple arrays and arithmetic operations are involved. Since tomcatv is a small program, the di erence in the partitioning choice may not be signi cant1 . However for larger problem sizes or larger programs such as the NAS SP and BT benchmark, the asymptotic di erence becomes clear. Speci cally, Naik showed that for SP on the IBM SP/1, a 3-D partitioning can be 66% faster than a 1-D partitioning on 16 processors [Naik 94].

6.2.1 Idealized execution Figure 6.2 shows three possible parallel executions of the loops, assuming that the mn = 4  4 array is distributed onto a Pr  Pc = 2  2 processor grid. The solid time line for each 1 This proves to be a mitigating circumstance for the APR compiler which only accepts 1-D processor grid.

144 P1

P2

P3

P0

P1

P2

P3

P0

P1

P2

P3

time

P0

(a) parallel in the i dimension

(b) pipelined in the j dimension

(c) fully parallel

communication

Figure 6.2: Three possible parallel executions. processor represents the computation required for the 2  2 array section that it owns. In all cases, loop i is fully parallel; the only di erence is in how loop j is parallelized. In case (a), loop j remains serialized: the Pr processors fail to execute in parallel and the speedup is limited to Pc . In case (b), loop j still executes sequentially, but by employing pipelining, partial results are forwarded to allow the waiting processors to initiate their execution earlier. The overhead to start and nish the pipeline will degrade the overall speedup to bPcPr , where b < 1. In case (c), loop j proceeds e ectively in parallel. This is possible if an ecient parallel algorithm exists for the j loop: if Pc Pr ) can be the parallelization overhead can be limited to O(log Pr ), a speedup of O( log Pr achieved.

6.2.2 An algorithmic approach We now consider an algorithmic approach for parallelizing in the j dimension. It is algorithmic since it relies on unique properties of the intended computation instead of

145 the language construct. This approach parallelizes the scan operation itself, resulting in an execution similar to Figure 6.2(c). One important advantage is that no other source of parallelism is required (e.g., parallelism in the i dimension). Parallel pre x is an example of a sequential computation that can be parallelized e ectively with an overhead of O(log P ). The parallel algorithm in this case requires that the operator be associative so that the partial results can be combined arbitrarily. This requirement is too restrictive for the general scan operation that we are considering. However, with careful manipulation, it may be possible to factor out from the sequential computation certain components that can be computed in parallel. Then we can examine if the remaining components can be computed with a reasonable overhead. These two steps would constitute the two phases of a parallel implementation of scan: (1) compute locally and (2) communicate and update. Consider a scan computation that is slightly more complex than a simple summation and that is similar (but not identical) to that found in tomcatv. For array x(1 : 16) and b(1 : 16): DO i=2,16

x(i) = x(i) + x(i-1) * b(i-1) ENDDO Although the code has the structure of a scan operation, we cannot use the parallel pre x algorithm because the combination of + and * is not associative. However, if the terms are expanded completely, the computation can be expressed as follows:

x1 x2 x3 x4 ::: xk

= x1 = x1b1 + x2 = x1b1b2 + x2 b2 + x3 = x1b1b2b3 + x2 b2b3 + x3b3 + x4 =

X(x kY?1 b ) + x

k?1 j =1

j

i=j

i

k

146 Let processor P own a section (m : n) of the arrays. Then in computing each new element in its array section, the summation expression can be split into two components. For element xn speci cally:

xn =

X (x nY?1 b ) + nX?1(x nY?1 b ) + x

m?1 j =1

j

i=j

i

j =m

j

i=j

i

n

Note that the second term involves only local elements in the array section (m : n); therefore processor P can proceed to compute this component independently. Note also that for the rst term, the upper limit for the summation is m ? 1 and for the product is n ? 1. This allows us to factor the common b, thus further splitting the rst term into three factors:

xn =

Yb

n?1

j =m

j  bm?1  (

X (x nY?1 b ) + x

m?1 j =1

j

i=j

i

m?1 ) +

X (x nY?1 b ) + x

n?1

j =m

j

i=j

i

n

Q

Clearly for the rst term, the rst factor jn=?m1 bj can be computed locally. The third factor is simply the value for xm?1 computed by its owner processor, which incidentally also owns the second factor bm?1. A parallel algorithm can nally be described:

P Q 1. Compute locally: xk = kj =?m1 (xj ki=?j1 bi) + xk ; m  k  n 2. 3. 4. 5. 6.

Q

Compute locally: jn=?m1 bj Receive from preceding processor: xm?1  bm?1 Update xn rst using the results from 1, 2 and 3 Send to next processor: xn  bn Update remaining elements xm:n?1

Considering the complexity, the local computation in steps 1, 2 and 6 requires O(N=P ) where N is the array size and P is the number of processors. Although the communication is sequential, it occurs after the main parallel computation and only one operation is interposed between each message; therefore the serialization e ect is limited

147 to O(P ). A compiler that implements this algorithm may generate the following SPMD program. /* local computation */

for (i=mylo+1; i

Suggest Documents