The Performance and Scalability of Parallel Systems

0 downloads 0 Views 1MB Size Report
individual behaviour characteristics of the real system may need to be approxi- mated. We examine .... 4.1.2.2 Average queue length and response time . . . . . 91.
The Performance and Scalability of Parallel Systems

Neil James Davies

A thesis submitted to the University of Bristol in accordance with the requirements for the degree of Ph.D in the Faculty of Engineering, Department of Computer Science. December 1994

c N. J. Davies, 1994

Abstract In this thesis we develop an analytical performance model for parallel computer systems. This model is built on three abstract performance elements; loading intensity, contention, and delay. These elements correspond to performance measures that are the outcome of features of both software and hardware components of a computing system. The pro le of these components can in turn be derived from an analysis of the performance-related behaviour of the individual processes that constitute a complete system. We show how such models of particular systems can be used for performance prediction. They can be used to predict the performance of a speci ed number of processors, derive the maximum expectation in terms of performance, as the number of processors increases, and predict the amount of latency hiding required to achieve a particular performance pro le. We illustrate the use of the model with a particular concrete application. The interaction between the three performance elements is examined, with particular attention being paid to the relationship between loading intensity and delay in certain classes of parallel system. We examine the consequences that this relationship has on the ability to measure certain types of performance data. In examining the relationship between delay and performance elements we also look at the e ect of allocating many processes to each processor. As a consequence of using this modelling technique on a particular parallel system, individual behaviour characteristics of the real system may need to be approximated. We examine the e ects of this approximation, looking at the particular circumstances under which our model may not give appropriate quantitative results for the individual system.

ii

Acknowledgements I would like to thank all at the Department of Computer Science, both corporately and individually, for their support during the evolution of this thesis. I am particularly indebted to the inhabitants of the Computer Science Annex for their encouragement and especially to Ian Holyer for patiently correcting my mathematics and listening to my thoughts, even when they were only half formed; to Mike Rogers for providing the milestones to keep it moving; to Brian Stonebridge for all the hours of fun with those piles of pennies and to Hussein Zedan for his incisive comments. I would also like to thank Len , Babs and Gene for all the building work that they have done over the last few years while I was working on this. Last, but by no means least, thanks to Sally, Hannah and Lucy; for all their help and for putting up with the ever-busy father and husband. 2

iii

To Hannah and Lucy, for putting up with a part-time dad. To Sally, for putting up with a part-time husband, and everything else.

iv

Declaration The work in this dissertation is the independent and original work of the author except where explicit reference to the contrary has been made. It has not been submitted for any other degree or award to this or any other university or educational institution. The views expressed in this dissertation are those of the author and not of the University of Bristol.

N. J. Davies

v

Contents 1 Introduction 1.1 1.2 1.3 1.4

Performance issues : : : : : : : : Support for the design process : : Motivation and aims of this thesis Outline of chapters : : : : : : : : 1.4.1 Chapter 2 : : : : : : : : : 1.4.2 Chapter 3 : : : : : : : : : 1.4.3 Chapter 4 : : : : : : : : : 1.4.4 Chapter 5 : : : : : : : : : 1.4.5 Chapter 6 : : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

2 Performance Models and their Measurement 2.1 Components of performance : : : : : : : : : : : : : : 2.1.1 Sequential portion : : : : : : : : : : : : : : : 2.1.2 Communications delay : : : : : : : : : : : : : 2.1.3 Algorithmic limitations : : : : : : : : : : : : : 2.1.4 Algorithmic synchronisation costs : : : : : : : 2.1.5 Resource saturation : : : : : : : : : : : : : : : 2.1.6 Other factors : : : : : : : : : : : : : : : : : : 2.2 Classi cation of scaling and speedup models : : : : : 2.2.1 Output measures : : : : : : : : : : : : : : : : 2.3 Overview of scaling and speedup: models and metrics vi

1 2 3 4 6 6 6 6 7 8

9 : : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

12 12 13 15 16 17 17 18 19 23

2.4

2.5

2.6 2.7

2.3.1 Speedup formul : : : : : : : : : : : : : : : : : : : : : : : 2.3.1.1 Solution Rate : : : : : : : : : : : : : : : : : : : : 2.3.1.2 Speedup : : : : : : : : : : : : : : : : : : : : : : : 2.3.1.3 Eciency : : : : : : : : : : : : : : : : : : : : : : 2.3.1.4 Response Time, Throughput, and Power : : : : : 2.3.2 Macroscopic measures : : : : : : : : : : : : : : : : : : : : 2.3.3 Microscopic measures : : : : : : : : : : : : : : : : : : : : : Scaling models : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.4.1 Fixed size speedup (Amdahl's law) : : : : : : : : : : : : : 2.4.2 Scaled speedup (Gustafson's law) : : : : : : : : : : : : : : 2.4.3 Fixed time speedup : : : : : : : : : : : : : : : : : : : : : : 2.4.4 Available concurrency measures : : : : : : : : : : : : : : : 2.4.5 Comparison of scaling models : : : : : : : : : : : : : : : : 2.4.6 Performance metrics : : : : : : : : : : : : : : : : : : : : : 2.4.6.1 Serial fraction : : : : : : : : : : : : : : : : : : : : 2.4.6.2 Incremental eciency : : : : : : : : : : : : : : : 2.4.7 Cost and performance : : : : : : : : : : : : : : : : : : : : 2.4.8 Solution rates : : : : : : : : : : : : : : : : : : : : : : : : : Predictability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.5.1 Algorithmic analysis approaches : : : : : : : : : : : : : : : 2.5.2 Approximate modelling and measures : : : : : : : : : : : : 2.5.2.1 Gustafson's xed time speedup : : : : : : : : : : 2.5.2.2 Eager's analysis based on average available concurrency : : : : : : : : : : : : : : : : : : : : : : : Structurally-based performance models : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 Behaviour Based Models of Performance

3.1 Cycles of behaviour : : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Outline of basic queueing theory : : : : : : : : : : : : : : : : : : : vii

25 25 25 26 26 27 28 31 31 32 36 38 40 43 44 44 45 46 47 48 49 49 49 49 50

51 56 64

3.2.1 Random variables and their probability distributions 3.2.2 Exponential distribution : : : : : : : : : : : : : : : : 3.2.3 Stochastic processes and Markov chains : : : : : : : : 3.2.3.1 Steady state : : : : : : : : : : : : : : : : : : 3.2.3.2 Conservation of ow : : : : : : : : : : : : : 3.2.3.3 Kendall's notation : : : : : : : : : : : : : : 3.2.3.4 Little's law : : : : : : : : : : : : : : : : : : 3.2.4 Properties of the M=M=1 queue : : : : : : : : : : : : 3.2.5 Performance applications : : : : : : : : : : : : : : : : 3.3 Performance parameters : : : : : : : : : : : : : : : : : : : : 3.3.1 Speedup : : : : : : : : : : : : : : : : : : : : : : : : : 3.3.1.1 Relative speedup : : : : : : : : : : : : : : : 3.3.1.2 Absolute speedup : : : : : : : : : : : : : : : 3.3.2 Eciency, load and utilisation : : : : : : : : : : : : : 3.3.3 Response time - average and distribution : : : : : : : 3.4 Use of queueing theory for performance modelling : : : : : : 3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : :

4 Single Thread Per Processor

65 66 66 68 68 69 70 70 72 73 74 74 75 76 76 76 78

80

4.1 Without response delay : : : : : : : : : : : : : : : : : : : 4.1.1 Derivation of properties of this model : : : : : : : 4.1.1.1 General properties : : : : : : : : : : : : 4.1.1.2 Distribution related properties : : : : : 4.1.2 Performance measures of this model : : : : : : : : 4.1.2.1 Server loading : : : : : : : : : : : : : : 4.1.2.2 Average queue length and response time 4.1.2.3 Processor idleness/processor utilisation : 4.1.2.4 Speedup : : : : : : : : : : : : : : : : : : 4.1.3 Practical use of this model : : : : : : : : : : : : : viii

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

82 87 87 88 90 90 91 92 93 98

4.1.3.1 Applicability of exponential distributions to real systems : : : : : : : : : : : : : : : : : : : : : : : 4.1.3.2 Service time distributions : : : : : : : : : : : : : 4.1.3.3 Processor operating time distributions : : : : : : 4.1.3.4 Use of the formul : : : : : : : : : : : : : : : : : 4.1.3.5 Finding u when it cannot be measured directly : 4.1.4 Shortcomings in this model : : : : : : : : : : : : : : : : : 4.2 With response delay : : : : : : : : : : : : : : : : : : : : : : : : : 4.2.1 M=M=1=K=K with delay : : : : : : : : : : : : : : : : : : 4.2.2 Derivation of properties of this model : : : : : : : : : : : : 4.2.2.1 General properties : : : : : : : : : : : : : : : : : 4.2.2.2 Distribution related properties : : : : : : : : : : 4.2.3 Performance measures of this model : : : : : : : : : : : : : 4.2.3.1 Server loading : : : : : : : : : : : : : : : : : : : 4.2.3.2 Average queue length and response time : : : : : 4.2.3.3 Processor idleness : : : : : : : : : : : : : : : : : 4.2.3.4 Speedup : : : : : : : : : : : : : : : : : : : : : : : 4.2.4 Use of the formulae : : : : : : : : : : : : : : : : : : : : : : 4.2.5 Model of delay : : : : : : : : : : : : : : : : : : : : : : : : 4.3 Use of single threaded models : : : : : : : : : : : : : : : : : : : : 4.3.1 Importance of  : : : : : : : : : : : : : : : : : : : : : : : : 4.3.2 Comparison between models and `linear speedup' : : : : : 4.3.3 Equivalence of the models : : : : : : : : : : : : : : : : : : 4.3.4 Practical analysis using this model : : : : : : : : : : : : : 4.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5 Multiple Threads per Processor

98 99 99 100 100 101 102 103 106 106 108 111 111 111 112 113 118 118 119 119 120 123 126 127

128

5.1 Single processor without delay : : : : : : : : : : : : : : : : : : : : 129 5.1.1 Derivation of the properties of this model : : : : : : : : : : 131 5.1.2 Performance measures : : : : : : : : : : : : : : : : : : : : 132 ix

5.2

5.3

5.4 5.5

5.1.2.1 Server loading : : : : : : : : : : : : : : 5.1.2.2 Processor idleness : : : : : : : : : : : : 5.1.2.3 Speedup : : : : : : : : : : : : : : : : : : Single processor with delay : : : : : : : : : : : : : : : : : 5.2.1 Performance measures : : : : : : : : : : : : : : : 5.2.1.1 Server loading : : : : : : : : : : : : : : 5.2.1.2 Processor idleness : : : : : : : : : : : : 5.2.1.3 Speedup : : : : : : : : : : : : : : : : : : 5.2.2 Levels of multiprocessing and latency hiding : : : Multiple processors without delay : : : : : : : : : : : : : 5.3.1 Simpli cation of the Markov chain representation 5.3.2 Finding the population of a cohort : : : : : : : : Multiple processors with response delay : : : : : : : : : : 5.4.1 Hiding number for multiple processors : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

6 Uses of the model

132 133 133 138 141 141 142 142 142 144 147 151 152 155 157

158

6.1 Evaluating and extrapolating performance : : : : : : : : : 6.1.1 Derivation of u0 and other parameters : : : : : : : : 6.1.2 Performance of other physical systems : : : : : : : 6.2 Threads of execution and processors : : : : : : : : : : : : 6.3 Service elements : : : : : : : : : : : : : : : : : : : : : : : : 6.4 Delay : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.4.1 Example 1 : : : : : : : : : : : : : : : : : : : : : : : 6.4.2 Example 2 : : : : : : : : : : : : : : : : : : : : : : : 6.5 Examples of the use of the modelling technique : : : : : : 6.5.1 Quantifying the e ects of algorithm change : : : : : 6.5.1.1 Caching in a distributed Virtual Memory 6.5.2 Quantifying the e ects of changes in topology : : : x

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

159 161 163 163 166 167 167 170 172 172 172 177

6.5.2.1 Growth of delay with the number of processors : 6.5.2.2 Growth of load intensity with the number of processors : : : : : : : : : : : : : : : : : : : : : : : : 6.6 Informing design decisions : : : : : : : : : : : : : : : : : : : : : : 6.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

7 Conclusions

A Proofs Multiple processors, single thread with delay : : : : : : : : : : : Single processor multiple threads with delay : : : : : : : : : : : Equality of state probabilities within a cohort | without delay : Probability distributions in cohorts with delay : : : : : : : : : : Queueing Theory : : : Petri Net Descriptions Performance Terms : : Other Nomenclature :

183 183 184 185 185 185 186 186 187 188 189

202 : : : :

B Glossary of Terms B.1 B.2 B.3 B.4

179 180 180

181

7.1 Future work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.1.1 Equivalence of the delay-free and delay-full multi-threaded systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.1.2 Asymmetry : : : : : : : : : : : : : : : : : : : : : : : : : : 7.1.3 Modelling the e ects of other overheads : : : : : : : : : : : 7.1.3.1 Loading pro les : : : : : : : : : : : : : : : : : : 7.1.3.2 E ects of nite bu er space : : : : : : : : : : : : 7.1.3.3 Including other contention points and servers : : 7.1.4 Quantitative measures of parallel algorithm improvement : 7.1.5 Inclusion of other forms of synchronisation : : : : : : : : : 7.1.6 Monitoring of performance and load balancing : : : : : : : 7.1.7 Cycles within cycles of behaviour : : : : : : : : : : : : : : A.1 A.2 A.3 A.4

177

204 206 208 211

213 : : : :

: : : :

: : : :

: : : :

: : : : xi

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

213 215 215 217

List of Figures 2.1 Scaling surface (after Worlton[131]) : : : : : : : : : : : : : : : : : 2.2 Measures of usefulness of parallel systems : : : : : : : : : : : : :

41 45

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10

Representation of the simplest cycle of behaviour : : : : : : : : : Block diagram of system with two processors and service facility : Example Petri net representing two processors and a server : : : : Reduced Petri net representation of the example system : : : : : : Two processor representation with queueing : : : : : : : : : : : : Two processor representation with queueing and timed transitions Alternative view of gure 3.6 as a Markov chain : : : : : : : : : : Elements of a queueing system : : : : : : : : : : : : : : : : : : : : State-transition-rate diagram for M=M=1 queueing system : : : : Average wait time as a function of loading intensity : : : : : : : :

57 58 59 60 61 63 64 69 71 73

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Petri net representation of gure 4.2 : : : : : : : : : : : : : : : : Outline of M/M/1/K/K queueing system : : : : : : : : : : : : : : State-transition diagram for M=M=1=K=K system : : : : : : : : Server utilisation as a function of load intensity (u 2 [0; 3]) : : : : Server utilisation as a function of load intensity (u 2 [0; 0:05]) : : The calculate, request, calculate cycle of a single worker processor Absolute speedup as a function of load intensity (u 2 [0; 3]) : : : : Absolute speedup as a function of load intensity (u 2 [0; 0:1]) : : : Relative speedup as a function of load intensity (u 2 [0; 3]) : : : :

83 84 87 90 91 92 96 96 97

xii

4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20

Relative speedup as a function of load intensity (u 2 [0; 0:1]) : : Outline of the M/M/1/K/K queueing system with delay : : : : Petri net representation of gure 4.11 : : : : : : : : : : : : : : : State-transition diagram for a two processor system with delay : The calculate, request, calculate cycle of a single processor : : : Absolute speedup for u = 0:05 for varying dr : : : : : : : : : : : Absolute speedup for u = 0:1 for varying dr : : : : : : : : : : : Relative speedup for u = 0:05 for varying dr : : : : : : : : : : : Relative speedup for u = 0:1 for varying dr : : : : : : : : : : : : Comparision between idealised server utilisation and  : : : : : Di erence in server utilisation between idealised and  : : : : :

: : : : : : : : : : :

98 103 104 105 107 114 114 117 117 121 122

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15

Behaviour of a multi-threaded single processor : : : : : : : : : : : State-transition-rate diagram for behaviour of gure 5.1 : : : : : : Variation of absolute speed for xed level of multiprogramming : : Variation of server utilisation for xed level of multiprogramming Variation of absolute speed for xed loading intensity : : : : : : : Variation of server utilisation for xed loading intensity : : : : : : Absolute reduction in processor idle for xed loading intensity : : Relative reduction in processor idle for xed loading intensity : : Behaviour of a multi-threaded single processor with delay : : : : : State-transition-rate diagram for behaviour of gure 5.9 : : : : : : Contour Plot of Hiding Number for Single Processor : : : : : : : Behaviour of two processors with many threads each : : : : : : : State-transition-rate diagram for 2 processors each with 2 threads Multiple threads per processor as a birth death system : : : : : : State-transition-rate diagram cohorts in a system with delay : : :

130 130 134 134 135 136 136 137 139 140 143 145 146 150 153

6.1 Relative speedup for di erent cache sizes : : : : : : : : : : : : : : 162 6.2 E ect of multiple connectivity : : : : : : : : : : : : : : : : : : : : 167 xiii

6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10

Relative error as a function of loading intensity : : : : : : : : : : E ect of queues in series : : : : : : : : : : : : : : : : : : : : : : : Relative error as a function of loading intensity : : : : : : : : : : Maximum relative error as a function of the ratio of service times Behaviour of a single processor with cache : : : : : : : : : : : : : Relative performance improvement for di erent costs of algorithm The e ect of growth of delay on absolute speed : : : : : : : : : : The e ect of growth of contention on absolute speed : : : : : : : :

xiv

169 170 171 172 173 176 178 179

List of Tables 2.1 Basic taxonomy of models : : : : : : : : : : : : : : : : : : : : : : 2.2 Basic taxonomy of models (continuation of table 2.1) : : : : : : : 2.3 Properties of scaling models (derived from Worlton [131]) : : : : :

20 21 41

4.1 4.2 4.3 4.4 4.5

Comparison of the two physical interpretations : : : : : : : : : : : 85 Correspondence between the physical and the behavioural models 85 Model characteristics not explicitly dependent on delay : : : : : : 119 Model characteristics explicitly dependent on delay : : : : : : : : 120 Number of processors for > 5% deviation from linear : : : : : : : 122

6.1 6.2 6.3 6.4

Raw data for rendering of picture on tertiary tree of depth 1 Raw data for rendering of picture on tertiary tree of depth 2 u0 values as result of least square t : : : : : : : : : : : : : : Service and delay rates by cache size : : : : : : : : : : : : :

xv

: : : :

: : : :

: : : :

159 160 162 164

Chapter 1

Introduction Since Amdahl's paper in 1967 [8], which painted a bleak and limited future for the \multiple processor approach in terms of application to real problems", the understanding of the performance and scalability of such multiprocessor systems has been a subject of continuing interest and research. The rationale and reasoning behind most of the existing models of scalability is based upon the post-event analysis of speci c executions of the computing system, ie the results observed after many actual trial runs with di erent numbers of processors have taken place, the predictive capacity of such models being based upon the tting of recorded data to some curve. The choice of functions for these curves has ranged from arbitrary polynomials, through polynomials based upon an analysis of an algorithm's data requirements, to functions based upon some underlying model of hardware performance. Such models tend to aggregate the performance factors a ecting scalability into large conceptual units such as `serial fraction' or `overhead'. This is an understandable position, given that the opposite extreme, modelling at the level of every instruction in every process on every processor, is not feasible because of the extreme complexity of actual behaviour of multiprocessor systems. This increase in complexity, due to the number of processors that are present, is dwarfed by the increase in complexity due to the presence of the many possible interleavings of operation that physically concurrent behaviour makes possible. A disadvantage of aggregating all the factors into a single unit is that this aggregation process is not easily reversed. Once the constituent factors have been combined, it is very dicult to isolate the contribution of an original factor to the 1

output from the performance model. Another diculty in the interpretation and evaluation of current work is that there are no agreed metrics and standards for the presentation of results. Even such terms as `speedup' have many possible interpretations and the conclusions that can be drawn from a `good speedup' can be entirely contradictory. The emergence of benchmarks in the area of message passing multiprocessors is a welcome development. A long term e ect of this is that it is dicult to feed back any performance improvements made for a particular system into the general design/development process to improve future systems.

1.1 Performance issues There is a plethora of possible performance issues that may be of interest to the users and designers of a parallel processing system. The advent of relatively cheap multiprocessor systems has meant that the user often takes on the r^ole of system designer, which involves not only the arrangement of the hardware components, in which there may not be much exibility, but also of the selection and use of appropriate software architectures for the particular problem at hand. This brief is markedly di erent from the traditional r^ole ful lled by the designers of new computer systems. In building a traditional large sequential machine, the e ort that would be expended in producing the computer's architecture was of such magnitude that it could usually only be contemplated by large organisations such as computer hardware manufacturers. The design would be taken on by a team assembled with the speci c task of producing such a new system. This team would include many specialists, some of whom had extensive past experience in modelling and building complex hardware and its associated operating systems. Typically, their brief would be to use their knowledge to design and build a relatively general purpose system to optimise performance for the perceived needs of a particular market sector. The designer or user of a multiprocessor, or highly distributed system today, needs a di erent set of skills and aims. The individual is unlikely to be part of a large design team which possesses the complete set of skills. Also, the set of goals that needs to be satis ed has been enlarged; when designing a scalable system, with its large and possibly arbitrary number of processors, the problem is far more 2

complex than the one addressed when designing a system for a xed number of processors. This nal system is e ectively `single-use' and is likely to be used only for the problem for which it was designed. An additional pressure is to produce a prototype with a small number of processors (possibly for a cut down problem) which will `scale' to solve the same problem in some larger form. The designer of such a scalable system may have many potential performance factors to take into account. Examples of such parameters are: memory usage, price/performance, response time, run time (wall clock time), processor idleness, scalability limits of the hardware, the system software and the algorithm. There are three distinct aspects to each of these performance parameters:

Measurement What performance parameters can be measured? Which per-

formance parameters are worth measuring? How can a given performance parameter be measured? What e ect does the measuring of a performance parameter have on the performance of the system?

Extrapolation Given that you have an existing system with associated perfor-

mance measurements, what performance characteristics can be extrapolated for systems containing more processors? How accurate are such extrapolations? What are the limiting factors in your design/implementation? How and when will these factors limit the scalability?

Prediction This is the Holy Grail of performance analysis. Can the performance parameter be predicted from some performance model even before the system has been built? What information needs to be gathered about the operation of the proposed computer system in order for the performance to be predictable? How accurate will that prediction be? How are the individual parameters interrelated?

1.2 Support for the design process It is often easier to talk about performance in terms of some abstract model, such a performance model should also give guidance as to how to achieve good scalability. The guidance given by existing models tends to be so general as to be of limited use. They may imply that the `serial portion' should be kept small, but give little advice on how to achieve this. 3

A good performance and scaling model should be able to give support for the \what if?" questions that arise in the design process. Ideally, it should be able to give quantitative comparisons of various options, but even a qualitative comparison is useful. A general model should also be able to provide a starting point for a more extensive modelling or simulation of the intended system if this is felt to be appropriate. Any abstract model should also have the ability to provide a framework for the understanding of any improvements in performance of a completed system. This will allow for such improvements to be expressed in such a manner that they can be applied to future systems.

1.3 Motivation and aims of this thesis The performance model, and its use in the analysis of scaling, presented in this thesis arose out of the desire to understand the performance behaviour of an existing system, both in the aspect of prediction of its performance, and in the limits on its scalability. There were three main aims in building such a model:

 To keep it as simple and as small as possible.  To relate the entities in the model to those entities that are present in a

multiprocessor system, the nature of these entities not being limited to either hardware or software components.

 To support \what if?" questions, from the point of view of, rstly, analysing

the increase in the performance with the addition of more processors; and, secondly, of nding the best way of mapping a problem onto a particular architecture.

In order to ful l these aims, it was felt that the use of system-speci c simulation models would have limited usefulness and, as such, an analytical approach was taken. The analytical technique that is used is that of Markov chain queueing theory. The di erent levels of complexity of the various models correspond to di erent queueing systems. The main objective of this thesis is to develop a modelling framework that captures aspects of scalability by capturing speci c performance criteria. Models are 4

built around three basic components: use of (and contention for) computation components (ie. CPUs), contention for shared resources (eg. remote services, virtual memory, communication resources) and delay (eg. propagation delays, use of other resources not subject to contention). An important feature of the approach is that these components can be combined in a manner that takes account of their `behaviour', and which allows, at least in principle, the prediction of the rate of occurrence of observable events. The inclusion of observable events within the framework allows for the objective validation of the accuracy of a given model against actual data, for the quanti cation of di erent design approaches, and for the designer to gain some `feel' for some of the tradeo s that are always present in designing and implementing distributed and parallel systems. This use of behavioural models is in contrast to several of the existing scalability approaches which consider amorphous structure (eg. ratios of sequential to parallel sections of code) entirely divorced from their actual order of invocation or their mutual interaction. For such a framework to be successful it needs to be `expressive'. It needs to allow for the capture of many di erent performance aspects, and permit di erent models of the same system which may concentrate on di erent aspects. Another aspect of this expressiveness is the ability to aggregate several factors, in a decomposable way, thus containing the complexity of the model while making it richer. As with all models they should describe the way in which their approximations interact with the nal results so that the user of any model can place an appropriate degree of trust in the output from the modelling process. The framework in particular is useful for modelling systems with cyclical behaviour where those systems are in a steady state. Examples of such behaviour are the computational levels of:

 Distributed Virtual Memory algorithms in message passing systems  Latency hiding in distributed systems, analysis of the tradeo between in-

creasing the level of multi-threading against increasing the number of processors

 Use of remote procedure calls in distributed systems. In addition the use of the framework to analyse existing and proposed systems, it also sheds some light on some of the arguments over the nature of scaling in parallel systems, and over its fundamental limitations. 5

1.4 Outline of chapters 1.4.1 Chapter 2 This chapter examines the motivation for models of scalability and performance. In it we attempt to categorise existing models. We also examine the factors that a ect performance and look at approaches, in the literature, for the inclusion of these factors into scaling models.

1.4.2 Chapter 3 This chapter introduces the structural model that is adopted for future chapters. We introduce this model in two complementary ways. To emphasise the importance of behaviour of individual components, Petri nets are used. These can be enhanced with stochastic information to generate Generalised Stochastic Petri Nets. This behavioural view is then shown to be equivalent to a birth-death Markov chain system. We then de ne the performance metrics used in this thesis in terms of observable properties of the model and brie y survey the existing Markov chain performance literature which relates to multi-processor systems. This chapter concludes with a description of the actual implementation that provided the initial impetus for this modelling exercise.

1.4.3 Chapter 4 In this chapter we take the basic model as a starting point to develop a performance model for systems which contain several processors; this performance model is then expressed as a birth-death Markov chain system. For the purposes of this development, we assume that there are no delays present in the system. From this Markov system we then derive useful performance quantities such as the fractional idleness of the processors and the absolute rate of computation of the system, and discuss the applicability of the underlying assumptions of Markov chain systems to the analysis of the performance behaviour of parallel systems. We then enhance this basic model of behaviour by introducing delay, deriving and solving the resulting system. The performance characteristics of the model with delay are investigated and compared with the results of the model without delay. 6

The central r^ole of the quantity , the server utilisation, is discussed and used to derive the limit on the absolute rate of computation that is present in systems to which our modelling technique applies. We then identify a previously unreported equivalence relationship between the models with and without delay and discuss the consequences of this relationship. This relationship raises questions of external observability. We show that for systems of this type it is not possible from external observation alone to determine whether or not there is delay present.

1.4.4 Chapter 5 Having, in the previous chapter, quanti ed the idleness that individual processors can experience when having only one process per processor in a system where delay and contention are present, we elaborate again on the basic model, this time to include many processes per processor. We do this, rst, by examining the e ects on performance of many processes on a single processor, both with and without delay, and then go on to develop the many processor version of the model. Although Markov chain based performance models exist for many processes on shared memory multi-processing systems, they are not applicable to message passing systems. We discuss how the shared memory performance models have the underlying assumption that the system is `work conserving'. This work conservation assumption makes implicit use of the ease with which a process can be rescheduled on another processor within a closely-coupled shared-memory environment. These assumptions are not reasonable within a message-passing environment as the costs both of knowing which processors are idle at any one time, and of the rescheduling of processes on those idle processors, are signi cant. The model that we develop for the message passing case has to cope with the additional requirement that the responses be returned to the processor that originated the requests and that a processor may be idle even though there is work (in the form of other processes) on other processors that could be run. This model is developed for systems both with and without delay and we show how the potentially N dimensional Markov chain system can be represented in only two dimensions. We introduce a new measure, the hiding number, which can be used to quantify how the performance e ects of delay can be mitigated by the 7

use of additional processes per processor. One important factor in creating this model is the development of conditions under which the many states present in the original Markov chain representation can be aggregated and hence the state space dramatically reduced. This reduction makes the application of the model to medium size problems (tens of processors, tens of processes per processor) computationally feasible. The extent of this simpli cation is easily illustrated as the original system has of the order of nm((nm)!=(m!)n) states, whereas in the aggregated system the total number of states is around (nm) =2, where n is the number of processors and m is the number of threads per processor. For 100 processors each with 10 processes the aggregation reduces the original state space of 10 states to about 500; 000. 2

1915

1.4.5 Chapter 6 In this chapter we validate the model against measurements taken of a multiprocessor system. The observed results agree with the model to better that 0:1%. We go on to examine under what circumstances the proposed models of performance are applicable, and illustrate the sort of error that may be introduced in approximating the true behaviour of the system in order to apply our model. We also show how our model can be used to investigate scalability issues by looking at the e ect of changing load intensity and delay as the number of processors change. Our model can also be used quantitatively to examine the e ect of design choices; this use is illustrated in x6.5.1.

8

Chapter 2

Performance Models and their Measurement The number of factors that have been identi ed in the literature as a ecting the performance and scalability of computer systems are almost as numerous as the papers themselves. Many authors [2, 8, 51, 52, 126, 128] argue that their particular set of characteristics or factors are, in some way, `objective' in that they correspond with some static properties of the system under examination. This has meant that much e ort has been expanded in examining the in uences of the `objective' component of computer systems, the hardware [31, 71, 96, 107, 109, 112]. Software in uences have taken a background r^ole and have been aggregated into such measures as `serial portion' and `eciency'. For many of the authors the main motivation for generating these models has been to use them predictively, often to allow for the extrapolation of the performance pro le observed to some larger system. In this chapter, after a brief look at the subjective as well as other aspects of performance, we will endeavour to categorise the physical factors that go to make up performance measures as they appear in the literature, and examine their taxonomy and their interrelationship. With the many interpretations that have already been placed on the terms `performance' and `scalability', it is useful to elucidate the way that these terms are used here.

Performance We will use this term to represent the measure of how `well' a

system, or a constituent component, accomplishes its assigned task. For 9

some authors the continuous utilisation of a particular system component (eg CPU) is a dominant goal, leading to performance measures such as `eciency' [36, 38, 40, 68, 101]. For others the performance issue is the reduction in the runtime (as measured by the elapsed time or `wall-clock') for a given problem [8, 31, 39, 40, 44, 91], or the running of the largest feasible problem on the available system [54, 55, 116]; These two di ering views lead to the models of xed-size speedup and scaled speedup respectively. As can be seen from just these few examples, there are di ering views of this term performance. In several of the proposed models the qualities being measured are implicit, in that there is not necessarily a correspondence between the performance characteristic and a directly measurable quantity in the system under study.

Scalability This is the study of the change of performance measures of a system

as particular characteristics of that system are varied. The characteristic that is usually varied is that of the number of processors, but we will not constrain ourselves to just this; the `granularity' of the problem or the amount and type of memory (eg cache) could equally be of interest. It is to be hoped that an output from the analysis of scalability may be that any limitations can be identi ed as changes are made in the system's characteristics. For example the limitation on the speedup of the system as the number of processors increase, or the e ects of changing the software implementation on the number of processors that can be e ectively utilised, could be isolated.

Our examination will generate models which can be used for analysis of existing systems but is biased towards the prediction of one or more performance measures as the parallel system is scaled. This scaling may be in the number of processors in the system or in the variation of some other factor. In discussing models of performance and scaling it is important to note that they are, in general, based around a perceived dominant characteristic extracted from the modeller's view of the parallel system. To structure the discussion of these characteristics, we have chosen to group the models into two distinct sets, macroscopic and microscopic. The distinguishing di erence between these two categories is the size of the conceptual components. 10

Those models we have classi ed as microscopic tend to have basic components that correspond to quantities that, in theory, could be measured, and are not aggregates of other quantities. As such they could be viewed as atomic. The macroscopic models tend to have components that are conglomerates of many measurable factors or that can be regarded as factors that are not directly measurable but are only available through the application of the model. Before discussing these factors in depth, it is useful to stand back and look at motivation: one point that is rarely made, but which we hope to illustrate here, is that the end-users' perceptions of parallel processing cannot all be the same [131]. In an end-user's view there are many components and characteristics that play a part in their perception of its performance.

Problem size The range of variation of problem size is vast, from problems of a

xed size, such as searching in a known size search space, to those in which the problem can be scaled to use all the available resources | both in terms of processing and memory.

Time constraints These can range from \as quickly as possible" for such things

as stock market trend prediction systems, through problems with a xed upper bound in time such as weather forecasting and interactive design tools, to problems where solution time is not the limiting factor.

Accuracy This is clearly related to the above categories. Some problems are more tolerant of inaccuracy than others. Structural analysis of a beam does not need to be carried to many decimal places, as it will be used well below its failure point. A model of the dynamics in a chemical process where the increase in accuracy reduces the consumption of an expensive reagent can, on the other hand, be seen as extremely worthwhile.

Cost When used in the sense of nancial cost, this is usually a constraining

factor. However, strictly money-based models would be at the mercy of so many factors as to be of little long term use. Other measures based more on opportunity cost, such as the silicon requirement for implementation of a particular strategy, may have longer useful lives.

In its widest sense, the performance analysis of parallel systems is a very old discipline. Until the last century, all the processing units would have been human beings (with their skills and tools) and the tasks physical ones. 11

Although performance and the striving for `eciency' is very much in vogue at the present, this is usually applied to human systems, with all the associated uncertainties in behaviour and `processing' capacity. Being, at least in principle, entirely predictable automata, computers should demonstrate performance characteristics, and hence scalability features, that are entirely predictable. However the complexity of interaction when several of these individually predictable automata are collectively involved, combined with the desire to predict, or at least to bound, performance, has led to divergent models of parallel processor scaling. These models have often been seen as being in con ict with each other, with the factors they seek to measure defying exact analysis. The existing models of scaling and performance tend to take a two-dimensional view of speedup, namely of processors and (wall-clock) run-time. This choice of factors represents a particular path along the actual performance surface. Taking more factors into account, ie striving for a higher dimensional view, it is possible to see that the models formerly perceived as being in con ict are not; they merely face in di erent directions and choose di erent paths [131]. The arguments as to which is the `right' model are numerous and well documented, and the answer is highly dependent on what the user hopes to achieve with their system. It is also dependent on the lengths to which the user is willing to go in order to optimise their code for their parallel system and their ability to choose (or develop) a new algorithm for the problem that they wish to solve. The aim of this thesis is not to propose a universally correct model but to view the modelling problem from a di erent perspective, one that will hopefully give more opportunity for feedback into the design process.

2.1 Components of performance 2.1.1 Sequential portion This performance component arises from Amdahl's [8, 128] original model but is the least well de ned. Most authors [40, 70, 116, 126] who use the term implicitly de ne it as that portion of the computation that is not parallelisable. It has also been used as a catch-all to encompass all the factors that the author is not explicitly dealing with in their model [26, 39, 129, 136]. Given this lack of an explicit de nition in terms of observable and quanti able 12

measures, the measurement and prediction of the sequential portion of any algorithm's computation is dicult. Amdahl [8] saw this serial portion as part of the inevitable `housekeeping'; others [39, 70, 116, 126] have seen it as an inescapable constant overhead or some function of the number of processors in the system. Gustafson viewed [54] it as a function of both the number of processors and the `ensemble' size | the granularity of the executed problem. The observation [8, 51, 55] that most algorithms have some inherently sequential portion, that does not yield to parallelisation, appears to be perfectly reasonable. This is especially true, given the context in which the early authors were writing, describing the performance characteristics of vector processors. However, techniques have been developed that remove or reduce the sequential e ects of algorithms, thus moving the classi cation of the factor from this category to one of the others described in this section. There are many possible examples of such techniques; one is the use of alternatives to critical regions for access to shared data values, such as `fetch and increment' [41]; another is that of the sequential loading of programs, in logarithmic time, into large numbers of processors. This logarithmic loading technique has been reported in [55] as 46 and 83 times faster on a 64 and 1024 node system respectively than a simple serial loading approach. The apparently downbeat future for parallel programming and the dominant in uence of the `housekeeping' factor led many to look for the underlying cause of this housekeeping overhead, the so-called `serial fraction'. One consequence of these studies was the analysis of data dependency and instruction level processor utilisation which is discussed further in x2.1.3.

2.1.2 Communications delay The delay associated with the transmission of data values and synchronisation information has been isolated as another factor that a ects performance. The algebraic scaling models take the view that any delays are either not present at all, or are static; a typical physical intuition that would be presented would be be the delay to transmit data over a communications medium. Such delay can be viewed as consisting of two distinct components. The rst is a constant delay akin to the transmission delay, the second is the delay due to contention for some limited resource, such as the servers managing the communication medium or the bandwidth of the communication medium itself. 13

Most models, especially those which are based on Amdahl's and Gustafson's models, treat this component as having a constant e ect or, at best, as having an e ect that is directly related to topological features, such as distance. Thus they are only considering propagation delay and ignoring, usually implicitly, any delay due to contention. There is a common assumption that computation and communication do not overlap, or if they do, that that portion of the communication time that does not overlap is constant. This is the quantity that is taken as the communication overhead [23, 109]. When scaling is considered it is usually combined with an analysis which is based solely on the path length that communications must travel [2, 90, 109, 112]. In view of the importance that communication e ects have in message passing systems, much e ort has been expended on the study of optimal interconnection networks in order to reduce this delay, usually through the minimisation of the topological distance. However, the realisation that the bandwidth of the communications resource is nite, with the consequent study of the contention for this nite resource, has received little attention. Paddon [100] has looked at the e ect of message densities on scaleup as the number of processors increased, whereas Harrison [58, 59, 60] has looked at the behaviour of various multi-stage interconnection networks. Even Valiant's Bulk-Synchronous Parallel approach [97, 125] to modelling parallel computation explicitly excludes the possibility of this sort of contention from its analysis. Hockney's approach to modelling the performance of vector processing units [66] should be included under this heading. There is a correspondence between the costs of starting up a vector processor pipeline and initiating a communications connection. This approach is also interesting as the analysis is based on the behaviour of the actual hardware component. One novel approach to measuring the e ects of communications was that of Glasser and Zukowski [46]. They took a continuous density approximation to communication requirements and showed that for an in nite number of processors, with a constant density of processors per unit volume, and the inter-processor communication dependent solely on the distance between processors, the communication density must fall o faster than the fourth power of distance.

14

2.1.3 Algorithmic limitations If it were possible to design and build a hardware platform which did not restrict the scaling of computer systems, there would still be an ultimate limitation on the reduction in the time taken to generate solutions. It would be a property of the actual algorithm and its inherent capability for parallel execution. Such limitations come in several forms; an algorithm may generate units of computation that can be processed concurrently. Data dependencies within the algorithm itself also have an e ect. There has been much theoretical study of the maximum concurrency that algorithms o er: this is well summarised in [7, chap. 4]. The e ect of simple communication latency on speedups was investigated by Kruskal, Rudolph and Snir [80]. The fundamental analysis of algorithmic limitations has usually been performed in the context of the PRAM model [7, x4.2]. This underlying virtual execution model, one of uniform and consistent access to memory by an arbitrary number of processors, represents a hypothetical ideal that is not achievable in physical reality. Actual systems limit these freedoms in order to achieve physically realisable hardware components. Thus studies of algorithmic performance on PRAM machines represent an upper bound; implementations on available hardware do not achieve these limits and many choices have to be made in achieving a workable high-performance solution on a parallel system. An alternative model for parallel execution of algorithms to the PRAM model uses a data- ow like concept, that of data dependency. Gelenbe [43, chap. 5{6] takes a stochastic view of this algorithmic dependency e ect. His starting point, the execution of tasks that spawn dependent tasks in accordance with some random distribution, leads to models of potential performance of such algorithms. The assumptions made in the model, that the dependencies are acyclic, that there is no contention for any system resource (practically in nite supply of processors, sucient communications bandwidth, and other resources) lead to lower bounds on the execution time of such systems. Gelenbe's work represents an extension of the series-parallel program structure approach whose programming paradigm is also known as fork and join. This approach to coding parallel programs has also been a subject of study [93, 94]. These studies illustrate the niteness of concurrency available in problems, which, when combined with the observation that the work units will be generated in many 15

locations in the system, leads to the problem of optimal load sharing in distributed systems. There have been several investigations into achieving this goal [84, 111] but they lack a theoretical model to allow for comparisons.

2.1.4 Algorithmic synchronisation costs Data dependency expresses the nest grain of computational dependency. However, many numerical algorithms have points in their execution where data must be broadcast to many participating processors, or where some global condition in the calculation must hold in order to continue. These represent a `barrier', which may a ect some or all of the processes in the computation. In most problems this barrier will not be reached at the same moment by all the processors. As no process may continue beyond this synchronisation barrier until all processes have reached it, there is some penalty associated with its existence [13, 49]. Usually such synchronisation is due to a fundamental property of the algorithm; for example, the checking for convergence of an iterative algorithm, or the distribution of updated values to some or all of the processors in the system. The choice of the granularity of the work unit can a ect the relative cost of barrier synchronisation. If the amount of computation in the work units is comparatively large and the distribution of work amongst processors is not perfectly even, then one or more processors will have to wait until all the processors have reached the barrier. Reduction of the amount of computation (granularity) per work unit, combined with low cost work distribution, can reduce this overhead. In the limiting case with zero cost work distribution and in nitely small work units, which attract no other overheads, the cost could be eliminated. Greenbaum [49] has produced a theoretical model of this cost with one unit of concurrency per processor and has shown, in this case, that barrier synchronisation can add 30{35% to the run time of an algorithm on a 1024 processor system. The requirement for synchronisation comes in many forms. The global interaction that characterises barrier synchronisation represents one extreme in involving all the processes. Synchronisation requirements can be more localised, perhaps a ecting only neighbouring processors, and hence have a less dramatic e ect on performance [49, 85]. The importance of the e ects of such synchronisation in parallel systems is such that a whole new approach to the design of parallel systems has been proposed 16

by Valiant [123, 124, 125], which he has called the Bulk-Synchronous Parallel (BSP) model. An implementation of the BSP model would provide the ability to synchronise the components of the system (processors and memory units) at regular intervals, as well as providing a mechanism for communication between these components. It has been shown that such a system can be scaled provided that there is sucient bandwidth and sucient excess concurrency in the algorithm [123]. The BSP model contains some assumptions [97] which require practical solutions; the rst is the need to assure a random communication pattern to avoid communications contention or `hot spots'. This randomisation depends on using a `Fast Good Hash Function'. Such functions have been proven to exist but have not yet been found. The BSP model also requires that the communication bandwidth increase logarithmically with the number of processors, and as such has implications for the manufacturing of the components of a parallel system based on this model. The model explicitly ignores any overheads that may relate to the managing of communication and load balancing.

2.1.5 Resource saturation As has been illustrated, most models ignore contention for resources and the in uence of their saturation on performance and scalability. If a model includes an examination of resources, it assumes that the overheads are static or a simple function of the granularity and topology. Although some work has been done on memory-bank contention in vector processing systems [15], and the performance of certain classes of interconnection network under load [59], it has tended to view these components in the context of an open system, one in which there is a constant ow of requests against the resource. The consequent e ects of such contention on the performance of the system as a whole has not been studied.

2.1.6 Other factors We discussed communications delay where such delay is modelled solely on the basis of average topological distance in x2.1.2. Such a measure does not take into account any contention for the communications resource; neither has any attempt 17

been made to model the consequential e ects that exist due to the routeing of messages in store-and-forward message passing systems. One implicit assumption that is almost universal is that the cost of accessing memory (both for data and for the program) remains constant as other factors vary; this is known as the ` at memory approximation' [54, x3.4]. This assumption is inherent in all of the models that have been discussed here. The relationship between size (in both code and data) and the existence of caching has not been modelled. Several claims of super-linear speedup [20, 68, 101] can be shown to be due to the breach of this approximation. Another factor related to the granularity of distributed computations which has not yet been studied is multi-threading [67, x9.2]; there is an inherent con ict in the relationship between decreasing computational e ort on each processor, and the need to keep the multiple ALUs present in today's super-scalar architectures active. Ful lling this imperative is likely to have a noticeable e ect long before the problem reaches its granularity limit.

2.2 Classi cation of scaling and speedup models Although many of the models use as their starting point some architectural feature of the system's hardware, the use of an architecturally based taxonomy does not re ect their basic di erences and similarities. In order to achieve a general basis for comparison of performance and scaling models we have identi ed the following six attributes to form a basis for comparison of the models. These attributes do not necessarily correspond to explicit assumptions made by authors, which are often expressed in architectural terms, but are more abstract and often only implicitly present. These abstract properties can nd expression in both software and hardware aspects of the system. For each model we will identify its properties under the following general headings:

Work preserving scheduling This is the assumption that if a processor is idle

and there exists in the system a process that is able to be run, then it will be run. It is often implicitly made in shared memory models, hence ignoring costs such as the maintenance of cache consistency and the inherent contention for the resources associated with such global scheduling. 18

Granularity overheads It is often assumed that there is little or no cost in re-

ducing the granularity of the algorithm's operation. Instances of this include:

at memory approximation | all memory accesses are of constant cost, ignoring caching and virtual memory e ects; constant data synchronisation overhead | no change in the costs of acquiring and distributing data or in achieving global conditions like convergence; and unlimited subdivision | no lower bound on the size of the computational unit or limited to a single instruction.

Non-Computational Delay This aspect is related to work-preserving scheduling, but covers those aspects that delay the execution of the algorithm but do not incur processing overhead, eg communications. Factors that contribute to this heading represent the opportunity for multithreading within the system.

Algorithm concurrency This represents the model's assumptions on the con-

tent/availability of concurrency in the execution of the algorithm. The upper bound comes from PRAM studies, but the realisation on a particular hardware and the way in which it varies represents the opportunity for multithreading and processor usage.

Resource niteness Many modelling approaches do not consider contention for resources | ie assume adequate supply of processors, memory or communication bandwidth. If a nite resource is included within the model there is usually one, eg available concurrency or processors.

Output measure What is the measure of interest in the model? The type and characteristics of these measures is discussed further below.

With the exception of work preserving scheduling these form the column headings in table 2.1. Work preserving scheduling is a common assumption, usually implicit, for all the existing performance and scaling models.

2.2.1 Output measures The desire to reduce the complexity of performance issues to a single quantity is very strong. Unfortunately, once this single value has been arrived at, it takes on a nature all of its own, with the compromises which were made to achieve it forgotten. 19

Output Metric

Modelling Method algebraic: serial fraction

(Default Entry)

Speedup ( xed size)

[8], [26], [86], [126], [128]

References

|

Algorithmic Resource Concurrency Finiteness implicitly assumed implicitly assumed in nitely variable in nite resource |

Granularity Non-Computational Issues Delay implicitly assumed only computation in nitely variable considered |

[20], [109]

|

|

|

synchronisation costs

|

|

|

assumed sucient assumed sucient

[39]

constant factor

|

|

data dependency

|

[51], [55]

[23]

|

|

|

[54]

|

|

|

|

[39], [92]

|

|

|

|

[54]

[13], [49]

|

synchronisation costs

|

[116], [117]

|

parallelisation overhead

|

|

[36]

[134]

|

data dependency

algebraic: | complexity function barrier assumed assumed xed assumed xed algebraic: abstract recognised characterisation as limitation Speedup algebraic: independence (scaled) serial fraction implicitly assumed algebraic: independence complexity function explicitly assumed algebraic: abstract | characterisation Speedup algebraic: ( xed time) complexity function

|

independence explicitly assumed independence explicitly assumed

Table 2.1: Basic taxonomy of models

20

Output Metric

Modelling Method

algebraic: complexity function algebraic: abstract characterisation algebraic: complexity function

algebraic: serial fraction algebraic: complexity function

(Default Entry)

Eciency

Serial Fraction Solution Time algebraic: abstract characterisation Solution algebraic: Rate complexity function Response algebraic: stochastic Time complexity function Throughput/ algebraic: stochastic Power characterisation

Algorithmic Resource Concurrency Finiteness implicitly assumed implicitly assumed in nitely variable in nite resource

References

[70]

Non-Computational Delay only computation considered

|

Granularity Issues implicitly assumed in nitely variable

|

|

[54], [112]

|

|

[36]

|

assumed sucient

|

data dependency data dependency

[26], [86], [109]

|

|

assumed xed

|

[96]

topological delay characterisation

|

[31]

|

|

|

[39]

[40], [126], [136]

single algorithm model

|

[66]

|

|

|

|

|

[77]

[29]

xed con gurations nite processors

| constant communications delay constant communications delay topologically dependent overheads pipeline start-up synchronisation costs |

fork/join data dependency stochastic arrivals open system

| | | | pipeline costs | |

Table 2.2: Basic taxonomy of models (continuation of table 2.1)

21

This pattern is well illustrated by the measure, `speedup'. Once it is proposed as the output measure, there is a tendency to place the maximisation of this value above other issues. It has even been suggested to the author that, as speedup is very dependent on the time taken to solve the problem on one processor, making sure that this single processor solution is inecient was a desirable aim. Speedup, as an issue, was brought to the fore by Amdahl's famous 1967 paper [8] and formalised by Ware [128]. Its seminal position as the measure of scalability can be seen by its widespread use as a basis for scalability comparison [13, 20, 22, 26, 36, 38, 54, 68, 86, 98, 101, 109, 117, 127, 129, 134] and it remained unchallenged in the literature until Gustafson in 1988 [51]. The subsequent analysis instigated by Gustafson showed that there were other ways of viewing such speedup and the original model of Amdahl has become known as xed-size speedup; its starting point is a xed size problem (in terms of abstract computational requirement). The aim of allocating more processors to its solution is to minimise the time to get to a solution. Gustafson started from a di erent viewpoint. His limiting factor was the amount of memory available in the parallel system, hence he was attacking a class of problems for which a single processor solution was not feasible. Instead of viewing the starting point as the properties of the execution of the problem on one processor and, through experimentation and extrapolation, looking at the way the same problem runs on many machines, Gustafson's starting point is the execution on a large number of processors, and working backwards, he predicts how the problem would take to run on just one processor. This change in measuring approach led to speedups of in excess of 1000 on 1024 processors [55]. This model has become known as scaled speedup [39, 51, 54, 55, 92]. Another constraint on the usefulness of parallel algorithms is the timeliness of the output from the computation; the archetypal example being weather prediction. The observation that there are problems in which accuracy can be traded for speed led to the xed time speedup model [50, 54, 116, 117], which measures the amount of problem related computation that can be performed in a speci ed time on di ering numbers of processors. Speedup, in its various forms, represents the usual way in which scalability results have been presented. Other measures have included eciency [26, 36, 44, 54, 70, 112, 131], in which the ability of the system to perform the task, compared against a hypothetical maximum are seen as the crucial measure. Also closely related is 22

the issue of serial and parallel fraction [25, 26, 40, 70, 86, 109, 126, 136]. These lines of enquiry have their roots in Amdahl's original thesis, and have been more fully developed by Karp and Flatt [70]. All of these speedup measures are dimensionless. Their common formulation as the ratio of execution times on systems with di ering numbers of processors gives rise to a common weakness: the choice of the value in their denominator. The selection of a baseline of a single processor runtime for a problem has been argued extensively. All of the models cited above di er only in their recommendations of the con guration over which the denominator should be demonstrated. This similarity has led to some work on `unifying' models [54, 136] which have tried to characterise algorithms by the use of linear algebraic descriptions of their execution time. Worlton [131] has identi ed the dimensionless nature of the metrics as a major obstacle, and several authors [31, 39, 96, 107] have deliberately presented results as `time to solution' to avoid it. To counter this over-reliance on dimensionless characteristics, Hockney [66] has proposed that solution rates should be used, and Kleinrock and Huang [77] have independently examined this issue and proposed similar measures of throughput and power. This nal measure allows for the capture of the tension of the competition between processor eciency and mean response time. Although Worlton has attempted a taxonomy of performance metrics and scalability measures, he has been hindered by the way in which these models view the performance factors as entirely hardware-based. This led him to develop a 48 way taxonomy, which does not help in acquiring a compact overview.

2.3 Overview of scaling and speedup: models and metrics In order to write a synopsis of the existing literature it is necessary to group the constituent concepts together. Any act of grouping these existing models and measures of parallel systems requires the author to single out one attribute as predominant over many possible others. The table 2.1 represents one classi cation of the literature in terms of the implicit and explicit assumptions and output metric. This simple taxonomy still has a large 23

number of classi cations, although it does serve as a start to bring together the literature under a small number of headings. The motivating phenomena for all of these models being either data dependency or hardware related e ects. The way in which these models concentrate on individual physically related phenomena while aggregating the other factors into a single constant has led us to an additional taxonomy of the models which uses the level of detail in the model formulation as the basis of di erentiation. We classify those models which take a broad view of the system under study as macroscopic, whilst those models which use individual components that are seen to correspond with physical factors, we term microscopic. In addition to this compartmentalisation based on the component factor size there are several other issues that can be used to distinguish the models, several of which have already been identi ed in x2.2. One distinguishing feature that can be used is the way in which the model is formulated. At one extreme is Amdahl's view which is entirely algebraic and not related to any structural aspects of the system being modelled. The other extreme is the inclusion of so many factors as to obscure any abstract principles. Many architectural assumptions are implicit in these models. When attempting to apply the models to message passing environments one assumption that is often breached is that the work loading of all processors can be known throughout the whole system accurately and that work can be transferred between processors at negligible cost. The work conserving scheduling strategy typi es these assumptions and implicitly lies at the heart of several performance models [8, 22, 26, 39, 136]. Amongst the other assumptions that are made is that communication (of data values) carries no cost, or that the cost is constant or that any variation is negligible. Synchronisation is another factor that has received individual treatment. Although the e ects of barrier synchronisation are well documented, there is an interaction here between other factors and such synchronisation. One form of synchronisation that has not received much attention is that associated with the client/server model of execution. Traditionally this model has been seen as applying to local or wide area networking, however this model is equally applicable to paradigms used in message passing multiprocessor systems, such as data sharing mechanisms. Another viewpoint on a particular model is its view of the e ects of granularity. Most models make assumptions of in nitely divisible work and communication. One assumption that most authors, and we, will make is that the execution envi24

ronment is error free. Although this may seem reasonable when considering tens of processors with their associated memory, this is not an assumption that is likely to continue to hold as the number of processors increases.

2.3.1 Speedup formul We will take the opportunity here to introduce the nomenclature and quantities that will be used throughout the rest of this thesis. They are introduced here to give an unambiguous de nition of these terms, which are often left ill-de ned in the literature, and to allow for some general discussion of their use.

2.3.1.1 Solution Rate Although speedup, in its various formulations, has been seen as the measure of the performance of parallel computer systems, its dimensionless nature detracts from its usefulness. We prefer as a measure that of solution rate, as proposed by Hockney [66]. This has the advantage of representing a quantity of direct interest to the end user; solutions per unit time. We will represent this by the symbol R. This quantity may be seen as the simple reciprocal of time taken to solve the problem, which we will represent as T . It has two advantages

 There is often some sub-portion of the overall computation which can be

seen as `a solution'. Thus the same quantity can capture the performance of some part of the overall computation.

 Much of the discussion as to the applicability of the various models of speedup hinge around the di erent views of this quantity. Amdahl's law is the maximisation of this quantity given that the problem size remains constant; Gustafson's scaled speedup maximises this quantity in the presence of an increasing solution size. On the other hand xed-time speedup maximises this quantity, keeping the time constant.

2.3.1.2 Speedup This is traditionally seen as the ratio of two times; the time taken to solve `the problem' on one processor T against the time taken on some number of processors T k . We will represent this quantity by the symbol S . 1

25

Given our view on the importance of the rate of solution we will view speedup as the ratio of two solution rates on di erent con gurations ie (2.1)

S k = TT k = RRk

1

1

2.3.1.3 Eciency This is yet another quantity that is used loosely in the literature. It is often used as a measure of the processor non-idle time, perhaps because the idle time is a quantity that is amenable to measurement. We would prefer to see this as a measure of useful processing power expended on obtaining the solution, although this may be dicult to measure in actual systems. We will denote this quantity with the symbol E .

2.3.1.4 Response Time, Throughput, and Power Although response time and throughput have the same units as time taken and solution rate they have a more precise meaning in performance literature in general. There is also a di erence in the emphasis in the physical quantities being measured. Solution time and solution rate would normally be seen as relating to a single solution, whereas response time and throughput are viewed as average measures of many events over some period of time. These terms will be used when we are viewing such averages. Another reason for introducing these terms to introduce a measure of `power'. This has been de ned for a general queueing system [77] as (2.2)

 T=x

where  is de ned as the system utilisation, T as the mean response time and x as the average service time. This allows for the capture of the traditional performance dichotomy: high utilisation requires long queues of customers; customers don't want long queues as this leads to large response times.

26

2.3.2 Macroscopic measures Amdahl's seminal paper of 1967 [8] is the rst example of an attempt to qualify the usefulness of parallel processing systems. In its formulation by Ware [128], it can be seen as the rst attempt at a scaling model. Amdahl examined the prospects for the unbounded reduction of time for the solution of problems on multi-processor computer systems. Through his analysis he saw a limited future for parallel systems. His premise, which he saw as a historically proven fact, was that the amount of sequential processing in a program was a large fraction of the overall computation. This serial portion, of tens of percent, was also seen as immutable, and as such was the limiting factor to scalability. Although reported work of the 70's and 80's [87] did not support this pessimism, it was not until 1988 that an alternative scaling model was proposed. In 1988 Gustafson, Montry and Benner [51, 55] questioned the validity of both the assumption that the serial fraction was immutable and of the resulting pessimism. They had achieved speedups of over 1000 on `real' problems involving 1024 processors in which the measured serial portion, on a single processor, was between 0.4 and 0.8 percent (the application of Amdahl's law to their problems having predicted speedups in the range 125 to 250). This led Gustafson to propose a reformulation of Amdahl's law which he called scaled speedup. In formulating his model, Gustafson adopted as his basic tenet one which di ered markedly from that of Amdahl. Gustafson saw the total available memory resource as the factor limiting the usefulness (to the end-user) of a parallel system. As the number of processors (each with its own memory) increased, he saw the problem scaling to t the available memory with the associated increase in computation requirements. In Amdahl's view, the reduction in run time (while the memory and total computational requirements remained the same) was the overriding aim. In both Amdahl's and Gustafson's scaling models there is no attempt at predicting performance on a particular number of processors; they are both viewed as models that predict the bounds on scaling. This, unfortunately, has not stopped other authors attempting to use them for such predictions. An alternative model, which can be seen as a compromise between the extremes, has also been proposed. In this model, the total time for the solution of the problem is xed. The problem size is allowed to increase as the number of processors increases, provided the total execution time remains the same. This scaling model is known as xed time speedup. It was rst discussed in an exchange of letters 27

about Gustafson's comments on Amdahl's law [61, 53] and the e ects of this model have been expounded by both Worley [129] and Gustafson [54]. All the approaches outlined above see `sequentiality' as the factor that bounds the speedup of parallel processing systems. They may see this sequentiality as a constant for the problem or as a complex function of the characteristics of the particular system under study. Most authors treat such sequentiality as an indivisible unit | an approach that typi es the macroscopic view of speedup models and metrics. These e ects have also been studied by Gelenbe [43, Chap 3] in the presence of granularity e ects and inter-process communication. The granularity e ect studied here is the imbalance in load on the individual processors, whether due to the size of the individual schedulable units of computation or the properties of the program itself. Complementing these scaling models, there are two macroscopic metrics of parallel computing performance based around this `serialness'; Karp and Flatt's [70] serial fraction and the measure of incremental eciency proposed by Worlton [131]. These metrics attempt neither to model performance nor to give predictive scalings, but strive to provide a basis for the analysis of performance results from parallel systems. These are expanded upon in x2.4.6.

2.3.3 Microscopic measures Less macroscopic models and metrics also exist; they look to the physical characteristics of an algorithm or system, identifying the cost, in terms of time, of individual components in solving a given problem in a parallel system. In categorising these approaches it is useful to refer to some overall model and view the published work as contributing towards the understanding of one or more of the particular components of that model. The model that will be used here is common in the literature; it views the total time taken as comprising three individual components, and can be expressed as follows: (2.3)

T k = T s + T kpar + T sc(k)

where T k is the total time taken for the work on k processors. The components are as follows: 28

 T s is the purely sequential portion of the problem,  T par is that portion of the problem that can be run in parallel, and  T sc the time spent in synchronisation and other overheads. The overhead function T sc (k) can be viewed in various ways, but in almost all cases it is expected to be a monotonically increasing function of the number of processors. Viewing equation (2.3) as a solution rate we get (2.4)

Rk = k(T s + T sck(k)) + T par

As mentioned above, the three components of equations (2.3) and (2.4) have received individual attention in the literature.

Ts

Serial e ects have been investigated by many authors, both in the abstract [8, 51, 54, 53, 55, 129, 61, 116] and in relation to speci c algorithms or classes of algorithms run on speci c systems [23, 31, 96]. There is evidence that the sequential component can be reduced by various techniques, thus moving the contribution to the overall time cost from the sequential element to some other component of the equation. Indeed, Gustafson appears to claim that it can e ectively be eliminated [55, p609] with the attendant removal of the barrier to speedup.

T par

The size of this measure quanti es the total amount of potential concurrency in the algorithm/problem. However it does not attempt to quantify the potential limits on the subdivision of this work. The study of the amount of work that can be done in parallel can be classi ed in three distinct ways:

Statically predictable potential concurrency Problems that fall into

this class tend to arise from models of physical systems. They usually map well onto grid solutions. Even though the total concurrency available may vary with time, its variation will alternate between statically predictable values. The source of these variations is the potentially available concurrency in di erent stages in the overall algorithm. The allocation of the work units is usually done statically. Typical example problem areas would be uid dynamics and radiosity. 29

Predictable potential concurrency, continuous variation with time

Problems that fall into this class tend to exhibit the following property: the potential concurrency available as the solution progresses grows monotonically to some peak (which may be input data dependent) and then reduces; the amount of work units created by existing work units tends to follow some simple formulation. Such problems, where work units are created throughout the system, also have the accompanying problem of the dynamic distribution of such units and of their balance across the system. Typical examples of problems in this class are those amenable to branch-and-bound and divide-and-conquer techniques. Probabilistic potential concurrency Problems in this class tend to arise where the amount of concurrency is data dependent and probabilistic in nature. All the comments on the preceding class apply. The difference between this class and the class above is that the technique for modelling the amount of potential concurrency is probabilistic. Typical examples of problems that lie in this area are those of macro-data ow execution and parallel execution models of declarative languages. The in uence of such program structures has also been studied [43] and predictive models of concurrency have been developed for the last two classes above for certain sets of applications; these models use stochastic techniques to derive averages and distributions for the potential concurrency [62, 93, 94, 104].

Tsc The approaches to this component exhibit the most variation. In most of

its uses it is intended to encompass the overheads associated with both synchronisation and communication. However, it is useful to view these two aspects separately.

Synchronisation In the literature, synchronisation is usually synonymous with barrier synchronisation. Such barriers represent a point in the ow of computation where progress beyond the barrier is dependent on every processor (from a particular set) having arrived at that point. This set of processors is often all the processors in the system, as is the case when global information has to be exchanged. A typical application class which exhibits this property is certain iterative techniques for the solution of dense matrices. 30

The e ects of such synchronisation have been studied by Alexrod [13] and Greenbaum [49] and a general model of execution of parallel programs viewing such synchronisation as the major component it its design has been proposed by Valiant [124, 125]. Communications The modelling of the overhead of communications has been included in several models. Initial models have assumed constant time overhead [31] or a linear function of the message length [109, 129]. One other approach to this Tsc component is more macroscopic. It was rst proposed by Flatt and Kennedy [39] and a less restrictive model has been proposed by Muller-Wichards [92]. In this approach the complex nature of the Tsc function is acknowledged and, instead of trying to produce an explicit formulation for it, generally accepted intuitive properties of the overhead are used. These properties lead to constraints on the form of the overhead function Tsc . To satisfy these constraints a function would need to be a monotonically increasing function of the number of processors in the system as well as continuous, di erentiable and non-negative for all positive numbers.

2.4 Scaling models 2.4.1 Fixed size speedup (Amdahl's law) As well as being the best known, Amdahl's law [8] is the most widely criticised. A basic tenet of this model is that there exists a serial portion of the overall execution. This portion is immutable. Amdahl's formulation has also set the attitude that the metric of interest is that of `speedup'. Amdahl argues his principle from the necessity of `housekeeping' associated with data management and other such, purely sequential, features of programs. His law, which will be termed xed size speedup to distinguish it from other scaleup models, can be formulated in several ways; one way is to take the actual times spent in the serial portion and the parallel portion of the computation: (2.5)

xed size speedup =

Ts + Tp Ts + Tp=k

where k is the number of processors, Ts and Tp represent the time spent in the sequential and parallelisable portion of the algorithms, respectively. 31

This serial time corresponds to Amdahl's `housekeeping'. He quanti ed such `housekeeping' at 40% of execution time based on an historical analysis. The formulation of equation (2.5), combined with the observation that the time taken on a single processor is Ts + Tp, leads to the following alternative formulation of relative speedup: (2.6)

xed size speedup =

1

s + (1 , s)=k

where s is the serial fraction of the algorithm and Ts + Tp is normalised to 1. This formulation is due to Ware [128]. Although Amdahl's 1967 paper saw \an upper limit on throughput [of a parallel system] of ve to seven times the sequential processing rate ...", this was not the reality that several systems saw in the 70's and 80's. Such machines as C.mmp [132] and C.m? [118] saw speedups in excess of Amdahl's predicted limits [87]. It is thus somewhat surprising that this prognosis expressed by Amdahl was not challenged in the literature until 1988.

2.4.2 Scaled speedup (Gustafson's law) In his Karp prize winning paper, Gustafson with Monty and Bennet [55] argues that in most problems the sequential portion of a computation can be minimised by the use of various techniques, such as the overlapping of communications with computation, load balancing, latency reduction and analysis of sequential operational dependencies. In such a problem the sequential portion of the task can be e ectively decreased as the problem size is increased, thus \removing the barrier to speedup as the number of processors is increased" [55, p609]. The inadequacy of Amdahl's law to explain their observed results was one justi cation for Gustafson's scaled speedup. Another justi cation for this model as a better measure of parallel processing performance was the hypothesis that: \: : : except when doing academic research; in practice, the problem size scales with the number of processors." [51, p533] and that the limiting factor on the usability of parallel processing systems was the time that a user was prepared to wait for the result, not the problem size. His approach was to take the time for the largest problem that could be run on the 32

maximum con guration available and extrapolate the time that would be required for the solution on a single processor (given that it had sucient memory). As has been noted [131, p1087], this approach ignores a whole class of problems whose current solution is beyond currently available computational power. Problems in this class are the `interesting' ones, and occur at the frontiers of science and engineering. The modelling approach taken was to extrapolate the time that would be taken on one processor for a problem that would only t (due to memory constraints) on a k processor system. Thus scaled speedup is expressed as following: (2.7) (2.8)

scaled speedup = (Ts + kTp)=(Ts + Tp ) = (s + pk)=(s + p) = k , s(k , 1)

given that the time taken on the k processor system is unity, ie s + p = 1. In reducing these overheads Gustafson saw that other e ects began to dominate. They were found to be due to the invocation of error-correcting codes on certain processors, and the data-dependent execution times of certain arithmetic operations. These e ects are reported to have caused variations in the execution times of a step of the algorithm thought to take a constant time of up to 10%. In all the sample problems reported by Gustafson et al, the vast majority of communications were with near neighbour processors (within a hypercube interconnection network topology). The amount of computation per step was not datadependent, and the load balancing was performed statically at program loading. In one case a small amount of global communication (apart from the returning of results to the host processor) was present. This was a Beam Strain analysis using Preconditioned Conjugant Gradient (PCG) technique, in which the global inner products were exchanged by spanning-tree-based communication to all hosts. They report speedups of 1

It should be noted that there is some confusion as to the actual speedups that the group reported, the results reported in [55] are much better than the results in the original technical note due to the re-evaluation of the run time of the best sequential version [53] 1

33

Speedup on 1024 processors xed size scaled Wave mechanics 637 1020 Flux corrected transport 519 1009 Beam Strain 502 1019 It is reported that this work was seen by many (especially by the more popular trade magazines) as `violating' Amdahl's law and several attempts [70, 136] were made to unify the approaches. One outcome from an exchange of letters between Heath [61] and Gustafson [53] has apparently led to the development of xed time speedup. It is interesting to look at the attempts that have been made to `unify' these approaches, the most notable of which are Van-Catledge [126], Zhou [136] and Sun and Ni [117]. Van-Catledge proposes a modi cation to the modelling of the time taken in the parallel and sequential portions of the execution. His view is that the time taken to solve a particular problem should be expressed as (2.9)

T (k) = f (n)s + g (n)(1 , s)

where f and g are `scaling functions' and represent factors which a ect Ts and Tp of equations (2.5) and (2.7). These factors are dependent on the `size' of the problem, n. Van-Catledge goes on to show that, for suitable choices of f and g , the variations of the parameter n can cause the scaling properties of this model to correspond to either xed size speedup (f (n) constant) or scaled speedup (f (n) = k). Unfortunately Van-Catledge is unable to o er justi cation for his choice of scaling function and does not discuss what the choice of scaling function means in terms of a scaling model. He argues solely on the basis of Gustafson's published results. Zhou o ers a similar approach in which there is only one scaling function, a ecting only the serial portion of the problem. Sun and Ni[117] approach the uni cation of the scaling models by looking at the time taken to perform work done under some measure of the available parallelism of the problem; this is an extension of the work discussed here and of Eager, Zahorjan and Lazowska [36]. It is discussed in more detail in x2.4.4. The two scaling models so far discussed seem to contradict each other. In one 34

way their conceptual starting points could not be further apart. Amdahl could not foresee the need for more than a handful of processors as \Overhead alone would place an upper limit on throughput of ve to seven times the sequential processing rate, : : :" [8] This is now known to be overly pessimistic. Gustafson, on the other hand, states that when scaling the problem proportionally with the number of processors, the serial portion, \s can e ectively decrease, removing the barrier to speedup as the number of processors is increased." [55] The apparent contradiction between these authorities lies in their implicit axioms, and it is useful to uncover their implicit views by looking at their attitudes to several factors:

Serial portion Amdahl views this as immutable; it has always been there and will always be there. On the other hand, Gustafson's position is that it can always be reduced as the size of the problem grows; there is therefore no conceptual lower bound.

User's motivation What is the user's aim in running the problem on a parallel

system? Amdahl: to reduce the time taken for a xed problem; Gustafson: to solve the largest problem (in terms of memory requirements) that will t on the given system.

User's e ort How much e ort is the user willing to put into the parallelisation

of the code both in term of learning new techniques and in terms of systemspeci c optimisation? Amdahl does not make any real comment on this point except to say that \A fairly obvious conclusion which can be drawn at this point is that the e ort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude." [8] Gustafson sees the user's potential expertise and e ort as almost unlimited. The number and complexity of the techniques employed in the production of the results contained in [55] is very impressive. Although he does not 35

examine the overall e ect of these measures, he states some of the individual e ects. The time taken to load programs in to the processors is mentioned, the spanning-tree version being 40 times faster than the sequential version. The time taken to communicate values from one processor to the next is 10.3msec unoptimised and 3.5msec optimised. The recoding of time-critical bu ering routines in assembler produced code that was 11 times faster than the original FORTRAN.

Other factors There are several implicit axioms common to the approaches em-

bodied by these models. They both assume that the parallelisable portion of the computation is continuously divisible, ie that it can be evenly distributed on k processors and there is no limit on k imposed by this division. They also presume that the parallel portion remains constant for the execution of the algorithm, ie that there are no variations in the concurrency of the algorithm. This allows for static load balancing to be close to optimal. Thirdly, they assume that there are no other factors that delay the execution, eg no synchronisation (in any of its forms). In short, the statement \there is a portion that is serial" is reasonable. It is interesting to note that both authors use the serial portion s as the important factor in their formul. However, when it comes to results and reasoning, they treat this serial portion, not as a rst-class citizen, but solely as the portion of the problem that did not parallelise.

2.4.3 Fixed time speedup The concept of xed time speedup is based on the observation that a speci c period of time, usually the time taken for the current solution of a particular problem, provides the upper bound on the run time of a parallel solution of that problem. Within this time constraint, the aim is to solve the largest problem on the available processors. This is inherently a two-dimensional scaling model. Gustafson argues [54, p119] that xed time speedup is more useful as a measure of performance than either xed size speedup or scaled speedup. He bases this judgement on several speci c points:

Historical precedence Gustafson argues that throughout the last 100 years, as computation speeds have increased, the problem has grown, keeping the time for the solution roughly constant. Thus problems previously regarded 36

as taking too long have been rendered accessible by the increase in computational speed made available by developing technologies. This, as pointed out by Worlton [131, p1087], dismisses whole classes of time-intractable problems whose run-time needs to be reduced while keeping their computational complexity constant.

Does not favour slower processors It is well known that xed size and scaled

speedup favour slower processors. Gustafson shows that for a canonical problem the xed time speedup measure is independent of processor speed. He goes on to say that for real problems this anomaly still persists, but to a lesser degree than with other scaling models. This apparently anomalous e ect argues against all uses of speedup and is discussed further in x2.4.8.

Uni es eciency measures Again, for the canonical problem the eciency metric measures only the constant overhead and not the processing speed.

Creates a new type of super-linear speedup This e ect can be observed when the algorithm has two (or more) distinct sections which scale (with relation to the problem size) at di erent rates. The speedup is dependent on the relative rates of work in both these sections and, given the right circumstances, super-linear speedup can be observed. This argument is similar to that of Parkinson [101]. It can be shown that breaches of the ` at memory approximation' (discussed below) can also manifest themselves as super-linear e ects.

Predicts new limits for speedup Gustafson argues that, except for the canon-

ical speedup case, xed time speedup predicts that communication and other parallel overheads will eventually dominate and thus limit the speedup.

His argument is that both scaled and xed size speedup make the implicit assumption that access times to information are constant as the problem grows. He terms this the `Flat Memory Approximation'. He argues that just as a planar approximation to the earth is acceptable except for very small (few inches) and large (few miles) structures, likewise the assumption that the access times to memory are constant is not true for small quantities of memory (due to caching) and large quantities (due to the use of virtual memory). The use of a xed time approximation constrains the memory access costs of an algorithm to lie in the same memory performance band (main memory, say) as the base sequential case. 37

The xed time speedup measure attempts to constrain the run time on the multiprocessor system to be as close as possible to the time taken for the sequential case, yet bounded by it. To be able to use this model of speedup requires that the user have several pieces of knowledge about the scaling properties of the problem. Ideally, one needs to be able to predict the run-time for a particular complexity of problem on a particular number of processors. This, in e ect, is complete knowledge of the scaling properties of the problem under investigation. Gustafson accepts that the determination of the form of the scaling function (and hence its inverse) is complex and suggests that it can be done by simple trial and error or by the numerical inversion of some `complexity' equation. One diculty with this approach is the determination of the complexity of a parallelised version of the problem of a given size. In many algorithms, the computational complexity grows worse than linearly in terms of the size of the problem. This is especially true of computer models of physical systems. Hence keeping the problem solution time bounded as the number of processors increases requires that the `size' of the problem on each individual processor decreases. This decrease in the memory requirements of the problem places two new bounds on the usability of this model of parallel computation. 1. By Gustafson's own arguments, the Flat Memory Approximation will be breached as the size of the problem decreases; this will manifest itself as super-linear speedup e ects. 2. There is a minimum granularity at which the physical system is soluble, ie the data representing one particle or one grid point. Gustafson sees this as a new type of limit to the parallel solution of certain problems. The size of the problem does not necessarily have a linear relationship with the accuracy or usefulness of the results produced. This factor combined with the nonlinear growth in complexity implies that the doubling of the number of processors in a system which exhibits perfect xed time speedup does not guarantee that the solution is twice as `useful' to the end user.

2.4.4 Available concurrency measures All of the above models make the assumption, either explicitly or implicitly, that there is sucient concurrency in the problem under investigation to occupy all 38

the available processors. The availability of this resource is another factor in the scaling of multiprocessor systems. While acknowledging this fundamental bound on speedup, Eager, Zahorjan and Lazowska [36] have combined the available concurrency of algorithms with that of eciency to derive several interesting results. One central measure in their analysis is that of the average available concurrency that a problem o ers; this they call average parallelism. Eager et al assert that there are four equivalent de nitions of this measure. However, all these de nitions are couched in terms of observable behaviour and make implicit assumptions of the system which is being observed. Their rst de nition of this measure is \the average number of processors that are busy during the execution time of the software system in question, given an unbounded number of available processors" [36, p411]. Contained within this and their equivalent measures are the following assumptions: 1. There are no overheads that exist outside the concurrency unit. Overheads do not grow as the number of processors or number of active concurrency units grow. 2. There is no latency in communication. The costs of task initiation and termination are xed and, as such, can be considered as part of the cost of execution of the concurrency unit. 3. The scheduling of concurrency units on processors is `work preserving', ie no processor is left idle if there is work available anywhere for it. This makes very strong assumptions about the distribution of information within the system. They are not alone in making these assumptions; the same ones are also made by Sun and Ni [117] and Sun and Gustafson [116]. All take it as read that allocating more processors to the solution of the problem is, at worst, harmless. Through an analysis of the total idleness in the system based upon a precedence graph of the computation, they show that, given the above conditions, the speedup S is: (2.10)

S (k)  k +kA A,1

where k is the number of processors and A is the average available concurrency. 39

Their claim that this analysis holds for \: : : any work-conserving scheduling discipline. No matter how poorly designed such a discipline may be, or how baroque a software structure is presented, the behaviour of the software system can be no worse than the stated bounds." [36, p412] appears to be a panacea for the practitioners of parallel computing. This suggests that, as this is not achieved in practice, the e ects of their implicit assumptions must dominate. As a dual to their lower bound analysis, they go on to derive an upper bound formula and show that the speedup estimate S^ given by (2.11)

S^ = 2 

min(k; A) k kA A, min(k; A) + k kA A, +

+

1

1

has a relative error of less than 34 percent of an actual speedup.

2.4.5 Comparison of scaling models Various attempts have been made to unify xed size, scaled and xed time speedup. All of the uni cation attempts are based upon the need to model the serial and parallelisable portions of the computation for particular con gurations. A con guration consists of a number of processors and the amount of the two types of computation, serial and parallel, required for the solution. The way in which authors have compared the scaling models di ers in the underlying mathematical model of the scaling, and the amount by which the model is justi ed in well known behaviour of algorithms. There are three distinct approaches: 1. Arbitrary functions to represent amount of serial and/or parallel computation [126, 136]. The particular functions proposed have little justi cation. 2. Algebraic scaling function. There are two sub-approaches:

 The rst borrows ideas from complexity analysis, examines the algo-

rithm and produces a complexity function for the growth of computation. The coecients of the growth terms represent the measure of the 40

T (P; k  n) Tim n e v elin ri ess D ( A y m t i d ahl Drive lp ex ) n m

Co

T (P; N )

Symmetric Scaling (m = k)

Tim elin ess

Dri

ven

n rive

T (m  P; k  N )

ty D i x le

mp

Co

T (m  P; N ) Figure 2.1: Scaling surface (after Worlton[131]) Scaling System Problem Size Size

P P

N kN

Time (T ) T (P; N ) T (P; k  N )

Fixed Size

P

N

T (m  P;N )

Scaled Size

mP

kN

T (m  P; k  N )

Original Problem Problem Scaling

Metrics Speedup (S ) 1.0 T (P; N ) T (P; k  N )=k T (P; N ) T (m  P; N ) T (P; N ) T (m  P; k  N )=k

Ideal (I ) 1.0

Eciency (E = S=T ) 1.0

m

S=m

mk

S=(m  k)

k

S=k

Table 2.3: Properties of scaling models (derived from Worlton [131]) cost of that portion of the work. These coecients can be ascertained either by detailed analysis or by experimentation and curve tting.  The second is based entirely on curve tting to performance data and using this data for extrapolation and interpolation. 3. Complexity Weighting. This uses the ideas expressed in the average parallelism analysis, combined with weighting of the execution of the threads to produce scaling functions. One unifying view of the scaling models presented here is as a two dimensional surface in 3D space, after Worlton [131], the axes of this space being the number of processors, the total complexity of the problem and the total execution time. This is illustrated in gure 2.1. All the scaling models can be represented as particular 41

paths on this surface. The models and their properties are listed in table 2.3.

xed processors, scaled problem This scaling model is not really discussed

in the literature. It corresponds to increasing the problem size while keeping the number of processors xed. Although this may be of limited interest in general, it can be used to illustrate one of the causes of super-linear speedup. In keeping the number of processors constant, a doubling of the problem size (in terms of its computation e ort) would be expected to lead to a doubling of the execution time. However such e ects as vector startup costs ([101]) or other costs that decrease as the granularity of execution increases mean that execution time may not double, giving a larger increase in speedup than expected.

increasing processors, xed problem This is basically Amdahl's law. It represents a slice through the scaling surface parallel to the complexity axis.

increasing processors, increasing problem When the problem is scaled at the same rate as the number of processors, this represents Gustafson's scaled speedup. Fixed time speedup corresponds to a contour following path on this surface.

All of these scaling surface models see execution time as the fundamental metric for performance. There are two issues in this approach to performance modelling: the rst is the change in the complexity as the problem is distributed over more processors (whether the problem complexity grows or not); the second is the way in which execution of that complexity changes with the number of processors. Most models do not make such an explicit distinction between these two components and deal with them jointly. There have been several attempts at producing characterisations for the scaling surface just described. They take various starting points, ranging from modi cations to Amdahl's law to complexity analysis based scaling functions. Typical of the approach of adding some simple factors to the original formulations of Amdahl and Gustafson is the work of Sherson and Corbett [109]. They simply introduce a factor for the performance component of interest (in their case the overhead of non-overlapped communications) and manipulate the resulting modi cations to the original formulae. Zhou [136] takes a similar approach. 42

A more general approach to inclusion of non-linear in uences on scaling has been expounded by Flatt and Kennedy [39] and Muller-Wichards [92]. They look at the scaling properties as an arbitrary function. They then list the properties of such a function, eg monotonicity and di erentiability, and derive a performance measure using these functions. The choice of function is still arbitrary but may allow for such factors as synchronisation. Muller-Wichards showed how Flatt and Kennedy's original criteria for the scaling functions could be weakened, and extended their results. A conceptual diculty with these approaches is the justi cation for the scaling functions that are chosen; their choice is based mainly on intuition or curve tting. They do not make any use of known properties of the parallel system or algorithm apart from the number of processors. The use of such algorithmic properties distinguishes the third category of scaling model. This is the approach taken by Li, Sun, Gustafson and Worley [54, 116, 117, 129]. Gustafson [54] approaches this by expressing both the time and space complexity of a particular problem, after some complexity analysis, in terms of a polynomial. The coecients of this polynomial are determined by experimentation. Worley [129] takes a similar approach but derives the scaling functions from an analysis of the problem; he illustrates this with several near-neighbour problems. He also includes the e ects of di erent distributions of the total complexity amongst a particular number of processors. The e ects of applying this approach in an algorithmically independent fashion have been expounded by Sun, Li and Gustafson [116, 117]. They take an `instruction counting' approach to represent the computational complexity and represent the processing capacity in a similar fashion. Through this they compare available scaling models. One other practical approach of interest which does not seem to have been developed is that of Calzorossa et al [25], which adopts a phase-diagram like representation for the e ects of di erent portions of a problem for a vector processor; this representation technique captures the inherent complexity and the potential tradeo s in a readily digestible form.

2.4.6 Performance metrics A complementary approach to the scaling models in the examination of the properties of parallel computer systems is that of nding measures of factors that change 43

under scaling.

2.4.6.1 Serial fraction The approach taken by Karp and Flatt [70] is to examine the r^ole of the serial fraction in performance measurement. They acknowledge the complexity of performance and do not attempt to use their scaling model in any predictive way. Their viewpoint is that it is possible to ascertain that fraction of the computation that did not scale by experiment. Their conclusion is that this measure should be monotonically increasing, and that the best that can be hoped for is a linear increase. The use of serial fraction then allows for the re-writing of both Amdahl's and Gustafson's laws. Amdahl's law then becomes (2.12)

T (p) = T (1)f + T (1)(1p , f )

or in terms of the speedup (2.13)

1 =f + 1,f

s

p

This can then be used to nd the serial fraction, namely (2.14)

f = 11=s,,11=p=p

allowing for the experimental determination of this fraction. They see the use of this fraction as a diagnostic tool and irregular changes of this measure may point to ine ective load balancing, excessive synchronisation costs or non-optimal use of the vector units in processors. They have a similar analysis of Gustafson's law and go on to analyse the data from the Gustafson, Montry and Benner paper [55]. It is interesting that their analysis shows the greatest variability in the Beam-Stress problem, this being the only problem with global communications.

2.4.6.2 Incremental eciency A similar approach is to look at the change of the eciency of the implementation as the number of processors increases. This metric is proposed by Worlton [131]. 44

Accuracy SS  S  S



S









S

S



S



Cost 

S S

S

S Timeliness

Figure 2.2: Measures of usefulness of parallel systems The incremental eciency metric looks at the ratio of successive eciencies for di erent numbers of processors, a quantity that tends to unity in the limit. Worlton goes on to propose this as a predictive model and shows its correspondence to some of Gustafson's results [55].

2.4.7 Cost and performance All measures are an attempt to quantify the qualitative sense of `usefulness' to the end user. One possible way of looking at this is encapsulated in gure 2.2. Use of the measures of usefulness shown above presents some unique diculties, as the end-user usually has one criterion which overrides all the others. This diculty is at its most acute with the measurement of `cost'. The use of monetary measures is not appropriate as this is at the whim of marketing strategy, currency conversions etc. Even the use of relative nancial measures is dicult as technological changes can radically alter these ratios. Perhaps a better measure is that of opportunity cost: in making a certain decision (like a particular approach in the silicon), the use of silicon for that purpose means that some other feature may not be possible. This cost relationship is part of the reasoning behind Dally's work [32] on k-ary n-cube interconnection networks. The growth of requirement for physical space for routing as systems grow is one of the factors that has spurred interest in the use of opto-electronics for the building of interconnection networks. There the whole volume can be used 45

for the interconnect, not just the plane. Such opportunity cost arguments have also been used to justify pursuing the development of heterogeneous multi-computer systems over homogeneous ones ([10]).

2.4.8 Solution rates The majority of authors acknowledge that all the above performance measurements are not a great deal of use in comparing machine with machine or implementation with implementation. The xed time speedup was intended to address some of these problems. A major problem with `scaleup' is the contingent nature of the measurements taken and, more seriously perhaps, the failure of some observers to recognise that contingency. A set of impressive relative speedup results surely begs the question, \relative to what?". The use of speedup hides the fact that those systems with the highest speedup do not, necessarily, run in the minimum time. One of the enduser's pressing needs is the knowledge of the rate at which solutions are generated by any particular system. This is the approach that has been taken by Hockney [66]. His objections [66] are: 1. Speedup is performance arbitrarily scaled for one processor. 2. Speedup is performance measured in arbitrary units that will di er from algorithm to algorithm if the one-processor time changes. This is inherent in the nature of speedup, it being a dimensionless ratio. 3. Speedup cannot be used to compare the relative performance of two algorithms, unless the single processor time is identical for both. 4. The program with the worst speedup may execute in the least time, and therefore be the best algorithm. 5. The use of speedup as a measure throws away all knowledge of the absolute performance of the algorithm; the number generated is dimensionless. This last point is also made by Worley [131]. He goes further to say that it is a sign of maturity of understanding of a subject when the appropriate measures (in terms of their dimensions) can be found. Hockney [66] goes on to argue that even seemingly absolute measurements of performance of algorithms on computer systems are not what they appear. The 46

MFlop/s ratings of a particular solution are dependent on the analytical approach taken and rely on the assumption that implementors will make decisions to recalculate rather than to distribute values if this approach is more advantageous for the particular system. He goes on to say that the MFlop/s ratings quoted for benchmarks such as Linpak are entirely benchmark related. The di erence between the benchmark MFlop/s and the raw hardware MFlop/s that a manufacturer may quote are purely a measure of the eciency with which that benchmark maps onto the particular hardware platform. He argues that his objections would disappear if the performance were interpreted in terms of the number of solutions per second, thus allowing the comparison between di erent hardware platforms and algorithms. This still leaves the issue \what is a solution?", but most scalable algorithms have intermediate points of the computation that may suce as the basis of such a measurement. As many other performance measures can be derived from a knowledge of the solution rate of the system, this measure appears to give a better way of discussing the e ects of scaling in multi-processor systems, and as such will be used as one of the output measures of the models proposed in this thesis. This approach opens up a whole area of study that could examine the way in which the amount of computation relates to the rate at which the solution is generated.

2.5 Predictability Although the use of scaleup models allows for the post-event analysis of speedup, the results do not, of themselves, allow for the prediction of the achieved speedup on a given number of processors. To predict the performance it is necessary to have a predictive model both of the complexity and of the way in which the computation of that complexity is realised, ie the Tpar , Tsc and Ts factors in equation (2.3). Some attempts have been made to take scaleup models and turn them into predictive speedup models. The approaches fall into two speci c categories: rstly, those that run the problem on a certain number of processors and t the observed results to some, relatively arbitrary, curve; and, secondly, those that look closely at the algorithm (on a particular system), examining operation counts and communication costs as the system is scaled. 47

2.5.1 Algorithmic analysis approaches The complete complexity analysis of all stages of an algorithm, although technically possible, does not seem to have been attempted. This is not surprising as the vast majority of the computational cost of most algorithms is found in some small computational kernel (this is also the basis of several benchmarks). The approach that has been taken is to analyse both the temporal behaviour and the computation costs of such a kernel and nd some model of the execution time; this model includes the computation cost and the communication cost. All the published analyses taking this approach are in two areas; one, matrix operations (typically multiplication) and, two, nite di erence approximations to partial di erential equations. Stewart [112] analyses the e ects of choice between two communications disciplines and two multiprocess topologies on the scaling of such matrix operations. He admits that the O(n ) nature of such operations can not be defeated by the linear growth in processor numbers as the problem sizes increase. All the con guration options make simple scaling changes to the solution rates. One assumption that he makes is that the ratio of arithmetic rates to communication rates for the processors remains constant, and therefore it is only the problem size and the number of processors which varies. He presents his results in the form of xed time speedup contours. Worley [130] performs a detailed analysis of the scaling properties of the solution of linear Partial Di erential Equations. His general framework is that of a set of identical cooperating sequential portions of code that consume input, execute and then generate output. The analysis is based on the properties of the sequential code fragment from both the point of view of the computation cost that the fragment contains and of the communication costs and requirements. He derives a series of lower bounds on this serial portion. Hence, by looking at their data requirements it is possible to derive a parallel cost for the algorithm. Among the interesting conclusions from this analysis is that increasingly large instances of a problem cannot be solved in a xed amount of time no matter how many processors are available. 3

48

2.5.2 Approximate modelling and measures 2.5.2.1 Gustafson's xed time speedup In [54] Gustafson uses the concept of the algebraic property model (modelling memory and communications requirements of the computational kernel) to compare the behaviour of the various scaling models. He also says that this property model approach can be used for performance prediction, the coecient of the property mode being determined by a series of experimental measurements.

2.5.2.2 Eager's analysis based on average available concurrency Eager [36] approaches the problem of prediction through the use of average parallelism measures. These measures are based on the potential concurrency that the problem o ers. From this measure he derives bounds on the speedup that a problem with known concurrency may have. He arrives at an estimation formula for speedup that, given this knowledge of average parallelism, is within 34% relative error of the xed time speedup achievable. However, his model assumes a work conserving scheduler and takes no account of other overheads.

2.6 Structurally-based performance models There is one entirely distinct approach to performance modelling of parallel computer systems; this is based on direct behaviour modelling and simulation of such models. Such approaches tend to be used to analyse particular options or con gurations and are not generally used to study scalability issues. These models can capture such factors as data value precedence [37, 45] or the e ects of additional hardware to a given number of processors [4]. Their use in the study of scalability is limited as the models are only amenable to numerical solutions; however, given that a processor and software architecture has been chosen, they provide a suitable means of evaluating the implementation choices.

49

2.7 Summary In this chapter we have examined and contrasted the available scaling models and attempted to classify them on the basis of the mapping between their conceptual basis and the physical realities of the system that they attempt to model. We have identi ed ve factors: work preserving scheduling, granularity overheads, non-computational delay, algorithmic concurrency and resource niteness. We have used these factors, combined with the output measure, to form a basic taxonomy of existing performance and scalability models. This taxonomy alights on entirely di erent factors from the ones explicitly discussed in the referenced articles. This di erence is especially noticeable with regard to the work preserving scheduling paradigm; this assumption is present in all the models found, usually implicitly. This six-fold taxonomy summarised in table 2.1 and will provide a means of differentiating the models proposed in this thesis from existing work. However, this taxonomy is too bare for a detailed comparison of the individual models. For those discussions we need to use a framework closer to the original papers by broadly classifying the models into microscopic and macroscopic. We also note, that with the exception of data dependency, the machinery of execution is seen to dominate over the actual programs. This hardware-centric viewpoint has meant that the more uni ed view as co-operating systems of behaviour has not received any attention. The three themes of work preserving scheduling, non-computational delay and behaviour form the original motivation for the performance and scalability models discussed in the subsequent chapters of this thesis.

50

Chapter 3

Behaviour Based Models of Performance All the models of scaling and performance that have been discussed in the previous chapter look only at the average composition of the components of the overall system, ie the fraction of the total computation that was serial or spent in communication. An alternative starting point is to view the system under study as being made up of components that interact and can evolve as the computation progresses. In this chapter we will discuss the behaviour based approaches from the following points of view.

Stochastic Evolution This allows for the modelling of classes of problems by

abstracting away from actual execution times of program segments. This can be done by representing execution times as a suitably chosen random variable.

Resource Contention Many models assume an idealised con guration in which

there is no upper bound on the number of processors or no limit on the communications bandwidth available. Unfortunately such limitations are present in real systems. Given these restrictions the goal becomes to make the `best' possible use of the available resources.

Process versus State-Space based representation The complexity of the

construction of the model is an important issue. Ideally the solution method should allow for the simple composition of entities that correspond closely 51

to natural units of design, eg parallel process composition in process algebras. However the solution methods for stochastic based systems are based on an analysis of the transitions within a state-space. Naive composition of such state-spaces generates state-spaces which are the Cartesian product of the state-spaces of the constituent processes, with consequential exponential growth. The challenge here is to achieve a solution methodology that can hide, as much as possible, the state-space representation behind the scenes. Such an approach would contain the apparent complexity of the solution method, at least in the perception of the model user.

Algebraic Solution Ideally in conjunction with all the above criteria the model

outputs should be expressible in an algebraic form. However, to be contained within this form, some modelling approximations may have to be made. Such a model could then inform design decisions and allow recourse to numerical and simulation techniques for more sophisticated analysis.

There have been several performance models based on examining the evolution of behaviour of parallel systems. They have taken di erent models of parallel algorithm operations which represent various compromises in capturing the operation of the algorithm, combined with feasibility of its realisation in hardware. The PRAM model is such a model; its capacity to model performance is highly dependent on its assumption of uniform access to all memory at instruction level granularity. Such a system represents the most favourable environment for the execution of any parallel algorithm. The PRAM's limitation as a modelling tool comes from the practical diculties of physically realising the theoretical ideal. The PRAM model has been used to derive complexity bounds for algorithms and does not usually consider any resource saturation issues. As it does not admit for simple implementation in hardware, it is of limited use as a performance model for parallel systems, but provides for extensive analysis of algorithms. Instruction level approaches have also been used to formulate a unifying model for xed-size, scaled and xed time speedup [54, 116]. Here the modelling assumption is that the computation consists of small granularity computations (usually instructions) o ered as a workload that the processor must perform. The basis of the modelling assumes full knowledge of the execution and does not allow for any resource contention. The use of the model in practical performance prediction requires such intimate knowledge of the actual execution on the hardware as to be of limited use as a general predictive tool. 52

Moving away from the instruction level view of the execution is Serial Parallel Task Graph (SPTG). Here the points of interest are when a task (small unit of computation in this model) spawns its successor(s) and the measure of interest is the time taken for a task to complete. Gelenbe [43, Chap. 5] has used the SPTG model with stochastic variables to capture the evolution of behaviour with parallel algorithms. These approaches have assumptions in common with the models discussed in table 2.1. They all assume work preserving scheduling and sucient resource(s) for there to be no contention. In addition, as these models look at the complete execution of an algorithm, they insist on acyclic behaviour to assure termination. Valiant has proposed a computation model based on a variation of the PRAM [124, 125] that addresses some of these issues. His bulk-synchronous parallel computer is directly aimed at an e ective implementation of a PRAM on distributed memory machines. His main result is the use of universal random routing as an approach to resolving communication hot-spots when exchanging variables. Such a technique assures fair service from the underlying infrastructure, given that bandwidth grows suciently quickly and the algorithms distribute their messages suciently di usely and have sucient excess concurrency within them to allow for latency hiding. Gelenbe has also included a simple model of communication overheads within the SPTG framework. This model assigns xed temporal cost on the basis of the topology, again excluding any saturation of the communications resource. These models assume that the actual algorithm has been written on the parallel machine and that its execution can be reproduced with sucient accuracy. To get to this level of detail it is implicit that many design decisions have already been made, for example the decomposition of the algorithm and the allocation of processes to processors. A di erent viewpoint is taken by process algebras. They characterise processes by their observable behaviour, abstracting away from the actual execution mechanisms [65, 88]. Combining the observed events with a model of the passage of time allows for the capturing of some performance characteristics. Time has been included into process algebras in several ways, but the ones of most interest to us are those that admit a stochastic reading. This can either be by explicitly modelling a special form of rendezvous which lasts a random interval (Hilliston [63, 64]) or by allowing the construction of processes with stochastic behaviour (Tofts [121, 122]). 53

Although such models allow for the incorporation of stochastic evolution of behaviour they still implicitly assume unbounded computational resources. It is possible to encode processor contention explicitly within a process algebra model of a particular system. However, to use such techniques for general performance modelling would require the separate identi cation of transitions dependent on the particular state of those portions of the algebra that represent the contended-for resources. One modelling technique that supports the majority of the aims that we set out is Petri Nets. In their basic form they support resource contention modelling and can be used to encode aspects of observational behaviour. Their stochastic variants, Stochastic Petri Nets and Generalised Stochastic Petri Nets (GSPN) allow for the inclusion of probabilistic models of system behaviour. Their descriptive methodology is akin to that of process algebras, and one solution methodology is based on the enumeration of all the reachable states followed by numerical solution. Even given these limitations, they represent the most suitable starting point for describing the behavioural basis of our performance and scalability model. Petri nets are one example of a graphical representation of algorithmic behaviour; such representations have often been used as a modelling technique in computing eg nite state automata. Such abstract models of computation view the act of processing as travelling from one state to another within some graph-like structure. The graph encapsulates the complete set of possible journeys. Such systems provide a simple model of the operational behaviour of processes. When combined with weights either on the arcs within the graph or associated with the nodes in the graph, such models can be also be interpreted as a simple performance model of sequential algorithms. The weights are used to represent the computational e ort required to traverse that arc. However, such models are not suciently rich to capture the value passing/synchronisation requirements that exist in concurrent systems. Petri nets can be used to capture such qualitative behaviour as synchronisation requirements as well as general computational behaviour and, as such, have been used as a basis for the operational semantics of concurrent speci cation tools such as CSP and CCS [119]. There are di erent variations on basic Petri nets, each having several ways of being formally represented [103, 105]. As the only use of Petri nets made here is as a descriptive tool, the di erences in formal representations does not concern 54

us. We will only be using Marked Petri Nets (MPN) [105, part 2] and General Stochastic Petri Nets (GSPN) [3, 4, chap. 4]. A brief description of each of these follows.

Marked Petri Nets These are bipartite graphs consisting of places and tran-

sitions connected by arcs. Arcs travel only from places to transitions and from transitions to places. A MPN contains a number of tokens, each of which is said to be at a particular place. A transition can re when all its input arcs have tokens present in the places that feed them; the act of ring a transition consumes all these input tokens and generates a token for each of the output arcs. The tokens then appear at the places to which these arcs are connected. If more than one transition can be red for a particular distribution of marks amongst the places (this distribution being known as a marking), the choice as to which transition res is nondeterministic. If a place has two output arcs to two di erent transitions then the act of ring one of those transitions removes the token; thus the other transition is unable to re until another token arrives. Such Petri nets have been used to examine the behavioural properties of parallel computer systems and such issues as deadlock [105, chap. 7].

Generalised Stochastic Petri Nets These are an extension of the Marked Petri Nets above. In GSPNs the transitions can be of two types, immediate and timed. The immediate transitions have a probability associated with them, so that when two (or more) transitions are enabled by a particular marking, the nondeterministic choice is performed with regard to the relative probabilities of the enabled transitions. A timed transition has associated with it a ring rate (which may be dependent on the marking of the net), the actual time to re being taken as an exponentially distributed random variable. Such GSPNs can capture synchronisation as well as modelling the time taken for computation, and other performance items such as delay. They can be shown to be equivalent to Continuous Time Markov Chain systems [3].

In using behaviour to model performance one may be tempted to take the view that each operation has a known cost (in terms of time) and that a system which captures the time for each operation and allows for the sequential and parallel composition of these times, combined with a means of representing synchronisation, would allow for the production of all the pertinent performance data (this 55

would lead to something akin to Timed CSP [33]). Such a ne-grained approach may provide a good conceptual model. However, the complexity of the run-time behaviour due to the data dependent choices that are made in the ow of execution of an algorithm, means that the direct use of such a model as a performance tool is limited. The aim of performance models presented in this chapter is to allow for the extraction of macroscopic properties from the known microscopic structure. This aim allows, in many cases, for the reduction of the inherent complexity, by viewing the speci c choices that are made with the execution of the algorithm probabilistically. To take such a stochastic view of a particular choice-point in the execution implies that there are many occasions when the execution comes to that point and the choice has to be made. This implies some cyclical properties in the behaviour of the system. In terms of a Petri net model, this cyclical property means that the same place in the graph is being returned to repeatedly. The particular path that is taken through the net from that place corresponds to a particular computation. In determining the choice made at a particular time there are many factors; these choices are made by reference to the internal state of the process and by the values communicated to it. From the performance point of view we are not interested in the particular path that was taken, but in the time taken for traversal of that path. Given a sucient number of such cycles, the modelling of the time taken for each choice as a sample from some probabilistic distribution is not unreasonable.

3.1 Cycles of behaviour A central feature of our behaviourally based model is cycles of behaviour. It is by observing some of the events which form these cycles, combined with stochastic model of the time intervals between observations, that provides the input to our modelling process. This section introduces our model through building a simple concrete example, showing how cyclical behaviour can be captured and related to a random walk around a state space. This simple example is developed into our performance and scalability model in later chapters. 56

p

1

t

t

2

1

p

2

Figure 3.1: Representation of the simplest cycle of behaviour The simplest cycle of behaviour that involves interaction is illustrated in gure 3.1. In this simple model the processor performs `useful' work while the token is at the place p , and then interacts with its environment, signi ed by the token being at the place p . The maximisation of the computational performance corresponds to the minimisation of the fraction of the time taken for that interaction (ie maximising the time that the token is at p or minimising the time that the token is at p ). This simple cyclical model of execution can be seen at the kernel of many existing algorithms. 1

2

1

2

 In N -point approximations to di erential equations on two dimensional ar-

rays of message passing computers, the place p can be seen as corresponding to the computational section and the place p as the communication of values to neighbours and receipt of their new values. 1

2

 In processor farms, where tasks are distributed to processors from some central pool, p corresponds to the processor performing work and p the processor awaiting the allocation of the next task. 1

2

 In problems with global communication, p corresponds to the synchronisa2

tion where a global update of common values is performed.

 In models of computation within a distributed virtual memory system, the

place p can be seen as the processor performing computation, the transition t as a request to the virtual memory service, and the transition t as the response to that request. 1

1

2

57

Proc

1

Server

Proc

1

Figure 3.2: Block diagram of system with two processors and service facility The performance of all these algorithms is dependent on the relative time that the token spends in these places p and p . The use of this model to predict the performance on a particular number of processors raises a diculty. How do the processors interfere with each other given that p is contended for in some way? Throughout the rest of this thesis we will attempt to answer this question using models which correspond to variations on this basic structure. In such models, p will correspond to the processor performing `useful' computation and other places in the net will correspond to such factors as queueing for and receiving service, experiencing overhead and delay. In the discussion presented so far, the marking of the tokens has represented the activity of a single processor. We are interested in systems with many processors; as with most modelling tools there are alternative ways of approaching the representation of the problem, all of which are equivalent. The nature of this equivalence is best illustrated by a simple example. In doing this we also wish to illustrate the correspondence between the physical structural view and the temporal view that Petri nets embody. Figure 3.2 is an outline representation of the physical structure of a simple system consisting of two processors that compute and make requests of some common service. Each processor computes, makes a request, waits for the response and then continues processing. We assume that the processors and their tasks are homogeneous. Figure 3.3 is a representation of this behaviour. When the mark is the place run 1

2

2

1

1

58

run

1

idles

run

2

req

req

1

serv

resp

2

serv

1

2

resp

1

2

Figure 3.3: Example Petri net representing two processors and a server the rst processor is performing useful work. The transition req represents this processor making a request. This request can be processed when the server is in the place idles , whereupon the transition run can re, marking the place serv . If the second processor attempts to issue a request, its transition is unable to re as there is no token on the arc from idles . When the service has been completed, the token res the resp transition generating two tokens, one representing the rst processor restarting computation, and the other representing the server becoming available. The Petri net representation in this gure is consistent with the intuitive behaviour of the system outlined in gure 3.2. The rate at which this system produces solutions is directly proportional to the time that the tokens are present in the respective `run' places. The representation of gure 3.3 is not unique, and gure 3.4 represents an alternative rendering. In this representation there are two tokens, each representing a processor. A running processor which makes a request will re the request transition, preventing the remaining processor from successfully issuing a similar request. It can be seen that this model is equivalent to the previous model in gure 3.3, where processors have become anonymous; it makes the implicit assumption 1

1

1

59

1

running

server idle

request

serving

response

Figure 3.4: Reduced Petri net representation of the example system that the processors are identical and as such this anonymity is a reasonable supposition. Although the behaviour of this two processor system is reasonably captured by this model, it is not suciently rich to be used for performance measurement. One obvious factor that has not yet been captured is the time spent queueing for service if the service facility is already busy serving another processor. This can be accomplished by incorporating another place into the net in which tokens can wait (queue) until the service facility is available. This system is illustrated in gure 3.5. An observer noting the marking and how it changes with time would be able to extract many performance parameters from this net. For example the rate of computation of this system is dependent on the product of the number of tokens that mark the place running and the times spent with that particular marking. Similar expressions can be easily found for fractional server utilisation and the average number present in the queue. Having captured the behaviour and arrived at a means of assessing the performance of this simple system, it is now necessary to include some idea of periods of 60

running

request

queueing server idle receive service

serving

response

Figure 3.5: Two processor representation with queueing

61

time and the frequency of transition into the model in order to be able to produce some quantitative analysis. This involves the introduction of timed transitions into the representation and the introduction of a probabilistic distribution from which these times are chosen. It may appear at rst that the easiest timing distribution to reason with is when the timings are deterministic; however this is not the case. The way in which this timing information is introduced is important. Taking the representation in gure 3.5, where each token represents a processor, and given that there is more than one token in the place running , when will the next request be made? In a world where the execution times are deterministic there is the need to di erentiate between tokens. Hence the timings would need to be related to the places and the periods that the tokens dwell at them, as opposed to being associated with the transitions. The inherent complexity of associating times with tokens and thus having to distinguish between tokens, combined with the necessity to model the distribution of transition times, lends itself towards viewing the transition timings as an exponentially distributed random variable. The memoryless property of the exponential distribution allows for easy handling of marking dependent transition probabilities in a token-independent fashion, and for reasoning where the choice of transitions for a particular marking is not unique. This is the basic extension of Marked Petri Nets that is o ered by General Stochastic Petri Nets. The inclusion of such timed transitions into our example is illustrated in gure 3.6. In this gure the transitions t and t have become timed transitions, and t has becomes an immediate transition. The rate at which t res is dependent on the number of tokens that mark p (that is m ) and the rate at which an individual token res (that is ). The total rate at which requests are made (ie tokens transit from p to p ) is thus m . Requests are serviced (ie tokens go from p to p ) at the rate . A transition from p to p is made whenever there is an outstanding request (ie m > 0) and the service facility is available (ie a token is present at s ). It can be seen that this diagram could just as well be used to represent the behaviour of an arbitrary number of processors, each processor being represented by a corresponding token. As has already been asserted, such GSPNs can be transformed into a nite continuous-time discrete-state Markov Chain system; however an automatic transfor1

3

2

1

1

1

1

2

2

3

2

3

2

1

62

1

p

1

t (m ) 1

1

p

2

s

1

t

2

p

3

t () 3

Figure 3.6: Two processor representation with queueing and timed transitions

63

2



0

1



2



Figure 3.7: Alternative view of gure 3.6 as a Markov chain mation is unlikely to end up with an `optimal' Markov chain representation. Markov chains (MC) view systems as probabilistically moving between di erent states. The attribute that characterises the state from the performance point of view is the number of tokens that are present at p or, conversely, the number of tokens that are not present at p but present at either p or p , ie queueing or receiving service. This latter number is more typically used in MC analyses, hence the use of the values 0, 1 and 2 in the MC representation of our example system in gure 3.7. In this view of the system the server is busy whenever the system is not in the state S0. The state variable captures the number of processors queueing for or receiving service. The rate at which the system is processing is dependent on the state that the system is in. As before, the rate at which requests are generated is dependent on the number of running processors and hence the state. The next section outlines the basic results from queueing theory needed for the subsequent performance analysis. 1

1

2

3

3.2 Outline of basic queueing theory Queueing systems represent an example of a much broader class of interesting dynamic systems, `systems of ow'. Such systems are systems in which some commodity ows, or moves through some nite capacity channels from one point in the system to another. This broad outline encompasses such things as trac

ow in a road system, the demand of customers upon a telephone exchange, and the ow of water in and out of a reservoir. The main interest in this thesis is in the analysis of the cyclical ow of the computation between states, each state representing a stage in the computation of some algorithm. However, it is useful to look at some of the underlying concepts that exist in queueing theory. 64

There are any several component concepts that are built upon to make queueing theory. These are:

 Random variables  Probability distributions  Stochastic processes  Markov chains

3.2.1 Random variables and their probability distributions When modelling using stochastic techniques, the factors of interest within the structural elements of a system are viewed as random variables. A random variable can be formally de ned as a real-valued function that is de ned over some sample space. For example, a random variable that describes the toss of a coin would be de ned over the set fhead; tailg, whereas a random variable that describes the amount of rainfall on a particular day would have a continuous sample space. One important property of a sample space (and hence of a random variable) is the associated distribution function. More formally, given a random variable X which is de ned over some sample space

, X = x is written for the event (3.1)

f! : ! 2 and X (!) = xg

Similarly, one can write X  x for the event (3.2)

f! : ! 2 and X (!)  xg

The probability distribution function (often written PDF) F of the random variable X can be de ned for each real x by (3.3)

F (x) = P [X  x]

where the notation P [expr] is the probability the expression expr is true. Such a distribution function is monotonically increasing and tends in the limit to one.

65

Another useful concept associated with a random variable is that of the probability density function (often written pdf) f and is related to the PDF by (3.4)

dF = f (x) dx

3.2.2 Exponential distribution This is the most important distribution from the point of view of queueing theory. A continuous random variable X has an exponential distribution with parameter  > 0, if its pdf function f is de ned by (3.5)

8 < ,x f (x) = : e 0

if x > 0 otherwise

Hence the distribution function F is given by (3.6)

8 ,x < F (x) = : 1 , e 0

if x > 0 otherwise

One reason for the importance of the exponential distribution in queueing theory and elsewhere is the Markov property also known as the memoryless property, given by: (3.7)

P [X > t + hjX > t] = P [X > h]

t > 0; h > 0

This states that the conditional probability of X being greater than t + h, given that X is greater than t, is dependent only on the value of h. The random variable does not `remember' how long it was since the last event. Hence, when observing a system with this property, the moment when you started observing the system has no in uence on the outcome. This is the only probability distribution with this property. The mean of an exponentially distributed random variable is E [X ] = 1= which happens to be the same as its variance V ar[X ] = 1= = E [X ].

3.2.3 Stochastic processes and Markov chains Queueing systems are a subset of the family of stochastic processes. Stochastic processes are about the random walk of particles between states over time. The properties of such stochastic processes depend on three things: 66

 A state space which represents the valid locations of the particles.  Some index parameter;(usually time) at which the location of the particle is measured.

 Statistical dependencies among the random variables for di erent values of the index parameter.

In this thesis we are looking only at state spaces which are discrete in nature, in which the index parameter represents time and where this time parameter is continuous. The interdependencies between states are simple, in that the future behaviour of the process is dependent only on the state that the process is currently in, and not at all dependent on any past history. These criteria de ne precisely the area known as Markov chains. The property that the process's complete past history is completely summarised in its current state has the implication that it is not possible to nd or to use the time that the stochastic process has been in the current state. This implies that the random variables which describe transitions from the current state to the succeeding state have no memory. As has already been asserted in the preceding section, the only such distribution for a random variable is the exponential distribution. Much has been written on Markov chains (MC). Of the properties that MCs may have, there are several interesting ones that are used in later chapters. These properties are summarised here:

homogeneous A MC is said to be homogeneous if transition probabilities are independent of the value of the index variable, ie the probability of going between states is not time dependent.

irreducible A MC is said to be irreducible if every state can be reached from every other state, in some arbitrary number of steps.

recurrent nonnull A MC is said to be recurrent if the probability of the process returning to the current state at some time in the future is 1. If the process is also nonnull then the time between such recurrences is nite.

aperiodic A MC which is recurrent is periodic if a state is returned to at times T; 2T; 3T:::. There is a theorem [75, page 29] that states that an irreducible

recurrent MC is either periodic such that every state is visited with the same period or is aperiodic. 67

All the systems that are studied in this thesis possess all the above properties. This allows for the use of the theorem [75, x2.3] that states that a steady state solution exists. This steady state solution is independent of the initial state and it has an unique determination through the solution of a set of linear equations. A state which is aperiodic recurrent nonnull is said to be ergodic; if all states in a MC are ergodic then the MC is said to be ergodic. It can be shown that a nite aperiodic irreducible MC is ergodic and hence has a steady state solution.

3.2.3.1 Steady state The concept of a starting-state-independent steady state solution is central to the extraction of performance criteria from the models that follow. As can readily be seen, a multiprocessor system that has just been started will commence execution in a particular state. Such a system will make requests in a particular sequence, which may di er from invocation to invocation. However, it is possible to calculate the transient probabilities of the system being in a particular state at any particular time. The steady state captures these probabilities in the long term. The rate of convergence to this steady state is rapid, usually being exponentially quick. The steady state probability associated with a state corresponds to the probability of nding the system in that state at an arbitrary point in time. This also corresponds to the fraction of time that the system spends in that state. There is the possibility therefore for the use of these probabilities in the generation of performance characteristics.

3.2.3.2 Conservation of ow One important concept in Markov chains is that of ow. Flow captures the e ect of the transition probabilities with respect to time. When the system is in equilibrium, the ow into a state (or collection of mutually independent states) must equal the ow out of that state (or collection of states). With reference to gure 3.7 the set of local equilibrium ow equations is (3.8) (3.9) (3.10) (3.11)

2S0 = S1 ( + )S1 = 2S0 + S2 S2 = S1 S0 + S1 + S2 = 1 68

Server 1

Queue

Server 2

Population

Server c Figure 3.8: Elements of a queueing system where the last equation (the normalisation equation) states that the system has to be in one of the three states and thus uniquely de nes the solution. This gives that the probabilities of the system being found in a particular state once the system is in equilibrium are: (3.12) (3.13)

S0 = S1 =

1

 

1 + 2  + 2  2    1 + 2  + 2    2    1 + 2  + 2 

2



2



2

(3.14)

S2 =

= 2  S0 2

2

= 2  S0

This illustrates an important property of such solutions; namely, that the probabilities of the system being in a particular state can be expressed as a geometric factor of some base state (S0 here).

3.2.3.3 Kendall's notation A general model of queueing systems is illustrated in gure 3.8. An arbitrary queueing system consists of a customer population (which may be in nite), members of which enter the system forming a queue until served. The service facility consists of one or more identical servers. This is a very general model and there exists a lot of scope for variation; in the size of the population, the arrival distribution, the queueing discipline and the number of servers. 69

A shorthand notation, called the Kendall notation after David Kendall, has been developed to describe queueing systems and has the form A=B=c=K=m=Z . In this notation A describes the interarrival time distribution, B the service time distribution, c the number of servers, K the system capacity (= size of queue + number of servers), m the size of the source population and Z the service discipline. Usually the shorter notation A=B=c is used and it assumed to be equivalent to A=B=c=1=1=FCFS where FCFS stands the for First-Come First-Served service discipline. The symbols traditionally used for A and B are described in the glossary (B).

3.2.3.4 Little's law One important result that expresses the relationship between the average time a customer spends in the service facility and the number of customers in that service facility, is known as Little's result. It states that, The average number of customers in a queueing system is equal to average arrival rate of customers to that system, times the average time spent in that system. This result holds for steady state queueing systems under very general conditions. It is often expressed as (3.15)

L = W

and (3.16)

Lq = Wq

Where L is the average length of the customers in the service facility and W is the average wait time. The subscript q denotes only those customers in the queue (ie excludes those customers actually receiving service).

3.2.4 Properties of the M=M=1 queue The M=M=1 queue is the classical queueing system. It represents an in nite population which requests service at a rate . They form a rst-come rst-served 70





0



1



k,1

2



k 

Figure 3.9: State-transition-rate diagram for M=M=1 queueing system queue, which is serviced at a rate . The state-transition-rate diagram for this system is illustrated in gure 3.9. The properties of this system can be derived in many ways. The approach illustrated here will make use of the conservation of ow and hence equilibrium and the existence of a steady state. All this follows, for the system is ergodic. For a general node in the chain, conservation of ow implies that ( + )Pk = Pk,1 + Pk

(3.17)

1

+

Allowing for the special case of P0, from the above it can be seen that the following holds:



P1 = P0    P2 = P0   k Pk = P0 

(3.18)

2

(3.19) (3.20)

This combined with the conservation relation 1 X

(3.21)

k

Pk = 1

=0

gives the following value for P0 (3.22)

1 = 1 +  +    + : : : +   k + : : : P    2

0

which is the well known summation of a geometric series. It converges to 1=(1 , =) for = < 1.

71

Substituting u = = gives the probability of the system being in S0 as

P0 = 1 , u

(3.23)

and the general equation for the probability being in a particular state as

Pk = (1 , u)uk

(3.24)

The quantity u is known as the loading intensity.

3.2.5 Performance applications There are several performance related parameters that can be extracted from the M=M=1 queue. One of interest to a system designer would be the fraction of the time that the server is occupied. The server is occupied whenever the system is not in S0 ie 1 , P0. Hence the server utilisation is given by 1 , P0 = 1 , (1 , u) = u

(3.25)

Another factor of interest is the average number of customers that are in the queue. This can be derived through the following argument. Whenever the system is in S0 there are no customers in the queue; when the system is in S1 there is one customer in the queue and so on. Given that we already know the probability of the system being in a particular state, and that these probabilities correspond to the fraction of the time that the system is in that state, the average length of the queue is 0P0 + 1P1 + 2P2 + : : : =

(3.26)

1 X k

kPk

=0

this gives (3.27)

1 X k

=0

kPk = (1 , u)

1 X k

kuk

= (1 , u)u

=0

1 X k

u , u)u kuk, = (1 (1 , u) = 1 , u 1

2

=1

By Little's Law (L = W ) knowing this average queue length allows for the 72

10

Average Wait Time

8 6 4 2 0

0

0.2

0.4

u

0.6

0.8

1

Figure 3.10: Average wait time as a function of loading intensity derivation of the average wait that a customer will experience (3.28)

Average Wait =

u = 1=u = E [s] (1 , u) 1 , u 1 , u

where E [s] is the average service time. Figure 3.10 contains a graph of the average wait time (where E [s] has been normalised to 1). This shows how the wait time grows rapidly as u ! 1. A small change in the loading intensity can cause a large change in the waiting time when u is near 1.

3.3 Performance parameters The de nitions of performance measures of parallel computing systems still tend to vary from author to author. For the purposes of this thesis we use the following de nitions:

73

3.3.1 Speedup As has been mentioned, the use of the term `speedup' in performance measurement is open to many possible interpretations. The de nition we wish to use here is based on Hockney's view that the performance of parallel systems is best measured in terms of the generation of solutions at a particular rate (x2.4.8). This starting point leads to two views of speedup, one based on the absolute rate of computation that the system is operating at, the second based on a relative measure of rates of work on di ering numbers of processors. The speedup measures we use here assume that the total amount of computation within the system does not vary as the number of processors vary, ie its assumptions are akin to the xed size speedup model presented in x2.4.1. We also make the assumption that the rate of solutions is directly proportional to the rate at which the processor performs useful computation. We will examine this relative measure rst.

3.3.1.1 Relative speedup For a given problem, in which the total amount of computation is xed: (3.29)

Total Computation = Processing Rate  Time Taken

for two con gurations, one with N processors the other with one. The total computation C being constant then C = R  T and C = RN  TN . This gives the relative speedup of 1

(3.30)

1

RN = T = Time taken on one processor R TN Time taken on N processors 1

1

Given that the processors that make up the system are identical, and letting the raw processing rate of the processors that make up the system be Prate , the following de nitions can be derived. If there were no idleness in the processors (ie they all ran at the maximum rate) then the speedup could be expressed as: (3.31)

R N  Prate = N = N Relative Speedup = N = R 1  Prate 1 1

74

That is, linear speedup. However, given that the processors may idle while waiting for a response, they are only performing work while they are non-idle. This gives the following relationship

N , Total idleness of N proc. system)  Prate (3.32) Relative Speedup = ((1 , Total idleness of 1 proc. system)  P rate

This is independent of the rate at which the individual processors compute.

3.3.1.2 Absolute speedup The relative measure of speedup can be greatly in uenced by the rate at which a single processor performs the task. The measure `absolute speedup' that is used here has been chosen to highlight this single processor e ect and to allow for the comparison of di erent algorithmic approaches; this would not be possible if the only measure of speedup used were that of relative speedup. Absolute speedup is seen as the rate at which the system is performing computation that is directly related to generating the solution, hence it excludes such factors as idleness and overheads. Given that the motivation for embarking on this work was the study of a particular distributed virtual memory system (described in x3.5) the measure of absolute speedup has an associated physical intuition. If a single processor could be attached to a virtual memory system which responded to requests instantaneously then all the computational power of that processor would be delivered to the solution of the given problem. Such a processor would run at the absolute rate of one. Given that such an ideal system is not physically possible, a single processor will run at some rate dependent on the virtual memory system and other factors, hence some period of time will be spent awaiting a response from the virtual memory system. Thus in all practical systems the absolute rate of processing of a single processor is less that one, and hence so is its absolute speedup. We will derive, in later chapters, formul for this absolute rate of work. The relationship between absolute and relative speedup is (3.33)

Relative Speedup =

absolute speedup on N processors absolute speedup on 1 processor 75

3.3.2 Eciency, load and utilisation This set of measures is also relative, due to the diculty in ascertaining absolutes in computer systems. In the models that will be investigated all these values are interrelated. Eciency will tend to be used when examining the performance of the system with respect to the best absolute performance that the model generates. Load and utilisation will usually be used in reference to the use made of the virtual memory system. The importance of the load that additional processors generate when added to a multiprocessor system gives rise to the concept of loading intensity, which is a measure of this e ect.

3.3.3 Response time - average and distribution In discussing message passing systems with a request-response cycle the question of response time is often raised. The average response time and the distribution of such times can be important in the design of real-time systems. These times are measurable inside the computer systems and can lead to more con dence in the results of the model. It is also possible (via various theorems) to place probabilistic bounds on the response times. The issue of response time is less important outside the area of real-time systems, so much so that for the systems that will be studied here the distribution of the response time has no e ect on the performance of the system.

3.4 Use of queueing theory for performance modelling In modelling the performance of a computer system there is rarely a single component of interest. In addition to the CPU in a computer system there are peripherals. An individual job may have to visit the processor and peripherals several times before running to completion. Queueing networks can be used to model such things as jobs traversing between several service centres, queueing at each one, until completion. They can also be used to model other physical systems, such as the properties of packet switched computer networks, or the properties of computer systems with CPU and peripherals. 76

There are two major variations within this basic model. In the rst, known as open networks or in nite models, jobs enter and leave the queueing network, as well as circulating between the nodes. In the second, known as closed or nite models, the number of jobs in the system is constant and these individual jobs circulate around the system. One point to note is that in these networks, as in Markov chains, the jobs have no history. Their behaviour (in terms of which node they are going to visit next in the network) is not determined by the nodes that they have previously visited, only by the node that they are currently at. Although the use of such queueing networks for single processors with peripherals and for communications networks is well documented [6, 21, 75, 76, 89] their use in predicting the performance of multiprocessor systems appears to have been limited to two speci c areas. One area is the contention for memory modules in multiprocessors with a bus architecture [15, 24, 79, 4, 133] in which the e ects of multiple processors accessing one (or more) internal buses to a limited number of memory units is examined. A variation on this in which multi-stage interconnection networks were analysed has been investigated by Harrison [58, 59]. The other area that has been modelled is the e ect of the varying concurrency available in shared memory parallel machines in terms of fork and join [9, 62, 93, 94]. The analysis associated with these models is performed under the assumption that the environment is work conserving, ie no processor is ever idle if there is an available task to be run. This allows for the treatment of the available work as one queue to one (or more) servers, any processor being available to run any work. This also implies that the tasks can be allocated to any processor; it makes use of the memoryless nature of the queueing network. This work conserving paradigm is not a reasonable supposition in a message passing environment. The global knowledge and the transfer of work is not possible due to the distributed nature of such knowledge and nite bandwidth of the interconnection network. In the analysis of multi-threaded execution, Alkalaj and Boppana [5] investigate the e ects of two thread execution strategies in the presence of a thread transfer latency. In this model they assume a shared memory implementation (the work conserving scheduling strategy) and use the threads to hide the cost of creation and synchronisation. They do not take into account the implicit cost involved in 77

scheduling a thread on a processor di erent from the one on which it was last run.

3.5 Summary In this chapter we have outlined how behaviour can be used to formulate stochastic performance models. One important feature of such an approach is that it allows for the composition of behaviours representing individual components of the overall system. Such compositionality is important, as it eases performance and scalability modelling within a traditional function decomposition design approach. Preserving the outline of the process within the nal model also aids the application of knowledge gained from the performance model to the design and implementation. Combining this behaviour based modelling with the taxonomy of the previous chapter, we have identi ed the following conceptual objects as the structural components of our model:

   

Processors Processes (levels of multi-threaded execution) Delay Contention (synchronisation between two processes)

As was mentioned in the introduction, this work was motivated by a desire to understand the performance characteristics of an existing parallel processing system. This system was developed by Green [48] to study the application of massive parallelism to an area of computer graphics. Our initial impetus was to understand the performance and scalability properties of the distributed virtual memory system that was employed by Green. Green's system studied ray-tracing; the two major data structures, that of objects in the scene and the octree (a data structure which aids the testing of which objects intersect with a given ray) were not required to be resident on each processor. To illustrate how the algorithm operated we will use as an illustration the acquiring of an object from the scene (however, a similar mechanism is used for the accessing of voxels out of the octree). During the execution of the ray-tracing algorithm it is necessary to test for intersection between objects in the scene and the ray that is being rendered. Green's 78

approach to this was to request the object from a local data manager, which would check for its presence in a local cache; if this succeeded, the object would be returned to the requesting process. If the object was not present then the request was passed to another processor which looked in its local cache, and so on. The system was designed in such a way that there was a unique path to a specialised processor which contained copies of all the objects in the scene. From this outline of the system's physical structure and behaviour it can be seen that each ray-tracing process has a basic cycle of computation, request and response. In x6.1 this virtual memory model will be used as a concrete example of the application of our modelling technique. The basic cycle of compute, request and response bears a marked similarity to Remote Procedure Calls (RPCs). In the case of the ray-tracing system the computation portion of the process makes a procedure call on the memory manager, which may or may not generate a remote procedure call on a succession of processors. It is interesting to note that although such RPCs are a common place programming paradigm, especially in networks of minicomputers, there has not been much analysis of their abstract behaviour. There appears to have been but a single study of the performance implications of data dependency (Zhou and Molinari [135]) and one application of queueing theory to RPC systems, in which the server modelled required time to recover before the next request could be processed. This study by Mujamdar et al [85] gives some performance bounds on such systems.

79

Chapter 4

Single Thread Per Processor Within the context of the behaviourally based models discussed in the previous chapter, this chapter begins the development of our class of models. We will also describe how such models can be used in analysing and predicting performance and scalability. First we need to place our model within the context of the taxonomy of x2.2. As we are explicitly considering message passing systems where the communications infrastructure is nite we do not make the work preserving scheduling assumption; If a processor has no work to perform because all of the processes assigned to it are awaiting some interaction then the processor will idle, even if there are available, runnable, processes elsewhere in the system. As our models develop we will include non-computational delay: to include such delay into the model we will need to place some independence restrictions on the nature of the delay. These are not very restrictive and will easily allow us to capture delay which varies with the topology of the system. One of the basic assumptions is that our system is engaging in simple synchronisation, in the form of a rendezvous. This rendezvous can be used to represent many di erent mechanisms; the original motivation was to model requests of a distributed memory system. It could equally well be used to capture simple remote procedure call mechanisms. We have also taken the starting point that we have nite capacities in both processors, processes and other resources. The assumption of nite processes will allow us to explicitly capture speci c granularity of execution, where this is appropriate. Our initial model allows for the classi cation of the performance components 80

within the system into three distinct categories. The rst is a computational resource, such as a processor: this processor has a task to perform; in performing that task it occasionally makes requests of some remote service. While awaiting a response from this remote service it is idle (the case where there are other tasks that it can perform is covered in the next chapter). The second component is the remote service; requests from remote processors arrive at this resource and are serviced on a rst-come rst-served basis, the total delay at this facility being the time spent queueing and the time to process the request. The nal component is that of pure delay; this captures other time consuming activities, such as communication. There is an assumption that pure delay is not dependent on the values of the other two components above. In modelling the computational resource we are treating the generated stream of requests as a random stream. The use of such a random stream is usually chosen for one of two reasons. 1. The source of requests is completely independent of the environment in which it is placed and the rate and distribution of these requests is una ected by the state of the total system. 2. The above criteria do not hold, but an approximation is made as a true model of the interaction introduces too much complexity. This complexity a ects the tractability of the problem. As discussed in the previous chapter, the cyclical nature of the requests and responses, combined with the nite population of requests, leads to a closed queueing system. The nite and cyclic nature of the problem occurs in other areas, such as communication networks with credit ow control, which can not be adequately modelled without taking account of these interactions. The question that drives the analysis in this chapter is \given knowledge of the behaviour of a single processor, how much can be said about the composite behaviour of several such processors in a system?". This is a fundamental question for any modelling approach that wishes to have compositional properties. Some answers to this question are developed in this chapter and in chapter 5. In this chapter we will develop the basic approach for the single thread of computation. In x4.1 this will be for the idealised case where the system has no delay in the transmission of the requests and responses. The addition of such delay into this model is discussed in x4.2. This approach will be enhanced in chapter 81

5 when many threads (processes) are present in each processor, as would be the case where latency hiding is employed.

4.1 Without response delay In keeping with the emphasis on composability we will develop this model by looking at the individual components. As discussed in section 3.1 the Generalised Stochastic Petri Net formulation can be viewed as a representation that captures elements of both behaviour and performance. The correspondence between the behavioural model of GSPN and the Markov chain representation of that model for two processors accessing a single service point was illustrated in gures 3.6 and 3.7 in x3.1. It is the quantitative analysis of the general form of this model that we wish to develop here. That two processor example on page 63 was just one case out of a family of systems. It is this family whose behaviour is analysed in this chapter. Initially the members of the family will di er only in the number of processors they contain. In this section we concentrate on systems in which there are no delays other than the delay introduced by queueing and receiving service. The addition of other delays is studied in x4.2. We will show how this family of systems corresponds to the queueing system known as M=M=1=K=K . As described in x3.2.3.3, this notation signi es a Markov arrival pattern, a Markov service pattern, one server, a capacity of K jobs queueing and receiving service, and a total population of K jobs. This M=M=1=K=K queueing system has been studied since the 1950's [12, 19] when it was used to study the performance of weaving machines that would break after random intervals, given that these breakages would need to be xed by an operator who tended several machines. This model was developed in the mid 1970's and applied to computer performance modelling [6, 35]. It found applications in the modelling of processes waiting for the ful lling of I/O requests, those requests coming from several independent processes. It was also used in the modelling of user queries against some database where the `operating' time is the time taken by the user to generate their next query. This abstract model is often referred to in queueing theory literature as the \Machine Repair with One Repairman" or the \Machine Interference" model. 82

Processors

K

p

1

t (m r ) 1

1

Server

Queueing p 2

Receiving Service

t

2

s

1

p

3

t () 3

Figure 4.1: Petri net representation of gure 4.2

83

Time between requests

 w q s N

r = E 1

[

]

Processor 1 Processor 2



= = = = = =

q

operating time composite rate of requests total wait time time in queue time receiving service number in queue or receiving service

w

s Server

N Processor K

Figure 4.2: Outline of M/M/1/K/K queueing system It can be seen that for a process making requests of a distributed virtual memory system there is a correspondence with this model. The `operating' time is the time that the processor is performing useful work between requests for data, while the `broken' time is the time spent waiting for a response from the virtual memory system. The physical structural model, whose behaviour is illustrated in gure 4.2, can be described as follows: Consider K identical processors (or machines) which run for a period of time before requesting service (or breaking). The processor (machine) then waits for a response (to be xed) before starting the cycle again. There are several important performance parameters that can be derived from this model. Some of these assume exponential distribution for both the service time and the time for which the processor runs before requesting service, whereas others only require the conservation of ow principle and hence hold for other request and service time distributions. Table 4.1 summarises these parameters and highlights the di erence in physical interpretation between the modelling of parallel computing systems and the more 84

Virtual Memory Machine Parameter Performance Model Repair E [ ] ( r ) Average time between re- Average time between quests for data breakdowns, the operating time s Time taken to service a par- Time taken to repair a particular request ticular breakage E [s] (  ) Average time to service data Average time to repair a request machine q Number of requests queue- Number of broken machines ing awaiting service awaiting repair N Number of requests out- Number of machines that are standing (= number of pro- down cessors that are idle awaiting data)  Average arrival rate of re- Average rate at which the K quests from the parallel machines break down system 1

1

Table 4.1: Comparison of the two physical interpretations

Parameter

r  q w s N

Physical Model Request rate from a single processor Composite rate from all processors Number of requests awaiting service Total time queueing and receiving service (round trip time) Time taken to process a particular request Total number of requests that are outstanding

Petri Net Model Rate of ring for a single token m r ; the total rate of transitions from p to p The number of tokens in p (m ) Latency of a token in p and p 1

1

2

2

2

2

3

Latency of a token in p

3

m +m 2

3

Table 4.2: Correspondence between the physical and the behavioural models 85

usual machine repair model. The correspondence between the physical model of gure 4.2 and the behavioural model of gure 4.1 is summarised in table 4.2 It can be seen that the performance of the system corresponds to assigning meaning to the duration of the location of the tokens in various locations within the representation. The same sort of technique has been used in various other settings; the Probabilistic Duration Calculus [27] maps the state of the system to either 1 or 0. The integration of this variable can then be used to qualify an attribute of this system over the phases of its operation, like gas being released by an un-lit pilot light. It allows for specifying safety properties, for this example the system is safe providing the the amount of un-ignited gas within the system is bounded. Stochastic Reward Networks [30] map the state (in the Markov Chain sense) of the system to a reward value. Here the system is trying to use the average sojourn time associated with sets of states to give performance measures. In the queueing theory literature, the exploration of the practical problem of machine repair by an operator started as a model of the performance of both machine and operator, in factory situations, in order to assess the performance of the machines and the demands upon the operator. In its multi-server form (M=M=c=K=K ) it has also been used to estimate the number of operators required to give a particular upper bound on the repair time of machines. In its application to computer performance the model has been used to nd bounds on the number of users of transaction processing systems [6, p190]; to estimate the maximum number of sources that can share packet switching network connections [89, p75]; and to examine the response times for multiple users of a CPU system [72, p56]. The application of this model to measure the composite work rate from a total system does not seem to have been given any attention since the 1950's, when the rate of production of weaving machines under the control of an `operative', given particular ratios of failure to a deterministic repair time, was studied. This contention for a common resource (which in the case of Benson and Cox [19] was the operator, and which in our case is the virtual memory system) is the ultimately limiting factor in increase of rate of production. Once the service facility is continuously occupied the addition of processors (or machines) to the system o ers no bene t in the rate of production.

86

(K , 1)r

Kr 0

r

1

K ,1

2





K 

Figure 4.3: State-transition diagram for M=M=1=K=K system

4.1.1 Derivation of properties of this model The view of the behavioural model of gure 4.1 as a Markov chain is again based on the number of requests queued for or receiving service (ie m + m ). Given that the number of tokens in the complete system is constant (K ), the number of tokens at p must be K , (m + m ). This gives rise to the birth-death Markov chain view of gure 4.3, in which each state represents the number of tokens at p , ie K , (m + m ). There are two distinct stages in using this Markov chain system to derive the properties of the system. The rst is based on the conservation of ow, which yields results that can be used without regard to the distribution of request and service rates; the second uses the steady state probabilities associated with the Markov chain to predict the number of outstanding requests and allows for the prediction of the behaviour of systems that have suitable known distributions of request and service rates. 2

2

1

1

2

3

3

3

4.1.1.1 General properties The general properties of this system could be derived equally well by reference either to the structural model or to the behavioural model combined with the conservation of ow. We will use the structural model here. The relationship between the request rate, service rate, server loading and wait time can be derived using only the conservation of ow: Let the utilisation of the server be . The server is assumed to be eager, ie it processes a request whenever there is one in the queue to be processed. The system throughput, T , is de ned as the average number of requests that are processed per unit time (this equals the number of requests that arrive from the processors per unit time by the conservation of ow). As the network processes  87

requests (on average) per unit time while the server is busy then

T = 

(4.1)

Letting the response time from the server be W (this includes the time spent queueing for service and the time spent receiving service), the average interval between successive requests from each processor in the system is W + 1=r . This implies that the average rate of requests from each processor is 1=(W + 1=r ). Since there are K processors, all running independently, the total arrival rate of requests, which equals the throughput of the system T is

T = W +K1=

(4.2)

r

By the conservation of ow, equations (4.2) and (4.1) must be equal. Solving these equations for the average response time W , gives 1 K 1 W=K T , r =  , r

(4.3)

As already stated this argument does not rely upon the distribution of either the service times or the request inter-arrival times. The resulting equation (4.3) and other results derived from it are valid for any queueing system in which the average service time is 1= and the average interval between requests is 1=r.

4.1.1.2 Distribution related properties The relationship between the load intensity (u) and server loading () is dependent upon both the request and service rate being exponential distributions. The length of the queue (N in gure 4.2 and table 4.1; m + m in gure 4.1) can be used to represent the state of the system. The system that is represented by these states can be viewed as a Markov birth-death system. In such a system all the running processors are generating requests independently of each other. As illustrated in gure 4.3, the following transitions between the states can occur: 1

2

 state Sj to Sj,1: this represents a request being serviced, the re-starting of an idle processor and a reduction in the number of items to be serviced.

 state Sj to Sj 1: this represents a new request being made by an active +

88

processor, that processor becoming idle, and an increase in the number of items to be serviced. In this model a processor can generate a request only while it is active and, once a request has been made, it cannot generate another request until that request has been serviced. Thus the state number N , which is equivalent to the number of items queued, represents the current number of processors that are idle. When the system is in state S0 all K processors are running. First, looking at the instantaneous transition rate from state Sj to Sj 1 , making the same assumptions as in sections 3.2.2 and 3.2.3 on renewals in stochastic processes, the probability that any one processor will make a request in a time interval h is rh + o(h). Also, given that Sj represents j idle processors, there must be K , j processors running and hence possible candidates to generate the next request. Thus in Sj this set of processors which are running independently of each other has an overall probability that a request will be made of (K , j )r h + o(h). This gives an instantaneous transition rate from Sj to Sj 1 of (K , j )r . The instantaneous transition rate from state Sj to Sj,1 does not change with the state of the system, as there is only one service point and this is operating at a rate of . Following the same line of reasoning as in section 3.2.3.2 and denoting the steadystate probability of state Sj by Pj , we have the following set of local balance equations: +

+

Kr P0 = P1 (K , 1)rP1 = P2

.. . r PK,1 = PK

which is equivalent to (4.4)

(K , j + 1)r Pj,1 = Pj

1jK

From these equations it can be seen that it is possible to express the probabilities of all the states in terms of P0, namely: (4.5)

Pj = (KK,! j )! uj P0;

1jK 89

Server Utilisation ()

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

K =1 K =2 K =3 K =5 0

0.5

1

1.5 2 Load intensity ( r )

2.5

3

Figure 4.4: Server utilisation as a function of load intensity (u 2 [0; 3]) where u = r =. P The normalisation equation ( Ki Pi = 1) implies that the value of P0 is: =0

(4.6)

P0 =

"X K i

=0

K !  r i (K , i)! 

#,

1

4.1.2 Performance measures of this model 4.1.2.1 Server loading The server has work to perform whenever any one of the processors has made a request, ie whenever the system is not in state S0. This gives the server fractional utilisation  as: (4.7)

 = (1 , P0)

The variation of server utilisation against load for various numbers of processes is illustrated in gures 4.4 and 4.5. Another useful measure is the total rate of requests from all the processors. This 90

Server Utilisation ()

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

K= 1 K= 5 K = 10 K = 20 K = 50

0

0.01

0.02 0.03 0.04 0.05  r Load intensity (  ) Figure 4.5: Server utilisation as a function of load intensity (u 2 [0; 0:05])

measure  must equal the rate at which the server is processing them, thus: (4.8)

 =  = (1 , P0)

This value  can be viewed as the message density that the server is processing.

4.1.2.2 Average queue length and response time The average response time W has already been de ned in equation (4.3) and by Little's result (x3.2.3.4): the average length of the queue for processors requesting and receiving service is:

(4.9)

L = W = W = K ,  r  = K, u

where  = (1 , P0 ) from equation (4.7). The total response time W consists of two components: the time spent in the 91

request

response

#

request

#

#

waiting & service time ! w+s

operating time

!

Figure 4.6: The calculate, request, calculate cycle of a single worker processor queue waiting for service Wq and the time spent receiving service Ws . Ws is the average service time, E [s], which equals  . Thus the time spent queueing is: 1

Wq = W , Ws = W , E [s] = K, 1 ,1   

(4.10)

r

4.1.2.3 Processor idleness/processor utilisation As there are no other overheads in this model, a processor is idle whenever it is waiting for a response to a request; this is W as derived in equation (4.3). When a processor is running it generates requests at the rate r ; each processor cycles between the two states as illustrated in gure 4.6. This cycle corresponds to the transition of a token from p via p and p back to p in gure 4.1. To complete one of these cycles takes W +1=r period of time. Hence the idleness of this processor is: 1

(4.11)

2

3

Fraction Idle =

1

W

W + 1=r

If the equation (4.11) is rewritten to nd W in terms of the fractional idle, and if it is combined with the de nition of W from equation (4.3), an equation for u using the measure of fractional idle can be derived: (4.12)

1  idle  = K , 1 r 1 , idle  r u = K (1 , idle)

This provides for a measure of the loading intensity u based on the measurement of 92

the per-processor idle time. However, when using the de nition of  from equation (4.7), a polynomial of order K is generated; for which a direct algebraic solution is too complex in the general case. Such polynomials are amenable to numerical solutions. In the case K = 1 it simpli es to (4.13)

u = 1 ,idleidle

4.1.2.4 Speedup Absolute speedup As outlined in section 3.3.1.2 the absolute speedup of this

system is equivalent to the rate at which the system generates solutions on N processors. Taking the case of a single processor in the system, as can be seen in gure 4.3, when K = 1 there are two states S0 and S1 . The system is performing useful work when it is not in S1, ie when the single processor is not idle. Due to the normalising condition, P0 + P1 = 1, the system is performing work at the fraction P0 of its maximum rate. From the de nition in equation (4.6) the value is:

#, "X 1! P0 = ui (1 , i )! i

1

1

=0

which reduces to (4.14)

P0 = 1 +1 u

which is the rate of work when there is one processor. There are two approaches to evaluating the combined rate of computation on K processors. The rst looks at the system as a whole; the second approaches the composite rate by looking at the rate of processing of an individual processor.

First approach This approach is based on looking at the system as a whole, noting that the state numbers for the states S0 ; S1; : : :; SK,1; SK represent the

number of processors that are idle; hence when the system is in a particular state Si it is computing the solution at the rate (K , i). The system is in the state Si with a probability of Pi , hence the total rate at which the system is processing is the sum of the products of the number of processors running in a state with the 93

probability of the system being in that state:

K P0 + (K , 1)P1 + : : : + (K , (K , 1))PK,1 + (K , K )PK = (4.15)

P

=

K X

(K , i)Pi

i

=0

i

=0

K X

K Pi ,

K X i

iP i

=0

As Ki Pi is the left hand side of the normalising equation and hence equal to P 1, and as Ki iPi is the usual de nition of the average queue length (L) of those seeking service, the rate at which this system runs is =0

=0

(4.16)

K P0 + (K , 1)P1 + : : : + (K , (K , 1))PK,1 + (K , K )PK = K , L

This equates to the average number of processors that are running. This is not a surprising result; however it is useful in that the model yields a concept that is in keeping with the intuitive feel for such systems. Combining this result with the de nition of the average queue length from equation (4.9), we get the average running time for an M/M/1/K/K system of K processors: Rate = K , L = K , (K ,  )

(4.17)

= u

u

This gives the rate of processing for a system consisting of K processors as the server utilisation over the o ered load.

Second approach This approach is based on a individual processor's view of

the system. As illustrated in gure 4.6, the processor views the time spent queueing for and receiving service as non-productive time; it is actually performing work for E [ ] units of time out of every E [ ] + E [s] + E [w] time units. This is equivalent to (1 , Fraction idle) as de ned in equation (4.11); given that E [ ] = 1=r and thatE [s] + E [w] = W , which is de ned in equation (4.3), the 94

processing rate of one processor in a K processor system is (4.18)

Single Processor Rate (1=r) = (1= r) + W = 1 + 1 W r 1K  = 1 + r  , r = K r 1  = 1

(4.19)

Ku

Thus the rate of processing for K processors is =u. The value for K processors is the same as that already derived in equation (4.17). It is interesting to note that this second approach is based entirely on averages and not at all on the distribution of either the service or inter-arrival times, and hence holds for other distributions where the average service time is 1= and the average interval between requests is 1=. There is an alternative informal way of deriving these results; the generation of a request implies that a certain amount of processing of the problem in hand has been completed. This request has the e ect of increasing the intensity of the load on the server by u (= r =). Thus, when a server is being operated at a fraction  of its capacity, it is performing work at the rate =. As the server can only ever run at full capacity ( = 1), the maximum rate of work is thus 1=; this is more formally derived in the following section.

Asymptotic performance The asymptotic performance of a system as the number of processors grows can be viewed as



lim u

N !1

(4.20)

0

1

@1 , PN 1 N k A = u1 Nlim !1 k N ,k u

0 1 @ PN 1 = , Nlim u !1 N !

=0 (

k

= u1

95

=0 (

!

1 A

)!

uk N ,k

)!

Absolute Speedup

Absolute Speedup

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

50 45 40 35 30 25 20 15 10 5 0

K=1 K=3 K=5

0

0.5

1

1.5 2 2.5 3 Trac Intensity (u) Figure 4.7: Absolute speedup as a function of load intensity (u 2 [0; 3])

u = 0:01 u = 0:02 u = 0:04 u = 0:06 u = 0:08 u = 0:10

5

10

15

20 25 30 35 Number of Processors

40

45

50

Figure 4.8: Absolute speedup as a function of load intensity (u 2 [0; 0:1]) 96

5

K=1 K=3 K=5

4.5 Relative Speedup

4 3.5 3 2.5 2 1.5 1 0.5

0

0.5

1

1.5 2 Trac Intensity (u)

2.5

3

Figure 4.9: Relative speedup as a function of load intensity (u 2 [0; 3]) This result is illustrated by the graphs contained in gures 4.7 and 4.8. A similar result to this asymptotic limit for the maximum productivity of weaving machines with deterministic clearing times under the control of one operator has been reported by Benson and Cox [19].

Relative speedup As de ned in section 3.3.1.1, relative speedup can be seen as

the ratio of rate of computation on one processor against the rate of computation on K processors. Using the absolute rate of computation from equation (4.17) and the absolute rate on a single processor from equation (4.14), The relative speedup is (4.21)

(1 , P0)=u = (1 + u) 1=(1 + u) u

This is illustrated in gures 4.9 and 4.10 for various values of u and K . To derive the asymptotic performance of this model we need to look at limits on the speed-up as the number of processors grows. From the formulation of P0 from

97

Relative Speedup

50 45 40 35 30 25 20 15 10 5 0

u = 0:01 u = 0:02 u = 0:04 u = 0:06 u = 0:08 u = 0:10

5

10

15

20 25 30 35 Number of Processors

40

45

50

Figure 4.10: Relative speedup as a function of load intensity (u 2 [0; 0:1]) equation (4.6) the maximum relative speedup is reached as (4.22)

0 1  (1 + u)  1 + u @1 , PN 1 N k A = u Nlim lim !1 N !1 u k N ,k u !

=0 (

)!

4.1.3 Practical use of this model Although straightforward and not complex, the results of the queueing system that describes this model have many potential uses. Within the limitations of the original model, those being the absence of delay and overhead, the qualitative results hold for any distribution of request and service rates where use of an average is reasonable. In the use of quantitative results there is a need for the user of this model to have justi cation that the exponential distribution assumptions are reasonable for the particular application.

4.1.3.1 Applicability of exponential distributions to real systems Exponential distributions (and the renewal Poisson processes derived from them) have many physical manifestations, and many physical processes are adequately 98

modelled by them [120]. The available examples of the importance of this distribution range from the well known nature of radioactive decay, to the accidental death of soldiers in the American army from being kicked by their horses [42]. Even if a single random process is not well approximated by a Poisson process, there is justi cation that, as the output from several random processes are superimposed, the resulting random process can be approximated more and more accurately by a Poisson process [89, x2.2]. Even if, given these justi cations, it is still felt that the distribution criteria are not met, then the only applicable formulae are those of equations (4.3), (4.9), (4.10), (4.12) and the processing rate equations (4.17) and (4.19).

4.1.3.2 Service time distributions Although the discussion has been related to a virtual memory server, this model models something broader, namely resource contention. In a particular system it may not be the software of the virtual memory server that is the subject of contention but the communications between processors. The Poisson process has been successfully used for the modelling of communication behaviour ([89, chap. 6], [75, chaps. 5&6], [72, chap. 14]). The usual justi cation is that the data being transmitted is of varying length and the transmission time is related to the length of the message being transmitted. If the subject of contention is some software server then the major portion of the service time is the execution of the algorithm that implements the service. Although most searching algorithms have a uniformly distributed execution time, the use of caching and the underlying scheduling of processes will perturb this.

4.1.3.3 Processor operating time distributions One of the assumptions of this modelling technique is that the intervals between requests from processors are random and independent of each other. In the modelling of virtual memory systems for applications without barrier synchronisation these appear to be reasonable assumptions for the following reasons: 1. Local caching will a ect the interval between requests; a hit on the cache will increase the delay between two requests. The number of hits between any two misses on the cache can be viewed as a random variable. 99

2. Most algorithms do not take exactly the same amount of time to execute each step; there are tests where the branches have unequal execution times. Even if the algorithm has equal execution times per step, the following point still applies. 3. As the time taken to receive a response depends on queueing, only one request can be processed at any one time, so even if the requests start in synchrony, after one request they will get out of step and the distribution of the round trip time will tend to become exponentially distributed. (This is a consequence of the Heavy Trac Approximation Theorem of Kingman [73]).

4.1.3.4 Use of the formul If it is felt that the distribution of request and service times is not reasonably modelled by Poisson processes, the formulae that are derived by substituting 1 ,P0 for the value of  would need to be avoided. This still allows for the comparison of various measures but does not allow for the predictive aspects, as these need to predict the value of  for a number of processors. Where distribution assumptions are felt to be reasonable it is only necessary to know u in order to predict the various performance measures of the system. This can either be derived by some direct measurement, analysis of the code itself, or indirectly. The knowledge of u allows for a prediction of the maximum of both absolute and relative speedup (1=u and (1 + u)=u) and the various con gurations that would achieve the required performance. Examples of the use of these equations can be found in x6.1. Although many performance related measures can be found from the use of the formul derived in this section, we will concentrate on their application to scalability and speedup.

4.1.3.5 Finding u when it cannot be measured directly Given that one of the aims expounded in the introduction to this thesis was the understanding of existing systems, there is a need to be able to apply this theoretical performance model to existing multiprocessor systems. 100

The use of this theoretical model depends on knowledge of u; this may be dicult to measure in an existing system. Moreover, the instrumentation necessary to measure r and  may change the characteristics of the system. However, the model allows several approaches to the approximation of u. These methods are outlined below and their use is illustrated in x6.1.

response time Using equation (4.3), it is possible to take the average response time to a request and, by nding the roots of a polynomial, estimate u for a given con guration. The analytical solution of the polynomial is not required and any root nding technique to nd the real root between 0 and 1 should be appropriate.

processor idle If it is possible to measure the idleness of individual processors

then the use of equation (4.12) combined with nding the root of a polynomial can be used to approximate u.

server idle If this is measurable then the solution of the polynomial for P0 can yield an approximation for u.

curve tting If the only available measurement of performance of the system under examination is that of the wall clock time to solve a particular problem on di erent numbers of processors, then a least-squares tting technique can be used to nd the value of u. This technique appears to give very good ts; it is discussed in more depth in x6.1.

4.1.4 Shortcomings in this model This model assumes that a processor waits for a response to every request before continuing `useful' work; this waiting can constitute a signi cant portion of the run time. The hiding of this latency is investigated in chapter 5. It assumes that there are no delays in moving data around the system; a simple model of delay is incorporated into this model in x4.2. It assumes that  and r remain constant as the con guration changes; this is not always so. The request rate r depends on many factors which change with the number of processors and with topologies. Some of these e ects are investigated in x6.5.2. Finally, the model takes no account of other overheads such as CPU costs for initiating requests, the costs of routeing messages, or other work such as load 101

balancing. Suggestions of ways of incorporating some of these factors into the model are made in x7.1.3.

4.2 With response delay In this section we wish to expand upon the concept of `pure delay', while discussing its e ects from the point of view of the building of performance models as well as discussing the programming paradigms used to mitigate its e ects in parallel and distributed computing systems. Pure delay is the name we give to a delay which exists solely because of the structure of the system. It would typically be used to model the use of a communications medium or some computation that is solely due to the parallelisation of a sequential algorithm. The most important property of this pure delay is that its value is independent of the loading placed on the system being modelled. It is surprising that such delay is not discussed in the queueing literature. This can be explained by the fact that the vast majority of queueing literature deals with open queueing systems (systems with in nite source populations), the results for closed queueing systems being presented as extensions of this work. In an open queueing system introducing such a pure delay has no e ect on the arrival density, and hence on many of performance parameters. This lack of e ect can easily be seen given that the request rate from an in nite population is una ected by any delay, provided the random process which describes that delay is independent. The e ect of such a pure delay on response time is easily calculated; as loading and hence server related delaying is unchanged, the response time is simply increased by the amount of the pure delay. When studying a closed system, such as the system described in x4.1, pure delay does have an e ect. In such a system delay represents lost computation e ort, hence an overall decrease in the observed solution rate. The quantitative e ects of this pure delay are discussed in the rest of this section. It is interesting to note that the e ect of pure delay on server utilisation, and hence the related performance measures, will decrease as the source population increases. This is because, in some senses, open queueing systems represent an upper bound on the properties of closed queueing systems [99]. Given this property it is likely 102

 w q s N

Processor 1 Processor 2



= = = = = =

q

operating time composite rate of requests total wait time time in queue time receiving service number in queue or receiving service

w

s Server

N Processor K Delay Figure 4.11: Outline of the M/M/1/K/K queueing system with delay that the e ect of pure delay on response time will begin to mirror the simple e ect of such delays in an open queueing system as the total population is allowed to increase.

4.2.1

M=M=1=K=K

with delay

Delay can be an important factor in the performance of parallel systems [67]. This section takes the basic model from x4.1 and adds delay to it. The initial model of delay that is adopted is that of a Poisson process. In adding delay to this model it is assumed that there is no contention for the resource that represents the delay portion of the system. One way of viewing such a model of delay is to consider a service facility consisting of an in nite number of identical servers (a M=M=1 queueing system). This means that each delay that is applied to a response is independent of all other delays. The block diagram model for this system is illustrated in gure 4.11. Note that there is only one place where the delay is applied to the response. The positioning of delay in the cycle of behaviour has no e ect on either the performance or the response time. This is because the the operating time is still a Poisson process even when shifted in time; the important factor is the interval between requests. 103

K

p

1

t (m r ) 1

p

1

2

t (m  ) 4

4

t

2

p

4

p

3

t () 3

Figure 4.12: Petri net representation of gure 4.11

104

s

1

2r

0; 0

r 1; 0



2; 0







r 1; 1

2; 1 2

 2; 2

Figure 4.13: State-transition diagram for a two processor system with delay The behavioural diagram for this system is illustrated in gure 4.12. In this behavioural model the delay is represented as a transition from a place (p ) which is dependent on the number of tokens at that place. The average delay that each token experiences is 1= , giving the rate at which the delay occurs for a single request as  . In constructing the Markov chain corresponding to this behavioural model, a state in the Markov chain has to capture two characteristics of the system. The rst is the number of outstanding requests (ie m + m + m ); the second is the number of these that are actually needing service (or conversely the number undergoing delay, m ). This leads to a triangular system of states in the Markov Chain. Figure 4.13 illustrates this, portraying a system consisting of two processors. In this gure the request and service rates (r and ) have the same interpretation as in the previous section. To clarify the correspondence between the Markov chain representation and the physical system, consider the following sequence of events for this two processor system. Let the two processors be A and B and assume that both processors are running, ie the system is in state S0;0. 4

2

3

4

4

1. Processor A issues a request (system goes from state S0;0 to S1;0). Processor A becomes idle. Processor B continues running. The server starts processing 105

the request. 2. The request is processed (system goes from S1;0 to S1;1). Processor A remains idle. Processor B continues running. The server becomes idle. Processor A is now awaiting the `expiry' of the delay. 3. Processor B issues a request. System goes from S1;1 to S2;1. Processors A and B are both idle. The server becomes busy. Processor A still awaits the expiry of the delay. 4. The delay on the response to the original request expires. The system goes from S2;1 to S1;0. Processor A restarts. Processor B remains idle. The server remains busy. Note that due to the memoryless nature of the Poisson process, in the above scenario the transition between S1;1 and S0;0 is just as probable as the transition between S2;1 and S1;0; the changing of states does not have any e ect on the expiry of the delay. The expiry of a delay  corresponds to a processor re-commencing useful work. It should be noted that in this model the server is idle in several of the possible states (S0;0, S1;1 and S2;2 in gure 4.13), and the system is performing useful work at the same rate in several states, eg S0;0 (2 processors running), S1;0 and S1;1 (1 processor running).

4.2.2 Derivation of properties of this model As for the M=M=1=K=K model, the derivation of properties fall into the same two distinct stages as described in x4.1.1.

4.2.2.1 General properties The relationship between the request rate, service rate, delay rate, server loading and wait time can be derived using the same conservation of ow argument as in section 4.1.1.1. The throughput of the service facility is the product of the server loading times and the average rate at which requests are processed: (4.23)

T =  106

request

#

serviced

response

#

waiting & service time ! w+s

#

request

#

operating time !

delay !

Figure 4.14: The calculate, request, calculate cycle of a single processor The rate at which requests are arriving at the server can be derived by looking at the behaviour of a processor. A processor generates a request once per cycle, as illustrated in gure 4.14. Such a cycle takes W + E [delay] + E [ ] period of time where W represents the time spent queueing for and receiving service, E [delay] the average delay that a response to such a request experiences after being serviced, and E [ ] the average operating time of a processor before making another request. Hence a single processor generates requests at a rate: (4.24)

1

W + E [delay] + E [ ]

and the total rate at which requests are generated by the system of K processors is: (4.25)

K W + E [delay] + E [ ]

By the conservation of ow, the rate at which the requests are generated must equal the rate at which they are processed: (4.26)

K  = W + E [delay] + E [ ]

Replacing E [delay] with 1= and E [ ] with 1=r , the average wait time for a request queueing for and receiving service is: (4.27)

K ,1, 1 W =   

r

where  represents the server fractional utilisation. This formula is similar in structure to equation (4.3).

107

4.2.2.2 Distribution related properties The relationship between the load intensity, delay and server loading is dependent upon the request, service and delay rate distributions being exponential. The mapping between the Markov Chain states and the performance of the system is more complex than in the M=M=1=K=K queueing system. We need to identify those states in the Markov Chain representation that correspond to a speci c number of tokens at the various places in the Petri-Net description of the system in gure 4.12. Note that the state Si;j represents i requests outstanding (i processors idle; m = K , i) of which j requests have been serviced and are undergoing delay before being returned to the requesting processor. Hence the number of requests that are queued or receiving service is i , j (m + m ). The server is idle when there are no outstanding requests, or when all the outstanding requests are undergoing a delay, ie when i = j . The total server idle is thus the sum of all probabilities of the states in which the server is idle: 1

2

(4.28)

Server Idle =

K X i

3

Pi;i

=0

As can be seen from gure 4.13 , the local balance equations for this system can

108

be written down:

Kr S0;0 =  S1;1 ( + (K , 1)r )S1;0 = Kr S0;0 +  S2;1 ( + (K , 1)r )S1;1 = S1;0 + 2 S2;2

.. . ( + j + (K , 1)r)Si;j = (K , i + 1)rSi,1;j + Si;j,1 + (j + 1) Si 1;j 1 if j < i ^ j > 0 ^ i < k (an interior node) +

+

( + (K , 1)r)Si;j = (K , i + 1)rSi,1;j + (j + 1) Si 1;j if j = 0 ^ i < k (typical top edge node) +

1

+

( + j + (K , 1)r)Si;j = Si;j,1 + (j + 1) Si 1;j 1 if i = j ^ i < k (typical lower edge node) +

+

In order to contain the complexity, it is useful to be able to write a general equation for any node in the state graph. The complications arise at the edges of the state graph and can best be dealt with by de ning a step function  with the following properties: (4.29)

8 <  (x) = : 1 0

if x > 0 otherwise

Thus the general local ow balance equation for any state in the Markov Chain system can be written:

109

( (k , i) (k , i)r + (i , j )  + (4.30)  (j ) j ) Si;j =  (i) (i , j ) (k , i + 1)r Si,1;j +  (j )  Si;j,1 +  (K , i) (j + 1) Si 1;j 1 +

+

As in the previous analyses of queueing systems, it is possible to express the probability of the system being in a particular state as a factor of the probability of being in a distinguished state, in this case the base state S0;0. It can be shown (see Appendix A.1) that the probability of being in a particular state is: (4.31)

! ui dj P Pi;j = (K K , i)!j ! r 0;0

where u = r = (loading intensity) and dr = = (delay ratio). The delay ratio dr is the ratio of average delay over average service time. As the average delay 1= tends to zero, the delay `rate' tends to 1 and the delay ratio dr tends to zero. Using the normalising condition, the probability of being in the state S0;0 is (4.32)

2K i XX P0;0 = 4 i

=0

j

=0

3,

K ! ui dj 5 (K , i)!j ! r

1

It can be demonstrated that the value of  for this system, an M=M=1=K=K system with delay, tends to that of a M=M=1=K=K system without delay as dr ! 0.

110

4.2.3 Performance measures of this model 4.2.3.1 Server loading The server loading  is the period for which the server is not idle. The server is idle in many states as de ned in equation (4.28):

 = 1, = 1,

K X i

Pi;i

!

=0

K X

i K X =0

K ! u i di P (K , i)!i! r 0;0

!2

=0

(4.33)

3,

K X i K ! uidi 4X K ! uidj 5 = 1, r r i (K , i)!i! i j (K , i)!j ! (1 + udr )K = 1 , PK P j i K i j K ,i j ui dr =0

1

=0

!

=0

=0 (

)! !

The total rate of requests from all processors has already been derived in the previous section: (4.34)

! K X  =  = 1 , Pi;i  i

=0

4.2.3.2 Average queue length and response time The average response time has already been de ned in equation (4.27) and, by Little's result, the average length of the queue of processors requesting service is

L = W = W = K ,  ,    (4.35)

 1 = K ,  dr + 

r

u

where  is de ned in equation (4.33), dr = = and u = r =. The total response time W consists of the queueing time Wq plus the service time

111

Ws . The queueing time Wq is thus: Wq = W , Ws = W , 1 K ,1, 1 ,1 =   r 

4.2.3.3 Processor idleness As illustrated in gure 4.14 the processor is only active for the period E [ ] in the cycle. The total length of the cycle is W + E [delay] + E [ ]. Hence the processor is idle:

W + E [delay] W + E [delay] + E [ ] + 1= = W +W1= + 1=r r (W + 1) =  (W + 1) +  r

Fractional Idle =

(4.36)

The wait can be expressed in terms of fractional idle + idle , r W = r idle r  (1 , idle) (r +  ) , r = idle (1 , idle)r Combining this with the de nition of W in equation (4.27) we get:

(4.37)

idle(r +  ) , r K 1 1  ,  , r = (1 , idle)r u = K (1 , idle)

This equation is identical to the corresponding equation (4.12). It is interesting to note that the delay rate  does not appear explicitly in this equation, although  is dependent upon its value. This implies that for a physical system it is the server utilisation that has a direct e ect upon the processor idle time, and hence other performance values, not the delay itself.

112

Taking the value of  from equation (4.33) this becomes

P

K Pi;i u = 1K,(1 ,i idle ) =0

Although  is a polynomial in u and dr , the roots of the resulting polynomial are not dicult to nd. This can be done by numerical means, given a knowledge of the delay dr .

4.2.3.4 Speedup Absolute speedup As in the M=M=1=K=K case there are two approaches, one

based on the system state as a whole and the other on an individual processor. As the second approach is in some ways clearer, that approach will be used here. From the processing cycle as described in gure 4.14 the rate of processing of useful work on one processor is: 1=r W + 1= + 1=r

(4.38)

Hence the rate on K processors is:

K r W + r + 

(4.39)

Substituting W from equation (4.27) the rate of processing is: (4.40)

K  =   ,  , r r + r + 

K

1

1

This is identical in form with the corresponding equation for the model without delay (equation (4.17)). The relationship between loading intensity and delay ratio is illustrated in gures 4.15 and 4.16. The asymptotic limit on the absolute speedup is dependent on K  = 1 lim 1 , X lim Pi;i K !1 u u K !1 i

!

!! K X 1 = u 1 , Klim Pi;i !1 i =0

=0

113

Absolute Speedup

20 18 16 14 12 10 8 6 4 2 0

dr = 0 dr = 1 dr = 2 dr = 5 dr = 10 dr = 20 dr = 50 5

10

15

20 25 30 35 Number of Processors

40

45

50

Absolute Speedup

Figure 4.15: Absolute speedup for u = 0:05 for varying dr 10 9 8 7 6 5 4 3 2 1 0

dr = 0 dr = 1 dr = 2 dr = 5 dr = 10 dr = 20 dr = 50 5

10

15

20 25 30 35 Number of Processors

40

45

Figure 4.16: Absolute speedup for u = 0:1 for varying dr 114

50

P



As is shown below the limK !1 Ki Pi;i is 0. The absolute speedup for this model is thus 1=u. This again is identical to the model without delay, and shows that pure delay does not e ect the asymptotic performance of a parallel processing system. To show that the asymptotic limit on absolute speedup is 1=u it is sucient to show that the server idleness tends to zero as the number of processors increases. As already discussed the server is idle in the states Si;i ; i 2 0::K , hence the server fractional idle becomes: Server Idle =

K X i

=0

Pi;i

!

=0

=

K X

K ! ui di P (K , i)!i! r 0;0

i K X

!2

=0

3,

K X i K ! ui di 4X K ! ui dj 5 = r r i (K , i)!i! i j (K , i)!j ! (1 + udr )K = PK P j i K i j K ,i j ui dr )K = PK (1K+ udrP i dj i K ,i ui j jr =0

=0

1

=0

!

=0

=0 (

)! !

!

2K 3 i dj , K i X X (1 + ud ) u r 4 r5 = K! ( K , i ) j ! i j =0 (

=0

)

!

1

=0

Noting that

i dj X r j

=0

j!

= edr ,

=0

1 dj X r j i

= +1

j!

that the rst term in the series is unity, and the rest of the terms in the series are all positive (as dr > 0), then the series is bounded by 1

Xi djr j

=0

dr j!  e

8i  0

P Also note that with the substitution j = K , i the series Ki Ku,i i becomes =0 (

K j K X (1=u) = u j! j j K j!

X uK,j 0

=0

=

115

)!

As K increases this summation tends to e =u hence 1

(4.41)

K X i

=0

ui = uK e =u ,  (K , i)! 1

where  ! 0 as K ! 1 P j Taking the least advantageous bound for the series ij djr , that of 1, the server fractional idle becomes =0

"

!

K (1 + udr )K X ui Server Idle  K! i (K , i)! K  (1 +Kud! r) uK (e 1=u , )

#,

1

=0





 udr K

1

1+

u

K!

Hence, the server fractional idle as the number of processors increases is less than or equal to



(4.42)

lim

K !1

 udr K u

1+

K!

which is 0.

Relative speedup As de ned in x3.3.1.1, relative speedup can be seen as the

ratio of rate of computation on K processors against the rate of computation on a single processor. Using the processing rate for this system from equation (4.40) and de nition of  from equation (4.33), the rate of processing on a single processor is

(4.43)

1 (1 , (P + P )) = 1 (1 , (1 + ud )P ) 0;0 1;1 r 0;0 u u   1 1 + ud r = u 1 , 1 + u + ud ) r 1 = 1 + u + ud r

Noting that udr = r = = E [delay]=E [ ], in the single processor case the difference between the relative speedup with and without delay is dependent on the ratio of the delay to the average operating time. 116

Relative Speedup

50 45 40 35 30 25 20 15 10 5 0

dr = 0 dr = 1 dr = 2 dr = 5 dr = 10 dr = 20 dr = 50

5

10

15

20 25 30 35 Number of Processors

40

45

50

Relative Speedup

Figure 4.17: Relative speedup for u = 0:05 for varying dr 50 45 40 35 30 25 20 15 10 5 0

dr = 0 dr = 1 dr = 2 dr = 5 dr = 10 dr = 20 dr = 50

5

10

15

20 25 30 35 Number of Processors

40

45

Figure 4.18: Relative speedup for u = 0:1 for varying dr 117

50

Relative speedup being the ratio of the absolute speedups, this gives (4.44)

K X Relative  1 + u + udr   1 + u + udr  =  = 1 , Pi;i u u Speedup i

!

=0

The graphs in gures 4.17 and 4.18 illustrate the relative speedup for some values of the loading intensity and delay ratio.

4.2.4 Use of the formulae For this model with delay, the measurable parameters that are available are the same as outlined in x4.1.3.5. However, in the model without delay there was only the requirement to nd u, whereas in this delay-full model the equations yield polynomials in both u and dr . To resolve the two unknowns, two independent measurements of performance parameters are necessary. As pointed out in the derivation of the properties of this model (x4.2.2) there are two distinct stages to the analysis. One uses the conservation of ow, the other the properties of Markov chain systems. In order to use one set of observations it is necessary to take measures that are derived independently of each other. In this model the derivation of  in equation (4.33) is dependent only on the Markov chain analysis whereas all the other performance related equations either require knowledge of  (equation (4.34)), or make use of the conservation of ow via equation (4.27). Hence, there is a need to know the server idle as well as another performance measure in order to nd u and dr from the measurement of performance parameters extracted from a single con guration. If the results from multiple con gurations are used it is possible to derive sets of simultaneous non-linear equations which can then be solved.

4.2.5 Model of delay The analysis has been presented here only with regard to delay which can be modelled as an exponential distribution; the proof in xA.1 uses a more general mechanism. One property of this proof is that for systems where there are as many servers as there are requests, as in the case of our model of delay, the actual distribution has no e ect on such things as the contended-for server's utilisation, 118

De ning Equation without with delay delay 4.8 4.34

Model De nition Characteristic Message Density at  =  Service Point u in terms of per u = K ,idle processor idleness  Absolute Speedup u Asymptotic Limit u (1

)

1

4.12

4.37

4.17 4.22

4.40 4.42

Table 4.3: Model characteristics not explicitly dependent on delay this being solely a function of the average delay. This property holds for all delay distributions which can be represented as a Coxian distribution (ie have a rational Laplace transformation). Thus the analysis presented here applies to most possible forms of delay, from the deterministic through the truly random to such distributions as hyperexponential.

4.3 Use of single threaded models The aim of this section is to look at the relationship between the two models, with and without delay, and to show how the formul that the models generate can be used.

4.3.1 Importance of  As can be seen from inspection of the formul in the preceding sections in this chapter, the server utilisation () occupies a pivotal r^ole. The server fractional utilisation is central to many measurements. This is because, in the basic model that has been adopted, the performance of `work' is dependent upon the completion of cycles, each cycle requiring the response to some request. The passage of a request through the server equates to a certain amount of computation having been performed. The rate at which requests are processed through the service facility is thus the rate at which computation is being performed in the system as a whole. It is possible to divide the equations derived in the delay-free and the delay-full 119

De ning Equations

Model Characteristic

without delay K , Response Time W =  r Server Fractional  = 1 , PK K ui 1

1

Utilisation Processor Idleness Server Average Queue Length Time Spent queueing for service Relative Speedup

i=0 (K ,i)! !

W W =r

K , , Wq =  r  1

u

(1+ )

u

1

1

(1+

)

!

=0 (

=0

(

L = K , u 

K , , W =  4.3 4.27  Kr ud r 4.7  = 1 , PK Pi K i j 4.33 K ,i j u dr i j r W 4.11 4.36 r W  (

+1

1

with delay

)! !

+1)

+1)+





4.9

L = K ,  dr + u

4.10

K , , , Wq =   r 

4.36

4.21

P 1 , Ki Pi;i

4.44

1

1

1

1

=0

Table 4.4: Model characteristics explicitly dependent on delay models into two categories, those dependent solely on  and u and those that are dependent on the delay. These two sets are summarised in tables 4.3 and 4.4. This similarity is more than an apparent likeness in the derivation of equations of the same form: there is an underlying property that leads to the concept of ` ow-equivalence' between models with and without delay; this is expanded in x4.3.3.

4.3.2 Comparison between models and `linear speedup' Any model of parallel processor performance needs to be able to cast light on the question of the achievability (or not) of linear speedup. As has already been mentioned, the rate at which a system is performing computation is proportional to the fractional server utilisation. This fractional utilisation is bounded by unity, hence there is an upper limit on the maximum speedup of the model. This upper limit (in terms of absolute speedup) is 1=u (equations (4.22) and (4.42)). This implies that unbounded speedup, let alone linear speedup, is not possible in the presence of contention for a common resource. Although there have been many arguments against such unbounded speedup from the 1970's onwards [8, 39, 51, 82] there has been much discussion of `linear' speedup and many models have already shown that it is not achievable in practice [39, 54, 55]. 120

4.35

Idealised

Server Utilisation

1 0.8 0.6 0.4 0.2 0

5

10

15 Processors

20

25

30

Figure 4.19: Comparision between idealised server utilisation and  In terms of a contention model used here, linear speedup equates to a growth in fractional server utilisation which is directly proportional to the number of processors. This would require the fractional server utilisation to have the following formulation: 0 = min(1; K  0 ) 1

where 0 is the fractional server utilisation that is experienced when one processor is in the system. As has already been illustrated, as the number of processors increases,  does not increase linearly then level o , but approaches unity asymptotically. The graph in gure 4.19 illustrates this for u = 0:1. As can be seen from the graph in gure 4.20 the fractional di erence between the idealised case and the server utilisation for this model grows to a maximum at (K = 11) and the di erence is small for both small and large numbers of processors. As speedup is directly related to server utilisation the di erences between the idealised and the actual server utilisation correspond directly to fractional di erences between linear speedup and actual speedup. As u decreases, the two curves remain close to each other for larger and larger numbers of processors. The table 4.5 illustrates this by showing the number of 1

121

0.18 Di erence in Server Utilisation

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

5

10 15 Number of Processors

20

25

Figure 4.20: Di erence in server utilisation between idealised and  processors in a system before the deviation from the linear utilisation is more than 5%. K represents the maximum relative speedup for the given loading intensity u, K represents the number of processors at which the di erence from linear speedup rst exceeds 5%. Given the close t of a linear model to the value of  for small u, it is not unreasonable that the original workers in this area saw such close- tting results as being consistent with linear speedup. This supposition is even more understandable in the presence of delay, as is illustrated in gures 4.17 and 4.18. 0

1

Max Relative Speedup (K ) 0.5 3 0.2 5 0.1 11 0.05 21 0.025 41 0.01 101 0.005 201

u

0

K

1

2 3 8 16 35 96 201

% Di erence % Di erence at K at K 6.67 21.05 5.07 19.91 6.56 16.32 5.39 13.14 5.39 10.18 5.22 6.97 5.13 5.13 1

0

Table 4.5: Number of processors for > 5% deviation from linear 122

4.3.3 Equivalence of the models As relative speedup can be derived from the run time of a problem on di ering numbers of processors, relative speedup is the only performance measure that is quanti able while treating the operation of the system as a `black box'. As shown in both x4.1.2.4 and x4.2.3.4 relative speedup is proportional to server utilisation. It is reasonable to question whether the two variations of the model can be distinguished solely by the measurement of relative speedup. Initial empirical studies showed that this did not appear to be the case and, moreover, that there was an equivalence between the systems with and without delay. This relationship is between the two loading intensities:

u0 = 1 +uud

(4.45)

r

where u0 is the loading intensity for the model without delay and u and dr are the loading intensity and delay ratio, respectively, for the model with delay.

Equivalence of models with and without delay If the relative speedup for two models is identical then the following relationship must hold: 1 + u0 0 = 1 + u + udr 

u0

u

where u0 and 0 are the loading intensity and server utilisation for the model without delay. Using the substitution from equation (4.45) this becomes 1 + u=(1 + udr ) 0 = 1 + u + udr  u=(1 + udr ) u 1 + u + udr 0 = 1 + u + udr  u u Which reduces the proof of equivalence to showing that 0 =  under the substitution of equation (4.45).

123

From equations (4.7), (4.6) and (4.33) this becomes (1 + udr )K = 1 , P P j K i K 0i i j K ,i j ui dr K ,i u 1 (1 + udr )K =

1 , PK 1K i

0 = 

!

!

=0 (

PK i

K K ,i !

=0 (

=0

)!

)!

PK Pi

u0i

i

j

=0

=0 (

)! !

K i j K ,i j u dr !

=0 (

)! !

Inverting these equations, bringing out the K ! term, taking the (1 + udr )K term inside the summation and making the substitution of equation (4.45) we get

K!

K X i

=0

 u i K X i X ui djr = K ! (K , i)! 1 + ud (1 + ud )K (K , i)!j !

K X i

=0

1

r

i

=0

j

r

=0

K X i ui (1 + ud ),i = X uidjr r K (K , i)! i j (1 + udr ) (K , i)!j ! =0

=0

Multiplying by (1 + udr )K K X i

=0

K X i ui (1 + ud )K,i = X ui djr r (K , i)! i j (K , i)!j ! =0

=0

The next stage is to show, term by term, equality between the two sides. Using the binomial expansion for (1 + udr )K , on the lhs: 1

K X i

=0

=

,i (K , i)!(ud )k ui KX r (K , i)! k k!(K , i , k)!

K KX ,i X i

=0

letting j = i +k

=0

k

=0

ui k dkr (K , i)! k!(K , i , k)!(K , i)! +

j K X X j

=0

k

=0

uj dkr k!(K , j )!

which is term by term equivalent to the rhs. This linear relationship between the loading intensities has several interesting consequences: 1. From the inspection of relative speedup alone it is not possible to distinguish between the model with delay and the model without delay. This implies 124

that such systems are not distinguishable by external observation alone. 2. Each value of u0 (loading intensity without delay) has an equivalence family of u and dr values related by u0 = u=(1 + udr ) 3. The asymptotic value of the relative speedup of a system without delay is equal to the corresponding asymptotic value for the equivalence family. 4. To use the model with delay to derive all performance parameters, it is necessary to be able to measure one additional parameter such as per processor idleness, loading intensity, response time, etc, all of which are only internally visible. 5. The increase of delay in a system (increase of dr ) reduces the u0 family to which it belongs, thus increasing the maximum relative speedup. This phenomenon has been reported by Gustafson [54]. What is the physical interpretation of these equivalence families? Noting that u = r = = E [service]=E [ ] and that dr = = = E [delay]=E [service]

E [service] 1 u = 1 + udr E [ ] 1 + E [delay]=E [ ] E [ ] = E [service] E [ ] E [ ] + E [delay] = E [ E][service] + E [delay] Assuming that  is the same between the two models this gives the following relationship (4.46)

1 = 1 +1 0r r 

where 0r is the processor request rate in the model without delay. This implies that, from the point of view of server utilisation, the model with delay behaves as if it were the model without delay, given that the delay time was added into the average time between requests from the processor. This is the ` ow equivalence' referred to earlier. This, in turn, implies that not only can all the delays in a system be moved into one delay, but also that this one delay can be moved out of the general model into 125

the model of the processors' behaviour without any loss of generality. Hence the simpler delay-free model can be used. This property can be viewed in another way. A system with delay can be modelled in the same way as the simpler delay-free model by replacing the portion of the delay-free model that represents the processor by an equivalent unit. This equivalent unit is chosen so as to generate the same rate of requests as if it were a processor with delay. Using the appropriate substitutions from equations (4.45) and (4.46) from tables 4.3 and 4.4, it can be seen that all the performance equations for the system with delay are equivalent to those for the system without delay after appropriate substitutions. The one exception is that of per-processor idle. The reason for this exception is that this is the one equation that, in using the internal behaviour of the processor in its derivation, looks at E [ ] and delay as separate entities and not in combination. Hence the principle of ow equivalence does not hold. The di erence between these two formul is best explained by viewing the idle time in the delay-full model as consisting of two forms of idle time; contentionbased and delay-based. In the delay-free model the formula for idleness (equation (4.11) is based solely on the contention for the server. In the delay-full model the idleness consists of both forms of idleness. The concept of delay-based idleness gives another viewpoint on why the asymptotic absolute speedup is una ected by the presence of delay. The fraction of a processor's time that is spent in delay-based idleness is

E [delay] W + E [ ] + E [delay] As the number of processors in the system increases, the time spent queueing for and receiving service (W ) increases while the delay time remains constant, thus the delay-based idleness decreases.

4.3.4 Practical analysis using this model Section 6.1 contains an analysis of data obtained over several months of a computationally intense parallel algorithm. The system was built with substantial 126

performance instrumentation to allow for validation of the model for at least one particular system. The original system was designed to examine the e ects of caching on the execution time of the problem; this version was instrumented to measure performance issues such as processor idle, request rates etc. The raw information, along with equations from this chapter, is used to derive performance characteristics used in the model (eg r , dr ) and validate the derived performance measures (eg processor idle). The model ts the observed scalability performance very closely, with the standard deviation of the predicted to observed load (u0 ) being less that 2  10, . 3

4.4 Summary In this chapter we have developed two Markov chain models of multi-processor performance. These models have been generated from the analysis of the behavioural characteristics of processes that make up a parallel system. In deriving the performance parameters based on the Markov chain representation, we have shown the central r^ole that the server utilisation takes in the determining of the absolute rate of processing. We have used the properties of  to give a possible explanation for the reporting of linear-speedup and have shown that a linear approximation to speedup becomes better as the per-processor loading intensity u gets smaller. The reduction in u can be due, either to reduced demands of the contended-for service, or by increasing the delay within the system. In studying systems with delay, we have shown that there is an equivalence between the delay in the system and the value of the loading intensity u. This equivalence is such, that by external observation alone, it is not possible to distinguish systems that have no delay present from systems that have delay. This has several implications on the measuring of parallel system performance as well as their modelling.

127

Chapter 5

Multiple Threads per Processor As was illustrated in gures 4.15 and 4.16, the request for service combined with delay can lead to large periods of time in which the processor remains idle awaiting a response. To mitigate this e ect the concept of latency hiding is often used [67]. In a system with latency hiding there is more than one thread of computation allocated to each processor. When the processor is awaiting a response to a request for a particular thread, it can continue with the computation of another thread. If there is an abundance of potential threads available in a particular problem, then the opportunity cost of allocating many such threads to an individual processor is small. However, if the number of such threads is of the same order of magnitude as the number of processors, or if the problem is such that the number of threads in the problem is variable, then the choice of the number of threads to allocate to each processor may become an important factor in the overall performance of a parallel system. The use of several threads per processor attracts problems of its own; one of these is the load balancing of the available threads amongst the available processors. This is an area that has attracted a large amount of interest [17, 28, 67, 84, 108]. In a comparison of some of the available techniques, Luling, Monien and Ramme [84] have demonstrated that there are good techniques for sharing the available work between processors, where the available work on each processor can be classi ed into three categories: low, medium and high. This has transformed the problem into the choice of numerical values for these three categories; currently this is done by trial and error [83]. 128

It is hoped that the quantitative use of the results of this chapter will provide a basis for the estimation of the `optimum' level of multi-threading in systems in which there is contention for common resources. In addition it will provide means of estimating overheads associated with thread switching, and, in the case of those systems in which the granularity of the thread can be altered, provide some guidance for the optimal choice of both the granularity and the number of threads per processor. In the analysis in this chapter it is assumed that the number of threads per processor is constant, and that this balance can be maintained with no extra overhead. The consequences of an uneven allocation of threads to processor is discussed in x7.1.2. As in previous chapters the concrete modelling example will be that of a distributed virtual memory system. In looking at the e ects of increasing the number of threads on a processor the simplest model is that of a single processor with a variable number of threads. This is considered here before moving onto systems with many processors.

5.1 Single processor without delay As in the case of the multiple processor systems investigated in the last chapter we will look at the case of a multi-threaded single processor both with and without delay present. The rst case to be considered is that of a multi-threaded single processor without delay. The petri net description of the behaviour of this system is illustrated in gure 5.1. In this system there are K threads available for execution. When being executed, each thread makes a request of the underlying virtual memory system at the rate r . There is a queue of unsatis ed requests (m ), the action of the service centre being represented by the cycling of the token from p to p . Taking the number of outstanding requests (m + m ) as the state variable, this behaviour is equivalent to the Markov chain system illustrated in gure 5.2. Before discussing the solution to this system it is worth brie y addressing the e ects of the scheduling strategy being used to apportion the processors' time amongst the threads that are active on that processor. In the case where there is no cost in switching between the threads, any (per-processor) work conserving scheduling strategy produces the same overall rate of requests. This is best 2

4

2

129

3

3

::: p K

1

t (r ) 1

p

2

p

4

t

2

p

3

t () 3

Figure 5.1: Behaviour of a multi-threaded single processor

r 0

r 1



r K ,1

2



K 

Figure 5.2: State-transition-rate diagram for behaviour of gure 5.1 130

illustrated by taking the two extreme examples. One extreme is that of the processor being allocated entirely to a single thread until that thread makes a request, the processor being then allocated to another thread (if one is present). During any (suitably small) time interval the probability of a request occurring is equivalent to the probability of any single thread making a request in the same interval, which is r . This is true even when a new thread is chosen, as the exponential distribution has no memory of past events and hence the time to the next event is not dependent on the time that the thread has been active. The other extreme is that the processor is being equally allocated (in in nitely small quanta) to each of the available threads. Given the assumption that there is no overhead in this processor-sharing operation, at any one moment let the number of available threads be m (0 < m  K ). Each one of the threads is receiving 1=m of the available processor power and is hence performing its task at 1=m of its maximum rate. The rate of requests from a particular task is thus r =m. However there are m such threads, and by the super-imposition property of the Poisson process, this is equivalent to the sum of the rates ie m  r =m which is r . It can readily be seen that any scheduling algorithm that is work-conserving will cause the processor to generate requests at the rate r until there are no more threads to be executed. To de ne the performance characteristics it is rst necessary to derive the probabilities of states and other factors from the Markov chain system.

5.1.1 Derivation of the properties of this model As can be seen from the state-transition diagram in gure 5.2, the system generates requests at an average rate of r until the Kth request has been made. This request behaviour is similar to the queueing system M=M=1=K , in which there is a limited waiting room, where requests arriving when the waiting room is full (ie the system is in SK) are discarded.

131

The steady-state balance equations are:

r P0 = P1 r P1 = P2

.. . r PK,1 = PK that is

Pj = uj P0

(5.1)

where u = r = This, combined with the normalising equation, implies that

P0 = 1 ,1 ,uKu

(5.2)

+1

which gives the steady-state distribution of the queueing system

, u)uj ; Pj = (1 1 , uK

(5.3)

0jK

+1

This equation has a singularity at u = 1; however, by application of L'Hopital's rule it can be shown that at u = 1, Pj = K for all j . This gives 1

+1

(5.4)

8 < Pj = :

,u uj ,uK K (1

)

1

+1

1

+1

for r 6=  for r = 

0jK

5.1.2 Performance measures 5.1.2.1 Server loading In this model there are two measures of the intensity of the loading: the rst is that of u(= r =), which can be thought of as the thread loading intensity; the second is the loading intensity that the server perceives, this being the combined intensities of the threads that are running on the processor. This is the product 132

of the time that the processor is actively processing any thread, and the thread loading intensity ie (1 , PK)u. From the server's point of view, the server is active whenever the processor is not in the state P0: (5.5)

8 <  = (1 , P0) = : 1 , 1,

,u ,uK K

r 6=  r = 

1

+1

1

1

1+

5.1.2.2 Processor idleness The processor is idle when the system is in the state PK from equation (5.4). Namely: (5.6)

8 < PK = :

,u uK ,uK K

(1

r 6=  r = 

)

+1

1

1

1+

5.1.2.3 Speedup In this model of a single processor system, the maximum rate at which work can be performed occurs when the processor is never idle. This is precisely the de nition of absolute speedup in x3.3.1.2. As there is only one processor in the system under consideration, the concept of relative speedup does not really apply. However absolute speedup is a reasonable measure. The system is performing work related to the task in hand whenever it is not in state PK: Absolute Speed = 18, PK < 1 , ,,uuKuK = : 1, K (1

(5.7)

8 < = :

1

1+

,uK ,uK K K 1

1

)

+1

1

+1

1+

r 6=  r =  r 6=  r = 

In the case of a multi-threaded execution, where the threads come from the same basic task, there are two e ects that limit the absolute speedup. The rst is the processor being completely utilised in the performance of the task (ie there is no idle time on that processor). The second is that the resource contended for is saturated; even if there is idle time on the processor and more available threads, 133

1

K= 1 K= 2 K= 5 K = 10

0.9 Absolute Speed

0.8 0.7 0.6 0.5 0.4 0.3

0

0.2

0.4

0.6

0.8 1 1.2 Load Intensity u

1.4

1.6

1.8

2

Server Utilisation 

Figure 5.3: Variation of absolute speed for xed level of multiprogramming 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

K= 1 K= 2 K= 5 K = 10 0

0.2

0.4

0.6

0.8 1 1.2 Load Intensity u

1.4

1.6

1.8

2

Figure 5.4: Variation of server utilisation for xed level of multiprogramming 134

1

Absolute Speed

0.9 0.8 0.7

u = 0:1 u = 0:25 u = 0:5 u = 0:75 u = 1:2

0.6 0.5 0.4

1

2

3

4 5 6 7 8 9 10 Level of Multiprogramming Figure 5.5: Variation of absolute speed for xed loading intensity

it is not possible to execute the task any faster. This leads to an asymptotic performance as K ! 1 of max(1; 1=u). As can be seen in gures 5.3 and 5.4, both the absolute speed at which this system runs and the server utilisation are dependent, rstly, on the loading intensity u and, secondly, on the level of multiprogramming K . The e ect of increasing the level of multiprogramming (increasing K ) is to keep the processor running at higher thread loading intensities (u) ( gure 5.3). As this loading intensity approaches and passes 1, the absolute speed reduces; this is the e ect of the system going from processor saturation to server saturation. It is illustrated in the graph in gure 5.4. These inherent limitations are perhaps better illustrated in gures 5.5 and 5.6. Here it can be seen that for low loading intensities ( 1) the absolute speed is dependent on the level of multiprogramming, the server utilisation remaining low. In the other cases the absolute speed is soon limited by the server utilisation. In the introduction we asked what the quantitative e ects of increasing the perprocessor concurrency for a given system would be. Here, even though the system only consists of a single processor, we can begin to answer this question. In most systems the value of r = will be xed by the algorithm and the system 135

1.2

Loading Intensity

u = 0:1 u = 0:25 0:5 1 uu = = 0:75 u = 1:2

0.8 0.6 0.4 0.2 0

1

2

3

4 5 6 7 8 9 10 Level of Multiprogramming Figure 5.6: Variation of server utilisation for xed loading intensity

0.18

u = 0:1 u = 0:25 u = 0:5 u = 0:75 u = 1:2

Absolute improvement

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

2

3

4

5 6 7 8 9 10 Level of Multiprogramming Figure 5.7: Absolute reduction in processor idle for xed loading intensity 136

0.4

u = 0:1 u = 0:25 u = 0:5 u = 0:75 u = 1:2

Relative improvement

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

2

3

4

5 6 7 8 9 10 Level of Multiprogramming Figure 5.8: Relative reduction in processor idle for xed loading intensity

designer would wish to see the e ects of varying the level of multiprogramming on both the fractional idle of the processor (with the corresponding increase in the absolute speed) and the server utilisation. As a processor is idle when the Markov chain is in state PK, the absolute di erence in idleness between two consecutive levels of multiprogramming is: processor idle = PKK , PKK,,1 , u)uK , (1 , u)uK, = (1 K 1 , uK  1 ,u u  1 K , = (1 , u)u 1 , uK , 1 , uK (

)

(

1)

1

+1

(5.8)

1

+1

which is illustrated in gure 5.7 for various values of u. As can be seen, the e ects of adding another thread to a processor with only one thread are highly dependent on the loading intensity u. Where u = 1:2, the addition of another thread increases the absolute speed by 15% { making the twothreaded system run faster by 33% than the system with only one thread. However, the addition of the third thread, while raising the absolute performance by 22%, gives a relative performance improvement of 12% over the two-threaded case, which is 49% better than the single-threaded case. Where the loading intensity 137

is much less, ie u = 0:1, the absolute gain in performance is much less, being 8.2% and 9% for the second and third threads. This is a relative performance improvement of only 9% in the two-threaded case and 9.9% in the three-threaded case. The relative improvement of the three-threaded case over the two-threaded case is therefore only 0.8%. Alternatively, the designer may be interested in the relative gain in increasing the processor's level of multiprogramming, ie the measure

PKK , PKK,,1 1 , PKK (

(5.9)

)

(

(

1)

)

which is illustrated in gure 5.8.

5.2 Single processor with delay In the case just considered the increase in multiprogramming serves only to reduce the time that the processor spends idle (ie time spent in SK), this having the direct e ect of increasing the server utilisation. The bene ts of latency-hiding are much more obvious when there is delay present in the system. In most of its practical uses, the major aim of increasing the level of multiprogramming is to mitigate the e ects of delay. The behaviour of a single processor system with multiple threads and delay is illustrated in gure 5.9. In this system the requests are generated as for the previous case, however the tokens (the responses) are delayed in p . This is a pure delay, ie each response is delayed independently of the number of responses that are currently experiencing a delay. This is coped with in the model by a marking related transition, t . The delay service rate is directly proportional to the number of tokens (responses) at p . The delay is again modelled as a service centre with an exponential distribution. The Markov chain corresponding to the system described in gure 5.9 is illustrated in gure 5.10. This has the same structure as the Markov chain in gure 4.13 for single-threaded multiple processors with delay. The di erence is the rate at which requests are generated. This is independent of the number of requests that are already in the queue. Using the same techniques as used in x4.2.2.2, namely the equilibrium ow balance 5

4

5

138

::: p K

1

t (r ) 1

(m  ) t 5

p

4

2

p

4

p

t

5

2

p

3

t () 3

Figure 5.9: Behaviour of a multi-threaded single processor with delay

139

r 0; 0

r 1; 0





2; 0



K; 0



r 1; 1

2; 1 2

 2; 2

K

K; K

Figure 5.10: State-transition-rate diagram for behaviour of gure 5.9

140

equations, the general equation for the Markov chain system is:

(5.10)

( (k , i) r +  (i , j )  +  (j ) j ) Si;j =  (i) (i , j ) r Si,1;j +  (j )  Si;j,1 +  (K , i) (j + 1) Si 1;j 1 +

+

It can be shown (xA.2) that the probability of the system being in a particular state is given by: i j

Pi;j = ujd! r P0;0

(5.11)

where u = r = (loading intensity) and dr = = (delay ratio). Using the normalising condition the probability of being in the state S0;0 is given by

3 2K i i dj , X X u r5 P0;0 = 4 j!

1

(5.12)

i

=0

j

=0

5.2.1 Performance measures 5.2.1.1 Server loading As in the case of the M=M=1=K=K queueing system with delay (page 108), the server is idle in many states, ie at any time when the number of requests undergoing P delay is equal to the number of outstanding requests, that is, Ki Si;i . The server is busy at all other times. This gives the server utilisation as =0

 = 1,

K X i

Pi;i

K ui di ! X r P = 1, 0;0 i ! i 2K i 3 K ui dj ! X i dj , X X u r 4 r5 = 1, =0

=0

1

(5.13)

i

=0

j!

141

i

=0

j

=0

j!

5.2.1.2 Processor idleness

P The processor is idle when all the K threads are awaiting a response ie Kj PK;j. Thus the processor fractional idleness is given by =0

fractional idle =

K X j

PK;j

0K 1 K dj X u rAP = @ 0;0 j ! j 12 K i 3 0K i dj , K dj X X X u u r5 rA4 = @ j ! j ! i j j 0 K 12 K i 3 j i dj , X X X d u rA4 r5 = uK @ =0

=0

1

=0

=0

=0

1

(5.14)

j

=0

j!

i

=0

j

=0

j!

5.2.1.3 Speedup As for the case without delay the only meaningful measure is that of absolute speedup which is simply 1 , fractional idle ie Absolute Speed = 1 ,

K X j

PK;j

0K 1 K dj X u rAP = 1,@ 0;0 j ! j 0K, i 12 K i 3, i j i j X X X X u dr A 4 u dr 5 = @ =0

=0

(5.15)

1

1

i

=0

j

=0

j!

i

=0

j

=0

j!

The same asymptotic limits on the performance exist as for the case without delay.

5.2.2 Levels of multiprocessing and latency hiding It is interesting to ask what physical interpretation can be made of the multithreading of individual processors. In contrast to the case of adding delay to the single threaded model in the previous chapter, there does not appear to be a simple substitution. There are two apparent metrics associated with the use of many threads on a processor: 142

45 40 35 30 25 20 15 10 5

0.2

0.3

0.4 0.5 0.6 0.7 Loading Intensity u

0.8

Delay Ratio dr

90 80 70 60 50 40 30 20 10 0.9

Figure 5.11: Contour Plot of Hiding Number for Single Processor

 The number of threads per processor that are necessary to produce the same

absolute speedup for a system with delay, compared to the same system without delay. This gives a cost measure for the delay that is present in the system. It is a quantitative measure of the cost of hiding the latency introduced in the system. We will call this quantity the hiding number.

 Number of threads per processor required to keep the processor running at, or above, a certain percentage utilisation. This incorporates the hiding of latency with the hiding of queueing and service times, with their knock-on e ects on scalability.

The graph 5.11 illustrates this hiding number for the single processor case. The hiding number represents that level of multithreading that is required to totally mitigate the e ect of delay on the systems performance. The starting point for this calculation is the `perfect' system in which there is no pure delay. The hiding number represents the level of multithreading that is required to cause a system which includes delay to perform at the same rate. The hiding number can be de ned for the single processor case as the minimum 143

value of N that satis es: (5.16)

PN , Pi j i PN Pi 1

=0

i

=0

j

=0

=0

ui djr j ui djr j !

!

 11,,uu

2

It is interesting that at high levels of utilisation (u ! 1), the hiding number appears to have a linear relationship with the delay ratio. This is consistent with physical intuition; as the utilisation approaches unity with many threads on the processor there is a constant stream of requests. The latency hiding required is simply the number of requests that are required to keep the delay portion of the network occupied. P The djr =j ! term in 5.16 appears to account for the general exponential shape of the contours of equal hiding number. This only strengthens the importance of the e ect of the loading intensity on the overall performance.

5.3 Multiple processors without delay This section looks at the extension of this model to include multiple processors, each with multiple threads of execution allocated to them. Other models of multiple processors with multiple threads of execution have always been used for shared memory multiprocessors [5, 15, 24, 62, 93, 94] combined with a global work-conserving scheduling strategy. In modelling distributed memory systems the scheduling strategy can no longer be viewed as globally workconserving, as the cost of moving execution of a thread to another processor cannot any longer be ignored. The approach taken here is not to model such load balancing but to insist that a request must return to the originating processor, thus re-enabling the thread on that processor. This approach means that the requests from one particular processor are distinct from the requests from all other processors, and any model must preserve this distinction. This class of problem is perhaps best introduced through an example. The simplest system in this class is the case with two processors (A and B ) each with many threads. The behaviour of this two-processor example can be found in gure 5.12. In such a system there are two places (pA and pB ) where requests can be queued. There are also two places (pA and pB ) where service can occur, however, only 2

3

3

144

2

:::

pA

pB

r

tB

1

M

tA 1

pA 2

:::

1

M

r

1

S

pB

1

tA

2

tB

2

2

pA

pB

3

tA 3

3

tB



3



Figure 5.12: Behaviour of two processors with many threads each

145

A

A

B

B

A

B

B

A

A

B

B

A B

B

B

A

A

A

Figure 5.13: State-transition-rate diagram for 2 processors each with 2 threads one such service can be active at any time. In order to return the requests to the originating processors, the source of the requests must be recorded. This requirement leads to a branching Markov chain state space which is illustrated in gure 5.13, for the case M = 2. Taking such a system and assuming that it is currently in the state in which no requests are outstanding, there are two possible ways in which the rst request can be made. Either processor A or processor B can make a request. For the purposes of this illustration let us assume that processor A makes the rst request. This leaves both processors running, but only one thread remaining on processor A, with two threads remaining on processor B . For the next request let us look at both cases.

A: if processor A makes a request then there are now two requests requiring service; processor A is idle whereas processor B is running and still has two threads available.

B : if processor B makes a request then there are now two requests requiring

service; neither processor is idle and each processor has one thread available. 146

This scenario can be repeated, the number of arcs from the current state to its successor being the number of processors still running (ie with remaining threads). Thus the rate that the system is performing computation while in a particular state is dependent on the number of arcs leaving that state to its successors. This is the request ow that is illustrated by the solid lines in gure 5.13. To keep the ordering of the events distinct, this being a requirement to model the return of a request to its originating process, in the underlying Markov chain it is necessary to distinguish between the situation where processor A generates a request followed by a request by processor B from the situation in which processor B makes a request followed by processor A. This is readily achieved where the states are named by the sequence of events that would lead to them. So in case of the examples above, the rst case (processor A followed by B ) would place the system in the S A;B whereas the second case (processor B followed by A) would lead the system into the state S B;A . In such a naming scheme the state in which no requests are outstanding is represented by S . Having dealt with the motion of the Markov chain system through the state space for the generation of requests we now turn to the motion of the system through the state space when requests have been serviced. Taking the queueing discipline to be FCFS, the satisfying of a request in S A;B would take the system back to S B . De ne a function on states, path , [

]

[

]

[]

[

[

]

]

path (Sx) , x

and a function target ,

target (Sx) , tail path (Sx )

where tail gives the tail of the sequence. The function target yields the state which is arrived at by the server completing processing of the oldest request in its queue. The application of target to all the states in the system generates the service ow which is illustrated by the dashed lines in gure 5.13

5.3.1 Simpli cation of the Markov chain representation The solution of this system as a system of linear equations soon becomes computationally non-viable as the total number of states grows exponentially in the product of the number of processors and the number of threads. However it is 147

possible to reduce the complexity of the state space by the aggregation of these states. The basis of this aggregation is to view the system as a linear birth-death Markov chain [6]. The state variable in this system is the number of requests queued or receiving service. For a system of N processors each with M threads the state variable lies between 0 and NM . We will call the states of the original branching Markov chain system that are aggregated into a single state in the birth-death Markov chain system a cohort. In viewing this as a linear chain, the easiest transition rate to visualise is that of the servicing of requests. As there is only one service centre operating at the rate , the transition rate from Sn to Sn,1 in the linear birth-death system is always , that is 8n 2 [1 : : :NM ]  n =  The transition rate for requesting service is based on the number of processors that still have threads available to be run. It can easily be seen that for the rst M , 1 requests no processor can yet have consumed all of the threads assigned to it. As each processor is generating requests at the rate r and each is a Poisson process independent of all the other processors, the rate at which the system is generating requests is Nr . However when there are M requests or more, some of the possible states represent a processor having consumed all of its threads, thus the request transition rate from this cohort must be less than Nr . In reasoning about systems with this structure it is useful to be able to talk about the set of states that represents n requests having been made. The function cohort extracts these states and is de ned as: cohort (S N;M ; n) , fx j x 2 S N;M  len (path (x)) = ng

where S N;M represents the set of all possible states and len yields the length of a sequence. Note that this is also the set of states that are reached after n requests have been made. It is also useful to talk about the size of a cohort; for this we de ne the function population which returns the number of states in a particular cohort: population (Ci ) , card Ci

Where Cn is a shorthand notation for cohort(S N;M ; n). We will use this shorthand extensively in the rest of this chapter. 148

In looking at the transition rate from cohort(S N;M ; n) to cohort(S N;M ; n + 1) it is necessary to make the assumption (proof in xA.3) that for every cohort in S N;M the steady state probabilities of every state within a particular cohort are equal. That is

8n 2 [0 : : :NM ]  8s ; s 2 cohort(S N;M ; n)  prob(s ) = prob(s )

(5.17)

1

2

1

2

Under this assumption, for a given cohort, Cn say, there will be some number of states in which N processors are running, some other number in which N , 1 processors will be running, and so on. If ri represents the number of states in Cn for which i processors are running then the following identity must hold: population (Cn ) =

(5.18)

N X i

ri

=0

As has been already mentioned in the cohorts Ci ; i 2 [0 : : :M , 1] every processor is always running, ie rn = population (Cn ). The other simple case is the request ow from CNM , to CNM . It can readily be seen that the population of CNM is the permutation of the NM possible events given that there are N sets of M duplicates, this gives 1

population (CNM ) =

(5.19)

(NM )! (M !)N

as each one of these states represents no processors running, the number of arcs from each component state in CNM , (each state representing one of the N processors left running) must be one. Thus the rate at which the system is processing thus varies from N when the Markov chain system is in fCi ji 2 [0 : : :M , 1]g to 1 when the system is in CNM , . To calculate the transition rate from Cn to Cn (n 2 [0 : : :NM , 1], there is no request ow from CNM ), let us consider the number of arcs from a state s 2 Cn . This state represents a certain number of processors running; let this number be run(s). For each running processor there is an arc into Cn with transition rate r . Note that each arc corresponds to a distinct state in Cn . Given that the probability that the system is in Cn is Pn , then the probability that 1

1

+1

+1

+1

149

population(Cn )r

population(Cn )r +1

Cn



Cn



+1

Figure 5.14: Multiple threads per processor as a birth death system the system is in state s is given by the multiplication rule of conditional probability.

P [system in Cn & system in s] = P [system in Cn ]P [system in s given system in Cn]

(5.20)

under the assumption that the probabilities of being in any state in a cohort are equal: 1 P [system in s given system in Cn] = population (Cn)

(5.21)

This gives the transition rate from s as

run(s) population(Cn) r

(5.22)

Hence the transition rate from Cn to Cn is +1

X X run(s) r  = run(s) r population (Cn ) population (Cn ) s2C s2Cn n

(5.23)

P

However s2Cn run(s) is precisely the number of states in Cn . This gives the transition rate of (5.24)

+1

(Cn )  n = population population (C ) r +1

n

This is illustrated in gure 5.14. Returning to the general solution of linear birth-death Markov chain systems; the general form ([75, eqn 3.10]) of the coecients of such a birth-death system is: (5.25)

Pk =      k, P0 0

1

1

1

2

150

k



where

K kY ,  #, X i P0 = 1 + 

"

(5.26)

1

k

=1

i

=0

1

i

+1

where K represents the total number of threads in the system (NM ). From equation (5.24), the equation for the general coecients for the linear birthdeath system representing a system of N processors each with M threads is

Pk = P0

(5.27)

kY , population (C ) n u population ( C n) i 1

+1

=0

where u = r =. However, in the product of the above equation, the populations of each cohort except for C and Ck appear exactly once in both the numerator and the denominator. Also given that C represents the initial state with no requests outstanding and whose population is always 1, equation (5.27) becomes 0

0

(5.28)

Pk = population (Ck)uk P0

where (5.29)

"

P0 = 1 +

K X k

population (Ck )uk

#,

1

=1

As can be seen from the performance characteristics derived in the previous chapter, the knowledge of P0 gives most of the interesting performance metrics; the derivation of these metrics is along identical lines to those in chapter 4 and is not reproduced here. The problem of generating performance measurements for multiple processors with a multiple number of threads has now been reduced to the nding of the populations of the individual cohorts.

5.3.2 Finding the population of a cohort The problem of nding the population of a cohort can be expressed in many ways. One equivalent problem is the number of distinct strings of length n that can be 151

produced when choosing from an alphabet of N characters given that there are M copies of each of these characters available. There does not appear to be a simple closed formula for this number; however there are several techniques for generating these populations algorithmically. These approaches are summarised in [113] and have been used to generate the cohort population for medium size problems (processors 2 [1 : : : 100], threads per processor 2 [1 : : : 10]).

5.4 Multiple processors with response delay The inclusion of delay within the model introduces yet another increase in complexity of the Markov chain system. We do not attempt to illustrate the underlying chain as any such representation would have to multi-dimensional. The starting point for the discussion of this system is the aggregated states (cohorts) introduced in the previous section. As in both the previous cases of adding delay into the basic linear Markov chain system, the system becomes triangular. In this case the starting point of the linear states is that of each state representing a cohort of states as arrived at in x5.3. To demonstrate that the triangular system of states is valid with respect to the aggregation of states into cohorts, consider a state Si in a cohort Cn , starting with the case in which there are no requests undergoing delay. Considering this state Si in the model without delay, it has run(Si) arcs to the successor cohort. The state also has one arc to the predecessor cohort Cn, (representing service ow). In the case with delay this single arc is now to a new state in a different cohort Cn; where the ` ' in the subscript represents the set of all states where there is one request undergoing delay. There are up to n such cohorts each representing n requests undergoing delay. This is illustrated in gure 5.15. It is important to note that the state in Cn; , to which the state Si maps when the rst item in the queue has been serviced, possesses exactly the same number of arcs into its successor cohort Cn ; as the corresponding state in Cn; does to Cn ; . Thus the number of arcs from Cn; to Cn ; (and hence the number of states in cohort Cn ; ) is the same as the number of arcs from Cn; to Cn ; . Thus the transition rate into the successor cohort is dependent only on the base cohort (ie the value of n), and not dependent on the number of requests that have been serviced by the service centre. 1

1

1

1

+1 1

0

+1 0

1

+1 1

+1 1

0

152

+1 0

Cn, ;

xn r

xn r +1

Cn;

0

1 0





Cn;

;

+1 0

 xn r +1

1

Cn



Cn

;

+1 1



Cn;

2

Figure 5.15: State-transition-rate diagram cohorts in a system with delay

153

The e ect of servicing a request can be tied into the naming of states (recalling that the function path (Si) extracts the sequence of requests that corresponds to the system being in the state Si ). In a system where delay is present, the act of servicing reduces the queue and transfers the system into a new state in Cn; ; call this state Sx;y , where x represents the base path (ie path (Si) in Cn; ) and y represents the requests that have been serviced and are undergoing delay. Note that the requests serviced and awaiting delay behave di erently from the requests queued for service. This is due to the model of delay being an individual server per request; servicing related to that delay ( , the delay rate) is memoryless. This implies that if two responses have been serviced and are thus undergoing delay, it is possible for them to arrive at their originating processors in an order that is di erent from the order in which they were serviced. It is actually an advantage to make this behaviour explicit in the model. When communicating from a service point to di erent processors it is likely that the requests would travel by di erent routes, each with the potential for a di erent delay at any particular moment. The only area where it may be more dicult to accept is in the case where two requests have been serviced from the same processor. Even in this case the ordering of requests is unimportant. At no point have we distinguished between di erent requests from the same processor, only the number of such requests has been important. An additional advantage of this freedom from delivery order is that the underlying physical system can use any means of delivery, such as multiple routes, providing that the same service pro le is maintained. Returning to the processing of requests and taking an arbitrary cohort Cn;j , the target state (in the cohort Cn;j ), after processing a request from a state Sx;y (in the cohort Cn;j ), is given by 1

0

+1

service (Sx;y ) , let n0 = y + 1 y0 = y y [x[n0]]

in Sx;y0

where y is the sequence concatenation operator, len y = j , and the notation x[n] extracts the nth element from the sequence x. The service function embodies the following actions of the system: Assume that the system is currently in a state in Cn;j , let that particular state be Sx;y . This 154

implies that of the n outstanding requests (the originating processors being the sequence of elements x), j have already been processed. The servicing of the next request takes the j +1 element from the sequence x and appends it to the sequence of processed requests (y). The service facility will be idle when x = y. This state Sx;y in Cn;j , will have j arcs from states in the cohort Cn, ;j , , one for each serviced request whose delay could expire. This set of states in Cn, ;j , represent the path of outstanding requests, given that an element of y has both received service and undergone the delay. The set of states which are the targets in Cn, ;j , will be given by 1

1

1

1

1

1

delayTargets (Sx;y ) ,

fSx0;y0 ji 2 [1 : : : len y]  x0 = remove(x; i) ^ y0 = remove(y; i)g

where remove (x; n) produces a new sequence with the nth item in the sequence removed. Note that the position in both sequences of an element is the same, ie 8i 2 [1 : : : len y]  x[i] = y[i] Under the same assumption as for the case without delay, namely that all probabilities within a cohort are equally likely, the system with delay has the same triangular structure as has been seen before. Given that this system follows the same structure as the other systems with delay (x4.1 and x5.2) and using Pi;j to express the probability that the system is in cohort Ci;j ; the probability of being in a particular cohort Ci;j (from equation (5.28), outline proof in xA.4) is: (5.30) where

j

Pi;j = population (Ci)ui djr! P0;0

2 NM i 3 j , X X d P0;0 = 41 + population (Ci)ui r 5 j!

1

(5.31)

i

=1

j

=0

5.4.1 Hiding number for multiple processors In x5.2.2 we de ned hiding number in terms of the absolute speed of the system. In the style of systems that we are examining here, the calculation of the absolute 155

speed for a particular system requires the computation of the probability of being in a cohort weighted by the average rate at which the system is processing while in that cohort. An easier, and equivalent, way of calculating the hiding number is to look at the utilisation of the service centre, . We can recast the de nition of x5.2.2 and (5.16) as the minimum M (number of threads) for which the utilisation of the service centre is greater than or equal to the equivalent, delay free system. This being the case the following statement would hold:

MN;dr   N;

(5.32)

0

(

where  N; is de ned by (4.7)

)

(

0)

0 (

0)

1,

"X N i

=0

N ! ui (N , i)!

#,

1

and represents the server loading in a N processor system with one thread per processor and no delay. The MN;dr term can be calculated by noting that the server is idle when the number of processed requests equals the number of outstanding requests, ie for all the states Pi;i . This gives system utilisation as (

)

(5.33)

MN;dr (

)

=1 ,

NM X i

Pi;i

=0

=1 ,

NM X i NM X =0

=1 ,

i

=0

!

di population (Ci)ui r P0;0 i!

!2

3,

NM i X Xi d dj i r 4 population (Ci)u population (Ci)ui r 5 i! i j j! =0

=0

The condition of equation (5.32) becomes the smallest M such that (5.34)

PNM population (C )ui dir 1 i PNMi Pi population (C )iui djr  PNi N ui N ,i i j i j =0

!

!

=0

=0

!

is true. 156

=0 (

)!

1

5.5 Summary The analysis presented in this chapter allows for the prediction of performance characteristics where there are many threads per processor. We have taken the simpler case of the single processor and shown that there are now two possible factors that may limit speedup. The rst of these is the saturation of the server; this will occur when u > 1. The second is the complete use of the processor's computation power as the level of multi-threading increases. With the consumption of the concurrency that the system o ers being greater than the number of processors, the correct approach to maximise performance becomes more dependent on the problem. The use of performance formul would allow for quantitative comparison between the possible options. Given sucient concurrency, and a constant loading intensity from each thread of execution, it is possible to compare the overall rate of work of two systems that di er in the number of processors and/or number of threads per processor. Any increase in the number of threads or processors will always give some performance improvement (however small). The fact that this theoretical prediction does not always agree with the observed performance of systems could be because of the implicit assumption that there is always sucient bu ering space within the system under study. This requirement is not always reasonable, and ways of extending the models presented here to include the e ect of nite bu ering are examined in x7.1.3.2. Where loading intensity is some function of the number of units of concurrency (where the grain size may vary) or there is some additional cost related to the number of threads per processor, such as some scheduling overhead, the interrelationship is not as clear-cut. An example of the inter-relationship of these factors is given in x6.5.1. We have introduced the concept of hiding number as a measure of the required excess concurrency within an algorithm to mitigate the e ects of the delay present within a system.

157

Chapter 6

Uses of the model The analysis described in chapters 4 and 5 build their performance models out of several distinct conceptual units.

 Request/demand generating entities such as processors and threads of execution.

 Request/demand satisfaction | this representing the service elements in the model.

 Delay  Cycles of behaviour One point that has not yet been addressed is the breadth of applicability of the models described, and their use in the design and performance evaluation process. In looking at the possible uses of this model, there are several ways of applying the analysis. The justi cation for each level of use is dependent on the particular level of applicability claimed. The strongest level of applicability is the use of the model to make accurate predications of a multiprocessor's performance for a particular problem, both in terms of the solution rates for that problem and in terms of the limit in increase of these solution rates. In discussing the suitability of this model to predict performance it is useful to look at the reasonableness of the concepts we have used as the basic building blocks, as well as the way in which they are combined. 158

database number of fraction (% of DB) processors 1 1 2 3 2 1 2 3 5 1 2 3 10 1 2 3

runtime fractional request average relative (secs) idle (%) rate wait (sec) speedup 29843.270 16.422 648.932 302.788 1.000 15184.690 17.814 649.042 333.948 1.965 10167.250 18.054 648.789 339.574 2.935 23487.180 12.886 491.045 301.249 1.000 11907.000 13.957 491.352 330.140 1.973 7961.710 14.085 491.504 333.547 2.950 22347.370 12.210 456.029 304.997 1.000 11467.950 12.110 447.021 308.227 1.949 7699.290 12.941 452.390 328.528 2.903 18201.230 8.307 297.101 304.914 1.000 9273.920 8.286 295.731 305.516 1.963 6214.260 8.780 298.276 322.625 2.929

Table 6.1: Raw data for rendering of picture on tertiary tree of depth 1

6.1 Evaluating and extrapolating performance To validate the abstract model, the distributed ray tracing application described in x3.5 was enhanced to collect performance data. Tables 6.1 and 6.2 contain the aggregated raw results over several runs for each con guration. This physical system was the original impetus to develop the performance and behavioural models described in this thesis. The instrumentation of the software to capture the performance data was extensive. This was to make sure that values of several of the performance parameters that are related in the model could be measured to assure the internal consistency of the model. The way in which the performance gathering changes was made was assessed for its e ect on the overall behaviour of the processes involved. Any such e ects were minimised as far as possible by assuring that changes to any part of the behavioural cycle portion of the code took constant time (eg storing a simple value or some simple arithmetic operation). The more complex operations were performed outside the cycle (ie while awaiting response) or after the termination of the execution of the system. There were no additional communications during the main execution of the system so as not to increase the load on the communications infrastructure; the performance information was gathered by the host system after the termination of the normal execution. In this system, to keep the execution environment as homogeneous as possible, all 159

database number of fraction (% of DB) processors 1 1 2 3 4 5 6 9 2 1 2 3 4 5 6 7 8 9 5 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

runtime fractional request average relative (secs) idle (%) rate wait (sec) speedup 32570.470 23.340 649.069 469.073 1.000 16627.130 24.934 650.477 510.641 1.959 11146.070 25.259 650.916 519.200 2.922 8467.460 26.146 651.330 543.535 3.847 6798.990 26.363 651.909 549.184 4.790 5762.100 27.513 652.337 581.849 5.653 4146.190 32.633 653.460 741.303 7.856 25174.860 18.667 491.683 466.794 1.000 12801.580 19.974 493.243 506.015 1.967 8571.480 20.209 494.176 512.520 2.937 6491.800 20.888 495.030 533.362 3.878 5205.300 20.985 495.904 535.567 4.836 4391.910 21.846 496.692 562.785 5.732 3794.140 22.396 497.342 580.291 6.635 3353.300 23.053 497.964 601.653 7.507 3070.500 25.223 498.592 676.556 8.199 23866.720 17.742 456.750 472.214 1.000 12131.400 19.023 458.620 512.237 1.967 8122.450 19.233 459.734 517.980 2.938 6150.190 19.879 460.729 538.533 3.881 4930.410 19.964 461.755 540.196 4.841 4187.180 21.313 462.638 585.476 5.700 3591.590 21.313 463.515 584.351 6.645 3170.710 21.924 464.258 604.852 7.527 2858.430 22.637 463.375 631.467 8.350 19044.350 12.330 298.201 471.641 1.000 9643.480 13.278 300.754 509.085 1.975 6451.770 13.415 302.559 512.101 2.952 4872.080 13.849 304.246 528.392 3.909 3904.230 13.857 305.778 526.062 4.878 3295.290 14.788 307.217 564.884 5.779 2825.010 14.701 308.431 558.776 6.741 2485.880 15.035 309.646 571.482 7.661 2227.780 15.393 310.159 586.585 8.549

Table 6.2: Raw data for rendering of picture on tertiary tree of depth 2

160

the processing was performed at the leaves of a tertiary tree of processors; the data in table 6.1 is for a tree of depth 1 (hence the maximum of 3 processing nodes), whereas the data in table 6.2 is for a tree of depth 2 which has a maximum of 9 processing nodes. Each line in these tables was obtained by rendering the same picture (the `rings' model out of the Haines set of standard procedural models [56]) with the maximum cache size set to a di erent fraction of the total database of objects. The rst two columns in tables 6.1 and 6.2 are the cache size (in terms of the fraction of the database) and the number of processors in the particular con guration. The third column is the total runtime (measured in seconds) for the completion of the ray tracing algorithm on the picture. The fourth column is the average processor idleness which was measured by observing each request/response pair from the point of view of the processor. The next column is the average perprocessor request rate; this is calculated by taking the total number of requests and dividing by the total amount of CPU time that was spent processing on a per processor basis. The penultimate column is the average time in microseconds that a processor had to wait for a response to a request. The nal column is the speedup of the con guration relative to the single processor runtime for the given cache size. It is interesting to note that the per-processor request rate increases as the number of processors grows; this may be due to the cache becoming less e ective | each processor is processing less of the total image as the number of processors increases.

6.1.1 Derivation of u0 and other parameters As a result of the equivalence result introduced in x4.3.3 we will rst try to nd the equivalence family (in terms of equation (4.45)) to which each of these results belongs. In gure 6.1 the points we have plotted are the relative speedup values from table 6.2, whereas the curves are generated using equation (4.21). The values of u have been derived by a least-squares t of this equation to the raw data values (the exact algorithm is described in [114]). These results are summarised in table 6.3. Having determined the family (the u0) to which a set of observations belongs, we now have one relationship between r ,  and  , through equation (4.45). Given 161

9

10%db u=0.0706 5%db u=0.0829 1%db u=0.1061

8 Relative speedup

7 6 5 4 3 2 1 0

1

2

3

4 5 6 Number of processors

7

Figure 6.1: Relative speedup for di erent cache sizes

database fraction (% of DB) 1 2 5 10

u0 0.10611 0.08764 0.08288 0.07058

standard deviation 9:89  10, 1:18  10, 1:42  10, 1:40  10,

4 3 3 3

Table 6.3: u0 values as result of least square t

162

8

9

that we know r (the request rate, column 5 in table 6.2), through the use of equation (4.27), which relates the average waiting time for a response to a request to r ,  and  , we have the two sets of independently derived data to solve the resulting simultaneous equations. The observed waiting time is recorded in column 6 of table 6.2. The solution of these two simultaneous equations (equations (4.45) and (4.27)) leads to the estimates of ,  and dr which are contained in table 6.4. Thus from the raw data we have derived estimated values for the parameters r ;  and  and hence u0 , u and dr . The pattern of the changing values for  and  in table 6.4 suggests that there is some other factor that is not accounted for in our approximate model of this system's behaviour. Whatever this factor is, it shows up as an increase in the service rate, which implies that the loading intensity is decreasing as the number of processors increases. This is borne out in the changing values of u in the nal column in this table. The application of the model has shown that there are other scaling factors at work in the execution of this ray tracing system and point the way to additional analysis or estimation in order to give some idea of the maximum scalability of the rendering of this particular image.

6.1.2 Performance of other physical systems There is a major diculty in acquiring suitable raw performance data from the literature. The gures and graphs that are published usually have been heavily processed. The way in which the values have been gathered and the exact con guration details are not quoted. Given these restrictions there are a few sources of data published that are possible to evaluate [31, 25]. Using the curve tting techniques above this data has been tted and has yielded standard deviations ranging from a tenth to a few percent. However, there are usually very few (less that 5) data points and this does not allow for the claiming of any statistical accuracy for these calculated values.

6.2 Threads of execution and processors As was made clear in the introduction, the proposed model does not try to incorporate all the possible issues that may a ect performance. Possibly one of the 163

database number of fraction (% of DB) processors 1 1 2 3 4 5 6 9 2 1 2 3 4 5 6 7 8 9 5 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

service rate () 2131.864 2146.186 2327.330 2467.071 2728.145 2897.419 3376.247 2142.273 2135.467 2288.727 2399.234 2620.327 2749.901 2958.167 3183.538 3177.696 2117.684 2101.638 2246.862 2346.964 2553.301 2584.117 2855.349 3059.279 3268.194 2120.257 2093.809 2226.003 2315.320 2505.287 2523.559 2771.223 2956.650 3157.942

delay delay rate ( ) ratio (dr ) 347.227 6.140 350.410 6.125 397.922 5.849 437.700 5.636 520.705 5.239 581.509 4.983 793.019 4.257 303.726 7.053 301.582 7.081 337.625 6.779 365.532 6.564 427.713 6.126 468.157 5.874 541.555 5.462 634.524 5.017 630.874 5.037 285.048 7.429 280.851 7.483 313.006 7.178 336.646 6.972 390.647 6.536 398.782 6.480 483.512 5.905 558.667 5.476 651.993 5.013 300.398 7.058 290.546 7.206 326.821 6.811 353.037 6.558 419.284 5.975 423.838 5.954 534.633 5.183 639.990 4.620 792.133 3.987

Table 6.4: Service and delay rates by cache size

164

loading intensity (u) 0.304460 0.303085 0.279683 0.264009 0.238956 0.225144 0.193546 0.229514 0.230976 0.215917 0.206328 0.189252 0.180621 0.168125 0.156418 0.156903 0.215683 0.218220 0.204611 0.196308 0.180846 0.179031 0.162332 0.151754 0.141783 0.140643 0.143639 0.135920 0.131405 0.122053 0.121739 0.111297 0.104728 0.098215

most important omissions is that of the multi-way synchronisation of processes (ie involving more than two processes); however, suggested ways of incorporating this into the model are outlined in x7.1.5. The basic assumptions that underpin the model used here portray a process (thread of execution) running independently of all other processes, in all except two distinct ways: 1. All processes contend for some common resource such as the services of a process on another processor, or for some physical resource such as a communications medium. 2. A subset of the processes contend for the computational resources represented by a processor; this set is xed and there is no account taken of the cost of performing any load balancing. The process generates requests at a given rate, r , while it is being executed, the times between requests being drawn from a random variable with a probability distribution function of a negative exponential. This is not as restrictive as it may at rst appear. In the modelling of a process's activity, as observed in section 3.1, the execution time of a cycle of behaviour is not unreasonably seen as a random variable. The execution of this cycle gives rise to the externally observable event of making a request. The nature of the probability distribution function associated with the inter-request time may be discrete, or even totally deterministic. However, the presence of such things as caching, and the collective e ect of such individually minor e ects as the variation in time to execute instructions like multiplication, will tend to give the random variable associated with the inter-request time a more continuous nature. The applicability of the exponential function to the modelling of inter-request times has been discussed in x4.1.3.1. It is always possible to construct a counter-example for the use of this probability distribution, given that all the possible aspects of the system are under the ultimate control of the programmer. Although this may be seen as a drawback, this fact has not prevented the writing of a vast body of literature on the performance of computer systems using probability, and the successful use of this underlying theory in the construction of many systems, both hardware and software. This is in no small part due to the fact that even the most apparently deterministic calculations on distributed processors possess some random elements, for example 165

execution times of instructions being dependent on the values of their arguments, the slightly di ering clock speeds on each of the processors, or the e ects of the location of the processor on a bus. Gustafson has even reported the e ects of di ering speeds of memory access due to the invocation of error-correcting codes in the memory subsystems [55, p637].

6.3 Service elements Much of what has been said in the previous section on the execution component of this model applies to the modelling of the service component of the system as well. In viewing the combined behaviour of the two elements, that of the requester and the responder, additional factors suggest themselves. The observation was made in section x4.3.3 that there is an equivalence between the two models of system performance, both with and without delay. An outcome of this equivalence is that the e ect of delay can be accounted for by the `moving' of delay into the model of the processor's activity. This equivalence has some interesting implications for the applicability of this model to systems in which the processes cannot at rst be adequately modelled by a simple exponential pdf. The presence of delay in the model of the processes' behaviour seems to imply that the model is also applicable to behaviour which can be modelled with a nonexponential pdf. This delay equivalence combined with the properties of type 3 BCMP nodes [16] implies that, at least in the case of a single thread per processor, the inter-request time distribution of the processes is unimportant. There is another noteworthy fact which arises when servers are being used near their saturation point. This comes out of the heavy trac approximation of Kingman [73]. This theorem states that under these conditions the waiting time in the queue can be approximated by a Poisson process, with properties that are determined by the variances of the requester and the server. This result was derived for GI=G=1 queues, and appears general. It has been extended to apply to several servers in parallel [78] and to servers in series [57]. The existence of this heavy trac approximation theorem does not bode well for system designers who rely upon the inter-request times being deterministic with deterministic service times as a means of achieving linear speedup, up to the point where the server saturates. Any variation from this completely deterministic behaviour will cause the system to begin to behave like the model described in the 166

Y

A

B

1

B

2

X

Figure 6.2: E ect of multiple connectivity previous chapters. The further softening of the use of the model as a qualitative comparison for the performance of two di erent systems is also possible. All that is required is that the server utilisation be a monotonic function of the loading intensity; this is true for most models (but see the comments on blocking in x7.1.3.2). This version of the model, although limited in its quantitative uses, may nd many applications especially in the comparison of di erent algorithmic approaches to the same problem; this is expanded on in x6.5.1 and x7.1.4.

6.4 Delay One approximation in our model is that all the service-like activity is represented by a single, overriding, point of contention. All the other e ects are amalgamated into pure delay. There is no attempt to say that contention for the resources that make up the pure delay do not increase the round-trip time of a request/response. Where this increase is small compared with the time involved in queueing for and receiving service at the dominant service point, our model can still be applied. One important question is, when, in practice, is this simpli cation of the e ect of delay acceptable? This question is perhaps best addressed through the use of two examples.

6.4.1 Example 1 Consider a system that has several, identical, service points in it. The worst possible scenario for our technique is if all these service points are in series and all 167

have to be traversed. In these circumstances our model not going to yield exact results. However, viewing these service points as, for example, communication links, and assuming that they are con gured as in gure 6.2 such that all the trac on A is routed though the processing element X, but such that the links B and B are connected to the same network (which is not the dominant service point), what is the e ect of the queueing at the exit from X to enter the network via one of the links B? Let the arrival rate of information to be shipped over A (from Y) be  and let the rate at which the communication links A, B and B operate at be . Modelling the e ect of contention for A as a M=M=1 queueing system and the e ects of the two connections B also as a M=M=2 queueing system, we can calculate that the time taken to queue and receive service in the two halves of the system is: 1

2

1

2

Delay at A Delay at B (M=M=1) (M=M=2)

E [s] 1,u

E [s] 1 , (u=2)

2

where E [s] is the service time (1=) and u = r =mu. Our rst model will be an exact analysis of this queueing system. We will look at the total wait time experienced by explicitly modelling the e ects of the communication channels A and B. In this case, where the components of the queueing network are queueing systems of the form M=M=c, connected in series with no feedback, it is possible to model exactly the individual components in isolation [72, x10.1]. This allows for the construction of our rst model, where the waiting time at each of the service centres is incorporated in the overall delay. (6.1)

[s] WA = 1E,[su] + 1 ,E(u= 2)

2

In our second, approximate, model we adopt the simpli cation of representing the e ect of the second element in the queueing network solely by its average service time E [s]. This gives (6.2)

WB = 1E,[su] + E [s] 168

0.035 0.03 Relative Error

0.025 0.02 0.015 0.01 0.005 0

0

0.2

0.4 0.6 Loading Intensity

0.8

1

Figure 6.3: Relative error as a function of loading intensity The quantity of interest is the relative error of using the second model in preference to the rst, ie

WA , WB WA

(6.3) namely

+ ,Eu=s , E,su , E [s] Es + Es ,u , u=  1  1 1 , = , 1 + 1 , (u=2) 1 , u 1 , (u=2) = u (u , 1) u + 4u , 8

WA , WB = WA

Es ,u

[ ]

1

[ ]

[ ]

1

(

2)2

1

[ ]

[ ]

1

1

(

2)2

1

2

2

2

2

As can be seen in gure 6.3, this approximation introduces a relative error which has a maximum (for u 2 (0 : : : 1)) at u = 0:745, at which value the relative error is only 3.17%. This maximum relative error is very small and is likely to be acceptable in almost all cases. 169





1

()



2

Figure 6.4: E ect of queues in series

6.4.2 Example 2 In this example we look at the e ects of two exponential delays in series, each having di erent values for their average service times. The purpose of this example is to illustrate the e ects of the domination of the lower service rate element over the other one. The system illustrated in gure 6.4 consists of two M=M=1 queues connected together in series. As in the system above, the arrival rate at both of the constituent servers is identical, being . However the service rates of the two service centres are di erent. Assuming that  is the dominant service centre, ie  <  , let the two service rates be related by  = (1=R) where R 2 (0 : : : 1). This gives the loading intensity u = Ru and the average service times E [s ] = RE [s ]. Following the same structure as for the previous example, the two delay models are based on, rstly, the exact analysis of the delay at both queues: 1

1

2

2

(6.4)

2

1

1

2

1

WA = W (u ) + W (u ) 1

2

and, secondly, the approximation in which the delay incurred at the second, more lightly loaded, service centre is estimated solely by its mean service time: (6.5)

WB = W (u ) + E [s ] 1

2

where the wait time is that for an M=M=1 queue, namely: (6.6)

W (u) = 1E,[su]

170

Relative Error

0.3 R= R= R= 0.25 R= R=

0.1 0.25 0.5 0.75 0.9

0.2 0.15 0.1 0.05 0

0

0.2

0.4 0.6 Loading intensity

0.8

1

Figure 6.5: Relative error as a function of loading intensity This gives the relative error of this modelling approximation as

WA , WB = W (u ) + W (u ) , (W (u ) , E [s ]) WA W (u ) + W (u ) RE s , RE [s ] ,Ru 1

2

1

1

=

[ 1]

1

1

Es ,u

RE s ,Ru

+ , u) = 1R, u2(1 Ru + R [ 1]

1

2

2

1

[ 1]

1

1

2

As is illustrated in gure 6.5, for small values of R this relative error is very small; the maximum relative error is dependent on R and is achieved at di erent values of the loading intensity. This value of the loading intensity at which the maximum relative error occurs is: (6.7)

p

u = 1 + R ,2R 1 , R

2

The maximum relative error for a given value of the service time ratio R is illustrated in gure 6.6. Thus the relative error in making this approximation can be calculated; for all u the error is less than 50%. However if the ratio of the two service times, R, is less 171

0.45

Maximum Relative Error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

0.2

0.4 0.6 Ratio of average service times

0.8

1

Figure 6.6: Maximum relative error as a function of the ratio of service times than 0.5, the maximum relative error is around 6%. Also at low (or high) loading intensities, this approximation can yield remarkably accurate results.

6.5 Examples of the use of the modelling technique 6.5.1 Quantifying the e ects of algorithm change There are many possible examples that could be used to illustrate the application of this performance model.

6.5.1.1 Caching in a distributed Virtual Memory In this example we look at the e ects on performance of a change in a simple caching algorithm. The behaviour of this system is similar to that described in x3.5; the di erences lie in the way in which the requests generated by the computation are handled. In this system each local processor has a cache (managed by software) that holds some fraction of the global information; a request for a datum can be satis ed by 172

p =Processor computing p =Lookup in local cache p =Evaluating e ectiveness of lookup p =Queueing awaiting VM service p =Server Idle p =Server Busy t =Make request t =Local cache examined t =In local cache t =Not in local cache t =Acquire server t =Service request

p

1

1

2

3

4 5

c

t

1

6

1 2

p

3 4

2

5 6

t

2

t

3

l

=Cache hit ratio

p

c =Request generation rate l =Cache lookup rate  =Service rate

t

4

3

(1 , )

p

5

p

4

t

5

p t

6

6



Figure 6.7: Behaviour of a single processor with cache 173

this cache or passed onto a server that holds all the information. The server is shared amongst all the processors. This behaviour is illustrated in gure 6.7. In this system the algorithm generates requests (to the virtual memory system) at a rate c . Each request requires 1=l units of time to be processed by the local cache mechanism. This processing will either return the required value (with a probability ) or transmit the request to the server (probability 1 , ). Taking the composite view of the processor (ie all that lies within the dashed line in gure 6.7), the server sees a processor which generates requests at a rate of: 1 , = (1 , )cl 1=c + 1=l c + l

(6.8)

This is the r that is used in the equations in chapters 4 and 5. The quantity of interest here is the actual rate at which the solution is generated (ie excluding the computation e ort in looking up the local cache). Referring to gure 6.7, the solution is being computed while the system is at p . Any processing rate that is arrived at by the application of the formul in chapters 4 and 5 will give the fraction of the time that the processor spends within the set of places fp ; p ; p g (ie within the dashed line in the gure). This quantity, frun say, needs to be converted into the observed solution rate. The place p is only occupied transiently and the processor does not spend any time in it; it is there to allow for the probabilistic choice between the possible outcomes of the lookup against the local cache. Hence frun is the time that the processor spends in fp ; p g. Therefore the observed solution rate is: 1

1

2

3

3

1

(6.9)

2

1=c f = l f 1=c + 1=l run c + l run

With the equivalences embodied in equations (6.8) and (6.9), it is possible to apply the performance equations of the models in chapters 4 and 5. It is now possible to explore the e ects of an `improvement' in the caching algorithm, ie one for which is bigger. Let this improved algorithm increase the cache hit ratio by a factor R , ie the new cache hit ratio is R ; as this improved version of the cache has some cost in the increased complexity of the lookup algorithm, the `rate' at which the lookup portion of the code is executed decreases; let this be by a factor Rl. Hence the e ect of an improvement in the lookup algorithm is represented by an increase in the hit rate (R  1) and a decrease in the rate of 174

lookup processing (Rl  1). The rst question is, what di erence does the improvement in the caching algorithm have on the asymptotic performance of the system? As has been shown, the maximum absolute speedup is given by 1=u = mu=r . This corresponds to the quantity frun in equation (6.9). With A representing the original system, and B being the system with the improved caching algorithm, their respective performance characteristics are: Caching Algorithm A B Speedup (frun )

(l + c) (Rll + c) (1 , )lc (1 , R )Rll c True Solution Rate

l f c + l run

Rl l f c + Rll run

Maximum Solution Rate





(1 , )c

(1 , R )c

Thus the relative increase in the solution rate between the two caching algorithms is: (6.10)

  ,R c , , c = (R , 1)  1 , R , c

(1

)

(1

(1

)

)

This shows us that the asymptotic solution rate is dependent only on the rate at which the requests are generated by the algorithm and on the e ectiveness of the caching. It is not dependent on the cost of the caching algorithm. This asymptotic relative increase in solution generation rate comes solely from the di erence in caching eciency between the two schemes. This fact is likely to be of limited interest for users of a xed number of processors; they will be far more interested in the e ective change of solution rate 175

Relative Observed Solution Rate

0.015 0.01 0.005 0 -0.005 -0.01

Rl = 0:9 Rl = 0:5 Rl = 0:2

-0.015 -0.02

5

10

15 20 25 Number of Processors

30

35

40

Figure 6.8: Relative performance improvement for di erent costs of algorithm on a certain number of processors. This requires us to apply a particular one of our performance models; the chosen model could be any of those discussed in previous chapters, but for the ease of illustration we shall use the model in x4.1 (single-thread per processor). The rate at which systems of this type generate solutions is given by equation (4.40), this being equivalent to the quantity frun . From equation (6.9); this gives an observed solution rate for the system of

 l  

(6.11)

hP

c + l 

i,

where  = (1 ,P0) = 1 , Ki KK,i ui and from equation (6.8), u = , l lcc Applying these substitutions to equation 6.11, the solution rate simpli es to (6.12)

1

!

=0 (

)!

(1

(

)

+

)

  (1 , )c

The relative e ective increase of two particular caching strategies can thus be calculated. This is illustrated in gure 6.8. This graph shows the perceived performance improvement (in terms of observed solution rate) where = 0:1; c = 0:05; l = 0:05;  = 1 and R = 1:1. The respective graphs illustrate the e ect on 176

observed solution rate achieved by this 10% improvement in the cache hit ratio for various values of the cost associated with the improvement in the local caching algorithm. As can be seen from these results, it is sometimes more cost-e ective (in terms of observed solution rate from a system) to go for a less expensive caching algorithm which is not as e ective for small numbers of processors, and to use a more expensive, better hit rate solution for larger processor numbers. This implies that a general purpose solution to software caching may need to consist of several di erent types of solutions that are chosen on the basis of the level of saturation of the central server. Similar observations have been made through the means of simulation studies by Bennet et al [18], Bagrodia [14], Stumm and Zhou [115], Adve et al [1] and Archibald and Baer [11].

6.5.2 Quantifying the e ects of changes in topology 6.5.2.1 Growth of delay with the number of processors Another noteworthy aspect of scalability is the e ect of those factors that increase, because of topological or other reasons, with the increase in the number of processors. In this section we will look brie y at the e ects of increasing delay and increasing contention on absolute speedup, given increasing numbers of processors. In the case of the growth of delay, the sort of scenario envisaged is one in which the delay introduced by the underlying communications network increases as the number of processors connected to it grows. The starting-point for these examples is the absolute speedup formulation in equation (4.40) although the corresponding formulation from x5.4 could equally well have been used. The graph in gure 6.9 illustrates the e ects of this growth of delay. For the purposes of this illustration, the values of u and dr chosen were 0:1 and 5 respectively. The linear factor chosen was 0:25, which would be typical of a connected ring of processors where the processors have four connections each, whereas the growth in the log case is to the base 2, typical of some binary branching networks. 177

Absolute Speedup

10 9 8 7 6 5 4 3 2 1 0

Without delay With constant delay Linear delay growth Log delay growth 5

10

15

20 25 30 35 Number of processors

40

45

50

Figure 6.9: The e ect of growth of delay on absolute speed It appears that linear delay growth, although having less e ect on absolute performance for small numbers of processors, does have an e ect on the asymptotic performance. Whether this is true in the general case, and if so by what amount the maximum speedup is reduced requires nding the limit of P0;0 (equation (4.32)). In the case of the example shown in gure 6.9 the absolute performance rises from just over 6 with 50 processors, through 7.23 on 100 processors, to 7.58 on 200 processors. It does not appear to have reached its asymptote, but an absolute increase in solution rate of 0.35 for the addition of 100 processors imposes an economic, if not a theoretical, limit on the scalability. The e ect of linear delay growth has a potential impact on the design and use of planar networks, even where the bandwidth is unlimited. The software architectures designed for use on them will have to either limit the distance of communication paths between requester and server, or turn to the use of more concurrency per processor, to cover this latency. This result complements the continuous density model of communications proposed by Glasser and Zukowski [46], which derives limits on the density of communication for planar structures, showing that the message density must fall o with order greater than 3, where the maximum density per unit area is nite, ie in all practical cases. 178

12

Absolute Speedup

10 Without contention growth Linear contention growth Log contention growth Log contention growth Log contention shrinkage

8

10

2

6 4 2 0

5

10

15

20 25 30 35 Number of processors

40

45

50

Figure 6.10: The e ect of growth of contention on absolute speed

6.5.2.2 Growth of load intensity with the number of processors Figure 6.10 illustrates absolute speedup given the presence of changing contention. Again the chosen value of loading intensity (u) is 0:1. Linear growth in contention has a dramatic e ect on the speed of the system, and, in the limit, the system grinds to a halt; even logarithmic growth of contention has an appreciable detrimental e ect. Growth of contention may occur as the problem grows with the number of processors, giving rise to some increase in the demands upon the server. Another possibility is in the reduction of loading intensity as the number of processors increases; this may occur in systems with distributed caching, where, as the number of processors increases, the likelihood of nding the datum without recourse to the central data repository also increases. These two examples illustrate the sort of applications to which the formul in chapters 4 and 5 could be put. Their quantitative predictive capacity is not extensive in the circumstances, but their qualitative use allows for theoretical comparisons between di erent techniques, which can give guidance to the implementor. 179

6.6 Informing design decisions Much of the material in this chapter illustrates the use of the models in evaluating implementation strategies. Although the model allows for the choices in loading intensity, delay ratio and numbers of processor it is likely that the design freedom will lie in the choice between the level of multi-threading per processor and the components of loading intensity (request and service rate). The concept of hiding number discussed in section 5.2.2 provides a broad metric for assessing design choices. As with all models and approaches there is the need to gain practical experience in applying techniques to nd their strength and weaknesses. The model described in this thesis gives a quantitative and qualitative basis to supporting informed decision making | but like all design systems does not enforce it.

6.7 Summary In this chapter we have looked at some of the possible practical uses to which the performance and scaling models expounded here could be put. We have examined the sort of criteria that would need to be employed in deciding the suitability of applying our models to a particular system. We have validated the model against an instrumented large application and shown that the performance model ts the observed performance well. We have taken a small number of examples which illustrate the use of the models, and shown that generalisations about `good' properties or algorithms are not appropriate to all cases.

180

Chapter 7

Conclusions In this thesis we have taken a particular view of the limiting factors in design and implementation that a ect the scalability of parallel systems. The abstract factors that we have studied are loading intensity, contention, and delay. We accept that there are many other factors that may a ect parallel performance, such as the variation of computational complexity and of communication intensity. These factors may dominate as a particular algorithmic solution to a problem is mapped onto the underlying hardware and software architecture. Although the route that we have chosen has been tangential to other research e ort in this area, outlined in chapter 2, we are able to take account of the results presented by other authors in our modelling. This is done by equating their results to variations of the factors summarised above. The property of absolute speedup, that has been used throughout this thesis, is a measure of the fraction of the total available computational power of the system that is being devoted to the computational task in hand. The fact that this computational requirement may be varying is not directly modelled. The scaling model that we have developed can be viewed as orthogonal to those of the xed size, scaled size and xed time scaling models summarised in chapter 2. In those scaling models many factors have been amalgamated into some quantity (`serial portion') whose scaling properties are then approximated. The non-linear view of the interaction of processes that our model pursues is also di erent in approach from the more microscopic performance models presented in the same chapter. Our approach to modelling has been based on looking at the behaviour of the 181

individual processes that make up the system, not at the e ects of some general aggregation of factors. The result of our approach is that we have generated an analytical model that explicitly includes the individual factors, an approach which we have not seen duplicated elsewhere. Through the application of our analysis it is to be hoped that any insight gained can positively in uence the design and implementation of systems. This bene t can apply in the development of a speci c implementation, as well as a more general understanding. Our model may nd particular use in the analysis of the factors a ecting the scalability and performance of existing systems, the analysis contained within the model accounting for the e ects of contention and delay, thus exposing other factors that are present in the actual system. In looking at the scaling behaviour of parallel systems, apart from providing a rationale for such reported phenomena as linear speedup, and for the assertion that `slower' systems have greater speedup, the equivalence between single-threaded systems with delay and single-threaded systems without delay has interesting consequences for the type of conclusions that can be drawn from the observation of the run-time of a system alone. If correct, this has important consequences for the measuring of performance of parallel systems; as far as we know it has not been reported elsewhere. This equivalence also gives pointers as to what properties should be measured in a parallel system in order to assess its performance. The models discussed in x5.2.2 and the latency hiding measure, hiding number, give a quantitative and qualitative insight into the use of algorithmic concurrency to hide delay in a system. They also give a quantitative measure of the cost of not hiding that delay and illustrate the consequences when contention (in the form of the loading intensity) could be reduced, but only at the cost of increasing the delay. In developing the model for multiple processors, each with many threads, we have shown that the important concept is the population of the cohorts, a cohort being the number of distinct ways of achieving a particular number of outstanding messages from the pool of possible messages. In the derivation presented here for the symmetric case, it is easy to see that each of the states that make up the ways of generating a particular number of messages are equally probable. This situation also holds when there is not an equal allocation per processor; the consequences of this are discussed below. It is interesting to note that the e ect of delay on the nal equations for the multi-threaded case is similar to the e ect in the single-threaded case; this may point to some underlying property of delay as 182

used in this thesis and raises the question of an equivalence relationship between the delay-free and delay-full multi-threaded models. In chapter 6 we addressed the assumption of a single point of contention and introduced the sort of approach that would be necessary for analysis of the error introduced in the approximation of the non-dominant points of contention by their delay. Even where this error analysis shows that the chosen behaviour approximation leads to unacceptable error in the quantitative results from our model, the qualitative results will still hold. Even in this qualitative case, our model provides a good starting point for the more complete analysis of the parallel system through the further application of queueing theory or simulation. In producing our basic scaling model we made an explicit assumption about the nature of the interconnection network. This assumption is that each processor receives the same service from both the communications network and the service point. Translating this into a network topology, this would typically mean a star network with the service facility at the centre. In x6.5.2 we examined the e ects of changing the network and processor attributes, combined with scaling, though still in the context of this homogeneous environment.

7.1 Future work Although the model we have presented o ers a predictive solution for certain types of parallel system, it raises several issues which could be the subject of further investigation.

7.1.1 Equivalence of the delay-free and delay-full multi-threaded systems The delay equivalence that was shown in x4.3.3 for the single thread per processor begs the question as to whether such an equivalence exists when there are multiple threads per processor. A cursory examination shows that a simple linear model along the line of that in x4.3.3 is not sucient, and also raises the question of the physical interpretation of any such relationship, if it exists. In the single-threaded case the equivalence highlights the importance of the physical quantity, loading intensity; in the multi-threaded case there are two quantities of interest: loading intensity and `threadedness'. It appears likely that any equivalence would map delay into a change in these two quantities. The diculty of 183

interpretation would arise if the amount of `threadedness' required is non-integral; what does such a quantity mean in the physical reality of the system? Also, nonintegral threadedness would require the use of continuous-space Markov chains, as opposed to the discrete-state Markov chain modelling that has been used in this thesis.

7.1.2 Asymmetry There are several modi cations to the work presented here that would break the symmetry that has been used throughout.

Asymmetric allocation of threads to processors The analysis presented in

chapter 5 keeps the number of threads per processor equal. When a system is being scaled this restriction may be too restrictive. The symmetry is not essential for the derivation of the equality of probabilities within a cohort (xA.3). This is solely dependent on the memoryless properties of the Poisson process which implies in this case that the probability of a processor generating a request is independent of the number of threads remaining (or allocated to) that processor. However, this symmetry is used to to assert that the visit ratios used with the BCMP theorem are the same for all jobs of all classes. Any breach of this symmetry would require the explicit calculation of these visit ratios, which is not an onerous task. Any asymmetric variation of our system would still lie within the remit of the BCMP theorem, and the simple use of the ratio of successive cohort populations would still provide the request ow in in the birth-death model. Care would have to be taken in interpreting the individual processor's results to take account of the asymmetry.

Asymmetry in the service received Another requirement explicit in our modelling is the homogeneity of the processors and of the properties of the system that they see. One possible extension of this work is to break this homogeneity by allowing each processor's requests to experience a di erent delay, thus modelling networks where the topological distance between the requesting processor and service centre is di erent for di erent processors. This would again remain with the BCMP theorem but would require an alternative approach to the state aggregation that the cohorts represent. 184

The existence of the delay equivalence might mean that this problem could be equivalent, at least in the single-threaded case, to that of modelling delayfree systems where the loading intensity varies from processor to processor.

7.1.3 Modelling the e ects of other overheads The starting point for our model made certain assumptions about the nature of the system. There are several avenues for extending the work presented here, amongst these are:

7.1.3.1 Loading pro les The assumption that the request rate (r ) is a constant per thread may not always be applicable, and various scenarios where this assumption does not hold were looked at in x6.5.2. The models of change of loading intensity chosen there were arbitrary and are not intended to be representative of any particular system. The analysis of the change in computational and communication requirements along the lines of the analysis of Worley [130] and Gustafson [54] could be used to derive the way in which the loading intensity changes with scaling of the system, and hence give performance and scalability predictions for these systems.

7.1.3.2 E ects of nite bu er space The main reason why any increase in the threadedness or number of processors gives rise to a monotonic increase in performance within our model is the assumption that there is always sucient bu ering space available in the system. The extension of the model to systems which have only nite bu ering would have several interesting consequences:

 The performance (absolute speed) will no longer be monotonic as the number of threads and processors increases, but will reach a maximum and could even drop o .

 The strategy of how to cope with full bu ers will have di ering e ects on where this maximum occurs and on how the system behaves beyond this point. 185

The inclusion of nite bu ering in the model can take the solution of the system outside the product form solutions of the BCMP theorem, however work reported by Onvural [99] allows for the bounding of the performance of systems with nite bu er space. He also gives algorithms for the evaluation of the maximum performance point and reports that their practical use can yield results that are within 1% relative error of the exact results. The inclusion of nite bu ering into the model does not appear to require a complete reworking of the theory presented here, but could be added as an additional step in the design process.

7.1.3.3 Including other contention points and servers As shown in chapter 6 the approximation of queueing systems by their mean service time can be appropriate in many circumstances. A possible area for future work would be to explicitly incorporate several servers into the model. In the same vein the model we have presented here assumes a single service point. Although the extension to more service points is relatively easy the standard techniques are not likely to lead to a correct model of the physical reality. The presence of delay, the inherent physical di erence between shared and distributed memory systems, means that the two or more servers could not be seen as deriving their input from the same queue. If there were many identical servers, the paths to each of which would not interact to cause contention, then such a system could be modelled as a single-server system with an appropriate reduction in load intensity. If the servers were not identical, as in the case of a non-replicated distributed memory, then the approach would be more complex. We would need to model the e ects of a server not having the requested data item and thus forwarding the request (or returning an indication that the request needs to be sent elsewhere). This complexity, combined with the inherent delay and the asymmetry in perceived service could lead to an analytical model of such distributed, non-replicated memories.

7.1.4 Quantitative measures of parallel algorithm improvement The importance of the loading intensity, u, in the derivation of all the performance parameters is a constant theme in chapters 4{6. The variation of this loading 186

intensity can be seen as one measure of many in the comparison of di erent algorithms that perform the same function. One of the issues that seems dicult, at present, to resolve in the coding of parallel distributed memory algorithms is the bene t of recalculation compared with communication. This issue often arises in the situation where a thread, in order to complete a calculation, requires a value which is known to exist on another processor in the system. The system designer has a choice between performing computation that would yield this value on the local processor or requesting the value from some other processor. The former requires duplication of computation e ort and hence a reduction in the absolute rate of performing the given task. The latter introduces delay (in terms of the round trip time of the request and the associated response) and additional load on the underlying virtual memory system (or communication network). Given that the system designer can estimate the e ect of the di erent approaches in two ways:

 The di erence in the rate of request generation (r) between the alternative methods.

 The cost di erence (in terms of reduction in the solution generation rate) between the methods,

the results outlined in chapters 4{6 would allow the system designer to estimate the e ect on the resulting solution rate and hence to make a more informed choice between the two options.

7.1.5 Inclusion of other forms of synchronisation As was mentioned in the introduction the only synchronisation that has been explicitly modelled in this thesis is the two-party version of client and server. The existence of other forms of synchronisation, such as barrier synchronisation, has attracted the interest of many authors. One possible way of directly including such synchronisation into our model would be to view the barrier as a service facility. The existence of barrier synchronisation can be detected only by some external observer of the system; from the point of view of the processors, the synchronisation is only a service they request with an unusual service time distribution. If 187

it were possible to transform the barrier into an equivalent Coxian service system, where the average service time was the average wait of a processor for the rest of the processors to reach the barrier, then barrier synchronisation could be approximated with our model. There are several additional points that would also need attention, not least of which is the steady state assumption of Markov chains, given that barrier synchronisation will cause the behaviour of each of the processors to start from the same point in their cycle. This is also not consistent with the assumption that the processors can be modelled by independent random variables.

7.1.6 Monitoring of performance and load balancing One advantage of a having a performance model is the ability to use it to decide what, of the possible performance measurements, should be measured, and which are not necessary, as was discussed in x4.1.3.1. One of the areas of interest within the topic of load balancing in message-passing parallel computers is the diculty in deciding the relative loading level on each processor [83]. Current solutions rely on the communication of loading levels between processors as part of the load balancing algorithm, thus increasing the load on the communication infrastructure. The present model may allow for this loading information to be obtained without any additional communication requirements. As was shown in x6.1 the equations which predict the response time can be used to ascertain the loading intensity. These equations also allow for the derivation of the server loading (), and hence the load on the remote service. The raw data for this analysis is dependent solely on parameters that can be measured on the local processor, ie round trip time and request rate. This theme has been developed by the author in [34]. The performance data could then be used to calculate the loading levels on every server with which the local processor has communicated; it would know what fraction of that load it was generating and thus be able to estimate the load for which other processors were responsible. The use of such a mechanism would depend on the equations for average wait time holding; this would imply that the system had reached some steady state. It would require the distributed load balancing algorithm to be able to detect that the steady state had been reached. 188

This detection that the system has reached a steady state has a direct parallel in the simulation of queueing networks, and the sort of techniques for steady state detection mentioned in [102] may be applicable.

7.1.7 Cycles within cycles of behaviour The basic model that has been presented here is for a single cycle of behaviour. It is reasonable to question its applicability to systems where the behaviour changes as the task progresses. There are two obvious ways in which the basic cyclical behaviour could be modi ed: the rst is to allow the task to consist of several distinct stages, each with its own cycle of behaviour; the second is to study algorithms in which there are behavioural cycles within other cycles. The use of our model to analyse systems such as these is dependent on the rate at which the system enters its steady state. Where the there are many changes from one type of cycle (hence one steady state) to another, the transitional e ects may become signi cant. The use of the Method of Layers [106], which uses the concept of cycles within cycles, may be applicable here. Analysis of the transient behaviour of the queueing models in this thesis would be necessary in order to give some bound on the frequency with which such changes between cycles of behaviour could occur, without making a signi cant di erence in the quantitative predictive power of the model.

189

Bibliography [1] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. Comparison of hardware and software cache coherence schemes. Proceedings of the 18th International Symposium on Computer Architecture, 19(3):298{ 308, May 1991. In Computer Architecture News. [2] Dharma P. Agrawal and Virendra Janakiram. Evaluating the performance of multicomputer con gurations. Computer, 19(5):23{37, May 1986. [3] M. Ajmone Marson. Stochastic Petri nets: an elementary introduction. In Grzegorz Rozenberg, editor, Advances in Petri Nets 1989, number 424 in Lecture Notes in Computer Science, pages 1{29. Springer-Verlag, New York, 1990. [4] M. Ajomone Marsan, G. Balbo, and G Conte. Performance Models of Multiprocessor Systems. Computer Systems Series. The MIT Press, 1986. [5] Leon Alkalaj and Rejendra Boppana. An analytical model of multi-threaded execution in a shared-memory multiprocessor. In Joosen and Milgrom [69], pages 108{122. [6] Arnold O Allen. Probability, Statistics, and Queueing Theory with Computer Science Applications. Academic Press, 1978. [7] George Almasi and Allan Gottlieb. Highly Parallel Computing. Benjamin/Cummings, Redwood City, CA, 1989. [8] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS conference proceedings, 30:483{ 485, April 1967.

190

[9] Mostafa H. Ammar and Stanley B. Gershwin. Equivalence-relations in queuing models of fork/join networks with blocking. Performance Evaluation, 10(3):233{245, 1989. [10] John B. Andrews. An analytical approach to performance/cost modelling of parallel computers. Journal of Parallel and Distributed Computing, 12:343{ 356, 1991. [11] James Archibald and Jean-Loup Baer. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Transactions on Computer Systems, 4(4):273{298, November 1986. [12] H Ashcroft. The productivity of several machines under the care of one operator. Journal of the Royal Statistical Society, Series B, 12(1):145{151, 1950. [13] Tim S. Axelrod. E ects of synchronization barriers on multiprocessor performance. Parallel Computing, 3:129{140, 1986. [14] Rajive Bagrodia. Process synchronization: Design and performance evaluation of distributed algorithms. IEEE Transactions on Software Engineering, 15(9):1053{1065, September 1989. [15] David H. Bailey. Vector computer memory bank contention. IEEE Transactions on Computers, C-36(3):293{298, March 1987. [16] F. Baskett, K. M. Chandy, R. R. Muntz, and F. G. Palacios. Open, closed and mixed networks of queues with di erent classes of customers. Journal of the ACM, 22(2):249{260, April 1975. [17] Yasmina Belhamissi and Maurice Jegado. Scheduling in distributed systems: Survey and questions. Research Report 1478, INRIA-Rennes, July 1991. [18] John K. Bennett, John B. Carter, and Willy Zwaenepoel. Adaptive software cache management for distributed shared memory. Technical Report COMP TR90-109, Department of Computer Science, Rice University, March 1990. [19] F. Benson and D. R. Cox. The productivity of machines requiring attention at random intervals. Journal of the Royal Statistical Society, Series B, 13(1):65{82, 1951. 191

[20] J. Boreddy and A. Paulraj. On the performance of transputer arrays for dense linear systems. Parallel Computing, 15:107{117, 1990. [21] John W. Boyse and David R. Warn. A straightforward model for computer performance prediction. ACM Computing Surveys, 7:73{93, 1975. [22] G. Bracha and S. Toueg. A distributed algorithm for generalized deadlock detection. Technical Report 83-559, Cornell University, Department of Computer Science, June 1983. [23] Luigi Brochard and Alex Freau. Designing algorithms on hierarchical memory multiprocessors. Proceedings of 1990 International Conference on Supercomputing, 18(3):414{427, September 1990. In Computer Architecture News. [24] Ingrid Y. Bucher and Donald A. Calahan. Access con icts in multiprocessor memories queueing models and simulation studies. Proceedings of the 1990 International Conference on Supercomputing, 18(3):428{438, September 1990. [25] M. Calzarossa, V. Comincioli, G. Meloni, and G. Serazzi. Experimental studies of the performance of a supercomputer. In Lartashev and Kartashev [81], pages 343{349. [26] Edward A. Carmona and Michael D. Rice. Modelling the serial and parallel fractions of a parallel algorithm. Journal of Parallel and Distributed Computing, 13:268{298, 1991. [27] Zhou Chaochen, C. A. R. Hoare, and Anders P. Ravn. A calculus of durations. Information Processing Letters, 40:269{276, 1991. [28] Ken Chen and Paul Muhlethaler. A family of scheduling algorithms for realtime systems using time value functions. Research Report 1530, INRIARocquencourt, September 1991. [29] Wesley W. Chu, Chi-Man Sit, and Kin K. Leung. Task response time for real-time distributed system with resource contentions. IEEE Transactions on Software Engineering, 17(10):1076{1092, October 1991. [30] G. Ciardo and K. S. Trivedi. A decomposition approach for stochastic reward net models. Performance Evaluation, 18(1):37{59, 1993. 192

[31] M. Cosnard, Y. Robert, and B. Tourancheau. Evaluating speedups on distributed memory architectures. Parallel Computing, 10:247{253, 1989. [32] William J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775{785, June 1990. [33] Jim Davies. Speci cation and Proof in Real-Time Systems. PhD thesis, Oxford University Computing Laboratory, Oxford University, UK, 1991. also available as Oxford PRG Technical Monograph PRG-93. [34] Neil Davies. Use of an observationally-based performance model for informing scheduling decisions. Technical Report CSTR-94-11, Department of Computer Science, University of Bristol, September 1994. [35] Peter J Denning and Je rey P Buzen. The operational analysis of queueing network models. Computing Surveys, 10(3):225{261, September 1978. [36] Derek L. Eager, John Zahorjan, and Edward D. Lazowska. Speedup versus eciency in parallel systems. IEEE Transactions on Computers, 38(3):408{ 423, March 1989. [37] P. Evripidou, W. Najjar, and J-L. Gaudiot. A single-assignment language in a distributed memory multiprocessor. In PARLE '89 Conference Proceedings, Volume 2, number 366 in Lecture Notes in Computer Science, pages 304{320, 1989. [38] V. Faber, M. Lubeck, and Andrew B. White Jr. Superlinear speedup of an ecient sequential algorithm is not possible. Parallel Computing, 3:259{260, 1986. [39] Horace P. Flatt and Ken Kennedy. Performance of parallel processors. Parallel Computing, 12:1{20, 1989. [40] H.P. Flatt. A simple model for parallel processing. Computer, 17(11):95, November 1984. [41] Eric Freudenthal and Allan Gottlieb. Process coordination with fetch-andincrement. ASPLOS 4 Proceedings, 19(2):260{268, April 1991. In Computer Architecture News. [42] T. C. Fry. Probability and its Engineering Uses. Von Nostrand (New York), 1928. 193

[43] Erol Gelenbe. Multiprocessor Performance. John Wiley & Sons, 1989. [44] Alessandro Genco. Parallel performance prediction of time and space scalable problems. In Joosen and Milgrom [69], pages 160{163. [45] Apostolos Gerasoulis, Sesh Venugopal, and Tao Yang. Clustering task graphs for message passing architectures. Proceedings of 1990 International Conference on Supercomputing, 18(3):447{456, September 1990. [46] Lance A. Glasser and Charles A. Zukowski. Continuous models for communication density constraints on multiprocessor performance. IEEE Trans on computers, 37(6):652|656, June 1988. [47] Stuart Green. Parallel Processing for Computer Graphics. PhD thesis, University of Bristol, UK, 1989. Later published as [48]. [48] Stuart Green. Parallel Processing for Computer Graphics. Research Monographs in Parallel and Distributed Computing. Pitman Publishing, 1991. This is a revised version of [47]. [49] Anne Greenbaum. Synchronization costs on multiprocessors. Parallel Computing, 10:3{14, 1989. [50] John Gustafson, Diane Rover, Stephen Elbert, and Michael Carter. The design of a scalable, xed-time computer benchmark. Journal of Parallel and Distributed Computing, 12:388{401, 1991. [51] John L. Gustafson. Reevaluating Amdahl's law. Communications of the ACM, 31(5):532{533, May 1988. [52] John L. Gustafson. Bridging the gap between Amdahl's Law and Sandia Laboratory's result. Communications of the ACM, 32(8):1014{1016, 1989. [53] John L. Gustafson. Response to \Once Again, Amdahl's law". Communications of the ACM, 32(2):263{264, February 1989. [54] John L. Gustafson. The consequences of xed time performance measurement. In Scriver [110]. [55] John L. Gustafson, Gary R. Montry, and Robert E. Benner. Development of parallel methods for a 1024-processor hypercube. SIAM J. Sci. Sat. Comput., 9(4):609{638, July 1988. 194

[56] Eric Haines. Standard procedural databases. Available through netlib at [email protected], May 1988. Version 2.4. [57] J. M. Harrison. The heavy trac approximation for single server queues in series. Journal of Applied Probability, 10(3):613{629, 1973. [58] Peter G. Harrison. Analytical models for multi-stage interconnection networks. In Newcastle Systems Modelling Conference 1990 [95]. Published in Newcastle Technical Report Series. [59] Peter G. Harrison. Analytic models for multistate interconnection networks. Journal of Parallel and Distributed Computing, 12:357{369, 1991. [60] Peter G. Harrison and Naresh M. Patel. Modelling circuit-switched multistage interconnection networks. In Newcastle Systems Modelling Conference 1990 [95]. Published in Newcastle Technical Report Series. [61] Micheal Heath and Patrick Worley. Once again, Amdahl's law. Communications of the ACM, 32(2):263{263, February 1989. [62] Philip Heidelberger and Kishor Trivedi. Analytic queueing models for programs with internal concurrency. IEEE Transactions on Computers, C32(1):73{82, January 1983. [63] Jane Hillston. PEPA: Performance enhanced process algebra. Research Report CSR-24-93, Department of Computer Science, University of Edinburgh, March 1993. [64] Jane Hillston. A Compositional Approach to Performance Modelling. PhD thesis, University of Edinburgh, UK, 1994. [65] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [66] Roger Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111{1130, 1991. [67] Kai Hwang. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993. [68] R. Janssen. A note on superlinear speedup. Parallel Computing, 4:211{213, 1987. 195

[69] W. Joosen and E. Milgrom, editors. Parallel Computing: From Theory to Sound Practice. IOS Press, 1992. [70] A Karp and H Flatt. Measuring parallel processor performance. Communications of the ACM, 3(5):539{543, May 1990. [71] Shin-Dug Kim, Mark A. Nichols, and Howard Jay Siegel. Modelling overlapped operation between the control unit and processing elements in an SIMD machine. Journal of Parallel and Distributed Computing, 12:329{342, 1991. [72] Peter J B King. Computer and Communication Systems Performance Modelling. Prentice Hall, 1990. [73] J. F. C. Kingman. On queues in heavy trac. Journal of the Royal Statistical Society, Series B, 24:383{392, 1962. [74] J. F. C. Kingman. Markov population processes. Journal of Applied Probability, 6:1{18, 1969. [75] Leonard Kleinrock. Queueing Systems Volume 1: Theory. John Wiley & Sons, 1975. [76] Leonard Kleinrock. Queueing Systems Volume 2: Computer Applications. John Wiley & Sons, 1976. [77] Leonard Kleinrock and Jau-Hsiung Huang. On parallel processing systems: Amdahl's law generalized and some results on optimal design. IEEE Transactions on Software Engineering, 18(5):434{447, may 1992. [78] J. Kollerstrom. Heavy trac theory for queues with several servers. I. Journal of Applied Probability, 11:544{552, 1974. [79] H Kordecki. Some problems of multiprocessing - selected survey. Systems Analysis Model Simulation, 6(11-12):915{922, 1989. [80] Clyde P. Kruskal, Larry Randolph, and Mark Snir. A complexity theory of ecient parallel algorithms. Theoretical Computer Science, 71:95{132, 1990. [81] Lana P. Lartashev and Steven I. Kartashev, editors. SUPERCOMPUTING'87, volume III, Suite B-309, 2000-34th Street, South, St. Petersburg, Florida, 33711, 1987. International Supercomputing Institute, Inc. 196

[82] Ruby Bei-Loh Lee. Performance bounds in parallel processor organizations. In D. J. Kuck, D. Laurie, and A. Sameh, editors, High speed computer and algorithm organization, pages 453{455, New York, 1977. Academic Press. [83] R. Luling, 1991. private communication. [84] R. Luling, B. Monien, and F. Ramme. Load balancing in large networks: A comparative study. In IEEE Symposium on Parallel and Distributed Processing, Dallas, pages 686{689, 1991. [85] S. Majumdar, C.M. Woodside, J.E. Neilson, and D.C. Petriu. Performance bounds for concurrent software with rendezvous. Performance Evaluation, 13(4):207{236, 1991. [86] Joanne L. Martin. Performance evaluation: Applications and architectures. In Lartashev and Kartashev [81], pages 369{373. [87] Patrick F. McGehearty. Performance Evaluation of a Multiprocessor under Interactive Workloads. PhD thesis, Department of Computer Science, Carnegie-Mellon, August 1980. [88] Robin Milner. Communication and Concurrency. Prentice-Hall, 1989. [89] I. Mitrani. Modelling of computer and communications systems. Number 24 in Cambridge Computer Science Texts. Cambridge University Press, 1987. [90] James Mohan. Performance of parallel programs: Model and analyses. Technical Report CMU-CS-84-141, Carnegie-Mellon, Department of Computer Science, July 1984. [91] Cleve Moler. Matrix computation distributed memory multiprocessors. In M Heath, editor, Hypercube Microprocessors, pages 181{195, SIAM Publications Philadelphia, 1986. SIAM. [92] Dieter Muller-Wichards. Problem size scaling in the presence of parallel overhead. Parallel Computing, 17:1361{1376, 1991. [93] R. Nelson and A. N. Tantawi. Approximate analysis of fork/join synchronization in parallel queues. IEEE Transactions on Computers, 37(6):739{743, June 1988.

197

[94] Randolph Nelson, Don Towsley, and Asser N. Tantawi. Performance analysis of parallel processing systems. IEEE Transactions on Software Engineering, 14(4):532{540, April 1988. [95] University of Newcastle-on-Tyne and ICL System Modelling Conference, September 1990. Published in Newcastle Technical Report Series. [96] David M. Nicol and Frank H. Willard. Problem size, parallel architecture, and optimal speedup. Journal of Parallel and Distributed Computing, 5:404{ 420, 1987. [97] Michael G. Norman. Bulk synchronous parallelism. WoTUG Newsletter, 17:32{35, July 1992. [98] Daniel Nussbaum and Anant Agarwal. Scalability of parallel machines. Communications of the ACM, 34(3):57{61, March 1991. [99] Raif O. Onvural. Survey of closed queueing networks with blocking. ACM Computing Surveys, 22(2):83{121, 1990. [100] Derek J. Paddon and Stuart A. Green. Handling graphical databases in parallel architectures. In Parallel Processing for Display, BCS Computer Graphics and Display, London, April 1989. British Computer Society. [101] D. Parkinson. Parallel eciency can be greater than unity. Parallel Computing, 3:261{262, 1986. [102] Krzysztof Pawlikowski. Steady-state simulation of queueing processes: A survey of problems and solutions. ACM Computing Surveys, 22(2):123{170, June 1990. [103] James L. Peterson. Petri net theory and the modelling of systems. PrenticeHall, Englewood Cli s, NJ, 1981. [104] Brigitte Plateau and Jean-Michel Fourneau. A methodology for solving Markov models of parallel systems. Journal of Parallel and Distributed Computing, 12(4):370{387, August 1991. [105] Wolfgang Reisig. Petri Nets: An introduction. Springer-Verlag, New York, 1985. Translation of Petrinetze, Springer-Verlag Berlin, 1982.

198

[106] Jerome Alexander Rolia. Predicting the Performance of Software Systems. PhD thesis, University of Toronto, January 1992. Also published as Technical Report CSRI-260, Computer Systems Research Institute. [107] Youcef Saad and Martin H. Schultz. Data communication in parallel architectures. Parallel Computing, 11:131 { 150, 1989. [108] Vivek Sarkar. Partioning and Scheduling Parallel Programs for Multiprocessors. The MIT Press, Cambridge, Mass, 1989. [109] Isaac D. Scherson and Peter F. Corbett. Communications overhead and the expected speedup of multidimensional mesh-connected parallel processors. Journal of Parallel and Distributed Computing, 11:86{96, 1991. [110] Bruce D. Scriver, editor. Proceedings of the Hawaii International Conference on System Sciences, volume 2, Los Alamitos, California, January 1992. IEEE Computer Society Press. [111] Kenneth C. Sevcik. Characterizations of parallelism in applications and their use in scheduling. Performance Evaluation Review, pages 171{180, 1989. [112] G W Stewart. Communication and matrix computations on large message passing systems. Parallel Computing, 16:27{40, 1990. [113] Brian R. Stonebridge. Cardinality of strings drawn from multisets. Technical Report CSTR-92-21, University of Bristol, Department of Computer Science, June 1992. [114] Brian R. Stonebridge and Richard L. Soulsby. Damped nonlinear leastsquares computation of a model for scouring of the seabed around a vertical cylinder. Technical Report TR-91-34, University of Bristol, Department of Computer Science, December 1991. [115] Michael Stumm and Songnian Zhow. Algorithms implementing distributed shared memory. Computer, 23(5):54{64, May 1990. [116] Xian-He Sun and John L. Gustafson. Toward a better parallel performance metric. Parallel Computing, 17:1093{1109, 1991. [117] Xian-He Sun and Lionel M. Ni. Another view on parallel speedup. In Supercomputing 90, pages 324{333, 1990. 199

[118] R. J. Swan, S. H. Fuller, and D. P. Siewiorek. Cm | a modular, multimicroprocessor. In Proceedings AFIPS 1977 Fall Joint Computer Conference, volume 46, pages 637{644, 1977. [119] Dirk Taubner. Finite Representations of CCS and TCSP Programs by Automata and Petri Nets. Number 369 in LNCS. Springer-Verlag, 1989. [120] Heck C. Tijms. Stochastic Modelling and Analysis: a computational approach, chapter 1. John Wiley & Sons, 1986. [121] Chris Tofts. The autosynchronisation of Lephtothorax Acervorum (Fabricius) described in WSCCS. Research Report ECS-LFCS-90-128, LFCS, University of Edinburgh, December 1990. [122] Chris Tofts. A synchronous calculus of relative frequency. In J. C. M. Baeten and J. W. Klop, editors, CONCUR'90, number 458 in Lecture Notes in Computer Science, pages 467{480. Springer-Verlag, 1990. [123] L. G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science. North Holland, Amsterdam, 1989. [124] Leslie G. Valiant. Bulk-synchronous parallel computers. In Mike Reeve and Steven Ericsson Zenith, editors, Parallel Processing and Arti cial Intelligence, chapter 2, pages 15{22. Wiley, UK, 1989. [125] Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103{111, 1990. [126] F. A. Van-Catledge. Toward a general model for evaluating the relative performance of computer systems. The International Journal of Supercomputing Applications, 3(2):100{108, 1989. [127] Mary K. Vernon, Edward D. Lazowska, and John. Zahorjan. An accurate and ecient performance analysis technique for multiprocessor snooping cache{consistency protocols. Proceedings of the 15th Annual International Symposium on Computer Architecture, 16(2):308{315, May 1988. In Computer Architecture News. [128] Willis H. Ware. The ultimate computer. IEEE Spectrum, 9:84{91, March 1972. 200

[129] Patrick H. Worley. The e ect of time constraints on scaled speedup. SIAM Journal of Scienti c and Statistical Computing, 11(5), September 1990. [130] Patrick H. Worley. Limits on parallelism in the numerical solution of linear partial di erential equations. SIAM J. Sci. Stat. Comput., 12(1):1{35, January 1991. [131] Jack Worlton. Toward a taxonomy of performance metrics. Parallel Computing, 17:1073{1092, 1991. [132] William A. Wulf and C. Gordon Bell. C.mmp | a multi-mini-processor. In Proceedings of the AFIPS 1972 Fall Joint Computer Conference, volume 41, pages 766{777, 1972. [133] Qing Yang and Laxmi N. Bhuyan. Performance of multiple-bus interconnections for multiprocessors. Journal of Parallel and Distributed Computing, 8(3):267{273, March 1990. [134] Xiaodong Zhang. Performance measurement and modelling to evaluate various e ects on a shared memory multiprocessor. IEEE Transactions on Software Engineering, 17(1):87{93, January 1991. [135] Wanlei Zhou and Brian Molinari. A model of execution time estimating for RPC-oriented programs. In Proceedings of Advances in Computing and Information, Niagara Falls, May 1990, number 468 in Lecture Notes in Computer Science, pages 376{384. Springer-Verlag, 1990. [136] X Zhou. Bridging the gap between Amdahl's law and Sandia laboratory's results. Communications of the ACM, 32(8):1014{1015, 1989.

201

Appendix A

Proofs The basis of the proofs in this section is the BCMP theorem [16]. This was a collective e ort of Baskett, Chandy, Muntz and Palacios and represented a major step forward in the use of queueing theory for the analysis of networks of queues. The general theory, with its proof, is not reproduced here but can be found in many sources [16, 4, 99]. As all the systems considered here come under the category of `closed' systems (the number of jobs in the system remains constant), the slightly simpli ed form for such nite systems that can be found in Onvural [99] is used. In its most general form the BCMP theorem can describe the behaviour of queueing networks in which jobs traverse from one service centre (node) in the system to another; they receive service and then go on to other service centres. The system can consist of several classes of job that can be routed through the nodes in the system separately. The number of jobs in the system can be constant, when the system is termed a closed system, or jobs can arrive and depart according to some rules, this being known as an open system. Within its limits the BCMP theorem is very general; one aspect of this generality important to us here is the allowable distributions for type 3 nodes. The service discipline for these nodes can be anything which can be described by a Coxian distribution. Coxian distributions can be used to approximate, as closely as desired, any probability distribution with a rational Laplace transform [72, p133]. This covers almost all distributions seen in practice. The use of the BCMP theorem generates product-form solutions for networks of

202

queues

p(s) = G

(A.1)

N Y i

fi (si)

=1

In this theorem s is a state vector, the components of which represent the individual populations at individual service centres (this is very similar to the concept of population processes in [74]); i ranges over the number of nodes in the queueing network. The contribution of each component to the general system is dependent on the particular characteristics of that node in the queueing system, the di erence P being denoted by the functions fi . The constant G is picked so that s2S p(s) = 1, where S is the set of all possible state vectors. As can be seen the number of states in the state space grows rapidly; a ten node network with ten customers has 1,847,560 states. As such this theorem is generally used as a basis for numerical solutions. As as been mentioned, the factor fi (si ) is dependent on the type of the node i, namely:

Type 1 These are nodes at which all the customers have the same service time

distributions, that is, negatively exponentially distributed with mean 1=i. The service discipline is FCFS. The state si is the vector (r ; r ; : : :; rni ) where ni is the number of customers present at node i and rj is the class index of the j th customer in FCFS order. There is a single server whose speed Ci (ni ) depends on the number of customers at node i; that is, the instantaneous service completion rate at node i is i Cn (ni ). Then the factor fi (si ) for this type of node is given by: 1

(A.2)

fi (si ) =

2

 ni  e Y ir

j

=1

i Ci(j )

The term eir is the visit ratio of the jobs of this class to the node as compared to all other jobs of other classes that visit the same node. This visit ratio appears in the formulation for each node type.

Type 2 These nodes consist of a single server where the service discipline is

processor-sharing (ie when there are n customers at the node, each receives service at 1=n of the service rate). Nodes of this type are not required for the analysis below, hence they are not expanded upon here. 203

Type 3 These nodes consist of as many servers as there may be customers. Each

customer is assigned a separate server as it arrives at the node. The service times may be di erent for each class r, and can be an arbitrary Coxian distribution. The node vector si represents not only the the number of customers from each class, but the particular number of customers of a class at each of the substates of the Coxian distribution.

(A.3)

R LY ir  1  e A nirl  Y ir irl f (s ) = i i

r

l

=1 =1

nirl !

irl

where R is the total number of job classes and Airl ; Lir and irl characterise the Coxian distribution. In the case where there is only one job class and the service distribution is a single negative exponential with service rate 1=i (the simplest Coxian distribution) the equation above reduces to (A.4)

 ei ni 1 fi (si ) = n !  i i

Type 4 These nodes consist of a single server with preemptive last come rst

served service discipline. Nodes of this type are not present in the systems studied here.

A.1 Multiple processors, single thread with delay The model that has been presented in x4.2 can be viewed as one consisting of three nodes. The rst is a service centre that represents the pool of processors.

Processor Pool The rate at which jobs are processed (ie the rate at which re-

quests against the virtual memory system are generated) is proportional to the number of jobs (processors) that are present (running) at the service centre. Each processor services jobs (generates requests) at the rate r . This corresponds to a type 1 BCMP node. This is the rst node in the state vector.

Service Centre The rate at which jobs (requests) are serviced (satis ed by the

virtual memory system) is independent of the number of the jobs (requests) that are present at this node, a type 1 BCMP node. This is the second node in the state vector. 204

Delay Node Delay is represented by as many servers as there are requests; each

delay is seen as a service centre with an exponentially distributed service. This corresponds to the type 3 BCMP node. This is the third node in the state vector.

A job (cycle of behaviour) travels from the processor pool to the service centre and back to the processor pool via the delay. All processors are assumed to be homogeneous and there is no distinction between the requests of di erent processors. This corresponds to there being only one job class in the BCMP system. This system is a closed system and the total number of jobs is the same as the number of processors in the processor pool. The state vector s has three components representing the number of jobs at each successive node; these values always sum to K , the number of processors in the system. As every job visits every node in succession all the visit ratios are one. From the BCMP theorem the probability of being in a particular state is given by (A.5)

p(s) = Gf (s )f (s )f (s ) 1

where

1

2

2

3

3

s 1 Y f (s ) = r j j s  1 Y 1

(A.6)

1

1

=1

f (s ) =

(A.7)

2

(A.9)

j



 1 s 1 f (s ) = s !  =1

3

(A.8) This reduces to

2

2

3

3

3

 s  1 s 1  1 s p(s) = G s1 ! 1  s!  r 1

1

2

3

3

ignoring the normalisation constant (G) for the moment, and noting that (in terms of the nomenclature in x4.2.2.2) the total number of jobs (processors) is K , the number of requests that are present in the system (ie receiving service or undergoing delay) is i, and the number of responses that are undergoing delay is j . Thus the values of the components of s are (A.10)

s = K , i; s = i , j; s = j 1

2

205

3

substituting this into equation (A.9) gives (A.11)

1

(K , i)! j !

" K,i  i,j  j # 1 1 1 r





Substituting dr for = , this becomes (A.12)

1

(K , i)! j !

" K,i  i # 1 1 r

j  dr

Noting that (1=r)K is constant for all terms and is hence a common factor in all terms that can be removed and using u = r =, this becomes (A.13)

ui djr (K , i)! j !

which is identical to the expression of the state probability in equation (4.31). Thus the system described in x4.2 is equivalent to a BCMP system where

p() = Pi;j and

G = P0;0

A.2 Single processor multiple threads with delay Again this is a BCMP queueing network consisting of three nodes. The rst is a node that represents the pool of threads within the single processor.

Processor The rate at which jobs are processed (requests against the virtual memory system are generated) is constant while there are threads at the node remaining to be run. This corresponds to a type 1 BCMP node. This is the rst node in the state vector.

Service Centre The rate at which jobs (requests) are serviced (satis ed by the

virtual memory system) is independent of the number of the jobs (requests) that are present at this node; a type 1 BCMP node. This is the second node in the state vector. 206

Delay Node Delay is represented by as many servers as there are requests; each

delay is an exponential service. This corresponds to the type 3 BCMP node. This is the third node in the state vector.

A job (cycle of behaviour) travels from the processor to the service centre and back to the processor via the delay element. This corresponds to there being only one job class in the BCMP system. This system is closed and the total number of jobs is the same as the number of threads assigned to the processor. The state vector s has three components representing the number of jobs at each successive node; these values always sum to K , the number of threads in the system. As every job visits every node in succession all the visit ratios are one. From the BCMP theorem the probability of being in particular state is given by (A.14)

p(s) = Gf (s )f (s )f (s ) 1

1

where

f (s ) =

(A.15)

1

f (s ) =

(A.16)

2

(A.18)

2

2

3

3

3

s 1 Y 1

j

r

s 1 Y =1 2

j



 s f (s ) = s1 ! 1

(A.17) This reduces to

1

2

=1

3

3

3

 1 s  1 s 1  1 s p(s) = G 1

r

2



s! 

3

3

ignoring the normalisation constant for the moment, and noting that (following the nomenclature in x5.2) the total number of jobs (threads) is K , the number of requests that are present in the system (ie receiving service or undergoing delay) is i, and the number of responses that are undergoing delay is j . Thus the values of s are (A.19)

s = K , i; s = i , j; s = j 1

2

207

3

Substituting this into equation (A.18) gives (A.20)

" # 1  1 K ,i  1 i,j  1 j j ! r  

Substituting dr for = , this becomes (A.21)

" # 1  1 K ,i  1 i dj j ! r  r

Noting that (1=r)K is present in all terms, and is hence a common factor in these terms and can be removed, and using u = r = this becomes (A.22)

ui djr j!

which is identical to the expression for the state probability given in equation (5.11). Thus the system described in x5.2 can be viewed as a BCMP system where

p() = Pi;j and

G = P0;0

A.3 Equality of state probabilities within a cohort | without delay There are two di erent approaches to demonstrating this property of systems, the rst is based entirely on symmetry, the latter on a use of the BCMP theorem. The rst equality argument is based on an appeal to the inherent symmetry that the system must have, combined with the complete solution of small systems supporting this argument. It must be accepted that in the systems that have been considered, that the processors are homogeneous in all respects that may a ect their performance. This being the case, given that the system starts in S there can be no di erence in numerical value of the probabilities of the system being in S x where x 2 fprocessorsg. This argument can be easily continued for the rst M , 1 cohorts (ie where each state []

[ ]

208

has N successor states). For the succeeding cohorts the same type of argument can be used for all those states within a cohort that have the same number of successor arcs. In the nal cohort each state represents the system where there are no processors running and the cohort consists of states that represent all possible interleavings of requests from processors. If these states had di erent probabilities this would imply that some interleaving of requests was more likely or some processor received di erent quality of service from other processors. As the path from the initial state is the traversal down a tree of states, the equality of probability of the leaves of that tree imply that for the predecessor cohort, all the probabilities of the states in that cohort are also identical. This argument can then be used recursively until the initial cohort of the tree is reached; there being only one state in this cohort, the equality condition is trivially satis ed. The alternative approach uses the BCMP theorem. In reasoning about multiple threads with multiple processors the simpli cations of the BCMP theorem used in the previous sections are no longer applicable. In the case of multiple-threads we have to preserve the identity of the processor originating each request. Expressing this in terms of the queueing network model that BCMP o ers, each processor corresponds to a node and the threads allocated to that processor correspond to a given job class. The associated cycle of behaviour is more complex. The ability to route on the basis of job class with the BCMP model means that the cycle of behaviour from the point of view of the processors is as follows: A processor X issues a request (this corresponds to a job in class X moving from the processor pool to the service point); this request is serviced in a FCFS fashion along with all other requests from all other processors regardless of their origin (and hence their class). All the responses are returned to the originating processor. For a system with N processors the corresponding BCMP system has a state vector s with N + 1 nodes, the additional node corresponding to the server. All the nodes in this network are of type 1 and have service rates that are independent of the number of jobs (requests) that are present at that node. From the general BCMP formulation, equation (A.1), the state probabilities are 209

given by (A.23) p() = f (s )f (s ) : : :fN (sN )fN (sN 1

1

2

2

+1

+1

)

where s : : : sN represent the number of jobs (threads) at processors 1 : : :N respectively and sN represents the service centre. For all nodes i 2 [1; : : :; N ] the functions fi are of the form (1=r)si . The visit ratios are all one, as, rstly, the only jobs that visit them are the jobs of their class, and, secondly, all jobs of that class visit them. For the service centre (also of type 1) the visit ratio is also one, but this is not quite so obvious. Each processor generates jobs of only one class and has the same number of threads assigned to it. Given that the system is homogeneous, in the steady state, each processor must be generating the same number of requests per unit time, thus the relative ratio of the requests from the processors, which is the visit ratio, is one. This yields the form of fN as 1

+1

+1

 1 sN

fN (sN ) = 

(A.24)

+1

+1

+1

When the system is in cohort n, this represents n requests (jobs) present at node N + 1 (the service point) in the queueing network. As the sum of all the jobs in the system is NM this implies that the probability function for each state in a cohort Cn is of the form (A.25)

 1 NM ,n  1 n

p(s) = G  r



8s 2 Cn

Noting that (1=r)NM is a constant, and present for all states in the system, the probability of each state in a cohort is (A.26)

p(s) = G0

 r n 

= G0 un

8s 2 Cn

Hence the probability of each state in a cohort is dependent only on the number of requests outstanding, not on their ordering.

210

A.4 Probability distributions in cohorts with delay The inclusion of delay into the multi-processor, multi-threaded model can be seen as a simple amendment to the model discussed in the previous section. The di erence is the addition of one extra node in the queueing network to represent the delay. The application of the BCMP theorem yields a probability equation of the form (A.27)

p(s) = G

"Y N

#

fi (si ) fN (sN )fN (sN ) +1

i

+1

+2

+2

=1

where the nodes N + 1 and N + 2 correspond to the service centre and delay respectively. As before, for the rst N nodes, the corresponding fi is (1=r )si . The visit ratio is 1 for the nodes by the same argument. This gives the form of fN as

 1 sN

fN (sN ) = 

(A.28)

+1

+1

+1

+1

The delay node treats all requests equally and the associated visit ratio is also one. This gives

 sN fN (sN ) = s 1 ! 1 N

(A.29)

+2

+2

which reduces the state probability equation to (A.30)

+2

+2

p(s) = G

"Y N  1 si #  1 sN i

=1

r

+1



1  1 sN sN ! 

+2

+2

Given that there are NM potential threads (jobs) in the system of which i are currently awaiting a response (ie sN + sN = i), and j are undergoing delay, we get +1

+2

 NM ,i  i,j  j

1 1 1 p(s) = G 1  j !  r      j NM i ir 1 j 1 = G j1! 1 r

(A.31)

= G j1! ui djr

211

given that u = r =, dr = = and the factor (1=r )NM is a constant present in all the terms in the system. Thus the probabilities within a cohort Ci;j are dependent solely on the values of the subscripts i and j .

212

Appendix B

Glossary of Terms B.1 Queueing Theory Sx

This represents the state signi ed by x; x can be a nonnegative integer, in the case of birth-death systems, or a sequence of non-negative labels, where the state is denoted by a path from some distinguished state.

Px

The probability that the system, when in its steady state, will be found in Sx. This is equal to the fraction of time that the system spends in that state.

E [x]

The expected value of x, ie its arithmetic mean.

Poisson process

A stochastic process in which the interval between events has a negative exponential distribution. Poisson processes are memoryless, ie the time to the next event is not dependent on when the observation of the process commenced. (see x3.2.2)

Markov Chain (MC) A Markov Chain is a stochastic process whose future behaviour is dependent solely on its current state. The state space can be either discrete or continuous. (see x3.2.3) birth-death MC

A birth-death Markov Chain is a stochastic process where the increase of the state variable corresponds to 213

some arrival (birth) into the system, and a decrease corresponds to a departure (death). request rate ()

The symbol  is traditionally used to represent the rate of arrivals into a queueing system; the average time between arrivals is 1=.

Kendall's Notation A shorthand notation to capture the major components of a queueing system; the general description is: A=B=c=K=m=Z The symbols traditionally used for A and B are: GI general independent interarrival time; G general service time distribution; Hk k-stage hyperexponential interarrival or service time distribution; Ek Erlang-k interarrival or service time distribution; M exponential interarrival or service time distribution; D deterministic (constant) interarrival or service time distribution; Thus a M=G=1 queueing model has exponentially distributed interarrival time, a `general' service time distribution (very few assumptions are made about it) and one server. There is an in nite population and there is no limit on the amount of queueing. The service discipline is rst-come rst-served. service rate ()

The symbol  is traditionally used to represent the rate at which jobs receive service. The average service time is 1=.

loading intensity (u) The loading intensity is the ratio of the request rate to the service rate (usually =). It represents the fraction of the server's capacity that is required to handle the o ered requests. When used in the context of a scalable number of sources of requests, u represents the fraction 214

of the server's capacity that each source of request could utilise. server utilisation () This is the fraction of time that the server is occupied in servicing requests.

B.2 Petri Net Descriptions MPN

A Marked Petri Net is a Petri net in which tokens travel from place to place; each distribution of tokens is known as a marking. (see chapter 3)

GSPN

A General Stochastic Petri Net is a MPN where there is timing information associated with the transitions; GSPNs can be shown to be homomorphic with discretetime Markov Chains.

px

This is the notation used to name a place which is present in a Petri net.

tx

This is the notation used to name a transition which is present in a Petri net.

mx

This is the number of tokens that are present at px

B.3 Performance Terms serial portion

This is the term used by Amdahl and others to express the fraction of the total computation that is not amenable to parallelisation.

serial fraction

This is a measure of the fraction of an algorithm that has been run sequentially, ie the algorithm's serial portion. It is used to study the e ects of scaling. (see x2.4.6)

speedup

This term is generally used in the literature to mean the ratio of the reduction in the run time on many processors in comparison with the run time for the same problem on a single processor. Whether the size of problem 215

may have been allowed to change, is dependent on the particular author. absolute speedup

We de ne this to be the absolute rate at which the system is computing the required solution. A single processor is said to be operating at an absolute speedup of 1 if all its computational power is being expended in the calculation of this solution. (see x3.3.1.2)

relative speedup

Relative speedup is the ratio of the time taken to solve a problem on one processor over the time take to solve the same problem on a speci c number of processors. We only use this term when the total amount of computation is constant in both cases. We de ne this to be the ratio of the two absolute speedups. (see `absolute speedup' and x3.3.1.1)

xed size speedup This metric is generated when the size of the problem is xed as the number of processors is allowed to increase. (see x2.4.1) scaled speedup

This speedup is achieved when the given problem is scaled so as to t the available memory as the parallel system is scaled. (see x2.4.2)

xed time speedup This speedup is generated when the time for the computation is xed and the size of the problem is scaled so that this requirement is ful lled. (see x2.4.3) granularity

This term has two uses; rstly, it can express the amount of computation for a certain amount of communication, and as such is similar to the concept of loading intensity; secondly, it can be used to mean the computational size of a thread of execution.

at memory approx. This is the assumption that the access to memory locations by a process remains constant as a parallel system is scaled. (see x2.4.3) potential concurrency This is number of threads of execution that can coexist during the solution of a given problem. 216

B.4 Other Nomenclature thread

A thread (of execution) is a portion of the overall computation required for the solution. It can be executed separately from other portions. A processor is said to be multi-threaded when more that one thread of execution has been allocated to it.

cohort

A collection of states in the Markov Chain representation of a multi-threaded multi-processor system. The members of a particular cohort are those states that can be reached, given that a particular number of requests are outstanding within the system (see x5.3).

Cn

This the shorthand representation for the cohort of states which represent n outstanding requests in the system.

barrier synchronisation Barrier synchronisation is said to occur when a set of processes must all reach a particular point in their execution for any one of those processes to continue. work conservation Work is said to be conserved when no processor is allowed to be idle if there is a process, anywhere in the system, that can be executed. Although this concept has been used extensively in the analysis of the performance of shared-memory parallel systems, it is not applicable for models of distributed-memory systems.

r

We use r to represent the rate at which a thread in our model generates a request.



We use 1= to represent the delay that is being modelled. Such a delay is said to expire at a rate  .

dr

We use dr to represent the ratio of delay to the average service time ie = .

217