Workload Modeling for Parallel Processing Systems - Semantic Scholar

8 downloads 0 Views 1MB Size Report
Erha 91] W. Erhard, A. Grefe, M. Gutzmann, and D. P oschle. Vergleich von .... John Wiley & Sons, Chichester, 1989. ..... John Wiley & Sons, New York, 1975.
Workload Modeling for Parallel Processing Systems eingereicht von:

Gabriele Kotsis

DISSERTATION zur Erlangung des akademischen Grades Doctor rerum socialium oeconomicarumque (Doktor der Sozial- und Wirtschaftswissenschaften)

Sozial- und Wirtschaftswissenschaftliche Fakultat Universitat Wien

Betreuer/Begutachter:

O. Univ. Prof. Dr. G. Haring O. Univ. Prof. Dr. G. Vinek

Wien, im Mai 1995

Vorwort Zum Gelingen dieser Arbeit, meiner Dissertation, haben eine Reihe von Personen beigetragen, bei denen ich mich herzlich bedanken m ochte. Zuallerst m ochte ich mich bei meinem Betreuer und Begutachter, Herrn Prof. Haring bedanken, der mich auf das interessante Fachgebiet der Arbeitslastmodellierung aufmerksam gemacht hat. Von Anfang an habe ich in ihm einen Ansprechpartner f ur Probleme und Schwierigkeiten gefunden und besonders in der Endphase der Arbeit, wo | wie viele wissen | oft Ideen und Eifer zu schwinden drohen, wurde ich durch ihn in regelm a igen, konstruktiven Besprechungen zur Weiterarbeit ermuntert. Auch Herrn Prof. Vinek m ochte ich hiermit Dank sagen f ur die Zeit und M uhe, die er sich f ur die Begutachtung meiner Arbeit genommen hat. Meinen Kollegen schulde ich Dank f ur ihre Unterst utzung. Besonders hervorheben darf ich hier Johannes L uthi, meinen Zimmerkollegen, der jederzeit bereit war, ad hoc Probleme zu diskutieren, und meinen Kollegen Alois Ferscha, von dem ich mir oft fachliche Ratschl age holen konnte. Ich m ochte ihm vor allem daf ur danken, da ich durch sein Beispiel Freude und Begeisterung am wissenschaftlichen Arbeiten erlernen konnte. Bei meinen Studenten m ochte ich mich bedanken, besonders bei jenen, die im Rahmen von Praktika und Hausarbeiten Programme implementierten und Analysen durchf uhrten, die ich dann als Grundlage f ur Beispiele in dieser Arbeit nutzen konnte. Die wertvollste Unterst utzung habe ich aber von meinen Eltern erhalten, die mir zwar kaum inhaltliche Ratschl age geben konnten, die mir aber durch ihre Liebe und ihr immerw ahrendes Vertrauen in mich den n otigen R uckhalt zur Vollendung der Arbeit gegeben haben.

Gabriele Kotsis

Contents 1 Introduction

1.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.1.1 Open Problems in Performance Evaluation : : : : : : : : : : : : : : : 1.1.2 Demands on a Workload Modeling Framework and Methodology for Parallel Processing Systems : : : : : : : : : : : : : : : : : : : : : : : 1.2 Addressed Types of Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2.1 User System Interaction : : : : : : : : : : : : : : : : : : : : : : : : : 1.2.2 System Under Test and Component Under Study : : : : : : : : : : : 1.2.3 Parallel Architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2.4 Parallel Programs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.3 Contribution of the Presented Work : : : : : : : : : : : : : : : : : : : : : : : 1.3.1 Summary of the Framework and the Methodology : : : : : : : : : : : 1.3.2 Outline of the Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 Workload Modeling | Framework and Methodology

2.1 Denition of Workload : : : : : : : : : : : : : : : : : : : : : : : : 2.1.1 Workload Components : : : : : : : : : : : : : : : : : : : : 2.1.2 Specifying the Test Workload : : : : : : : : : : : : : : : : 2.1.3 Parallel Workload Characteristics : : : : : : : : : : : : : : 2.2 Workload Model Construction in a Performance Evaluation Cycle 2.2.1 Inuence of the Objective of the Evaluation Study : : : : : 2.2.2 Inuence of the Evaluation Technique : : : : : : : : : : : : 2.3 Constructing a Non-Executable Test Workload Model : : : : : : : 2.3.1 Model Parametrization : : : : : : : : : : : : : : : : : : : : ii

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

1

1 3

7 8 8 10 10 12 14 14 16

18 19 19 22 25 31 35 39 46 47

2.3.2 Validation Criteria for Workload Models : : : : : : : : : : : : : : : : 53

3 Survey and Comparison of Parallel Workload Models 3.1 3.2 3.3 3.4

Parameter Based Models : : : : : Proles, Signatures and Shapes : Ratios for Varying System Size : Graph Models : : : : : : : : : : : 3.4.1 Undirected Graph Models 3.4.2 Directed Graph Models : :

: : : : : :

: : : : : :

4 Scalability Analysis

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

4.1 Scalability Concept : : : : : : : : : : : : : : : : : : : : : 4.1.1 Problem Denition : : : : : : : : : : : : : : : : : 4.1.2 Scalability Index : : : : : : : : : : : : : : : : : : 4.1.3 Characterizing the Architecture Size A : : : : : : 4.1.4 Characterizing the Amount of Work W : : : : : : 4.1.5 Sources of Overhead : : : : : : : : : : : : : : : : 4.1.6 Evaluation Techniques for Scalability Analysis : : 4.2 Visual Approach for Scalability Analysis : : : : : : : : : 4.2.1 Performance Characteristics : : : : : : : : : : : : 4.2.2 Model Parameters and Denitions : : : : : : : : : 4.2.3 Comparison to other Approaches : : : : : : : : : 4.3 Examples : : : : : : : : : : : : : : : : : : : : : : : : : : 4.3.1 Example 1: An Embarrassingly Parallel Problem 4.3.2 Example 2: Householder Reduction : : : : : : : : 4.3.3 Example 3: FFT : : : : : : : : : : : : : : : : : : 4.3.4 Example 4: Finite Element Method : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : :

56 61 64 72 86 86 87

100 101 101 102 107 108 111 114 115 116 117 123 128 129 138 141 145

5 Summary and Future Work

149

A Abbreviations and Symbols

152

B Denitions and Derivations

156

B.1 Petri Nets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 156 iii

B.2 B.3 B.4 B.5 B.6

Series Parallel Graphs : : : : : : : : : : Stochastic Process : : : : : : : : : : : : Queuing networks : : : : : : : : : : : : : Complexity Notation : : : : : : : : : : : Bounds on and Relations between Ratios B.6.1 Bounds : : : : : : : : : : : : : : B.6.2 Relations : : : : : : : : : : : : :

Bibliography

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

157 157 157 158 158 158 159

164

iv

Chapter 1 Introduction 1.1 Motivation Increasing the performance was and is one of the driving forces in computer development. The term performance can be used in a very general way characterizing the functions that a computer system provides in a quantitative and qualitative way. Performance in this broad interpretation summarizes the following properties of a system. First, it is used to denote the processing power of a system, which is characterized by the time used to solve a given problem or by the size and number of problems solved in a given time. Second, the reliability of a system, e.g. the time periods of availability and unavailability or the time between failures, is an aspect of performance. Finally functional aspects of performance include the correctness of the solutions, and the ergonomic properties in the interaction between the user and the system. In this work only the rst aspect of performance will be considered and in the following, the term performance will be used in this narrow sense. Increasing demands for performance (in the sense of processing power) in scientic applications lead to a seemingly never ending quest for \peak performance"1. Although an incredible progress was achieved in speeding and sizing up the hardware components of computer systems, the power of a single computer is limited and new architectural concepts are needed to further improve the performance. A solution to this problem was found in parallel processing. This peak performance is usually measured in FLOP rates, which give the number of oating point operations processed during a specied time interval. Typically one second is the selected time period. 1

1

CHAPTER 1. INTRODUCTION

2

A variety of architectural concepts supporting parallel processing have been proposed and several commercial machines are (more or less successful) on the market. While the theoretical peak performance reported by hardware vendors has surpassed the Teraop/s mark2, it has been observed and documented in many case studies, that the actual performance achieved with parallel computers when running real applications is far o from these optimum values. This gap between peak performance and actual attainable performance raises on the one hand the question of e ective and ecient development of parallel software in many application domains. Eective development means, that the achieved performance meets the expectations and ecient means, that the development eort is in a reasonable range. Therefore special emphasis must be given to performance evaluation during the development of parallel applications to detect and solve performance problems as early as possible thus minimizing both, the loss in performance and the eort in program development. On the other hand new metrics, methods and techniques for comparing the performance of parallel architectures have to be found, since the benchmarks giving this peak performance as FLOP rates obviously fail to characterize the type of problems solved using parallel architectures. Since the performance of a system can only be interpreted and compared correctly with respect to the processed load, workload modeling, i.e. selecting and characterizing the load, is a central issue in performance evaluation. Recently the topic of workload characterization has been addressed in a survey paper Calz 93] where also a chapter is devoted to parallel processing systems. But the state of the art is still far from a systematic approach for workload modeling of parallel processing systems (PPS). In a workload modeling framework the entities that a workload consists of are to be dened in a systematic way together with a possible set of characteristics describing these entities. In the corresponding workload modeling methodology, the steps necessary to construct a test workload are given, embedded in a performance evaluation cycle. The framework could be seen as the entity oriented view, while the methodology is the process oriented view of workload modeling. Cray has announced, that a CRAY T3D could be congured to reach the Teraop/s mark, but actually this machine has not been built yet. 2

CHAPTER 1. INTRODUCTION

3

The need for and the demands on a parallel workload modeling framework and methodology are discussed in the next section.

1.1.1 Open Problems in Performance Evaluation In sequential processing performance evaluation has a long tradition and is a well established discipline3. Based on a solid theory a variety of techniques, tools and evaluation environments have been developed and performance engineering activities Smit 90] Smit 91] have become an integral part of system development Beil 88]. When browsing through the literature on performance evaluation for parallel systems, many of the papers published are mainly descriptions of case studies, i.e. they report on measurement experiences for a particular application that has been (more or less) successfully implemented on or ported to a parallel machine. Sometimes, an attempt is made to construct a model for an architecture or an application (or both). All these eorts have not merged yet into a broadly accepted and usable performance evaluation methodology yet. In the following the main issues in both keystones of performance evaluation, namely measurement and modeling, will be sketched briey emphasizing the importance of workload modeling.

Performance Measurement The performance of a system can be evaluated using measurement techniques, i.e. using hardware and/or software tools, that allow the collection of performance data during program execution. The \states" of the program and the processing system are recorded at specic instants of time. The selection of these instants in time can either be event driven (upon the occurrence of predened events), or time driven (in certain time intervals). The collected data can either be forwarded immediately to a host system for analysis and visualization (real-time) or stored in a trace le to be processed later (post-mortem). Much has been published on measurement and visualization for parallel processing systems, including case studies, tool descriptions, and general frameworks and methodologies. A representative selection of dierent topics within this broad eld can be found in Hari 93]. See for example the following books, which provide an excellent introductions into the eld of performance evaluation: Ferr 83] Jain 91a] 3

CHAPTER 1. INTRODUCTION

4

In all measurements, the execution of a single program or of a set of programs on a real architecture is investigated. Depending on the focus of the evaluation the following distinctions are relevant for workload characterization. a) Investigate the program b) Investigate the architecture b1) under real load b2) under hypothetical load In literature, the rst two objectives (a) and (b1) are summarized under the term monitoring , while the last objective (b2) is called benchmarking . The focus in a) is to gain insight in the behavior of the program and to detect | and eliminate in the following | performance bottlenecks in the program (performance debugging or tuning). The major problems are (1) the massive amount of data obtained from measurements, (2) the intrusion to program behavior due to measurements, and (3) the problem of having no global clock to relate the timings obtained from dierent processing elements. As a consequence of (1), an additional problem arises, namely the visualization and interpretation of the results. Although workload modeling will not help in solving problems (2) and (3), it might help in reducing the amount of measurement data. A description of the load can be presented to the user before performing the measurements, so that he or she can select the aspects he or she is interested in (see for example Musi 94, Hari 94], where a tool following this user-oriented approach for monitoring has been developed). When evaluating a parallel architecture under real load (b1), performance indices related to the architectural components are to be obtained (e.g. processor speed for solving a particular type of problem, throughput of the interconnection network,: : : ). Workload modeling activities are restricted to selecting the time interval for the observations and to describing the load to be able to interpret the performance results. In benchmarking (b2), the objective is to evaluate and compare the performance of parallel architectures under hypothetical load. This load can either be a selection of the real load or it is articially generated. In benchmarking, workload modeling is a central issue, since the workloads to be used in benchmarking have to be selected carefully to allow a comparison between several architectures.

CHAPTER 1. INTRODUCTION (set of) programs

5 mapped on

represented in

workload model

parallel architecture represented in

combined into

architecture model

combined performance model

Figure 1.1: A Cooperative Performance Modeling Methodology

Performance Modeling Performance modeling summarizes all analytic or simulation techniques that aim to evaluate the performance of a system. In parallel processing, there are three major factors, inuencing the performance: The program, the parallel machine, and the mapping of program parts and data to resources of the machine. Within each category numerous parameters inuence the performance their interdependencies and interactions haven't been studied in a sucient way yet. Several modeling techniques, that have already been applied to the analysis of sequential systems, have been applied in the analysis of parallel systems as well Erha 91]. These models consider either only the program Mahg 92] Brui 88], or the architecture Akyi 92] Butl 86] Chen 92], or a monolithic combination of both Cand 91] Deme 91]. But a general methodology or framework for parallel performance modeling is still missing. In the past few years, several approaches have been presented, that can be seen as a rst step towards a systematic parallel performance modeling methodology (see also Herz 91], where similar observations are reported). Instead of directly constructing a monolithic performance model of both, the program and the architecture, the concept of separating the description of the program and the parallel machine is pursued Fers 90a] Peas 91] Gemu 93]. The program model and the architecture model are (sometimes automatically) transformed into a combined performance model (see Figure 1.1). A methodology based on this idea is called cooperative in contrast to monolithic approaches.

CHAPTER 1. INTRODUCTION

6

At the University of Vienna, Institute of Applied Computer Science and Information Systems, Department of Advanced Computer Engineering, two approaches following this concept have been developed and are partly implemented in tool sets. The general methodology for the PAPS (Performance Analysis of Parallel Systems) tool set Wabn 94a] is based on four layers: The rst layer, which is called specication layer, contains information about the workload of the parallel algorithm, hardware characteristics and mapping information. These three inputs can be varied to a high degree independently of each other which makes it possible to easily experiment with various dierent hardware congurations and mappings of workload elements to the hardware. The second layer is the transformation layer which takes the information dened in the specication layer to generate automatically performance models. The third layer is a set of evaluation techniques appropriate for the generated model. Various characteristics of parallel programs and objectives of the investigation make it necessary to use dierent approaches and methods for performance analysis. The methods can be based on various performance models (e.g. Petri nets, queuing networks). In the current implementation a timed Petri net model is generated and simulated. The resulting trace le information and/or performance gures are depicted at the presentation layer (layer 4). In the N-MAP approach Fers 94] a performance and behavior prediction is proposed based on real (skeletal) codes rather than on models in the early development stages. At this level, mainly communication patterns among independent (sequential) tasks are specied, reecting the kind of parallelism actually exploited for a set of virtual processors or a specic, dedicated parallel hardware architecture. Obviously, at that level the most critical performance decisions are being made, and performance prediction is becoming essential. The task structure specication can be parametrized by performance characteristics of the target architecture and user provided estimates of resource requirements. A parse, translate and compile system { the N-MAP tool { has been developed taking the above specications as input, and producing a simulated execution run on a virtual (multi-)processor or a physical target system. All relevant program or algorithm performance characteristics can be deduced from the trace generated by the simulation engine or a multiprocessor scheduler. Part of the contribution of this work has to be seen in the context of these two approaches, in that it proposes a framework and methodology for characterizing the program part, i.e. the workload.

CHAPTER 1. INTRODUCTION

7

1.1.2 Demands on a Workload Modeling Framework and Methodology for Parallel Processing Systems The steps for constructing a workload model and the techniques for workload selection and characterization of sequential (uniprocessor) systems are well known and commonly used, but a workload modeling methodology is missing for parallel systems. The following principles can be identied for a workload modeling framework and methodology. 1. Principle of Hierarchy The load, that a parallel system has to process, is not a monolithic block of work, but a structured (sometimes complex) set of tasks. Therefore it is necessary and natural, to represent such a load in a structured and hierarchical way in the workload model. Following the principle of hierarchy will result in a model, which is easier to understand and to construct. 2. Principle of Modularity The entities of the load and their descriptions should be modular, so that also parts of the load can be described. Similar to the principle of modularity in software engineering, also a reuse of previously dened and described entities of the load is supported in a modular characterization. 3. Principle of Flexibility The workload model should be exible, in that the entities can be characterized using dierent types of parameters. 4. Principle of Universality Since a characterization of the load is necessary in both, measurements and modeling, a systematic methodology should be applicable in measurements as well as in modeling. The framework has to be universal to provide information for selecting and constructing an appropriate executable load for measurements and to abstract and capture all relevant load parameters in non-executable workload models, that have to be considered in a performance model. This principle is of particular importance in performance analysis accompanying a program or system development cycle, where

CHAPTER 1. INTRODUCTION

8

performance modeling and measurement techniques are to be applied. A universal framework supporting both, descriptions suitable for modeling and for measurements, has the advantage, that both types of evaluations can be made within the same framework. 5. Principle of Target Orientation Dierent objectives of the performance analysis study inuence the aspects of the load, that have to be considered. A workload modeling methodology should be adaptable to the dierent objectives. Note, that modularity and exibility are preconditions for this principle. A workload modeling concept following these principles serves on the one hand as a unifying framework for categorizing and comparing existing approaches for workload modeling. On the other hand, it should guide workload model selection and construction for any type of performance evaluation study. Before summarizing the approach presented in this work, the addressed types of systems (architecture and programs) have to be discussed.

1.2 Addressed Types of Systems 1.2.1 User System Interaction In sequential computing the interaction between the user and the processing system is typically described as follows: the user submits certain requests to the system, these requests are processed by the system, and the results (and the performance) are fed back to the user. Three dierent types are common for modeling this interaction: 1. a transaction workload, which is characterized in terms of arrival rates of requests, 2. a batch workload, which is characterized by a constant number of requests, and 3. a terminal workload, characterized by the number of users and their think times, i.e. the time between receiving the result and submitting the next request. Current practice in using parallel processing systems exhibits a user behavior comparable to transaction loads, where the processing requests are submitted to a host system,

CHAPTER 1. INTRODUCTION

9

feed back

USER

WORKLOAD

received and processed by

Host System

submits Parallel Processing System

returns

PERFORMANCE

Figure 1.2: Interaction between user, workload, and system characterized by an arrival stream of programs at the computer system. The load of the actual parallel system (i.e. the load that the processing elements have to process) is more adequately characterized as a batch workload, and in many evaluation studies the number of requests is equal to one, i.e. the behavior of the system when executing a single program is investigated. Figure 1.2 gives a schematic representation of the interaction between the user and the processing system. If the system is used in a single-programming mode, i.e. when executing a single program exclusively, the reported performance just includes the actual execution of a single program on a dedicated architecture, possible interactions with the user before starting the program or after nishing it, are usually not considered. Therefore user commands can be neglected. The workload model consist only of a quantitative description of the single program. In a multi-programming mode, i.e. several programs are executed on the same system and have to compete for resources, it is necessary to consider the user commands, typically in terms of the rate at which new programs arrive at the parallel systems, in addition to a characterization of the single programs. Furthermore, dierent types of programs can be submitted, resulting in a multi-class workload model. The framework and methodology to be presented in this work will be suitable for both, single and multiple program environments, but emphasis is given to single-program environments. This is motivated by the focus on scalability analysis in the second part of the work, where only a single program is analyzed.

CHAPTER 1. INTRODUCTION

10

1.2.2 System Under Test and Component Under Study Every performance evaluation study is performed in the context of a particular system, called system under test (SUT) Jain 91a], which is to be investigated. The analysis itself can either focus on the evaluation of the whole SUT, or can be oriented towards the evaluation of a certain part or component of the system, which is then called component under study (CUS) Jain 91a]. Note, that the SUT may not only include the actual processing system, but also the other entities depicted in Figure 1.2. For example, in modeling the execution time of a parallel program to be executed on a parallel system, the SUT would be both, the workload and the processing system, with the workload (i.e. the single program) itself being the CUS. In the comparison of dierent interconnection topologies in a multiprogrammed parallel system, the SUT would be all entities, but the CUS would only be the interconnection network. As can be seen from these examples, the separation between the CUS and the SUT depends on the objective of the evaluation study and is crucial for every performance evaluation study. On the one hand, performance indices have to be dened, that are representative for the CUS. On the other hand, also the characterization of the workload has to be selected accordingly. If, for example, the interconnection network of a parallel processing system has to be evaluated, the workload has to be specied in terms of the communication demands. A workload, that tests only CPU performance, would be obviously inappropriate in that case. Also the performance indices have to be dened in terms of communication delays or network throughput. A measure characterizing the performance of the whole system (like speedup) will not give the necessary insight in network performance.

1.2.3 Parallel Architectures A variety of concepts for parallel architectures has been proposed4 , some of the concepts have never been put into practice, some have never surpassed the stage of prototypes, and only a few have emerged to commercial systems. In the following the dierent types of architecture will be discussed briey to point out the types of system addressed in this work. While parallelism can already be exploited in uniprocessor systems5 (overlapping CPU 4 5

See for example Chapter 1 in Hwan 93a]. Throughout this thesis a system which can only execute a single instruction at a time will be called a

CHAPTER 1. INTRODUCTION

11

and I/O operations, multiprogramming and timesharing, : : : ) new classes of architectures have been developed allowing to exploit dierent types of parallelism. In pipelined computers temporal parallelism is exploited by overlapping the instruction execution. In array computers multiple processing elements are synchronized to perform the same operations at the same speed in a lockstep fashion, thus achieving spatial parallelism. Asynchronous spatial parallelism is used in most of todays parallel processing systems called multiprocessor systems6. The processing elements perform their operations on their local data. Whenever a point of synchronization is necessary in the program, information is exchanged either via accessing a global memory (shared memory architectures) or by message passing (distributed memory architectures). Therefore this class can be further distinguished in shared and distributed memory asynchronous parallel architectures7. In a shared memory system, the processing elements do have access to a shared memory. In a distributed memory system each processing element has its own local memory module. Exchange of information is only possible by exchange of messages. Because of physical limitations, the number of processing elements is restricted in shared memory systems, while distributed memory systems are suitable for a larger number of processing elements (the price one has to pay are higher communication costs because of a larger distance among the processing elements). Several parallel architectures are implemented as distributed memory systems, but oer a virtual shared address space (memory) to the programmer. In the previous approaches the execution of the program is control ow driven, i.e. the order of computations is determined by the order of the statements in the program code. A dierent concept is data driven parallel processing, where each operation may be executed as soon as all its operands are available. uniprocessor system. 6 Some authors distinguish between multiprocessor and multicomputer systems. A multiprocessor system is a system which consists of a large number of processing elements, each having only a restricted functionality. A multicomputer system is characterized by a smaller number of more powerful processing elements. Treleaven Trel 87, Trel 88] gives a very illustrative comparison for these two types of systems: They are compared to a group of beavers and an army of ants, both \working" on chewing down a tree. The multiprocessor approaches is represented by the army of ants (many workers, each being not very powerful), while the multicomputer approach corresponds to the group of beavers (a fewer number, but each worker is more eective in solving the task). 7 For a survey and comparison, see for example Bail 88].

CHAPTER 1. INTRODUCTION

12

Since temporal, synchronous spatial, and data parallelism are exploited at a physical level using specialized hardware, these machines are often tailored for processing a particular type of applications and are called special purpose architectures. The processing elements used to exploit asynchronous parallelism are similar to the processing elements used in uniprocessor systems enhanced with particular facilities for message exchange. Therefore the class of multiprocessor and multicomputer systems are called general purpose parallel computers. Temporal, synchronous spatial, and data parallelism Lewi 92] can still be exploited as programming paradigms at the software level in these architectures. The proposed workload modeling approach is restricted to shared and distributed asynchronous parallel architectures, which hold a substantial share of all parallel architectures in the commercial and scientic market Rama 93]. It is not investigated, whether and how the framework could be extended or modied to be applied in the analysis of other types of parallel architectures. Finally, it is worth to mention the dierence between parallel and distributed processing as understood in this work. Depending on whether the processing facilities are located in a centralized or decentralized way, processing is called parallel or distributed . Since there are a lot of similarities between distributed processing in computer networks8 and parallel processing on distributed memory (asynchronous spatial) architectures, it will be an interesting issue for future work to investigate the suitability of the proposed approach in modeling the load of computer networks.

1.2.4 Parallel Programs A parallel program is a program that contains some potential to be executed (partly) in parallel, i.e. using more than one processing element. Parallelism in programs can be exploited at dierent levels: Several programs (either identical or dierent) can be executed in parallel. Typical applications are simulations, where several runs of the simulation (possibly with parameter variations) are necessary to obtain data. A computer network consists of several (uni)processor systems connected via a network. Each machine in the network computes a single task, but may request services from other machines. 8

CHAPTER 1. INTRODUCTION

13

Threads 9 in a program can be executed in parallel Instructions within a thread can be parallelized (typically in do-loops).

The basic machine operations within an instruction can be parallelized. Parallelism at the program level is characterized as coarse grain (a whole program is the unit to be executed in parallel) and is suitable for architectures where each processing element is a fully equipped machine (e.g. in computer networks). The communication media can be rather slow, since usually no communication is necessary between the dierent program runs. The next level is still coarse grained (threads are to be executed in parallel), but communication occurs more frequently, threads have to be synchronized. The threads can either perform the same operations on dierent data (SPMD or data parallelism) or dierent operations (MIMD or task parallelism). Parallelism at the instruction level is categorized as ne grain, typically the same instruction is executed simultaneously on dierent data (SIMD). The fourth type of parallelism (within an instruction) is usually implemented in hardware (vector computers), and will not be considered in this study on workloads. The proposed approach allows a characterization of all dierent types of parallelism within programs, coarse grain (program parallel), medium grain (SPMD and MIMD), but also ne grain SIMD parallelism, depending on the level of characterization (see next section). There are basically two dierent styles for programming parallel architectures. In an explicit programming style, the programmer has to specify the concurrency in the program and to provide the necessary mechanism for synchronization and communication. In implicit parallel programming the programmer writes a sequential program which is translated automatically into a parallel program by a parallelizing or vectorizing compiler. The result of both approaches is a parallel program, and this program will be the object of investigations for workload modeling. The methodology can be applied to both, a parallel program generated by a compiler or by a human being. In this work the term \thread" is used to denote a ow of control to be executed on a processing element. A physical processing element is able to execute more than one thread of control in parallel, therefore threads can be interpreted as virtual processing elements. 9

CHAPTER 1. INTRODUCTION

14

1.3 Contribution of the Presented Work 1.3.1 Summary of the Framework and the Methodology In this work a systematic framework and a corresponding methodology for workload modeling of parallel systems is proposed. The presented framework serves on the one hand as a classication scheme for existing approaches to workload modeling for parallel systems, and on the other hand as a basis for deriving new workload models according to the methodology. The objectives in workload characterization for parallel systems are to provide adequate representations of the programs and the associated data and to characterize the way in which these programs are submitted to the system (user interaction). In rening the user-workload relation from Figure 1.2, the proposed framework is summarized in Figure 1.3. This framework is based on the concept of components and characteristics. A component is an entity of the workload, which can be specied at three distinct hierarchical levels. At the application level , the whole program (application) is specied in terms of the algorithms it consists of. At the next level, the algorithm level , each algorithm is dened by the routines it is composed of. At the lowest level (routine level ), each routine is described in terms of its statements. Therefore the components at each level are the algorithms, the routines, and the statements. The corresponding data, that are processed, are not considered explicitly in a data model, but are included in the characterization of the components, whenever they have signicant inuence on the aspect to be represented. For each component, a set of characteristics can be dened. These characteristics represent the functional, behavioral and quantitative aspects of the components. When considering a single program as the workload, no user interaction has to be specied. When considering a multiple program or multiple user environment, the characteristics and layers described above are specied for each class of program (or user), similar to multiple class workload models for uniprocessor systems. In addition, the distribution of the dierent types of programs in the total workload has to be specied. This characteristic is called the \user behavior", but the term is used here in a more abstract way than in human computer interaction. Only the quantitative aspects of user-computer interaction are considered, not the psychological or ergonomic aspects. The proposed framework is hierarchical and modular. Depending on the objective of the

CHAPTER 1. INTRODUCTION

15

USERS User Behavior

Program Data

submit

Program

Program

Data

Data

described in terms of application algorithm routine

Components

aspects abstracted in Functional Composition Dependency Structure Degree of Parallelism Communication Behavior Memory Access Behavior Communication Demands Computation Demands Memory Demands I/O Demands

influence

Characteristics

represented in

WL Model Parameters

Figure 1.3: Schematic Representation of the Proposed Methodology

CHAPTER 1. INTRODUCTION

16

study and on the evaluation techniques, levels or characteristics can be omitted. The corresponding methodology species the steps for constructing a workload model within this framework. It can be applied in the analysis of a multiple program (user) environment, where dierent classes of programs are to be specied (e.g. in scheduling studies), in the analysis of a single program, or in the analysis of program components at dierent levels of detail (e.g. in scalability analysis or in program development).

1.3.2 Outline of the Thesis In the rst part of the thesis the fundamentals in workload modeling for parallel systems are explained (Chapter 2) and a framework and methodology for workload modeling for parallel systems is proposed. The problem of how to dene the workload of a parallel system (Section 2.1) is addressed by identifying the basic components that a model representing a parallel system's workload has to capture (programs and data), by specifying the types of test workloads (executable and non-executable workloads), and by discussing the characteristics of workload components pointing out the dissimilarities to the workload of uniprocessor systems. A workload modeling methodology is proposed which is based on this framework (i.e. the components and characteristics) and which is embedded in a performance evaluation cycle (Section 2.2). The inuence of the objectives of the analysis and of the performance analysis techniques (including both, measurements as well as modeling approaches) to be applied in the evaluation study are discussed. In narrowing the focus to non-executable workloads (to be used in modeling studies), Section 2.3 focuses on the construction, parametrization, and validation of these types of workloads within the proposed framework of components and characteristics. The concept of parameters is introduced to model the characteristics of the workload components. The rst part is concluded by a survey of the state of the art in parallel workload modeling (Chapter 3). The proposed framework is now used as a classication scheme for workload models based on the following criteria: the characteristics considered in the model (functional composition, dependency structure, degree of parallelism, communication and memory access behavior, computation, communication, memory and I/O demands,

CHAPTER 1. INTRODUCTION

17

user behavior) the way of representation (system independent or system dependent, and deterministic or stochastic) and the type of workload model, i.e. 1. parameter based models (program, resource and structure oriented) (Section 3.1), 2. xed system size ratios (proles, signatures, shapes) (Section 3.2), 3. varying system size ratios (execution time function, speedup, eciency, sequential and parallel fractions) (Section 3.3), and 4. behavior graph based models (communication graphs, task graphs, data dependency graphs, Petri nets, Pert networks) (Section 3.4). In the second part of this work (Chapter 4), the proposed methodology is applied to scalability analysis of parallel systems, where the objective is to investigate the ability of a system to maintain a certain level of performance, if architecture and workload size are increased. This eld of performance evaluation has been selected because of the central role of workload characterization in scalability analysis. First, the concept of scalability will be introduced (Section 4.1). A general denition of scalability will be given and a scalability index will be proposed. This index is based on a characterization of the scaling in architecture size and in the amount of work . Based on a visual representation of this index, a scalability analysis approach is presented (Section 4.2). To derive this index, dierent techniques (analytic, simulation, and measurements) are investigated. A workload model is proposed, which serves on the one hand as input to a performance model based on complexity analysis and on the other hand as input to an analysis tool for evaluations based on simulations. The proposed approach is demonstrated on selected examples (Section 4.3).

Chapter 2 Workload Modeling | Framework and Methodology Modeling, i.e. selecting and characterizing, the workload of a parallel system is an important task in every performance evaluation study, as all the estimated or observed performance results will depend on the load, that the system under test has to process. In this chapter an introduction into the fundamental issues in workload modeling of parallel processing systems is given, which includes both, the selection and characterization of the load. A systematic approach for workload modeling is proposed based on workload components, characteristics, and parameters. This approach is a framework for the construction of any type of test workload (executable and non-executable), which can be adapted to meet the particular demands posed by the objectives of the evaluation study and the selected evaluation technique. In the rst section of this chapter (2.1) a denition of the workload for parallel systems is given. A hierarchical description of the workload in terms of its components is proposed (Section 2.1.1). Before discussing the characteristics of the components, i.e. their functional, behavioral, and quantitative aspects, dierent types of test workloads (real or articial and executable or non-executable models) are described (Section 2.1.2). A set of characteristics is proposed (Section 2.1.3) allowing both, the description and construction of executable models, and the representation of non-executable models. In Section 2.2 workload modeling is discussed in the context of a performance evaluation cycle. The inuence of the objectives of the evaluation (Section 2.2.1) and of the evaluation techniques (Section 2.2.2) are discussed. 18

CHAPTER 2. FRAMEWORK AND METHODOLOGY

19

The last section (2.3) focuses on the construction, parametrization, and validation of non-executable workloads within the proposed framework of components and characteristics. The concept of parameters is introduced (Section 2.3.1), which are used to model the characteristics. Workload model validation criteria for parallel systems are identied (Section 2.3.2).

2.1 Denition of Workload 2.1.1 Workload Components Traditionally, the workload is dened as the set of all inputs (processing requests) from the users to the processing system during a specied time interval Ferr 78]. Depending on the objective of the performance evaluation study, the workload may include programs, commands, and data submitted by the users, but also system programs (compilers, editors, ...) requested by the users. The same components can be identied in a parallel system's workload. Because of the possibility of concurrent execution and the resulting problems of synchronization and communication, the techniques for a quantitative description of components (workload characterization) of sequential systems cannot be applied directly to the workload of parallel systems. In order to determine characteristics that will capture the workload of parallel systems, the properties of submitted programs, commands and data have to be investigated.

Programs A parallel program is typically characterized by a hierarchical structure, i.e. it is composed of smaller parts. Reecting this hierarchy also in the workload model has several advantages. It is easier to construct a hierarchical model for a hierarchical system. A at representation would not be a natural representation. A hierarchical model is more exible and makes the model more tractable. Model construction is simplied, since it is easier to quantify smaller components and then aggregate the results. The following three hierarchical layers for the characterization of a parallel program are proposed Hari 95] Calz 95].

CHAPTER 2. FRAMEWORK AND METHODOLOGY

20

Application Layer At the very top layer the load is characterized by a coarse gran-

ularity, in that, a whole application corresponds to the problem to be solved (e.g., biomolecular modeling, uid dynamics). As solving a problem requires a set of methods to be applied, an application A can be described in terms of the underlying algorithms Ai (i = 1 2 : : :) in the solution domain.

Denition 1 A parallel program at the application layer is described in terms of the algorithms it is composed of.

A = fA1 A2 : : :g where A denotes the parallel application, and Ai are the algorithms.

Algorithm Layer At the intermediate layer of the methodology, the granularity is

rened in order to characterize the components of an application. For numerical applications, algorithms solving systems of nonlinear equations are examples for the methods used at this layer. An algorithm Ai can be described by means of the routines rji (j = 1 2 : : :).

Denition 2 A parallel program at the algorithm layer is described in terms of

the routines it is composed of.

Ai = fr1i  r2i  : : :g where Ai denotes the algorithm, and r1i are the routines.

Routine Layer At the bottom layer, the specic routines used to implement an algo-

rithm are considered. A routine rji is composed of code segments sjk (k = 1 2 : : :). (e.g., subroutines, functions, loops). At the very nest granularity, a routine can even be a single statement.

Denition 3 A parallel program at the routine layer is described in terms of the statements it is composed of.

rji = fsj1 sj2 : : :g where rji denotes the routine, and sj1 are the statements.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

21

According to the application under study and/or the objectives of the analysis to be performed, the granularity identied for the components at each layer may vary. Sometimes a few methods can be considered as a single algorithm component because they are logically related. Furthermore, characterization layers can be omitted. A small problem, which might require only one solution method, can be characterized starting at the algorithm layer, hence omitting the application layer. The components at these three layers (the algorithms at the application layer, the routines at the algorithm layer, and the statements at the routine layer) are the basic building blocks of the proposed framework.

Data The inuence of the input data that a computer system has to process on the performance is obvious for both, sequential and parallel systems. While this inuence is mainly determined by the size and the value of the input data in sequential systems, the situation is more complex in parallel systems, because the aspects of data dependencies and distributions have to be considered. In performance modeling the problem arises on how to represent the data in the model. Two alternatives for the representation of data can be considered: In a direct representation, the workload description consists of a program model and a separate data model is constructed. This model has to contain not only the size and values of the input data, but also the data dependencies and distributions. To obtain performance results, both, the data model and the program model have to be considered. This approach has the advantage of a clear separation (thus increasing model exibility). But nding the appropriate level of granularity for the representation of data dependencies is a non-trivial task. The nest granularity of representation is a data ow graph, but this representation is unfeasible for any problem of realistic size. Therefore a more concise, indirect representation of data where the characteristics are represented within the program model seems to be more justied, considering that the operations in the program are the major components causing load on a system. Of course, the characteristics of these operations will usually depend on the data they process, but this just conrms that data are to be seen as an additional aspect that

CHAPTER 2. FRAMEWORK AND METHODOLOGY

22

should be considered within the model of the program. For example the duration of a routine can be specied as a function of the problem size, or the probability of executing either one routine or the other may be given depending on the values of input data. This indirect representation has the advantage that only those aspects of the data are captured, which are relevant for the model, thus supporting a more concise representation. In performance measurements, the major problem is to nd a representative set of the input data, since it is impossible to test all possible input data (combinations). The analyst has to restrict the investigations to a subset of input data which can either be selected to stress the system beyond its capacity (worst case scenario), or assuming an ideal combination of input data (best case scenario) or trying to nd typical combinations (average case scenario).

Commands and Subsidiary Programs Subsidiary programs on a parallel system can include automatic mappers, schedulers, or routers. Those programs do have a signicant inuence on the performance. They can either be considered as part of the workload, i.e. they have to be represented in the workload model, or as part of the processing system, i.e. they have to be considered in the model of the architecture. In this work it is assumed, that ther inuence on performance caused by subsidiary programs is considered in the architecture model. The commands, that are submitted from the user to the system are basically only commands for executing the program. All editing and compile activities are performed on a front end computer. Once the execution of a program has started, there is nearly no interaction with the user.

2.1.2 Specifying the Test Workload The workload, that is used in a performance evaluation study, is called a test workload Ferr 83]. A test workload may either be executable (to be used in measurement) or nonexecutable (to be used in modeling). A test workload may consist of the real workload, or of articial components or a mix of the both. Figure 2.1 depicts the possible types of workload models which will be discussed in the following.

CHAPTER 2. FRAMEWORK AND METHODOLOGY executable actual programs, commands and data

23 non executable

real workload synthetic workload

application suites benchmarks

mimiced commands, programs, and data

instruction mixes kernels synthetic scripts

analytic models distribution driven models artificial workload

Figure 2.1: Executable and Non-Executable Test Workloads

Real test workloads If performance evaluations are made on a system when executing the

actual programs, commands, and data submitted by the user(s) during a particular time interval, then the performance measures are said to be obtained under real workload. Consequently, real test workloads can only be used in measurement experiments on real systems. Since the system can only be observed in a specied time interval, in fact only a sample of the real workload is considered. The selection of this time interval is the only parameter, that can be varied by the analyst when using a real test workload. In practice, several problems prohibit the use of real workloads: on the one hand, it is often not possible to measure a real system during execution for security reasons (e.g. in database systems) or because a measurement would disturb the system behavior in a signicant, unacceptable way for the users. On the other hand, the usefulness of the obtained results is limited, because it cannot be guaranteed, that the selected time interval for observation is representative, nor, that the obtained data are reproducible. Furthermore, the amount of obtained data is tremendous. Finally, a real workload is observable and measurable, but not controllable. This implies, that no evaluations can be performed to obtain results on system performance in best and worst case. There is only few literature reporting on measurement studies of parallel systems under real workload. See for example VanV 94], where the workload on an iPSC/860 system

CHAPTER 2. FRAMEWORK AND METHODOLOGY

24

is investigated, or Mutk 88], where the load on a network of workstations is observed.

Articial workload An articial workload mimics the quantitative and/or functional be-

havior of real workloads. In system measurements an executable model has to be used. Consisting of a set of operations, that are either typical for a given set of applications or that are chosen in order to stress the system in ways not possible with an actual application, an executable articial workload does not necessarily have to produce any useful computational results. Executable models include instruction mixes, kernels, synthetic scripts or programs, interactive drivers, or traces. In literature, several approaches can be found on approaches Bart 92] Lind 92] and tools Mehr 92] Kao 92] Roge 93] for executable , articial workload generation for parallel systems. In some workload generators, the workload parameters are adjustable to mimic for example computation or communication intensive applications, or particular memory access patterns and frequencies. In performance modeling, the workload model has to provide the necessary information on the program, the data, and the user interaction, that is needed to construct the performance model. Therefore a non-executable model, that has to represent the performance relevant characteristics and aspects of the real workload, is required in performance modeling. A particular modeling technique typically needs a particular type of input with respect to the information contained in the workload model and with respect to the way it is represented in the model. The demands on the workload model implied by the selected modeling technique are discussed in Section 2.2.1. A variety of approaches for non-executable workload models exists, a survey will be given in Section 3.

Synthetic workload A synthetic workload is a mix of real and executable articial work-

load components. If only real programs are used, the test workload is called a natural workload or benchmark, if the synthetic workload consists of real programs and articial components, it is called a hybrid model. As in articial workload models, the mix can either be chosen to represent the typical workload, or some extremal situation, depending on the objective of the analysis.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

25

Well known examples of benchmarks for parallel systems are the NAS Benchmark suite (which contains both, application benchmarks, but also kernels) Bail 91, Bail 94], the SPLASH suite Sing 91], the Perfect Benchmarks Blum 92], or the LinPack Benchmark Dong 93]. Several other concepts and methods related to parallel computer benchmarking can be found in Gunt 88] (PARCBench), in Mess 90] (general methodologies and guidelines), or in Brad 93] (comparison of dierent benchmarking techniques).

2.1.3 Parallel Workload Characteristics Depending on the type of test workload, three dierent purposes for characterizing the load can be identied. 1. Description When using the real load or a subset of the real load, the analyst has no inuence on the aspects of the load. But it is important to describe the load in order to be able to interpret the performance results correctly. For example, it is helpful to know the potential degree of parallelism of an application, when interpreting the measured and observed speedup. In the following all activities related to modeling the real load are referred to as describing the load. 2. Specication When constructing articial executable workload models, it is necessary to specify the load, i.e. the type of behavior, that the articial workload has to mimic Ferr 84]. This specication can either be used to select several programs and benchmarks for a workload mix, or to parametrize a synthetic workload generator. Workload modeling of articial executable test workloads will be called workload model specication . 3. Representation In a non-executable model, the aspects of the real model which are relevant input data for the performance model have to be represented in a way suitable as input for a performance model. All activities within this abstraction process from the real executable load to a non-executable workload model are summarized as representing the load.

CHAPTER 2. FRAMEWORK AND METHODOLOGY Functional

Behavioral

26 Quantitative

user

er

B

v eha

ior

Co

mm

Co mp Me un utatio ica mo tio n De ry I/O nD m De ma em ands De nds ma and nds s

routine

C Me omm Data D m un D Co egree ory A icati epen ntr on den ol of Pa cces Be cy De s pen rallel Beh havio r den ism avio r cy

algorithm

Us

Fu nct ion al C om po sit ion

application

Interaction

Figure 2.2: Workload Components and Characteristics In the following a set of parallel workload characteristics is dened, which shall provide a basis for denition, specication, and representation. These characteristics include functional and behavioral aspects as well as quantitative aspects, and the type of interaction between the user and the system. The proposed characteristics are distinguished into four categories (see Figure 2.2). While functional, behavioral, and quantitative characteristics can be dened at any desired level (application, algorithm, or routine), the user interaction is described independent of the levels when viewing the total system.

Functional Aspects of the Components Although a workload model mainly has to capture the quantitative aspects of the program, the identication of the basic components is the starting point for any further description. Since these components have been dened related to the functional units of the program (remember the three layers of application, algorithm, and routine), this characteristic is called \functional behavior".

CHAPTER 2. FRAMEWORK AND METHODOLOGY

27

Denition 4 The characterization of the workload in terms of the components is called

Functional Composition.

The statements at the routine layer, the routines at the algorithm layer, and nally the algorithms at the application layer are the building blocks for aggregating the components.

For a representation in a non-executable model it is sucient to give the aggregation and \mapping" of program parts into components. When characterizing executable workload models, a description and specication of the type of functions or computations may be included.

Behavioral Aspects of the Components The components identied in the functional description are now related to each other in a description of their dependency structure. Typically, this description will result in a graph model, where the nodes correspond to the components, and the arcs denote dependence relations. The dependencies can either be data or control dependencies. There is no dierence between a description, specication, or representation.

Denition 5 The Dependency Structure characterizes the control and/or data dependencies among the components at each layer.

The parallelism contained in an application, algorithm, or routine is characterized by the degree of parallelism among the corresponding components.

Denition 6 The Degree of Parallelism characterizes the number of components to be processed in parallel at each layer.

This can either be the physical parallelism observed from measurements or simulations (i.e. an architecture dependent parameter) or the potential (ideal) parallelism independent of the physical restrictions of a parallel architecture. The potential parallelism can be given for both, executable and non-executable models. The physical parallelism is mainly used to describe or specify executable workloads.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

28

The degree of parallelism may vary during execution or in dierent phases of the program. Instead of characterizing all these changes, it is possible to give only the maximum, minimum or average degree of parallelism. Depending on the assumed paradigm for exchanging information between components, either communication or a memory access patterns can be dened. A pattern has to represent the communication partners (or access references) and the frequency of communications (accesses).

Denition 7 The Communication Behavior characterizes the interaction between the

components in terms of communication partners, direction of communication, and frequency of communication. The Memory Access Behavior characterizes the patterns, frequencies, and the locality of accesses to memory for the components at each layer.

The communication or memory access volume is separated from the behavioral description and specied in the corresponding demands (communication demands and memory demands).

Quantitative Aspects of the Components The characterization of the quantitative aspects of the components can be compared to the characterization of \resource demands" in classical workload modeling.

Denition 8 The amount of processing resources required by the workload components

is characterized in terms of the Computation Demands.

The computation demands can either be given in units of work (number and types of instructions) to be processed or in units of time. A characterization by units of work is architecture independent, a characterization by units of time is architecture dependent, since the particular processing speed of the underlying architecture is implicitly included. In characterizing real workloads, a description in terms of work units is to be favored, if several architectures are to be compared. In the specication of executable models it depends on the type of workload generator, whether a characterization by

CHAPTER 2. FRAMEWORK AND METHODOLOGY

29

computation

memory

I/O

communication

Figure 2.3: Kiviat Diagrams for Characterizing Parallel Programs work or by time is required. In constructing non-executable workload models, the type of input requested by the modeling technique has to be considered.

Denition 9 The volume and type of data to be exchanged via communication among

the components is characterized in terms of the Communication Demands.

The hierarchical organization of memories of parallel systems and the resulting nonuniform memory access times are an important performance parameter. Therefore the workload model has to contain information on the memory demands (volume and type of data).

Denition 10 The volume and type of data to be accessed from memory by the com-

ponents is characterized in terms of the Memory Demands.

An important aspect, which is often neglected in parallel performance evaluation, are the I/O requirements. Volume and type of input and output data are to be part of the workload characterization.

Denition 11 The volume and type of data to be accessed from storage media by the

components is characterized in terms of the Input/Output Demands.

Similar to characterizing the computation demands either by the amount of work or the amount of time, an alternative characterization for communication, memory access,

CHAPTER 2. FRAMEWORK AND METHODOLOGY single program environment

consider program model only

30

multiple program environment

single program investigated

consider program model and background load

multiple programs investigated

consider program models and user behavior

Figure 2.4: Dierent Viewpoints of Analysis and Corresponding Consideration of User Interaction and I/O demands can be the time needed to complete the corresponding communication, memory or I/O requests. When comparing the computation, communication, memory, and I/O demands, a program can be characterized as being either computation, communication, memory, or I/O bound. A way of representing these relations are Kiviat diagrams Kole 73] as shown in Figure 2.3. For each characteristics, a suitable metric has to be found, that can be represented at a semiaxis in the circle. Typically, those metrics are given as percentage values, starting with 0 in the middle. In the example of Figure refkiviat, the percentages are the busy times of the corresponding resources (i.e. CPUs, memory subsystem, communication network, and I/O subsystem). The resulting shape of the polygon can now easily be identied (in the example a computation bound application).

Characteristics Related to User Interaction In a single program evaluation study (which would be the typical case in program analysis and design studies or in sensitivity or scalability analysis), the characteristics dened above would have to be described and quantied for the particular program under study. Investigating the behavior of a system when executing multiple programs (for example in scheduling studies) requires a more complex representation in the sense that on the one hand, the characteristics have to be specied for each program (class), and in addition, the arrival characteristics of programs as caused by the users have to

CHAPTER 2. FRAMEWORK AND METHODOLOGY

31

be considered (see gure 2.4).

Denition 12 The User Behavior characterizes the type of interaction between user

and processing system in terms of the frequencies at which (possibly di erent types of) programs are submitted to the system.

A detailed model for characterizing the user behavior is provided by user behavior graphs Calz 90]. More simple characterizations consider the user behavior either as a terminal load (giving the number of users and their think times), as a batch load (giving a constant number of requests), or as a transaction load (giving a (stochastic) arrival rate of requests). Characterizing the arrival behavior as a stochastic process is a well known technique and does not have to be explained in detail here. The interested reader is referred to Klei 75] or Triv 84], or Fros 94]. Note, that the user behavior is typically represented at the application layer, i.e. user submits a whole program, and not just its components Kant 88]. But in many scheduling or mapping studies, the objects that can be scheduled or mapped are at a ner granularity. In this case, the arrival of applications has to be transformed into arrivals of the particular components. These characteristics are the second keystone (besides the components) of the proposed framework. The decision on which characteristics are to be selected and on how to parametrize them depends on the objectives of the evaluation study and on the selected evaluation techniques. A workload modeling methodology which describes the necessary steps for adapting the proposed framework according to the evaluation techniques and objectives is presented in the next section (2.2).

2.2 Workload Model Construction in a Performance Evaluation Cycle The decision upon which type of test workload or which level of characterization to use can only be made when considering the objectives and restrictions of the whole performance evaluation study.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

Model Formulation

improvement and refinement

Workload Model Construction

32

What is the objective of the analysis? What are the relevant performance indices? What analysis technique will be used? What kind of test workload is used? What is the level of characterization? How are the workload parameters obtained and quantified?

System Modeling/Measurement

Data Interpretation and Validation

Is the quality of the model satisfying? Is the model sensitive with respect to the input parameters?

Figure 2.5: A Schematic Performance Evaluation Cycle By recalling the basic steps in a performance evaluation cycle, as described for example in Ferr 83], the inuence of the performance evaluation framework on the denition and construction of the test workload will be discussed. A typical performance evaluation process is depicted in Figure 2.5. In the rst phase, called model formulation, the framework for all further decisions has to be set up. The objectives of the analysis are to be dened clearly, since all following decisions in the workload modeling process (and also in the whole evaluation study) will depend upon these objectives. Performance indices are to be dened, that characterize either the speed, the eciency, the eectiveness, or the productivity of the system1. It is not sucient to specify the indices, but it is also necessary to give a range of desired (or acceptable) values as a basis for comparison in the validation phase. After dening the evaluation environment, the workload has to be selected and characterized. Workload selection refers to the process of choosing the components, that have to be modeled, and the type of workload model (real or articial). The workload has to be selected either to represent a typical case or a stress case. Workload characterization refers to all the decisions related to describe, specify or represent the selected workload type and components. In executable workloads, the (articial) or real programs have to be prepared 1

See Worl 91] for a detailed discussion on performance indices for parallel systems.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

33

for execution on the processing system. In constructing non-executable workload models, the characteristics of the components and parameters, quantifying these characteristics, have to be found. In the next step, the workload characterization is combined with the parallel architecture. In measurements, this means, that the programs are to be executed on the system, in modeling, that the workload model has to be integrated in a performance model which is to be solved by the previously selected techniques. Comparing the results with respect to validation criteria and interpreting the obtained performance results either concludes the modeling study if the objectives are met or causes further improvements or renements in the model. According to this performance evaluation cycle, eleven steps in adapting the workload modeling framework are identied resulting in a parallel workload modeling methodology which is summarized in Figure 2.6. In the rst step (1), the evaluation environment is to be dened, followed by a selection of the evaluation approach (2), namely modeling or measurements, and the corresponding evaluation technique. While a particular measurement technique (software or hardware monitoring, event or time driven, real time or post mortem) does not inuence the workload modeling, the selected modeling technique is a key factor determining the aspects to be represented in the workload model. In the next two subsections (2.2.1 and 2.2.2), the inuence of the decisions made in steps (1) and (2) on modeling the workload are discussed in detail. Steps (3) to (7) are those steps related to the actual modeling of the workload. First, the type of test workload determined by the evaluation approach has to be selected (nonexecutable models in modeling studies, executable (real) loads in measurements). Next, step (4), the type of user interaction to be considered in the model has to be selected. This depends on the separation between the system under test and the component under study according to the objectives of the analysis and will also be discussed in Section 2.2.1. In steps (5) and (6), the level and granularity at which the programs are to be characterized are chosen, i.e. the functional composition of the workload is characterized and the behavioral characteristics (the structure of the components) are given. Depending on the type of workload model, this characterization is either an abstracted representation (to be included in a performance model), a description (characterizing the real load), or a specication (for the construction of an executable model) of the load. In step (7), the quantitative load charac-

CHAPTER 2. FRAMEWORK AND METHODOLOGY

1. Dene objectives, performance indices, and validation criteria 2. Select evaluation approach 2.1. Select modeling technique 2.2. Select measurement technique 3. Select type of test workload 3.1. Non executable test workload 3.2. Real workload 3.3. Articial or synthetic executable workload 4. Dene type of user interaction 5. Dene granularity (functional composition) 5.1. For representation 5.2. For description 5.3. For specication

34

7. Dene quantitative characteristics 7.1. Estimate load parameters 7.2. Select time interval and prepare real load 7.3. Generate/select/prepare articial and synthetic load 8. Evaluate the system 8.1. Construct architecture model and combine it into performance model 8.2. Measure the system 8.3. Measure the system 9. Reduce and visualize data 10. Validate and interprate the results 11. (Optional) loop back to one of the previous steps

6. Dene behavioral characteristics 6.1. Representing 6.2. Describing 6.3. Specifying Figure 2.6: A Workload Modeling Methodology

CHAPTER 2. FRAMEWORK AND METHODOLOGY

35

teristics are dened. While in non-executable models an estimate (based on measurements or guesses) for the quantitative characteristics has to be provided, it is sucient for real workloads to select the time interval for measurements and to prepare the load for the measurements, in articial or synthetic executable models the load is generated or selected and prepared for measurements. These preparations may, for example, include instrumentations of program code in software monitoring. To evaluate the system (step (8)), the workload model and the system under test have to be brought together. In performance modeling, a combined performance model has to be constructed and solved. In measurements, the system has to be measured under the real (8.2) or articial/synthetic (8.3) load. Steps (9) and (10) are related to data reduction, visualization, interpretation and validation. Depending on the conclusions drawn from these investigations, it might be necessary to rene the model or to conduct further measurements to gain more insight in system behavior and performance (loop back step (11)). While the inuence of the objectives and the evaluation techniques on the workload model are discussed considering all three dierent types of workload models (real, executable , and non-executable models) in sections 2.2.1 and 2.2.2, a separate section (2.3.1) will be devoted to the discussion of steps (4) to (7) for non-executable models only, motivated by the particular importance of these steps for non-executable workload model construction. The application of the whole methodology for a particular type of evaluation study (scalability analysis) will be demonstrated in Chapter 4.

2.2.1 Inuence of the Objective of the Evaluation Study Evaluation/Comparison of Parallel Architectures One possible question in the analysis of parallel systems is to compare or evaluate the performance of a particular parallel machine. The system under test is the parallel architecture, the component under study is either again the parallel architecture, or subcomponents of the parallel architecture (e.g. the CPUs, or the interconnection network). The interaction with the user is typically not considered in the workload, the system is usually studied under a batch load, frequently when executing only a single program. System related performance indices (e.g. computation or communication rates) are to be obtained.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

36

Such evaluations are frequently done by measurements, where either real programs could be used or synthetic or articial programs can be constructed. The workload model (program and data) can either represent the typical workload, that such a system has to process or extreme workloads to stress the system. The literature on using benchmarks for the evaluation of parallel architectures is numerous, a few examples are Naik 94], Bozk 92], Berr 92], Fagi 90], and Gust 91].

Program Analysis and Design In program analysis and design, again the performance of a parallel program (to be) executed on a parallel architecture is investigated. But in contrast to evaluation and comparison of architectures, the focus is here on how well the program on a particular architecture performs (or will perform), the component under test is now the parallel program. Possible questions include predicting the execution time of a parallel algorithm and developing or selecting an optimum algorithm with respect to performance indices like speedup or eciency. These questions can be answered either by measuring the performance of the program or by trying to develop a model. In program analysis, a program is investigated, that has already been implemented. Therefore either measurement or modeling can be applied. In program design, the program is still under development and not completely available for execution. As a consequence, mainly modeling techniques have to be applied, measurements for already implemented parts of the program can support the investigations. In both cases (modeling and measurements), the workload consists of the particular program to be investigated and is itself the component under study. Therefore the question of selecting the workload is already answered, and the major questions consist in 1. what to measure to obtain the desired performance indices and to gain insight in the program behavior and 2. how to measure in order not to disturb the system behavior signicantly.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

37

Several answers to both questions can be found for example in the thesis of Malony Malo 90] or Kesselman Kess 91].

Mapping and Scheduling In contrast to program analysis, where only a single program is investigated, the workload in mapping (assigning parts of a program to processors) and scheduling (determining their order of execution) typically consists of several programs, with parts of the architecture being the component under study (e.g. mapping and scheduling programs and facilities). Questions in scheduling and mapping include: analysis of mapping or scheduling techniques (complexity and quality of achieved mapping or schedule), determining optimum mappings/schedules in a single or multiple user environment, determine the optimum number of processors to be allocated to a program. Various optimality criteria (like maximum throughput or minimum average response time) are possible. If a certain mapping or scheduling technique is investigated, the workload has to be selected (to represent either the typical behavior or a stress case), while in determining the optimum schedule/mapping for a particular set of programs, the workload is given by this set of programs. The arrival of programs at the system (i.e. the user interaction) has to be considered. The question on how to characterize the workload remains for both objectives. Several approaches can be found in literature. In Norm 93] a survey on models to be used in mapping studies is given. Characteristics to be used in scheduling are investigated for example in Ferr 88], Sevc 89, Sevc 94], Rost 94], Rein 94], and El R 90].

Performance Debugging Whenever the performance of a parallel system is unsatisfying, it is necessary to detect the sources of performance degradation and to improve the system. Performance debugging describes all activities related to the detection and removal of performance bottlenecks in parallel programs and is therefore a consequence of program analysis, if the observed performance does not fulll the expectations. Improvements can either concern the program, the hardware (e.g. assigning a dierent number of processing elements to the problem) or the mapping (assigning the problem parts in a dierent

CHAPTER 2. FRAMEWORK AND METHODOLOGY

38

way). The component under study is the parallel program, the system is investigated in a single program mode, therefore no user interaction is considered. Performance debugging includes mainly improvements on the program or the mapping, variations concerning the architecture are typically restricted to varying the system size (e.g. the number of processing elements). Although it would also be possible to evaluate a system using modeling techniques, typically measurement techniques are applied, since rather accurate results are needed. To support the process of problem detection, a variety of monitoring and visualization tools have been developed (see Blas 92] for a survey). These tools help in gaining insight in performance and program behavior, but only a few are helpful in providing hints for actually improving the performance. Concerning the workload, the problems are very similar to those discussed for the objective \program analysis".

Sensitivity and Scalability Analysis Finally an important objective is to investigate the inuence of changes in the input parameters on changes in the performance results. In general, these studies are summarized under the term sensitivity analysis. They should be an integral part of any modeling study, since the accuracy of the modeled performance results can only be interpreted correctly, if it is known, how sensitive they are with respect to the workload parameters. In the analysis of parallel systems the investigation of scalability is of particular interest. The objective is to predict the performance of the parallel system when the number of processors and/or the problem size is increased Hwan 93b]. As a consequence, the system under study and the component under test are a particular pair of architecture and program. A characterization of the program in terms of functional, behavioral and quantitative aspects is sucient, since user interaction can be neglected when analyzing a single program. A workload model supporting sensitivity and scalability analysis has to be easy to modify (exible and in particular scalable) in order to allow the necessary parameter

CHAPTER 2. FRAMEWORK AND METHODOLOGY

39

variations. Usage costs have to be rather low to support a fast evaluation of a large number of variations. The problem of scalability analysis will be addressed in Chapter 4.

2.2.2 Inuence of the Evaluation Technique Performance Measurement Measurement techniques can be applied, whenever the parallel system, the program, and the data are available in the analysis experiment. Measurement techniques are distinguished in hardware, software, and hybrid monitoring approaches. In hardware monitoring, the processing system is observed using specialized, dedicated hardware for the measurements. The advantage of hardware monitoring is, that system behavior is not disturbed signicantly. The disadvantage are the high costs for the additional hardware. In software monitoring, the program itself is modied to produce additional information on system behavior. This modication can either be instrumentations of the code (source code or object code), or additional programs at the operating system level. The disadvantage is a signicant inuence on performance (see for example the studies in Malo 92]). In hybrid monitoring, both hardware and software components are used. While most of the monitoring tools for parallel systems are software monitors (see for example the tools described in Nich 91] or Liao 92] which are based on code instrumentation, or the tool proposed in Klar 92] which collects monitoring information at the operating system level), some hardware monitoring tools exist (usually commercial tools, developed by the hardware vendors), and there are also a few reports on hybrid approaches Haba 90]. The test workload to be used in measurements is not inuenced by the particular monitoring technique. The only restriction is, that only the class of executable models can be used in performance measurements. Although there are several interesting aspects in workload characterization for performance measurements (like selecting the appropriate workload or benchmark McDo 92] Berr 92] or the time interval for the observations if a real workload is used), these problems will not be discussed in this work. Here the focus is on performance modeling approaches which will be discussed in the following.

CHAPTER 2. FRAMEWORK AND METHODOLOGY speedup =

Amdahl

40

sequential time parallel time

Gustafson time = 1

time = 1 seq

W

seq

W

W W

par

speedup =

par

(hypoythetical) seq. time parallel time

/P seq

1 ser

W

+W

par

/P

seq

W

seq

W

W ... seq. part of work par W ... parallel part of work P ... number of processing elements

PxW

par

par

W

ser

speedup = W

par

+PxW

Figure 2.7: Comparing Amdahls Law and Gustafson's Law

Performance Modeling Whenever parts of the real system under investigation are not available, it is only possible to model the system and to predict its performance. A variety of performance modeling techniques exist, in this work only a selection of techniques will be discussed, that have been proven to be useful in the evaluation of parallel systems. The discussion shall not explain these methods in detail, but point out the areas of application and the demands posed on the workload model. A more formal denition of the models discussed here is given in Appendix B.

Fundamental Laws One of the most famous laws in parallel processing is Amdahl's law

Amda 67], which gives an upper bound for the attainable speedup2 in a parallel program as a function of the sequential fraction of code. The rather pessimistic conclusion is, that speedup is bounded by W seq , where W seq denotes the sequential fraction in the application, i.e. the amount of work to be executed sequentially. (see left of Figure 2.7). A dierent viewpoint of the problem was found by Gustafson Gust 88]. By assuming, that the amount of work to be processed will increase when using larger systems, a (ideal) linear speedup bound can be derived (see right of Figure 2.7). It is important to mention, that neither one nor the other approach is wrong. They just dier in the assumptions they made upon the workload. While Amdahl has assumed a xed size workload, Gustafson has assumed a scaled size workload.

Speedup is here dened as the ratio of the execution time on a system with one processor to the execution time on a system with P processors 2

CHAPTER 2. FRAMEWORK AND METHODOLOGY

41

The input to both models is a characterization of the degree of parallelism in terms of the sequential and parallel fraction of computation, given as deterministic, architecture dependent values. In practice, neither one nor the other model will really obtain accurate bounds on performance, because there is no overhead considered. Several amendments have been made to consider communication overhead, contention, and load imbalance. We will come back to this topic in Chapters 3 and 4.

Analytic, Deterministic Models In analytic, deterministic models, a set of relations be-

tween the load parameters and the performance values is derived. These relations are either given as closed forms derived from observed measurement results or as a complexity notation derived from program analysis. It is unrealistic to assume, that the complex interactions between the program and the architecture and the resulting performance can be captured in simple closed forms for realistic applications. Therefore analytical models should rather be seen as a rst step in performance prediction, providing either rough bounds on the performance (similar to the fundamental laws) or showing at least the direction of dependencies among workload parameters and performance results, i.e. identifying the parameters of inuence and their inuence (direct or indirect proportional). Examples in literature where the use of analytic, deterministic models is demonstrated, are Agar 92] or Ajmo 86].

Stochastic Graph Modeling Another type of performance model are stochastic graph

models, where the program is represented as a directed graph, nodes correspond to the components of (or events in) the program and arcs represent precedence relations between components (events). The durations of the events are specied by random variables. This corresponds to a description of the workload in terms of the dependency structure and the computation demands. The objective is to evaluate symbolically the distribution of the total duration of the program or to calculate moments of the distribution Gele 86] Chim 88] Hart 93].

CHAPTER 2. FRAMEWORK AND METHODOLOGY

42

Assuming, that nodes (or subgraphs) are either combined in series or in parallel a set of rules for calculating the distribution of a such a Series Parallel Graph can be given3 Sahn 85] Gele 86]. Since solving such models manually is tedious, tools have been developed supporting the specication and analysis of stochastic task graphs Triv 90]. Stochastic graph models of arbitrary structure are to be evaluated using simulation techniques. Note, that in this model, all information on the architecture is hidden in the workload model. In fact, the workload model, specied by the dependency structure and by the stochastic architecture dependent computation demands, is the only performance model. Communication can be modeled within this formalism by introducing additional nodes, representing the duration of the communication between two components.

Stochastic Processes A stochastic process is a family of random variables fX (t) t 2 T g, where t is an index parameter (usually time), and X (t) 2 S , is the sta te space Triv 84].

Since both, the state space and the index parameter are either discrete or continuous, four dierent types of stochastic processes can be distinguished. A further criterion for classication is the dependence or independence of state transition probabilities. So called Markov processes, where the transition probabilities depend only on the current state and not on the history of previous states (memoryless property) are frequently used in performance modeling, because the memoryless property allows analytic evaluation techniques. Depending on the semantics of states and state transitions a variety of aspects related to a parallel program or a parallel architecture can be modeled Moll 89], Iaze 93]. But because of the complexity of dening the state space, usually higher level specication models such as Petri nets or queuing networks are used in practice Jonk 93].

Petri Nets, Timed Petri Nets It is assumed, that the reader is familiar with basic Petri

net theory and their properties as described in Mura 89]. In appendix B.1 a denition of a Petri net is given to introduce the notation used in this work.

3

See the appendix for the rules and formal denitions.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

Class nontimed PN Timed Marked Graph Discrete Time Stochastic PN Continuous Time Stochastic PN Discrete Time Generalized SPN Continuous Time Generalized SPN Deterministic and Stochastic PN Transition Timed PN

43

Type of Transitions allowed Analysis Techniques I no quantitative analysis possible D-DT, D-CT, M-DT, M- recurrence equations CT. G-DT, G-CT M-DT isomorphic to DTMC M-CT

isomorphic to CTMC

I, M-DT

DT semi-Markov chain

I, M-CT

CT semi-Markov chain

I, M-CT, maximum 1 D-DT generalized enabled in each marking process any simulation

semi-Markov

I ::: :: :: :: :: ::: :: :: :: ::: :: :: :: ::: :: :: :: :: : immediate transition D- : :: :: :: :: ::: :: :: :: ::: :: :: :: :: ::: :: :: :: ::: : deterministic value M- :: ::: :: :: :: :: ::: :: :: :: ::: : distribution with Markov property G- : :: ::: :: :: :: ::: :: :: :: :: ::: :: :: :: ::: :: :: : generally distributed DT : :: ::: :: :: :: :: ::: :: :: :: ::: :: :: :: ::: :: :: :: :: ::: : discrete time CT : :: :: :: ::: :: :: :: ::: :: :: :: :: ::: :: :: :: ::: :: :: : continuous time MC :: :: :: ::: :: :: :: ::: :: :: :: ::: :: :: :: :: ::: :: :: :: :: Markov chain

Table 2.1: Analysis Techniques for TTPNs

CHAPTER 2. FRAMEWORK AND METHODOLOGY

44

While Petri nets are excellent models for representing all aspects of parallelism (concurrency, synchronization, : : : ), thus supporting qualitative analysis of parallel programs and system models, a time dimension has to be added to capture also quantitative evaluation. Several approaches have been proposed in literature for associating time. In Place-Timed PNs a timing function  : P 7! IR assigns holding times i to places pi 2 P . In Transition-Timed PNs (TTPNs) the timing function  : T 7! IR assigns ring delays i to transitions ti 2 T . Arc-Timed PNs (ATPNs) are characterized by a timing function  : T 7! IR assigns propagation delays i to arcs (P  T ), (T  P ). If aging functions are assigned to tokens, which are incremented when tokens ow through places, transitions or arcs, the net is called a Token-Timed PN . It is important to note, that almost every timing concept has an equivalent representation as a TTPN, which have become the most popular class of timed PNs. Depending on the type of transition ring functions in the net, dierent analysis techniques can be applied, which are summarized in Table 2.1. In all these models, there is a trade o between modeling power and the complexity for solving the model (decision power). A state space based analysis, which would provide most accurate and versatile performance results, is exponential in the number of places P and number of tokens in the initial marking (0). The decision power can generally be improved by reducing modeling power e.g. polynomial analysis for restricted PN classes (Marked Graphs, Free-Choice Nets). Therefore, simulation is an attractive alternative to analytical methods, because the simulation model is likely to be of smaller complexity and is applicable to any class of nets. Using the Petri net formalism, the dependency structure and the degree of parallelism can be easily specied for the workload model. The transitions in the net would represent the components in the program, together with additional transitions for representing synchronization, communication, or memory access structures. The corresponding demands are the timings associated to the transitions. Both, architecture independent, as well as architecture dependent characterizations are possible, as well as stochastic or deterministic parameters for ring times (see also Table 2.1). The Petri net description of the workload can be easily combined with a model of the architecture into an integrated performance model. See for example the PRM-approach

CHAPTER 2. FRAMEWORK AND METHODOLOGY

45

proposed by Ferscha Fers 90a] Fers 92] and further developed in Wabn 94a]. A variety of other papers gives examples on how to model parallel systems within the Petri net formalism Chio 93a, Chio 93b] Gran 92], but in most of these approaches, there is no clear separation between the workload and the system model.

Queuing Network Models Queuing theory is one of the most commonly used analytic

performance modeling methods. The evaluation techniques are well established and the theory is well understood. Typically a queuing network model (QNM) is used to represent the components of a processing system as stations and the processing requests (the workload) are modeled as job ows between the stations. Each job is characterized in terms of its service requests per station and in terms of its visit behavior. These are therefore the aspects, that the workload model has to represent. Although solution techniques for product form queuing networks are well known, the necessary assumptions are often not given when modeling a parallel system. A further drawback of queuing models for the purpose of prediction the performance of a parallel algorithm is, that these models are architecture oriented, i.e. the queuing network model represents the architecture in detail, but the program structure is considered insuciently. Finally, the representation of contention and synchronization is more complicated in QNMs than in Petri nets. Only extensions to queuing networks allow the representation of contention and synchronization Bacc 90], and this class of networks can usually only be solved using simulation. Most literature on queuing theoretical performance analysis of parallel systems focuses on the analysis of parallel architectures Kush 93] Lauw 93] Chia 91], some approaches propose a stochastic graph model as a specication of the workload and a queuing network model of the architecture Nels 88] Mak 90]. A combination of Petri nets and the queuing network formalism has been presented in Alme 88] Chan 89].

Process Algebras A convenient formalism for the specication of systems which contain

concurrency are process algebras, which are based on the concept of cooperating and communicating agents. To apply process algebras in the context of performance modeling, a timing concept has to be introduced. Stochastic process algebras Goet 93]

CHAPTER 2. FRAMEWORK AND METHODOLOGY

46

Hill 94] have proven to be a suitable extension to address the problem of performance analysis. When modeling a system within a (time extended or stochastic) process algebra, the system is described in terms of its processes (the agents) and the interactions between these processes (see for comparison the concept of CSP as proposed by Hoar 78, Hoar 85]). The major motivation for using stochastic process algebras is to have a unique formalism for both, functional specication, and performance modeling. Less emphasis was given to a structured approach for performance evaluation in terms of a clear separation between workload and architecture description. The workload is basically considered in the stochastic parameters assigned to the processes, which represent the duration of these processes.

2.3 Constructing a Non-Executable Test Workload Model A model is an abstraction from the real world. In workload modeling the components of the workload are this \real world" that has to be represented in a model. The aspects that are abstracted from these components are the characteristics. Depending on which aspects of the workload are modeled, on the stochastic nature of the models, and on the architecture dependence or independence of the models, dierent types of parameters can be used to model the characteristics of the workload as depicted in Figure 2.8. The cells in this matrix scheme represent the possible instances for parameters, describing the component characteristics. For example, the maximum cut in a task graph representation of a parallel program would be a deterministic, architecture independent parameter, giving an upper bound for the degree of parallelism contained in a program. An estimate of the execution time for a certain routine given as a random variable would be a stochastic, architecture dependent parameter for the computation demands. Note, that this characterization scheme is to be seen orthogonal to the previously dened hierarchy of workload components (application, algorithm, routine), i.e. either the total workload (application) or its components can be characterized according to the scheme.

Represented Aspects

CHAPTER 2. FRAMEWORK AND METHODOLOGY

Functional Composition Dependency Structure Degree of Parallelism Communication Behavior Memory Access Behavior Computation Demands Communication Demands Memory Demands I/O Demands User Behavior

47

Nature of the Model stochastic deterministic

arch. dependent architecture independent

Figure 2.8: Workload Model Classication Scheme

2.3.1 Model Parametrization The characteristics depicted in Figure 2.8 can be represented in non-executable test workloads using dierent parameters. According to the proposed methodology, the representation of the characteristics in terms of models and parameters is to be performed in steps four to seven (see Figure 2.6). In the following each characteristic will be discussed with respect to the type of parameters suitable for non-executable test workloads. The corresponding step of the methodology, where the particular characteristics is to be dened, is given in brackets.

User Behavior (step 4)

In modeling a single application, no user behavior has to be considered. In modeling the behavior of a parallel system, when executing several programs (either submitted by a single user or by a group of users), the arrival rate of programs in an open system (or the number of programs to be processed in a closed system) must be given. While the arrival rate is a stochastic parameter, the number of programs is a deterministic value (but can, for example, be varied in a scenario analysis).

CHAPTER 2. FRAMEWORK AND METHODOLOGY

48

Functional Composition (step 5)

For a non-executable workload model, the description of the functional composition mainly denes the granularity. It is therefore sucient to provide the set of components without any further description of their functionality.

The granularity at each layer depends on the workload under study and on the objective of the analysis. For example, it is possible to lump two algorithms together into a single algorithm component Ai at the application layer. This lumping implies that all the other characteristics can only be specied in terms of the component Ai. Since this description is merely an enumeration of the components, it is by default architecture independent and deterministic4.

Dependency Structure (step 6)

The data and control dependencies among the components can either be specied in the workload model or can be considered in the performance model. When considering them in the workload model, a graph model is a convenient and frequently used representation, where the nodes correspond to the components and the arcs represent the dependencies. These models may either be deterministic (i.e. there are no branching probabilities, the graph depicts a xed and given order among the components), or stochastic (certain branches are taken with a certain probability). In an architecture independent representation, no particular mapping of components to processing element can be assumed, and therefore all possible dependencies among the components have to be specied. In an architecture dependent representation, the known assignment of components to processing elements can reduce the representation complexity. For example, a sequence of components assigned to a single processing element can be represented as a single node. Considering the dependency structure in the performance model reduces model exibility and understandability, as the information on the workload is \hidden" in the monolithic performance model.

If a certain component is only included in the workload with a certain probability, this wi ll be represented in the dependency structure. 4

CHAPTER 2. FRAMEWORK AND METHODOLOGY

49

Degree of Parallelism (step 6)

In performance modeling, the degree of parallelism that can be given in advance (i.e. before solving the model) is typically the potential degree of parallelism obtained from structural analysis of the program code or a specication, and is therefore an architecture independent parameter. The actual degree of parallelism depending on the physical restrictions of the parallel architecture is an output of the evaluation study. In studies, where dierent mapping or scheduling policies are investigated, previous runs of the programs to be mapped or scheduled can be used to characterize the actual observed degree of parallelism (DOP) as an architecture dependent parameter. Based on previous measurements, also a stochastic model of the DOP can be derived.

Communication/Memory Access Behavior (step 6)

Depending on the assumed paradigm for exchanging information between components, either a communication or a memory access model can be dened. This model has to represent the communication partners (or access patterns) and the frequency of communication (accesses).

The communication or memory access volume is separated from the behavioral description and specied in the corresponding demands (communication demands and memory demands). This separation is motivated by the fact, that some performance modeling techniques (e.g. queuing network models) only need resource demands as input parameters, structural dependencies are represented in the performance model, not in the workload model. Note, that this characteristic is in some sense architecture dependent, since a certain communication paradigm (either message passing or shared address space) is assumed.

Computation Demands (step 7)

To derive a performance model, information is needed about the internal characteristics of the components in terms of their processing requirements. An architecture independent characterization can be obtained, if the computation demands are specied in a logical way Ferr 83], e.g. in counting the number of operations that a component consists of. These counts can then be combined with the architecture specic execution times per operation in a performance model. In an architecture dependent character-

CHAPTER 2. FRAMEWORK AND METHODOLOGY

50

ization, the execution times would be directly specied in the workload model. This has the advantage of a less complex performance model, but the workload model is obviously less exible.

Communication Demands (step 7)

While the structure of the communication is specied in the communication or memory access model, the description in terms of volume and type of data is summarized in the communication demands. In an architecture independent description, no timing information about the actual processing system is included. All aspects of contention or competition for resources and machine specic delays are to be represented explicitly in the performance model, thus increasing its complexity. Depending on the objective of the evaluation, it might be desirable to include these aspects already in the workload model, thus moving the characterization eort from the performance model into the workload model. Once the architecture dependent timings have been obtained, a more concise representation is possible, but again at the costs of loosing exibility.

Memory Demands (step 7)

To be able to consider the hierarchical organization of memories of parallel systems (including caches) and the resulting non-uniform memory access times in a performance model, already the workload model has to contain information on the memory demands, again either specied in terms of data volume or in access times (being more architecture specic).

I/O Demands (step 7)

An important aspect, which is often neglected in parallel performance evaluation, are the I/O requirements. Research on parallel I/O is just at the beginning, therefore I/O is often the bottleneck in a parallel system. Again, a characterization can either be architecture independent (specifying the I/O volume) or architecture dependent (specifying the requested service times for I/O).

synthesis of parameters

CHAPTER 2. FRAMEWORK AND METHODOLOGY

derive Application

51

architecture independent parameters (static)

Algorithm Routine

measure architecture dependent parameters (dynamic)

Figure 2.9: Deriving the Load Parameters After discussing the suitable parameters, the question how to derive these parameter has to be discussed. Figure 2.9 depicts possible ways for deriving load parameters. While architecture independent parameters (e.g. the number of statements or their dependencies as determined by the control structures of the program) can be derived from an analysis of the program, architecture dependent parameters (e.g. the time to execute a certain statement) are typically obtained from measurements. Although both techniques can be applied at any desired layer of the program, typically, measurements are made at lower levels and the results are then aggregated to parameters at higher layers (e.g. aggregating the execution times for single statements in a routine into the total execution time of the routine).

Parameters Obtained from Measurements When deriving load parameters from measurements, the following problems have to be considered: First, the workload used in these measurements is typically not identical to the workload that has to be modeled. Usually, only parts of the workload (e.g. selected components) can be measured, and the workload model has to be constructed based on these sample measurements. Therefore it is crucial to select carefully what to measure. Second, the data obtained from measurements have to be statistically processed. Outliers must be carefully considered, the total amount of measurement data will not be used directly to parametrize the workload model, but several reduction

CHAPTER 2. FRAMEWORK AND METHODOLOGY

52

techniques have to be applied. This process of reducing the measured data to obtain a more concise model is sometimes called workload characterization in literature. Note, that in this work, the term workload characterization is used in a wider sense, including all activities related to the construction of a workload model. A detailed discussion of techniques for workload characterization in the narrow sense (e.g. clustering or tting techniques) is far beyond the scope of this work. A survey can be found in Jain 91b], where also several suggestions for further readings are given. An example for a tool for workload characterization is MEDEA Merl 93] Discussing the problem of what to measure in order to obtain meaningful input data in general would also exceed the scope of this work, but will be discussed in the context of scalability analysis in Chapter 4. In literature, many examples on using measurements as input to performance models can be found. See for example Bron 90].

Parameter Obtained from Program Analysis Complexity analysis is one possible technique to derive load parameters of a program. The techniques and notations as known from sequential processing can be applied (!, O, -Notation) Quin 93a]. In addition, in parallel processing, the concept of computation costs has been introduced. The computation costs of a parallel algorithm are dened as the product of its time complexity and the number of processing elements. A parallel algorithm is said to be cost eective, if its costs are not larger than the complexity of the best known sequential algorithm. Complexity functions are frequently used in scalability analysis to characterize a parallel program. Another possibility is to characterize the workload by counting the type and number of instructions in the program. These counts are sometimes used in benchmark programs to quote a certain computation or communication speed, given by the number of instructions processed in a certain time interval or by the number of bytes transferred in a certain time interval. Techniques for analyzing the data dependencies in a program can also provide performance characteristics, but are mainly used in parallel compiler technology.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

53

Finally, a graph model representation of the program can serve as a basis for deriving several other program parameters.

Stochastic versus Deterministic Parameters A deterministic model is typically easier to solve than a stochastic model. Unfortunately, there are several reasons, why stochastic models should be used instead of deterministic models. On the one hand, there can be a non-deterministic behavior within the program or within the architecture. But even if all factors of inuence are of deterministic nature, it might not be possible to specify their interaction in an accurate way (because there are too many parameters, some of the parameters are not known or not quantifyable), thus motivating the use of a stochastic model. Whenever using a stochastic model, the stochastic nature of the parameters must be specied by giving their distributions or by giving some characteristics or moments of their distribution. Recently, several authors have investigated the inuence of the stochastic nature of parameters on the derived performance results Adve 93]. It has been observed, that the error when using a deterministic model instead of a stochastic model was surprisingly small in several case studies. However, much more research has to be done on sensitivity analysis, before an actual conclusion can be drawn from these observations.

Architecture Independent versus Architecture Dependent Parameters The decision between an architecture independent representation of parameters and an architecture dependent representation is a decision between workload model exibility and performance model complexity. If the workload model has to be exible, e.g. if it is necessary to compare several systems using identical workload models, an architecture independent characterization is favorable. However, the architecture dependencies have then to be considered in the performance model, thus enlarging the modeling complexity.

2.3.2 Validation Criteria for Workload Models Step 10 (in Figure 2.6) summarizes all activities related to the validation and interpretation of the results, including also a validation of the workload model.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

54

A set of criteria has been established for validating workload models for uniprocessor systems and computer networks Ferr 78]. In the context of workload models for parallel systems, the following criteria seem to be relevant: representativeness, exibility, simplicity of construction, compactness, usage costs, and system independence.

Representativeness How does the performance and/or the behavior under the real load compare to the performance and/or behavior under the modeled load? Representativeness is the most frequently used validation criterion. We distinguish between performance-oriented (with respect to some user-supplied performance index) or resource-oriented representativeness. A workload model is said to be performancerepresentative, if the performance indices of the system using the workload model are the same as the performance indices observed with the real workload. A workload model fullls resource-representativeness, if the system resources are utilized in the same way when using the test workload and the real workload. Representativeness is evaluated in comparing the measured performance indices or resource utilizations under real load against results under the generated, modeled loads. Since frequently the real loads are not available, the models are compared to simulation results or measurements on other test workloads. It is important to note, that in those cases the simulation or measurement itself requires a workload model, which can also introduce sources of inaccuracy.

Flexibility What is the eort for adopting the model to changes in the real workload? In a exible model it is easy to change parameters in order to represent changes in the workload. A subcriteria of exibility is scalability, which refers to the possibility to extend the workload model to larger problem sizes, without increasing the model complexity too much.

CHAPTER 2. FRAMEWORK AND METHODOLOGY

55

Simplicity of Construction, Compactness, Usage Costs What is the eort in model construction? Is the model clear in its structure and easy to understand? What is the eort for using the model? All these criteria address the \usability" of the workload model. On the one hand, it seems, that one goal supports the other, e.g. a model that is simple to construct, should also be compact and easy to use. But practice shows, that there is sometimes a tradeo between simplicity and compactness and usage costs, if we consider not only the usage costs of the workload model, but also the usage costs of the performance evaluation technique, where the workload model is going to be used.

System Independence Is the workload model valid for dierent instances within a class of parallel architectures? System independence, which is an important criterion for workload models for sequential systems, has to be redened in the context of parallel systems. Although a workload model may be independent of the particular instance of a parallel machine within a class of parallel machines (e.g. a particular shared memory computer), it is in general unreasonable to claim system independence among the dierent architecture classes. The classes dier signicantly in their characteristics and this dierence must already be considered in the workload model. The workload model should allow a system independent representation of the workload in a sense, that it is possible to modify several hardware parameters (like number of processors, processing speed, communication speed) without changing the workload model. Changes in the architecture class (e.g. from distributed memory multicomputers to vector processors) will in general also inuence the workload model, since ot her parameters become relevant. A workload model which is system independent with respect to all architectures, would have to capture all aspects that might ever become relevant for a certain architecture and would therefore be too costly.

Chapter 3 Survey and Comparison of Parallel Workload Models It has been shown in the previous chapter, that the proposed methodology is general enough to be adapted to any type of evaluation study. All the following considerations are restricted to the class of non-executable models1, since the major objective of this work is to develop a workload modeling methodology mainly to be applied in performance modeling studies. A variety of approaches for a quantitative description of parallel programs can be found in literature. In this chapter, these approaches are to be compared to the proposed framework in terms of the characteristics they are able to represent. The models are discussed in four groups, which dier according to the way in which information on the parallel program is presented, namely in parameter based models, in the family of proles, signatures and shapes, in ratios for varying system size, and in behavior (graph) based models. Each group is described briey in the following. Parameter based models try to capture the workload characteristics in either a single value or a set of values. The parameters can either be specied in a domain oriented, resource oriented, or structure oriented way. Typically, program and resource oriented parameters are used to depict the dierent types of demands these parameters are used as input to performance models like queuing network models or timed Petri nets for specifying `'service demands\ or transition ring times. Structure oriented parameter do not explicitly represent detailed structural or behavioral information of the workload or dependencies between 1

In the following \non-executable workload models" are called \workload models" for short.

56

CHAPTER 3. SURVEY AND COMPARISON

57

workload components, but try to include structural and behavioral information in a single parameter. Proles, signatures, and shapes are all based on the concept of the varying degree of activity within a program, where dierent interpretations for \activity" can be distinguished. Activities can be dened in terms of computations, communications, but also in terms of \non activity", depicting the phases of idle times within a parallel program. Phases of activity can either be represented as a function of time (proles and shapes) or as a function of the number of active units2 (signatures). From these diagrams, the sequential or parallel fraction of a program can be derived (or derived directly from analysis of the program or measurements). If these fractions are not only given for a certain system size (i.e. for a single number of processing elements P ), but as a function of P , further metrics like speedup and eciency can be derived. Since these models depict performance ratios for a parallel program when system size is varied, these models are summarized in a class called ratios for varying system size . Dierent types of graph models have been proposed to represent the workload. The most popular representation are task graphs, but also timed Petri nets, or Pert Networks can model the characteristics of the workload. All of these models represent some aspect of program behavior (structure, dependencies, : : : ) and may also serve as a performance model, i.e. there exist analysis techniques that allow to analyze the performance directly based on the information contained in the behavior graph based workload model (e.g. analysis techniques for stochastic series parallel graphs). To compare the variety of approaches in a systematic way, the classication of Section 2.3.1 in Figure 2.8 is used, which distinguishes workload models according to the abstracted aspects, their architecture dependence or independence, and their stochastic or deterministic nature. While the advantages of a stochastic versus a deterministic model and an architecture independent versus an architecture dependent model will be discussed in the text only, the represented aspects (the characteristics) are depicted in Figure 3.1, which gives a schematic survey of the workload models discussed in this section. A cell entry lled in grey indicates, that the model does represent the aspect given on top of the column. The white-grey pattern indicates, that the model does only partially 2

The term unit will be more precisely specied when discussing these models.

CHAPTER 3. SURVEY AND COMPARISON

s and em

nd

nd ma De I/O

ica un mm Co

s

tio

ma De ry mo Me

mp Co

Me

nD

s

and em nD uta

tio

Ac mo

ry

ica un mm Co

s

vio sB ces

nB tio

ara fP eo gre De

eha

vio eha

lism lle

y enc nd epe ta D Da

De ol ntr Co

utilization redundancy

ratios for varying system size

Fu

nct

ion

al C

pen

om

den

po

cy

siti

on

r

r

58

quality of parallelism processor working set efficacy efficiency speedup execution time function

profiles, signatures, shapes

sequential/parallel fractions activity shapes activity signatures activity profiles

behavior graph based models

data dependency graphs data flow graphs control flow graphs Petri nets PERT networks event graphs task graphs

parameter based models

communication graphs

resource oriented parameters program oriented parameters structure oriented parameters

not represented

B A B

partially represented

A

represented

model B can be derived from model A information from model A is included in B

Figure 3.1: Schematic Representation of Workload Models

CHAPTER 3. SURVEY AND COMPARISON

59

represent the particular aspect. Partially represented aspects are either characterized at a very abstract level (e.g. a characterization of a parallel program in terms of its speedup gives only a very rough idea of the degree of parallelism), or are only an (optional), additional parameter in the model (e.g. in most behavior graph based models arc and node annotations can represent various types of demands). Finally it is worth to mention, that the models are neither exclusively used (e.g. resource and domain oriented parameters can be used to annotate arcs and nodes in behavior graph based models) nor are they independent of each other. Within and between these groups, several models can be derived from others. These dependencies are represented by arcs. The black pointed arcs connecting model rows indicate, that the information contained in the source model (i.e. the model where the arc origins from) is sucient to derive the target model (i.e. the model where the arc points to). Note, that the arcs represent a transitive relation, i.e. if model B can be derived from model A and model C can be derived from model B , then model C can also be derived from model A. White pointed arcs indicate, that some information of the source model is included in the target model. In the following each type of model will be discussed by giving a basic denition of the model, by explaining possible extensions to the model, by pointing out advantages and disadvantages and areas of application, and by giving some examples and references to literature where these models have been used. Special emphasis was given to the discussion of the family of proles, signatures, and shapes and to ratios for varying system size, since these models are the fundamentals for scalability analysis, to be discussed in the last chapter.

Example To illustrate the dierent types of models, a parallel implementation of the as-

signment problem was chosen. In the formulation of the assignment problem as implemented in this work, m \tasks" are to be assigned to m \persons", to each individual assignment a certain prot is associated and the objective is to nd an assignment, which maximizes the total prot, dened as the sum over all individual prots3 . This program has been implemented on a workstation cluster in C using PVM Begu 93] for message passing by a group of students within the \Projektstudium Parallelverarbeitung, Several other formulations of the assignment problem can be found in literature, see for example Papa 82]. 3

CHAPTER 3. SURVEY AND COMPARISON MCB MCE WCB WCE MSB MSE WSB WSE MOB MOE WOB WOE

60

begin of computation block at the master end of computation block at the master begin of computation block at the worker end of computation block at the worker begin of send at the master end of send at the master begin of send at the worker end of send at the worker begin of overhead at the master end of overhead at the master begin of overhead at the worker end of overhead at the worker

Table 3.1: Measured Events of the Example Program Studienjahr 1993/94" Kope 94]. The parallel implementation is based on the auction algorithm Bert 89a], which is an iterative master worker application (see Figures 3.10 and 3.12 for a task graph and control ow graph model of this application). The models were obtained from an analysis of the program code to obtain the graph models and from measurements using event based source code instrumentation. The recorded events are summarized in Table 3.1. A computation block is a sequence of statements containing no communication. The begin of the send statement was the actual pvm_send statement, computations for packing the data were considered as overhead. Other sources of overhead were times spent for waiting for communication. Within the interval of begin computation block and end computation block, a processing unit (master or worker) is said to be in state \computing" in this example. The communication phase is bounded by the begin and end of sends, and all other states are classied as overhead. Note, that the receive time is regarded as overhead, therefore it is possible, that an odd number of units may be in state communicating.

CHAPTER 3. SURVEY AND COMPARISON

61

3.1 Parameter Based Models Resource Oriented Parameters In a resource oriented workload model the parameters are characterized at the resource level, i.e. in terms of architecture dependent processing requests (service demands or execution times of parts of the program or the whole program). Resource oriented parameters are usually obtained from proling or measuring the execution of the real program itself (as in Gohb 91]) or from benchmarking, i.e. proling or measuring an articial or synthetic load (see for example Liu 91]). The following denition for resource oriented parameters can be given.

Denition 13 A resource-oriented parameter is an operational quantity related to the amount of consumed resources.

Resource oriented parameters capture the physical, system dependent, quantitative aspects of the program, i.e. its computation, communication, memory, and I/O demands. Depending on the techniques applied in deriving these parameters from measurements, they can either be given as deterministic or stochastic values. Resource oriented parameters can either be used in a classication of programs (e.g. distinguishing between computation or communication bound problems) or as input to performance models (e.g. representing the service demands in queuing networks Mass 93] Mabb 94] or the transition ring times in transition timed Petri nets Wang 91] Ho 93]). Several studies investigating the speedup and scalability of parallel programs are based on resource oriented parameters Cyph 93] Gohb 91]. For the assignment problem, resource oriented parameters were obtained from measurements. A detailed instrumentation of the program allows the collection of communication times for each communication in the program, and computation times for each sequential block of computations. Aggregating these values results in the total communication demands and in the total computation demands, e.g. for 6 processing elements and a matrix of dimension 100 (average density) 51.876 seconds computation demands (with 28.481 seconds spent in sequential computations) and 45.667 seconds communication demands were obtained.

CHAPTER 3. SURVEY AND COMPARISON

62

Structure Oriented Parameters Denition 14 A structure oriented parameter is a static quantity derived from anal-

ysis of the program or its specication, which summarizes information about the structure of a program in a single number.

In contrast to resource oriented parameters, structure oriented parameters do not characterize the resource demands, but structural aspects of the program in a concise way. This means, that not all dependencies, communication patterns or other structural aspects are represented (for example in a graph model), but a single number should provide insight in the structure of the program. Typical examples are the (maximum, minimum, or average) degree of parallelism of a program. Structure oriented parameters can be derived either from the program code itself (an example might be the number of nested loops indicating possible levels of exploitable parallelism) or from a description of the program in terms of behavior oriented models (e.g. the maximum cut in a task graph as a bound on the maximum attainable degree of parallelism). Structural parameters are used in mapping Cand 92], or scheduling studies Mass 93], and as parameters for constructing benchmark programs Bart 92]. Several structure oriented parameters can also be used as input to queuing network models. For example, the average degree of parallelism could be interpreted as the number of customers in a closed QNM representing a parallel architecture Ghos 90] Lauw 93]. For the purpose of performance prediction, a pure structural description might be inappropriate, and is therefore usually combined with a description in terms of resource oriented parameters. But investigating the structural parameters, that inuence the performance, might help in understanding system behavior and can therefore support a performance evaluation study. The work reported in Cand 93] is a rst step towards this direction. Considering the example program, a structure oriented parameter is the outdegree of the nodes in the task graph representation (trivially 5 for the 5 worker implementation). Another structural parameter is given by the ratio of the number of tasks in the task

CHAPTER 3. SURVEY AND COMPARISON

63

graphs, and the product of the longest path and width (the maximum cut) of the task graph. This is an index (dened in the range of 0 and 1) for the deviation from an ideal parallel application, where the number of tasks in each step of the program is constant, i.e. there are no sequential phases. For the assignment problem, this index is computed as 17415 = 0:48, indicating potential performance bottlenecks due to phases of loss in the degree of parallelism. Note, that in this index the actual duration of the tasks are not considered. If the computation demands in the parallel phases are large compared to the computation demands in the sequential phases, the parallel application might still exhibit a good speedup. But considering also the measured resource oriented parameters for the assignment problem, it can be observed, that actually 28.481 seconds of the total 51.876 seconds were spent in sequential computations, thus there is a considerably sequential bottleneck.

Domain Oriented Parameters The performance of an program may not only depend on the program structure or architectural characteristics, but also on factors inherent to the problem or solution domain. This type of parameters will be called domain oriented.

Denition 15 A domain oriented parameter is a quantity which characterizes a performance relevant feature of the problem domain.

The problem size is the most frequently used domain oriented parameter, but also very program specic parameters are used, like convergence rates in simulated annealing or rollback probabilities in distributed simulation. Domain oriented parameters have similar elds of programs as resource oriented parameters. They can be used instead of resource oriented parameters for characterizing demands in an architecture independent way (e.g. specifying the communication demands in terms of the type and size of messages to be transferred Kats 92] and not in terms of the communication times). In performance modeling, domain oriented parameters are mainly used in scalability studies Azmy 92] Jako 93] Sing 93], but also in load balancing Nico 89] or in providing assistance in parallelization of sequential programs Fahr 93].

CHAPTER 3. SURVEY AND COMPARISON

64

The problem size for the assignment problem is given by the dimension of the matrix. In the examples in this chapter, a problem size of 100 (i.e. an 100  100 matrix) was chosen4. Althoug the density of the matrix is another parameter inuencing the performance, it is not considered in the examples of this chapter. The matrix was generated using a random number generator only once. This matrix was used in all experiments and measurements. All parameter based models characterize the workload aspects at a high level of abstraction. i.e. the aspects are not specied in detail, but summarized in global values. While resource and domain oriented parameters are mainly used as input to other models (e.g. quantifying the demands in graph based models), structure oriented models are obtained from graph models to give rough but concise estimates on the workload, to be used for example in comparison of dierent programs.

3.2 Proles, Signatures and Shapes Prole The behavior of a parallel program is typically characterized by a varying degree of parallelism, i.e. the number of \active" units5 is not constant but changes over time. A function, characterizing these changes is called a prole. Dierent types of proles can be distinguished depending on the type of units, e.g. processing elements, threads of control, tasks, algorithms, routines, : : : the states of the units, e.g. computing, communicating, idling, : : : , the concept of time, e.g. observed execution times or virtual time steps A general denition of a prole is given below, where the term \active" is used as a ll-in for an arbitrary state of the units. Unfortunately, this problem size was too small to obtain a speedup, but a larger problem size would have increased the amount of measurement data up to an intractable size. 5 Depending on the degree of granularity and on the viewpoint of the system, a unit may either be a physical resource, i.e. a processing element, or a program or its component (algorithm, routine, statement). 4

CHAPTER 3. SURVEY AND COMPARISON

65

Denition 16 An \activity" prole P (t) of a parallel program using P processors is dened as a step-function

P : t ! IN

giving the number of \active" units at (time) step t. Time instants ti (0  i  imax) denote those points in the execution, where the degree of \activity" changes (points of jump discontinuity), imax denotes the total number of changes. A unit may either be a processing element, a thread of control, or a workload component (specied at any desired level of detail).

To make the denition more precise, the term \active" has to be specied. A unit can either perform computations, or can communicate, or can be idle. The computations can be further distinguished into useful computations related to the solution of the problem, and to wasteful computations arising from solving the problem in parallel (e.g. redundant or additional computations). Communications, idle times, and wasteful computations are summarized as total overhead. For each of these \states" a prole can be specied. The values of t can either be time instants during the (observed or predicted) execution time, or virtual time steps dening a precedence relation between the dierent stages in program behavior. In both cases, either the theoretical (architecture independent) maximum number of active units or the actual (architecture dependent) number of active units can be specied. The following types of proles are typically used in characterizing parallel programs.

Denition 17 A representation of the theoretical maximum number of units (typically

the nest granularity at which parallelism can be exploited) being simultaneously active during each step in the program, assuming innite processing capabilities, is called the degree of parallelism (DOP ) or ideal parallelism prole P inf of a program. This measure is an upper bound for the maximum exploitable parallelism in a program.

Denition 18 A function showing the number of processing elements performing computations as a function of t is called a computation prole P comp , with 0  t  T (P ),

CHAPTER 3. SURVEY AND COMPARISON

computation

communication

66

overhead

Figure 3.2: Proles of the Assignment Problem where T (P ) denotes the total execution time of the parallel program using P processing elements. The set of possible values for a computation prole is between 0 and P .

Denition 19 A function showing the number of communicating processing elements as a function of t is called a communication prole P comm , with 0  t  T (P ),

where T (P ) denotes the total execution time of the parallel program using P processing elements. The set of possible values for a communication prole is between 0 and P .

Denition 20 A function showing the number of processing elements performing computations or communications as a function of t is called an execution prole P exec , with 0  t  T (P ), where T (P ) denotes the total execution time of the parallel program using P processing elements. The set of possible values for an execution prole is between 0 and P .

Assuming, that communications and computations may not overlap, the execution prole is the sum of the computation and communication prole. Figure 3.2 represents the computation prole, the communication prole, and the overhead prole of a parallel master worker implementation of the assignment problem using ve processing elements. Only a part of the total prole is depicted (i.e. only a selected time interval).

CHAPTER 3. SURVEY AND COMPARISON

67

Time P comp(t) P comm (t) P exec (t) P ovh (t) 0 1 0 1 4 8 0 1 1 4 19 1 1 2 3 28 2 1 3 2 37 2 2 4 1 47 3 0 3 2 50 2 1 3 2 61 1 1 2 3 74 0 1 1 4 84 0 0 0 5 Table 3.2: Relative Prole Values for the Third Iteration In this diagram, the proles are depicted using stacked bars, i.e. the bars representing the computation prole are shown starting always from the bottom horizontal line. Thus, the corresponding function values can be directly read from the y-axis. The bars of the communication prole are put above the bars of the computation proles (if they exist for this time interval), thus the distance from the top of the bar to the x-axis gives the sum of both proles, i.e. the execution prole. Finally, the overhead prole is depicted on top of the two other proles. This way of representation has been chosen to illustrate, that the total number of processing elements is always the sum of the three proles, i.e. a processing element can either compute, communicate or perform activities judged as overhead. The corresponding timing values are given in detail for all four proles in Table 3.2, but for brevity only for the third iteration. Note, that the time index is based on a relative time scale (i.e. the begin of the third iteration is assumed to be 0). Proles are typically used to characterize the workload, whenever a decision has to be made on the number of processing elements to be assigned to a problem (e.g. mapping or scheduling Klei 92]), but are not suitable for performance prediction during program design, where the execution signature should be the output of the study, and not the input. But they can be used to estimate the speedup of parallel programs Carl 92].

CHAPTER 3. SURVEY AND COMPARISON

68

Signature A dierent viewpoint of the degree of activity is provided by a signature. The following denitions are based on the denition of the computation prole, denoted as P comp(t), i.e. the units are the number of computing processing elements which are depicted on a time axis. Similar denitions can be given for other activity proles.

Denition 21 A computation signature  comp (p) of a parallel program using P processors is dened as a function

 comp : p ! IR p 2 0 1 ::P ] of p, the number of computing processors, giving the total time during program execution, where p processing elements where computing.

A signature is a (graphical) representation of the \states" of a program in terms of the states of its processors. A signature can be derived from a prole as follows. imax X;1 comp  comp(p) = f (P (ti) p) i=0 8 > < ti+1 ; ti if P comp(ti) = p comp f (P (ti) p) = > :0 else

According to the denitions of proles and signatures, an execution signature6 is not equal to the sum of the computation and communication signature. Computation, communication, execution, and overhead signature for the example program are shown in Figure 3.3. The sum of the computation and the communication signatures are shown in the execution signature diagram (black lines) to illustrate, that the execution signature is dierent from this sum. The derivation from the proles is shown by example for the computation signature, for all other signatures, only the results are tabulated (see table on the left). Note, that in this work, the term \execution signature" is dened dierently from Dowd 93], where an \execution signature" is a function showing the total execution times of a parallel program for a varying system size (e.g. as a function of the number of processing elements used in the execution). Here, this type of function is called \execution time function" and will be discussed in the next section (3.3). 6

CHAPTER 3. SURVEY AND COMPARISON

communication

computation

100

69

execution

overhead

80 60

60

40

40

20

20

0

0 1 2 3 4 5 number of PEs

0 1 2 3 4 5 number of PEs

i ti 0 1 2 3 4 5 6 7 8 9

0 8 19 28 37 47 50 61 74 84

0

P comp(ti) 1 0 1 2 2 3 2 1 0

0 1 2 3 4 5 number of PEs

0 1 2 3 4 5 number of PEs

f (P comp(ti) p) for p = 0 1 2 3 4 5 0 8 0 0 0 0 11 0 0 0 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 10 0 0 0 0 0 0 3 0 0 0 0 11 0 0 0 0 13 0 0 0 0 10 0 0 0 0 0

 comp (p) = 21 30 30 3 0 0  comm(p) = 11 63 10 0 0 0  exec (p) = 0 29 22 23 10 0  ovh (p) = 0 10 23 22 29 0  comm +  comp(p) = 32 93 40 3 0 0

Figure 3.3: Signatures of the Assignment Program

CHAPTER 3. SURVEY AND COMPARISON Intervals for x comp(x) Intervals for x comm (x) Intervals for x exec (x) Intervals for x ovh (x)

0-25 0 0-13 0 0-34 1 0-21 1

70

26-61 62-96 97-100 1 2 3 14-98 99-100 1 2 35-60 61-87 87-100 2 3 4 22-40 41-75 76-100 2 3 4

Table 3.3: Deriving the Computation Shape from the Computation Signature Signatures can provide insight in program behavior in performance tuning and debugging, but can also serve as a basis for deriving input parameters for a queuing network model of a multiprocessor system Leuz 89], or as a characterization of the load in scheduling Ghos 92] and mapping.

Shape The information of a prole or a signature can be represented in a cumulative way, called a shape (x), where x is the percentage of time during that the metric took a particular value .

Denition 22 A computation shape comp(x) of a parallel program using P processors is dened as a step-function

comp : x ! IN x 2 0::100] giving the distribution of the number of computing processors, i.e. (x) = p means, that at most p processors were computing during x percent of the total execution time.

Note, that this denition is only valid, if the parameter on the x-axis of the prole is given on an interval scale (e.g. the parameter t has to be a time index, not a step counter). Otherwise it is wrong to calculate percentages. A computation shape can be derived from the aggregated (cumulative) values of the computation signature. Let

CHAPTER 3. SURVEY AND COMPARISON

computation

5

71

communication

5

4

4

3

3

2

2

1

1

0

0 0

0.2

0.4

0.6

0.8

1

execution

5

0

0.2

0.6

0.8

1

0.8

1

overhead

5

4

0.4

4

3

3

2

2

1

1

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

Figure 3.4: Shapes of the Assignment Problem

8 > 0 if p  0 > < ;1 comp  agg comp(p) = > Pip=0  (i) if 0 < p  P > : T (P ) if p > P denote the time, where less than p processing elements were computing. The computation shape comp(x) is given as x T (P ) 2  agg comp(p)  agg comp(p) +  comp(p)] comp(x) = p () 100 x T (P ) 2  agg comp(p)  agg comp(p + 1)] () 100 agg comp(p)  agg comp (p + 1)  x () 100 2  T (P )  T (P ) ] agg comp(p) comp (p) x   () ( 100 ; T (P ) ) 2 0 T (P ) ] The shapes for the assignment problem are plotted in Figure 3.4 and tabulated in Table 3.3. In the diagrams, the dashed line is drawn to support readability.

CHAPTER 3. SURVEY AND COMPARISON

72

While signatures, proles, and shapes provide a rather detailed view of the program behavior for a given system size P , more concise representations have to be found for comparing the behavior when the system size varies. Those characteristics are discussed in the next section.

3.3 Ratios for Varying System Size Speedup, Eciency, Ecacy One possible motivation for using a parallel system instead of a uniprocessor system for solving a given problem of given size is a reduction in execution time. Unfortunately, the reduction in the time spent for useful computations by doing them in parallel is accomplished by additional overhead which increases with the number of processing elements being used. An important characterization for parallel programs is therefore the variation in execution time when the number of processing elements is changed.

Execution Time Function The execution time of a parallel program is given by the total time between start and termination of the program.

Denition 23 The execution time function T (P ) of a parallel program when executing on a system with P processing elements is dened as

T (P ) = T use (P ) + T ovh(P ) where T use (P ) is the time spent for useful computations7 and T ovh (P ) is the time related to overhead assuming that there is no overlapping of useful computations and overhead.

Note, that T (P ) is a discrete function. Lines connecting the corresponding function values should only be interpreted as supporting the readability of the graph and are not part of the function in a graphical representation. Useful computations are directly related to the solution of the problem and would also be necessary in a sequential program. 7

CHAPTER 3. SURVEY AND COMPARISON

73 par

z}|{ Wz}|{2 W (P ) = |2N +{z N } + 2( | P{z; 1)} W seq

N

W (1)

broadcast in ln P steps

...

collect in ln P steps

N

T

Wseq

8

Wpar

7 6

Wovh

600

(P )

10 9

1200

800

W ovh

T seq z }| { z}|{ T (P ) = 2N + N 2 d N e + 2 log2 P | use{z P } | T ovh{z(P ) }

N^2

1000

T par

5 4

W(P)

3

400

2 1 0

200

2

4

1024

512

256

128

64

32

16

8

4

2

0

250

8

16

S(P)

32

64

eta(P)

128

1

200 150 100

E´(P)

0.7

Tovh

0.6

T(P)

0.5

1024

Q(P)

U(P)

0.8

Tpar

512

E(P)

0.9

Tseq

256

R(P)

0.4 0.3

50

0.2 0.1 1024

512

256

128

64

32

16

8

4

2

0

0 2

4

8

16

32

64

128

256

512

1024

Figure 3.5: Execution Times and Derived Ratios In most cases, it is infeasible to derive or measure the function T (P ) for any value of P , the number of processing elements. Therefore T (P ) is only evaluated for selected values of P . Again, lines connecting these \sample points" may not be interpreted as the function T (P ), but only support readability. Unfortunately, it is common practice, to draw continuos diagrams of T (P ), based on only a few sample points, thus resulting in misleading interpretations of such a graph. Whenever T (P ) is given only for a specic subset of values for P , this subset must also be clearly identied in a graphical representation. Because of the poor behavior of the assignment problem for the selected problem size, a hypothetical application has been chosen for comparing the dierent ratios for varying system size. Instead of deriving parameters from measurements, the computation demands are given as a function of problem size N and of system size P .

CHAPTER 3. SURVEY AND COMPARISON

74

A (task) graph model of this parallel application, which consists of three phases, is shown in Figure 3.5 (top). First, some sequential computing of order N is necessary. After distributing the data (which is assumed to take log2 P steps, where P denotes the number of processing elements), the N 2 following computations can be performed in parallel. The results are again sent back (requiring a communication complexity or overhead of log2 P ), and are aggregated to a nal result (see the graph model in Figure 3.5 top). The corresponding expression for T (P ) is given distinguishing between useful computations (to be executed sequentially or in parallel) and overhead (see also the lower left diagram in Figure 3.5). The expressions for work and the other diagrams are explained in the following sections.

Speedup and Derived Ratios Based on the information of T (1), the execution time of a sequential program (or the execution time when using only one processing element), performance ratios can be derived. The most popular measure used in parallel processing is speedup.

Denition 24 The speedup S (P ) of a parallel program is dened as S (P ) = TT((1) P) the ratio of the execution of a sequential program (or the run on a single processor) to the execution on P processors.

There has been a discussion in literature on whether this denition of speedup is a fair denition. It might be argued, that the comparison should be based on the best (known) sequential and the best (known) parallel algorithm for a given P . For a survey of workload models, the denition given above is sucient, since the objective is not to give a fair comparison between the execution times of two programs, but to characterize the behavior of a given program if the system size is varied.

Denition 25 The eciency E (P ) of a parallel program is dened as E (P ) = S (PP ) the speedup divided by the number of processors.

CHAPTER 3. SURVEY AND COMPARISON

75

Eciency is an indicator on how each processing element contributes on the average to the overall speedup (see for example Eage 89] for a study on speedup and eciency evaluations in parallel systems) and can also be interpreted as a measure for the average utilization of the processing elements. A value close to one indicates, that nearly all work could be split equally among the processing elements, a value close to 0 indicates that either overhead or that the sequential portion of the program is too large for eciently using all of the P processing elements. An increase in the number of processing elements results usually in an increase in speedup8, but eciency will decrease. An index comparing the costs with the prots when using more processing elements is ecacy.

Denition 26 The ecacy (P ) of a parallel program is dened as (P ) = S (P ) E (P ) the ratio of speedup to eciency. The maximum of this function is called the processor working set pws.

Ecacy relates the benets (prot) of using more processing elements (increase in speedup) with the loss in performance (costs) due to overhead (decrease in eciency). It has been shown by several authors, that concise measures like speedup, eciency, ecacy, or processor working sets, can be eciently applied in scheduling studies. Those parameters are more compact than a representation of the execution signature or the execution prole, but provide in many cases enough information for scheduling decisions Park 92] Dowd 93]. So far, it has been assumed, that these are timing values. But execution time function, speedup, and related measures can also be given as complexity functions. For this kind of characterization, several other indices are meaningful.

Denition 27 The costs C (P ) of a parallel program are dened as C (P ) = T (P ) P Assuming, that the problem size is suciently large. This topic will be studied in detail in the next chapter. 8

CHAPTER 3. SURVEY AND COMPARISON

76

the product of execution time and the number of processing elements.

Note, that several other denitions of \costs" are possible (e.g. the loss in performance in previously given denition of ecacy), but in the following, the above denition of costs will be used. Following this denition, eciency can also be interpreted as the ratio of the costs of the sequential program C (1) = T (1) 1 to the costs of the parallel program C (P ).

E (P ) = CC((1) P) The amount of work in the sequential program and in the parallel program can be given as a complexity function of the number of operations.

Denition 28 The amount of work in a sequential program W (1) is given by the number

of operations in the sequential program. The amount of work in a parallel program W (P ) is given by the number of operations in the parallel program. The overhead work in a parallel program W ovh (P ) is given by the di erence W (P );W (1) and characterizes the additional work in the parallel program.

The amount of work for the hypothetical example is shown in top left diagram in Figure 3.5. The total amount of work is composed of the useful work (part of which can be executed in parallel) and of overhead. In this example, the complexity of useful work can easily be derived from the time complexity (and vice versa) by distinguishing between the sequential and parallel fraction (see next section). The relation for overhead is not as obvious, but assuming an ecient implementation of a broadcast, it is known (and can easily be shown), that the P ; 1 point to point communications can be performed in log2 P time steps. As can be seen in this example, there exists a relation between the amount of operations and the time complexity of a program. Although the relations will not always be as simple as in this example, there are several assumptions, which are realistic for most parallel programs. 1. W (1) = T (1), i.e. the amount of work and the time to execute this work are of the same order.

CHAPTER 3. SURVEY AND COMPARISON

77

2. T (1)=P  T (P )  T (1), i.e. the parallel execution time is at least the sequential execution time divided by the number of processing elements, superlinear speedup is not considered9. The parallel execution time does not exceed the sequential execution time, i.e. a \slowdown" is not considered. 3. T (P ) < W (P )  P T (P ), i.e. some of the operations can be performed in parallel, and the total amount of work may not exceed the complexity of the costs. 4. W (1)  W (P ) due to possible overhead. The ratio of the number of operations in the parallel program to the number of operations in the sequential program is called the redundancy and gives the overhead due to parallelism in terms of operations. Note, that all functions in the following denitions are assumed to be given in a complexity notation instead of exact timing values.

Redundancy and Derived Ratios Denition 29 The redundancy R(P ) of a parallel program is dened as R(P ) = W (P )=W (1) the ratio of the total number of operations in a parallel program to the total number of operations in a sequential program.

Based on the denition of redundancy, the utilization function of a parallel program can be dened by relating the total number of operations in the parallel program to the costs P T (P ).

Denition 30 The utilization U (P ) of a parallel program is dened as U (P ) = PW T(P(P) ) = R(P ) E (P ) the product of redundancy and eciency. Superlinear speedup may arise for example from faster memory access due to a reduced problem size per processor in the parallel application. See for example Helm 90] where some reasons for superlinear speedup are discussed. 9

CHAPTER 3. SURVEY AND COMPARISON

1 1=P 1=P 1 1=P 1=P

Bounds  S (P )  E (P )   (P )  R(P ) < U (P ) < Q(P )

     

P 1 P P 1 P

78

type S (P ) E (P ) (P ) R(P ) U (P ) Q(P ) HB = AVG < = HB  = LB < | = AVG  <  = HB  >  | | =

Table 3.4: Summary of Ratios If a program contains redundancy, than utilization is larger than eciency, i.e. the processing elements are computing, but on wasteful, redundant work.

Denition 31 The quality of parallelism Q(P ) of a parallel program is dened as Q(P ) = S (P ) E (P )=R(P ) = (P )=R(P ) the ratio of ecacy and redundancy.

The analogous measure to ecacy when considering the redundancy is the quality of parallelism. Since ecacy is divided by redundancy, a higher quality implies a lower redundancy for a given ecacy. According to the denitions of the dierent ratios and based on the assumptions on the functions T (1) T (P ) W (1) W (P ) bounds and relations between ratios can be given, which are summarized in Table 3.4. In the rst column, the bounds on the ratios are given. In the second column, the type of the ratio is given, where HB denotes a higher is better metric, LB denotes a lower is better metric, and AVG is an average is best metric. The following 6 columns give the relations between the metrics and are to be interpreted as follows: the metric in the row is less () than the metric in the column (sim. for  and ). An entry of = denotes identity (just for readability), an entry of | means, that the measure can be higher or lower. In the appendix (B.6) these relations are derived in detail.

CHAPTER 3. SURVEY AND COMPARISON

79

These relations are also certied by the selected example, which fullls the above stated properties. The corresponding functions for speedup and derived ratios are shown in Figure 3.5 (two diagrams on the right). The eciency curve is below the utilization curve, but both curves are close to each other due to low redundancy. The parallel execution time can be reduced using more processing elements up to a certain point, afterwards an increase is observed (there is not enough work to be done in parallel). Speedup is limited due to overhead and the sequential fraction. Since the increase in speedup decreases, the ecacy function and the quality of parallelism rst increase, then decrease. The maximum (the processor working set) is located at P = 8.

Sequential and Parallel Fraction In the previous section, the sequential and parallel total execution time T (1) and T (P ) and the total amount of work in the parallel and in the sequential program W (P ) and W (1) have been dened. In this section the sequential and the parallel fraction (of time or of work) within a parallel program are to be investigated.

Denition 32 The amount of work characterized by a degree of parallelism of i is denoted

by W par(i). The sequential fraction of work W seq = W par (1) of a parallel program is the amount of work that can be executed using only one processing element. The parallel fraction of work W par = Pmi=2 W par (i) of a parallel program is the amount of work that can be executed using more than one processing elements, where m is the maximum DOP in the work. The time to execute W par(i) using p processing elements is given by par (i) ie d t(p W par(i)) = i W(W par (i)) p where (W par (i)) denotes the speed, at which the work can be processed using a single processing element. All PEs are assumed to have identical speed. The sequential time fraction tseq of a parallel program is the amount of time spent in serial computations (using one processing element). The parallel time fraction tpar(P ) of a parallel program is the amount of time spent in par-

CHAPTER 3. SURVEY AND COMPARISON

80

Workload

Execution Time

W 1 W1 W1 W1 W1 W1 W 1

T1 T1 T1

W p W p W p W p Wp W p W p

1

2

3

4

5

6

7 No. of PEs

Tp Tp

1

2

T1

Tp 3

T1 T1 Tp T T p T p T1p 4

5

6

7 No. of PEs

Figure 3.6: Fixed-Load Speedup allel computations when using a total of P processing elements.

Characterizing a program in terms of its sequential and parallel fraction gives a rst rough idea of the performance characteristics of a parallel program. These characteristics are used in the analysis of speedup and scalability as well as in scheduling Nels 90]. In the following dierent speedup models are discussed. The approaches dier on the assumptions on the amount of work (xed or scalable) and on the level of detail at which the parallel execution time T (P ) is characterized.

Fixed-Load Speedup (Amdahl) Assuming, that the workload is characterized by W par(i), the amount of work to be executed at degree of parallelism DOP = i, xed load speedup is dened as follows.

Pm t(1 W par(i)) T (1) Sfix(P ) = T (P ) = Pm t(Pi=1W par (i)) + T ovh (P ) i=1 Pm W par (i) par = Pm W pari=1(i) (W i (i)) ovh d e + T (P ) i=1 i(W par (i)) P

where m denotes the maximum DOP, and T ovh (P ) summarizes all possible sources of overhead. In this speedup model it is assumed, that the system is used to solve a problem of given size, i.e. the amount of work (or the load) is constant (or xed) if the system size is increased.

CHAPTER 3. SURVEY AND COMPARISON

81

Amdahls law is a special case of this general denition, assuming that the systems operates either in sequential or in full parallel mode (i.e. W par(i) = 0 if i 6= 1 or i 6= n) at a constant speed  and assuming T ovh(P ) = 0. Therefore, T (1) = (W seq + W par)= and T (P ) = (W seq + W par=P )= and speedup is given by seq

par

S (P ) = W seq + W W + WPpar where W par (1) = W seq and W par(P ) = W par for brevity. By normalizing = W seq =(W seq + W par) and (1 ; ) = W par =(W seq + W par) S (P ) = 1 + (P P; 1)

where is called the sequential bottleneck Amdahls Law is based on the assumption, that the sequential and parallel amount of work (denoted by W seq and W par ) are xed, if the number of processors is increased (see Figure 3.6). Execution time will decrease, but since only the parallel work can be solved in a shorter time, there remains a signicant sequential bottleneck, which bounds speedup.

Scaled Load Speedup In practice, parallel architectures are not only used to solve a given problem in shorter time, but also to solve larger problems. Assuming, that the workload scales with the number of processing elements such that T (1) = T 0(P ), i.e. the sequential execution time for solving the unscaled problem equals the parallel execution time for solving the scaled problem, a xed-time (or scaled load) speedup model can be dened. Let m0 denote the maximum DOP with respect to the scaled workload W (i)0.

Pm W par(i)0=(W par (i)0) i=1 Sscaled (P ) = T (1)=T (P ) = Pm W par (i) =(W par (i) ) d i e + T ovh (P ) i=1 Pm W par(i)0i=(W par (iP)0) = Pi=1 m W par (i)=(W par (i)) i=1 0

0

0

0

0

0

0

The speedup model from Gustafson Gust 88] follows the same concept, assuming a constant speed  and W par(i) = 0, if i 6= 1 or i 6= n, and T ovh(P ) = 0.

CHAPTER 3. SURVEY AND COMPARISON

82

Execution Time

Workload

T1

W1

T1

T1 T1

T1 T1

T1

W1 W1 W1

W1

Tp Tp Tp Tp Tp Tp Tp

W1 W1 Wp Wp Wp W p Wp Wp Wp 3 7 No. of PEs 1 2 4 5 6

1

2

3

4

5

7 No. of PEs

6

Figure 3.7: Fixed-Time Speedup

seq + W par W Sscaled (P ) = W seq + W par denote the scaled workload with W par = P W par (xed-time 0

where W seq and W par assumption). Therefore 0

and by normalization

0

0

0

seq par Sscaled (P ) = WW seq++P W Wpar

Sscaled (P ) = P ; (P ; 1)

In this model workload scales with number of processing elements as represented in Figure 3.7.

Fixed-Memory Speedup Finally, in many systems memory capacity increases with the number of processing elements, again enabling the solution of larger problems. Let m denote the maximum DOP w.r.t. the scaled workload W (i).

Pm W par(i) =(W par (i) ) Smemory (P ) = T (1) =T (P ) = Pm W ipar=1(i) =(W par (i) ) i d P e + T ovh(P ) i=1 i Pm W par(i)=(W par (i) ) = Pim=1 W par (i)=(W par(i)) i=1 









CHAPTER 3. SURVEY AND COMPARISON

83

20 G(P) = 1 17.5 15 12.5

G(P) = P G(P) = ln P G(P) = P x ln P

10 7.5 5 2.5 0

Figure 3.8: Comparison of Dierent Scaling Functions for Speedup Assuming a constant speed  and W par(i) = 0 if i 6= 1 or i 6= n, and T ovh(P ) = 0, the model simplies to seq + W par W seq + G(P )W par Smemory (P ) = W = par W seq + W P W seq + G(P ) WPpar where G(P ) reects the increase in workload as memory increases by a factor of P . Three dierent cases for G(P ) are to be distinguished: 

1. If G(P ) = 1, then the workload does not increase with system size and xed-load speedup is obtained. 2. If G(P ) = P , then workload increases linearly with problem size, resulting in a xed-time speedup model. 3. If G(P ) > P , then a higher speedup than with the xed-time model can be explained. Figure 3.8 plots dierent scaled speedup curves for selected functions of G(P ). The curves G(P ) = 1 and G(P ) = P correspond to the xed load and to the xed time speedup model. The function G(P ) = log2 P is a scaled model, but workload is scaled at a lower rate than in the xed time model, thus resulting in a lower speedup curve. The function G(P ) = P log2 P is a scaled model, where the workload is scaled at a higher rate than in the xed time model, consequently speedup is higher. All these values are plotted assuming a sequential fraction of 1=6. The dotted line in the gure represents the ideal speedup of S (P ) = P .

CHAPTER 3. SURVEY AND COMPARISON

84

Generalized Speedup and Sizeup In Sun 91a] Sun 91b] new metrics for speedup are presented, relating the parallel and the sequential work (sizeup S Size ) and the parallel and sequential speed (generalized speedup S speed ) instead of the time fractions. The generalized speedup is dened as the ratio of the speed of the parallel system to the speed of the sequential system and is given as follows W T (P W ) W T (1 W ) 0

Sspeed =

0

= par (P W ) seq (W )

where par (P W ) denotes the speed, at which work W is solved on a system of size P and seq (W ) is the speed of a sequential system when solving work W . Assuming a constant work W = W 0, this denition is identical to the basic speedup denition T (1)=T (P ) assuming a scaled work W 0 > W and a constant time T (P W 0) = T (1 W ), a denition of sizeup (or scaled speedup) is obtained: 0 W SSize = W which gives the increase in the amount of work W 0 that can be solved on a parallel system compared to the amount of work W that can be solved on a sequential system in the same time. All the models given above characterize overhead as a single parameter, but the overhead in parallel computations can be credited to several factors related to the program or the architecture. A variety of models have been proposed giving a more detailed characterization of overhead. Since these models are mainly used in scalability analysis, some of them will be discussed in Chapter 4. Table 3.5 summarizes the dierent speedup models by giving the name of the model and/or a reference to literature, and by giving a summary of the input parameters.

CHAPTER 3. SURVEY AND COMPARISON

85

Approach Fixed-Load Speedup Fixed-Time Speedup

Parameters sequential and parallel work (time) for xed workload size, total overhead sequential and parallel work (time) for scaled workload size, total overhead Fixed-Memory Speedup sequential and parallel work (time) for scaled workload size, total overhead Amdahl's Law Amda 67] sequential and parallel time fraction for xed workload size Gust 88] sequential and parallel time fraction for scaled workload size Generalized Speedup sequential and parallel execution speed, total overhead Sun 91a] Sizeup Sun 91a] sequential and parallel work for xed time, total overhead Klei 92] sequential and parallel work (time) characterized as stochastic values Gele 89] sequential and parallel work, overhead due to inhomogeneous processor usage, load imbalance, and communication Chri 91] sequential and parallel time, communication overhead, delays due to interconnection topology Mari 93] sequential and parallel work (time), redundant work, blocking delays, communication volume Helm 90] sequential and parallel instructions (time), fractions of selected operations with dierent scaling behavior Anna 92] sequential and parallel instructions (time), locality of code, fraction of I/O, fraction of non-local memory references, remote access penalty, communication distance

Table 3.5: Comparison of Methods for Characterizing the Sequential and Parallel Fraction

CHAPTER 3. SURVEY AND COMPARISON

86 master

master

workers

workers (a) Static Process Graph

(b) Communication Graph

Figure 3.9: Graph Models of the Assignment Problem

3.4 Graph Models In contrast to parameter oriented models, in behavior graph models the dependence and communication structure of the workload is specied explicitly. A frequently used representation are directed or undirected graphs, where the nodes represent tasks, processes or events in the program and directed arcs or undirected edges represent relations among the nodes. Relations are typically precedence orders, if the arcs are directed, or indicate the need for communication if the arcs are undirected. Two other graph models, which are suitable for expressing parallelism, are Petri nets and GERT networks, which will also be discussed in this section. They are also directed graph models, but characterized by dierent types of nodes.

3.4.1 Undirected Graph Models Denition 33 A static process graph G = (V E f e) consists of a set of V nodes rep-

resenting the processes in a parallel or distributed program, E is a set of undirected arcs representing two-way communication, f is a function associating computation costs to the nodes and e is a function associating communication costs to the arcs.

Static process graphs, although rst dened for sequential systems Ston 77], are one of the oldest models for a quantitative representation of parallel programs. This model is suitable for representing systems with coarse granularity, where the processes are persistent during main parts of the execution time having stable communication patterns Chu 84]. Performance prediction based on this model is restricted, because no information is contained in this model on how frequently communication may occur. Arc weights just specify the total communication demands, node weights the total computation demands. The model has been applied in mapping studies Norm 93].

CHAPTER 3. SURVEY AND COMPARISON

87

3.4.2 Directed Graph Models Representing a parallel program is most naturally done by drawing the system as a directed graph. The nodes in the graph model can represent blocks of computation of dierent granularity. A ne granularity refers to a node, that contains a single or only a few statements, while the representation of an algorithm as a single node would be a coarse granularity. But the nodes may not only be parts of the program, but also events in the program's execution (e.g. the events dened for the assignment problem in Table 3.1). The arcs in the graph can either represent dependences among program parts, communication among processes, temporal relations between events, the control ow, or the data ow of the program. Depending on possible restrictions on the graph structure (series-parallel graphs, fork-join structures, meshes, : : : ) and the semantics of incoming and out-going links (AND, exclusive or inclusive OR, m out of n) a further distinction is possible. The models dier also in how timing information is associated to the graph. Timing information associated to the nodes usually represents execution time or computation demands, timing associated to the links represent communication time or message transfer demands. Whenever time is associated, hardware characteristics are already included, the workload model is therefore hardware dependent. If only the corresponding demands are specied (number of instructions, data volume), the workload model is architecture independent. This has the advantage of exibility, i.e. the same model can be used for the evaluation of dierent architectures, but will usually result in an increase in complexity when solving the performance model, where the hardware aspects are to be considered. Finally, the parameters associated to a model may or may not be random variables resulting in probabilistic (stochastic) or deterministic models. Some of these characteristics are summarized in Table 3.6 in a comparison of the graph models presented in this section. An entry n : n in the arc types column denotes an AND semantic (i.e. all n of the total n links must be taken for out-going links, and all of the n inputs must be available for in-going links), a m : n entry denotes, that m of the n links are selected for out-going nodes and m of n inputs from incoming links must be available, thus either representing exclusive-or semantic (m = 1) or the selection of several alternatives. Any graph model, that supports m : n arc types, also supports n : n arc types, therefore an entry m : n in the table indicates, that the model supports both, And and OR semantics,

m:n/m:n

Stochastic Task medium-coarse Graphs Series Parallel Graphs medium-coarse

Table 3.6: Comparison of Directed Graph Models n:n/n:n

medium-coarse medium-coarse

GERT networks

m:n/m:n

n:n/n:n

coarse

Communication Graphs Event Graphs

m:n/m:n

m:n/m:n m:n/m:n n:n/n:n

ne

Data Dependency Graphs Data Flow Graphs Control Flow Graphs Task Graphs

In/Out Arc Types n:n/n:n

ne-medium ne-coarse medium-coarse

Granularity

Approach

in extensions in extensions

| | necessary for some solution techniques

Arc Annotations Restrictions in Structure in extensions |

empirically observed random variables

|

|

several restr. for validity

|

exponentially | distributed cox. phase type | serial and paraldistributions lel composition any distribution any distribution |

in extensions in extensions

Node Annotations in extensions

CHAPTER 3. SURVEY AND COMPARISON 88

CHAPTER 3. SURVEY AND COMPARISON

89

while n : n denotes a model, where only AND semantics are supported. Node and arc annotations are classied according to the way of representation, e.g. deterministic, distributions, : : : . The entry \in extensions" means, that in the original denition of the model, no annotations were dened, but several extensions have been proposed. Restrictions in structure are only mentioned and will be discussed in more detail when discussing the dierent types of models.

Communication Graphs Denition 34 A communication graph CG = (V E ) is a directed graph, with V a set

of nodes representing tasks or processes in a parallel program, and E a set of directed arcs. An arc going from node i to another node j represents communication from node i to node j .

The time for executing parts of the program are represented as node weights and can either be obtained from the actual execution of these parts (i.e. they have already to be implemented), or given in terms of distributions (possibly derived from measurements of similar program parts), but also in terms of system independent, domain oriented parameters Hart 90]. Weights associated to the arcs may represent the data transfer volume Iane 94] Hart 90], but also system dependent, resource oriented communication times Gemu 93]. In some approaches, a specication language is proposed instead of a graphical formalism Andr 87] Gemu 93]. Communication graphs gives only a rough picture of the communication behavior, as they only specify, whether there is communication between two processes or not. Neither any sequence among the processes is specied, nor the frequency of communication. Therefore communication graphs are mainly used in studies on (static) mapping policies Dona 92].

Task Graphs Task graphs are probably the most frequently used model for representing the dependency structures in parallel programs Bert 89b].

CHAPTER 3. SURVEY AND COMPARISON

90

Denition 35 A task graph TG = (V E ) is a directed, acyclic graph, where V is a

set of nodes, representing the tasks in a parallel program, E is a set of directed arcs representing precedence relation among tasks.

Figure 3.10 shows the task graph for the assignment problem (1 master process, 5 workers). The pattern is a regular sequence of alternating phases of sequential and parallel computations. Such a type of structure is called a fork-join structure. In literature, a variety of task graph models have been proposed, the two major areas of application are scheduling Gera 93], and performance prediction Wabn 93] Ghos 88]. Task graphs may be augmented by assigning weights to the nodes and arcs, representing computation and communication costs. Such a model is used in Noh 92] where computation time functions are associated to the nodes and data transfer functions are associated to the arcs. Both functions may depend on the number of processors, the problem size and other program dependent factors, but the values given are assumed to be deterministic. The assumptions of deterministic computation and communication demands may not always allow a representative characterization of the workload. Therefore probabilistic or stochastic models have been proposed, where computation and communication demands may be dened as random variables Vinc 88]. Several performance evaluation techniques based on such a stochastic graph model are presented in Hart 93, Hart 92, Sotz 90]. Since the exact solution of these graphs is very costly and even intractable for larger systems, several approximation techniques are proposed and implemented in a tool called PEPP Daup 93]. In order to deal with the complex representation of large task graphs, hierarchical structures have been proposed. A composite task graph Lewi 91] or hierarchical task graph may contain regular-structured sub-task graphs as components, which may be aggregated to a single node at a higher level in the hierarchy. Model parameters are the set of tasks to be executed, the precedence relation between the tasks, a communication matrix showing the amount of data transfer between any two tasks, and a function characterizing the operations to execute a task. This function may, for example, be a complexity function dependent on the problem size. Computation and communication

CHAPTER 3. SURVEY AND COMPARISON

91

START

t1

AND

t2 .. t6 t7 t8

AND

t9 .. t13

t14 t15 AND t16 .. t20

t21

t22 .. t26

AND

t27

AND

t28 .. t32 t33 Dummy t34 AND

EXOR t35 EXOR t36 AND

t37 .. t41 EXOR 's

t42

Figure 3.10: Task Graph of the Assignment Problem

CHAPTER 3. SURVEY AND COMPARISON

92

times are derived by combining hardware parameters (execution rate, communication speed) and workload parameters (amount of data, number of operations). Expressions for deriving the execution time and optimum number of processors for a given time complexity function and a certain subset of sub-task structures (divide and conquer, nearest neighbor mesh) are derived. A tool supporting the evaluation of dierent scheduling strategies based on this model is presented in Lewi 93]. By restricting the structure of the task graph to certain regular structures, more ecient solution techniques can be given. Frequently used structures are fork join models, where a single task spawns a set of (sub)tasks performing independent computations in parallel. After a computation phase, the tasks are joined together at a synchronization point. Based on task graph models following this restricted structure, bounds on speedup are derived in Agra 92]. Performance indices relevant in scheduling (job response time, throughput) are derived in Maju 88] and Sree 92]. Execution time predictions and speedup models for task graphs with regular structures are presented in Mada 91] and in Mena 92]. Another structure frequently found in linear algebra problems, are triangular task graphs. Such graphs do have a synchronization structure similar to fork join models, but the number of tasks spawned either decreases or increases during the iteration, resulting in a triangular shape. An example of analyzing this class of task graphs can be found in Mak 90] or in Staf 93]. As the construction of a task graph is rather time consuming, it would be desirable to have tools automatizing this process, i.e. generating a task graph automatically from a program or its specication. One approach has been reported in Cosn 94], where a task graph is generated from an annotated sequential program. Another solution to overcome the analysis complexity of task graphs is proposed in Wabn 94b]. An approach is presented, where the task graph is only used to model the program. This model together with an architecture and a mapping model is automatically translated into a Petri net, which is simulated to obtain performance results. In an approach presented in Fers 95a] a tool is proposed for the (graphical) construction of scalable task graphs. In this approach, the task graph is dened for a small instance of the problem. Several predened and user dened functions for specifying the scaling behavior of the

CHAPTER 3. SURVEY AND COMPARISON

93

graph are provided (including the scaling of the number of nodes and the number and pattern of communication). A task graph for a larger problem size can be generated automatically based on the denition of its small instance and of its scaling behavior.

Event Graphs Denition 36 An event graph EG = (V E ) is a directed, acyclic graph, where V is a set of nodes denoting events in (the execution of) a parallel program, and E is a set of directed arcs, representing temporal relationships among events.

Usually, all events associated to a certain process (or processor), are shown along a line, similar to a gantt chart. Event graphs are derived from an analysis of the execution trace of a parallel program obtained from monitoring Hick 92] Yang 88]. Event graphs are mainly used in debugging and in performance tuning, but also in mapping studies Lo 91] Ibar 90]. In Figure 3.11, a sample of the event graph for the assignment problem is shown. This sample shows the same time interval as the Figures in section 3.2. Each line represents a process, the occurrences of events are depicted as marks on each line along the horizontal time axis. Dependencies among events are depicted by arrows from one process line to another. In the top diagram communication events are represented, in the diagram at the bottom, the duration of the computation blocks are shown The dark shaded boxes represent computations at the master, grey boxes represent computations at the workers.

Control Flow Graphs Denition 37 A control ow graph CFG = (V E ) is a directed graph where the nodes represent computations and the arcs represent the order in the ow of computations.

The control graph depicts the control structure of the program. In contrast to task graphs, control ow graphs do not necessarily have to be acyclic. Node types include a decision node, a fork node, a join node, an operation node, and start and end

CHAPTER 3. SURVEY AND COMPARISON

Figure 3.11: Event Graphs of the Assignment Problem

94

CHAPTER 3. SURVEY AND COMPARISON

95

56)46 t1 AND

t2

. . .

t3

t4 EXOR

t5 AND

t10

. . .

t6

AND

t7 AND 's

t11

t12 . . .

EXOR t60 EW

t8 t9

EXOR

t61

t16

EW t13

t14

t19 EXOR

AND

t17

AND

EM t20

t27

AND

t15

AND t18 t28

EW

. . . EW

t21...t23

t29

.

.

t30

.

t24...t26

t31 EW

EW

EM

AND t32

t33

. . . EXOR 's

t34

t35

t36

t37

t38

t39

t40

t41

t42

t43

AND t44 . . .

t45

t46

t47 EXOR t49 Dummy

t48 AND

AND

Dummy

t50 AND

. . . t51

EXOR 's t55 t54

t56

t57

AND's

t62

EEW.. Worker terminated EEM .. Master terminated

t53

t52

EXOR

EW

t58

t59

t63 EW

EM

Figure 3.12: Control Flow Graph of the Assignment Problem

CHAPTER 3. SURVEY AND COMPARISON

96

nodes. To each node a cost function is associated characterizing the time elapsed when executing the particular operation(s). Flow and cost analysis can be applied to derive the total execution time of the program based on the timings of its components Qin 93]. Another technique based on a geometric concurrency model is applied in Abra 87]. Restricting the execution times associated to the nodes to exponentially distributed random variables, and neglecting communication times, analytical performance evaluation techniques based on a hierarchical aggregation of segments can be applied Kape 92]. The control ow graph is only shown for a 2-worker implementation, because of its graphical complexity (large number of nodes and arcs) in Figure 3.12.

Data-Flow Graphs Denition 38 A data ow graph DFG = (V E ) is a directed graph where the nodes represent computations and the arcs represent the ow of data.

A data ow graph representing the parallel program may either be specied directly or derived from the code of an existing program. To obtain performance results, it is necessary to specify the computation and communication demands. The execution can then be simulated. Tools supporting this specication and simulation process are available Luqu 92]. If the computation and communication demands are given as exponentially distributed random variables, analysis based on Markov chains is possible Hous 90].

Petri Nets When modeling the parallel program in terms of (stochastic) Petri nets, this model serves on the one hand as the workload model, but may on the other hand already be the performance model (see the discussion in Section 2.2.2). With respect to expressive power, Petri nets, extended with a timing concept, are the most powerful formalism from all models presented here. All important structural aspects of a parallel program (parallelism, synchronization, communication, conditional

CHAPTER 3. SURVEY AND COMPARISON

97

Figure 3.13: Petri net Model of the Assignment Problem branches) can be expressed and the association of a timing concept supports performance analysis. A detailed modeling example of a parallel algorithm is given in Mahg 92], where a colored Stochastic Petri net is used for modeling the behavior of a parallel algorithm. Within the Petri net formalism, the parallel program can be described, too. In Fers 91, Fers 92] a method is proposed combining both models, a program model and an architecture model given as a timed petri net, in a unique model using PRM-nets Fers 90b, Fers 90a]. The major problem in the use of Petri nets is the complexity of analysis due to the exploding number of states. The evaluation of large systems using analytic techniques (Markov analysis) is (nearly) impossible. Although several attempts have been made in making these models analytically tractable (see for example Gran 92]), simulation will be the only possibility to analyze these models. A Petri net model of the assignment problem is given in Figure 3.13. This is a simplied model, where only the basic structure of a single iteration is modeled. First, the master performs some computations (the rst timed transition, represented by a hollow bar). Then the ve workers are spawned (communication is not modeled, therefor spawning is represented by an immediate transition represented as a thin bar), which perform their computations (ve timed transitions). All results are sent back to the master, which can then continue with computations. A loop back indicates, that this sequence is repeated several times.

CHAPTER 3. SURVEY AND COMPARISON Ingoing Outgoing

AND

98

XOR

OR

deterministic stochastic

Figure 3.14: GERT Network for the Assignment Problem

GERT Networks The technique of GERT networks, frequently used in project planning, may also be applied in the domain of parallel processing.

Denition 39 A GERT network G = (V E T ) is a directed, acyclic graph, where V

is a set of nodes (representing the tasks in a project) and A is a set of directed arcs, connecting two nodes, if there exists some precedence relation. T is a set of timing functions associating a transition time (a random variable with given distribution) to each task. Nodes can either have AND, OR, or XOR in-going semantics, and deterministic and stochastic out-going semantics.

Applied to the model of a parallel program, the nodes represent the parts of the program and the arcs represent the dependence between the parts. The timing function corresponds to the parts' execution time. Note, that similar to Petri net modeling, the workload model can be directly used as the performance model. A network where all nodes are AND/AND or XOR/XOR types, belongs to the class of series-parallel graphs, where exact analytic solution techniques are feasible. All other networks are to be solved using simulation or approximation techniques. Simulation languages like SLAM are available for a convenient specication of GERT networks.

CHAPTER 3. SURVEY AND COMPARISON

99

A comparison between the modeling power of Petri nets and GERT networks (modeled in SLAM) can be found in Taqi 92], but the comparison is restricted to structural aspects, no timing aspects are considered. Examples of performance models using GERT networks as workload and performance models can be found in Sinz 93] and in Cuba 91]. The GERT network of the assignment problem is similar in structure to the Petri net (compare Figures 3.13 and 3.14).

Chapter 4 Scalability Analysis It would be nice to believe, that an increase in system size (e.g. using more processing power) results in a proportional increase in performance (e.g. a reduced execution time, a higher speedup, : : : ). It has already been shown by Amdahl, that the expected improvement in speedup is limited by the reciprocal value of the sequential fraction of time in the program (and by other sources of overhead), if it is assumed that the amount of work is xed (see also Section 3.3). But what are the eects on performance if both, work and system size are scaled? Finding an answer to this question is the objective of scalability analysis which is to be discussed in this chapter. First, the concept of scalability will be introduced (Section 4.1) and a general denition of scalability is given. Depending on the performance metrics used to evaluate the scalability, dierent scalability models can be dened. Fundamental for all these models is a characterization of the workload in terms of the amount of useful work and the amount of overhead. Based on the methodology and framework presented in Chapter 2 a scalability modeling approaches is developed. To characterize the structure and the parallelism of the application under study, a task graph model is used. The corresponding computation demands (distinguished in useful and redundant or wasteful work) are given as a function of the scalable problem and system parameters. Also communication demands are specied in that way. To evaluate this model, two approaches are investigated. First, the use of complexity function is demonstrated on selected examples (see Section 4.2.2). These functions either characterize the total amount of work (useful and overhead) in a work oriented analysis, or 100

CHAPTER 4. SCALABILITY ANALYSIS

101

the total execution time (useful and overhead) in a time oriented analysis. The expressions for work or time respectively are derived from behavioral characteristics (the dependency structure and the degree of parallelism characterized by a the task graph) and from demands specied as domain oriented parameters. The second technique is based on simulations, where the behavioral characteristics are represented in an executable simulation model and the demands are specied using deterministic and stochastic estimates (see Section 4.2.2).

4.1 Scalability Concept 4.1.1 Problem De nition Scalability characterizes the ability of a system to maintain a certain level of performance if the amount of work and the processing power are increased. The term system denotes here the combination of the architecture and the program. Either a specic pair of program and architecture can be investigated, or the behavior of a program on an idealized architecture is studied (program scalability ), or the scalability of the architecture for a certain class of programs is investigated (architecture scalability ). Program scalability may be used to design and compare parallel programs from an algorithmic point of view, i.e. nding the best parallel algorithm, assuming that the number of processing elements is unlimited. The maximum attainable speedup on an idealized architecture is investigated for example in Nuss 91]. Comparing this maximum speedup to the speedup achieved on a real architecture leads to the analysis of architecture scalability. In this work scalability of program-architecture pairs is investigated, which is the most frequent purpose in scalability studies. Scaling the program means to increase the amount of work, that the system has to process. For the time being, this amount of work will be denoted by W without any further specication. A more detailed discussion on how to characterize W will follow. To represent the scaling of the architecture, a general parameter A is introduced, which may represents any scalable architecture related characteristic. The only restriction is, that this characteristic can be mapped on a quantitative parameter. A could for example be the number of processing elements if a variation of the system size is investigated or the network diameter, if dierent topologies are investigated. Based on these

CHAPTER 4. SCALABILITY ANALYSIS

102

two quantities A and W , a general denition of scalability is given below.

Denition 40 A parallel program is said to be scalable on a parallel architecture, if 8A A0 W 9W 0 : A0 = A (1 + AA ) ^ W 0 = W (1 + WW ) ^ Performance(A W ) = Performance(A0 W 0) where A A0 denote the original and the scaled architecture, W W 0 denote the original and the scaled amount of work, and \Performance" is the performance metric selected for comparison.

Based on this denition, two problems can be formulated: PROBLEM 1 PROBLEM 2 What are the eects (changes) in per- What are the necessary changes in paformance, if the parameters are scaled? rameters, if performance is to be kept constant? Performance (A W ) = 1 given A A0 W Performance given A A0 W W 0 (A0 W 0) Performance(A W ) sought W 0 sought Performance (A W ) The problem given on the left hand side is a typical question in sensitivity analysis. In scalability analysis, problems as formulated on the right hand side are to be solved, i.e. a function of W 0 for any combination of A A0, and W is to be found. In the following the questions related to the investigation of Problem 2 are discussed. First, a scalability index is proposed, supporting an analytic and graphical evaluation of scalability. The scalable architecture and work parameters are investigated and possibilities for their characterization are discussed. Dierent sources of overhead are identied. Finally, the three major techniques for performance evaluation (measurements, analytical models, and simulation) are studied with respect to their suitability for scalability evaluation. 0

0

4.1.2 Scalability Index Figure 4.1 shows ve dierent methods for representing the increase in work necessary to maintain a certain level of performance1 when scaling the architecture. In this section the term performance is not further specied. It may be any performance metric relevant in the analysis of a parallel application, e.g. speedup or eciency. In the next section, two particular metrics that have proven to be useful in scalability analysis are presented. 1

CHAPTER 4. SCALABILITY ANALYSIS

103 1000

50

100

30

work

(a)

work

40

20

10

10

10

system size

system size 1000

work scaling rate (absolute)

40 30 20 10

100

10

9

10

11

system size

system size 10

work scaling rate (relative)

2 1.5 1 0.5

4 2

8

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

11

4

5

6

7

8

9

10

11

system size W=log_2(A)

W=cA

8

1024

3

16

2

4

system size

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

2

(e)

k_A / k_W (relative)

system size

1024

10

512

9

512

8

256

7

256

6

128

5

128

4

64

3

16

2

4

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2

k_A / k_W (absolute)

system size

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

k_A / k_W (relative)

(d)

k_A / k_W (absolute)

system size

1024

11

512

10

256

9

128

8

64

7

64

6

32

5

32

4

32

3

16

2 2

8

1

0

4

work scaling rate (relative)

2.5

(c)

1024

8

512

7

256

6

64

5

128

4

32

3

8

2 2

16

1

0

4

work scaling rate (absolute)

50

(b)

512

9

256

8

128

7

64

6

32

5

16

4

8

3

2

2

4

1

0

system size W=A^c

W=A log_2 A

Figure 4.1: Dierent Representations for Scalability

W=c^A

CHAPTER 4. SCALABILITY ANALYSIS

104

Obviously, representations for scalability depend on the rate at which the architecture is scaled. Note, that the architecture scaling can be determined by the particular system under study, e.g. hardware or software limitations may only allow a scaling of a certain step size $A (see for example the case study 3 in Section 4.3, where only exponential scaling is possible). Whenever the step size for architecture scaling is not determined but can be chosen by the analyst, it is important to note, that the scaling behavior will look dierent for dierent architecture scaling rates. When comparing dierent systems, the same architecture scaling must be used, otherwise the interpretation will be misleading. In this work, two dierent ways to scale up the architecture will be investigated. A linear scaling of the architecture ($A = 1) corresponds to a small, constant increase in system size, where the increase is independent of the current size of the system (see the ve diagrams on the left in Figure 4.1). In an exponential scaling ($A = A), the architecture size is doubled, therefore the increase depends on the current size of the reference system (ve diagrams on the right). The eects of these two architecture scalings on the slope of scaling functions will be discussed below. In order to dene three dierent scaling functions, we rst introduce the following terminology:

Denition 41 Let A0 A1 A2 : : : denote the sequence of scaled architecture sizes with A0 being the original size, with

8 > < A + 1 for linear scaling Ai = > i;1 : Ai;1 2 for exponential scaling

for all i > 0.

As can be seen from this denition, linear scaling results in an arithmetic sequence of architecture sizes, exponential scaling results in a geometric sequence. Next, the scaled amount of work or the work scaling function can be dened as follows:

Denition 42 The work scaling function Wscaled = (W1 W2 : : :) is dened as a series of work sizes Wi , for which the following relation holds:

Performance(W0 A0) = Performance(Wi  Ai) 8i > 0

CHAPTER 4. SCALABILITY ANALYSIS

105

where (W0  A0) are the work and architecture size of the reference system and (Wi  Ai) are the scaled sizes.

In the top two diagrams (a) the values for the work scaling function (W1 W2 : : :) are plotted as a function of the scaled architecture size Ai with A0 = 2. These diagrams give an illustrative impression of the order of increase in work depending on the increase in system size. A at curve indicates good scalability behavior, the steeper the curve, the higher the necessary increase in work (poor scalability). This type of representation is suitable for comparing dierent applications with respect to their scaling behavior. Instead of representing the work scaling function (i.e. the amount of work), the increase in work can be depicted. This increase can either be computed related to the reference size W0 (absolute increase) or relative (related to the previous amount of work).

Denition 43 The absolute work scaling rate kW0 (i) is dened as kW0 (i) = Wi=W0

8i > 0

The relative work scaling rate kW (i) is dened as

kW (i) = Wi=Wi;1

8i > 0

The two diagrams in Figure 4.1 (b) depict the absolute work scaling rate, while the diagrams in (c) show the relative scaling rates. By denition, absolute work scaling rates are a linear transformation (multiplication by a constant) of the work scaling function. The only advantage is, that the eect of the original workload size W0 is eliminated. Again, steeper functions and higher function values represent poor scalability. The relative scaling rates represent the incremental change in the amount of work, i.e. the change from one architecture size to the next. A constant function (horizontal line) represents a constant incremental change. For example, the horizontal line (at height 2) in diagram (c), linear scaling, means, that the amount of work is doubled when increasing the architecture size. Scalability is poor, as the work sizes are a geometric sequence, while the architecture sizes are an arithmetic sequence (linear scaling). For exponential scaling, the horizontal lines at height two and four represent good scalability, as both, architecture and workload size are geometric sequences. In fact, the lower line at height two indicates ideal scaling behavior, as architecture and workload are scaled at the same rate.

CHAPTER 4. SCALABILITY ANALYSIS

106

It can be seen from these examples, that the scaling behavior can only be interpreted correctly with respect to the architecture scaling rate, which can be dened analogously to the work scaling rate: kA0 (i) = Ai=A0 denotes the absolute scaling rate and kA (i) = Ai=Ai;1 denotes the relative scaling rate. Based on the ratio of architecture and work scaling rate, a scaling index is proposed.

Denition 44 The absolute work scaling index WSI 0(i) is dened as 0 WSI 0(i) = kk0A ((ii))

8i > 0

WSI (i) = kkA ((ii))

8i > 0

W

The relative work scaling index WSI (i) is dened as W

Relating the relative and absolute work scaling rates to the relative and absolute architecture scaling rates nally leads to a \normalized" index, with 1 being the optimum value (e.g. architecture and work are scaled at the same rate). A value below one indicates, that work has to be increased at a higher rate than the system size2. A value above one indicates \super unitary" scalability, i.e. the relative increase in work is lower than the relative increase in system size. Five dierent functions for the increase in work (given as a complexity function depending on the architecture size) are plotted in the diagrams. The amount of scaled work is of the same order as the architecture size:

Wi = O(Ai) () kA (i) = kW (i) 8i Work and architecture are scaled at the same rate, the application architecture pair is linear scalable. The ratio of kA =kW is constant for both, linear and exponential scaling of the architecture. The amount of work is of lower order than the architecture size:

Wi < O(Ai) () kA (i) > kW (i)8i Alternatively, also kW =kA could be dened. But by putting kA in the denominator, the ratios for poor scalability (kW > kA) approach to innity, and the visualization of these curves is more dicult. By putting kW in the denominator, 0 is an asymptotic lower bound for poor scalability. 2

CHAPTER 4. SCALABILITY ANALYSIS

107

The necessary increase in work is smaller than the given increase in architecture size (in the diagrams in Figure 4.1 W = O(log2 A)). This type of function will be called superunitary scalability. For linear architecture scaling, the ratio approaches asymptotically to one (e.g. to linear scalability), for exponential increase in architecture size, the ratio increases and approaches asymptotically to two. The amount of work is of higher order than the architecture size. Three subclasses are shown in the diagrams:

Wi = O(Ai log2 Ai) =) kA (i) < kW (i) 8i The ratio of kA =kW is smaller than one, but approaches asymptotically to one with (linear or exponential) increase in architecture size.

Wi = O((Ai)c ) =) kA (i) < kW (i) 8i The scaling ratio is smaller than one, but approaches asymptotically to one for a linear scaling of the architecture and is constant for an exponential scaling (in the diagram c = 2).

Wi = O(c(Ai)) =) kA (i) < kW (i) 8i With increasing architecture size, the rate kA =kW decreases and approaches asymptotically to zero in exponential scaling. In practice, this means, that the performance will drop despite of increased work when surpassing a certain system size (in the diagram c = 2).

4.1.3 Characterizing the Architecture Size A When talking about architecture size, usually the number of processing elements are considered as a scalable factor. Other parameters, which do have inuence on performance are assumed to be constant. These parameters include the processing capacities (clock rate, cpu speed), the communication capacities (network bandwidth, transfer speed, transfer latencies), and memory organization (capacity, hierarchies, and access speed). It will be a topic for future work to investigate scalability with respect to changes in other parameters (or even with respect to changes in more than one parameter). In this

CHAPTER 4. SCALABILITY ANALYSIS

108

... ...

... ... ... ... Fork-Join-Model (a)

Iterative Pattern (b)

Divide&Conquer Structure (c)

Figure 4.2: Examples of Typical Program Structures thesis, only the inuence of the number of processing elements P will be considered. Note, that the performance metric that has been chosen in the particular scalability analysis study may and will also depend on other architecture parameters, but these parameters are not varied in the study.

4.1.4 Characterizing the Amount of Work W Work denotes the amount of computations that the system has to process. As already discussed in Section 3.3, the amount of work can be distinguished in useful work and in overhead (all work, that is only necessary because the problem is to be solved in parallel). In scalability analysis, it seems to be natural to investigate the necessary increase in useful work, not in total work, which would correspond to the idea of scaled speedup (e.g. being able to solve a larger problem). Considering the increase in total work would be misleading in that sense. In the following W = W use and it is assumed, that the amount

CHAPTER 4. SCALABILITY ANALYSIS

109

of useful work in the sequential application W use (1) is equal to the amount of useful work in the parallel application W use (P ), while the time to process W typically decreases when using more processing elements due to exploitable parallelism in W . The amount of work can be graphically depicted as a parallelism signature  1(p) giving the amount of work as a function of the degree of parallelism p assuming that an innite number of processing elements is available for execution. This function will depend mainly on the problem size, which is one parameter, inuencing the amount of work. Other parameters are the type of input (e.g. density of matrices) or the computation accuracy (see example 4 in Section 4.3). The degree of parallelism and the dependency structure can also be represented as a task graph (or as a component graph, depending on the level of characterization). Figure 4.2 depicts three task graphs with distinct patterns of computation and communication. All three examples have a layered structure, i.e. the nodes in the graph can be distinguished into K classes such that there are only arcs that point from nodes in class i to nodes in class i + 1 and there are no arcs connecting nodes within a class. The classes can be interpreted as \phases" of computation, the arcs connecting one phase with the next represent the communication phase. The rst example (a) is a typical fork and join application, characterized by varying phases of parallelism and phases of sequential computations (global synchronization points). The second example (b) is a similar pattern of several parallel phases, but without global synchronization. The communication patterns between two dierent phases of parallel computations may include a direct or shifted 1:1 communication (rst and second communication in the depicted example) or several broadcasts (last communication). The third example (c) is a divide and conquer pattern, which consists of two tree shaped parts. In the rst part (upper tree), the degree of parallelism increases, until the maximum DOP is reached. In the second part, the DOP decreases until nally a single node remains. For these three patterns, a set of parameters is identied (see Table 4.1). These parameters dene on the one hand the particular structure of the task graph (the number of parallel phases and the number of tasks in each phase), and on the other hand the computation and communication demands associated to the nodes and arcs. The inuence of these parameters on the total amount of useful work and overhead is discussed in the next section, when proposing a scalability analysis approach. The quanti-

CHAPTER 4. SCALABILITY ANALYSIS

K DOPi ci wuse (i j ) wredundant (i j ) wcomm (i j ) tuse (i j ) tredundant (i j ) tcomm (i) k DOPi ci k DOPi h d DOPi ci

110

Parameters for all Patterns number of phases degree of parallelism in phase i, i = 1 : : : K communication complexity from phase i to i + 1, i = 1 : : : K ; 1 useful work per node j in phase i redundant work per node j in phase i communication volume per link j from phase i to i + 1 time spent for useful work per node j in phase i time spent for redundant work per node j in phase i time spent for communication from phase i to i + 1 Parameters for Pattern (a) number of parallel phases, thus K = 2k + 1 = DOPmax if i is odd = 1 if i is even broadcast to DOPmax PEs if i odd, i = 1 3 : : : K ; 2 collect from DOPmax PEs if i even, i = 2 4 : : : K ; 1 Parameters for Pattern (b) number of parallel phases, thus K = k + 2 = DOPmax for all i Parameters for Pattern (c) height (= number of levels in the tree minus 1) of half tree, thus K = 2h + 1 branching degree (= aggregation degree) di;1 if i  h + 1 (increases with i) dK;i if i > h + 1 (decreases with i) DOPi broadcasts to d if i  h + 1 DOPi+1 collects from d if i > h + 1

Table 4.1: Parameters Characterizing Typical Program Structures

CHAPTER 4. SCALABILITY ANALYSIS

111

cation of the parameters will be discussed when using an analytical model (Section 4.2.2), a simulation model (Section 4.2.2), and measurements (Section 4.2.2) to evaluate scalability.

4.1.5 Sources of Overhead Any portion of the execution time in a parallel program, which is not spent for useful computations for solving the problem and which would not be necessary in the equivalent sequential application, is summarized under the term overhead. There are several sources of overhead, including program dependent overhead (e.g. additional, redundant computations, load imbalance, idle times caused by the synchronization structure of the program) and architecture dependent overhead (e.g. communication latencies, contention for shared resources, CPU cycles for preprocessing messages to be sent). Ideally, the total amount of useful work W use should be the same as the costs (P T (P )) of a parallel program. Thus, each of the P processing elements would be busy during the total execution time T (P ) in performing useful computations. In practice there is a dierence between useful computations and costs P T (P ) ; W use > 0 due to overhead. This overhead summarizes the total time during the execution of a parallel program, which was not directly spent for solving the problem. It includes idle times (e.g. due to synchronization or due to load imbalance) or time spent in computing \wasteful" work, i.e. all computations that do not directly contribute to the solution of the actual problem but which are necessary for solving the problem in parallel, e.g. redundant (duplicated) computations to avoid communication or pre- and post-computations for distributing data or aggregating results. Overhead can arise due to ineciencies in the program (poor data distribution resulting in an imbalanced load), can be introduced by architecture related factors (e.g. communication delays). Both, sources for program and for architecture related overhead are discussed in the following, distinguishing between overhead due to idle times and overhead due to additional computations.

Program Related Overhead 1. Idle Times To characterize overhead due to idle PEs either time can be specied or the amount of

CHAPTER 4. SCALABILITY ANALYSIS

112

useful work, that could have been processed during the phases of overhead. PEs can be idle for three dierent reasons. 1.1. Load imbalance Distributing the amount of work evenly among the processing elements is one of the most challenging problems in developing parallel applications. In an unbalanced application, some components may nish their work earlier than others, because less work was assigned to them. When monitoring the parallel program to obtain the idle times due to load imbalance, it is dicult or even impossible to eliminate the other sources of architecture related overhead. An accurate estimate for program related load imbalance can only be derived from simulations. To consider load imbalance in analytical models, a load imbalance factor (either a deterministic value or a random variable) can be introduced, which approximates the eects of load imbalance by increasing or decreasing the computation demands of the components Pete 93]. 1.2. Insucient Degree of Parallelism The potential degree of parallelism changes in most parallel applications. Therefore, there will be phases in the program execution, where the DOP is smaller than the available number of processing elements (as an extremum, consider sequential phases, where only a single PE can be kept busy). The larger the phases of small DOP, the higher the contribution to total overhead due to idle PEs3 . 1.3. Synchronization/Communication The synchronization or communication structure of the program may introduce idle times, if one or more PE have to wait for messages from other PEs or for the completion of other PEs. Overhead increases with a higher frequency of synchronization or communication points and a larger number of involved components. Note, that in multiprogrammed environments scheduling strategies have been proposed, which try to \ll" these idle times with tasks from other programs, thus minimizing the total idle times of the system. But these aspects do not have to be considered in this work for two reasons: First, only a single program architecture pair is considered, there are no other programs to ll the gaps. Second, such policies will reduce the total idle times of the system, but from the viewpoint of the analysis of a single application, no improvement is achieved with respect to idle times. 3

CHAPTER 4. SCALABILITY ANALYSIS

113

2. Additional Work The amount of additional work can be derived from analysis of the program code, timing values can be obtained from proling the program's execution. Load imbalance and Synchronization/Communication may cause additional work. The third source of overhead are redundant computations. 2.1. Load Imbalance In dynamic load balancing strategies, the assignment of work to processing elements can be changed if load imbalance is detected. This reassignment typically introduces additional work (computations and communications). Therefore tradeo between the gain in using the processing power of idle processing elements and the overhead introduced by the additional work for redistribution has to be found. 2.2. Synchronization/Communication The necessity to synchronize threads of control and to communicate messages also introduces additional work. It might be necessary to preprocess (e.g. pack) or postprocess (e.g. unpack) data before sending them. 2.3. Redundant work Sometimes it is advisable to process work redundantly instead of processing it only once and then distributing the results. This redundant work is regarded as overhead, but it is important to consider only the additional work as overhead. E.g. if N useful operations are to be computed P times (on the same data on each processing element), then N contributes to the amount of useful work and (P ; 1)N is the redundant work regarded as overhead.

Architecture Overhead Architecture overhead summarizes all overhead introduced by the particular architecture under study. This overhead includes communication delays, contention for shared resources, inhomogeneous processing speed, or increased execution time due to background load. Note, that all these factors should not be included in the description of the load, but the actual overhead is a result of the performance evaluation. In measurements, the diculty is to isolate the eects of architecture related overhead and program related overhead. Simulations

CHAPTER 4. SCALABILITY ANALYSIS

114

allow a more accurate estimate of architecture overhead, because it is possible to compare the results assuming an ideal architecture with the results for a particular given architecture. In modeling, parameters inuencing the architecture overhead are to be specied separately from the specication of work in order to keep the model modular and exible.

4.1.6 Evaluation Techniques for Scalability Analysis To nd the appropriate scaling rate or increase in W , i.e. to evaluate the scalability of a system, either empirical analysis, or analytic and simulation models can be applied.

Empirical Studies In empirical studies, the program and the architecture must be available for measurements. Although the objective is to nd values for W 0 (see Problem 2), only the performance can be obtained from measurements. Therefore, dierent combinations of varying system size and work size are necessary to empirically evaluate the scaling behavior. The number of experiments can become rather large, if several parameter combinations for A A0 W , and W 0 are to be investigated.

Analytical Models In analytical models, an expression for the selected performance metric given as a closed form depending on the scalable parameters is to be solved with respect to W 0. The major diculty is not to solve this equation, but to represent the complex interactions between performance and program and architecture parameters in a closed form. Simplifying assumptions are necessary to come up with analytically tractable models. Although these models may not lead to accurate predictions of the scalability, they could be used in best or worst case scenarios to establish bounds on scalability or to provide rough estimates. These bounds and estimates could be used in early stages of program development to compare the suitability of dierent parallel algorithms for a particular parallel architecture before actually implementing them.

Simulation Models When neither analytical models (due to unsatisfying accuracy) nor empirical studies (due to unavailability of the program and/or architecture) are possible, simulation

CHAPTER 4. SCALABILITY ANALYSIS

115

techniques can be applied. The interaction between the workload (program) and the system has to be captured in the simulation model, which is parametrized with the particular characteristics of the load and the architecture. To support scalability analysis, the simulation technique should be exible, i.e. it should be easy to modify these parameters to investigate dierent combinations of work and architecture size. In this work a characterization of work is proposed, which can be rened to be used in both, analytical as well as simulation models. An analytic evaluation method based on a complexity analysis of work and time is discussed. To be able to represent more complex dependencies (in particular to represent load imbalance) simulation techniques will be used. Finally, measurements are investigated briey.

4.2 Visual Approach for Scalability Analysis To apply the proposed framework and methodology, each of the steps listed in Figure 2.6 (Chapter 2) will be discussed in the context of scalability analysis. The objectives of a scalability study (step 1) have already been dened in the introduction to this chapter, where also an architecture-work scaling ratio has been introduced as a metric for scalability. The remaining problem of selecting the performance metric that is to be kept constant, will be discussed in Section 4.2.1. For the validation, the accuracy of the approach and the costs of evaluation are considered. Three dierent evaluation techniques will be investigated (step 2): an evaluation of the complexity of work and time (analytic), an approach based on simulations of the execution time, and measurements on a real system, where in the last two approaches speedup and eciency are calculated based on the simulated or measured execution times. As a consequence, the type of test workload has to be an executable (i.e. the actual program) in measurements, a partly executable partly non-executable model in simulations, and a non-executable model in the analytic approach. It will be shown, how the information used in the non-executable model can be transformed in information suitable for the simulation model, thus supporting to some extend the idea of a renement in modeling. I.e. the analytic approach, where results can be obtained quickly but at a rather high degree of

CHAPTER 4. SCALABILITY ANALYSIS

116

inaccuracy, can be \rened" in the simulation model. As the simulation tool used in this study (N-MAP Fers 94] Fers 95b]), which is based on a description of the program structure in a programming language, provides also a module for translating the model into an executable parallel program (for a CM-5), the measurements can be seen as the nal stage of renement. The measurement results will also be used to validate the predictions obtained from the analytic approach and the simulation. With respect to step 4 (characterizing the user interaction) it is emphasized again, that only a single program is investigated executed exclusively on a parallel architecture. Therefore, neither user interaction nor background load will be considered. Steps 5 to 7 (the actual process of workload characterization) will be discussed in detail in Section 4.2.2. Steps 8 to 11 (evaluation, data visualization, interpretation and validation, model renements), will be demonstrated on selected examples in the last section (4.3).

4.2.1 Performance Characteristics So far the performance index, that has to be kept constant, has not been specied. It has to be a measure, that depends on both, the number of processing elements and the workload. Therefore, the indices discussed in Section 3.3 are appropriate. In particular, eciency and redundancy are used in many scalability studies Kuma 91] Gram 93].

Eciency E (P N ) = T (1 N )=C (P N ) Eciency relates the costs of the sequential computation to the costs of the parallel computation. In this work a scalability function obtained when keeping eciency constant, is called a time oriented scalability function, as eciency is based on time characteristics (the sequential and parallel execution time): use E (P N ) = P T use(1TN ) +(1PN )T ovh(P N )

Redundancy R(P N ) = W (P N )=W (1 N ) Redundancy seems to be a suitable performance index, as it is directly based on the ratio of work in the parallel and in the sequential program. But redundancy is bounded between 1 and P , thus it is not a normalized index. Therefore the reciprocal value of redundancy is used which is given by

CHAPTER 4. SCALABILITY ANALYSIS

117

Total Useful Work E 0(P N ) = 1=R(P N ) = Total Useful Work + Total Overhead This expression is similar to the denition of eciency, but is based on work and will therefore be called redundancy based or work oriented eciency. In a more formal notation, use (N ) 1 W 0 E (P N ) = W use (N ) + W ovh(P N ) = W ovh(P N )=W use (N ) + 1 where W use (N ) denotes all useful work (which is independent of the number of PEs) and W ovh (P N ) is an expression for the overhead when solving a problem of size N on a system with P processing elements. The fraction W ovh(P N )=W use (N ) is called overhead ratio in Carm 91]. The objective is to nd a workload model, which provides (together with a specication of architecture parameters) the necessary information to derive expressions for W use  W ovh T use and T ovh. Such a model is presented in the next section.

4.2.2 Model Parameters and De nitions To obtain the values for E (P N ) (time oriented eciency) and E 0(P N ) (work oriented eciency) a set of parameters is introduced (see the summary in Table 4.2)4. The relations between these parameters are given below.

Work Oriented Eciency The amount of work necessary for solving the problem on a uniprocessor system is equal to the amount of useful work in the solution on a parallel system and is given by

W (1 N ) = W use (1 N ) = W use (P N ) DOP Xmax par W (i) = W seq (P N ) + W par(P N ) = i=1

In this chapter, only message passing, distributed memory architectures are considered. A similar approach could be derived for shared memory systems by investigating memory access instead of communication behavior. 4

CHAPTER 4. SCALABILITY ANALYSIS

Notation

W W (1 N ) W (P N ) W use (1 N ) W use (P N ) W ovh (P N ) E' T (1 N ) T (P N ) T use (1 N ) T use (P N ) T ovh (P N )  E

Description

Work Oriented total amount of work total amount of work in sequential program total amount of work in parallel program total amount of useful work in parallel program total amount of useful work in parallel program total amount of overhead work in parallel program work oriented eciency Time Oriented sequential execution time parallel execution time time spent for useful computations in sequential program time spent for useful computations in parallel program overhead time in parallel program processing speed time oriented eciency

Table 4.2: Parameters Characterizing Useful Work and Overhead

118

CHAPTER 4. SCALABILITY ANALYSIS

119

The total amount of work in the parallel algorithm is given by the sum of the amount of useful work and the amount of overhead (either redundant work or communication):

W (P N ) = W use (P N ) + W ovh (P N ) W ovh(P N ) = W red(P N ) + W comm (P N ) Substituting these expressions in the equation for work oriented eciency PDOP max W par(i) 0 E (P N ) = PDOP max par i=1 red = W (i) + W (P N ) + W comm (P N ) i=1 use  N ) 1 = = W use (1WN ) +(1W ovh (P N ) W ovh (P N )=W + 1 and solving this equation with respect to the amount of work 0 Wscaled = 1 ;E E 0  W ovh (P N ) gives an expression for the work scaling function. It can be seen from this function, that work has to grow at the same rate as overhead, to keep work oriented eciency constant. 0

Time Oriented Eciency The execution time for the uniprocessor algorithm is given by

T (1 N ) = T use (1 N ) = W (1 N )= As part of the W (1 N ) can be executed in parallel, the time to execute this useful work is given by

T use (P N )

=

T seq (P N ) + T par(P N ) =

DOP Xmax W par (i) i=1

ie d i  P

CHAPTER 4. SCALABILITY ANALYSIS

120

and the total parallel execution time is given by

T (P N ) = T use(P N ) + T ovh(P N ) with

T ovh (P N ) = T ovh prog (P N ) + T ovh arch(P N ) T ovh prog (P N ) = T redundant (P N ) + T imbalance(P N ) T ovh arch (P N ) = T comm (P N ) Again, these expressions can be substituted in the denition of eciency and solved with respect to the amount of work. ) = E (P N ) = P T T(1(PNN ) P DOP max W par (i))= ( i=1 = max W par (i) i ovh P (PDOP i=1 i d P e + T (P N )) W seq (N ) + W par (P N ))=  P (W seq (N )(= + W par (P N )=(P ) + T ovh(P N )) seq N ) + W par (P N ) = P W seq (N ) ; W seq (N )W+ W(seq (N ) + W par(P N ) + P  T ovh(P N ) Wscaled = 1 ;E E ((P ; 1) W seq (N ) + P  T ovh (P N )) The amount of work to keep time oriented eciency xed depends on the overhead and on the sequential fraction.

Analysis Based on a Complexity Notation For the work oriented analysis (keeping E 0 constant), it is necessary to specify the total amount of useful and wasteful computations. With respect to overhead, only communication and redundant work are considered. To characterize the communication overhead, an

CHAPTER 4. SCALABILITY ANALYSIS

121

expression has to be found, relating the complexity of the communication pattern on a particular architecture and the amount of data to be sent. Redundant work must be specied in a mutually exclusive way to useful work, i.e. if P processing elements are performing X identical operations on the same data to avoid communication, the useful work includes already X of these operations, redundant work is given by the remaining (P ; 1) X operations. All these parameters are to be specied as complexity functions, depending on the problem size N and on the system size P (the number of processing elements). The task graph model and the corresponding parameters introduced in Section 4.1.4 provide enough information on the amount of work to derive expressions for the characteristics in Table 4.2. As the amount of work summarizes all work, that has to be performed in the (part of the) application under study, the amount of useful work is given by wiuse = Pj wiusej and the amount of redundant work is given by wiredundant = Pj wiredundant . Considering also j the communication complexity and the communication volume to derive the amount of communication work wcomm (ci wicomm j ), the total amount of useful work and the total amount of overhead are given by W (1 N ) = PKi=1 wiuse W ovh (P N ) = PK wredundant + PK;1 wcomm (c  wcomm ) i=1 i

i=1

i

ij

In a time oriented analysis, the corresponding values for T use (P N ) and T red(P N ) are derived from the characterization of work given that the processing speed  of a processing element is known (this speed can for example be obtained from measurements). Note, that if the speed for processing one unit of work is constant (e.g. independent on P and the total amount of work), it does not have to be considered in a complexity analysis. In the examples, a constant speed of 1 is assumed. The time to execute a phase containing a single task (sequential phase) is given by tuse (i) = wiuse =, as there is only a single task (DOPi = 1). The time to execute a parallel phase (DOPi > 1, i.e. j possibly dierent tasks) is given by the execution time of the processing element with the heaviest load tuse (i) = maxk (Pj mapped on Pk wiusej =), where Pk denotes the kth PE. To derive communication times, additional knowledge on the particular architecture under study is necessary. In one of the examples (example 3) the inuence of the interconnection topology on the scaling behavior is investigated.

CHAPTER 4. SCALABILITY ANALYSIS

122

Expressions for time characteristics based on task graph parameters are summarized below. T seq = PKi=1 wiuse = () DOPi = 1 P T par(P ) = Ki=1 tuse () DOPi > 1 i dDOPi =P e P P T ovh(P ) = K tredundant dDOP =P e + K;1 tcomm (c  wcomm ) i=1 i

i

i=1

i

ij

where tcomm(ci wicomm j ) is an architecture dependent expression for the communication time from phase i to phase i + 1.

Analysis Based on Simulation All work oriented parameters can be used as inputs to a performance model, where this information is combined with information on the architecture to obtain time oriented values (compare the PAPS and the NMAP approach). In this work a simulation tool called N-MAP Fers 94] is used for the analysis. In NMAP, the program structure is specied in a so called Task Structure Specication (TSS), which describes the parts of the programs (tasks) and their interaction (dependencies) in an language extension of C. In addition, timing requirements for tasks and for communication can be specied (by an arbitrary complex expression or function in C). The functional behavior of the tasks (Task Behavior Specication, TBS) can also be implemented, but for performance analysis, a specication of the structure and the requirements is sucient. In the simulation based approach the structure of the task graph has to be transformed in a task structure specication in N-MAP5. The parameters characterizing the computation and communication demands can be included in the task requirements specication in a straight forward way.

Analysis Using Measurements The task structure specication of N-MAP, which provides the algorithmic skeleton, can be rened by adding task behavior specications, which contain the actual code of the program Currently, the integration of a tool for graphical editing of task graphs, PatternTool Fers 95a], and N-MAP is investigated in two diploma thesis. A prototype of a tool transforming automatically a task graph modeled in PatternTool into a TSS has been developed. In this work the transformation has been done \manually", by explicitly generating the TSS. 5

CHAPTER 4. SCALABILITY ANALYSIS

123

to be implemented. In the current development state of N-MAP, a translator is available for generating CM-5 source code from the TSS and TBS. To obtain isoeciency curves (e.g. lines connecting combinations of architecture and work size with identical eciency) several combinations for P and W have to be executed and measured. Previous analytic evaluation might provide helpful insights in choosing these combinations to reduce the number of execution runs.

4.2.3 Comparison to other Approaches Besides the two metrics used in this work (E and E 0), several other performance measures can be selected as a basis for scalability analysis. Two of them, namely speed and asymptotic speedup, are discussed below.

Speed I (P N ) Based on the concept of generalized speedup (see section 3), an isospeed

function can be dened, similar to the isoeciency concept. This function gives an expression for the necessary increase in problem size, if system size is increased and speed is to be kept constant, i.e. PW0 T(P(P0 0NN0)0) = PWiT(P(Pi iNNi)i) . In this denition, two systems are compared against each other, the reference system characterized by N0 and P0 , and the scaled system, characterized by Ni and Pi . Solving this equation with respect to the scaled work W 0 = W (Pi  Ni) gives the following scalability function based on the concept of isospeed Sun 94]. W 0 = W (P0P N 0)T (PPi  NT ()Pi Ni) 0 0 0 Assuming, that the values for the reference system are constant, the scaled amount of work has to grow at the same order as the parallel computation costs C (P ).

Asymptotic Speedup The ratio between the execution time of the best sequential algorithm to the execution time of the parallel algorithm when using as many processors as necessary is called asymptotic speedup and is given as follows (1 N ) S (Popt N ) = TTopt (Popt N )

CHAPTER 4. SCALABILITY ANALYSIS

124

where N denotes the problem size and Popt denotes the optimum number of PEs for the particular algorithm-machine combination, i.e. the system size, which yields minimum total execution time. Assuming an ideal parallel architecture (i.e. there is no architecture related overhead) leads to the denition of ideal asymptotic speedup. (1 N ) SI (PI opt N ) = TT(opt I PI opt  N ) Note, that the optimum number of PEs may be dierent for an ideal parallel architecture and a given real architecture. In general it is reasonable to assume, that PI opt Popt because of the additional architecture related overhead in a real system. Based on the concept of asymptotic and ideal asymptotic speedup a denition of scalability can be given as follows: ) = TI (PI opt N ) %(Popt N ) = SS((PPopt N T (Popt N ) I I opt  N ) As additional overhead can only increase the execution time, TI  T , thus this scalability index is normalized (bounded between 0 and 1). This index characterizes the degree of performance degradation due to architecture overhead and is therefore suitable for the analysis of architecture scalability. Several other approaches can be found in literature, which are based on the evaluation of eciency as proposed in this work, but they dier in the methods to characterize the amount of work or the execution times. A selection of approaches is discussed below.

Gele 89] One of the earlier approaches for analytical models of parallel execution time and

speedup was proposed by Gelenbe. In this approach the ideal parallelism prole P 1 (t) (assuming an unlimited number of PEs) with the ideal execution time of T opt and the derived average DOP are used for a characterization. When using P processing elements, then sometimes P 1(t) will be larger than P and sometimes smaller. Thus, the \free" capacities (P 1(t) < P ) are opt given by T free(P ) = PTt=0 (P ) maxf0 (P ; P 1(t))g and the periods of \overloading" opt (P 1(t) > P ) are given by T over (P ) = PTt=0 (P ) maxf0 (P 1 (t) ; P )g. This \free time" may be used to compute (part of) the load with higher degree of parallelism in the

CHAPTER 4. SCALABILITY ANALYSIS

125

best case (by rearranging the order of the program parts), but in the worst case, it is not possible to \ll" the free times (because of dependencies in the program structure). Therefore, the following lower and upper bounds on execution time can be derived:

T opt + maxf0 T

over (P ) ; T free(P )g

P

over  T (P )  T opt + T P (P )

Based on these bounds for the execution time, bounds for (time oriented) eciency can be derived:

DOP avg

P + T over(PT);optT free(P ) where

avg DOP S (P ) (P ) P + T over T opt

Z T opt 1 1 P (t)dt = T opt 0 gives the average degree of parallelism in the application. Instead of specifying T use by a prole, the varying degree of parallelism can be expressed in a more concise model by introducing a probability i for using i processing elements from a maximum of P available PEs. Thus, P X T use = i 1i i=1 DOP avg

where T (1) = 1 and neither the sequential fraction nor any overhead are considered. Note, that i does not characterize the varying DOP within a particular program in a single execution run, but should be interpreted as average values when a family of programs is considered. To consider overhead caused by load imbalance, a load imbalance factor i is introduced, giving the additional time consumed by the most loaded PE and resulting in P X T (P ) = i( 1i + ii ) and

i=1

1 (1 ; )(1 + ) + P PPi=1;1 i( 1i + ii ) where = PNi=1;1 i is the probability, that not all processors are used, and = P , and T (1) = 1. Depending on dierent assumptions on dierent models are obtained:

E (P ) =

CHAPTER 4. SCALABILITY ANALYSIS

126

1. Following an optimistic assumption, where i = 0 for 1  i  (P ;2) and = P ;1, eciency is given by 1 1 = E (P ) =  1 P 1 (1 ; )(1 + ) + P i( P ;1 + P ;1 ) 1 + (1 ; ) + P ;1 ;

which would explain a \quasi-linear" speedup for large values of P . 2. In a pessimistic assumption, i = 1 ; = P1 for 1  i  (P ; 1) and eciency is given by E (P ) = PP 1 1 PP i  log1 P + i=1 i

i=1 i

2

i.e. a \logarithmic" speedup. 3. As a \compromise": i = P ; 1 (uniform distribution) 1 1 E (P ) =  P P P ;1 1  (1 ; )(1 ; ) + P ;1 i=1 ( i + ii ) (1 ; )(1 ; ) + (log2 P ) for large P . To evaluate the scalability, the functions for eciency have to be evaluated for both, the reference system and the scaled system. Depending on the level of detail at which the work can be characterized either the detailed model (characterization of the parallelism prole) or the concise model (characterization by probabilities for using i PEs and by a load imbalance factor) can be applied.

Klei 92] A couple of years later, Kleinrock has proposed a similar approach, which is also

based on the workload being modeled by an ideal parallelism prole. This prole is either a discrete function with a constant, deterministic step size (the time dierence between two points of discontinuity), or as a continuos function P 1(t). In both cases, the sequential execution time T (1) for a job is given by the total number of its tasks W , since it is assumed, that each tasks can be executed in one time unit. In the R R R 1 (t)=Pdt, continuos case, W (P ) = W = 0T opt P 1 (t)dt and T (P ) = 0tP 1dt + tTPopt Pinc where Pinc is a parallelism prole rearranged into a non-decreasing function, and tP is the time instant in the rearranged prole, where the ideal maximum DOP exceeds P , the available number of PEs. Thus, eciency is given by

E (P ) =

W R T 1 (t)dt P tP + tPopt Pinc

CHAPTER 4. SCALABILITY ANALYSIS

127

The considerations for the discrete case are analogous, by replacing the integration by a summation. In the paper cited above, this model is not used for scalability analysis of a single program, but as a characterization of programs in a multiprogrammed system. This characterization is the input to a queuing network modeling the behavior of a multiprogrammed system. To use this model for scalability analysis, the equations given above can be solved w.r.t. W analytically, thus giving the general scalability behavior, or can be evaluated for a reference system and one or more scaled systems, thus obtaining a scalability function for particular scaling ratios (only a set of samples of the scalability behavior).

Mari 93] Recently, Marinescu and Rice have proposed a model, where the sequential and

parallel execution times are given as the ratio of work and speed. The parallel work consists of the useful work, the duplicated, redundant work and the work related to communication. All values are given in terms of complexity functions of P , the number of processing elements. Following this model, eciency is given by use (1) E (P ) = P T T(1)(P ) = W (P )=(W(P ))= + P T blk (P ) with

T (1) T (P ) T calc(P ) W (P ) W dupl(P )

= = = = =

W use (1) T calc(P ) + T blk (P ) W (P ) P(P ) W use + W dupl(P ) + W comm (P ) (P ; 1) f dupl W (1)

T calc(P ) : : : : : : :: : : calculation time W dupl (P ) :: : : : : : :: duplicated work f dupl : : : fraction of duplicated work W comm (P ) : : : : :: : : : communication complexity T blk (P ) :: : : : : : :: : : : : blocking time

Assuming, that the processing speeds per processor are equal, i.e. (1) = (P ) = , the model simplies to use E (P ) = W (P ) + PW  T blk (P )

CHAPTER 4. SCALABILITY ANALYSIS

128

The characteristics used in this approach are similar to the characteristics proposed in this thesis, but dier in that it is assumed, that all useful work, duplicated work, and communication demands can be processed fully in parallel (see the expression for T calc(P )).

Sun 91b] In the previous model, the speed per processing element might be dierent for

the uniprocessor system compared to the parallel system. In this model the execution time is derived from the amount of work where speed may depend on the type of work. k dierent types of work W (i j ) can be executed at dierent speeds j , where i denotes the DOP in W , and j denotes the type of work.

T (P ) = T ovh(W P ) +

DOP k W (i j ) Xmax X i=1

j =1

i j di=P e

Obviously, this model requires a more detailed characterization of work as proposed in this work, but if this information is available, it can be easily included in the proposed approach.

4.3 Examples In this section selected examples are given to illustrate the proposed approach. To demonstrate the application of all three evaluation techniques and to show the previously mentioned process of renement a simple example of an embarrassingly parallel application (one of the NAS Kernels) has been chosen. In a modication of this type of application, the eects of a lower degree of parallelism in the program are investigated. The second example, Householder Reduction, belongs to a class of applications frequently used in performance evaluation studies, because of the simple, deterministic structure. For this example, only the analytic approach will be demonstrated. Another popular example is parallel FFT Quin 93b], which is the third example. In this study the inuence of the type of architecture on scalability is investigated comparing an hypercube topology with a grid of processing elements. Again, the evaluations are based on complexity analysis. In the last example, the scalability of a Finite Element Method is investigated. Here, not only the problem size (number of grid points), but also the desired accuracy has signicant inuence on the amount of work.

CHAPTER 4. SCALABILITY ANALYSIS

129

W (P ) = |{z} 0:1 + w| {z2N} + |2 (P{z; 1)} par seq W

...

W

2N

W ovh

T (P ) = |0:1{z=} + w d  P e + 2| log{z2 P} | {z } seq T

T par

T ovh

Figure 4.3: Task Graph and Parameters for EMBAR

4.3.1 Example 1: An Embarrassingly Parallel Problem A parallel application, which is (nearly) perfectly parallelizable, is called an embarrassingly parallel program (see NAS benchmark). This is a special case of a communication pattern type (a) with k = 1 and w(1) + w(3)  w(2), i.e. a (nearly) neglectable sequential phase. ovh = 0. Assuming, that the load is In a perfectly parallel program, w(1) + w(3) = Ttotal T (1 N ) perfectly balanced, the execution time is given by T (P N ) = minfP DOP max (N )g , thus eciency max max E (P ) = minf1 DOP (N )=P g. As long as P  DOP (N ), eciency is always 100 %, no scaling of the workload is necessary. If P is scaled above DOP max , the workload has to be scaled such that DOP max = P 0. The assumptions for a perfectly parallel program are not realistic (no overhead, perfect balance of load), but there are some applications which are \nearly" perfectly parallel. As an example let us consider EMBAR, one of the benchmarks from the NAS suite Bail 91], where a set of 2N random numbers is to be generated. The DOP in the parallel phase is bounded by the problem size 2N and each subtask is assumed to require a constant amount of work w, independent of N , thus W par(N ) = (2N w). Phases of sequential computation at the beginning and at the end of execution are independent of the amount of work and are comparatively small (assumed to be w(1) + w(3) = W seq (N ) = 0:1). (P ; 1) communication steps are necessary to activate the parallel tasks and (P ; 1) steps are required for gathering the results. By assuming a constant data volume of one, W comm is given by 2 (P ; 1). The execution times for the sequential and the parallel phase are derived straightforward following the denitions given in the previous section, T seq = 0:1= and T par = d 2PN e w=. Assuming, that the broadcast operation is eciently implemented on a hypercube, the communication overhead is T comm = log2 P . The task graph and the expressions for W and T are summarized in Figure 4.3. 0

CHAPTER 4. SCALABILITY ANALYSIS

130

2 1 1024

Speedup

140 120

2^6

100

2^10

80

0.6 0.4

512

0

2^24

40

2^28

20

8

0.2

2^20

60

4

0.8

256

16

System Size

1024

512

256

128

64

32

16

8

4

2

0

128

32 64

Figure 4.4: Predicted Speedup and Eciency for EMBAR Based on these expressions, speedup and eciency can be predicted As no particular architecture is considered so far, the architecture dependent parameters w and  are assumed to be one. Figure 4.4 shows these functions for a varying number of random numbers to be computed (26 to 228). In the left diagram, the predicted speedup is shown. For a small workload, the degree of parallelism is too low to obtain a signicant speedup. In fact, if the DOP becomes smaller than the number of PEs (e.g. 64 for a DOP of 26), speedup decreases with increasing system size. If the workload is suciently large (> 220), speedup is nearly linear. A similar behavior can be seen from the eciency curves in the diagram on the right. In this diagram, there is an axis from 0 to 1 for each system size. A point on an axis represents the corresponding eciency value for the particular system size and the workload size (see the legend). To make the diagram more readable, the points are connected, thus a decreasing eciency is depicted by a spiral line towards the center. For a large amount of work, eciency is close to one.

CHAPTER 4. SCALABILITY ANALYSIS

131

Complexity Analysis Work oriented eciency (see page 116) is given by 2N E 0(P N ) = 0:1 + 0w:1 2+Nw+ 2( P ; 1) Solving this equations with respect to W = 0:1 + w 2N , gives the following expression for scaled load 0 0 Wscaled = 1 ;E E 0 2(P ; 1) = O(P )

i.e. the application is linear scalable. The work scaling function (see Section 4.1.2, Denition 42) is depicted in Figure 4.5 for linear (a) and exponential (b) scaling (the larger diagrams in the middle). The relative and absolute scaling rates and scaling ratios (Denitions 43 and 44) are also depicted in those gures. Note, that it is possible to obtain any desired level of eciency with scaling rates of the same order (the functions depicting the scaling rates for eciencies of 0.1, 0.3, 0.5, and 0.7 are hidden by the 0.9 line). In linear and exponential scaling, the scaling index approaches to one, i.e. work and architecture are scaled at the same rate (linear scalable). Similar conclusions can be drawn from the diagrams showing the scaling rates. Absolute scaling rates grow linearly for linear scaling, and exponentially for exponential scaling, relative rates approach 1 in linear scaling and 2 in exponential scaling. The time oriented eciency (see page 116) is given by

:1 + w 2N ) E (P N ) = (0:1P + (0 w 2N )= + P 2 log2 P Thus,

Wscaled = 1 ;E E (0:1(P ; 1) + 2P log2 P ) = O(P log2 P )

Thus, the relative scaling ratio kA =kW is smaller than one, but also converges asymptotically to one, as P increases, for both, linear and exponential scaling (see Figure 4.6). Comparing time and work oriented scalability, the absolute amount of work to maintain time oriented eciency is larger, but the scaling rates show a similar behavior.

k_A / k_W (absolute)

132

1

0.5 50

0.5

9

10

11

10

11

512

1024

512

1024

8

9 256

7

256

system size

6

5

4

0

40

11

9

10

8

7

6

5

4

3

2

0

1

3

k_A / k_W (relative)

CHAPTER 4. SCALABILITY ANALYSIS

system size

11

10

9

8

7

6

5

4

3

2

3

4

5

6

7

8

9

10

10 0

system size

system size

0.1

0.3

8

0 0

20

7

0.5

30

6

10

1

40

5

1.5

50

4

20

work scaling rate (absolute)

2

3

work

2.5

2

work scaling rate (relative)

30

system size

0.5

0.7

0.9

k_A / k_W (absolute)

1

0.5 1000

1

0.5

128

64

32

16

100

10

system size

0.3

0.5

128

64

32

16

1 8

512

256

64

128

32

8

16

1000

2

1024

512

256

128

64

32

16

8

4

0.1

system size

0.1

4

2

1

work scaling rate (absolute)

work

10

2

2

work scaling rate (relative)

system size

4

1

8

100

system size 10

4

2

1024

512

256

128

64

32

16

8

4

0 2

0

4

k_A / k_W (relative)

(a) linear increase in P

system size

0.7

0.9

(b) exponential increase in P Figure 4.5: Work Oriented Scalability Behavior for EMBAR

k_A / k_W (absolute)

133

1

0.5 50

0.5

9

10

11

10

11

512

1024

512

1024

8

9 256

7

256

system size

6

5

4

0

40

11

9

10

8

7

6

5

4

3

2

0

1

3

k_A / k_W (relative)

CHAPTER 4. SCALABILITY ANALYSIS

system size

11

10

9

8

7

6

5

4

3

2

3

4

5

6

7

8

9

10

10 0

system size

system size

0.1

0.3

8

0 0

20

7

0.5

30

6

10

1

40

5

1.5

50

4

20

work scaling rate (absolute)

2

3

work

2.5

2

work scaling rate (relative)

30

system size

0.5

0.7

0.9

k_A / k_W (absolute)

1

0.5 1000

1

0.5

128

64

32

16

100

10

system size

0.3

0.5

128

64

32

16

1 8

512

256

64

128

32

8

16

1000

2

1024

512

256

128

64

32

16

8

4

0.1

system size

0.1

4

2

1

work scaling rate (absolute)

work

10

2

2

work scaling rate (relative)

system size

4

1

8

100

system size 10

4

2

1024

512

256

128

64

32

16

8

4

0 2

0

4

k_A / k_W (relative)

(a) linear increase in P

system size

0.7

0.9

(b) exponential increase in P Figure 4.6: Time Oriented Scalability Behavior for EMBAR

CHAPTER 4. SCALABILITY ANALYSIS

134

As a modication, let us consider the hypothetical example from Section 3.3 (explained in Figure 3.5), which has the same structure as the embarrassingly parallel problem, but a poor ratio of sequential and parallel work. For this example, called HYPO in the following, 2 E 0(P N ) = 2N 2+NN+2 +N 2 P 0 0 Wscaled = E 0 2P = O(P ) 1;E i.e. a sequential fraction which increases with the problem size does not change the work oriented scaling behavior. The diagrams showing the scalability behavior of HYPO would be identical to those of EMBAR (as shown in Figure 4.5). Time oriented eciency is given by N 2)= E (P N ) = (2PN +(2NN2+ )= + 2P log2 P E Wscaled = 1 ; E (2N (P ; 1) + 2P log2 P ) = O(NP + P log2 P ) As the sequential fraction increases with N , the amount of work necessary to keep time oriented eciency constant is larger than the required increase if the sequential fraction is constant (compare the diagrams for EMBAR and HYPO in Figures 4.6 and 4.7). The relative scaling rates approach asymptotically to one in linear scaling of P (ideal) and to 0:5 in exponential scaling, i.e. the work scaling rate is twice as high as the architecture scaling rate. Absolute scaling rates show poor behavior if the increase in architecture becomes large. Furtheron, the scaling rates for dierent values of E dier slightly, but are still of the same order.

k_A / k_W (absolute)

135

1

0.5 50

0.5

9

10

11

10

11

512

1024

512

1024

8

9 256

7

256

system size

6

5

4

0

40

11

9

10

8

7

6

5

4

3

2

0

1

3

k_A / k_W (relative)

CHAPTER 4. SCALABILITY ANALYSIS

system size

0 2

11

10

9

8

7

6

5

4

3

0

3

4

5

6

7

8

9

10

10 0

system size

system size

0.1

8

0.5

20

7

1

30

6

10

40

5

2 1.5

50

4

3 2.5

work scaling rate (absolute)

20

3

work

4 3.5

2

work scaling rate (relative)

30

system size

0.5

0.9

k_A / k_W (absolute)

1

0.5 1000

128

64

32

16

8

10

system size

system size

0.1

0.5

128

64

32

1 16

512

256

64

128

32

8

16

0.1

100

2

1024

512

256

128

64

32

16

8

4

2

4

2

1 2

1000

8

10

4

1

4

2

work

10

system size work scaling rate (absolute)

512

256

128

64

32

16

8

4

1024

100

system size

work scaling rate (relative)

0.5

0 2

0

1

4

k_A / k_W (relative)

(a) with linear increase in P

system size

0.9

(b) with exponential increase in P Figure 4.7: Time Oriented Scalability Behavior for HYPO

CHAPTER 4. SCALABILITY ANALYSIS

136

2 1 0.8

35

Relative Speedup

30

64

0.6

25

2^20

0.4

20

2^24

0.2

15

2^28

0

10

optimum

4

5 8

32

0 2

4

8

16

32

64

System Size

16

Figure 4.8: Simulated Relative Speedup and Eciency for EMBAR

Analysis based on Simulations For simulations, the task graph model was transformed into a TSS, and the amount of work to be computed by each task was given in the requirements specication. To obtain a timing value, it is necessary to specify the processing speed . Measurement experiments provided an estimate for  = 7:64. From several simulation runs with varying system and workload size, the execution times were recorded. Based on these simulated values, relative speedup and eciency were computed6. The obtained relative speedups and eciencies are depicted in Figure 4.8. The simulation shows similar results as the analytical model, speedup increases, if the problem size is suciently large. The loss in eciency in the simulation model is due to higher communication costs. In the analytical model it was assumed, that the broadcast operation can be performed in log2(P ) time units, but in the simulation model, a single PE activates all others, thus requiring P ; 1 communication steps7. This also explains the loss The basis was the execution on a system with P = 2. The version of N-MAP used during this work did not support an ecient implementation of a broadcast operation. 6 7

CHAPTER 4. SCALABILITY ANALYSIS

137

32

Relative Speedup

1 4

0.8

3.5

0.6

3

0.4

2.5

2^20

2

2^24

1.5

2^28

0.2 0

1 0.5 128

0 32

64

64

128

System Size

Figure 4.9: Measured Relative Speedup and Eciencies for EMBAR in eciency (there are larger idle times for the PEs).

Measurements The functionality of EMBAR has also been implemented within N-MAP in specifying the task behavior specications (see Pfne 95] for a description of the N-MAP implementation). Source code for a CM-5 was generated and the program was executed on a 128-node machine in Paris8. From the measurement experiments, timings were obtained and are given in Figure 4.9. Note, that only partitions of 32, 64, and 128 nodes were available, thus the number of experiments was limited and not all combinations of interest could be evaluated. Furtheron, the partitions containing 32 and 64 nodes could only be used in a shared mode (i.e. other programs can be executed simultaneously, thus measurements can be inuenced). Based on the measured execution times, relative speedup and eciency were computed and are depicted in Figure 4.9. The observed values match closely with the analytical model, With the kind permission of A. Ferscha and G. Chiola, who share this account, and of Centre National de Calcul Parallele en Sciences de la Terre, the owner of the machine. 8

CHAPTER 4. SCALABILITY ANALYSIS NX ;2

eliminate(N) broadcast(N)

...

N x transform(N)

(synchronization)

eliminate(N-1)

...

138

broadcast(N-1)

W

... eliminate(2)

broadcast(2)

transform(2)

|

{z

}

W par

|

ci

 N2 + N3 + N3 NX ;2 + (N ; i)=  d N P; i e + T (P ) = N ( N ; 1) = | {z }

(N-1) x transform(N-1) (synchronization)

NX ;2

N ; i e(N ; i)) 2 + ( N ; i )  d W (P ) = N ( N ; 1) + (min f N ; i P g {z } | P {z } | {zseq } i=0 i=0 |

T seq

|i=0

{z

T par

NX ;2

|i=0

{zovh

wi

W

}

(log2 (minfN ; i P g)  d N P; i e  (N ; i))

{z

T ovh

}

 N 2 = + N 2 =  d NP e + N 2  log2 P  d NP e

eliminate(1)

Figure 4.10: Task Graph and Parameters for Parallel HH i.e. relative speedup doubles if the system size is doubled and eciencies are close to one. The reasons for the dierence to the simulation model lies again in the implementation of the broadcast operation. On the CM-5 an ecient implementation of the broadcast was possible.

4.3.2 Example 2: Householder Reduction One of the most frequently cited simple examples in the analysis of parallel programs are LU decomposition, gaussian elimination, or Householder reduction for solving a system of linear equations with coecients given in matrix form (dimension N ). All these applications have an appealing, simple structure, but still provide a pattern worth to be investigated. In this study a column oriented parallel version of Householder reduction will be investigated (as described in Brin 92]), i.e. the computation of the upper triangular matrix of the algorithm will be investigated, excluding the actual back substitution for the solution of the equations. The basic steps consist in eliminating the rst column, broadcasting the newly computed values and then transforming the remaining columns in parallel. The matrix is reduced by the rst row and the rst column, and the same steps are performed on the reduced matrix. Thus, after a total of N ; 1 iterations over the steps eliminate, broadcast, and transform, the matrix has been transformed into an upper triangular matrix9. 9

Let us assume, that there exists a solution.

}

CHAPTER 4. SCALABILITY ANALYSIS

139

2 0.5 11 2.5

3

0.4 0.3

2 0.2

Speedup

10 1.5

N=10

0

N=100

1 0.5

4

0.1

9

5

System Size

11

10

9

8

7

6

5

4

3

2

0 8

6 7

Figure 4.11: Speedup and Eciency for HH A simplied10 task graph of this algorithm is shown in Figure 4.10, where also the parameters, characterizing W and T , are summarized. For two dierent dimensions of the matrix (N = 10 and N = 100) the performance of HH is depicted in Figure 4.11. Speedup is rather low for both, the smaller and the larger system size, because of the comparatively large sequential fraction. Eciency drops fast with increasing system size. The communication overhead makes it impossible to obtain an eciency larger than 0:5. In this example, the amount of overhead work is independent of P , and is of the same order as W , thus it is not meaningful to compute work oriented scalability. Time oriented eciency (see page 116) is given by 2 N 3 )= E = (PN 2 +(NN 3+ )= + N 3 log2 P Wscaled = 1 ;E E (N 2(P ; 1) + N 3 log2 P) Actually, the eliminate task can be performed, as soon as the transformation of the corresponding column is nished, it does not depend on the completion of the transformations of the following columns. To obtain a fork join structure, these dependencies have been added. 10

k_A / k_W (absolute)

140

1

0.5 100 90

9

10

11

10

11

512

1024

8

7

6

5

4

9

system size

0.5

0

80

11

9

10

8

7

6

5

4

3

2

0

1

3

k_A / k_W (relative)

CHAPTER 4. SCALABILITY ANALYSIS

system size

70

10

5

0 2

11

10

9

8

7

6

5

4

3

0

3

4

5

6

7

8

9

10

30 20 10 0

system size

system size

0.1

8

20

7

10

40

6

30

5

15

50

4

40

work scaling rate (absolute)

50

3

work

20

2

work scaling rate (relative)

60

system size

0.3

k_A / k_W (absolute)

1

0.5 1000

0.5

system size

256

128

64

32

16

100

8

2

1024

512

256

128

64

32

16

8

4

0 2

0

1

4

k_A / k_W (relative)

(a) with linear scaling of P

system size

system size

system size

128 256

32 64

16

work scaling rate (absolute)

512

0.01

2 4 8

1024

512

256

64

128

32

16

8

4

1

256

64

0.1

128

32

8

16

4

2

1 10

1E+72 1E+64 1E+56 1E+48 1E+40 1E+32 1E+24 1E+16 1E+08 1

system size

0.1

(b) with exponential scaling of P Figure 4.12: Time Oriented Scalability Behavior for HH

512 1024

0.1

work

100

2

work scaling rate (relative)

10

CHAPTER 4. SCALABILITY ANALYSIS

141

W (P ) = |{z} 0 +N {z2 N} | (log2{zN + 1)} + N| log seq W

W par

W ovh

Tcube(P ) = |{z} 0 + (log2 N + 1)=  d N e + log2 P  d NP e P {z } | {zovh } T seq | T par T p N N Tgrid (P ) = |{z} 0 + (log2 N + 1)=  d P e + P  d P e | {z } | {z } seq T T par

T ovh

Figure 4.13: Task Graph and Parameters for Parallel FFT The isoeciency function and scaling rates are plotted in Figure 4.12 for linear and exponential scaling of P . The diagrams show a poor scaling behavior. In fact, only the curves for E = 0:1 and a few sample points for an eciency level of 0:3 are represented in the diagrams, although it would be theoretically possible to obtain an eciency of 50 %, the functions could not be computed11. An analysis of HH using simulations in N-MAP can be found in Fers 94].

4.3.3 Example 3: FFT Techniques for parallel FFT (Fast Fourier Transform) are well known for many years. Let N be the number of points, then the DOP of this application is N for all phases. The parallel FFT is an application of type (b) where the number of phases is determined by log2 N . The communication among phases are shu'ed point to point communications, which form the well known buttery pattern. Two dierent types of architectures are be compared, a hypercube of dimension P = 2n p and a grid of the same size, i.e. of dimension 2n . It is well known, that the communication pattern of FFT can be mapped eciently on a hypercube topology with a communication overhead of N log2(P ) Gupt 93]. Considering an implementation of FFT on a mesh A spreadsheet was used to evaluate the expression for W 0 , the values necessary to obtain a higher eciency were out of range, no feasible solution was found. 11

CHAPTER 4. SCALABILITY ANALYSIS

142

p

Gupt 93], the communication complexity is given by N P . The task graph and the corresponding expressions for W and T are given in Figure 4.13. Speedup and eciency derived from these expressions are shown in Figures 4.14 and 4.15. Eciency decreases slowly, as long as the DOP is larger than the number of PEs, but drops signicantly, if P > DOP , for both, the implementation on the cube and on the grid. As expected, on the cube, the values for speedup and eciency are higher than on the grid. In FFT (as in HH) the overhead is independent of P and of the same order as W , thus no meaningful work oriented scalability can be obtained. Time oriented eciencies (see page 116) for the cube are given by 2 N )= E = (N log(NN log 2 )= + N log2 P Wscaled = O(N log2 P ) and for the grid by (N log2 N )= p (N logp2 N )= + N P = O (N P )

E = Wscaled

p

Scalability of FFT on a grid is only better, as long as P is smaller than log2 P , i.e. up to 16 processing elements (assuming, that PEs are only scalable at a power of 2). The scalability behavior is depicted in Figure 4.16 for both topologies (only exponential increase in PEs is depicted, because the number of nodes in a hypercube topology must be a power of two). The relative scaling index for the cube implementation is closer to one than the index for the grid implementation, thus the scaling behavior on the cube is better than the scaling behavior of the grid implementation.

CHAPTER 4. SCALABILITY ANALYSIS

143

2 1 1024 600

0.6

500

Speedup

4

0.8

400

16

300

64

200

1024

100

0.4

512

8

0.2 0

256

16

1024

512

256

128

64

32

16

8

4

2

0

System Size

128

32 64

Figure 4.14: Speedup and Eciency for FFT on a Hypercube

2 1 1024 70

0.8

60

0.6 16

40

0.4

512

0

1024

20 10

8

0.2

64

30

256

16

System Size

1024

512

256

128

64

32

16

8

4

0 2

Speedup

50

4

128

32 64

Figure 4.15: Speedup and Eciency for FFT on a Grid

1024

512

256

128

64

32

16

8

4

2

work scaling rate (relative)

1

system size

0.1

4 1

64

10

32

100 64

(a) on a hypercube 1

0.5

0

system size

1000

100

10

1

128

64

32

16

8

4

2

system size

0.3

(b) on a grid

Figure 4.16: Time Oriented Scalability Behavior for FFT 1024

512

0.5

256

system size

512

1 256

10

1024

100

512

256

256

64

32 128

1000

1024

128

16

8

system size

512

128

0 32

system size

16

1000 4

100

16

0.5 2

1

8

2

4

10

work scaling rate (absolute)

0

8

1

k_A / k_W (absolute)

512

k_A / k_W (absolute)

1000

1024

system size 256

128

64

32

k_A / k_W (relative) 0.5

4

2

8 16

work

1024

512

256

128

64

32

16

8

4

2

1

work scaling rate (absolute)

0.1

512

0.3

256

128

system size

64

0.1

32

system size

16

1 4

work scaling rate (relative) 4

8

2

1024

512

256

128

64

32

16

8

4

2

system size

4

k_A / k_W (relative) 10

2

2 2

10

work

1024

512

256

128

64

32

16

8

4

2

CHAPTER 4. SCALABILITY ANALYSIS 144

1

0.5

0

CHAPTER 4. SCALABILITY ANALYSIS

00

01

02

03

10

11

12

13

20

21

22

...

23

30

31

32

145

33

2 W (P ) = |{z} 0 +N  K} + K  P  4  pN | {z P seq

...

W

W par

|

{z

W ovh

}

T (P ) = |{z} 0 + N  K=  d NP e + K  4  pN {z } | {z P} T seq | T par

T ovh

Figure 4.17: Task Graph and Parameters for FEM

4.3.4 Example 4: Finite Element Method A variety of parallel applications exhibit an iterative nearest neighbor communication pattern, which is a pattern of type (b) with a communication complexity of P times the number of neighbors (typically four assuming that the PEs are arranged in a grid). An example are nite element methods (FEM), where the values of points arranged in a grid are iteratively updated, the new value of a grid point is computed out of its old value and the value of the neighboring points. This process is repeated until convergence, i.e. the dierence between the old value and the new value is within a tolerable error range for all grid points. For scalability, the interesting point in FEM is, that the amount of work does not only depend on the problem size (grid dimension N ), but also on the desired accuracy. Let denote the probability, that the desired accuracy is not reached after an iteration, then the number of iterations is given by E x] = 1;1  , as the probability for reaching the desired accuracy in the kth iteration is given by P (x = k) = (1 ; ) k;1. The total number of phases is K = 1 + E x]. At the nest granularity, the DOP in each iteration is N 2, the number of grid points. For simplicity, it is assumed, that the amount of work for updating a single grid point is 1. Let us consider an implementation on a grid of PEs, where the grid is assigned block-wise p to the PEs, i.e. each processing element receives a subgrid of dimension N= P . Assuming, that all PEs can send and receive a message simultaneously, the amount of overhead is given

CHAPTER 4. SCALABILITY ANALYSIS

146

2 0.5 1024 600

0.3

500 400 Speedup

4

0.4

16

300

64

200

1024

100

0.2

512

8

0.1 0

256

16

1024

512

256

128

64

32

16

8

4

2

0

System Size

128

32 64

Figure 4.18: Speedup and Eciency for FEM by W ovh = (K ; 1) P 4 pNP , i.e. each of the P PEs has to send a message of size pNP (one border row or column) to all of its 4 neighbors (K-1) times. Figure 4.17 shows the task graph and summarizes the expressions for W and T . Speedup and eciency (assuming = 0:9) for grid sizes 16, 64, and 1024 are shown in Figure 4.18. Eciency is equal to 0.5, as long as the DOP is suciently large (larger than the number of PEs), indicating a promising scaling behavior. Because of the assumption of simultaneous send and receive of all PEs, all overhead is \perfectly parallel", i.e. T ovh = W ovh =P . Furtheron, there is no sequential fraction, thus work and time oriented eciency (see page 116) are of the same order (they dier only in absolute values because of the processing speed ).

K N2 K N 2 + K P 4 pNP 0 p ) Wscaled = O( PN P (K N 2)= E = K= N 2 + K P 4 pNP E0 =

CHAPTER 4. SCALABILITY ANALYSIS

147

0.8 0.6 120

0.2

1 0.8 0.6 0.4 0.2

100

9

10

11

10

11

512

1024

512

1024

8

7

9

2

11

10

9

8

7

6

5

256

0

0 4

256

0.2

3

4

5

6

7

8

9

10

3 2 1 0

system size

system size

0.1

0.3

8

20

0.4

7

0.6

4

6

0.8

5

5

40

1

6

4

60

1.2

work scaling rate (absolute)

work

1.4

3

6

3

system size

80

1.6

2

5

0

11

9

10

8

7

6

5

4

3

2

system size

4

0.4

0

work scaling rate (relative)

k_A / k_W (absolute)

1.2

1

3

k_A / k_W (relative)

1.2

system size

0.5

0.7

0.9

(a) with Linear Increase in P 1.2

k_A / k_W (absolute)

1 0.8 0.6 0.4

1000

0.2

1 0.8 0.6 0.4 0.2 128

64

32

10

0.3

0.5

128

64

4

32

1

512

256

64

128

32

8

16

100

system size

system size

0.1

4

1024

512

256

128

64

32

16

8

2

1

1

1000

8

work scaling rate (absolute)

work 10

4

8

system size

100

10

2

work scaling rate (relative)

system size

16

4

1024

512

256

128

64

32

16

8

4

0 2

0

16

k_A / k_W (relative)

1.2

system size

0.7

0.9

(b) with Exponential Increase in P Figure 4.19: Time Oriented Scalability Behavior for FEM

CHAPTER 4. SCALABILITY ANALYSIS

148

Wscaled =  K P 4 pN P p ) = O( PN P The diagrams for scalability behavior are given in Figure 4.19. As already indicated by the plot of the eciency (Figure 4.18) scaling behavior is linear. The work scaling index is one for any desired level of eciency, thus workload and architecture are scaled at the same rate. This ideal behavior may not be observed on real systems, because here it was assumed, that all sends and receives may occur simultaneously, which might be unrealistic for most parallel architectures. But the objective was not to provide a realistic model of an architecture, but to demonstrate the characterization of the workload for scalability.

Chapter 5 Summary and Future Work The major objective of this work was to develop a systematic methodology for workload modeling of parallel systems emphasizing the quantitative characterization of the parallel program. To address this problem, rst several fundamental issues related to workload modeling, i.e. selecting and characterizing the load, that a system has to process, are discussed. A denition of the workload for parallel systems is given based on a hierarchical description of the workload in terms of its components. These components correspond to the parts of a parallel program, namely the whole application, algorithms, routines, and the statements. A set of characteristics for these components is proposed allowing both, the description and specication of executable models, and the representation and modeling of non-executable models. Based on this framework a workload modeling methodology embedded in a performance evaluation cycle is developed. This methodology can be applied in any performance evaluation study and can be adapted to meet the particular demands of the objective of the evaluation study and the selected evaluation technique (measurement or modeling). The major contribution of this part is a classication and characterization methodology for parallel workloads. While the issues related to the use of executable workload models are discussed rather briey, the questions related to the construction of non-executable workload models have been described in more detail, since the focus of the other part of this work is on non-executable models, i.e. workload descriptions to be used in performance modeling. The suitability of the proposed framework as a classication scheme is demonstrated by 149

CHAPTER 5. SUMMARY AND FUTURE WORK

150

giving a state of the art review of non-executable workload models. Emphasis is given to a consistent notation for the dierent models (in particular, a consistent notation for proles, shapes, and signatures and ratios for varying system size like speedup, and eciency). To demonstrate the integration of the framework and methodology in a performance evaluation study, the proposed workload modeling approach is applied for performance prediction in the second part of the work. A particular subproblem of performance prediction, namely scalability analysis has been chosen. Two eciency metrics are used for scalability analysis. A work based eciency metric is dened, which relates W (1) and W (P ), thus characterizing the amount of overhead work (work, that is only necessary in the parallel solution) as a value between two extremes, 0 (innite amount of overhead) and 1 (no overhead). The time based eciency metric is the well known ratio of speedup and number of processing elements. An approach for scalability analysis is derived based on a characterization of the load and the architecture. To characterize the structure and the parallelism of the application under study (the load), a task graph model is used. The corresponding computation demands (distinguished in useful and redundant or wasteful work) are given as a function of the scalable problem and system parameters. Also communication demands are specied in that way. To combine this model with information on the architecture, three approaches are investigated. First, the use of a complexity function is discussed. For this type of analysis, the information on dependencies and on the degree of parallelism represented in the graph model is combined with the computation and communication demands to derive expressions for the total amount of work assuming a sequential execution, denoted by W (1), and the total amount of work in a parallel execution, denoted by W (P ). The second technique is based on simulations, where the behavioral characteristics are represented in an executable simulation model and the demands are specied using deterministic and stochastic estimates. Finally, scalability behavior is evaluated by measurements on a real system. Several examples of programs with typical parallelism proles are studied to illustrate the proposed approach. The contribution of this work can be summarized as follows. 1. A workload modeling methodology has been proposed following the principles of hier-

CHAPTER 5. SUMMARY AND FUTURE WORK

151

archy, modularity, exibility, universality, and target orientation. 2. A workload modeling framework based on components, characteristics, and parameters has been developed, which is exible enough to be adapted to meet the dierent demands posed by the performance evaluation technique and the objective of the evaluation study. 3. A systematic survey of non-executable workload models has been given. 4. The proposed approach has been applied in scalability analysis providing new insights in this particular subproblem of performance evaluation. Although this work has (hopefully) answered several questions related to workload modeling of parallel systems, there are still open problems. In particular, the following topics for future work are identied. 1. Investigate the suitability of the proposed approach over a variety of parallel architectures (including shared memory systems, heterogeneous systems, computer networks). 2. Apply the proposed approach in other performance evaluation studies. 3. Tighten the connection to the existing analysis tools developed at the department (NMAP and PAPS) in providing a workload characterization tool with an interface to these analysis tools. Finally, it would be interesting to apply the proposed methodology on a \real world" example, e.g. a large and complex parallel application, but limitations in time have prohibited this project so far.

Appendix A Abbreviations and Symbols Abbreviations ATPN : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : arc timed Petri net CAPSE : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : computer aided parallel software engineering CFG : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: control ow graph CG : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : communication graph CUT : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : component under study DFG : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : data ow graph DOP : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : degree of parallelism EG : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : event graph I/O : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : input/output MIMD : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : multiple instruction multiple data PAPS : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : performance analysis of parallel systems PE : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : processing element GERT : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : graphical process evaluation review technique PN : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : Petri net PPS : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : : parallel processing system PRM : : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : program resource mapping QNM : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : queuing network model SIMD : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : single instruction multiple data SPMD :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : single program multiple data 152

APPENDIX A. ABBREVIATIONS AND SYMBOLS

153

SPN : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: stochastic Petri net SUT : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : system under test TG : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : task graph TTPN : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : transition timed Petri net WL :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : workload

Symbols A : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : application component

Ai : : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : algorithm component rji :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : routine component sjk : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: code segment A :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : architecture size A0 : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : scaled architecture size C (P ) : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: parallel computation costs E (P ) : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : eciency E 0(P ) :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: work based eciency (P ) : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: ecacy I : : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : processing speed kA :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: architecture scaling rate kW : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : work scaling rate  : : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : processing rate par : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : parallel processing rate seq : : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : sequential processing rate (W ) : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : processing rate as a function of work N : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : problem size IN : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: set of natural numbers P :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: number of processing elements P (t) : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : activity prole P 1(t) : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: ideal parallelism prole

APPENDIX A. ABBREVIATIONS AND SYMBOLS

154

P comm (t) : : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : communication prole P comp (t) : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : computation prole P exec (t) : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : execution prole P ovh (t) : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : : overhead prole

(x) :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : activity shape comm (x) : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : communication shape comp(x) : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : computation shape exec (x) : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : execution shape ovh (x) :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : overhead shape Q(P ) : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : quality of parallelism R(P ) : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : redundancy IR :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : set of real numbers S (P ) : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : speedup S (Popt N ) : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : asymptotic speedup Sfix : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: xed load speedup SI (PI opt N ) : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : ideal asymptotic speedup Smemory : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : memory bounded speedup Sscaled : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : xed time speedup Sspeed : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : generalized speedup t : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : time index ti : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : time instant t(P W ) : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : time to execute W using P PEs T : : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : execution time T (1) : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : sequential execution time T (P ) : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : (total, parallel) execution time function T comm : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : communication time T comp : : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : computation time T dupl : : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : time spent in duplicated (redundant) computations T ovh : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : total time spent for overhead T use :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : total time spent for useful computations  (p) : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : activity signature  comm (p) : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : communication signature

APPENDIX A. ABBREVIATIONS AND SYMBOLS

155

 comp (p) :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : computation signature  exec (p) : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : execution signature  ovh (p) : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : overhead signature U (P ) : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: utilization W :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : amount of work W 0 : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : scaled amount of work for xed time speedup W : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : scaled amount of work for xed memory speedup 0 Wscaled : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : scaled amount of work for work based eciency Wscaled : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : scaled amount of work for time based eciency W (1) : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: total amount of work in sequential program W (P ) :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: total amount of work in parallel program W ovh (P ) : : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : total amount of overhead work W par : : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : parallel fraction of work W par (i) :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : fraction of work with DOP i W seq : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : sequential fraction of work

Appendix B Denitions and Derivations B.1 Petri Nets A Petri net is a tuple PN = (P T F W ( (0)) where: (i) P = fp1 p2 : : :  pnP g is a nite set of P -elements (places), T = ft1 t2 : : : tnT g is a nite set of T -elements (transitions) with P \ T =  and a nonempty set of nodes (P  T 6= ). (ii) F  (P  T )  (T  P ) is a nite set of arcs between P -elements and T -elements denoting input ows (P  T ) to and output ows (T  P ) from transitions. (iii) W : F 7! IN + assigns weights w(f ) to elements of f 2 F denoting the multiplicity of unary arcs between the connected nodes. (iv) ( : T 7! IN + assigns priorities i to T -elements ti 2 T . (v) (0) is a marking vector generated by the marking function M : P 7! IN 0 expressing for every p the number of tokens (0)(p) initially assigned to it. Notation:

I (t) = I (p) =

 (pi t)2(P T )



(ti p)2(T P )

pi 

O(t) =

ti 

O(p) =

156



pi

(t pi)2(T P )



(p ti)2(P T )

ti

APPENDIX B. DEFINITIONS AND DERIVATIONS

157

B.2 Series Parallel Graphs The distribution of the execution time of a Series Parallel Graph G can be dened as follows. 1. G is a single node k =) FG (t) = Fk (t) 2. G is a serial composition of G  G  : : :G =) F (t) = Nn F (t) (convolution) 1

2

n

G

i=1 i

3. G is a parallel composition of G1 G2  : : :Gn and 3.1. Gi is taken with probability pi =) FG(t) = Pni=1 pi FGi (t) (theorem of total distribution) 3.2. Gi are OR-parallel =) FG(t) = 1 ; Qni=1 (1 ; FGi (t)) (1st order statistic) 3.3. Gi are AND-parallel =) FG(t) = Qni=1 FGi (t) (n-th order statistic) 3.4. k of the n subgraphs 0 1 are executed in parallel =) n FG(t) = Pnj=k B @ CA FGj i (t) (1 ; FGi (t))n;j (k-th order statistic) j

B.3 Stochastic Process A stochastic process is a family of random variables fX (t) t 2 T g, where t is an index parameter (usually time), and X (t) 2 S , is the state space. A stochastic process is described by means of the joint probability distribution function of its random variables. FX~ (~x ~t) = P fX (t1)  x1 X (t2)  x2 : : :  X (tn)  xng

B.4 Queuing networks A queuing network model is a directed graph (V E ), where each vertex vi 2 V is called a station, and an edge connecting nodes vi and vj denotes a ow of jobs from station i to station j . Weights associated to the arcs denote the probability for arriving at station j after nishing service at station j . For each station, the following parameters are specied: the arrival process, the service process, the number of servers, the buer size, and the queuing discipline.

APPENDIX B. DEFINITIONS AND DERIVATIONS

158

B.5 Complexity Notation B.6 Bounds on and Relations between Ratios Assumptions (recalled from section 3.3) 1. W (1) = T (1), i.e. the amount of work and the time to execute this work are of the same order. 2. T (1)=P  T (P )  T (1), i.e. the parallel execution time is at least the sequential execution time divided by the number of processing elements, superlinear speedup is not considered. The parallel execution time does not exceed the sequential execution time, i.e. a \slowdown" is not considered. 3. T (P ) < W (P )  P T (P ), i.e. some of the operations can be performed in parallel, and the total amount of work may not exceed the complexity of the costs. 4. W (1)  W (P ) due to possible overhead.

B.6.1 Bounds Speedup As T (1)=P  T (P )  T (1) (assumption 2) and S (P ) = T (1)=T (P ), it follows, that 1  S (P )  P .

Eciency As E (P ) = S (P )=P and 1  S (P )  P , it follows, that 1=P  E (P )  1.

Ecacy As (P ) = S (P ) S (P )=P , and 1  S (P )  P , it follows, that 1=P  (P )  P .

Redundancy As W (1)  W (P ) (assumption 4), and R(P ) = W (P )=W (1), it follows, that 1  R(P ). As W (1) = T (1) (assumption 1), and W (P )  P T (P ) (assumption 3), and R(P ) =

APPENDIX B. DEFINITIONS AND DERIVATIONS

159

W (P )=W (1), it follows, that R(P )  P T (P )=T (1), and from assumption 2 (T (P )  T (1)) it follows, that R(P )  P .

Utilization As T (P ) < W (P )  P T (P ) (assumption 3), and U (P ) = U (P )  1.

W (P ) P T (P ) ,

it follows, that 1=P
E (P ) (see below) and E (P ) 1=P , it follows, that Q(P ) > 1=P .

B.6.2 Relations Eciency and Speedup As 1  S (P ) and E (P )  1, it follows, that E (P )  S (P ). If S (P ) = 1, then E (P ) = 1=P , thus E (P ) < S (P ) for all P > 1.

Ecacy and Speedup As (P ) = S (P ) E (P ) and E (P )  1, it follows, that (P )  S (P ). Ecacy and Eciency As (P ) = S (P ) E (P ), and S (P ) 1, it follows, that (P ) E (P ). Redundancy and Speedup As R(P ) = W (1)=W (P ), and S (P ) = T (1)=T (P ), and W (1) = T (1) (according to assumption 1) and T (P ) < W (P ) (according to assumption 3), it follows, that R(P ) < S (P ).

Redundancy and Eciency As E (P )  1 and R(P ) 1, it follows, that R(P ) E (P ).

APPENDIX B. DEFINITIONS AND DERIVATIONS

160

Utilization and Speedup As U (P )  1 and S (P ) 1, it follows, that U (P )  S (P ). Utilization and Eciency As U (P ) = R(P ) E (P ) and R(P ) 1, it follows, that U (P ) E (P ). Utilization and Ecacy As U (P ) = R(P ) E (P ) and (P ) = S (P ) E (P ), and S (P ) > R(P ) it follows that,

U (P ) < (P ).

Utilization and Redundancy As U (P )  1 and R(P ) 1, it follows, that U (P )  R(P ). Quality of Parallelism and Speedup As Q(P ) = S (P ) E (P )=R(P ) and R(P ) E (P ), it follows, that Q(P )  S (P ). Quality of Parallelism and Eciency As Q(P ) = S (P ) E (P )=R(P ) and R(P ) < S (P ), it follows, that Q(P ) > E (P ). Quality of Parallelism and Ecacy As Q(P ) = (P )=R(P ) and R(P ) 1, it follows, that Q(P )  (P ). Quality of Parallelism and Utilization As Q(P ) = S (P ) E (P )=R(P ) = S (P ) U (P )=(R(P ))2 , it follows, that Q(P ) < U (P ), only i S (P ) < (R(P ))2 .

List of Figures 1.1 A Cooperative Performance Modeling Methodology : : : : : : : : : : : : : : 5 1.2 Interaction between user, workload, and system : : : : : : : : : : : : : : : : 9 1.3 Schematic Representation of the Proposed Methodology : : : : : : : : : : : : 15

::::: ::::: ::::: of User ::::: ::::: ::::: ::::: ::::: :::::

23 26 29

2.5 2.6 2.7 2.8 2.9

Executable and Non-Executable Test Workloads : : : : : : : : : : : Workload Components and Characteristics : : : : : : : : : : : : : : Kiviat Diagrams for Characterizing Parallel Programs : : : : : : : : Dierent Viewpoints of Analysis and Corresponding Consideration Interaction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Schematic Performance Evaluation Cycle : : : : : : : : : : : : : A Workload Modeling Methodology : : : : : : : : : : : : : : : : : : Comparing Amdahls Law and Gustafson's Law : : : : : : : : : : : Workload Model Classication Scheme : : : : : : : : : : : : : : : : Deriving the Load Parameters : : : : : : : : : : : : : : : : : : : : :

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10

Schematic Representation of Workload Models : : : : : Proles of the Assignment Problem : : : : : : : : : : : Signatures of the Assignment Program : : : : : : : : : Shapes of the Assignment Problem : : : : : : : : : : : Execution Times and Derived Ratios : : : : : : : : : : Fixed-Load Speedup : : : : : : : : : : : : : : : : : : : Fixed-Time Speedup : : : : : : : : : : : : : : : : : : : Comparison of Dierent Scaling Functions for Speedup Graph Models of the Assignment Problem : : : : : : : Task Graph of the Assignment Problem : : : : : : : : :

: : : : : : : : : :

58 66 69 71 73 80 82 83 86 91

2.1 2.2 2.3 2.4

161

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

30 32 34 40 47 51

LIST OF FIGURES

162

3.11 3.12 3.13 3.14

Event Graphs of the Assignment Problem : : : Control Flow Graph of the Assignment Problem Petri net Model of the Assignment Problem : : GERT Network for the Assignment Problem : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

94 95 97 98

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19

Dierent Representations for Scalability : : : : : : : : : Examples of Typical Program Structures : : : : : : : : : Task Graph and Parameters for EMBAR : : : : : : : : : Predicted Speedup and Eciency for EMBAR : : : : : : Work Oriented Scalability Behavior for EMBAR : : : : : Time Oriented Scalability Behavior for EMBAR : : : : : Time Oriented Scalability Behavior for HYPO : : : : : : Simulated Relative Speedup and Eciency for EMBAR : Measured Relative Speedup and Eciencies for EMBAR Task Graph and Parameters for Parallel HH : : : : : : : Speedup and Eciency for HH : : : : : : : : : : : : : : : Time Oriented Scalability Behavior for HH : : : : : : : : Task Graph and Parameters for Parallel FFT : : : : : : Speedup and Eciency for FFT on a Hypercube : : : : : Speedup and Eciency for FFT on a Grid : : : : : : : : Time Oriented Scalability Behavior for FFT : : : : : : : Task Graph and Parameters for FEM : : : : : : : : : : : Speedup and Eciency for FEM : : : : : : : : : : : : : : Time Oriented Scalability Behavior for FEM : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

103 108 129 130 132 133 135 136 137 138 139 140 141 143 143 144 145 146 147

List of Tables 2.1 Analysis Techniques for TTPNs : : : : : : : : : : : : : : : : : : : : : : : : : 43 3.1 3.2 3.3 3.4 3.5 3.6

Measured Events of the Example Program : : : : : : : : : : : : : : : : : : : Relative Prole Values for the Third Iteration : : : : : : : : : : : : : : : : : Deriving the Computation Shape from the Computation Signature : : : : : : Summary of Ratios : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Comparison of Methods for Characterizing the Sequential and Parallel Fraction Comparison of Directed Graph Models : : : : : : : : : : : : : : : : : : : : :

60 67 70 78 85 88

4.1 Parameters Characterizing Typical Program Structures : : : : : : : : : : : : 110 4.2 Parameters Characterizing Useful Work and Overhead : : : : : : : : : : : : 118

163

Bibliography Abra 87]

M. Abrams and A. K. Agrawala. \Automated Measurement and Prediction of Unconditionally Synchronizing Distributed Algorithms". In: Proceedings of the Seventh Conference on Distributed Memory Computer Systems, pp. 498{505, IEEE, 1987. The objective of this paper is to study parallel algorithms through measurement and modeling. A ow chart is used to model the behavior of a process in a parallel algorithm. The nodes correspond to phases of computation, directed arcs represent communication, but communication partners are not depicted. Communication may not occur in a conditional statement, and each process loops without termination. The parallel program consists of several processes and is described by its states, which are in return dened by the states of each process (nodes in the process ow chart). The execution times for each node are assumed to be deterministic. For the purpose of performance modeling, the states of the ow charts are translated into a geometric concurrency model, which is solved to obtain the steady state trajectory (a sequence and timings of transitions in the ow charts, which is repeated for ever). A measurement tool allows automatic instrumentation of the program to obtain the actual trajectory at during the execution. Actual and predicted trajectories are compared to validate the model.

Adve 93] V. S. Adve and M. K. Vernon. \The Inuence of Random Delays on Parallel Execution Times". ACM Performance Evaluation Review, Special Issue, Proceedings of the 1989 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993. The authors

show, that in current and forseeable shared-memory programs, communication delays merely inuence the execution time between synchronization points. Other sources of randomness like non-deterministic computational requirements, also have no signicant inuence thus motivating the use of deterministic models instead of more complex stochastic models. Analytic models and detailed measurements are used to verify these results. Agar 92]

A. Agarwal. \Performance Tradeos in Multithreaded Processors". IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, pp. 525{539, September 1992. An analytic performance model is presented for the analysis of multithreaded processors. Parameters of inuence include cache interference, network contention, context-switching overhead , and datasharing eects. The results obtained from the analytic model are compared to simulations.

164

BIBLIOGRAPHY Agra 92]

165

V. D. Agrawal and S. T. Chakradhar. \Performance Analysis of Synchronized Iterative Algorithms on Multiprocessor Systems". IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 6, pp. 739{746, November 1992. In this article an upper bound on the speedup of iterative synchronized algorithms is derived. The algorithm is modeled as a task, which consists of a number of atoms, which are parts of the execution, that cannot be further decomposed, and whose execution takes exactly one unit of time. An atom becomes active with a certain probability and has to be executed on the processor, where it is allocated. Tasks (and atoms resp.) are distributed statically among the processes. The total execution is composed of a given number of iterations over this task with synchronization points between two successive iterations, i.e. only after all atoms have nished computation, the next iteration can start (fork join structure ). Although in a real application, usually messages are exchanged at a synchronization point, the communication time is neglected in this model. The characterizing parameters are the number of iterations, the number of atoms , and the probability for an atom to become active. It is assumed, that all atoms have the same probabilities.

Ajmo 86] M. Ajmone Marsan, G. Balbo, and G. Conte. Performance Models of Multiprocessor Systems. MIT Press Series in Computer Systems, MIT-Press, Cambridge, Massachusetts, 1986. Akyi 92]

I. F. Akyildiz. \Performance Analysis of a Multiprocessor System Model with Process Communication". The Computer Journal, Vol. 35, No. 1, pp. 52{61, 1992. In this article a closed queuing network model is used to analyze process communication in a multiprocessor system. The network contains one multiple server, representing the identical processors in the system, and K delay stations, representing the blocked state of a processor (waiting for communication). After nishing one time slice at the queuing station, a process will either join the ready queue again, or branch to one of the delay stations, if communication is required. The workload is given by the (exponentially distributed) service time of a process. The eect of process communication is considered on the one hand by the delay servers, where for each process the service time is equal to its blocking time, and by determining the transition probabilities of the processes. The resulting performance measures include utilization, throughput, response time, mean queue length, and blocking overhead due to communication.

Alme 88] V. A. F. Almeida and D. A. Menasce. \An Integrated Queueing Network and Stochastic Petri Net Model of Supercomputer Performance". 1988. Amda 67] G. M. Amdahl. \Validity of a single-processor approach to achieving large scale computing capabilities". AFIPS Conference Proceedings, Vol. 30, No. 18-20, pp. 483{485, Apr. 1967. Amdahl's Law. Andr 87] F. Andre and A. Joubert. \SiGLe: An evaluation tool for distributed systems". In: Proceedings of the Seventh Conference on Distributed Memory Computer Systems, pp. 466{472, 1987. A tool is presented for the analysis of parallel systems. The authors characterize a parallel program by its communication structure using a description language . The time for executing sequential

BIBLIOGRAPHY

166

parts of the program is obtained from actual execution of these parts (i.e. they have already be implemented), dierent processor speeds may be simulated. Communication times are determined from the amount of data and the transfer rate (both are to be known, constant values), the synchronization and communication delays are obtained by a simulator tool. SiGLe is a simulation tool for studying the performance of distributed algorithms on a user-dened architecture. Questions that can be answered with this tool are comparison of various mappings, the choice of an architecture optimum for a given class of applications, and nding the appropriate degree of parallelism for a program to be executed on a given architecture. The input to the tool are the description of the architecture (in the Language for ARchitectures), the specication of the distributed algorithm (using the Language for ALgorithms), and the mapping (using the Implementation Language). LAR, LAL, and IL have a similar, and easy syntax, LAR and LAL provide basic components (such as processor, memory, and buses in LAL) that can be aggregated into compound structures (using array and connect constructors). To each type of resource a list of parameters is associated specifying the performance characteristics (e. g. processor speed, memory size, communication speed and bandwidth). The sequential parts of the program may be written in any programming language, only the communication primitives are provided by LAL. In the mapping part LI instructions are used for associating processes to processors and communication requests to links. Simulation results (currently only provided in textual representation) include total execution time, execution time of processes, utilization of processors and buses, and access conicts to shared resources (along with the corresponding waiting times for resources). Anna 92] M. Annaratone. \MPPs, Amdahl's Law, and Comparing Computers". In: H. Siegel, Ed., Proceedings Frontiers '92, The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 465{470, IEEE Computer Society Press, Los Alamitos, California, 1992. By

introducing 6 parameters (fraction of serial execution time, locality of code in parallel section, fraction of I/O in serial code, fraction of non-local memory references, remote access penalty, and average number of hops that a message has to take) the author derives a bound on speedup, which is used in a case study comparing two hypothetical parallel machines, one with a small number of fast PEs and the other with a large number of slow PEs. Azmy 92] Y. Y. Azmy. \Performance and Performance Modeling of a Parallel Algorithm for Solving the Neutron Transport Equation". Journal of Supercomputing, Vol. 6, pp. 211{235, 1992. In this article a parallel algorithm for the neutron transport problem was implemented on an IPSC/2 hypercube. Performance results obtained from measurements were used to construct a performance model for execution time. The load parameters in these models are the problem size (described by three application specic parameters) and the number of processors. This model was used to predict the scaling behavior of the application. Bacc 90]

F. Baccelli and A. Makowski. \Synchronization in Queueing Systems". In: H. Takagi, Ed.,

BIBLIOGRAPHY

167

Stochastic Analysis of Computer and Communication Systems, pp. 57{130, North-Holland,

1990. Bail 88]

C. F. Baillie. \Comparing shared and distributed memory computers". Parallel Computing, No. 8, pp. 101{110, 1988. North Holland.

Bail 91]

D. Bailey, J.Barton, T. Lasinski, and H. Simon. \The NAS Parallel Benchmarks". Tech. Rep. Report RNR-91-002 Revision 2, Numerical Aerodynamic Simulation (NAS) Systems Division, NASA Ames Research Center, Moett Field, USA, Aug. 1991.

Bail 94]

D. H. Bailey, E. Baraszcz, L. Dagum, and H. D. Simon. \NAS Parallel Benchmark Results 3-94". In: Proceedings of the SHPCC'94. Scalable High-Performance Computing Conference, May 23-25, 1994, Knoxville, Tennessee, pp. 111{120, IEEE, IEEE Computer Society Press, Los Alamitos, CA, May 1994. In this paper the recent results on the NAS parallel benchmarks on various architectures are tabulated.

Bart 92]

E. Bartscht and J. Brehm. \A Multiprocessor Communication Benchmark". 1992. Esprit 3, V1.5, Internal Report.

Begu 93]

A. Beguelin, J. Dongarra, A. Geist, W. Jiang, R. Manchek, and V. Sunderam. \PVM 3 User's Guide and Reference Manual". Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, May 1993.

Beil 88]

H. Beilner, J. Mater, and N. Weienberg. \Towards a Performance Modelling Environment: News on HIT". In: Proceedings of the International Conference on Modelling Techniques and Tools for Computer Evaluation, Plenum Press, 1988. HIT is a modeling tool oering an integrated methodology: it operates within a framework designed to be closer to system description than to performance modeling.

Berr 92]

R. Berry. \Computer Benchmark Evaluation and Design, a Case Study". IEEE Transactions on Computers, Vol. 41, No. 10, pp. 1279{1289, October 1992. A case study of the implementation of two scientic benchmarks on a CRAY X-MP is presented. The objective of the study was to determine the impact on performance of the most dominant loops in the applications.

Bert 89a] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Mehtods, Chap. 5.3.1 The Auction Algorithm for the Asiggnment Problem. Prentice Hall, 1989. Bert 89b] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Mehtods, Chap. 1.2.1 Models. Prentice Hall, 1989. Parallel algorithms are represented by directed acyclic graphs specifying what operations are to be performed on what operands imposing certain precedence constraints. In order to specify a parallel algorithm completely, a schedule is dened associating a processor (and a processing time) to each (non-input) node in the DAG. Blas 92]

H. Blaschek, G. Drsticka, A. Ferscha, and G. Kotsis. \Visualization of Parallel Program Behavior". Tech. Rep., University of Vienna, Dept. of Appl. Comp. Sci., 1992. Internal Report.

BIBLIOGRAPHY

168

Blum 92] W. Blume and R. Eigenmann. \Performance Analysis of Parallelizing Compilers on the Perfect c Programs". IEEE Transactions on Parallel and Distributed Systems, Vol. 3, Benchmarks No. 6, pp. 643{656, November 1992. The Perfect Benchmarks have been implemented on an Alliant/FX80 system using on the one hand a modied version of the Kap source-to-source code restructurer and on the other hand the Vast restructurer, built into the Fortran compilation system of the Alliant/FX80 machine. The achieved performance of both parallelizations is compared. Bozk 92]

Z. Bozkus, S. Ranka, and G. Fox. \Benchmarking the CM-5 multicomputer". In: H. Siegel, Ed., Proceedings Frontiers '92, The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 100{107, IEEE Computer Society Press, Los Alamitos, California, 1992. A number of benchmarks is presented for the analysis of the CM-5 architecture. Basic communication primitives are measured, as well as the performance of global operations and complex communication patterns (complete exchange).

Brad 93]

D. K. Bradley and J. L. Larson. \A Parallelism-Based Analytic Approach to Performance Evaluation Using Application Programs". Proceedings of the IEEE, Vol. 81, No. 8, pp. 1126{1135, Aug. 1993. The path to performance is proposed as a framework describing the way from the physical problem formulation to the achieved performance of problem solution on a parallel architecture. Two tracks are identied at this path, the software track and the architecture track, performance may be lost in each step at the path and in particular at the point when the two tracks merge (i.e. if the software parallelism does not match the parallelism oered by the hardware). Dierent current benchmarking techniques (ranging from kernels to application benchmarks) are investigated with respect to where they set the probes in the path to performance. Furtheron, traditional time-oriented performance measures (execution time, speedup, eciency, utilization) are reformulated as parallelism-oriented metrics. In a case study, measurements of the Perfect Benchmarks on a CRAY Y-MP are performed and analyzed in terms of parallelism-oriented measures. They authors conclude, that current benchmarks may not represent the actual workload adequately and that better techniques are needed to examine the suitability of benchmarks. A parallelism-based evaluation may be a rst step in the right direction.

Brin 92]

P. Brinch Hansen. \Householder Reduction of Linear Equations". ACM Computing Surveys, Vol. 24, No. 2, pp. 185 { 194, June 1992.

Bron 90]

E. C. Bronson, T. L. Casavant, and L. H. Jamieson. \Experimental Application-Driven Architecture Analysis of an SIMD/MIMD Parallel Processing System". IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, pp. 195{205, April 1990. The authors report on performance measurement experiments on a prototype the PASM parallel architecture. The application used in the experiment were three dierent implementations of an FFT algorithm, the rst one programmed in SIMD style, the second one in MIMD, and the third one in a

BIBLIOGRAPHY

169

combination of both. Detailed measurement results could be used to extrapolate the execution time, speedup, and eciency to a larger System. Brui 88]

A. de Bruin, A. H. G. R. Kan, and H. W. J. M. Trienekens. \A Simulation Tool for the Performance Evaluation of Parallel Branch and Bound Algorithms". Mathematical Programming, No. 42, pp. 245{271, 1988.

Butl 86]

J. M. Butler and A. Y. Oruc. \EUCLID: An architectural multiprocessor simulator". In: Proceedings of the Sixth Conference on Distributed Memory Computer Systems, pp. 280{287, 1986. The processing network model for parallel computations contains a set of processors, a set of terminals (which can be I/O facilities, memories or other components), and a function associating a subset of terminals to each processor. The simulation tool EUCLID supports the analysis of parallel programs (written in the Multisoft programming language) on multiprocessor architectures described in the processing network formalism. Performance statistics of the architecture and of the user's program are available during the simulation run as well as after the simulation. An example is given comparing the performance of an algorithm (nding a real root or zero-crossing of a polynomial) running on a 4 multiprocessor system and on an 8 multiprocessor system.

Calz 90]

M. Calzarossa, R. Marie, and K. Trivedi. \System Performance with User Behavior Graphs". Performance Evaluation, Vol. 11, pp. 155{164, 1990. In this article user behavior graphs are presented as a practical means odf workload characterization. Each node in the UBG corresponds to a possible command that the user may issue. Arcs connecting the nodes i and j represent the probability for issuing command j after having issued command i. An UBG can be modeled as a discrete time continuous markov chain.

Calz 93]

M. Calzarossa and G. Serazzi. \Workload Characterization: A Survey". Proceedings of the IEEE, Vol. 81, No. 8, pp. 1136{1150, August 1993. The state of the art in workload characterization techniques is presented in this paper. After recalling the basic steps in the construction of workload models the application of these techniques to selected systems (including parallel systems) is surveyed.

Calz 95]

M. Calzarossa, G. Haring, G. Kotsis, A. Merlo, and D. Tessera. \A Hierarchical Approach to Workload Characterization for Parallel Systems". In: B. Hertzberger and G. Serazzi, Eds., High Performance Computing and Networking, LNCS vol. 919, pp. 102{109, 1995.

Cand 91] R. Candlin and N. Skilling. \A Modelling System for the Investigation of Parallel Program Performance". In: Proceedings of the 5th international conference on Modelling Techniques and Tools for Computer Performance Evaluation, pp. 381{393, February 1991. The MIMD modeling system is a simulation tool for analyzing the performance of a parallel program running on a parallel architecture. The user describes the algorithm (either a synthetic program directly written in SIMULA or derived from an existing program), the architecture (dening

BIBLIOGRAPHY

170

parameters for processors and links along with the interconnection topology), and the mapping (calling procedures that set up \load unit" objects), that can either be done by hand or by using in-built mapping strategies. Currently the system is tailored to OCCAM and transputer networks providing predened classes and topologies, but it would be easy to dene new classes representing dierent programming languages and architectures. The statistics obtained from the tool include processor and link utilization on the hardware side, and process and channel activity on the software part. Cand 92] R. Candlin, P. Fisk, L. Phillips, and N. Skilling. \Studying the Performance of Comcurrent Programs by Simulation Experiments on Synthetic Programs". Performance Evaluation Review, Special Issue, 1992 ACM SIGMETRICS, Vol. 20, No. 1, pp. 239{241, June 1992. Poster Session of Extended Abstracts. In this article the inuence of several program and resource oriented parameters on dierent mapping strategies is analyzed. Cand 93] R. Candlin. \Estimating Performance from the Macroscopic Properties of Parallel Programs". December 1993. University of Edinburgh. In this article an eort was made to identify and qualify those parameters which mostly inuence the performance of a parallel application. Experiments were made to investigate the inuence of the following parameters: the number of nodes in the task graph and the average node degree as structure oriented parameters and the computation time of a task (assumed to be normally distributed with known mean and standard deviation) and the message length (again assumed to be normally distributed with known mean and standard deviation) as resource oriented parameters. Carl 92]

B. Carlson, T. Wagner, L. Dowdy, and P. H. Worley. \Speedup Properties of Phases in the Execution Prole of D istributed Parallel Programs". In: R. Pooley and J. Hillston, Eds., Computer Performance Evaluation'92 - Modelling Techniques an d Tools, pp. 83{95, Edinburgh, Scotland, 1992.

Carm 91] E. A. Carmona and M. D. Rice. \Modeling the Serial and Parallel Fractions of a Parallel Algorithm". Journal of Parallel and Distributed Computing, Vol. 13, No. 3, pp. 286{298, Nov. 1991. In this paper new formulations of speedup and eciency are given. Instead of using time as the unit of measure, the author propose a measure in terms of the amount of work, distinguishing between the work accomplished, the work expended and the work wasted (dened as the dierence of the rst two). The applicability of these measures is demonstrated on two examples. Chan 89] C. K. Chang, Y.-F. Chang, L. Yang, C.-R. Chou, and J.-J. Chen. \Modeling a Real-Time Multitasking System in a Timed PQ Net". IEEE Software, pp. 46{51, March 1989. Performance analysis of today's complex systems requires a model capabel of representing both, queuing behavior and asynchronous concurrent behavior thus traditional modeling approaches fail. In this paper an enhanced model is presented combining queuing networks and Petri nets. A graphical modeling tool (TPQN) is provided for describing timed PQ nets. A timed PQ net is

BIBLIOGRAPHY

171

a timed Petri net (where the density functions of the enabling time, the ring time and the nishing time are specied) and a special set Q denoting a nite set of queues (specied by their entry and exit places, server input and output conditions, a queuing discipline and the queue size). A queue is a special transition holding several waiting jobs. To simulate a TPQN model TPQL (a simulation language) and TPQS (the simulator itself) are provided. A real-time multiprocessor scheduler is modeled as an example. Chen 92] S. P. Cheng and S. Dandamudi. \Performance Analysis of a Hierarchical task Queue Organization for Parallel Systems". In: M. T. Liu, Ed., Proceedings of the 12th International Conference on Distributed Computing Systems, pp. 286{293, IEEE Computer Society Press, Los Alamitos, California, 1992. In this article queuing network models are used to compare centralized and decentralized task queue organizations in parallel systems. The workload is assumed to have a fork and join job structure, where each job is assumed to consist of a set of independent tasks, which may be executed concurrently on any processor. There is no communication among tasks within a job. This workload is modeled by an arrival rate of jobs and a random variable giving the number of tasks per job. Each task is assumed to have an exponentially distributed execution time with given mean. Chia 91]

M.-C. Chiang and G. S. Sohi. \Experience with Mean Value Analysis Models for Evaluating Shared Bus, Throughput-Oriented Multiprocessors". Performance Evaluation Review, Vol. 19, No. 1, pp. 90{100, May 1991. Proc. of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May 21{24, 1991, San Diego, California, USA. In this article mean value analysis is used to estimate the performance of a multiprocessor architecture. In particular, the model tries to estimate the cache miss latencies and processor throughputs. The results are compared to trace-driven simulation.

Chim 88] P. F. Chimento and K. S. Trivedi. \The Performance of Block Structured Programs on Processors Subject to Failure and Repair". In: E. Gelenbe, Ed., High Performance Computer Systems, pp. 269{280, North-Holland, Amsterdam, 1988. Chio 93a] G. Chiola, M. Ajmone Marsan, G. Balbo, and G. Conte. \Generalized Stochastic Petri Nets: A Denition at the Net Level and its Implications". IEEE Transactions on Software Engineering, Vol. 19, No. 2, Feb. 1993. Chio 93b] G. Chiola, C. Dutheillet, G. Franceschinis, and S. Haddad. \Stochastic Well-Formed Coloured Nets and Symmetric Modelling Applications". IEEE Transactions on Computers, Vol. 42, No. 11, pp. 1343{1360, Nov. 1993. Chri 91]

B. Christianson. \Amdahl's Law and the End of System Design". Performance Evaluation Review, Vol. 19, No. 2, pp. 30{32, August 1991. In this article theoretical bounds on speedup are derived, considering the overhead due to communication, where architectural characteristics (interconnection topology) are taken into account.

BIBLIOGRAPHY

172

Chu 84]

W. Chu and M.-T. L. andJ. Hellerstein. \Estimation of Intermodule Communication". IEEE Transactions on Computers, Vol. 33, No. 8, pp. 691{699, Aug. 1984.

Cosn 94]

M. Cosnard and M. Loi. \Automatic Task Graph Generation Techniques". 1994. Preprint. A compact, problem size independent model, called parametrized task graph is proposed. A method is presented to extract such a task graph from an annotated sequential program. A short survey of related work is included.

Cuba 91] P. H. Cubaud. \Evaluation of Parallel Programs Completion Time using Bilogic PERT Networks". Tech. Rep. 91-7, EHEI, Ecole des Hautes Etudes en Informatique, 45, rue des SaintsPeres, 75006 Paris, France, September 1991. In this article a bilogic extension of PERT networks (called BIPERT) is proposed as a suitable model for parallel programs. In a BIPERT network, ingoing and outgoing links may either be inclusive (IN) or exclusive (EX), thus allowing to represent amongst parallelism also conditional branches and alternative parallelism, but also dead ends (if a node spawns several inclusive outgoing links, which are collected later on in an exclusive ingoing). Graphs containing such dead constructs are prohibited and called unlegal. A network where all nodes are IN/IN or EX/EX types, belongs to the class of series-parallel graphs, where exact analytic solution techniques are feasible. All other networks are to be solved using simulation or approximation techniques. Cyph 93] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina. \Architectural Requirements of Parallel Scientic Applications with Explicit Communication". In: Proceedings of the 20th International Symposium on Computer Architecture, May 16-19 1993, San Diego, California, May 1993. The authors investigate the behavior of several parallel scientic applications by quantifying the following parameters: memory requirements, I/O requirements, computation versus communication ratio, and characteristics of message trac. The authors develop an analytical model for studying the eects of increasing problem size and increasing number of processors on these parameters for a set of scientic applications. Daup 93] P. Dauphin, M. Kienow, V. Mertsiotakis, A. Quick, and F. Sotz. \PEPP: Performance Evaluation of Parallel Programs. User's Guide - Version 3.0". Tech. Rep., IMMD, Universitat Erlangen-Nurnberg, Germany, 1993. Deme 91] I. M. Demeure and G. J. Nutt. \The VISA Distributed Computation Modeling System". In:

Proceedings of the 5th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, pp. 133{147, Torino, Italy, February 1991. When developing

parallel and distributed algorithms, partitioning the computation into processes and designing ecient interprocess communication strategies are challenging tasks. VISA is an environment for modeling and simulating dierent partitioning and communication strategies for large-grained MIMD computations. The parallel, distributed computation graph model (ParaDiGM), a formal graph model, is used as the basis for VISA's operation. The user can create, edit and

BIBLIOGRAPHY

173

simulate ParaDiGM models. A qualitative reporter serves as an animation facility, and a quantitative reporter displays simulation results (using graphs and tables) either during simulation or summarized at the end of a simulation run. Dona 92] S. Donatelli and C. Anglano. \On the use of communication graphs in automatic mapping". 1992. submitted to the 6th international conference on modeling techniques and tools for computer performance evaluation, Edinburgh, 1992. The authors investigate the use of communication graphs as a suitable model of a parallel algorithm in mapping studies. It turns out, that this model has several weaknesses even if applied to applications whose run-time behavior can be determined by static analysis. Dong 93] J. J. Dongarra. \Performance of Various Computers Using Standard Linear Equations Software". Tech. Rep. CS-89-85, University of Tennessee and Oak Ridge National Laboratory, Sep. 1993. Dowd 93] L. Dowdy and M. Leuze. \On Modeling Partitioned Multiprocessor Systems". Int. Journal of High Speed Computing, 1993. (to appear). Eage 89]

D. Eager, J. Zahorjan, and E. Lazowska. \Speedup versus Eciency in Parallel Systems". IEEE Trans. on Computers, Vol. 38, No. 3, pp. 408{423, 1989.

El R 90]

H. El-Rewini and T. G. Lewis. \Scheduling Parallel Program Tasks onto Arbitrary Target Machines". Journal of Parallel and Distributed Computing, Vol. 9, pp. 185{202, 1990. Task Grapher is a tool for studying optimum parallel program task scheduling on (arbitrarily interconnected) processors. Once the parallel program (represented as a task graph) along with the interconnection topology of the target machine (drawn in a topology window) is entered, the user can apply several heuristics for scheduling. Task Grapher provides the following displays: a Gantt chart showing the schedule, a speedup line graph, a critical path in the task graph, bar charts for processor utilization and average processor eciency, and a dynamic activity display (animation of the topology window).

Erha 91]

W. Erhard, A. Grefe, M. Gutzmann, and D. Poschle. Vergleich von Leistungsbewertungsverfahren fur Unkonventionelle Rechner. Vol. 24/5 of Arbeitsberichte des Instituts fur Mathematische Maschinen und Datenverarbeitung, UNI Erlangen, Nurnberg, July 1991.

Fagi 90]

B. S. Fagin and A. M. Despain. \The Performance of Parallel Prolog Programs". IEEE Transactions on Computers, Vol. 39, No. 12, pp. 1434{1445, Dec. 1990.

Fahr 93]

T. Fahringer and H. P. Zima. \A Static Parameter based Performance Prediction Tool for Parallel Programs". In: Proc. 1993 ACM Int. Conf. on Supercomputing, July 1993, Tokyo, Japan, ACM, 1993. In this paper characteristics are collected from proling the execution of a sequential run of the program. The obtained parameters (frequency informations, loop counts and true ratios) serve as the input to a performance prediction tool, where parameters characterizing the behavior of the parallel program (work distribution, number and amount of

BIBLIOGRAPHY

174

data transfers, transfer times, network contention, cache misses) are computed. The performance prediction tool, which is part of the Vienna Fortran Compilation Toolset, allows th analysis and comparison of dierent parallelization strategies. Ferr 78]

D. Ferrari. Computer Systems Performance Evaluation. Prentice-Hall, Englewood Clis, New Jersey, 1978.

Ferr 83]

D. Ferrari, G. Serazzi, and A. Zeigner. Measurement and Tuning of Computer Systems. PrenticeHall, 1983.

Ferr 84]

D. Ferrari. \On the Foundations of Articial Workload Design". In: Proc. ACM SIGMETRICS Conference, pp. 8{14, 1984.

Ferr 88]

D. Ferrari and S. Zhou. \An Empirical Investigation of Load Indices For Load Balancing Applications". In: P.-J. Coutrois and G. Latouche, Eds., Performance'87, pp. 515{528, NorthHolland, 1988. In this paper the eects on performance of load indices used in dynamic load balancing strategies and the characteristics of workloads are studied.

Fers 90a] A. Ferscha. Modellierung und Leistungsanalyze Paralleler Systeme mit dem PRM-Netz Modell. PhD thesis, University of Vienna, Institute of Statistics and Computer Science, May 1990. When analyzing the performance of parallel systems the performance of the hardware itself, the structure of the parallel program and the assignment of tasks to resources must be taken into account. Program Resource Mapping (PRM) net models have been introduced providing a graphical formalism (Petri nets) for describing and analyzing all three aspects. Modelling examples are given for some important programming paradigms (pipelining, systolic computation and compute-aggregat-broadcast). Fers 90b] A. Ferscha. \The PRM-Net Model - An Integrated Performance Model for Parallel Systems". Tech. Rep., Austrian Center for Parallel Computation, University of Vienna, 1990. Fers 91]

A. Ferscha and G. Kotsis. \Eliminating Routing overheads in Neural Network simulation using chordal ring interconnection topologies". Tech. Rep. ACPC/TR 91-8, Austrian Center for Parallel Computation, February 1991.

Fers 92]

A. Ferscha. \A Petri Net Approach for Performance Oriented Parallel Program Design". Journal of Parallel and Distributed Computing, Vol. 15, pp. 188{206, August 1992. In this paper a method is proposed combining both models in a unique net using PRM-nets. The application as well as the architecture are modeled as a generalized, stochastic Petri net, the mapping can also be expressed within the Petri net formalism. For each transition in the program net representing the execution of a program part, resource demands can be specied. When assigning processors from the resource net to this transition, the factual ring time will be computed considering the architectural parameters associated to the node representing the processor in the resource graph. The same technique is applied for determining the communication times. Therefore the model allows a clear separation between the description of the workload and the description of

BIBLIOGRAPHY

175

the architecture, but still provides an easy to use formalism to combine both into a performance model. Fers 94]

A. Ferscha and J. Johnson. \Performance Oriented Development of SPMD Programs Based on Task Structure Specications". In: B. Buchberger and J. Volkert, Eds., Parallel Processing: CONPAR94{VAPP VI, pp. 51{65, Springer Verlag, 1994.

Fers 95a] A. Ferscha and B. Gruber. \Prototype Graphical Communication Pattern Editor". Project Report D6H-2, CEI-Project, University of Vienna, Feb. 1995. Fers 95b] A. Ferscha and J. Johnson. \N-MAP: A Virtual Processor Discrete Event Simulation Tool for Performance Predicition in CAPSE". In: Proceedings of the HICCS28, pp. 276{285, IEEE Computer Society Press, 1995. Fros 94]

V. S. Frost and B. Melamed. \Trac Modeling for Telecommunications Networks". IEEE Communications Magazine, pp. 70{80, March 1994.

Gele 86]

E. Gelenbe, E. Montagne, and R. Suros. \A Performance Model of Block Structured Parallel Programs". In: M. Cosnard, P. Quinton, Y. Robert, and M. Tchuente, Eds., Parallel Algorithms and Architectures, pp. 127{138, North-Holland, Amsterdam, 1986. In this article the structure of the task graph representing a parallel program is not given deterministically, but follows certain stochastic rules. The workload characteristics (number of tasks, maximum and average parallelism) are derived.

Gele 89]

E. Gelenbe. Multiprocessor Performance. John Wiley & Sons, Chichester, 1989.

Gemu 93] A. J. C. van Gemund. \Performance Prediction of Parallel Processing Systems: The PAMELA Methodology". In: Proc. 1993 ACM Int. Conf. on Supercomputing, July 1993, Tokyo, Japan, ACM, 1993. A performance prediction methodology is presented for the analysis of parallel systems. A performance modeling language is specied for the description of the parallel algorithm and the parallel machine. A resource-oriented approach, specifying the dependence structure and the demands for computation and communication on the corresponding resources (processors, communication network) is used. The structure is described in a program description language. The parallel architecture is also modeled in this language. Performance prediction is based on the method of serialization analysis. Gera 93]

A. Gerasoulis and T. Yang. \On the granularity and Clustering of Directed Acyclic Task Graphs". IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 6, pp. 686{701, June 1993. The problem of scheduling typically consists of two parts: allocating the tasks of the program to the processors (considering the precedence relation specied by a directed acyclic task graph) and establishing an order for the execution of tasks on a processor. The allocation part is frequently called clustering if an innite number of totally connected processing elements is assumed. In this work nonlinear and linear clustering techniques are introduced and compared. A new metric for quantifying the granularity of a task graph is proposed. It is proven that for

BIBLIOGRAPHY

176

coarse grain task graphs every nonlinear clustering (where parallel tasks are mapped in the same cluster) can be transformed in a linear clustering of better or at least equal performance (small execution time). Ghos 88] D. Ghosal, L. N. Bhuyan, and U. Choudhury. \Approximate Analysis of Task Graphs for Parallel Processing Systems". Performance Evaluation Review, Special Issue, 1988 ACM SIGMETRICS, Vol. 16, No. 1, p. 274, May 1988. Poster Session Abstract. Ghos 90] D. Ghosal and L. N. Bhuyan. \Performance Evaluation of a Dataow Architecture". IEEE Transactions on Computers, Vol. 39, No. 5, pp. 615{627, May 1990. In this paper a closed queuing network model is used for analyzing the performance of the Manchester dataow computer. The average parallelism derived from a data ow graph is used to represent the population in a closed queuing network modeling the Manchester data ow computer. In order to solve this model for the dynamic dataow architecture, which is of nonproduct form, ctitious servers are added thus approximating the NPFQN by a PFQN. The accuracy of the approxiamtion depends on the proportion of dyadic instructions in the task graph. The results obtained from the model are compared to experimental results and the error was always less than 10 percent. Ghos 92] D. Ghosal, S. K. Tripathi, G. Serazzi, S. Noh, P. Lenzi, and A. K. Agrawala. \Characterizing Parallel Program Behavior: An Experimental Study". 1992. The objective of this paper is to investigate the sensitivity of performance measures on workload algorithm parameters. The execution signature and communication and computation proles are obtained from the execution of selected example programs running on a Connection Machine and on a Transputer based system. From those characteristics even more concise parameters (average parallelism, maximum parallelism, processor working set) have been derived. The usability of these parameters in scheduling problems is discussed. Goet 93]

N. Goetz, U. Herzog, and M. Rettenbach. \Multiprocessor and Distributed System Design: The Integration of Functional Specication and Performance Analysis Using Stochastic Process Algebras". In: Tutorial Proceedings of the 16th IFIP W.G. &.3 International Symposium on

Computer Performance Modelling, Measurement and Evaluation, PERFORMANCE '93, Lecture Notes in Computer Science, Springer Verlag, 1993. In this tutorial stochastic process alge-

bras are introduced for modeling the performance and behavior of parallel and distributed systems. The main advantage of this approach is, that it combines formal specication techniques with performance analysis. Extensions of classical process algebras are presented providing the necessary parameters for performance analysis. Several examples illustrate this approach.

Gohb 91] I. Gohberg, I. Koltracht, A. Averbuch, and B. Shoham. \Timing analysis of a parallel algorithm for Toeplitz matrices on a MIMD parallel machine". Parallel Computing, Vol. 17, pp. 563{577, 1991. In this article the implementation of a parallel Levinson-type algorithm for Toeplitz matrices on a shared-memory architecture is analyzed. The authors derive a simple processing timing model, dependent on the number of processors and the problem size, the time factor

BIBLIOGRAPHY

177

for computing within one iteration and the overhead time factor due to communication per iteration (obtained from measurements using a least square t method). The model was used to determine the optimum number of processors. Unfortunately, the optimality criterion is not given by the authors. Gram 93] A. Y. Grama, A. Gupta, and V. Kumar. \Isoeciency: Measuring the Scalability of Parallel Algorithms and Architectures". IEEE Parallel and Distributed Technology, Vol. 1, No. 3, pp. 12{ 21, Aug. 1993. The scalability of parallel systems is investigated using the isoeciency function. Isoeciency allows the investigation of program specic and machine specic characteristics. Gran 92] M. Granda, J. M. Drake, and J. A. Gregorio. \Performance Evaluation of Parallel Systems by Using Unbounded Generalized Stochastic Petri Nets". IEEE Transactions on Software Engineering, Vol. 18, No. 1, pp. 55{71, January 1992. In this paper techniques are presented for the analysis of parallel systems using unbounded stochastic Petri nets. A method based on state aggregation is presented to overcome model complexity. Gunt 88] N. J. Gunther. \PARCBench: A Benchmark Methodology for Multiprocessors". Performance Evaluation Review, Special Issue, 1988 ACM SIGMETRICS, Vol. 16, No. 1, p. 277, May 1988. Poster Session Abstract. Gupt 93] A. Gupta and V. Kumar. \The Scalability of FFT on Parallel Computers". IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 8, pp. 922{932, August 1993. Gust 88]

J. L. Gustafson. \Reevaluating Amdahl's Law". Communications of the ACM, Vol. 31, No. 5, pp. 532{533, May 1988. The author shows that Amdahl's Law giving an upper bound for speedup is unsuitable for the concept of massively parallelism. Amdahls Law is based on the assumption, that the parallel fraction is independent of the problem size (xed-sized speedup). An alternative model, called scaled-sized speedup is presented, based on the assumption that the problem size scales with the number of processors. Under this assumption, speedup is bounded by s + p  N . Amdahl's law (saying that the speedup is bounded by (s + p)=(s + p=N ), where s is the sequential part of a program and p is the parallel part and N the number of processors)is based on the assumption that p is independent of N . In this article it is argued, that this is virtually never the case. Under the assumption that the parallel part of a program scales with the problem size a scaled speedup (suggested by E. Barsis at Sandia) seems more realistic ((s + p  N )=(s + p), where p is now the parallel time spent on the parallel system). This function is a simple line of slope (1 ; N ) (if s + p is set to 1 for algebraic reasons).

Gust 91]

J. Gustafson, D. Rover, S. Elbert, and M. Carter. \The Design of a Scalable Fixed-Time Computer Benchmark". Journal of Parallel and Distributed Computing, Vol. 12, No. 4, pp. 388{ 401, August 1991.

BIBLIOGRAPHY

178

Haba 90] D. Haban and D. Wybranietz. \A Hybrid Monitor for Behavior and Performance Analysis of Distributed Systems". IEEE Transactions on Software Engineering, Vol. 16, No. 2, pp. 197{211, February 1990. Hari 93]

G. Haring and G. Kotsis, Eds. Performance Measurement and Visualization of Parallel Systems. Vol. 7 of Advances in Parallel Computing, G. R. Joubert, Udo Schendel (Series Eds), North Holland, 1993.

Hari 94]

G. Haring, S. Musil, and G. Kotsis. \Monitoring Parallel Processing Systems with Distributed Memory". Tech. Rep., University of Vienna, Sep. 1994. Endbericht zum Forschungsauftrag des BMWF GZ 613.542/1-26/92.

Hari 95]

G. Haring and G. Kotsis. \Workload Modeling for parallel Processing Systems". In: P. Dowd and E. Gelenbe, Eds., MASCOTS'95, Proc. of the 3rd Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 8{12, IEEE Computer Society Press, 1995. Invited Paper. http://www.ani.univie.ac.at/ gabi/papers/mascots95.abs.

Hart 90]

E. Hart and S. Flavell. \Prototyping Transputer Applications". In: H. Zedan, Ed., Real-Time Systems with Transputers, pp. 241{247, Occam User Group, IOS Press, Amsterdam, 1990.

Hart 92]

F. Hartleb and V. Mertsiotakis. \Bounds for the Mean Runtime of Parallel Programs". In: R. Pooley and J. Hillston, Eds., Computer Performance Evaluation'92: modeling techniques and

tools  proceedings of the Sixth International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, Edinburgh, Scotland, 16 - 18 September 1992, pp. 197{210,

Edinburgh Univ. Press, Edinburgh, Scotland, 1992. ISBN 0-7486-0425-1. Stochastic graph models are frequently used in performance analysis of parallel systems, since they can represent the characteristics of a parallel program adequately. Whenever the timing costs associated to the nodes are generally distributed, the analysis of such graphs is a complex task, even if their structure is restricted to series parallel graphs. In this paper three dierent methods are presented to calculate lower and upper bounds on the execution time of a parallel program represented by a stochastic graph. The graph can contain basic nodes, parallel nodes, cyclic nodes, and hierarchical nodes. The execution time of tasks (nodes) is given by a numerical distribution.

Hart 93]

F. Hartleb. \Stochastic Graph Models for performance Evaluation of parallel Programs and the Evaluation Tool PEPP". February 1993. University of Erlangen-Nurnberg, IMMD 7, Internal Report 3/93. Several methods for predicting the execution time of parallel programs are presented based on a stochastic graph model. The analysis techniques include series-parallel reduction and bounding methods. Those techniques have been implemented in a modeling tool called PEPP.

Helm 90] D. P. Helmbold and C. E. McDowell. \Modelling Speedup (n) Greater than n". IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, pp. 250{256, Apr. 1990. In this article

BIBLIOGRAPHY

179

an eort is made to explain superunitary speedup. When estimating speedup based on the serial and parallel amount of work and the costs to perform one unit of sequential work, Amdahl's law is obtained as a bound on speedup. A more detailed characterization of the workload may help in explaining superunitary speedup. In some application superunitary speedup was detected, because the costs per operation increase with the number of processes on a processor for some operations (e.g. resource management). To detect and show such a behavior, it is necessary to distinguish not only between the serial and parallel fraction of work, but also in the fraction of those particular operations. Several other examples are given by the authors (including analysis of cache size). Herz 91]

U. Herzog. \Performance Evaluation and Formal Description". In: V. A. Monaco and R. Negrini, Eds., Proceedings of the 5th Annual European Computer Conference Advanced Computer Technology, Reliable Systems and Applications, pp. 750{756, IEEE, May 1991. The author observes two major trends in analytic performance modeling: a drastic change in the modeling methodology (unique, monolytic approaches are replaced by two separated models, one representing the system load, the other representing the system conguration) and gradual changes in the model-description techniques (standard QNM's are no longer adequate). The trend goes towards a unied specication and evaluation technique which possesses ideally the following properties:

 a unied system description and analysis with respect to functionality, structure, timing, and possibly other predicates

 a conict free representation of all system aspects  a methodology supporting the analysis and synthesis of complex systems (hierarchical models)  being mathematically tractable (ecient solution algorithms and tools)

An approach is proposed integrating functional specication, time representation, and performance analysis. Hick 92]

T. J. Hickey, J. Cohen, H. Hotta, and T. Petitjean. \Computer-Assisted Microanalysis of Parallel Programs". ACM Transactions on Programming Languages and Systems, Vol. 14, No. 1, pp. 54{ 106, Jan. 1992. In this article a parallel program is characterized by the analysis of execution traces. This trace is a collection of time stamped events collected during the execution at a real architecture. An event marks the occurrence of a primitive action performed on a processor. Based on this event trace, the execution graph and the event graph are derived. The execution graph is a machine dependent representation of the program. Each node in the execution graph corresponds to the occurrence of an action associated to a process and a processor. Directed arcs connecting the nodes represent precedence relations. The event graph is a machine independent representation, which is obtained from the execution graph by removing all machine dependent

BIBLIOGRAPHY

180

information (i.e. processor numbers, duration between the occurrence of events). Now, a node represents the occurrence of an action associated to a process. Finally, a timing function may be derived from the event graph, which is an expression relating the program execution time to the execution time of primitive actions. Hill 94]

J. Hillston. A Compositional Approach to Performance Modeling. PhD thesis, University of Edinburgh, 1994.

Ho 93]

C. Ho, S. A. Mabbs, and K. E. Forward. \Performance modeling of the MR-1 multiprocessor using extended deterministic and stochastic Petri nets". Computer System Science & Engineering, Vol. 8, No. 4, pp. 195{209, Oct. 1993. In this article A shared memory architecture is modeled where a clustered shared-memory multiprocessor (MR-1) was modeled using extended deterministic and stochastic Petri nets. The workload is considered on the one hand by determining the probabilities for communication and memory access requests and on the other hand by determining the ring rate of the transitions representing the computation and the memory access times.

Hoar 78]

C. A. R. Hoare. \Communicating Sequential Processes". Communications ACM, Vol. 21, No. 8, Aug. 1978.

Hoar 85]

C. A. R. Hoare. Communicating Sequential Processes. Series in Computer Science, Prentice Hall International, UK, 1985.

Hous 90]

C. E. Houstis. \Module Allocation of Real-Time Applications to Distributed Systems". IEEE Transactions on Software Engineering, Vol. 16, No. 7, pp. 699{709, July 1990. In this article a parallel application is modeled by a stochastic data ow graph, which is a weighted precedence and-or logic graph, where the node and link weights represent the computation and communication requirements. The weights are exponentially distributed random variables with given mean. Because of the existence of conditional branches and loops, it is not known in advance, how often a module will actually be executed or how many data will be transferred. But the execution frequencies can be obtained from the analysis of the Markov chain, described by the stochastic data ow graph. After obtaining these frequencies, it is possible to derive a deterministic data ow graph, which has a simpler structure. Now the weights of the nodes are give the total processing requirements and the weights to the links, which are now undirected, give the total communication demands. Parameters derived from both, the stochastic and the deterministic graph are used to model the workload of real-time systems for the purpose of developing optimum allocation strategies with respect to minimizing the total run time of the application.

Hwan 93a] H.-C. Hwang and J.-G. Wu. Advanced Computer Architecture. Parallelism, Scalability, Programmability, Chap. 1 Parallel Computer Models, pp. 3{50. McGraw-Hill International Publishers, 1993.

BIBLIOGRAPHY

181

Hwan 93b] K. Hwang. Advanced Computer Architecture. Parallelism, Scalability, Programmability, Chap. 3, Principles of Scalable Performance, pp. 105{150. McGraw-Hill International Publishers, 1993. Iane 94] G. Ianello. \Communication Workload Analysis for Symmetric Concurrent Systems". Journal of Parallel and Distributed Computing, Vol. 20, pp. 225{235, 1994. In this article the communication aspect of the workload is modeled for a class of data parallel algorithms, characterized by alternating phases of (individual, independent) computations and data exchange (synchronization). An algorithm is modeled as a directed graph, where the nodes represent tasks and the arcs represent communication. Weights associated to the arcs represent the data transfer volume . Together with information on the architecture topology and the mapping, a hardware communication graphs are derived. In this graph, only those processors and links are shown, which are involved in the communication demands for a given process. The links are directed and weighted with the aggregation of all weights from the corresponding demands in the algorithm graph for this process. If the algorithm graph is symmetric, an expression for communication overhead can be derived. Iaze 93] G. Iazeolla and F. Marinuzzi. \LISPACK - A Methodology and Tool for the Performance Analysis of Parallel Systems and Applications". IEEE Transactions on Software Engineering, Vol. 19, No. 5, pp. 486{502, May 1993. LISPACK is a tool for the performance prediction of complex parallel systems (architecture and algorithm) using merkov chain analysis techniques. The proposed methodology uses string manipulation, lumping, and recursive simulation to reduce the complexity of model analysis. In a case study the usability of the approach is demonstrated. Ibar 90] O. H. Ibarra and S. M. Sohn. \On Mapping Systolic Algorithms onto the Hypercube". IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1, pp. 48{63, January 1990. In this paper strategies for mapping systolic algorithms on a hypercube are investigated. The parallel algorithm is represented as a time-space graph, where the nodes represent the states of the cells in the systolic array and directed edges represent dependencies of the states. Selected algorithms have been implemented on an NCUBE/7 to validate the techniques. Jain 91a] R. Jain. The Art of Computer System Performance Analysis. Techniques for Experimental Design, Measurement, Simulaiton, and Modeling. John Wiley and Sons, Inc., 1991. Jain 91b] R. Jain. The Art of Computer System Performance Analysis. Techniques for Experimental Design, Measurement, Simulaiton, and Modeling, Chap. 6 Workload Characterization Techniques, pp. 71{92. John Wiley and Sons, Inc., 1991. Jako 93] A. Jakobs and R. W. Gerling. \Scaling aspects for the performance of parallel algorithms". Parallel Computing, Vol. 19, pp. 1063{1073, 1993. In this article a detailed analysis of a a two-dimensional cellular automata simulation was performed to derive application specic expressions for the computation and communication time for each phase of the algorithm. Based on these expressions speedup and the scaling behavior were modeled and compared to measurements on a transputer based system and on a SUPRENUM-cluster.

BIBLIOGRAPHY

182

Jonk 93]

H. Jonkers. \Introduction to Probabilistic Performance Modelling of Parallel Applications". Tech. Rep. 1-68340-44(1993)04, Delft University of Technology, April 1993. This report surveys techniques for performance modeling of parallel applications. The presented techniques include queuing networks, Petri nets, Task graphs and hybrid approaches. The techniques are compared with respect to their expressive power, analytical complexity and accuracy. A case study (LU factorization) is included.

Kant 88]

K. Kant. \Application Level Modelling of Parallel Machines". Performance Evaluation Review, Special Issue, 1988 ACM SIGMETRICS, Vol. 16, No. 1, pp. 83{93, May 1988. In this article the workload consists of a continuos ow of applications. Applications are assumed to arrive at the parallel machine in a poisson arrival stream, requiring a certain substructure of processing elements. A xed set of K dierent substructures may arrive (represented by dierent job classes in a queuing network model), for each type of substructure an exponentially distributed holding time with given mean is assumed. The objective is to nd allocation schemes for assigning subsets of processors to applications. A queuing model is constructed for determining system throughput, queuing delays and processor utilization. Unfortunately, the load balance assumption will not hold for this model in general, thus solution methods become expensive.

Kao 92]

W.-I. Kao and R. K. Iyer. \A User-Oriented Synthetic Workload Generator". In: Proceedings of the 12th International Conference on Distributed Computing Systems, pp. 270{277, IEEE Computer Society Press, Los Alamitos, California, 1992.

Kape 92] A. Kapelnikov, R. R. Muntz, and M. D. Ergovac. \A Methodology for Performance Analysis of Parallel Computations with Looping Constructs". Journal of Parallel and Distributed Computing, Vol. 14, pp. 105{120, 1992. In this article a computation control graph (ccg) is proposed to model the program behavior. Nodes, representing the tasks, are connected via directed arcs. Multiple ingoing arcs can either be grouped conjunctive (AND) or disjunctive (XOR), probability weights may be assigned to multiple disjunctive outgoing arcs. In contrast to task graphs, ccgs do not necessarily have to be acyclic. The execution times are assumed to be exponentially distributed with a given mean. Analytical performance evaluation techniques are based on a hierarchical aggregation of segments of the ccg. Kats 92]

C. Katsinis. \Performance modeling of message-based multiprocessors under heavy trac". Computer System Science & Engineering, Vol. 7, No. 3, pp. 190{198, July 1992. In this article a the behavior of a single processor in a message passing architecture is analyzed using a queuing network model. The workload is characterized by an arrival rate of messages, the service time for each message is assumed to be proportional to the message length. Since each message is assumed to leave the station immediately after receiving service and joins the next station, the inter-arrival times between messages is proportional to the message length. Therefore a single processor can be modeled as a single queue with inter-dependent arrival and service rates.

BIBLIOGRAPHY

183

Conditions for the existence of a stationary distribution are derived and solutions for a number of inter-arrival time distributions are developed. Kess 91]

C. Kesselman. Tools and Techniques for Performance Measurement and Performance Improvement in Parallel Programs. PhD thesis, University of California, Los Angeles, 1991.

Klar 92]

R. Klar. \Event-Driven Monitorin of Parallel Systems". In: G. Haring and G. Kotsis, Eds.,

Advances in Parallel Computing: Monitoring and Visualization of Parallel processing Systems, Workshop, Moravany, CSFR, North Holland, 1992.

Klei 75]

L. Kleinrock. Queueing Systems, Volume I: Theory. John Wiley & Sons, New York, 1975.

Klei 92]

L. Kleinrock and J.-H. Huang. \On Parallel Processing Systems: Amdahl's Law Generalized and SOme Results on Optimal Design". IEEE Transactions on Software Engineering, Vol. 18, No. 5, pp. 434{447, May 1992. In this article the workload is characterized by the number of processors P 0 desired for execution during a certain interval of time (the parallelism prole of the application). The program is assumed to be divisible, i.e. the time for executing w units of work is given by max(w=P ! w=P 0), where P denotes the number of processors actually used in the execution. w is assumed to be a random variable (with given mean and coecient of variation), but the time for executing one unit of work is assumed to be deterministic (exactly one unit of time). In this model neither precedence restrictions nor communication or I/O overheads are considered. Based on this workload model, speedup and power (dened as the ratio of throughput and response time) are derived using queuing theoretic analysis.

Kole 73]

K. Kolence and P. Kiviat. \Software Unit Proles and Kiviat Figures". ACM SIGMETRICS, Performance Evaluation Review, Vol. 2, No. 3, pp. 2{12, September 1973.

Kope 94] R. Kopecny, M. Studnicka, P. M. Spindler, R. Sziderics, and H. Weissenbock. \Parallel Implementation of the Assignment Problem on a Workstation Cluster". Tech. Rep., Universitat Wien, 1994. Seminar- und Praktikumsarbeit Projektstudium Parallelverarbeitung, Studienjahr 1993/94. Kuma 91] V. Kumar and V. Singh. \Scalability of Parallel Algorithms for the All-Pairs Shortest-Path Problem". Journal of Parallel and Distributed Computing, Vol. 13, No. 2, pp. 124{138, Oct. 1991. The isoeciency metric is used for analyzing the scalability of several parallel algorithms for the all-pair shortest-path problem. Isoeciency functions have shown to be a compact and useful performance indicator. Kush 93] R. Kushawa. \Methodology for predicting performance of distributed and parallel systems". Performance Evaluation, Vol. 18, pp. 189{204, 1993. In this article a queuing network model is constructed for analyzing the performance of distributed systems. The system consists of a set of clients and a set of servers connected via a network. Clients request access to les from the servers. Model parameters are the number and think time of the clients (the time between two requests), the network transfer time, and the le access times at the servers. The performance

BIBLIOGRAPHY

184

of this system (utilization of servers, waiting times and system throughput) is analyzed with respect to the sensitivity to these parameters. Lauw 93] R. Lauwereins and H. Peperstraete. \Queueing Theoretical Analysis of Processor Utilization in Parallel Computers". Computer System Science & Engineering, Vol. 8, No. 1, pp. 13{23, Jan. 1993. In this article the performance of a parallel architecture with point-to-point link connections is analyzed using queuing theory. The tasks of the program are modeled by the jobs in the queuing network, the processors and links are modeled by queuing centers. After receiving service, a task may request service from the same processor or from any other processor in the network according to certain routing probabilities. Service demands per task on a server are exponentially distributed, communication or synchronization restrictions are not considered. Furtheron it is assumed, that the degree of parallelism (i.e. the number of jobs in the system) is constant over time. This assumption is unrealistic, but necessary from a queuing theoretic point of view for the system to operate in steady state. The inuences of the communication computation ratio, the application program parallelism, the network topology, and the unbalance of the load on system performance are investigated. Leuz 89]

M. R. Leuze, L. W. Dowdy, and K. H. Park. \Multiprogramming a distributed-memory multiprocessor". Concurrency: Practice and Experience, Vol. 1, No. 1, pp. 19{33, September 1989. In this article the performance of a parallel system, which is measured in terms of system throughput, is analyzed under a two program workload using a closed queuing network model. Both programs may either be at the host computer, or receiving service at the multiprocessor, or one program is at the host, while the other is at the multiprocessor system. These are the dierent states describing the system. The service rate on the multiprocessor system may either be state dependent and state independent on the host. An additional potential factor is introduced on the multiprocessor system, since a parallel program may not execute at its optimum potential due to contention for shared resources. The execution rates for a program when receiving exclusively service from the multiprocessor is derived from the execution signature of the application. The model is used to calculate system throughput. Validation experiments on an Intel IPSC/2 hypercube were performed with a workload consisting of a parallel wavefront algorithms for solving triangular systems of linear equations.

Lewi 91]

T. G. Lewis. \Piece-Wise Scheduling of Composite Task Graphs onto Distributed Memory Parallel Computers". Tech. Rep. 91-60-15, Oregon State University, Computer Science Department, Corvallis, OR 97331-3202, 1991. In this paper an approach is presented to schedule task graphs with regular structures (replicated, tree, mesh) considering process and message startup costs.

Lewi 92]

T. Lewis and H. El-Rewini. Introduction to Parallel Computing, Chap. 7 Data Parallel Programming. Prentice Hall, 1992.

BIBLIOGRAPHY

185

Lewi 93]

T. Lewis and H. El-Rewini. \Parallax: A Tool for Parallel Program Scheduling". IEEE Parallel and Distributed Technology, Vol. 1, pp. 62{72, May 1993. A tool is presented for scheduling tasks on various parallel architectures. Within the tool several scheduling strategies have been implemented giving the user the possibility to test and compare dierent strategies. The output (the predicted performance of the schedule) can be visualized.

Liao 92]

Y. Liao and D. Cohen. \A Specicational Approach to High Level Program Monitoring and Measuring". IEEE Transactions on Software Engineering, Vol. 18, No. 11, pp. 969{978, Nov. 1992.

Lind 92]

H. Lindmeier and D. Rauh. \Classication of Systems of a Set of Parameters". 1992. Esprit 3-OMI, Project No. 6271, Deliverable 1.

Liu 91]

L. Liu and J.-K. Peir. \A Performance Evaluation Methodology for Coupled Multiple Supercomputers". In: International Conference on Parallel Processing, Vol. 1, pp. 198{202, 1991. A performance evaluation methodology for the analysis of clustered parallel systems is proposed. In the rst step benchmarks and workload characterization techniques are applied. In the second step a model is constructed and validated. Finally performance projections for various clustered systems are made. The methodology is applied in a case study on a uid dynamics application.

Lo 91]

V. M. Lo, S. Rajopadhye, S. Gupta, D. Keldsen, M. A. Mohamed, B. Nitzberg, J. A. Telle, and X. Zhong. \OEREGAMI: Tools for Mapping Parallel Computations to Parallel Architectures". International Journal of Parallel Programming, Vol. 20, No. 3, pp. 237{270, 1991. In this article a tool supporting the mapping of parallel programs onto parallel architectures is presented. The program is modeled as a temporal communication graph, which incorporates the concepts of process-time graphs, static task graphs and directed acyclic graphs.

Luqu 92] E. Luque, R. Suppi, and J. Sorribes. \Designing parallel systems: a performance prediction problem". Information and Software Technology, Vol. 34, No. 12, pp. 813{823, Dec. 1992. In this article a tool called PSEE is presented for modeling analyzing the behavior of parallel algorithms. The behavioral data ow graph representing the parallel program in a architecture independent way may either be specied directly or derived from the code of an existing program together with computation and communication demands. Together with a description of the architecture and a mapping, the execution of the program can be simulated and performance results are obtained. Mabb 94] S. A. Mabbs and K. E. Forward. \Performance Analysis of the MR-1, a Clustered SharedMemory Multiprocessor". Journal of Parallel and Distributed Computing, Vol. 20, pp. 158{175, 1994. In this article a queuing network model is used to analyze the performance of a clustered shared-memory multiprocessor (MR-1). Since a queuing network model requires a characterization of the workload at a physical level, the workload parameters are the resource demands (memory access requests) and the arrival rates of the requests. To enhance the tractability of

BIBLIOGRAPHY

186

the model, the arrival stream is assumed to be a poisson process, and the access times are assumed to be exponentially distributed. Mada 91] S. Madala and J. B. Sinclair. \Performance of Synchronous Parallel Algorithms with Regular Structures". IEEE Transactions on Parallel Distributed Systems, Vol. 2, No. 1, pp. 105{116, January 1991. In this paper the performance of algorithms is analyzed whose task graph have regular structures, where the tasks can be grouped in certain levels and there are only dependences among nodes in adjacent levels. Examples of such regular structures are divide and conquer patterns or multiphase algorithms with alternating sequential and parallel phases. For each task the execution time is given by its mean and variance . Communication and contention costs are assumed to be either incorporated in the task execution time or are neglected. Based on these assumptions it is possible, to estimate speedup or to determine analytically the optimum number of processing elements (to minimize execution time) for several allocation strategies. The authors present new methods for estimating the mean execution time of parallel algorithms that have either a partitioning task graph structure or alternating serial and parallel phases (multiphase structure). It is assumed, that communication delays are either negligible or incorporated in the (nondeterministic) execution time of a task. Architecture-independent bounds and approximations on the total execution time are obtained given some information about the distribution of task execution times. Mahg 92] I. O. Mahgoub and A. K. Elmagarmid. \An Example of Modeling and Evaluation of a Concurrent Program Using Colored Stochastic Petri Nets: Lamport's Fast Mutual Exclusion Algorithm". IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 221{240, March 1992. Maju 88] S. Majumdar, D. L. Eager, and R. B. Bunt. \Scheduling in Multiprogrammed Parallel Systems". Performance Evaluation Review, Special Issue, 1988 ACM SIGMETRICS, Vol. 16, No. 1, pp. 104{113, May 1988. In this paper the eects of variations in the workload parameters are investigated with respect to the performance achieved by several scheduling strategies in a multiprogrammed parallel system. A queuing network model is used for the analysis. A set of jobs of a fork join structure serves as a workload model in scheduling. Jobs arrive at a certain rate and will fork in a certain number of tasks to be executed in parallel. The number of tasks is a random variable with given mean and coecient of variation. The cumulative computation demands (I/O and memory demands are neglected) for each job are also modeled as a random variable with given mean and coecient of variation. A Job is completed after completion of all its tasks. One particular instance of this model is investigated, where the job service demands are linear correlated with the number of tasks, which might be a realistic assumption for classes of applications. Mak 90]

V. W. Mak and S. F. Lundstrom. \Predicting Performance of Parallel Computers". IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 3, pp. 257{269, July 1990. In this

BIBLIOGRAPHY

187

article the performance of parallel programs is investigated, whose structure can be represented as a series-parallel directed, acyclic graph . To each node representing a task, an exponentially distributed random variable is associated, characterizing the task's execution time . The parallel system is modeled as a queuing network. A methodology is proposed to combine both models to derive predictions for the execution time. Execution time of parallel computations is increased by two kinds of delays: queuing delays and synchronization delays. In this article a performance prediction method is presented where a parallel computation is modeled as a task system with precedence relationships expressed as a series-parallel directed acyclic graph. A queuing network is used for modeling hardware resources. The proposed prediction method takes these models as input and using an iterative algorithm performance measures (such as residence time, completion time, utilization and queue lengths) are created as output. Malo 90]

A. D. Malony. Performance Observability. PhD thesis, University of Illinois, Department of Computer Science, University of Illinois, 1304 W. Springeld Avenue, Urbana, IL 61801, October 1990.

Malo 92]

A. D. Malony, D. A. Reed, and H. A. G. Wijsho. \Performance Measurement Intrusion and Perturbation Analysis". IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 4, pp. 433 { 450, July 1992.

Mari 93]

D. C. Marinescu and J. R. Rice. \Speedup, Communication Complexity and Blocking | A La Recherche du Temps Perdu". In: IEEE, Ed., Proc. of the 7th International Parallel Processing Symposium, pp. 712{721, 1993.

Mass 93]

L. Massari. \Performance Analysis of Applications in Parallel Systems". 1993. available upon request from [email protected]. Dynamic parameters are obtained from an analysis of the program execution on a real system. Based on the measurement of the execution time, speedup, eciency and ecacy are derived and used for describing the workload. Based on these parameters a queuing network model was constructed representing a parallel architecture. Performance indices are obtained from the solution of this model.

McDo 92] K. J. McDonell. \Benchmark Frameworks and Tools for Modelling the Workload Prole". May 1992. submitted to Tools'92. Mehr 92] P. Mehra and B. W. Wah. \Physical-Level Synthetic Workload Generation for Load-Balancing Experiments". In: Proceedings of the First International Symposium on High-Performance Distributed Computing (HPDC-1), September 9 - 11, 1992, Syracuse, New York, pp. 208{217, IEEE Computer Society Press, Los Alamitos, California, 1992. Mena 92] D. A. Menasce and L. A. Barroso. \A Methodology for Performance Evaluation of Parallel Applications on Multiprocessors". Journal of Parallel and Distributed Computing, Vol. 14, pp. 1{14, 1992. In this article the average execution time of parallel applications for shared

BIBLIOGRAPHY

188

memory architectures is calculated. The parallel program is represented as a task graph together with an assignment function allocating the tasks to processors. The computation time for a task k is given by a function Ak + Bk  tshr k , where Ak is a deterministic factor, estimating the actual computation time, Bk is a deterministic factor estimating the number of shared memory accesses, and tshr k is a nondeterministic function accounting for the average delay due to contention for a single shared memory access. Ak and Bk are application and architecture dependent and may be obtained from measurement experiments and benchmarking. tshr k depends on the access rates or probabilities, which are obtained in an iterative procedure from the deterministic part of the execution time Ak , the number of access requests per task and the interconnection network cycle time. This iterative procedure calculates the total execution time. The reported results are compared to simulation results, convergence and time complexity of the algorithm are analyysed. Merl 93]

A. Merlo. \MEDEA: MEasurement Description Evaluation and Analysis tool. User's Guide { Release 1.0". Tech. Rep. 3/117, Universita di Pavia, April 1993. Progetto Finalizzato Sistemi Informatici e Calcolo Parallelo. Sottoprogetto 3. Architetture Parallele. Coordinatore: Marco Vanneschi.

Mess 90]

P. Messina, C. Bailie, P. Hipes, J. Rogers, A. Alagar, A. Kamrath, R. Leary, W. Pfeier, R. Williams, and D. Walker. \Benchmarking advanced architecture computers". Concurrency: Practice and Experience, Vol. 2, No. 3, pp. 195{255, Sep. 1990. In this article a broad range of parallel architecture has been evaluated using a number of scientic application programs. Methodologies and guidelines for the development of standard benchmark suites for parallel systems are discussed.

Moll 89]

M. K. Molloy. Fundamentals of Performance Modeling. Macmillan Publishing Company, 1989.

Mura 89] T. Murata. \Petri Nets: Properties, Analysis and Applications". Proceedings of the IEEE, Vol. 77, No. 4, pp. 541{580, Apr. 1989. Musi 94]

S. Musil and G. Haring. \Monitoring Parallel Programs Using InHouse". Tech. Rep., KFKI Technical Reports, 1994.

Mutk 88] M. W. Mutka and M. Livny. \Proling Workstations' Available Capacity For Remote Execution". In: P.-J. Coutrois and G. Latouche, Eds., Performance'87, pp. 529{544, North-Holland, 1988. Naik 94]

V. K. Naikn. \Performance of NAS Parallel Application-Benchmarks on IBM SP1". In: Proceed-

ings of the SHPCC'94. Scalable High-Performance Computing Conference, May 23-25, 1994, Knoxville, Tennessee, pp. 121{128, IEEE, IEEE Computer Society Press, Los Alamitos, CA,

May 1994. Three applications of the NAS parallel benchmark suite have been implemented on an IBM Scalable POWERparallel 1 system. Results are presented in this paper. The implementation of one of the problems is discussed in detail.

BIBLIOGRAPHY

189

Nels 88]

R. Nelson, D. Towsley, and A. N. Tantawi. \Performance Analysis of Parallel Processing Systems". IEEE Transactions on Software Engineering, Vol. 14, No. 4, pp. 532{539, April 1988. In this article the response time of jobs arriving at a parallel system is modeled using queuing network analysis. Jobs, arriving at the system according to a poisson stream, consist of several independent tasks, that may be executed concurrently. The number of tasks is given by a random variable with known probability distribution, the execution times per task are identically independent exponentially distributed with given mean. The jobs are to be executed on a parallel system consisting of c identical processing elements. The eects of using a distributed or a centralized queue for collecting the arriving jobs and job splitting and no splitting are compared. The performance values are obtained from a solving the steady state equations of this system modeled as a continuous time, discrete state markov process. When modeling the performance of parallel systems both the job and the system structure must be taken into account. In this paper a bulk arrival M x =M=c queuing system is used to model a parallel processing system with a central queue and job splitting. An expression for the mean job response time is obtained. Four parallel processing models are studied: distributed/splitting, distributed/no splitting, centralized/splitting, and centralized/no splitting.

Nels 90]

R. Nelson. \A Performance Evaluation of a General Parallel Processing Model". Performance Evaluation Review, Special Issue, 1990 ACM SIGMETRICS, Vol. 18, No. 1, pp. 13{26, May 1990. The response time of a parallel multiprogrammed system is analyzed as a function of the parallelism in the workload. Since it may not always be possible to characterize the workload precisely, i.e. determine the distribution for the number of tasks in a job for a certain application, an alternative model is proposed, based on the fraction of sequential and parallel amount of work and on the average number of parallel tasks. The relative error in the results when using this workload model compared to the results when using the more detailed model was comparatively small, thus motivating a more concise workload representation. For both models a generalization of Amdahl's law is presented giving an estimation for speedup.

Nich 91]

K. Nichols and P. W. Oman. \Navigating Complexity to Achieve High Performance". IEEE Software, pp. 11{15, September 1991. There are two major approaches to performance analysis: measuring and modeling. Measurement, embodied in hardware and software instrumentation and teaching, quanties a real system. Modelling lets us simulate and analyze a real or proposed system. Two measurements are described: (1. Measuring and Analyzing Real Time Performance! 2. Finite-Element Analysis on a PC) and four modeling (1. Traceview! 2. ParaGraph! 3. Interactive Visual Modelling from Performance Analysis! 4. Performability Modelling with UltraSAN) techniques.

Nico 89]

D. M. Nicol and J. C. Townsend. \Accurate Modeling of Parallel Scientic Computations". Performance Evaluation Review, Special Issue, 1989 ACM SIGMETRICS, Vol. 17, No. 1, pp. 165{ 170, May 1989. In this article a workload model is derived from an analysis of the loop structure

BIBLIOGRAPHY

190

of the parallel program. The critical loops (i.e. those parts where most of the execution time will be spent) are to be identied. For those loops an analytic model for predicting the execution time has to be constructed. The parameters are obtained from measuring critical sections of the parallel program. The objective of the study is to nd remapping strategies balancing the workload among the processors. Noh 92] S. H. Noh and A. A. Agrawala. \Performance prediction of message passing SIMD multiprocessor systems". In: H. Siegel, Ed., Proceedings Frontiers '92, The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 560{561, IEEE Computer Society Press, Los Alamitos, California, 1992. The authors of this short paper present an approach for predicting the execution signature of a parallel SIMD application. Computation time functions are associated to the nodes and data transfer functions are associated to the arcs of a task graph. Both functions may depend on the number of processors, he problem size and other application dependent factors, but the values given are assumed to be deterministic. These parameters are the used for estimating the communication and computation signatures assuming an execution on a SIMD architecture. For estimating the communication signature, additional parameters characterizing the hardware are needed: an allocation and a contention factor and the time to transfer one data unit. These values were obtained from experimental measurements on a real hardware (CM-2). Norm 93] M. G. Norman and P. Thanisch. \Models of Machines and Computation for Mapping in Multicomputers". ACM Computing Surveys, Vol. 25, No. 3, pp. 263{302, Sep. 1993. When developing mapping algorithms but also in the complexity analysis of mapping problems, both an abstract model of the parallel architecture and of the program are needed. In this article a framework is presented for discussing and classifying various models. The workload models presented include a set of modules without any precedence relations, tasks with precedence relations given by a DAG, with and without considering costs for communication, and undirected process or communication graphs. Nuss 91] D. Nussbaum and A. Agarwal. \Scalability of Parallel Machines". Communications of the ACM, Vol. 34, No. 3, pp. 57{61, 1991. Papa 82] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization. Algorithms and Complexity. Prentice-Hall, Englewood Clis, 1982. Park 92] K. Park, L. Dowdy, and T. Wagner. \Parallel Workload Characterizations: Comparisons and Mappings". Tech. Rep., Dept. of Computer Science, Vanderbilt University, 1992. Peas 91] D. Pease, A. Ghafoor, I. Ahmad, D. L. Andrews, K. Foudil-Bey, T. E. Karpinski, M. A. Mikki, and M. Zerrouki. \PAWS: A Performance Evaluation Tool for Parallel Computing Systems". IEEE Computer, Vol. 24, No. 1, pp. 18{29, January 1991. PAWS (Parallel Assessment Window System) a performance evaluation tool for parallel computing systems is presented. PAWS consists of four tools:

BIBLIOGRAPHY

191

1. The application characterization tool translating applications written in a high-level language into a data dependence graph. This dataow graph is a machine-independent intermediate representation. 2. The architecture characterization tool allows the user to create descriptions of machines. 3. The interactive graphical display tool provides the user interface for accessing all PAWS tools. 4. The performance assessment tool allows the user to evaluate the performance of any application entered. The performance metrics include speedup curves (the average amount of computation performed in one step within unlimited processors), parallelism prole curves and execution proles. It shows both, the ideal parallelism inherent in the machine-independent dataow graph, and the predicted parallelism of the dataow graph on the target machine. Pete 93]

G. D. Peterson and R. D. Chamberlain. \Beyond Execution Time: Expanding the Use of Performance Models". Journal of Parallel and Distributed technology, Vol. 2, No. 2, 1993.

Pfne 95]

H. Pfneiszl and G. Kotsis. \Parallel Implementation and Analysis of a NAS Kernel Using NMAP". Tech. Rep., Universitat Wien, 1995. Internal Report, in German, English Version in preparation.

Qin 93]

B. Qin and R. A. Ammar. \Analytic Approach to Deriving Time Costs of Parallel Computations". Computer System Science & Engineering, Vol. 8, No. 2, pp. 90{100, Apr. 1993. In this article ow analysis and time cost analysis techniques are presented to derive the execution time for a parallel program specied by its control graph. This graph may contain the following types of nodes: start, end, decision, operation, fork, join, or. It is assumed, that the computation costs for each of these basic nodes are known and deterministic. The basic idea is to replace a parallel structure by a single node, thus transforming the parallel control graph into a sequential graph, where well known ow analysis techniques can be applied. The time costs for a parallel structure are determined and then associated to the sequential node which replaces the parallel structure. For some types of parallel structures, the ow count information is insucient for determining accurately the time costs. In those cases assumptions are to be made, resulting in an approximation of the time costs. The derived time costs are a function of the program and data structure, the input and the processor speed. Future work will focus on considering also the number of processors and contention.

Quin 93a] M. J. Quinn. Parallel Computing. Theory and Practice, Chap. 2 PRAM Algorithms. McGrawHill International Publishers, New York, 1993. Quin 93b] M. J. Quinn. Parallel Computing. Theory and Practice, Chap. 8 The Fast Fourier Transform. McGraw-Hill International Publishers, New York, 1993.

BIBLIOGRAPHY

192

Rama 93] G. Ramanathan and J. Oren. \Survey of Commercial Parallel Machines". Computer Architecture News, Vol. 21, No. 3, pp. 13{32, June 1993. Rein 94]

A. Reinefeld and V. Schnecke. \Work-Load Balancing in Highly Parallel Depth-First Search". In: Proceedings of the SHPCC'94. Scalable High-Performance Computing Conference, May 23-25, 1994, Knoxville, Tennessee, pp. 773{780, IEEE, IEEE Computer Society Press, Los Alamitos, CA, May 1994. Two techniques for load balancing in a parallel DFS application are compared. The rst strategy employs dynamic load balancing, while the second strategy starts with an initial distribution and redistributes the load only on demand. A scalability analysis of the schemes indicates, that the rst method is not suitable for massively parallel systems.

Roge 93]

S. A. Rogers. \A Synthetic Workload Generator for Evaluating Distributed Feult-tolerant Environments". Tech. Rep. ESL-AFT-040-93, MCC, 1993.

Rost 94]

E. Rosti, E. Smirni, L. W. Dowdy, G. Serazzi, and B. M. Carlson. \Robust Partitioning Polocies of Multiprocessor Systems". Performance Evaluation, Vol. 19, No. 2-3, pp. 141{166, May 1994.

Sahn 85]

R. A. Sahner and K. S. Trivedi. \SPADE: A tool for performance and reliability evaluation". In: N. A. El Ata, Ed., Modelling Techniques and Tools for Performance Analysis '85, pp. 147{164, North Holland, June 1985. SPADE (Series PArallel Directed acyclic graph Evaluator) is a tool for analyzing node-activity networks which are directed series-parallel graphs. Each node in the graph represents an event and the length of an event (representing the amount of time required for traversing this node) has a distribution of exponential polynomial form. SPADE will derive the distribution (CDF) of the time for traversing the whole graph by using the graph decomposition tree. The user must specify all nodes of the graph and their distributions (including \zero" indicating that the time for traversing this node is 0), the edges along with the type of parallelism for parallel subgraphs (\exit" types of the nodes, which can be either maximum, minimum or probabilistic), and a set of intervals over which to evaluate the overall graph CDF. SPADE is useful in modeling concurrent program execution and in reliability analysis.

Sevc 89]

K. C. Sevcik. \Characterization of Parallelism in Applications and Their Use in Schduling". Performance Evaluation Review, Special Issue, 1989 ACM SIGMETRICS, Vol. 17, No. 1, pp. 171{ 180, May 1989. The author has characterized the workload of a parallel system for the purpose of scheduling. It was shown, that on the one hand, single parameter characterizations such as the ratio of parallel and sequential code are insucient for this purpose. On the other hand, the most detailed information described in a full data dependency graph is not tractable. Therefore set of parameters is proposed, small enough to be handled eciently but still representing the basic characteristics of the workload necessary for deriving a scheduling strategy. From a parallelism prole, the following parameters are extracted: sequential and parallel fraction, average, maximum and minimum parallelism, and variance in parallelism. The behavior of three static scheduling policies is investigated under varying parameter mixes. It has turned out, that scheduling strategies which take these parameters into account behave comparatively well.

BIBLIOGRAPHY

193

Sevc 94]

K. C. Sevcik. \Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems". Performance Evaluation, Vol. 19, No. 2-3, pp. 107{140, May 1994.

Sing 91]

J. P. Singh, W.-D. Weber, and A. Gupta. \SPLASH: Stanford Parallel Applications for SharedMemory". Tech. Rep., Computer Systems Laboratory, Stanford University, Apr. 1991.

Sing 93]

J. P. Singh, J. L. Hennessy, and A. Gupta. \Scaling Parallel Programs for Multiprocessors: Methodology and Examples". IEEE Computer, pp. 42{50, July 1993.

Sinz 93]

J. Sinz. Simulation von GERT-Netzplanen mittels SLAM. Master's thesis, Institute of Applied Computer Science, University of Vienna, 1993.

Smit 90]

R. Smith and K. Trivedi. \The Analysis of Computer Systems Using Markov Reward Processes". In: H. Takagi, Ed., Stochastic Analysis of Computer and Communication Systems, pp. 589 { 630, North-Holland, 1990. performance engineering includes modeling, measurement and use of results.

Smit 91]

C. U. Smith. \Integrating New and \Used " Modelling Tools for Performance Engineering". In:

Proceedings of the 5th international conference on Modelling Techniques and Tools for Computer Performance Evaluation, pp. 148{158, February 1991. Software performance engineering meth-

ods cover performance data collection, quantitative analysis techniques, prediction strategies, management of uncertainities, data presentation and tracking, model verication and validation, critical sucess factors, and performance design principles using two models: the software execution model (representing he workload) and the system execution model. Three strategies for deriving quantitative assessments from models can be applied: best-worst-case strategy, adapt-to-precision strategy, and simple-to-realistic strategy. In this article some guidelines are given for the design of tools which will eciently support the SPE process. Important issues are: the user interface (direct-manipulative VUI), portability end extendibility, provided functions (several description and solution techniques). Another crucial aspect is the integration of CASE and SPE tools which must be capable to share data. Sotz 90]

F. Sotz. \A Method for Performance Prediction of Parallel Programs". In: H. Burkhart, Ed., CONPAR 90 - VAPP IV, Proc. of the Joint Int. Conf. on Vector and Parallel Processing, pp. 98{107, 1990. In this article several performance evaluation techniques are described based on a stochastic graph model of the workload. In this model, iterations and loops are represented by cyclic nodes or hierarchical loop nodes. The execution time for a task can be given by a distribution (deterministic, exponential, erlang) or numerically. Since the exact solution of non-series-parallel graphs is very costly and even intractable for larger systems, several approximation techniques are proposed and implemented in a tool called PEPP.

Sree 92]

H. V. Sreekantaswamy, S. Chanson, and A. Wagner. \Performance Prediction Modeling of Multicomputers". In: Proceedings of the 12th International Conference on Distributed Computing Systems, pp. 278{285, IEEE Computer Society Press, Los Alamitos, California, 1992. In this

BIBLIOGRAPHY

194

article the workload of a multicomputer system is modeled as a collection of tasks in a divide and conquer structure. Tasks enter the multiprocessor system at a single root processor. Upon arrival of a task, each processor either computes the task locally or splits it into subtasks. The resulting task structure is a tree. Nodes within one level are assumed to have identical split, join and execution times. Also the communication time between two successive levels is assumed to be equal for all nodes. Here only the actual overhead is considered, it is assumed, that the actual time to transfer the data may be overlapped with computation. Staf 93]

A. Stafylopatis. \Performance of Parallel Computations of Triangular Structure". Computer System Science & Engineering, Vol. 8, No. 1, pp. 24{32, Jan. 1993. In this article the performance of parallel applications with a triangular task graph structure is investigated. The workload is specied by a DAG with probabilistic task execution times, communication costs are neglected. The performance is evaluated using two dierent scheduling strategies by the analysis of Markov chains.

Ston 77]

H. Stone. \Multiprocessor Scheduling with the Aid of Network Flow Al gorithms". IEEE Trans. on Software Engineering, Vol. SE-3, pp. 85{93, 1977.

Sun 91a]

X.-H. Sun and J. L. Gustafson. \Sizeup: A New Parallel Performance Metric". In: International Conference on Parallel Processing, Vol 2, pp. 298{299, 1991. In this article ew metrics for speedup are presented, relating the parallel and the sequential work (sizeup) and the parallel and sequential speed (generalized speedup) instead of the time fractions. The commonly used performance metric parallel speedup is \unfair" in that it favors slow processors and poorly coded programs. In this article two new performance metrics are introduced: sizeup and generalized speedup.

Sun 91b]

X.-H. Sun and J. L. Gustafson. \Toward a better parallel performance metric". Parallel Computing, Vol. 17, pp. 1093{1109, 1991. After recalling the traditional denitions of absolute and relative speedups and formulating those metrics in terms of operation costs and amount of work, the authors point out, that these measures are unfair in that they favor poorly coded programs or slow processors. Two new metrices are proposed, which allow a more fair comparison of algorithms and architectures: size-up, which relates the sequential and parallel amount of work, and generalized speedup, which relates the sequential and parallel speed. Since speed is dened as the ration of total work and execution time, both speedup and sizeup can be seen as special cases of the generalized speedup. Under the assumption that the sequential fraction does not scale with the problem size, both metrics are machine independent and programming independent. The theoretical results are veried by measurements of a real application (parallel radiosity) on an nCUBE2.

Sun 94]

X.-H. Sun and D. T. Rover. \Scalability of Parallel Algorithm-Machine Combinations". IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, pp. 599{613, June 1994.

BIBLIOGRAPHY

195

Taqi 92]

A. A. Q. Taqi, A. J. Al-Sammak, A. A. Khan, and N. Ahmed. \A comparative study between Petri Net and SLAM". Simulation, Vol. 59, No. 5, pp. 339{344, Nov. 1992.

Trel 87]

P. C. Treleaven. Computer architectures for articial intelligence, pp. 416{492. Vol. 272 of Lecture Notes in Computer Science: Future Parallel Computer, Springer Verlag, 1987.

Trel 88]

P. C. Treleaven. \Parallel architecture overview". Parallel Computing, No. 8, pp. 59{70, 1988. North Holland.

Triv 84]

K. S. Trivedi. Probability and Statistics with Reliability, Queueing and Computer Science Applications. Prentice-Hall, Englewood Clis, New Jersey, 1984. K. S. Trivedi. \SHARPE: Symbolic Hierarchical Automated Reliability and Performance Evaluator". Tech. Rep., Department of Computer Science, Duke University, Durham, NC 27705, 1990.

Triv 90]

VanV 94] B. VanVoorst. \Proling the Communication Workload of an iPSC/860". In: Proceedings of

the SHPCC'94. Scalable High-Performance Computing Conference, May 23-25, 1994, Knoxville, Tennessee, pp. 221{228, IEEE, IEEE Computer Society Press, Los Alamitos, CA, May 1994.

The author reports on an measurement experiment, where the communication workload of an iPSC/860 was measured during a 10 days period. Measurements were obtained from a modied kernel running on the cube, so no user interaction was necessary. Statistics collected include the total number of dierent send commands (csend, isend, broadcast, communication to host), distance that messages travel, process execution time, and memory usage per node. Vinc 88] J.-M. Vincent. \Stability condition of a service system with precedence constraints between tasks". Rapport de recherche 89-6, Ecole des Hautes Etudes en Informatique (EHEI), 45, rue des Saints-Peres, 75006 Paris, France, December 1988. In this article the tasks arriving at a parallel system, which consists of an unlimited number of servers and a queue of innite capacity, may only be processed considering precedence relation given by an acyclic task graph. The processing time and the inter arrival times are assumed to be independent and general distributed. A general expressions for the stability condition of such systems was derived. Wabn 93] H. Wabnig, G. Kotsis, and G. Haring. \Performance Prediction of Parallel Programs". In: B. Walke and O. Spaniol, Eds., Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen, Informatik Aktuell, pp. 64{76, Springer Verlag, 1993. Wabn 94a] H. Wabnig and G. Haring. \PAPS - The Parallel Program Performance Prediction Toolset". In: G. Haring and G. Kotsis, Eds., Proc. of the 7th Int. Conf. on Modelling Techniques and Tools for Computer Performance Evaluation., pp. 284{304, Springer-Verlag, 1994. Wabn 94b] H. Wabnig and G. Haring. \Performance Prediction of Parallel Systems with Scalable Specications - Methodology and Case Study (Extended Abstract)". In: Proceedings of the ACM SIGMETRICS Conference, 1994. Accepted as Poster at the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems.

BIBLIOGRAPHY

196

Wang 91] C.-J. Wang and V. P. Nelson. \Petri net performance modeling of a modied mesh-connected parallel computer". Parallel Computing, Vol. 17, pp. 75{84, 1991. In this paper a modied mesh-connected parallel architecture is proposed and analyzed. A stochastic Petri net is used to model the interconnection topology (a two dimensional grid with link and bus connections) of the parallel system. The workload characterized in terms of computation and communication demands is assumed to be uniform distributed among all processing elements. The analysis has shown, that the combination of link and bus connections is promising to serve as a cost-eective general purpose architecture. Worl 91]

J. Worlton. \Toward a taxonomy of performance metrics". Parallel Computing, No. 17, pp. 1073{1092, 1991. North Holland.

Yang 88] C.-Q. Yang and B. P. Miller. \Critical Path Analysis for the Execution of Parallel and Distributed Programs". In: Proceedings of the Seventh Conference on Distributed Memory Computer Systems, pp. 366{373, IEEE, 1988. Critical path analysis of a program's execution history is a technique for automatically guiding the user to performance problems. In this paper an approach is presented where a program activity graph is created from its execution trace. In a program activity graph, each node corresponds to a certain event during the execution, connecting arcs indicate the time between the occurrence of the events. Usually, all events associated to a certain processor are shown on a line, so that communication can be easily detected as arcs connecting events on dierent lines. The longest path in the PAG represents the critical path of the program's execution. A parallel simplex method following the master slave paradigm has been selected in an example.